I think the problem is that it's hard to curate feeds in a language you don't understand. I've been building an uncurated index of OPML blogrolls, with no language restriction. The OPML blogrolls are curated by their owners, so someone decided they met some inclusion criteria, but the overall list is uncurated.
You know the funny thing about this is that I have talked, relatively recently, to one of the very few cryptographers who was an author on a DNSSEC standard, and they wouldn't work for the interview I want to do --- they're not sold enough on DNSSEC anymore.
The broader answer is: the relevant RFCs weren't authored by cryptography engineers. This was a major problem in the "old" IETF, before the cryptographers "took over" tls-wg and CFRG.
At any rate, the reason I asked in that particular place on the thread was that the preceding comment was attempting to draw a line between "sysadmins" who hate DNSSEC and the serious technologists who like it a lot.
Kudos on your site effort and I immediately see your point.
In fact I took your topmost entry with no helpful site/update tags and dove in a little to try to understand why a RSS friendly blogger might not be passing along ~ tags for better reader discovery.
Turns out my scarce info test case blogger has a mastodon that immediately lists all these tags about himself [I've stripped it down] ...
#FrontEnd Developer #CSS
#Halifax #London #Singapore
Technical writer and rabbit-hole deep-diver
Former Organiser for https://londonwebstandards.org & https://stateofthebrowser.com
Interests: #Bushcraft #Outdoors #DnD #Fantasy #SciFi #HipHop #CSS #Eleventy #IndieWeb #OpenSource #OpenWeb
I conclude if he knew such site and post tags getting to RSS would be of use, he'd probably make the tiny effort to wire the descriptions.
Nonetheless I merely crawled links for a minute to found this info, so I imagine something like the free tier of the Cloudflare crawling api might suffice over time for a simplistic automated fix to hint decorate blog sites.
I mean, given that we're not trying to recreate pagerank, but just trying to tip the balance in favor of desirable initial discovery.
Crawling related sites for tags could work (open graph tags on the website are another good source). I'm wary of mixing data across contexts though. A blog and a Mastodon profile may intend to present a different face to the world or could discuss different topics.
One objection I have to the kagi smallweb approach is the avoidance of infrequently updated sites. Some of my favorite blogs post very rarely; but when they post it's a great read. When I discover a great new blog that hasn't been updated in years I'm excited to add it to my feed reader, because it's a really good signal that when they publish again it will be worth reading.
One of the many things I disagree with Scott Alexander on is that to me, frequent blog updates signal poor quality not excellent writing. Its hard to come up with an independent, evidence-based opinion on something worth sharing every week, but easy to post about what you read lots of angry or scary posts about. People who post a lot also tend to have trouble finding useful things to do in their offline life. It is very unusual that he managed to be both a psychiatrist and a prolific blogger and he quit the psychiatry job before he had children or other care responsibilities.
I have a "frequent post" section of my blog and a "deeper" section. Unless you're interested in the frequent posts they aren't in your face on my blog. It's kind of a best of both worlds type thing.
The frequent posts also let me quickly try out new methods of telling stories or presenting information or new techniques. I think this tends to speed up how often I post larger effort things cuz I can practice skills with frequent posts.
A good comparison would be a youtuber with a patreon. The youtube gets the produced media, whereas the patreon gets "cell phone in the moment" updates.
but i totally agree that when folks are finding things to post about that can be problematic and annoying.
I'm with you. Also, sometimes I'm specifically looking for some dusty old site that has long been forgotten about. Maybe I'm trying to find something I remember from ages ago. Or maybe I'm trying to deeply research something.
There's a lot more to fixing search than prioritizing recency. In fact, I think recency bias sometimes makes search worse.
* The blog must have a recent post, no older than 12 months, to meet the recency criteria for inclusion.
* Criteria for posts to show on the website: Blog has recent posts (<7 days old), The website can appear in an iframe
The latter criteria is for the website / post to appear in Kagi's random Small Web feature, where they display the blog post in an iframe. (So I think only posts from the last week are displayed there.) Being on the list should ensure that any new posts could be displayed in Small Web though, and presumably that the website is indexed in Kagi's Teclis index as well. At least, I really hope that the Teclis index is including all of those old blog posts too, and not discarding them.
EDIT: I just realized freediver actually is Vladimir - I'd love to know if Teclis does index all those older blog posts too. I assume it does index everything that is still present in the RSS feeds?
I know kagi doesn't do it, but it is possible to specify language in the feed (xml:lang) such that a feed reader can filter languages the user doesn't understand out of multi-language feeds. One challenge is that lots of bloggers forget to add that tag.
If it takes off in any amount, then LLMs will just subscribe and pull said data from sites at a reasonable pace (or not, it's free so make many accounts).
~As far as I know, bucket names are public via certificate transparency logs.~ There are tools for collecting those names. Besides you'd leak the subdomain to (typically) unencrypted DNS when you do a lookup and maybe via SNI.
> Besides you'd leak the subdomain to (typically) unencrypted DNS when you do a lookup and maybe via SNI.
"Leak" is maybe a bit over-exaggerated, although if someone MitM'd you they definitely be able to see it. But "leak" makes it seem like it's broadcasted somehow, which obviously it isn't.
You'd need to check the privacy policy of your DNS provider to know if they share the data with anyone else. I've commonly seen source IP address consider as PII, but not the content of the query. Cloudflare's DNS, for example, shares queries with APNIC for research purposes. https://developers.cloudflare.com/1.1.1.1/privacy/public-dns... Other providers share much more broadly.
> No man-in-the-middle is needed [...] Check out passive DNS
How does one execute this "passive DNS" without quite literally being on the receiving end, or at least sitting in-between the sending and receiving end? You're quite literally describing what I'm saying, which makes it less of a "leak" and more like "others might collect your data, even your ISP", which I'd say would be accurate than "your DNS leaks".
There's a lot of online documentation about passive DNS. Here's one example
> Passive DNS is a historical database of how domains have resolved to IP addresses over time, collected from recursive DNS servers around the world. It has been an industry-standard tool for more than a decade.
> Spamhaus’ Passive DNS cluster handles more than 200 million DNS records per hour and stores hundreds of billions of records per month, providing you with access to a vast lake of threat intelligence data.
> collected from recursive DNS servers around the world
Yes, of course, because those DNS servers are literally receiving the queries, eg "receiving the data".
Again, there is nothing "leaking" here, that's like saying you leak what HTTP path you're requesting to a server, when you're sending a HTTP request to that server. Of course, that's how the protocol works!
Putting a secret subdomain in a DNS query shares it with the recursive resolver, who's privacy policy may permit them to share it with others. This is a common practice and attackers have access to the aggregated datasets. You are correct that third-party web servers or CDN could share your HTTP path, but I am not aware of any examples and most privacy policies should prohibit them from doing so. If your web server provider or CDN do this, change providers. DNS recursive resolvers are chosen client side, so you can't always choose which one handles the query. Even privacy-focused DNS recursive resolvers share anonymized query data. They remove the source IP address, since it's PII, but still "leak" the secret subdomain.
Any time you send secret data such that it travels to an attacker visible dataset it is vulnerable to attack. I call that a leak but we can use a different term.
These days people are fearful of their work ending up in LLM training datasets. A private, but static hosted website is on a lot of people's minds. Most social networks have privacy setting these days, which feels like a missing feature of standard, static blogs.
That should scale pretty well. The HTTP fetch of posts/index.json could use conditional get requests to avoid downloading the body when there are no changes. Static files are dirt cheap to serve.
The PR doesn't disclose that "an LLM did it", so maybe the project allowed a violation of their policy by mistake. I guess they could revert the commit if they happen to see the submitter's HN comment.
Dunno but a commenter already noted that some begins to say: "No LLM generated PR, but we'll accept your prompt" and another person answered he saw that too.
https://alexsci.com/rss-blogroll-network/
reply