Hacker Newsnew | past | comments | ask | show | jobs | submit | 8organicbits's commentslogin

I think the problem is that it's hard to curate feeds in a language you don't understand. I've been building an uncurated index of OPML blogrolls, with no language restriction. The OPML blogrolls are curated by their owners, so someone decided they met some inclusion criteria, but the overall list is uncurated.

https://alexsci.com/rss-blogroll-network/


Can you not check the RFCs?

You know the funny thing about this is that I have talked, relatively recently, to one of the very few cryptographers who was an author on a DNSSEC standard, and they wouldn't work for the interview I want to do --- they're not sold enough on DNSSEC anymore.

The broader answer is: the relevant RFCs weren't authored by cryptography engineers. This was a major problem in the "old" IETF, before the cryptographers "took over" tls-wg and CFRG.

At any rate, the reason I asked in that particular place on the thread was that the preceding comment was attempting to draw a line between "sysadmins" who hate DNSSEC and the serious technologists who like it a lot.



The tag cloud part may be a challenge. Web feeds don't always tag their content.

I have a blog filter that does something similar (https://alexsci.com/rss-blogroll-network/discover/), but the UI I ended up with isn't great and too many things are uncategorized.


Kudos on your site effort and I immediately see your point.

In fact I took your topmost entry with no helpful site/update tags and dove in a little to try to understand why a RSS friendly blogger might not be passing along ~ tags for better reader discovery.

Turns out my scarce info test case blogger has a mastodon that immediately lists all these tags about himself [I've stripped it down] ...

#FrontEnd Developer #CSS #Halifax #London #Singapore Technical writer and rabbit-hole deep-diver Former Organiser for https://londonwebstandards.org & https://stateofthebrowser.com Interests: #Bushcraft #Outdoors #DnD #Fantasy #SciFi #HipHop #CSS #Eleventy #IndieWeb #OpenSource #OpenWeb

I conclude if he knew such site and post tags getting to RSS would be of use, he'd probably make the tiny effort to wire the descriptions.

Nonetheless I merely crawled links for a minute to found this info, so I imagine something like the free tier of the Cloudflare crawling api might suffice over time for a simplistic automated fix to hint decorate blog sites.

I mean, given that we're not trying to recreate pagerank, but just trying to tip the balance in favor of desirable initial discovery.


Very cool.

Crawling related sites for tags could work (open graph tags on the website are another good source). I'm wary of mixing data across contexts though. A blog and a Mastodon profile may intend to present a different face to the world or could discuss different topics.


One objection I have to the kagi smallweb approach is the avoidance of infrequently updated sites. Some of my favorite blogs post very rarely; but when they post it's a great read. When I discover a great new blog that hasn't been updated in years I'm excited to add it to my feed reader, because it's a really good signal that when they publish again it will be worth reading.

One of the many things I disagree with Scott Alexander on is that to me, frequent blog updates signal poor quality not excellent writing. Its hard to come up with an independent, evidence-based opinion on something worth sharing every week, but easy to post about what you read lots of angry or scary posts about. People who post a lot also tend to have trouble finding useful things to do in their offline life. It is very unusual that he managed to be both a psychiatrist and a prolific blogger and he quit the psychiatry job before he had children or other care responsibilities.

I have a "frequent post" section of my blog and a "deeper" section. Unless you're interested in the frequent posts they aren't in your face on my blog. It's kind of a best of both worlds type thing.

The frequent posts also let me quickly try out new methods of telling stories or presenting information or new techniques. I think this tends to speed up how often I post larger effort things cuz I can practice skills with frequent posts.

A good comparison would be a youtuber with a patreon. The youtube gets the produced media, whereas the patreon gets "cell phone in the moment" updates.

but i totally agree that when folks are finding things to post about that can be problematic and annoying.


> frequent blog updates signal poor quality not excellent writing

it might be true, but there are exceptions, like acoup (history-focused), which is written by ancient history professor.


I'm with you. Also, sometimes I'm specifically looking for some dusty old site that has long been forgotten about. Maybe I'm trying to find something I remember from ages ago. Or maybe I'm trying to deeply research something.

There's a lot more to fixing search than prioritizing recency. In fact, I think recency bias sometimes makes search worse.



To clarify criteria is less than 2 years since last blog post.

You may want to clarify that on https://github.com/kagisearch/smallweb because the README there says:

> Blog has recent posts (<7 days old)

This may be different than inclusion criteria for websites in general, but on first read it looks like it has to be very active.

I might have missed something while skimming it, but would assume others would miss it as well.


There's two criteria, I agree it's hard to skim:

* The blog must have a recent post, no older than 12 months, to meet the recency criteria for inclusion.

* Criteria for posts to show on the website: Blog has recent posts (<7 days old), The website can appear in an iframe

The latter criteria is for the website / post to appear in Kagi's random Small Web feature, where they display the blog post in an iframe. (So I think only posts from the last week are displayed there.) Being on the list should ensure that any new posts could be displayed in Small Web though, and presumably that the website is indexed in Kagi's Teclis index as well. At least, I really hope that the Teclis index is including all of those old blog posts too, and not discarding them.

EDIT: I just realized freediver actually is Vladimir - I'd love to know if Teclis does index all those older blog posts too. I assume it does index everything that is still present in the RSS feeds?


Thank you. I swear I read that three times and missed the other criteria until you pointed it out and then I found it. :/

also kagi exclude non-English sites. Sad for mixed language blogs like mine.

I know kagi doesn't do it, but it is possible to specify language in the feed (xml:lang) such that a feed reader can filter languages the user doesn't understand out of multi-language feeds. One challenge is that lots of bloggers forget to add that tag.

start a small web directory for your language!

Is there a good free-but-subscriber-only solution for blogs? It seems like a contradiction, but in practice it may be manageable.

If it takes off in any amount, then LLMs will just subscribe and pull said data from sites at a reasonable pace (or not, it's free so make many accounts).

Loginwall or email newsletter with a summary on the open web.

~As far as I know, bucket names are public via certificate transparency logs.~ There are tools for collecting those names. Besides you'd leak the subdomain to (typically) unencrypted DNS when you do a lookup and maybe via SNI.

Edit: crossout incorrect info


I'm pretty sure buckets use star certs and thus the individual bucket names won't be in the transparency logs.

Ah you're right, they are always wildcard certs. I think I was mis-remembering https://news.ycombinator.com/item?id=15826906, which guesses names based on CT logs.

In either case, the subdomain you use in DNS requests are not private. Attackers can collect those from passive DNS logs or in other ways.


> Besides you'd leak the subdomain to (typically) unencrypted DNS when you do a lookup and maybe via SNI.

"Leak" is maybe a bit over-exaggerated, although if someone MitM'd you they definitely be able to see it. But "leak" makes it seem like it's broadcasted somehow, which obviously it isn't.


No man-in-the-middle is needed, DNS queries are often collected into large datasets which can be analyzed by threat hunters or attackers. Check out passive DNS https://www.spamhaus.com/resource-center/what-is-passive-dns...

You'd need to check the privacy policy of your DNS provider to know if they share the data with anyone else. I've commonly seen source IP address consider as PII, but not the content of the query. Cloudflare's DNS, for example, shares queries with APNIC for research purposes. https://developers.cloudflare.com/1.1.1.1/privacy/public-dns... Other providers share much more broadly.


> No man-in-the-middle is needed [...] Check out passive DNS

How does one execute this "passive DNS" without quite literally being on the receiving end, or at least sitting in-between the sending and receiving end? You're quite literally describing what I'm saying, which makes it less of a "leak" and more like "others might collect your data, even your ISP", which I'd say would be accurate than "your DNS leaks".


There's a lot of online documentation about passive DNS. Here's one example

> Passive DNS is a historical database of how domains have resolved to IP addresses over time, collected from recursive DNS servers around the world. It has been an industry-standard tool for more than a decade.

> Spamhaus’ Passive DNS cluster handles more than 200 million DNS records per hour and stores hundreds of billions of records per month, providing you with access to a vast lake of threat intelligence data.

https://www.spamhaus.com/resource-center/what-is-passive-dns...


> collected from recursive DNS servers around the world

Yes, of course, because those DNS servers are literally receiving the queries, eg "receiving the data".

Again, there is nothing "leaking" here, that's like saying you leak what HTTP path you're requesting to a server, when you're sending a HTTP request to that server. Of course, that's how the protocol works!


I think you are hung up on the word "leak".

Putting a secret subdomain in a DNS query shares it with the recursive resolver, who's privacy policy may permit them to share it with others. This is a common practice and attackers have access to the aggregated datasets. You are correct that third-party web servers or CDN could share your HTTP path, but I am not aware of any examples and most privacy policies should prohibit them from doing so. If your web server provider or CDN do this, change providers. DNS recursive resolvers are chosen client side, so you can't always choose which one handles the query. Even privacy-focused DNS recursive resolvers share anonymized query data. They remove the source IP address, since it's PII, but still "leak" the secret subdomain.

Any time you send secret data such that it travels to an attacker visible dataset it is vulnerable to attack. I call that a leak but we can use a different term.


> I think you are hung up on the word "leak".

What gave you that idea? Maybe because my initial comment started with:

> "Leak" is maybe a bit over-exaggerated...

And continues with about why I think so?

I raised this sub-thread specifically because I got hung up on "leak", that's entire point of the conversation in my mind.


So nothing to do with your DNS queries at all? Why did you bring it up?

These days people are fearful of their work ending up in LLM training datasets. A private, but static hosted website is on a lot of people's minds. Most social networks have privacy setting these days, which feels like a missing feature of standard, static blogs.

That should scale pretty well. The HTTP fetch of posts/index.json could use conditional get requests to avoid downloading the body when there are no changes. Static files are dirt cheap to serve.

The commit you listed was merged upstream.

https://github.com/zigimg/zigimg/pull/313


So does that mean they contradicted their own no LLM policy?

The PR doesn't disclose that "an LLM did it", so maybe the project allowed a violation of their policy by mistake. I guess they could revert the commit if they happen to see the submitter's HN comment.

Dunno but a commenter already noted that some begins to say: "No LLM generated PR, but we'll accept your prompt" and another person answered he saw that too.

It makes lots of sense to me.


I've never had a one-shot prompt ever work. It's always an interactive session to eventually get to the working solution.

the policy listed is for the zig compiler. the commit in question is in a fork of a third-party library

LOL, ah, whoops then =)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: