Hacker Newsnew | past | comments | ask | show | jobs | submit | neoromantique's commentslogin


Vast majority of websites today can and should be static, which makes even the aggressive llm scrapping non-issue.


One of the things that a lot of LLM scrapers are fetching are git repositories. They could just use git clone to fetch everything at once. But instead, they fetch them commit by commit. That's about as static as you can get, and it is absolutely NOT a non-issue.


No... Basically all git servers have to generate the file contents, diffs etc. on-demand because they don't store static pages for every single possible combination of view parameters. Git repositories also typically don't store full copies of all versions of a file that have ever existed either; they're incremental. You could pre-render everything statically, but that could take up gigabytes or more for any repo of non-trivial size.


> Git repositories also typically don't store full copies of all versions of a file that have ever existed either; they're incremental

This is wrong. Git does store full copies.


git stores files as objects, which are stored as full copies, unless those objects are stored in packfiles and are deltified, in which case they're stored as deltas. https://codewords.recurse.com/issues/three/unpacking-git-pac...


Thank you for the insights.


... which, in the context that is being discussed, is unusual.


that's a pretty niche issue, but fairly easy to solve.

Prebuild statically the most common commits (last XX) and heavily rate limit deeper ones


1. that doesn't appear to match the fetching patterns of the scrapers at all

2. 1M independent IPs hitting random commits from across a 25 year history is not, in fact, "easy to solve". It is addressable, but not easy ...

3. why should I have to do anything at all to deal with these scrapers? why is the onus not on them to do the right thing?


I did not imply that it does, I meant to have a budget allocated for 'unauthenticated deep history queries', when it's over it's over and you only handle dynamic fetching for authorized users until cooldown.

Is it pretty? No, but it also is a pretty niche thing overall (git repo storage).


You don't lower the cost of killing by improved targeting, you lower it by thugs shooting people in broad daylight with no consequences.

I understand the argument that moving the decision making power to a black box would clear conscience of the operator, yadda yadda yadda, but newsflash, price of human life is falling so quick, that I think we're far beyond the point where it matters.


Less severe than killing, you’re essentially describing the “broken windows” theory. https://en.wikipedia.org/wiki/Broken_windows_theory


Why does israel use expensive precise munitions wherever possible rather than their stockpiles of much more deadly "dumb" ones?


maybe because they are trying to act ethically toward a murderous neighbor that is conducting asymmetric warfare and those are the best tools to accomplish that.

or, maybe because they came to the conclusion that the repercussions on the world stage of even more horrific media coming out of Gaza is too steep of a price to pay.

i don't know which, but i do know it is naive to conclude that because they COULD end the war in a day and did not, they are driven by morality and ethical concerns rather than pragmatic ones.


I didn’t say they were driven by morality, though I’m sure they are more so than Hamas. I just think what they’re doing is ethnic cleansing (which is not a compliment) rather than genocide. I’m actually pretty sure that most of the people who call it by “genocide” don’t know the difference between the two.


Expensive for whom? US taxpayers?


because it would be admitting to the world that it has said weapons.

Israel has always said it doesn't have nuclear weapons. They would have absolutely zero sympathy going forward from any major nation if they decided to drop a nuclear bomb on Gaza, and they want that land so rendering that land uninhabitable might not be a good idea.


The argument conveniently always goes such that Israel is the baddie.

Curious how that goes, especially since Israels ulterior motives are always implied, they're not taken by their word.

And Islamists, who share their motives openly with anyone willing to listen are ignored.


Genocide was literally in Hamas’s charter and yet somehow they’re the good guys because modern leftists can’t think past “colonialism bad”.


I never said they were the good guys. Fuck Hamas.

But let's not equate an inexperienced group of starved and impoverished guerilla fighters with a first-world, nuclear-armed genocidal ethnostate.


by dumb munitions I mean older bombs vs JDAM and alike.

Anyone who seriously speaks words 'nuclear weapon' and 'gaza' together is basically admitting he has 0 clue about the situation and is uninformed larper for either side.


Korea


I don't know, just an anecdote:

I populated my Instagram/FB Account with my interests (I mainly have the accounts to follow local racing leagues / marketplaces), and feeds are mostly cars and tech stuff, seldom do I see any thirst traps in it (including reels).


>As a contrast, in the early web, plenty of people were hosting their own website, and messing around with all the basic tools available to see what novel thing they could create

I'm hoping that the already full of slop centralized platforms now with LLM fueled implosion will overflow and lead to a renaissance of sorts for small and open web, niche communities and decoupling from big tech.

It's already gaining traction among the young, as far as I can see.


Ask HN: How does one archive websites like this without being a d-ck?

I want to save this for offline use, but I think recursive wget is a bit poor manners, is there established way one should approach it, get it from archive somehow?


As long as you don't mirror daily and use rate limit there is no reason you would be a dick doing it.

FWIW I have a local copy of sheldown brown's website I mirrored a few years back when they announced the shop would close as I expected they would eventually shutdown the website too. I don't know if his wife is still alive, she had her own space nor if someone has taken over the maintenance.


A single user's one-off recursive wget seems fine? Browsers also support it iirc, individual pages at very least (and saved to the same place, the links will work).

No doubt it's already in many archive sites though, you could just fetch from them instead of the original?


I ask in more general sense, if there is a way to fetch this stuff directly from webarchive or something along those lines.

Gotta hit the search I feel :)


In the old-web days, I just used wget with slow pacing (and by "pacing" I mean: I don't need it to be done today or even this week, so if it takes a rather long time then that's fine. Slow helped keep me from mucking up the latency on my dial-up connection, too.)

I don't think that's being a dick for old-web sites that still exist today. Most of the information is text, the photos tend to be small, it's all generally static (ie, light-weight to serve), and the implicit intent is for people to use it.

But it's pretty slow-moving, so getting it from archive.org would probably suffice if being zero-impact is the goal.

(Or, you know: Just email the dude that runs it like it's 1998 again, say hi, and ask. In this particular instance, it's still being maintained.)


The Internet Archive probably has it already.


Yep, saved by ArchiveTeam a couple of years ago:

https://archive.fart.website/archivebot/viewer/?q=sheldonbro...


>Blaming DPRK's "economic mismanagement" while making no mention of the Western sanctions on DPRK which are the cause of its catastrophic economic and humanitarian situation

The catastrophic humanitarian situation IS the cause for the sanctions.


also the nukes. and shooting missiles over japan.

parent poster seems to want to ignore their decades of poor behavior and sheer brutality.

e.g. NK just executed people for watching squid game.


>First of all, the Guardian is known to be heavily biased again Musk.

Which is good, that is the sane position to take these days.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: