> I mean there’s not 16B people in the world, so a row per person can be ruled out pretty easily
In a hypothetical "master dump", a mix of all the dumps ever leaked, you'd expect dozens if not more entries for every "real person" out there. Think about how many people had a yahoo account, then how many had several yahoo accounts, and then multiply it with hundreds of leaks out there. I can see the number getting into billions easily, just because of how many accounts people have on many platforms that got hacked in the past ~20 years.
Sure, 99% of those won't be active accounts anymore, but the passwords used serve as a signal, at least for "what kinds of passwords do people use". There's lots to be learned about wordwordnumber wordnumbernumber, and so on.
> There's lots to be learned about wordwordnumber wordnumbernumber, and so on.
i had a plan to do statistical studies of some password dumps to try and make a "compressed password list" that could generate password guesses on the fly, and i forgot why i didn't do it, but i'm sure it's because the "model" - the statistical dataset upon which the program would generate output, wouldn't really be that much smaller; at least not with my poor maths skills.
I'm assuming that someone who really knew what they were doing could get close to 20% - 15% of the full password list. I doubt i could do better than just compressing the dataset and extracting it on the fly.
> I doubt i could do better than just compressing the dataset and extracting it on the fly.
The meta in that field is to extract "rules" (i.e. for hashcat) from datasets. Then you run the rules over the encrypted dumps. Rules can be word^number^number, word^number^word^number, or letter^upper^number^lower... etc. Then you build a dictionary and dict + rules = passwords.
Pretty sure you can extract some nice signals nowadays with embeddings and what not.
In a hypothetical "master dump", a mix of all the dumps ever leaked, you'd expect dozens if not more entries for every "real person" out there. Think about how many people had a yahoo account, then how many had several yahoo accounts, and then multiply it with hundreds of leaks out there. I can see the number getting into billions easily, just because of how many accounts people have on many platforms that got hacked in the past ~20 years.
Sure, 99% of those won't be active accounts anymore, but the passwords used serve as a signal, at least for "what kinds of passwords do people use". There's lots to be learned about wordwordnumber wordnumbernumber, and so on.