Simple file downloads and CSV files are fantastic. I wish more folks would publish data in simple formats like this. Every time I have to fill a "shopping cart" for a US government data download I die a little.
There's a bunch of wrapper tools to make this particular pipeline easier. Also something like Datasette is great if you want a web view and some fancier features.
One nice thing about CSV files being zipped and served via the web is they can be streamed directly into the database incredibly fast without having to persist them anywhere (aside from the db).
You can load the zip file as a stream, read the CSV line by line, transform it, and then load it to the db using COPY FROM stdin (assuming Postgres).
It isn't. But that's easily mitigated with temp tables, ephemeral database and COPY etc.
Upstream can easily f-up and (accidentally) delete production data if you do this on a live db. Which is why PostgreSQL and nearly all other DBS have a miriad of tools to solve this by not doing it directly on a production database
I had this last week, but instead it was a 3rd party api and their service started returning null instead of true for the has_more property beyond the second page of results.
In either the solution is probably to check rough counts and error if not reasonable.
That’s not how COPY FROM works in postgres. You give it a csv and a table matching the structure and it hammers the data into the table faster than anything else can.
You should checkout powershell; it supports converting CSV into in-memory structured data and then you can run regular powershell queries on that data:
$> csvData = @"
Name,Department,Salary
John Doe,IT,60000
Jane Smith,Finance,75000
Alice Johnson,HR,65000
Bob Anderson,IT,71000
"@;
$> csvData
| ConvertFrom-Csv
| Select Name, Salary
| Sort Salary -Descending
Name Salary
---- ------
Jane Smith 75000
Bob Anderson 71000
Alice Johnson 65000
John Doe 60000
You can also then convert the results back into CSV by piping into ConvertTo-Csv
/tmp/> "Name,Department,Salary
::: John Doe,IT,60000
::: Jane Smith,Finance,75000
::: Alice Johnson,HR,65000
::: Bob Anderson,IT,71000" |
::: from csv |
::: select Name Salary |
::: sort-by -r Salary
╭───┬───────────────┬────────╮
│ # │ Name │ Salary │
├───┼───────────────┼────────┤
│ 0 │ Jane Smith │ 75000 │
│ 1 │ Bob Anderson │ 71000 │
│ 2 │ Alice Johnson │ 65000 │
│ 3 │ John Doe │ 60000 │
╰───┴───────────────┴────────╯
for something a bit more robust, check out DuckDB. It's a library you can embed, use it to run SQL on local files as well as connect to databases, do joins, analytics, etc.
Agreed. The article mentioned duckdb and I'm her to thumbs-up the use of DuckDB wholeheartedly. If you like the world of public CSV files as data sources that you can query or cross-query, duckdb is the tool for you. Just follow the demo on the duckdb website and you'll be wow'd for sure.
I use both, and I have found it helpful to have nushell around when munging csv and parquet files, even when working with duckdb - I find it quicker to ask for the first few rows of a thing or do simple stuff with it, then deeper analysis with duckdb.
- Send the csv data from stdin (using echo and referred to in the command by -)
- Refer to the data in the query by stdin. You may also use the _t_N syntax (first table is _t_1, then _t_2, etc.), or the file name itself before the .csv extension if we were using files.
- Pipe the output to the table command for formatting.
- Also, the shape of the result is printed to stderr (the (4, 2) below).
$ echo 'Name,Department,Salary
John Doe,IT,60000
Jane Smith,Finance,75000
Alice Johnson,HR,65000
Bob Anderson,IT,71000' |
qsv sqlp - 'SELECT Name, Salary FROM stdin ORDER BY Salary DESC' |
qsv table
(4, 2)
Name Salary
Jane Smith 75000
Bob Anderson 71000
Alice Johnson 65000
John Doe 60000
You might find a lot of interesting tools in the Datasette ecosystem. Data dashboarding for SQLite with all sorts of import and export and visualization plugins.
What’s really interesting about it is that Awk is now, finally, getting support for CSV. But I bet a large amount of Awk+CSV use cases can be met with SQLite and SQL+CSV.
The clickhouse-local tool is also really great for querying a bunch of systems, and it has connectors and converters for a whole bunch of other systems and formats.
CSV would be great if there were one CSV, and if you absolutely guarantee that nobody has "improved" the CSV by editing it with a spreadsheet program (including opening it with a spreadsheet program) or their own deft little fingers in a text editor.
For example:
"Look, this contains \"quotes\"!",012345
Or:
"Look, this contains ""quotes""!",012345
Or, for some degenerate examples:
"Look, this contains "quotes"!",012345
Or:
Look, this contains "quotes"!,012345
Or the spoor of a spreadsheet:
"Look, this contains ""quotes""!",12345
Theoretically, JSON isn't immune to being hand-hacked into a semi-coherent mess. In practice, people don't seem to do that to JSON files, at least not that I've seen. Ditto number problems, in that in JSON, serial numbers and such tend to be strings instead of integers a "helpful" application can lop a few zeroes off of.
> CSV would be great if there were one CSV, and if you absolutely guarantee that nobody has "improved" the CSV by editing it with a spreadsheet program (including opening it with a spreadsheet program)
Practically no formats actually pass those rules. Even plain text is bound to be "improved" by text editors frequently (uniformation of line endings, removal of data not in a known encoding, UTF BOM, UTF normalization, etc.)
Just don't do what, edit your data? The problem is all JSON implementations agree what a properly encoded JSON file look like, and it's not true for CSV.
Arbitrary CSV is not, in general, round-trippable.
Json is JavaScript Object Notation. Javascript itself doesn't large numbers like 2^65 or accurate numbers like 2^55+75. The json spec doesn't limit numbers to 2^51 or 2^64, but also doesn't guarantee that they will be supported.
In practice, you open the file a disappointed customer sends you because your software doesn't parse it. You resist yelling at whomever decided to ever do CSV support in your software because you know what it's like. But the customer is important, so then either write a one-off converter for them to a more proper CSV format, or you add - not change - several lines of code in mainline to be able to also parse customer's format. God forbid maybe also write. Repeat that a couple of times. See the mess increasing. Decide to use a so-called battle-tested CSV parser after all, only to find out it doesn't pass all tests. Weep. Especially because at the sametime you also support a binary format which is 100 times faster to read and write and has no issues whatsoever.
I'm not blind for the benefits of CSV, but oh boy it would be nice if it would be, for starters, actually comma-separated.
JSONL should replace CSV. It is standardized and the escapes mostly well specified. It is effectively CSV with "[" and "]" surrounding lines.
Regular JSON would work fine for static file, and make Schema and links (JSON-LD) possible. But then the file could be any structure. JSONL works better for systems that assume line-based records, and are more likely to have consistent, simple records.
I can almost assure you it was an explicit requirement in an RFP that was copied from some master template. So not a good justification - but a justification in the eyes of the bureaucracy.
The book Recoding America has a lot of anecdotes to this effect; most of these situations reduce to a Congressional mandate that got misinterpreted along the way. My favorite was for an update to Social Security. The department in charge of the implementation swore that Congress was forcing them to build a facebook for doctors (literally where doctors could friend other doctors and message them). Congress had no such intention; it was actually 3rd party lobbying that wanted the requirement so they could build their own solution outside of government. Really crazy stuff.
> Congress had no such intention; it was actually 3rd party lobbying
Right, but 3rd party lobbying can't force anyone to do anything, whereas Congress can (and did) give this mandate the force of law. The fact that lobbyists got Congress to do something that they had "no such intention" to do is its own problem, but let's not lose sight of who is responsible for laws.
That's the interesting part of this story; Congress didn't think this requirement existed, neither did the lobbyists. But the language that congress adopted (with the consultation of this lobbying group) made the agencies _think_ it was what congress wanted. So the agency was left holding the bag for something no one actually wanted in the first place. Like a big game of telephone.
I agree with your broader point however. Congress needs to do a better job of owning outcomes.
I for one was tickled the first time I paid an NYC parking ticket online and had to add it to my cart, as if they might upsell me on some other more serious violation. Act now before it's too late!
When you're paying for things, and can pay for several things at the same time, it makes sense. I helped my mother pay her property taxes this year, and for two properties we had to make four payments. Without the shopping cart (or a functional equivalent) I would have had to enter my payment information four times instead of once.
I assume it's a vestige from the old days when you ordered actual tapes or printouts from government agencies. The notion of an order that had to be prepared is baked into how they think about the service and product.
All sorts of strange things happen with accessing US government data. But most agencies have a lot of excellent data available for free and motivated data scientists who want to make it available to you.
Some records are only released for a fee, so I always assumed implementing a standard interface to handle commercial transactions and making select items free was easier than maintaining separate commercial/noncommercial sites.
(As a side note, I can understand why in years past it would cost multiple cents per page to physically photocopy a federal document - but it is absolutely absurd that already-digitized documents, documents which are fundamentally part of the precedent that decides whether our behavior does or doesn't cause civil or criminal liability, are behind a paywall for a digital download!)
I’d go further and say that government websites and services in general are weird and often feel outdated, especially the more local you get.
Some of them look like they’re straight from the early 2000s.
Another thing is that privacy is dead and almost everything is deemed public information.
Your address, mugshot, voter rolls, you name it, it’s all deemed public information.
But once you actually want to access information that is useful to society as whole, more often than not it’s behind a paywall.
Despite this, they still call it “public” information because theoretically anyone can pay for it and get access.
It’s one of the first things I noticed when I moved to the US.
Another thing that I’ve noticed is that, if possible, there always needs to be a middle man inserted, a private corporation that can make a profit out of it.
You want to identify yourself to the government?
Well you’re gonna need an ID.me account or complete a quiz provided to you by LexisNexis or one of the credit reporting agencies.
Why? How is it that the government of all entities, isn’t capable of verifying me themselves?
Zooming out even further you’ll start to recognize even more ancient processes you interact with in daily life.
The whole banking situation and the backbone that’s running it is a great example.
The concept of pending transactions, checks and expensive and slow transfers is baffling to me.
It’s so weird, like ooh, aah this country is the pinnacle of technological innovation, yet in daily life there’s so much that hinges on ancient and suboptimal processes that you don’t see in, say, Western Europe.
My best guess is that’s this is because of a mix of lack of funding and politicians that wanting to placate profit seeking corporations.
Ironically, and I have no hard evidence for this because I’m too lazy to look into it, I suspect that on the long term it costs more to outsource it than it does to do it themselves.
I feel like the USA was probably ahead of the curve on things like the banking system in the 90s but removing all of that cruft and replacing with a v2 is not an easy thing to do. Especially when the population are generally skeptical of government and outsourcing to a private corporation is seen as the cheaper option. Short term it's potentially cheaper/quicker to outsource but long term maintenance contracts and lobbying to keep things as they are is much more expensive.
In the USA you find these whole industries that exist due to the inadequacies of the old systems. E.g. venmo doesnt need to exist in Western Europe (and probably rest of world?) because person-to-person bank transactions are free and easy.
The web development companies that are subcontracted by the government agencies to repurpose their silly shopping cart software.
I will decline to share my personal anecdote's about these companies because I am like 10+ years out of date, but I can tell you that most of these companies seemed to have certain very specific non-technical things in common.
> Simple file downloads and CSV files are fantastic. I wish more folks would publish data in simple formats like this.
The document format doesn't seem to have much to do with the problem. I mean, if the CSV is replaced by a zipped JSON doc then the benefits are the same.
> Every time I have to fill a "shopping cart" for a US government data download I die a little.
Now that seems to be the real problem: too many hurdles in the way of a simple download of a statically-served file.
csvtool is probably what you're looking for, though I think the use case for JSON vs CSV is different as one is hierarchy-oriented and the other is flat.
> For CSV? DuckDB, Datasette, awk, pawk, and tons of others.
Those tools are either line- or table-oriented, and thus don't handle CSVs as a structured format. For instance, is there any easy way to use AWK to handle a doc where one line has N columns but some lines have N+1, with a random column entered in an undetermined point?
The CSV Spec (1) specifies that “Each line should contain the same number of fields throughout the file”, so what you describe would be a noncompliant file. Nevertheless, I would expect both awk and pawk to be able to work with it just fine.
Both pawk and the recent version of awk can parse the CSV format correctly (not just splitting on commas, but also handling the quoting rules for example).
gron gives you both the complete key and value in a single line (which you then grep). Directly grep-ing a specific column in a CSV file isn't very straightforward.
You can search for companies, select the documents you’d like to see (like shareholder lists), then you go through a checkout process and pay 0 EUR (used to be like a few euros years ago), and then you can finally download your file. Still a super tedious process, but at least for free nowadays.
Quite literally yes. There are often multiple click throughs. Every Department, agency, sub-agency, all the way down to federal dog catcher has decided the most important thing is to invent a new way of getting data from them.
There are other ways to access the data on here, but they’re fragmented. It’s nicely organized here so it’s a bummer they make it hard to programmatically retrieve files once you find what you’re looking for.
Agreed! People are surprised but CSV files (while I would not use them for everything) work great in low-tech environment, for instance, or when files must be produced by many different actors with different technical levels.
They also work nicely as an interop between different stacks (Cobol <-> Ruby).
One concrete example is the French standard to describe Electrical Vehicles Charge Points, which is made of 2 parts (static = position & overall description, dynamic = current state, occupancy etc). Both "files" are just CSV files:
Files are produced directly by electrical charge points operators, which can have widely different sizes & technicality, so CSV works nicely in that case.
Lol. Tell me you never had to parse CSV files without telling me.
CSV files can be a nightmare to work with depending where they come from and various liberties that were taken when generating the file or reading the file.
Use a goddam battle tested library people and don't reinvent the wheel. /oldman rant over
Yes, you eventually realize the hard way that "CSV" is actually a blanket of various similar formats, with different rules and conventions. The way one program outputs CSV's may be completely different from another.
Actually, I worked on OpenAddresses, a project to parse thousands of CSV files containing address data from many different county, state, and national systems around the world. It really wasn't that hard, even Python's basic csv parser was sufficient to the task (and there are plenty of better options).
Looking at the parser I see a few problems with it just by skimming the code. I'm not saying it wouldn't work or that it's not good enough for certain purposes.
Oh yeah? Such as? What purposes do you think it wouldn't be good for? The author will probably be interested in your feedback. Apparently it's getting over a million downloads a week on npm.
I have not used it, so this is mostly speculation but i would be curios around character set handling, mixed line ending handling, large file handling, invalid and almost valid file handling.
The author of the library probably has learned, the hard way many many lessons (and probably also decided to prioritize some of the requested issues / feature requests along the way).
The above is not meant as a ding on the project itself and I am sure it is used successfully by many people. The point here is that your claim that you can easily write a csv parser in 200 lines of code does not hold water. It's anything but easy and you should use a battle tested library and not reinvent the wheel.
If you had read my original comment, you would see I didn't claim it's easy to do, only that it can be done in around 200 lines. That's clearly the case.
Character set handling isn't really an issue for JavaScript as strings are always utf-16.
When a file is read into a string the runtime handles the needed conversion.
As for handling large files, I've used this with 50mb CSVs, which would need a 32bit integer to index. Is that large enough? It's not like windows notepad which can only read 64kb files.
Windows notepad can read multiple megabyte files. It can read files that are hundreds of megabytes. It's not pleasant, loading is incredibly slow, and resizing the window when reflow is enabled makes it take that much longer, but it's definitely possible.
My point was that it's not trivial and it's hard to get it right. The way I read your comment was that it's not hard and can easily be done in 200 lines. It's possible I misread it.
I think the original point I was making still stands.
> Every time I have to fill a "shopping cart" for a US government data download I die a little.
I've worked in gov/civic tech for 5+ years and, as you're probably aware, there is now a highly lucrative business in centralizing and consequently selling easy access to this data to lobbyists/fortune 500s/nonprofits.
USDS/18f are sort of addressing this, but haven't really made a dent in the market afaik since they're focusing on modernizing specific agencies rather than centralizing everything.
I don't know if this is the cart you're talking about but the USGS does this for LiDAR data and yeah I'm with you I die a little every time I use that site. I love that the data is available but why in the world do we have to cart everything? Just give me a one click download
The main thing for me is that the data isn't broken down in a way that ends you having to make a thousand requests and collate them. If the client is expected to be able to pull the whole set of data, provide the whole set. There's other ways to rate limit.
clinicaltrials.gov let's you save a search term then download the results as a zip. But there's an xml file for each search result for the trial.
One of the first things I played around with was using Python to get that file, unzip it, then iterate through the xml files grabbing the info I wanted and putting it into ElasticSearch to make it searchable then putting an angular front end on it.
I used to have it published somewhere but I think I let it all die. :(
There's a bunch of wrapper tools to make this particular pipeline easier. Also something like Datasette is great if you want a web view and some fancier features.