Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Simple file downloads and CSV files are fantastic. I wish more folks would publish data in simple formats like this. Every time I have to fill a "shopping cart" for a US government data download I die a little.

There's a bunch of wrapper tools to make this particular pipeline easier. Also something like Datasette is great if you want a web view and some fancier features.



One nice thing about CSV files being zipped and served via the web is they can be streamed directly into the database incredibly fast without having to persist them anywhere (aside from the db).

You can load the zip file as a stream, read the CSV line by line, transform it, and then load it to the db using COPY FROM stdin (assuming Postgres).


Definitely, it is much easier to stream CSV than say JSON or XML (even if JSONL/Sax parsers exist etc).


That doesn't sound like an amazingly safe idea


It isn't. But that's easily mitigated with temp tables, ephemeral database and COPY etc.

Upstream can easily f-up and (accidentally) delete production data if you do this on a live db. Which is why PostgreSQL and nearly all other DBS have a miriad of tools to solve this by not doing it directly on a production database


Maybe I'm missing something but I don't see how it's possible for a COPY statement alone to remove existing data.


If in the regular scenario you load 10000 rows of new data and delete the old then it’s fine.

What if someone screws up the zip and instead of 10000 today, it’s only 10?


I had this last week, but instead it was a 3rd party api and their service started returning null instead of true for the has_more property beyond the second page of results.

In either the solution is probably to check rough counts and error if not reasonable.


I think generally don't replace the prod db until the new one passes tests.


What specific risks do you foresee with this approach?


Seem totally fine to me. As long as you can rollback if the download is truncated or the crc checksum doesn’t match.


> or the crc checksum doesn’t match.

which wouldn't exist if the api is simply just a single CSV file?

at least with a zip, the CRC exists (an incomplete zip file is detectable, an incomplete, but syntactically correct CSV file is not)


DROP DATABASE blah;


That’s not how COPY FROM works in postgres. You give it a csv and a table matching the structure and it hammers the data into the table faster than anything else can.


I you feel risky, try a Foreign Data Wrapper ;)


it blows my mind that you can use sqlite with csv as input and then query it, it sounds so logical and useful yet I never came by it.

we have lots of reporting in CSV, can't wait to start using it to run queries quickly


You should checkout powershell; it supports converting CSV into in-memory structured data and then you can run regular powershell queries on that data:

    $> csvData = @"
       Name,Department,Salary
       John Doe,IT,60000
       Jane Smith,Finance,75000
       Alice Johnson,HR,65000
       Bob Anderson,IT,71000
       "@;
    
    $> csvData
         | ConvertFrom-Csv
         | Select Name, Salary
         | Sort Salary -Descending
    
       Name          Salary
       ----          ------
       Jane Smith    75000
       Bob Anderson  71000
       Alice Johnson 65000
       John Doe      60000
You can also then convert the results back into CSV by piping into ConvertTo-Csv

    $> csvData
         | ConvertFrom-Csv
         | Select Name, Salary
         | Sort Salary -Descending
         | ConvertTo-Csv
         
       "Name","Salary"
       "Jane Smith","75000"
       "Bob Anderson","71000"
       "Alice Johnson","65000"
       "John Doe","60000"


nushell does too:

    /tmp/> "Name,Department,Salary
    ::: John Doe,IT,60000
    ::: Jane Smith,Finance,75000
    ::: Alice Johnson,HR,65000
    ::: Bob Anderson,IT,71000" |
    :::     from csv | 
    :::     select Name Salary | 
    :::     sort-by -r Salary
    ╭───┬───────────────┬────────╮
    │ # │     Name      │ Salary │
    ├───┼───────────────┼────────┤
    │ 0 │ Jane Smith    │  75000 │
    │ 1 │ Bob Anderson  │  71000 │
    │ 2 │ Alice Johnson │  65000 │
    │ 3 │ John Doe      │  60000 │
    ╰───┴───────────────┴────────╯


for something a bit more robust, check out DuckDB. It's a library you can embed, use it to run SQL on local files as well as connect to databases, do joins, analytics, etc.


Agreed. The article mentioned duckdb and I'm her to thumbs-up the use of DuckDB wholeheartedly. If you like the world of public CSV files as data sources that you can query or cross-query, duckdb is the tool for you. Just follow the demo on the duckdb website and you'll be wow'd for sure.


I use both, and I have found it helpful to have nushell around when munging csv and parquet files, even when working with duckdb - I find it quicker to ask for the first few rows of a thing or do simple stuff with it, then deeper analysis with duckdb.

They're a powerful pairing


Crazy how similar the commands are.


qsv (https://github.com/jqnatividad/qsv) also has a sqlp command which lets you run Polars SQL queries on CSV(s).

Here I'll:

  - Send the csv data from stdin (using echo and referred to in the command by -)
  - Refer to the data in the query by stdin. You may also use the _t_N syntax (first table is _t_1, then _t_2, etc.), or the file name itself before the .csv extension if we were using files.
  - Pipe the output to the table command for formatting.
  - Also, the shape of the result is printed to stderr (the (4, 2) below).

  $ echo 'Name,Department,Salary
    John Doe,IT,60000
    Jane Smith,Finance,75000
    Alice Johnson,HR,65000
    Bob Anderson,IT,71000' |
    qsv sqlp - 'SELECT Name, Salary FROM stdin ORDER BY Salary DESC' |
    qsv table
    
    (4, 2)
    Name           Salary
    Jane Smith     75000
    Bob Anderson   71000
    Alice Johnson  65000
    John Doe       60000


You do realize that this happened also on the article? I.e. In memory and autoinference


You might find a lot of interesting tools in the Datasette ecosystem. Data dashboarding for SQLite with all sorts of import and export and visualization plugins.

https://datasette.io/


What’s really interesting about it is that Awk is now, finally, getting support for CSV. But I bet a large amount of Awk+CSV use cases can be met with SQLite and SQL+CSV.


AWK's new CSV and UTF-8 support is great, but when querying data I think in terms of SQL.


The clickhouse-local tool is also really great for querying a bunch of systems, and it has connectors and converters for a whole bunch of other systems and formats.


Have you used duckdb? It's great for that.


Look at duckdb. Queries against csv, parquet, Jason, locally or via http. It’s like SQLite, but faster and better.


Or the vnlog tools can do this. There are many ways to do data processing on the commandline now.


There's a chance this HN post is going to more than halve my awk-usage....


perl with the right plugins makes data janitor situations simplified.


CSV would be great if there were one CSV, and if you absolutely guarantee that nobody has "improved" the CSV by editing it with a spreadsheet program (including opening it with a spreadsheet program) or their own deft little fingers in a text editor.

For example:

    "Look, this contains \"quotes\"!",012345
Or:

    "Look, this contains ""quotes""!",012345
Or, for some degenerate examples:

    "Look, this contains "quotes"!",012345
Or:

    Look, this contains "quotes"!,012345
Or the spoor of a spreadsheet:

    "Look, this contains ""quotes""!",12345
Theoretically, JSON isn't immune to being hand-hacked into a semi-coherent mess. In practice, people don't seem to do that to JSON files, at least not that I've seen. Ditto number problems, in that in JSON, serial numbers and such tend to be strings instead of integers a "helpful" application can lop a few zeroes off of.


> CSV would be great if there were one CSV, and if you absolutely guarantee that nobody has "improved" the CSV by editing it with a spreadsheet program (including opening it with a spreadsheet program)

Practically no formats actually pass those rules. Even plain text is bound to be "improved" by text editors frequently (uniformation of line endings, removal of data not in a known encoding, UTF BOM, UTF normalization, etc.)

Just don't do that.


Just don't do what, edit your data? The problem is all JSON implementations agree what a properly encoded JSON file look like, and it's not true for CSV.

Arbitrary CSV is not, in general, round-trippable.


> all JSON implementations agree what a properly encoded JSON file look like

You clearly haven't had to deal with tools that cannot read and write a JSON file without messing up and truncating large numbers.


Json is JavaScript Object Notation. Javascript itself doesn't large numbers like 2^65 or accurate numbers like 2^55+75. The json spec doesn't limit numbers to 2^51 or 2^64, but also doesn't guarantee that they will be supported.


I mean in practice you open up the CSV file, figure out the escaping rules, change one line of your code, and get on with your life.


You're lucky.

In practice, you open the file a disappointed customer sends you because your software doesn't parse it. You resist yelling at whomever decided to ever do CSV support in your software because you know what it's like. But the customer is important, so then either write a one-off converter for them to a more proper CSV format, or you add - not change - several lines of code in mainline to be able to also parse customer's format. God forbid maybe also write. Repeat that a couple of times. See the mess increasing. Decide to use a so-called battle-tested CSV parser after all, only to find out it doesn't pass all tests. Weep. Especially because at the sametime you also support a binary format which is 100 times faster to read and write and has no issues whatsoever.

I'm not blind for the benefits of CSV, but oh boy it would be nice if it would be, for starters, actually comma-separated.


Stick to files conforming to RFC-4180 then


JSONL should replace CSV. It is standardized and the escapes mostly well specified. It is effectively CSV with "[" and "]" surrounding lines.

Regular JSON would work fine for static file, and make Schema and links (JSON-LD) possible. But then the file could be any structure. JSONL works better for systems that assume line-based records, and are more likely to have consistent, simple records.


I've never stopped to question the absurdity of the gov data shopping cart thing. WHY!?!? Is there some justification for that?


I can almost assure you it was an explicit requirement in an RFP that was copied from some master template. So not a good justification - but a justification in the eyes of the bureaucracy.

The book Recoding America has a lot of anecdotes to this effect; most of these situations reduce to a Congressional mandate that got misinterpreted along the way. My favorite was for an update to Social Security. The department in charge of the implementation swore that Congress was forcing them to build a facebook for doctors (literally where doctors could friend other doctors and message them). Congress had no such intention; it was actually 3rd party lobbying that wanted the requirement so they could build their own solution outside of government. Really crazy stuff.


> Congress had no such intention; it was actually 3rd party lobbying

Right, but 3rd party lobbying can't force anyone to do anything, whereas Congress can (and did) give this mandate the force of law. The fact that lobbyists got Congress to do something that they had "no such intention" to do is its own problem, but let's not lose sight of who is responsible for laws.


That's the interesting part of this story; Congress didn't think this requirement existed, neither did the lobbyists. But the language that congress adopted (with the consultation of this lobbying group) made the agencies _think_ it was what congress wanted. So the agency was left holding the bag for something no one actually wanted in the first place. Like a big game of telephone.

I agree with your broader point however. Congress needs to do a better job of owning outcomes.


I for one was tickled the first time I paid an NYC parking ticket online and had to add it to my cart, as if they might upsell me on some other more serious violation. Act now before it's too late!


When you're paying for things, and can pay for several things at the same time, it makes sense. I helped my mother pay her property taxes this year, and for two properties we had to make four payments. Without the shopping cart (or a functional equivalent) I would have had to enter my payment information four times instead of once.


And if you don't like the service you have free returns for 14 days no questions asked


I assume it's a vestige from the old days when you ordered actual tapes or printouts from government agencies. The notion of an order that had to be prepared is baked into how they think about the service and product.

All sorts of strange things happen with accessing US government data. But most agencies have a lot of excellent data available for free and motivated data scientists who want to make it available to you.


Some records are only released for a fee, so I always assumed implementing a standard interface to handle commercial transactions and making select items free was easier than maintaining separate commercial/noncommercial sites.


Which in itself is fairly antidemocratic.


Probably because it’s the out-of-the-box functionality on ServiceNow or whatever tool they’re using.


My guess would be to gather information about who is using the data - and present it to the stakeholders.

Sometimes they send questionnaires to data consumers.


Probably made by the same people who made https://pacer.uscourts.gov/pacer-pricing-how-fees-work ...

(As a side note, I can understand why in years past it would cost multiple cents per page to physically photocopy a federal document - but it is absolutely absurd that already-digitized documents, documents which are fundamentally part of the precedent that decides whether our behavior does or doesn't cause civil or criminal liability, are behind a paywall for a digital download!)


Pacer physically hurts to use. They should have to pay us for having to use it.


I’d go further and say that government websites and services in general are weird and often feel outdated, especially the more local you get.

Some of them look like they’re straight from the early 2000s.

Another thing is that privacy is dead and almost everything is deemed public information. Your address, mugshot, voter rolls, you name it, it’s all deemed public information.

But once you actually want to access information that is useful to society as whole, more often than not it’s behind a paywall. Despite this, they still call it “public” information because theoretically anyone can pay for it and get access.

It’s one of the first things I noticed when I moved to the US.

Another thing that I’ve noticed is that, if possible, there always needs to be a middle man inserted, a private corporation that can make a profit out of it.

You want to identify yourself to the government? Well you’re gonna need an ID.me account or complete a quiz provided to you by LexisNexis or one of the credit reporting agencies.

Why? How is it that the government of all entities, isn’t capable of verifying me themselves?

Zooming out even further you’ll start to recognize even more ancient processes you interact with in daily life.

The whole banking situation and the backbone that’s running it is a great example. The concept of pending transactions, checks and expensive and slow transfers is baffling to me.

It’s so weird, like ooh, aah this country is the pinnacle of technological innovation, yet in daily life there’s so much that hinges on ancient and suboptimal processes that you don’t see in, say, Western Europe.

My best guess is that’s this is because of a mix of lack of funding and politicians that wanting to placate profit seeking corporations.

Ironically, and I have no hard evidence for this because I’m too lazy to look into it, I suspect that on the long term it costs more to outsource it than it does to do it themselves.

/rant


I feel like the USA was probably ahead of the curve on things like the banking system in the 90s but removing all of that cruft and replacing with a v2 is not an easy thing to do. Especially when the population are generally skeptical of government and outsourcing to a private corporation is seen as the cheaper option. Short term it's potentially cheaper/quicker to outsource but long term maintenance contracts and lobbying to keep things as they are is much more expensive.

In the USA you find these whole industries that exist due to the inadequacies of the old systems. E.g. venmo doesnt need to exist in Western Europe (and probably rest of world?) because person-to-person bank transactions are free and easy.


Bureaucracy = $$$


Who is making money with shopping cart-style downloads?


The web development companies that are subcontracted by the government agencies to repurpose their silly shopping cart software.

I will decline to share my personal anecdote's about these companies because I am like 10+ years out of date, but I can tell you that most of these companies seemed to have certain very specific non-technical things in common.


> Simple file downloads and CSV files are fantastic. I wish more folks would publish data in simple formats like this.

The document format doesn't seem to have much to do with the problem. I mean, if the CSV is replaced by a zipped JSON doc then the benefits are the same.

> Every time I have to fill a "shopping cart" for a US government data download I die a little.

Now that seems to be the real problem: too many hurdles in the way of a simple download of a statically-served file.


> if the CSV is replaced by a zipped JSON doc then the benefits are the same.

Being able to use things like jq and gron might make simple use cases extremely straightforward. I'm not aware of anything similarly nimble for CSV.


csvtool is probably what you're looking for, though I think the use case for JSON vs CSV is different as one is hierarchy-oriented and the other is flat.


For CSV? DuckDB, Datasette, awk, pawk, and tons of others.


> For CSV? DuckDB, Datasette, awk, pawk, and tons of others.

Those tools are either line- or table-oriented, and thus don't handle CSVs as a structured format. For instance, is there any easy way to use AWK to handle a doc where one line has N columns but some lines have N+1, with a random column entered in an undetermined point?


The CSV Spec (1) specifies that “Each line should contain the same number of fields throughout the file”, so what you describe would be a noncompliant file. Nevertheless, I would expect both awk and pawk to be able to work with it just fine.

Both pawk and the recent version of awk can parse the CSV format correctly (not just splitting on commas, but also handling the quoting rules for example).

1: https://www.rfc-editor.org/rfc/rfc4180


You dont gron with CSV, normal grep will work wonders.


gron gives you both the complete key and value in a single line (which you then grep). Directly grep-ing a specific column in a CSV file isn't very straightforward.


Can you explain what the shopping cart is?

I mean... do they make you select one or more files, then navigate to another page to download your selected files?


Sounds like Germany’s company register[1].

You can search for companies, select the documents you’d like to see (like shareholder lists), then you go through a checkout process and pay 0 EUR (used to be like a few euros years ago), and then you can finally download your file. Still a super tedious process, but at least for free nowadays.

[1] https://www.unternehmensregister.de/ureg/


Quite literally yes. There are often multiple click throughs. Every Department, agency, sub-agency, all the way down to federal dog catcher has decided the most important thing is to invent a new way of getting data from them.


Solutions vary as the budgets and talent assigned the projects do.

Federal, State, local and hyper local solutions cannot be the same unless the financier is also the same.


Pretty much. USGS’s EarthExplorer requires a login just to download free files https://earthexplorer.usgs.gov/

There are other ways to access the data on here, but they’re fragmented. It’s nicely organized here so it’s a bummer they make it hard to programmatically retrieve files once you find what you’re looking for.


Agreed! People are surprised but CSV files (while I would not use them for everything) work great in low-tech environment, for instance, or when files must be produced by many different actors with different technical levels.

They also work nicely as an interop between different stacks (Cobol <-> Ruby).

One concrete example is the French standard to describe Electrical Vehicles Charge Points, which is made of 2 parts (static = position & overall description, dynamic = current state, occupancy etc). Both "files" are just CSV files:

https://github.com/etalab/schema-irve

Both sub-formats are specified via TableSchema (https://frictionlessdata.io/schemas/table-schema.json).

Files are produced directly by electrical charge points operators, which can have widely different sizes & technicality, so CSV works nicely in that case.


I worked for a company that used TSV for huge datasets. It was extremely efficient, far more so than any database.


TSV is criminally underrated compared to CSV.


Lol. Tell me you never had to parse CSV files without telling me.

CSV files can be a nightmare to work with depending where they come from and various liberties that were taken when generating the file or reading the file.

Use a goddam battle tested library people and don't reinvent the wheel. /oldman rant over


Yes, you eventually realize the hard way that "CSV" is actually a blanket of various similar formats, with different rules and conventions. The way one program outputs CSV's may be completely different from another.


Actually, I worked on OpenAddresses, a project to parse thousands of CSV files containing address data from many different county, state, and national systems around the world. It really wasn't that hard, even Python's basic csv parser was sufficient to the task (and there are plenty of better options).

CSV is remarkably robust in practice.


Hey, CSV is hard, guys.

I've found template injection in a CSV upload before because they didn't anticipate a doublequote being syntactically relevant or something.

It was my job to find these things and I still felt betrayed by a file format I didn't realize wasn't just comma separated values only.


Someone can find all these issues on 200 lines of code :) see sibling comment


This isn't my code, but gives you an idea of the level of complexity involved. But don't reimplement what you don't need to.

https://github.com/mafintosh/csv-parser/blob/master/index.js


To be clear my comment was meant as a joke.

Looking at the parser I see a few problems with it just by skimming the code. I'm not saying it wouldn't work or that it's not good enough for certain purposes.


Oh yeah? Such as? What purposes do you think it wouldn't be good for? The author will probably be interested in your feedback. Apparently it's getting over a million downloads a week on npm.


I have not used it, so this is mostly speculation but i would be curios around character set handling, mixed line ending handling, large file handling, invalid and almost valid file handling.

You can pick on some of the corner case issues here: https://github.com/mafintosh/csv-parser/issues Also look at ones that were solved. https://github.com/mafintosh/csv-parser/pulls

Some interesting ones: https://github.com/mafintosh/csv-parser/pull/121 https://github.com/mafintosh/csv-parser/pull/151 https://github.com/mafintosh/csv-parser/issues/218

The author of the library probably has learned, the hard way many many lessons (and probably also decided to prioritize some of the requested issues / feature requests along the way).

The above is not meant as a ding on the project itself and I am sure it is used successfully by many people. The point here is that your claim that you can easily write a csv parser in 200 lines of code does not hold water. It's anything but easy and you should use a battle tested library and not reinvent the wheel.


If you had read my original comment, you would see I didn't claim it's easy to do, only that it can be done in around 200 lines. That's clearly the case.

Character set handling isn't really an issue for JavaScript as strings are always utf-16. When a file is read into a string the runtime handles the needed conversion.

As for handling large files, I've used this with 50mb CSVs, which would need a 32bit integer to index. Is that large enough? It's not like windows notepad which can only read 64kb files.


Windows notepad can read multiple megabyte files. It can read files that are hundreds of megabytes. It's not pleasant, loading is incredibly slow, and resizing the window when reflow is enabled makes it take that much longer, but it's definitely possible.


My point was that it's not trivial and it's hard to get it right. The way I read your comment was that it's not hard and can easily be done in 200 lines. It's possible I misread it.

I think the original point I was making still stands.


It's not as bad as all that. There's some gochas sure but you can cover them all with about 200 lines of code.

However, I would recommend using a tested library to do the parsing, sqlite for example, rather than rolling your own. Unless you have to of course.


> Every time I have to fill a "shopping cart" for a US government data download I die a little.

I've worked in gov/civic tech for 5+ years and, as you're probably aware, there is now a highly lucrative business in centralizing and consequently selling easy access to this data to lobbyists/fortune 500s/nonprofits.

USDS/18f are sort of addressing this, but haven't really made a dent in the market afaik since they're focusing on modernizing specific agencies rather than centralizing everything.


I don't know if this is the cart you're talking about but the USGS does this for LiDAR data and yeah I'm with you I die a little every time I use that site. I love that the data is available but why in the world do we have to cart everything? Just give me a one click download


The main thing for me is that the data isn't broken down in a way that ends you having to make a thousand requests and collate them. If the client is expected to be able to pull the whole set of data, provide the whole set. There's other ways to rate limit.


clinicaltrials.gov let's you save a search term then download the results as a zip. But there's an xml file for each search result for the trial.

One of the first things I played around with was using Python to get that file, unzip it, then iterate through the xml files grabbing the info I wanted and putting it into ElasticSearch to make it searchable then putting an angular front end on it.

I used to have it published somewhere but I think I let it all die. :(


Pretty much all data eventually ends up in 2 dimensions (rows & columns) so all these complicated data models are just mostly complicating things.


Someone posted me a pdf in the last year! Online even! (But posted!)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: