Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One nice thing about CSV files being zipped and served via the web is they can be streamed directly into the database incredibly fast without having to persist them anywhere (aside from the db).

You can load the zip file as a stream, read the CSV line by line, transform it, and then load it to the db using COPY FROM stdin (assuming Postgres).



Definitely, it is much easier to stream CSV than say JSON or XML (even if JSONL/Sax parsers exist etc).


That doesn't sound like an amazingly safe idea


It isn't. But that's easily mitigated with temp tables, ephemeral database and COPY etc.

Upstream can easily f-up and (accidentally) delete production data if you do this on a live db. Which is why PostgreSQL and nearly all other DBS have a miriad of tools to solve this by not doing it directly on a production database


Maybe I'm missing something but I don't see how it's possible for a COPY statement alone to remove existing data.


If in the regular scenario you load 10000 rows of new data and delete the old then it’s fine.

What if someone screws up the zip and instead of 10000 today, it’s only 10?


I had this last week, but instead it was a 3rd party api and their service started returning null instead of true for the has_more property beyond the second page of results.

In either the solution is probably to check rough counts and error if not reasonable.


I think generally don't replace the prod db until the new one passes tests.


What specific risks do you foresee with this approach?


Seem totally fine to me. As long as you can rollback if the download is truncated or the crc checksum doesn’t match.


> or the crc checksum doesn’t match.

which wouldn't exist if the api is simply just a single CSV file?

at least with a zip, the CRC exists (an incomplete zip file is detectable, an incomplete, but syntactically correct CSV file is not)


DROP DATABASE blah;


That’s not how COPY FROM works in postgres. You give it a csv and a table matching the structure and it hammers the data into the table faster than anything else can.


I you feel risky, try a Foreign Data Wrapper ;)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: