Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Notes on the SQLite DuckDB Paper (simonwillison.net)
195 points by todsacerdoti on Sept 1, 2022 | hide | past | favorite | 28 comments


I wish more people would publish this kind of content.

Academic papers aren't really designed for casual reading - I always find them to be a bit of a slog. It would be great if it was much more common for people to share their notes like this.

I'm going to try and remember to write more of these in the future.

I should add that I found this particular paper to be a whole lot more readable than most!


I try to do paper reviews (like https://news.ycombinator.com/item?id=31379383) because it is a great way for me to learn more and _hopefully_ share in a way that is useful for others . My writing abilities are still a work in progress :)


That one is great, thank you!


Simon's writing has always been crisp and insightful. If I remember correctly, Simon comes out of a newspaper background. I wonder how that's influenced his approach on these sorts of technical topics.


You realise you're replying to Simon about Simon, right?


Now you made me wonder if Simon realize that he's talking about Simon's paper.


I've spent a bunch of time working at newspapers and surrounded by journalists but I've never done any actual writing as part of that. My writing experience almost all comes from having a bluf for twenty years!


for anyone interested in a blog with a bunch of similar types of summaries of papers, here's a good one:

https://blog.acolyer.org/


Great suggestion and thank you for leading with a nice example!

I enjoyed reading your notes and benefited immensely from them.


DuckDB is obviously faster than SQLite on analytical queries, from 10 to 11 000 times on a benchmark [1]. But in comparison to SQLite, DuckDB cannot run all the queries due to worse memory management. It cannot run 12 out of 43 queries.

[1]: https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQXRoZW5hIC...


I’m a big fan of DuckDB. However it is true that memory usage is not optimized right now — I’ve had trouble with very large groupbys even with pragmas on memory limits (but I’m talking huge limits like 250 GB RAM). I’ve had to remodel and break down my problem to work around these limitations.

I believe it’s on the DuckDB roadmap to get better at memory use but apart from that it’s very very good at what it does — ie complex analytical queries. It vectorizes ops so everything is fast and efficient.


I'm surprised duckdb failed to run simple count/groupby queries. Is it simply OOM? If so, do you know how much more it would take to complete those queries?


My wild guess is lack of some functions, e.g., regexp, but need to take a deep look.


It's likely out of memory issues with many many distinct values in an aggregated column, or with a very large join intermediate. The joins are now able to handle larger than RAM, and larger than RAM intermediates in a group by is on the roadmap!


Looking at the failed queries, yes, OOM might be one big factor due to aggregations (there is no join in these queries though).


I wonder how well-motivated the OLAP use cases are.

Is there really a lot of demand of doing analytics on edge devices?

I see a windmill example in the introduction of a paper linked from the post above. Is that really a big thing in the windmill space? Are modern windmills needing to use embedded devices on them to perform real time analytics that would be enabled by OLAP engines and queries like that?

Are there other such cases that folks know where this would be very useful?


A lot of recent tooling around SQLite has opened up some new possibilities. The biggest one I can think of is Litestream which has allowed SQLite to have distributed reads. This makes SQLite a strong contender for serverless and extremely cheap OLAP. There have been a lot of posts about serverless Postgres as well lately in the form of neon.tech.

If SQLite had better performance for OLAP queries, there is no way I wouldn't explore it as an option for work because it's so darn easy to use. The operational or hosting or cloud costs of other OLAP database is insanely high in comparison.

P.S. It's my personal opinion and experience that most companies have only about 1TB or less of analytics data. Large OLAP databases are just a waste for such small data but currently apart from DuckDB there aren't any cheap competitors for these small use cases. Postgres and SQLite are too slow (without some additional engineering).


I've been thinking about very similar needs, but perhaps accomplished in a somewhat different way.

It would be really neat if SQLite or DuckDB implemented a way to create (register) read-only external tables (like Apache Iceberg). That way your large OLAP database (Snowflake, etc) could manage (DDL & CRUD) the external table, while SQLite and DuckDB would consume from it and automatically get all of the updates/deltas to the external table.

    clicks -->                     Iceberg
    bits   --> Snowflake/MPP <-->   table   --> DuckDB 
    pieces -->                      on S3 
Snowflake could take care of all the analytical heavy lifting continually preparing/updating/unloading data to the Iceberg table in an analytics ready/friendly form (lags, leads, etc built into data).

DuckDB could watch for changes on the Iceberg table (manifest) and update its internal representation of the external table as it changes. DuckDB can already read parquet files out on S3 so it should be somewhat trivial to implement? DuckDB could then handle all of the lightweight slicing/dicing of the analytics data; projections, counts, pagination, etc

Here's an example of Snowflake managing an Iceberg table: https://www.youtube.com/watch?v=Kz5cWY_vRwU


DuckDB can actually read SQLite or Postgres directly! In the SQLite case, something like Litestream plus DuckDB could work really well!

Also, with Pyarrow's help, DuckDB can already do this with Delta tables!

https://github.com/duckdblabs/sqlite_scanner

https://github.com/duckdblabs/postgresscanner


For comparison/reference, a typical single 2MW wind turbine platform costs about $3M USD. I would expect a wind park, that has several of wind turbines, to have traditional wired Ethernet connectivity (or 5G private network, in the future) for connecting all onboard turbine diagnostics electronics to send/receive signals/commands from a central control platform onsite. A typical part failure and downtime costs a lot of money. So, typically they would invest in this 'optional' observability/diagnostics platform which costs a few thousand $$s and can easily accommodate a full server stack for local collection, buffering/storage, aggregation, ETL and push to remote cloud for further real-time monitoring and long-term component failure prediction/analysis.


I think there is definitely some interest in building 'data apps' where you send data file to client/browser and user slices and dices as they please instead of a set of predetermined queries/analytics. I heard of companies like 'motherduck' exploring this space.


How do people see Apache DataFusion fitting into this space?

I've also been following DuckDB for a while and remain very excited about it. I've only recently become aware of DataFusion and am building a small project with it but I've not seen much discussion on it or any benchmark comparisons.

DataFusion seems to be part of the Apache Arrow stable and they've done a great service to analytics space, with DuckDB also being big supporters. I'm hoping this means that DataFusion has a good chance of also getting good traction and adding another option in this space.

Also Substrait for future backend independence.


I found this tweet from Andy Grove and in the linked benchmarks DataFusion is just behind DuckDB.

https://twitter.com/andygrove_io/status/1559920848882073600?...


Discussion of the paper yesterday https://news.ycombinator.com/item?id=32675861


I am writing a pet project and will try this as my first SQL database.


When I look at comparisons like this, one of the first red flags I check for is binary size. First SQLite:

https://sqlite.org/2022/sqlite-tools-win32-x86-3390200.zip

Result: "1,108,480 sqlite3.exe". Then DuckDB:

https://github.com/duckdb/duckdb/releases/download/v0.4.0/du...

Result: "32,052,736 duckdb.exe" (already stripped)

To me, this says that SQLite is a carefully considered codebase, with great care put into making a lean and high quality piece of software. While DuckDB seems to have the hallmark of bloated C++ code, with plenty of waste, and little care to these details.


Hmm, I would recommend reading a sample of the code first!


That's certainly one metric. But you can pick any single metric and find one is better than the other (e.g. data type support). You have to look at it overall.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: