I wish more people would publish this kind of content.
Academic papers aren't really designed for casual reading - I always find them to be a bit of a slog. It would be great if it was much more common for people to share their notes like this.
I'm going to try and remember to write more of these in the future.
I should add that I found this particular paper to be a whole lot more readable than most!
I try to do paper reviews (like https://news.ycombinator.com/item?id=31379383) because it is a great way for me to learn more and _hopefully_ share in a way that is useful for others . My writing abilities are still a work in progress :)
Simon's writing has always been crisp and insightful. If I remember correctly, Simon comes out of a newspaper background. I wonder how that's influenced his approach on these sorts of technical topics.
I've spent a bunch of time working at newspapers and surrounded by journalists but I've never done any actual writing as part of that. My writing experience almost all comes from having a bluf for twenty years!
DuckDB is obviously faster than SQLite on analytical queries, from 10 to 11 000 times on a benchmark [1]. But in comparison to SQLite, DuckDB cannot run all the queries due to worse memory management. It cannot run 12 out of 43 queries.
I’m a big fan of DuckDB. However it is true that memory usage is not optimized right now — I’ve had trouble with very large groupbys even with pragmas on memory limits (but I’m talking huge limits like 250 GB RAM). I’ve had to remodel and break down my problem to work around these limitations.
I believe it’s on the DuckDB roadmap to get better at memory use but apart from that it’s very very good at what it does — ie complex analytical queries. It vectorizes ops so everything is fast and efficient.
I'm surprised duckdb failed to run simple count/groupby queries. Is it simply OOM? If so, do you know how much more it would take to complete those queries?
It's likely out of memory issues with many many distinct values in an aggregated column, or with a very large join intermediate. The joins are now able to handle larger than RAM, and larger than RAM intermediates in a group by is on the roadmap!
I wonder how well-motivated the OLAP use cases are.
Is there really a lot of demand of doing analytics on edge devices?
I see a windmill example in the introduction of a paper linked from the post above. Is that really a big thing in the windmill space? Are modern windmills needing to use embedded devices on them to perform real time analytics that would be enabled by OLAP engines and queries like that?
Are there other such cases that folks know where this would be very useful?
A lot of recent tooling around SQLite has opened up some new possibilities. The biggest one I can think of is Litestream which has allowed SQLite to have distributed reads. This makes SQLite a strong contender for serverless and extremely cheap OLAP. There have been a lot of posts about serverless Postgres as well lately in the form of neon.tech.
If SQLite had better performance for OLAP queries, there is no way I wouldn't explore it as an option for work because it's so darn easy to use. The operational or hosting or cloud costs of other OLAP database is insanely high in comparison.
P.S. It's my personal opinion and experience that most companies have only about 1TB or less of analytics data. Large OLAP databases are just a waste for such small data but currently apart from DuckDB there aren't any cheap competitors for these small use cases. Postgres and SQLite are too slow (without some additional engineering).
I've been thinking about very similar needs, but perhaps accomplished in a somewhat different way.
It would be really neat if SQLite or DuckDB implemented a way to create (register) read-only external tables (like Apache Iceberg). That way your large OLAP database (Snowflake, etc) could manage (DDL & CRUD) the external table, while SQLite and DuckDB would consume from it and automatically get all of the updates/deltas to the external table.
Snowflake could take care of all the analytical heavy lifting continually preparing/updating/unloading data to the Iceberg table in an analytics ready/friendly form (lags, leads, etc built into data).
DuckDB could watch for changes on the Iceberg table (manifest) and update its internal representation of the external table as it changes. DuckDB can already read parquet files out on S3 so it should be somewhat trivial to implement? DuckDB could then handle all of the lightweight slicing/dicing of the analytics data; projections, counts, pagination, etc
For comparison/reference, a typical single 2MW wind turbine platform costs about $3M USD. I would expect a wind park, that has several of wind turbines, to have traditional wired Ethernet connectivity (or 5G private network, in the future) for connecting all onboard turbine diagnostics electronics to send/receive signals/commands from a central control platform onsite. A typical part failure and downtime costs a lot of money. So, typically they would invest in this 'optional' observability/diagnostics platform which costs a few thousand $$s and can easily accommodate a full server stack for local collection, buffering/storage, aggregation, ETL and push to remote cloud for further real-time monitoring and long-term component failure prediction/analysis.
I think there is definitely some interest in building 'data apps' where you send data file to client/browser and user slices and dices as they please instead of a set of predetermined queries/analytics. I heard of companies like 'motherduck' exploring this space.
How do people see Apache DataFusion fitting into this space?
I've also been following DuckDB for a while and remain very excited about it. I've only recently become aware of DataFusion and am building a small project with it but I've not seen much discussion on it or any benchmark comparisons.
DataFusion seems to be part of the Apache Arrow stable and they've done a great service to analytics space, with DuckDB also being big supporters. I'm hoping this means that DataFusion has a good chance of also getting good traction and adding another option in this space.
To me, this says that SQLite is a carefully considered codebase, with great
care put into making a lean and high quality piece of software. While DuckDB
seems to have the hallmark of bloated C++ code, with plenty of waste, and little
care to these details.
That's certainly one metric. But you can pick any single metric and find one is better than the other (e.g. data type support). You have to look at it overall.
Academic papers aren't really designed for casual reading - I always find them to be a bit of a slog. It would be great if it was much more common for people to share their notes like this.
I'm going to try and remember to write more of these in the future.
I should add that I found this particular paper to be a whole lot more readable than most!