#4 seems like a zealous overgeneralisation. JOINs get a lot of crap.
Yes, it's no secret that large joins on huge amounts of data can often be very intensive. From that does not follow, however, the exhortation to simply avoid them categorically. That's a little too much blanket statement for me.
For instance, joins are often used in situations where there is a numeric type column that refers to a very small table of enumerated values that have a textual or other translation, and there is a need to produce the latter in a single query. There's nothing wrong with that join from a performance standpoint, even for very large values of n.
JOINs rightly get a lot of crap - they are massive performance bottlenecks if you are working with large tables.
Having a single awkward example does not detract from that.
I'll add to this by noting the post does say "For smaller tables it doesn’t matter but as tables get bigger avoid joining when you can" which is sound advice.
I'd say joining scales up to perhaps a million or two rows unless you have a lot of RAM (you can see join spills to disk in the EXPLAIN ANALYZE output). I often am working with tables in the 10-30 million row range so my
perspective is probably a little slanted towards the negative.
It's not particularly bad advice. Perhaps for someone who hasn't worked with databases long enough, it's dangerous to tell them "don't worry about normality." But denormalizing certain groups of tables can certainly cause a significant performance gain.
Yes, you're right. I agree that there are times that denormalizing makes sense. My point was that the item said:
* Throw away the normal forms you learned in school
* Denormalize [...] whenever it makes a query faster.
I think that's bad advice as stated. As you imply, you have to know why the rules are there to know when to break them intelligently. I didn't get that message from the item as posted, rather that the author thought that NF had little value, which probably doesn't belong in a list of database tips and tricks.
Been unplugged all day so haven't been around to respond. This is a completely fair criticism - the wording in my post is indeed too plain. Without understanding the normal forms making smart decisions about denormalization is going to be very difficult.
Its bad advice because people denormalize the wrong tables. They'll actually slow queries down by denormalizing a table with a few rows into a table with many rows - which makes things slower.
I think "don't worry about normality" is a ridiculously bad statement. But perhaps "selectively denormalize when you run out of other optimzations" would be a little more reasonable.
One approach is to make all your denormalizations populated by triggers -- that way to you can still think about the DB in "clean" terms in your code when modifying data, but get your denormalized tables for queries, too.
The difference is in a matter of degree. I agree that denormalisation can be beneficial - even necessary - in certain scenarios, particularly when other design optimisations are exhausted. But that's a far cry from "the normal forms you learned about in school are an utter waste of time that will throw your database performance in the gutter."
Agreed it was too strong of a statement, but denormalization really is something you do all the time in big databases. It is not some targeted last resort technique.
If you have 30 million rows in a table you absolutely cannot do joins so you bring everything in that you commonly need.
Because if you're going to do that - why not drop the SQL and the ORM altogether? Most shops at that level aren't doing analysis on their production SQL anyway - they dump that into a data warehouse. So why not do your CRUD on NoSQL and dump that some place ppl can run SQL on it?
Because often that's going too far. Quite often you still need transactions, established technologies, some structured-data, , etc, etc.. Sometimes a regular SQL database, but with some de-normalised tweaks is all you need.
There is definitely a place for NoSQL. However, many projects simply don't need to bite this off. They could start with SQL, from there you can decide how rigid/structured you want to be... I definitely agree that this comes with baggage; but if you're prepared to be pragmatic it's often got a lot of upside.
SQL Server + NHibernate does handle the CRUD stuff, custom fields and DDD-based domain logic very well.
NoSQL (CouchDB in our case) doesn't handle transactions, locking and business rules effectively. However it excels at providing extremely fast views on schemaless data such as our core domain model plus custom fields.
Good post, but most of those are SQL specific, and not really unique to Postgres, right? But yeah, very crucial points you covered that you see a lot of people not implementing!
A lot of generic SQL stuff in there for sure. But, last I worked with MySQL (it has admittedly been a few years) the indexing options were severely limited when compared with PostgreSQL. Last I used it you could not index an expression or qualify an index with a where clause for example. Also at that time MySQL could only make use of one index per query.
All of those deficiencies may have since been eliminated.
So glad to see PostgreSQL bubbling up these day's as it's been my DB of choice for many years. I'd love to see a more portable version and of course something like mySQL's group_concat(). If you are game for trying it out be sure to check out the tools at sqlmanager.net (unfortunately they don't offer Linux versions anymore).
I used to be a huge MySQL fan until a coworker forced us to use PostgreSQL for a big project. At the start I wasn't happy about it but have since completely changed sides. It is an amazing piece of software with some exceptional code and engineering behind it. When you are faced with high concurrency (especially if you have nontrivial write volume) PostgreSQL is a beast.
We peak at 1000s of transactions a second on an OLTP database that is over 100GB in size and PostgreSQL handles it like a champ.
Any nice reference about pgsql and query optimization? Any book recommendation?
As an old mysql guy, as far as I do db:s, the advice re subqueries (#4) was unusual... :-)
(I looked at trying pgsql for a hobby a few years back and found lacking support for different char sets for different tables, etc. Is that [still] so?)
If you are looking for a book on Postgresql the Douglas book http://www.amazon.com/PostgreSQL-Developers-Library-Korry-Do...
is pretty good introduction to most of the topics specific to postgres, it includes a sample implementation of a postgres accessing script in most of the languages commonly used for the task, which can be a bit repetitive (perl, python, php, java, C, same tune different instrument) but that does make it a fairly good reference.
Thought I was add that the PostgreSQL documentation is really, really good for learning. It is a bit dense but overall very well written, includes nontrivial examples, and you can really get a solid dive into a topic reading it.
Yes, it's no secret that large joins on huge amounts of data can often be very intensive. From that does not follow, however, the exhortation to simply avoid them categorically. That's a little too much blanket statement for me.
For instance, joins are often used in situations where there is a numeric type column that refers to a very small table of enumerated values that have a textual or other translation, and there is a need to produce the latter in a single query. There's nothing wrong with that join from a performance standpoint, even for very large values of n.