He's a Russian hacker. That's where awesome new algorithms come from these days, including things like Dual-Pivot Quicksort. (And QuickLZ may have been invented by some Scandinavian guy, but I'm pretty sure it was announced on encode.ru.) He's not an academic, but that doesn't mean he can't do competent algorithmic analysis.
I concur with the other commenters that it's silly for you to complain about his not comparing his online string-search algorithm against offline string-search algorithms that search an index of the text, such as suffix-tree algorithms.
His being Russian has no bearing on the issue. nginx and the pivot improvement to quicksort don't imply that Russian mathematicians no longer need to prove their claims. Not being an academic doesn't exempt you from having to do rigorous analysis either.
As for the online algorithm issue, see my reply below.
Dual-Pivot Quicksort is an "awesome new algorithm"? Hardly. It's very small step in the direction of Samplesort -- which is asymptotically optimal and has been around for four decades.
Dual-Pivot Quicksort is a demonstration that someone didn't read the existing literature; nothing more.
Unfortunately I don't have access to the Samplesort paper, so I don't know which sense of "asymptotically optimal" you're using here. Quicksort is already asymptotically optimal in the sense that its asymptotic average performance is Θ(N log N), which is the best a sorting algorithm can do.
It's true that dual-pivot quicksort is a step in the direction of Samplesort. (I do understand Samplesort well enough to say that.) That doesn't mean it's not a worthwhile contribution in its own right. Jon Bentley (advisor of Brian Reid, Ousterhout, Josh Bloch, and Gosling while at CMU; later at Bell Labs in its glory days) was quoted as having a substantially different opinion from yours:
When that thread came up originally, I was intrigued by the idea of the dual-pivot quicksort, and went on to browse Robert Sedgewick's Ph.D. thesis, _Quicksort_, published in 1978. Sedgewick considers and analyzes several variants of quicksort in his thesis, and I've quickly discovered that he analyzes and dismisses the dual-pivot variant under a different name. I don't know whose analysis is better, but it does follow that we shouldn't treat dual-pivot qsort as an original invention (not to detract from Vladimir Yaroslavskiy's doubtless ingenuity).
Here's an excerpt from the email I wrote to Prof. Sedgewick back then; he didn't reply, so I've no idea what he thinks about this:
"I believe that your PhD thesis studies this under the name of "double partition modification". I could find no real difference between the two algorithms. Yaroslavsky claims to achieve a factor of 0.8 improvements over the standard quicksort on the number of exchanges (neutral on comparisons), while your analysis led you to reject the technique after concluding it required many more comparisons than the standard quicksort. I've only attempted very cursorily to reconcile the math; it seems that both your thesis and Yaroslavsky arrive at 0.8n(log n) as the average number of swaps, but you have (1/3) n(log n) as the average number of swaps in the classic algorithm while Yaroslavsky has 1.0 n(log n) for the same thing, so what he sees as an improvement you viewed as a disaster. My interpretation of this may be wrong, and I haven't attempted to verify either analysis yet."
The "classic algorithm" in the above is the classic quicksort as presented by Sedgewick in the thesis, which is slightly different than e.g. the standard Java implementation. I lost interest soon thereafter and stopped investigating this, but it shouldn't be too difficult to find out who was right in that algorithm comparison, if someone's interested enough.
Yes, but the number of comparisons and the number of swaps wouldn't change. Sedgewick and Yaroslavskiy can't both be right, unless anatoly misread something.
The asymptotic optimality of samplesort is in the sense that it takes (1 + o(1)) N log N / log 2 comparisons, i.e., you can't beat it by any constant factor.
Bentley's comments from that email don't make any sense to me. I can't understand what he was thinking, unless he's just unusually prone to flattering other people's work (I don't know Bentley personally, but I've met other people who do this, much to my irritation).
> The asymptotic optimality of samplesort is in the sense that it takes (1 + o(1)) N log N / log 2 comparisons, i.e., you can't beat it by any constant factor.
Why isn't it used everywhere that people currently use Quicksort? I've never actually heard of anyone using Samplesort except in a parallel context.
Two reasons: 1. It's a PITA to implement; 2. You need a large N to overcome the o(1), and at that point you're usually dominated by the performance of your memory hierarchy rather than by the cost of doing compares.
Last I heard, python's built-in sort was a type of samplesort, though, so it isn't completely unused.
The previous Python sort was a type of samplesort, but the current one is a “adaptive, stable, natural mergesort” by Tim Peters (“timsort”). On random data I think it performs about the same or slightly worse, but when data has some structure, it performs much better. See http://bugs.python.org/file4451/timsort.txt
Your #2 suggests that it would be slower than ordinary Quicksort on normal-sized datasets. But dual-pivot quicksort is faster than ordinary Quicksort on normal-sized datasets. So in at least one sense, dual-pivot is not a step in the direction of samplesort.
Python's built-in sort is timsort, last I heard, which is a variant of mergesort that looks for existing sorted or backwards runs in order to run faster in the common cases of mostly-sorted or mostly-backwards data. It's true that Python's built-in sort was a samplesort variant until 2002, though, which I didn't know.
(It does seem like Samplesort's virtue of minimizing comparisons would be particularly desirable in an environment like Python, where every comparison potentially involves a call into slow interpreted code.)
I just read the original paper on samplesort, so this might not be the most up to date description, but the basic idea is as follows:
- From an input sequence of length n, choose a random subsequence of length k.
- Sort the subsequence.
- Partition the remaining (n-k) elements of the original sequence into the (k+1) partitions given by the subsequence.
- Sort the (k+1) partitions.
In the paper, the partitioning and sorting is always done using quicksort. Any n*log(n) sorting algorithm should work though, which includes using samplesort recursively.
If you use an optimal value for k (and this might be a significant fraction of n), you can prove that lim_{n->infty} E(C_n)/log(n!) = 1, where E(C_n) is the expected number of comparisons for a sequence of length n.
Now, using samplesort recursively, quicksort is samplesort with k = 1. Double-Pivot quicksort is samplesort with k = 2.
I guess it depends on what type of queries you expect; if you want to find the same (fixed) substring across a body of (dynamic) texts, the O(n) cost of preprocessing (suffix trees/arrays) is terrible.
If, on the other hand, you have a fixed "corpus" and a dynamic query, O(n) search time (this algo, purportedly) is terrible.
Hmm...there's no mention of it being suited for a dynamic or "online" setting as another commenter notes (not sure why my other comment was downvoted for that). I don't know what to make of this from the description:
"This algorithm especially well suited for long S, multi-substrings, small alphabet, regex/parsing and search for common substrings."
Long source text: the O(n times m) worst-case time per search kills it. Even for a single search, O(n times m) worst case here versus O(n + m) for suffix trees.
multi-substrings: suffix trees do each substring of length m in O(m), but are not compared.
search for common substrings: again, suffix trees would be more appropriate.
Nevertheless, it's inexcusable to be proposing a string matching algorithm without at least mentioning why suffix trees can't do the job. At the very least, showing that your algorithm beats suffix trees in any instantiation does wonders for credibility. For the record, suffix trees can be built online as well:
Sure, but they use a lot more space (even if you use suffix arrays instead of suffix trees) and are generally only worth if you search for more than one pattern on the same text.
In most academic papers I have read that deal with online pattern matching suffix arrays/trees are generally not compared.
EDIT: Am I missing something???
Complexity analysis according to the author: m = search term, n = text
O(m) preprocessing -- that's right, O(search term). And O(n times m) worst-case query string search, so the worst case traverses the whole text.
Now compare that to suffix trees:
O(n) preprocessing O(m) string search
where worst case complexity is linear in search term.