If you are interested in these ideas, you should know that this essay kicks off a series of essays that culminates, a year later, with an examination of the Amazon-style Weekly Business Review:
(It took that long because of a) an NDA, and b) it takes time to put the ideas to practice and understand them, and then teach them to other business operators!)
The ideas presented in this particular essay are really attributed to W. Edwards Deming, Donald Wheeler, and Brian Joiner (who created Minitab; ‘Joiner’s Rule’, the variant of Goodhart’s Law that is cited in the link above is attributed to him)
Most of these ideas were developed in manufacturing, in the post WW2 period. The Amazon-style WBR merely adapts them for the tech industry.
I hope you will enjoy these essays — and better yet, put them to practice. Multiple executives have told me the series of posts have completely changed the way they see and run their businesses.
FYI, you can also upvote or favorite a comment, and then view those upvoted/favorited comments from your profile (same for submissions). Favorites are public.
I should note that this essay kicks off an entire series that eventually culminates in a detailed examination of the Amazon Weekly Business Review (which takes some time to get to because of a) an NDA, and b) it took some time to test it in practice). The Goodhart’s Law essay uses publicly available information about the WBR to explain how to defeat Goodhart’s Law (since the ideas it draws from are five decades old); the WBR itself is a two decades-old mechanism on how to actually accomplish these high-falutin’ goals.
Over the past year, Roger and I have been talking about the difficulty of spreading these ideas. The WBR works, but as the essay shows, it is an interlocking set of processes that solves for a bunch of socio-technical problems. It is not easy to get companies to adopt such large changes.
As a companion to the essay, here is a sequence of cases about companies putting these ideas to practice:
The common thing in all these essays is that it doesn’t stop at high-falutin’ (or conceptual) recommendation, but actually dives into real world application and practice. Yes, it’s nice to say “let’s have a re-evaluation date.” But what does it actually look like to get folks to do that at scale?
Well, the WBR is one way that works in practice, at scale, and with some success in multiple companies. And we keep finding nuances in our own practice: https://x.com/ejames_c/status/1849648179337371816
It looks like any other decision record where you set a date to evaluate the impact of a policy or course of action and make sure it's working out the way that you had anticipated.
And how are you going to tell that when you have a) variation (that is, every metric wiggles wildly)? And also b) how are you able to tell if it has or hasn’t impacted other parts of your business if you do not have a method for uncovering the causal model of your business (like that aquarium quote you cited earlier?)
Reality has a lot of detail. It’s nice to quote books about goals. It’s a different thing entirely to achieve them in practice with a real business.
I agree that reality is complex, but I worry you are conflating the challenges of running an Amazon-scale business with running the smaller businesses that most of the entrepreneurs on HN will need to manage. I thought Roger offered a more practical approach in about 10% of the words that you took. I am sorry if I have offended you; I was trying to save the entrepreneurs on HN time.
As to Jack Stack's book, I think the genius of his approach is communicating simple decision rules to the folks on the front line instead of trying to establish a complex model at the executive level that can become more removed from day-to-day realities. In my experience, which involves working in a variety of roles in startups and multi-billion dollar businesses over the better part of five decades, simple rules updated based on your best judgment risk "extinction by instinct" but outperform the "analysis paralysis" that comes from trying to develop overly complex models.
How do you answer? Notice that this is a problem regardless of whether you are a big company or a small company.
b) 3 months later, your client comes back and asks: “we are having trouble with customer support. How do we know that it’s not related to this change we made?” With your superior experience working with hundreds of startups, you are able to tell them if it is or isn’t after some investigation. Your client asks you: “how can we do that for ourselves without calling on you every time we see something weird?”
How do you answer?
(My answers are in the WBR essay and the essay that comes immediately before that, natch)
It is a common excuse to wave away these ideas with “oh, these are big company solutions, not applicable to small businesses.” But a) I have applied these ideas to my own small business and doubled revenue; also b) in 1992 Donald Wheeler applied these methods to a small Japanese night club and then wrote a whole book about the results: https://www.amazon.sg/Spc-Esquire-Club-Donald-Wheeler/dp/094...
Wheeler wanted to prove, (and I wanted to verify), that ‘tools to understand how your business ACTUALLY works’ are uniformly applicable regardless of company size.
If anyone reading this is interested in being able to answer confidently to both questions, I recommend reading my essays to start with (there’s enough in front of the paywall to be useful) and then jump straight to Wheeler. I recommend Understanding Variation, which was originally developed as a 1993 presentation to managers at DuPont (which means it is light on statistics).
To be fair to OP, Wheeler never claims that for stable/in-control/predictable processes roughly half of the measurements will lie above the average. The only claim he makes is that 97% of all data points for a stable process (assuming the process draws from a J-curve or single-mound distribution) will fall between the limit lines.
He can't make this claim (about ~half falling above/below the average line), because one of the core arguments he makes is that XmR charts are usable even when you're not dealing with normal distributions. He argues that the intuition behind how they work is that they detect the presence of more than one probability distribution in the variation of a time series.
I don't have the stats-fu to back it up but I would be very surprised if someone could point to a process where XmR charts are useful, but where the mean is not within 10–20 percentiles of the median.
So I've got a dumb question here: what happens when you use vanilla XmR charts with J-curve shaped or sub-exponential distributions?
My current simplistic (and very dumb!) solution that I've used for power-law type distributions — like HN virality, for instance — is to count the number of days between viral events, and then subject that to process control.[1] I basically take Wheeler's approach to chunky data and use that for J-curve type data, which tells me if the behaviour of my 'HN virality process' has changed.
I'd be very interested to learn of other approaches.
[1] HN traffic for commoncog.com displays routine variation most weeks with an Upper Process Limit of 192 and a Lower Process Limit of 0, unless one of my articles hit the front page, at which point I get 11-16k additional uniques).
I have an upcoming article on my lack of understanding on how to do this also. It's not finished but you may enjoy a near-finished draft. https://two-wrongs.com/extreme-value-spc
I did forget to bring up the Poisson approximation you mention though. I'll include that too.
The example of performance is interesting because as you say, there are often multiple jostling distributions under the surface (GC is one, but another doozy is CPU frequency scaling).
One possible way out is to look for measurements that contribute to running time but which are not affected by other factors. I remember the YJIT folks talking about using CPU instruction counters, but I can't find it on the benchmark website.
Time between events is an approach Montgomery (8th EMEA) discusses in 7.3.5. The application there is for dealing with very low error/defect rates. I am not familiar with Wheeler's approach.
A couple of quick notes, from someone who has actually put this to practice — and in a non-manufacturing context, to boot!
(From a brief reading of this thread, it seems like kqr, jacques_chester, and I are the only ones who have put this to practice in non-manufacturing contexts — though correct me if I'm wrong.)
The bulk of the debate in this HN thread seems to be centred around what is or isn't a 'stable process'. I think this is partially a terminology issue, which Donald Wheeler called out in the appendix of Understanding Variation. He recommends not using words like 'stable' or 'in-control', or even 'special cause variation', as the words are confusing ... and in his experience lead people to unfruitful discussions.
Instead, he suggests:
- Instead of calling this 'Statistical Process Control', call this 'Methods of Continual Improvement'
- Use the term 'routine variation' and 'exceptional variation' whenever possible. In practice, I tend to use 'special variation' in discussion, not 'exceptional variation', simply because it's easier to say.
- Use the term 'process behaviour chart' instead of 'process control chart' — we use these charts to characterising the behaviour of a process, not merely to 'control' it.
- Use 'predictable process' and 'unpredictable process' (instead of 'stable'/'in-control' vs 'unstable'/'out-of-control' processes) because these are more reflective of the process behaviours. (e.g. a predictable process should reliably show us data between two limit lines).
Using this terminology, the right question to ask is: are there processes in software development that display routine variation? And the answer is yes, absolutely. kqr has given a list in this comment: https://news.ycombinator.com/item?id=39638491
In my experience, people who haven't actually tried to apply SPC techniques outside of manufacturing do not typically have a good sense for what kinds of processes display routine variation. I would urge you to see for yourself: collect data, and then plot it on an XmR chart. It usually takes you only a couple of seconds to see if it does or does not apply — at which point you may discard the chart if you do not find it useful. But you should discover that a surprisingly large chunk of processes do display some form of routine variation. (Source: I've taught this to a handful of folk by now — in various marketing/sales and software engineering roles —and they typically find some way to use XmR charts relatively quickly within their work domains).
[Note: this 'XmR charts are surprisingly useful' is actually one of the major themes in Wheeler's Making Sense of Data — which was written specifically for usage in non-manufacturing contexts; the subtitle of the book is 'SPC for the Service Sector'. You should buy that book if you are serious about application!]
I realise that a bigger challenge with getting SPC adopted is as follows: why should I even use these techniques? What benefits might there be for me? If you don't think SPC is a powerful toolkit, you won't be bothered to look past the janky terminology or the weird statistics.
So here's my pitch: every Wednesday morning, Amazon's leaders get together to go through 400-500 metrics within one hour. This is the Amazon-style Weekly Business Review, or WBR. The WBR draws directly from SPC (early Amazon exec Colin Bryar told me that the WBR is but a 'process control tool' ... and the truth is that it stems from the same style of thinking that gives you the process behaviour chart). What is it good for? Well, the WBR helps Amazon's leaders build a shared causal model of their business, at which point they may loop on that model to turn the screws on their competition and to drive them out of business.
But in order to understand and implement the WBR, you must first understand some of the ideas of SPC.
If that whets your interest, here is a 9000 word essay I wrote to do exactly that, which stems from 1.5 years of personal research, and then practice, and then bad attempts at teaching it to other startup operator friends: https://commoncog.com/becoming-data-driven-first-principles/
I don't get into it too much, but the essay calls out various other applications of these ideas, amongst them the Toyota Production System (which was bootstrapped off a combination of ideas taught by W Edwards Deming — including the SPC theory of variation), Koch Industries's rise to powerful conglomerate, Iams pet foods, etc etc.
> (From a brief reading of this thread, it seems like kqr, jacques_chester, and I are the only ones who have put this to practice in non-manufacturing contexts — though correct me if I'm wrong.)
And roenxi.
> So here's my pitch: every Wednesday morning, Amazon's leaders get together to go through 400-500 metrics within one hour.
Amazon's core value proposition is they maintain a large and very physical fleet of machines that they rent out. With serious standards for up-time that they can take real pride in.
They don't sell themselves as a software house. I'm sure they have tentacles everywhere and they aren't bad at it (if anything I'd expect them to be pretty good on a given project), but they've greatly benefited from using other people's software - they don't have their own DB for example, they reuse others and have a couple of PostgreSQL forks for more at-scale use cases.
I'm sure they get huge value from SPC (anything physical generally benefits from it), and I'm sure they use SPC for software out of reflex; but it doesn't follow that it is driving productive behaviour in the software branch of the business. A fleet of ~infinite servers benefits from controlling 400 metrics. Software development does not.
What would you say if I told you Bryar has lots of stories of this style of thinking applied in early Amazon? This is pre-AWS Amazon, mind you — where they were trying to figure out how to build e-commerce web software at scale, from scratch. Granted, the bulk of their process control was directed at customer-facing controllable input metrics, but the software engineers were as much a part of it as the operational folks.
(To be fair to you, you are adamant that SPC does not apply to software development — which I take to mean measuring the productivity or act of building software. And I think we are all in agreement there! (That said, like kqr and jacques_chester, I want to believe that this has not been sufficiently explored) But it's not true that SPC has no place in software development — one way I've used this is that because XmR charts detect changes in variation, you can use it in a customer-facing software context to see if a feature change has resulted in user behaviour change without running an A/B test. Naturally, it makes sense to have the software engineer be responsible for observing this behaviour change themselves, since XmR charts are easy enough for the layman to use, and it gives them a sense of ownership for the feature or change. Some detail (on usage vs A/B tests) here: https://commoncog.com/two-types-of-data-analysis/)
Saw this on twitter...I actually think SPC can apply to Software Development in that the concept of normal variation, and being able to understand and measure the range, can be pretty useful. More detailed comment here if interested...
Very interesting to get the perspective of someone who did thisbin a non-manufacturing evironment. One interesting bit, for someone like me who knows SPC from manufacturing related processes, are the discussions around what a stable process is. Because I cannot remember a single of those discussions ever in manfucturing related fields. Intriguiging, especially since on HN sometimes discussion miss the point by turning into disputes about the exact definition of a term, something that sounds very similar to the "misunderstandings" about stuff like special-casue variation you described.
Edit: Fully agree on the Amazon style WBR, what you said is exactly what is happening at Amazon. Daily during Q4 peak for a large enough subset of metrics.
Loved the post. Just to add to the principal observation about the vocab point — One thing I’ve noticed is that experts who can communicate/teach well are very effective at getting you to the vocab point (to the extent possible without experience) quickly! And once you reach the vocab point, the field is legible enough that you have the means to organize deliberate practice and interpret experience (if you want to go towards mastery) or delegate details (if you don’t need them). I wonder if this provokes any more thoughts from you.
I'm glad you were able to articulate this idea so clearly. I think similar advice which maybe isn't so clear is to try to avoid "being the smartest person in the room".
On the topic, I'm reminded of coworkers talking about the most recent Super Bowl (I think, I don't follow sports) which included commentary from players who played as quarterbacks in the same season. One of the things my coworkers mentioned was how interesting it was to hear them talk about the game because they used unfamiliar terms.
Yes, rest assured the previous article talks about exactly this. (I mean, the title of this piece should give you a hint: “Pay attention to deviations from mainstream incentives” because the mainstream incentives in F&B are just so, so bad.)
Discourse is one of the best pieces of software I've ever used — I self-host, and it's never given me any problems. There's built-in backups to S3, I can upgrade the software 80% of the time from a web interface (even on my phone), and the community and moderation features are incredibly well thought out. So much so that if you take the time to slowly go through all the options in the admin panels, you'd realise that the team has thought through 90% of all community interaction models.
I'm looking forward to trying Discourse chat — if the level of engineering and thoughtfulness is the same as the rest of the feature-set, it'll be a great addition to my community.
https://commoncog.com/becoming-data-driven-first-principles/
https://commoncog.com/the-amazon-weekly-business-review/
(It took that long because of a) an NDA, and b) it takes time to put the ideas to practice and understand them, and then teach them to other business operators!)
The ideas presented in this particular essay are really attributed to W. Edwards Deming, Donald Wheeler, and Brian Joiner (who created Minitab; ‘Joiner’s Rule’, the variant of Goodhart’s Law that is cited in the link above is attributed to him)
Most of these ideas were developed in manufacturing, in the post WW2 period. The Amazon-style WBR merely adapts them for the tech industry.
I hope you will enjoy these essays — and better yet, put them to practice. Multiple executives have told me the series of posts have completely changed the way they see and run their businesses.