This blog looks at studies comparing expertise in many fields over decades, including work by Tetlock and Kahneman, and problems like — why people don’t learn to use even simple tools to stop children dying unnecessarily. There is a summary of some basic lessons at the end.
The reason for writing about this is that we will only improve the performance of government (at individual, team and institutional levels) if we reflect on:
- what expertise really is and why do some very successful fields cultivate it effectively while others, like government, do not;
- how to select much higher quality people (it’s insane people as ignorant and limited as me can have the influence we do in the way we do — us limited duffers can help in limited ways but why do we deliberately exclude ~100% of the most intelligent, talented, relentless, high performing people from fields with genuine expertise, why do we not have people like Fields Medallist Tim Gowers or Michael Nielsen as Chief Scientist sitting ex officio in Cabinet?);
- how to train people effectively to develop true expertise in skills relevant to government: it needs different intellectual content (PPE/economics are NOT good introductory degrees) and practice in practical skills (project management, making predictions and in general ‘thinking rationally’) with lots of fast, accurate feedback;
- how to give them effective tools: e.g the Cabinet Room is worse in this respect than it was in July 1914 — at least then the clock and fireplace worked, and Lord Salisbury in the 1890s would walk round the Cabinet table gathering papers to burn in the grate — while today No10 is decades behind the state-of-the-art in old technologies like TV, doesn’t understand simple tools like checklists, and is nowhere with advanced technologies;
- and how to ‘program’ institutions differently so that 1) people are more incentivised to optimise things we want them to optimise, like error-correction and predictive accuracy, and less incentivised to optimise bureaucratic process, prestige, and signalling as our institutions now do to a dangerous extent, and, connected, so that 2) institutions are much better at building high performance teams rather than continue normal rules that make this practically illegal, and so that 3) we have ‘immune systems’ to minimise the inevitable failures of even the best people and teams .
In SW1 now, those at the apex of power practically never think in a serious way about the reasons for the endemic dysfunctional decision-making that constitutes most of their daily experience or how to change it. What looks like omnishambles to the public and high performers in technology or business is seen by Insiders, always implicitly and often explicitly, as ‘normal performance’. ‘Crises’ such as the collapse of Carillion or our farcical multi-decade multi-billion ‘aircraft carrier’ project occasionally provoke a few days of headlines but it’s very rare anything important changes in the underlying structures and there is no real reflection on system failure.
This fact is why, for example, a startup created in a few months could win a referendum that should have been unwinnable. It was the systemic and consistent dysfunction of Establishment decision-making systems over a long period, with very poor mechanisms for good accurate feedback from reality, that created the space for a guerrilla operation to exploit.
This makes it particularly ironic that even after Westminster and Whitehall have allowed their internal consensus about UK national strategy to be shattered by the referendum, there is essentially no serious reflection on this system failure. It is much more psychologically appealing for Insiders to blame ‘lies’ (Blair and Osborne really say this without blushing), devilish use of technology to twist minds and so on. Perhaps the most profound aspect of broken systems is they cannot reflect on the reasons why they’re broken — never mind take effective action. Instead of serious thought, we have high status Insiders like Campbell reduced to bathos with whining on social media about Brexit ‘impacting mental health’. This lack of reflection is why Remain-dominated Insiders lurched from failure over the referendum to failure over negotiations. OODA loops across SW1 are broken and this is very hard to fix — if you can’t orient to reality how do you even see your problem well? (NB. It should go without saying that there is a faction of pro-Brexit MPs, ‘campaigners’ and ‘pro-Brexit economists’ who are at least as disconnected from reality, often more, as the May/Hammond bunker.)
In the commercial world, big companies mostly die within a few decades because they cannot maintain an internal system to keep them aligned to reality plus startups pop up. These two factors create learning at a system level — there is lots of micro failure but macro productivity/learning in which useful information is compressed and abstracted. In the political world, big established failing systems control the rules, suck in more and more resources rather than go bust, make it almost impossible for startups to contribute and so on. Even failures on the scale of the 2008 Crash or the 2016 referendum do not necessarily make broken systems face reality, at least quickly. Watching Parliament’s obsession with trivia in the face of the Cabinet’s and Whitehall’s contemptible failure to protect the interests of millions in the farcical Brexit negotiations is like watching the secretary to the Singapore Golf Club objecting to guns being placed on the links as the Japanese troops advanced.
Neither of the main parties has internalised the reality of these two crises. The Tories won’t face reality on things like corporate looting and the NHS, Labour won’t face reality on things like immigration and the limits of bureaucratic centralism. Neither can cope with the complexity of Brexit and both just look like I would look like in the ring with a professional fighter — baffled, terrified and desperate for a way to escape. There are so many simple ways to improve performance — and their own popularity! — but the system is stuck in such a closed loop it wilfully avoids seeing even the most obvious things and suppresses Insiders who want to do things differently…
But… there is a network of almost entirely younger people inside or close to the system thinking ‘we could do so much better than this’. Few senior Insiders are interested in these questions but that’s OK — few of them listened before the referendum either. It’s not the people now in power and running the parties and Whitehall who will determine whether we make Brexit a platform to contribute usefully to humanity’s biggest challenges but those that take over.
Doing better requires reflecting on what we know about real expertise…
How to distinguish between fields dominated by real expertise and those dominated by confident ‘experts’ who make bad predictions?
We know a lot about the distinction between fields in which there is real expertise and fields dominated by bogus expertise. Daniel Kahneman, who has published some of the most important research about expertise and prediction, summarises the two fundamental tests to ask about a field: 1) is there enough informational structure in the environment to allow good predictions, and 2) is there timely and effective feedback that enables error-correction and learning.
‘To know whether you can trust a particular intuitive judgment, there are two questions you should ask: Is the environment in which the judgment is made sufficiently regular to enable predictions from the available evidence? The answer is yes for diagnosticians, no for stock pickers. Do the professionals have an adequate opportunity to learn the cues and the regularities? The answer here depends on the professionals’ experience and on the quality and speed with which they discover their mistakes. Anesthesiologists have a better chance to develop intuitions than radiologists do. Many of the professionals we encounter easily pass both tests, and their off-the-cuff judgments deserve to be taken seriously. In general, however, you should not take assertive and confident people at their own evaluation unless you have independent reason to believe that they know what they are talking about.’ (Emphasis added.)
In fields where these two elements are present there is genuine expertise and people build new knowledge on the reliable foundations of previous knowledge. Some fields make a transition from stories (e.g Icarus) and authority (e.g ‘witch doctor’) to quantitative models (e.g modern aircraft) and evidence/experiment (e.g some parts of modern medicine/surgery). As scientists have said since Newton, they stand on the shoulders of giants.
How do we assess predictions / judgement about the future?
‘Good judgment is often gauged against two gold standards – coherence and correspondence. Judgments are coherent if they demonstrate consistency with the axioms of probability theory or propositional logic. Judgments are correspondent if they agree with ground truth. When gold standards are unavailable, silver standards such as consistency and discrimination can be used to evaluate judgment quality. Individuals are consistent if they assign similar judgments to comparable stimuli, and they discriminate if they assign different judgments to dissimilar stimuli.
‘Coherence violations range from base rate neglect and confirmation bias to overconfidence and framing effects (Gilovich, Griffith & Kahneman, 2002; Kahneman, Slovic & Tversky, 1982). Experts are not immune. Statisticians (Christensen-Szalanski & Bushyhead, 1981), doctors (Eddy, 1982), and nurses (Bennett, 1980) neglect base rates. Physicians and intelligence professionals are susceptible to framing effects and financial investors are prone to overconfidence.
‘Research on correspondence tells a similar story. Numerous studies show that human predictions are frequently inaccurate and worse than simple linear models in many domains (e.g. Meehl, 1954; Dawes, Faust & Meehl, 1989). Once again, expertise doesn’t necessarily help. Inaccurate predictions have been found in parole officers, court judges, investment managers in the US and Taiwan, and politicians. However, expert predictions are better when the forecasting environment provides regular, clear feedback and there are repeated opportunities to learn (Kahneman & Klein, 2009; Shanteau, 1992). Examples include meteorologists, professional bridge players, and bookmakers at the racetrack, all of whom are well-calibrated in their own domains.‘ (Tetlock, How generalizable is good judgment?, 2017.)
In another 2017 piece Tetlock explored the studies further. In the 1920s researchers built simple models based on expert assessments of 500 ears of corn and the price they would fetch in the market. They found that ‘to everyone’s surprise, the models that mimicked the judges’ strategies nearly always performed better than the judges themselves’ (Tetlock, cf. ‘What Is in the Corn Judge’s Mind?’, Journal of American Society for Agronomy, 1923). Banks found the same when they introduced models for credit decisions.
‘In other fields, from predicting the performance of newly hired salespeople to the bankruptcy risks of companies to the life expectancies of terminally ill cancer patients, the experience has been essentially the same. Even though experts usually possess deep knowledge, they often do not make good predictions…
‘When humans make predictions, wisdom gets mixed with “random noise.”… Bootstrapping, which incorporates expert judgment into a decision-making model, eliminates such inconsistencies while preserving the expert’s insights. But this does not occur when human judgment is employed on its own…
‘In fields ranging from medicine to finance, scores of studies have shown that replacing experts with models of experts produces superior judgments. In most cases, the bootstrapping model performed better than experts on their own. Nonetheless, bootstrapping models tend to be rather rudimentary in that human experts are usually needed to identify the factors that matter most in making predictions. Humans are also instrumental in assigning scores to the predictor variables (such as judging the strength of recommendation letters for college applications or the overall health of patients in medical cases). What’s more, humans are good at spotting when the model is getting out of date and needs updating…
‘Human experts typically provide signal, noise, and bias in unknown proportions, which makes it difficult to disentangle these three components in field settings. Whether humans or computers have the upper hand depends on many factors, including whether the tasks being undertaken are familiar or unique. When tasks are familiar and much data is available, computers will likely beat humans by being data-driven and highly consistent from one case to the next. But when tasks are unique (where creativity may matter more) and when data overload is not a problem for humans, humans will likely have an advantage…
‘One might think that humans have an advantage over models in understanding dynamically complex domains, with feedback loops, delays, and instability. But psychologists have examined how people learn about complex relationships in simulated dynamic environments (for example, a computer game modeling an airline’s strategic decisions or those of an electronics company managing a new product). Even after receiving extensive feedback after each round of play, the human subjects improved only slowly over time and failed to beat simple computer models. This raises questions about how much human expertise is desirable when building models for complex dynamic environments. The best way to find out is to compare how well humans and models do in specific domains and perhaps develop hybrid models that integrate different approaches.‘ (Tetlock)
Kahneman also recently published new work relevant to this.
‘Research has confirmed that in many tasks, experts’ decisions are highly variable: valuing stocks, appraising real estate, sentencing criminals, evaluating job performance, auditing financial statements, and more. The unavoidable conclusion is that professionals often make decisions that deviate significantly from those of their peers, from their own prior decisions, and from rules that they themselves claim to follow.’
In general organisations spend almost no effort figuring out how noisy the predictions made by senior staff are and how much this costs. Kahneman has done some ‘noise audits’ and shown companies that management make MUCH more variable predictions than people realise.
‘What prevents companies from recognizing that the judgments of their employees are noisy? The answer lies in two familiar phenomena: Experienced professionals tend to have high confidence in the accuracy of their own judgments, and they also have high regard for their colleagues’ intelligence. This combination inevitably leads to an overestimation of agreement. When asked about what their colleagues would say, professionals expect others’ judgments to be much closer to their own than they actually are. Most of the time, of course, experienced professionals are completely unconcerned with what others might think and simply assume that theirs is the best answer. One reason the problem of noise is invisible is that people do not go through life imagining plausible alternatives to every judgment they make.
‘High skill develops in chess and driving through years of practice in a predictable environment, in which actions are followed by feedback that is both immediate and clear. Unfortunately, few professionals operate in such a world. In most jobs people learn to make judgments by hearing managers and colleagues explain and criticize—a much less reliable source of knowledge than learning from one’s mistakes. Long experience on a job always increases people’s confidence in their judgments, but in the absence of rapid feedback, confidence is no guarantee of either accuracy or consensus.’
Reviewing the point that Tetlock makes about simple models beating experts in many fields, Kahneman summarises the evidence:
‘People have competed against algorithms in several hundred contests of accuracy over the past 60 years, in tasks ranging from predicting the life expectancy of cancer patients to predicting the success of graduate students. Algorithms were more accurate than human professionals in about half the studies, and approximately tied with the humans in the others. The ties should also count as victories for the algorithms, which are more cost-effective…
‘The common assumption is that algorithms require statistical analysis of large amounts of data. For example, most people we talk to believe that data on thousands of loan applications and their outcomes is needed to develop an equation that predicts commercial loan defaults. Very few know that adequate algorithms can be developed without any outcome data at all — and with input information on only a small number of cases. We call predictive formulas that are built without outcome data “reasoned rules,” because they draw on commonsense reasoning.
‘The construction of a reasoned rule starts with the selection of a few (perhaps six to eight) variables that are incontrovertibly related to the outcome being predicted. If the outcome is loan default, for example, assets and liabilities will surely be included in the list. The next step is to assign these variables equal weight in the prediction formula, setting their sign in the obvious direction (positive for assets, negative for liabilities). The rule can then be constructed by a few simple calculations.
‘The surprising result of much research is that in many contexts reasoned rules are about as accurate as statistical models built with outcome data. Standard statistical models combine a set of predictive variables, which are assigned weights based on their relationship to the predicted outcomes and to one another. In many situations, however, these weights are both statistically unstable and practically unimportant. A simple rule that assigns equal weights to the selected variables is likely to be just as valid. Algorithms that weight variables equally and don’t rely on outcome data have proved successful in personnel selection, election forecasting, predictions about football games, and other applications.
‘The bottom line here is that if you plan to use an algorithm to reduce noise, you need not wait for outcome data. You can reap most of the benefits by using common sense to select variables and the simplest possible rule to combine them…
‘Uncomfortable as people may be with the idea, studies have shown that while humans can provide useful input to formulas, algorithms do better in the role of final decision maker. If the avoidance of errors is the only criterion, managers should be strongly advised to overrule the algorithm only in exceptional circumstances.‘
Jim Simons is a mathematician and founder of the world’s most successful ‘quant fund’, Renaissance Technologies. While market prices appear close to random and are therefore extremely hard to predict, they are not quite random and the right models/technology can exploit these small and fleeting opportunities. One of the lessons he learned early was: Don’t turn off the model and go with your gut. At Renaissance, they trust models over instincts. The Bridgewater hedge fund led by Ray Dalio is similar. After near destruction early in his career, Dalio explicitly turned towards explicit model building as the basis for decisions combined with radical attempts to create an internal system that incentivises the optimisation of error-correction. It works.
People fail to learn from even the great examples of success and the simplest lessons
One of the most interesting meta-lessons of studying high performance, though, is that simply demonstrating extreme success does NOT lead to much learning. For example:
- ARPA and PARC created the internet and PC. The PARC research team was an extraordinary collection of about two dozen people who were managed in a very unusual way that created super-productive processes extremely different to normal bureaucracies. XEROX, which owned PARC, had the entire future of the computer industry in its own hands, paid for by its own budgets, and it simultaneously let Bill Gates and Steve Jobs steal everything and XEROX then shut down the research team that did it. And then, as Silicon Valley grew on the back of these efforts, almost nobody, including most of the billionaires who got rich from the dynamics created by ARPA-PARC, studied the nature of the organisation and processes and copied it. Even today, those trying to do edge-of-the-art research in a similar way to PARC right at the heart of the Valley ecosystem are struggling for long-term patient funding. As Alan Kay, one of the PARC team, said, ‘The most interesting thing has been the contrast between appreciation/exploitation of the inventions/contributions [of PARC] versus the almost complete lack of curiosity and interest in the processes that produced them.’ ARPA survived being abolished in the 1970s but it was significantly changed and is no longer the freewheeling place that it was in the 1960s when it funded the internet. In many ways DARPA’s approach now is explicitly different to the old ARPA (the addition of the ‘D’ was a sign of internal bureaucratic changes).
- ‘Systems management’ was invented in the 1950s and 1960s (partly based on wartime experience of large complex projects) to deal with the classified ICBM project and Apollo. It put man on the moon then NASA largely abandoned the approach and reverted to being (relative to 1963-9) a normal bureaucracy. Most of Washington has ignored the lessons ever since — look for example at the collapse of ObamaCare’s rollout, after which Insiders said ‘oh, looks like it was a system failure, wonder how we deal with this’, mostly unaware that America had developed a successful approach to such projects half a century earlier. This is particularly interesting given that China also studied Mueller’s approach to systems management in Apollo and as we speak is copying it in projects across China. The EU’s bureaucracy is, like Whitehall, an anti-checklist to high level systems management — i.e they violate almost every principle of effective action.
- Buffett and Munger are the most successful investment partnership in world history. Every year for half a century they have explained some basic principles, particularly concerning incentives, behind organisational success. Practically no public companies take their advice and all around us in Britain we see vast corporate looting and politicians of all parties failing to act — they don’t even read the Buffett/Munger lessons and think about them. Even when given these lessons to read, they won’t read them (I know this because I’ve tried).
Perhaps you’re thinking — well, learning from these brilliant examples might be intrinsically really hard, much harder than Cummings thinks. I don’t think this is quite right. Why? Partly because millions of well-educated and normally-ethical people don’t learn even from much simpler things.
I will explore this separately soon but I’ll give just one example. The world of healthcare unnecessarily kills and injures people on a vast scale. Two aspects of this are 1) a deep resistance to learning from the success of very simple tools like checklists and 2) a deep resistance to face the fact that most medical experts do not understand statistics properly and their routine misjudgements cause vast suffering, plus warped incentives encourage widespread lies about statistics and irrational management. E.g People are constantly told things like ‘you’ve tested positive for X therefore you have X’ and they then kill themselves. We KNOW how to practically eliminate certain sorts of medical injury/death. We KNOW how to teach and communicate statistics better. (Cf. Professor Gigerenzer for details. He was the motivation for including things like conditional probabilities in the new National Curriculum.) These are MUCH simpler than building ICBMs, putting man on the moon, creating the internet and PC, or being great investors. Yet our societies don’t do them.
Because we do not incentivise error-correction and predictive accuracy. People are not incentivised to consider the cost of their noisy judgements. Where incentives and culture are changed, performance magically changes. It is the nature of the systems, not (mostly) the nature of the people, that is the crucial ingredient in learning from proven simple success. In healthcare like in government generally, people are incentivised to engage in wasteful/dangerous signalling to a terrifying degree — not rigorous thinking and not solving problems.
I have experienced the problem with checklists first hand in the Department for Education when trying to get the social worker bureaucracy to think about checklists in the context of avoiding child killings like Baby P. Professionals tend to see them as undermining their status and bureaucracies fight against learning, even when some great officials try really hard (as some in the DfE did such as Pamela Dow and Victoria Woodcock). ‘Social work is not the same as an airline Dominic’. No shit. Airlines can handle millions of people without killing one of them because they align incentives with predictive accuracy and error-correction.
Some appalling killings are inevitable but the social work bureaucracy will keep allowing unnecessary killings because they will not align incentives with error-correction. Undoing flawed incentives threatens the system so they’ll keep killing children instead — and they’re not particularly bad people, they’re normal people in a normal bureaucracy. The pilot dies with the passengers. The ‘CEO’ on over £150,000 a year presiding over another unnecessary death despite constantly increasing taxpayers money pouring in? Issue a statement that ‘this must never happen again’, tell the lawyers to redact embarrassing cockups on the grounds of ‘protecting someone’s anonymity’ (the ECHR is a great tool to cover up death by incompetence), fuck off to the golf course, and wait for the media circus to move on.
Why do so many things go wrong? Because usually nobody is incentivised to work relentlessly to suppress entropy, never mind come up with something new.
We can see some reasonably clear conclusions from decades of study on expertise and prediction in many fields.
- Some fields are like extreme sport or physics: genuine expertise emerges because of fast effective feedback on errors.
- Abstracting human wisdom into models often works better than relying on human experts as models are often more consistent and less noisy.
- Models are also often cheaper and simpler to use.
- Models do not have to be complex to be highly effective — quite the opposite, often simpler models outperform more sophisticated and expensive ones.
- In many fields (which I’ve explored before but won’t go into again here) low tech very simple checklists have been extremely effective: e.g flying aircraft or surgery.
- Successful individuals like Warren Buffett and Ray Dalio also create cognitive checklists to trap and correct normal cognitive biases that degrade individual and team performance.
- Fields make progress towards genuine expertise when they make a transition from stories (e.g Icarus) and authority (e.g ‘witch doctor’) to quantitative models (e.g modern aircraft) and evidence/experiment (e.g some parts of modern medicine/surgery).
- In the intellectual realm, maths and physics are fields dominated by genuine expertise and provide a useful benchmark to compare others against. They are also hierarchical. Social sciences have little in common with this.
- Even when we have great examples of learning and progress, and we can see the principles behind them are relatively simple and do not require high intelligence to understand, they are so psychologically hard and run so counter to the dynamics of normal big organisations, that almost nobody learns from them. Extreme success is ‘easy to learn from’ in one sense and ‘the hardest thing in the world to learn from’ in another sense.
It is fascinating how remarkably little interest there is in the world of politics/government, and social sciences analysing politics/government, about all this evidence. This is partly because politics/government is an anti-learning and anti-expertise field, partly because the social sciences are swamped by what Feynman called ‘cargo cult science’ with very noisy predictions, little good feedback and learning, and a lot of chippiness at criticism whether it’s from statistics experts or the ‘ignorant masses’. Fields like ‘education research’ and ‘political science’ are particularly dreadful and packed with charlatans but much of economics is not much better (much pro- and anti-Brexit mainstream economics is classic ‘cargo cult’).
I have found there is overwhelmingly more interest in high technology circles than in government circles, but in high technology circles there is also a lot of incredulity and naivety about how government works — many assume politicians are trying and failing to achieve high performance and don’t realise that in fact nobody is actually trying. This illusion extends to many well-connected businessmen who just can’t internalise the reality of the apex of power. I find that uneducated people on 20k living hundreds of miles from SW1 generally have a more accurate picture of daily No10 work than extremely well-connected billionaires.
This is all sobering and is another reason to be pessimistic about the chances of changing government from ‘normal’ to ‘high performance’ — but, pessimism of the intellect, optimism of the will…
If you are in Whitehall now watching the Brexit farce or abroad looking at similar, you will see from page 26 HERE a checklist for how to manage complex government projects at world class levels (if you find this interesting then read the whole paper). I will elaborate on this. I am also thinking about a project to look at the intersection of (roughly) five fields in order to make large improvements in the quality of people, ideas, tools, and institutions that determine political/government decisions and performance:
- the science of prediction across different fields (e.g early warning systems, the Tetlock/IARPA project showing dramatic performance improvements),
- what we know about high performance (individual/team/organisation) in different fields (e.g China’s application of ‘systems management’ to government),
- technology and tools (e.g Bret Victor’s work, Michael Nielsen’s work on cognitive technologies, work on human-AI ‘minotaur’ teams),
- political/government decision making affecting millions of people and trillions of dollars (e.g WMD, health), and
- communication (e.g crisis management, applied psychology).
Progress requires attacking the ‘system of systems’ problem at the right ‘level’. Attacking the problems directly — let’s improve policy X and Y, let’s swap ‘incompetent’ A for ‘competent’ B — cannot touch the core problems, particularly the hardest meta-problem that government systems bitterly fight improvement. Solving the explicit surface problems of politics and government is best approached by a more general focus on applying abstract principles of effective action. We need to surround relatively specific problems with a more general approach. Attack at the right level will see specific solutions automatically ‘pop out’ of the system. One of the most powerful simplicities in all conflict (almost always unrecognised) is: ‘winning without fighting is the highest form of war’. If we approach the problem of government performance at the right level of generality then we have a chance to solve specific problems ‘without fighting’ — or, rather, without fighting nearly so much and the fighting will be more fruitful.
This is not a theoretical argument. If you look carefully at ancient texts and modern case studies, you see that applying a small number of very simple, powerful, but largely unrecognised principles (that are very hard for organisations to operationalise) can produce extremely surprising results.
How to jump from the Idea to Reality? More soon…
Ps. Just as I was about to hit publish on this, the DCMS Select Committee released their report on me. The sentence about the Singapore golf club at the top comes to mind.