Why Evidence-based Policymaking Is Overrated
From parachutes and drugs to asylum bans, and the virtue of honest judgment
Three years ago, the United States declared the end of the pandemic emergency, and the country exhaled. You, dear reader, have probably filed the whole thing away by now. I’m sorry, but I just cannot let it go, and reading Macedo and Lee’s excellent “In Covid’s Wake” on my way back from Ireland last week reminded me why. The virus, of course, was not the fault of any of us in particular.1 But how we reasoned about it as a democratic society was absolutely our fault. The so-called experts spent years telling the rest of us to “follow the science,” then claimed a certainty the science never gave them, dressed value judgments up as technical ones, told plenty of noble lies for our own good, and waved away every cost that sat outside their own narrow expertise. We were drowning in data, but what we lacked was honest judgment.
This knee-jerk evidence-based mindset is not just a pandemic thing, and it shows up wherever we treat one kind of data as the only kind that counts. In 2018, a top medical journal published a randomized controlled trial (RCT) finding that parachutes did nothing to prevent death or injury when people jumped from an aircraft. The catch was that the planes were parked on the ground, and the mean jump altitude was about half a meter. The whole study, of course, was a joke. The authors were not against experiments as such, but they were mocking the common reflex among their colleagues that treats RCTs as the only respectable form of knowledge, even about a claim you could check by looking out the window of a plane.2
I have spent much of my professional life contributing to and asking for better evidence in immigration debates, usually to the quiet exasperation of people on my own side, so I am not about to start sneering at data and acting on “vibes.” Good policymaking does need evidence, cost-benefit analysis, and careful counterfactual thinking. But it also needs humility and better judgment about what kind of evidence a given question can actually require, if any.
There are some policies that are so obviously good that we should not need a perfect study to try them, like legally allowing more housing where demand is high, keeping reliable low-carbon power online when the substitute is fossil fuel, or increasing visas for foreign top talent that everyone says they want. Reasonable people can argue about details and trade-offs. But on questions like these, the case for action does not depend on a perfect RCT, and the burden of proof should not be infinite.
There are also policies so obviously bad that we should not need a study to stop them, like asylum seeker work bans. Seriously, you do not need a randomized trial, or any hard evidence for that matter, to predict what happens when you forbid a willing adult in a legal limbo to work, pay to shelter them instead, and then point to their idleness as proof the system is broken. Most of the hardest calls in policy look more like this than some great mystery requiring a well-designed experiment.
What “evidence” actually means
One rung down from the parachute sits dental floss. I personally don’t enjoy flossing, but most of us can probably feel the difference after flossing out a big meal, and our dentists can see it at the next cleaning. Yet when some journalists seized on a review that found little high-quality randomized evidence for flossing, a run of headlines announced there was no evidence it works at all. As the historian of science Naomi Oreskes points out, that was a misreading. We should be “broad-minded about evidence,” she argues, counting professional experience and ordinary observation, especially where a clean long-term trial is impractical or will never be funded.
The same confusion, though with far higher stakes, ran through the COVID pandemic. In their recent (highly-recommended) book on the topic, Stephen Macedo and Frances Lee make a startling case: much of the relevant knowledge on how to deal with a respiratory pandemic was already there, but governments across the West largely set it aside. Before 2020, the dominant pandemic preparedness plans were skeptical of sweeping measures like lockdowns and prolonged school closures, warning that the evidence for them was weak and the human and economic costs high. In the panic of early 2020, those governments scrapped that guidance almost overnight and then projected a confidence that science never supported.
It helps to sort things by the kind of feedback they give, and to what end. A parachute gives the simplest kind: the benefit of staying alive is immediate, individual, and impossible to miss, so a trial would only confirm what anyone can already see. A new drug or vaccine gives the opposite: the benefits and costs may be genuine, but they are also much more diverse and often invisible. The infection that never arrives is easy to confuse with luck or the body healing on its own, which is exactly why a randomized trial with placebos is essential and why modern medicine runs on them.
Most government policy sits in between. You can usually tell whether a policy points toward gain or loss long before you can put a clean number on it, and the effects run through labor markets, prices, and politics all at once, so no single trial can isolate them, and often none can be run at all. But you cannot settle it just by looking, the way you can with a parachute, and you cannot settle it with one clean experiment, the way you can with a drug. What is left is judgment based on cumulative evidence, comparisons across places and times, and honest cost-benefit reasoning under uncertainty, all stated openly instead of disguised as settled science.
Another casualty of our actually existing knee-jerk evidence-based policy is the prior question of what even counts as relevant evidence, and whose expertise gets to decide. Here, Macedo and Lee are devastating. As they document, a niche cohort of public health and infectious disease experts was suddenly treated as the only legitimate authority on a crisis that touched every part of life, and their lens was narrow by design: fixed on minimizing infections, it pushed almost every other effect, economic and non-economic alike, off the table as someone else’s department.
Francis Collins, who ran the National Institutes of Health, admitted that the “public health mindset” he shared led him to “attach infinite value to stopping the disease” and “zero value to whether this actually totally disrupts people’s lives, ruins the economy, and has many kids kept out of school.” In Britain, the chief medical officer, Chris Whitty, told the official Covid inquiry that adding economic or social experts to the government’s advisory group would have made it too “unwieldy.” Stopping the virus became the only goal that counted as following the science, and the costs any honest policy has to weigh were ruled out of scope.
The economists already had this fight
Development economists have been arguing about this for 20 years. The “credibility revolution” taught social science to distrust sloppy causal claims and to prize identification, which was a genuine advance. But the economist Lant Pritchett argues that some of its champions then performed a strange trick: having demanded the most stringent possible evidence for a narrow estimate inside a paper, they would turn around and accept sweeping, system-level claims built on those estimates “with complete and total gullibility.” He calls it the credulity revolution. An experiment can be internally airtight and still tell you almost nothing about whether a program will work at a national scale in another country.
The complaint comes from the center of the field. Angus Deaton and Nancy Cartwright, hardly enemies of quantification, argue that randomized trials earn their keep only “as part of a cumulative program” alongside theory and mechanism. Demanding external validity from a single trial “expects too much of an RCT while undervaluing its contribution.”
Pritchett offers a blunter smell test for his own field of economic growth: if rich countries do not have more of some fashionable factor than poor ones, we should be suspicious of claims that it explains development. No country has ever run a randomized trial on its way to wealth. Poland did not climb out of communism and into prosperity because of a well-run cash-transfer experiment; it did so through messy, large-scale changes in markets, institutions, and politics that no trial could have tested in advance. The method should fit the question. When it does not, more rigor on the wrong question is just a costlier way to be confidently beside the point.
When the trial earns its keep
Don’t get me wrong—I like my RCTs and have carried out some myself. Randomized studies are essential when intuition gets ahead of knowledge. The best case for a trial is the mirror image of the parachute: sometimes the answer is not obvious at all, the intuition almost everyone shares turns out to be wrong, and the only way to find out is to run the experiment and pay its cost. When you cannot see the effect immediately just by looking, you absolutely have to measure it carefully and systematically.
Cash transfers in rich countries can be an interesting example here. In poor countries, the evidence that handing people money improves their lives is about as strong as social science gets, with randomized studies finding large gains in income, assets, food security, and even lower infant mortality. When economists first proposed simply handing poor people cash, the worry was that they would fritter it away on alcohol and other temptations, which is why aid so often came as food or as cash with strings attached. The trials found that worry was largely unfounded: across dozens of studies, cash did not raise spending on alcohol or tobacco and often reduced it, because people in poverty turn out to be good judges of what they need.
Now, since it’s getting increasingly obvious to more folks that cash transfers clearly work, you might assume the same logic carries over to a rich country: if you give a struggling American a few hundred dollars a month, their life gets measurably better. At least it did seem obvious to the researchers who believed in it. But then they ran the trials.
As Kelsey Piper details, a run of careful American studies, including an OpenResearch experiment that gave people $1,000 a month for three years, found no sustained improvement in health, employment, stress, or children’s outcomes. Piper, who went in expecting cash to help more, called the evidence “shocking.” The developing-country findings were sound, but they did not transfer cleanly to this context, which is exactly that external validity problem Angus Deaton was so worried about. Here the trial earned every dollar it cost, because the stakes were high, the intuition was strong, but it was ultimately wrong. The problem, of course, is telling a parachute from a medical trial before you decide whether the study is worth running.
Even then, the right tool is not always an experiment, and sometimes it is not available at all. Some of the most important things we know rest on theory and modeling more than on any single trial. No one has run, or could run, a randomized study on whether a country should open itself to free trade. The case here rests on the theory of comparative advantage, worked out two centuries ago and refined by mountains of non-experimental evidence since. No one randomized nations into democracy and dictatorship to learn which produces better lives. We reason about questions like these from theory and values, and we lean on thought experiments and formal models, the kind of computational and thought simulations social scientists build precisely because some questions can never be put to an experimental test.
When everyone is wrong at once
I’m not at all an expert on the pandemic. But it is a very important case that still shapes public trust for the worse. We saw the uncertainty in the evidence already. But the deeper failure was the refusal to admit it. Officials at every level kept repeating that “we know what works against Covid-19,” as Macedo and Lee document, “even as it became increasingly evident that policymakers were improvising and did not, in fact, know with certainty what worked.” Certainty was claimed where it did not exist. Science can tell you what a policy is likely to do. But it cannot tell you what you ought to value, and pretending otherwise spends the credibility you need the next time.
The critics often made the mirror-image mistake, treating the absence of a clean trial as proof that a measure was worthless, which is the flossing error scaled up to a national emergency. Macedo and Lee fault both reflexes at once: it was wrong, they write, “to mock and censor mask skeptics, as it was to insist with certainty that masks do not work.” And underneath the shouting sat the trade-off nobody wanted to name. The studies on masking measured “only one side of the equation,” they point out, and said “nothing about the costs of masking to children’s learning, communication, socialization, and psychological well-being,” while school closures “most hurt poor kids.” What both errors share is a refusal to say the honest thing: the evidence was incomplete, and the choices imposed heavy costs on people who never got a vote.
The “follow the science” failure took a disagreement about values, about how much weight to put on the schooling of the young against the safety of the old, on liberty against caution, and dressed it up as a technical dispute the data had already settled. Immigration runs on the same move constantly. A great deal of what looks like an argument about evidence is actually an argument about what a government owes its own citizens against what it owes foreigners. No experiment can tell you how much to weigh a citizen’s wages against a stranger’s safety, or whether a more diverse society is better than a more cohesive one.
These are all questions of value, and if people simply do not want a certain outcome such as a more diverse society, no clean estimate that a policy would deliver it will change their minds. I have made this case at length: the most useful thing evidence can do here is discipline a values argument by telling us what a given policy will cost and produce. It cannot make the values argument disappear, and recasting a value conflict as a scientific one mostly just hides what people are fighting about.
Dan Williams makes a closely related point in his new essay on how political tribes construct rival realities. Political disagreement often runs through rival systems of interpretation that decide which facts matter, what counts as representative, and who belongs in the story as victim, villain, or hero. That is why appeals to “the evidence” so often disappoint. The fight is partly over facts, but also over the frame that tells people what the facts mean.
Demonstrable benefits
No randomized trial will also ever tell you everything you want to know about admitting a productive, skilled worker who pays taxes, fills a documented shortage, starts a company, or treats patients in a town that cannot recruit a doctor. The feedback here runs through a variety of institutions, and you cannot hold the rest of the world fixed while you turn the dial. Yet the basic direction is not at all mysterious. A policymaker weighing whether to route more eligible scientists through a specialized visa or to clear a needless skilled-worker backlog will never have clean experimental evidence that settles the whole question, and waiting for it just means letting a bad default persist.
This is why skilled immigration is so popular: its benefits are intuitive and visible without anyone reading an econometrics paper. About 80 percent of American voters support high-skilled immigration across party lines, and policymakers who wanted to act on that could streamline the uncapped O-1A visa for extraordinary-ability workers tomorrow, without a single new law. This is what I mean by demonstrably beneficial: a policy whose contribution to the country ordinary people can see in practical terms, explicitly and straightforwardly serving the national interest. The persuasive framing is already baked into the policy itself, so you do not need a campaign to explain it.
But setting immigration policy, of course, is not the same thing as choosing parachutes. “A skilled worker who pays taxes and fills a documented shortage is an asset to the country” is a claim about as close to observational saturation as social facts get. “This particular visa reform will produce that benefit at this particular magnitude” is a genuine empirical question that does need identification, and it is the kind of thing my colleagues spend careers trying to estimate. Treating the second claim as if it were as self-evident as the first is the same error the parachute authors mocked, just aimed in a friendlier direction. As I have argued before, “immigration” in the abstract has no clean effect waiting to be discovered; specific policies admitting specific people under specific rules do.
Not every visible benefit is as fraught as immigration. Since we were already talking about flossing, consider… the idea of a Japanese toilet. You only have to use a heated washlet once or twice to know it beats what most regular bathrooms offer, and nobody needs a randomized trial measuring cleanliness or satisfaction to settle the matter, as Noah Smith just argued. I also bought one recently after spending some time in Japan and have opinions.3 I am fairly confident that installing washlets at scale, in hotels and airports, would be a good policy to have for a lot of American institutions, and just as sure it does not need to be evidence-based in any serious sense. The holdup here is not a shortage of evidence. It’s our cultural habits, building codes, and the electrical wiring most American bathrooms were never built for.
Demonstrable harms
The same logic runs in reverse, and it produces what may be the clearest self-defeating immigration policy in the rich world: the ban on letting asylum seekers work while their claims are being processed. In many countries, that exclusion lasts half a year or more, and in practice it can stretch much longer.
Before considering any of the politics, let’s think about what this does to a person. A government accepts an asylum claim for consideration, pays to shelter the applicant, then forbids the one activity that would let them support themselves, build a record, and start to belong, which is work. Months of forced idleness drain savings, erode skills, and corrode the habits and confidence that make someone employable, and the damage outlasts the ban by years. You do not need a clever instrument, or a randomized trial, to see that this is a bad trade.
Then there is the fraught politics of the issue. The ban manufactures the exact image that anti-immigration politicians point to as proof the system is broken: able-bodied newcomers idle in taxpayer-funded hotels, or selling fruit at stoplights because the formal economy is closed to them. That visible dependence can sour voters on immigration more broadly, including the skilled pathways voters otherwise like.
Whenever I argue that immigration policy has to be intuitively beneficial to stay popular, someone asks what a non-beneficial pro-immigration policy would even look like. This is what it looks like, and no amount of aggregate evidence about the long-run fiscal contribution of refugees will make a voter unsee the visible disorder of the asylum crisis in the streets. The honest fix is a policy that stops generating the visible failure in the first place.
Against evidence theater
It helps to see why the maximalist version of evidence-based policy keeps disappointing its champions. Pushing every chip onto “the science,” and deferring to whatever the experts currently agree on, still does not deliver a policy, because the hardest questions are not the ones science was built to answer. Macedo and Lee make the case for Covid: scientists, they write, “with their narrow bases of expertise, should not make policy,” because the choices turn on values and trade-offs no study can weigh. The same gap shows up on climate, where the economist Matt Burgess argues that the loud framings, “no big deal” and “existential threat,” are both probably wrong, and that responsible policy has to weigh trade-offs and judge which scenarios are plausible, something no appeal to the consensus can settle. And it shows up on immigration, where “the evidence” gets invoked by both sides to settle what is at bottom a fight over what we owe one another.
So what does this look like in practice, for someone who wants to take evidence seriously without hiding behind it? The following questions help:
What kind of claim is actually on the table: an empirical one that data can settle or a normative one no regression will ever resolve?
What kind of feedback does the policy produce, and would a clean study even be valid and available before the decision has to be made?
And what does it cost to gather the evidence, what does it cost to wait, and who pays for the delay?
When the mechanism is clear and strong, the downside is reversible, and the people who would benefit cannot afford to wait, the honest move is to act now and keep studying as you go. And when the disagreement is at bottom about values, experts should say so, instead of laundering it through “the science.”
From evidence-based policy to honest democratic judgment
Every study has a price, and so does every delay. The writer Jeremiah Johnson recently named a failure mode he calls the tyranny of the edge case in The Argument piece, where any conceivable harm to anyone, however rare, becomes a reason for nobody to act, and the demand for one more study is among its favorite instruments. That demand almost never falls evenly: the bar rises for the reform someone dislikes and disappears for the status quo, which usually rests on no trial at all.
An isolated demand for rigor, aimed only at the conclusion you would rather not reach, is one of the most effective ways to block action while sounding admirably careful about the evidence. Running a trial means spending money, burning time, and sometimes withholding a promising policy from the people in the control group, while every month spent waiting is a month the current rule stays in force. You can usually weigh this informally: nobody needs a randomized trial to see that paying people to sit idle, as the asylum bans do, and then resenting them for it is a bad trade.
The discipline that keeps all of this honest is a willingness to say what would change your mind. If I claim the asylum work ban is obviously indefensible, I owe you the conditions under which I would drop the claim: solid evidence that it deters fraudulent applications at a scale worth years of lost earnings and delayed integration for people who would otherwise work. I have not seen that evidence, but I would genuinely look at it. Naming the finding that would move you is what separates a considered judgment from a convenient one, and it is a test the loudest demanders of “evidence-based policy” too rarely apply to themselves.
Don’t get me wrong—none of this is a license for vibes. The parachute study is funny because obviously nobody needs a randomized trial at altitude, but most of the choices that matter look less like a parachute and more like our immigration system: the evidence is partial and the feedback is slow or invisible. Evidence is indispensable for those choices. But it still cannot make the hard judgment for you, and a good judgment will not survive long in a democracy unless the policy behind it can show its value in the world ordinary citizens of any political persuasion actually live in.
Allegedly. As well summarized by Scott Alexander, “[e]ither a zoonotic virus crossed over to humans fifteen miles from the biggest coronavirus laboratory in the Eastern Hemisphere. Or a lab leak virus first rose to public attention right near a raccoon-dog stall in a wet market. Either way is one of the century’s biggest coincidences, designed by some cosmic joker who wanted to keep the debate acrimonious for years to come.”
Of course, even in this case, if you need to decide on a better parachute design with various tweaks, you may want to measure things systematically whether through a randomized trial or an observational study.
A TOTO Nexus, since you ask. I have become more insufferable about it than even Europe’s air-conditioning situation.




During the last century, the social sciences have started to focus much more on internal validity, e.g., identifying a causal effect as well as possible. This went to the detriment of other factors like external validity and to a focus on relatively simple causal relationship, including only a small set of variables, because only they can be causally identified well.
I guess that this strong focus has selected and formed researchers in a way that makes them less likely to acknowledge the trade-offs you describe, compared to previous generations of scientists. For the progress of science, a focus on internal validity may not be a problem, one could argue whether it is currently too strongly pronounced or not. But when such thinking is transferred to politics, the result is a narrow focus on maximizing one particular variable, at the cost of everything else, which is rarely welfare optimal.