Headlines about machine learning promise godlike predictive power. Here are four examples:
With articles like these, the press will have you believe that machine learning can reliably predict whether you’re gay, whether you’ll develop psychosis, whether you’ll have a heart attack and whether you’re a criminal—as well as other ambitious predictions such as when you’ll die and whether your unpublished book will be a bestseller.
It’s all a lie. Machine learning can’t confidently tell such things about each individual. In most cases, these things are simply too difficult to predict with certainty.
Here’s how the lie works. Researchers report high “accuracy,” but then later reveal—buried within the details of a technical paper—that they were actually misusing the word “accuracy” to mean another measure of performance related to accuracy but in actuality not nearly as impressive.
But the press runs with it. Time and again, this scheme succeeds in hoodwinking the media and generating flagrant publicity stunts that mislead.
Now, don’t get me wrong; machine learning does deserve high praise. The ability to predict better than random guessing, even if not with high confidence for most cases, serves to improve all kinds of business and health care processes. That’s pay dirt. And, in certain limited areas, machine learning can deliver strikingly high performance, such as for recognizing objects like traffic lights within photographs or recognizing the presence of certain diseases from medical images.
But, in other cases, researchers are falsely advertising high performance. Take Stanford University’s infamous “gaydar” study. In its opening summary, the 2018 report claims its predictive model achieves 91 percent accuracy distinguishing gay and straight males from facial images. This inspired journalists to broadcast gross exaggerations. The Newsweek article highlighted above kicked off with “Artificial intelligence can now tell whether you are gay or straight simply by analyzing a picture of your face.”
This deceptive media coverage is to be expected. The researchers’ opening claim has tacitly conveyed—to lay readers, nontechnical journalists and even casual technical readers—that the system can tell who’s gay and who isn’t and usually be correct about it.
That assertion is false. The model can’t confidently “tell” for any given photograph. Rather, what Stanford’s model can actually do 91 percent of the time is much less remarkable: It can identify which of a pair of two males are gay when it’s already been established that one is and one is not.
This “pairing test” tells a seductive story, but it’s a deceptive one. It translates to low performance outside the research lab, where there’s no contrived scenario presenting such pairings. Employing the model in the real world would require a tough trade-off. You could tune the model to correctly identify, say, two thirds of all gay individuals, but that would come at a price: When it predicted someone to be gay, it would be wrong more than half of the time—a high false positive rate. And if you configure its settings so that it correctly identifies even more than two thirds, the model will exhibit an even higher false positive rate.
The reason for this is that one of the two categories is infrequent—in this case, gay individuals, which amount to about 7 percent of males (according to the Stanford report). When one category is in the minority, that intrinsically makes it more challenging to reliably predict.
Now, the researchers did report on a viable measure of performance, called AUC—albeit mislabeled in their report as “accuracy.” AUC (Area Under the receiver operating characteristic Curve) indicates the extent of performance trade-offs available. The higher the AUC, the better the trade-off options offered by the predictive model.
In the field of machine learning, accuracy means something simpler: “How often the predictive model is correct—the percent of cases it gets right.” When researchers use the word to mean anything else, they’re at best adopting willful ignorance and at worst consciously laying a trap to ensnare the media.
But researchers face two publicity challenges: How can you make something as technical as AUC sexy and at the same time sell your predictive model’s performance? No problem. As it turns out, the AUC is mathematically equal to the result you get running the pairing test. And so, a 91 percent AUC can be explained with a story about distinguishing between pairs that sounds to many journalists like “high accuracy”—especially when the researchers commit the cardinal sin of just baldly—and falsely—calling it “accuracy.” Voila! Both the journalists and their readers believe the model can “tell” whether you’re gay.
This “accuracy fallacy” scheme is applied far and wide, with overblown claims about machine learning accurately predicting, among other things, psychosis, criminality, death, suicide, bestselling books, fraudulent dating profiles, banana crop diseases and various medical conditions. For an addendum to this article that covers 20 more examples, click here.
In some of these cases, researchers perpetrate a variation on the accuracy fallacy scheme: they report the accuracy you would get if half the cases were positive—that is, if the common and rare categories took place equally often. Mathematically, this usually inflates the reported “accuracy” a bit less than AUC, but it’s a similar maneuver and overstates performance in much the same way.
In popular culture, “gaydar” refers to an unattainable form of human clairvoyance. We shouldn’t expect machine learning to attain supernatural abilities either. Many human behaviors defy reliable prediction. It’s like predicting the weather many weeks in advance. There’s no achieving high certainty. There’s no magic crystal ball. Readers at large must hone a certain vigilance: Be wary about claims of “high accuracy” in machine learning. If it sounds too good to be true, it probably is.