“Dude, Machine Learning is Just Glorified Curve Fitting”

There is this video of David Bowie being interviewed by Jeremy Paxman that I keep coming back to. The interview took place in 1999 and one of the topics they discuss is the Internet. Specifically, Bowie expresses his opinion that “the potential of what the Internet is going to do to society—both good and bad—is unimaginable”. What is much more interesting to me though is Paxman’s skepticism: “It’s just a tool though, isn’t it? <…> It’s simply a different delivery system”. You can watch the full exchange on YouTube using this link (the most relevant part from 10:45 to 11:30).

For us, the enlightened people of 2020, it feels like Paxman made a fool of himself. But my point is that Paxman was technically correct. It is just that saying “$X$ is just $Y$” may not necessarily diminish the importance of $X$ if we are underestimating the power of $Y$ or if it is too vaguely defined.

One of the most popular variants of this meme today is the saying “machine learning is just glorified curve fitting”. It seems to have been popularized by Judea Pearl and is now the main argument of the folks that question the significance of recent advances in machine learning or are skeptical about its potential to replace jobs, for example. Although you might already (correctly) guess that I am of the opinion that this way of thinking is mostly flawed, in some ways I can actually sympathize with it too.

Curve Fitting

2-D

Before digging into machine learning, let us clarify what curve fitting even is; it might be illustrated with a classic experiment. Suppose you have a container of gas and you measure its pressure, $P$, at various temperatures, $T$. You then plot $P$ versus $T$ and find that the relationship between the two (at least in the range that you measured) is almost perfectly linear, i.e. $P(T) = aT + b$, where $a$ and $b$ are some constants. This is an example of fitting a straight line on a graph of two variables, which can then be used to predict the value of one variable given the value of the other.

9000-D

In machine learning, one might try to achieve something similar. For example, a neural network could be trained to differentiate between dogs and cats when provided with their images (which are collections of pixels with certain intensities of red, green and blue colors). A trained fully-connected neural network with one hidden layer would produce the relationship $\mathbf{F}(\mathbf{x}) = \sigma_2(\mathbf{A}_2 \sigma_1(\mathbf{A}_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2)$ between an input image $\mathbf{x}$ and a function $\mathbf{F}$ with outputs labeled “dog” and “cat”, where $\mathbf{A}_1$, $\mathbf{b}_1$, $\mathbf{A}_2$ and $\mathbf{b}_2$ are just collections of constants and $\sigma_1$ and $\sigma_2$ are some non-linear functions. Although scarier looking, this model performs the same task—predicts an output given some input. You can visualize1 this as fitting a multi-thousand-dimensional surface using millions of parameters.

Large machine learning models are not elegant at all. They have millions or even billions of parameters that model the relationships between inputs and outputs. And often our intuition is that this is just an extension of the same curve fitting principle we have seen with pressure and temperature, just with more inputs and more outputs. However, as the sizes of these models grow, behaviors sometimes emerge that mimic human cognition.

Alien Life Form

GPT-3 is the largest language model in the world; it might be the best example to illustrate my point. In the simplest terms, language model is an auto-completer of words, sentences, paragraphs, etc. Even the most basic of them should be able to complete sentences like “My favorite color is __” with words like “red” or “light blue” instead of “hamburger” or “Richard Nixon”. GPT-3 is much more sophisticated. For example, when provided with a definition of a made-up word, this language model can use it in context [1]:

Human input:
A “Burringo” is a car with very fast acceleration. An example of a sentence that uses the word Burringo is:
GPT-3 output:
In our garage we have a Burringo that my father drives to work every day.

There are many examples of what GPT-3 is capable of doing but to me the most fascinating one is its ability to perform basic arithmetic. When asked “What is 3 plus 2?”, the model will almost certainly correctly answer “5”. Of course, what could have happened is that the model might have found such a sentence in its training data set and simply memorized it. That is almost certain with one-digit addition given how much text GPT-3 was trained on. However, it performs almost as well on two- or three-digit addition and subtraction. Out of 2,000 three-digit subtraction problems, only 0.1% of which appeared in the training data set, GPT-3 was able to compute the answer correctly 94% of the time [1]. And the mistakes that it made are human-like—“forgetting” to borrow a “1”, for example. But it does carry or borrow a “1” when it needs to most of the time which in itself is impressive because it was not explicitly told the rules of addition and subtraction. It is amazing that with enough data even a relatively simple2 model can “learn” mathematical rules by analyzing text and extracting a pattern.

Conclusion

Most machine learning models are indeed just a high-dimensional equivalent of curve fitting. But thinking about them on this level of abstraction is simply not helpful. I agree with the sentiment that just building larger models is not a feasible strategy to achieve artificial general intelligence. However, we keep getting surprised by how good these dumb curve fitting algorithms can be at performing cognitive tasks. My enthusiasm and concern regarding machine learning stems from the belief that with large enough models emergent behaviors arise that emulate the way we, humans, perform many tasks. I fear that no matter how anticlimactic it sounds, functions with a few billion parameters might replace millions of jobs. Not because these mathematical functions can develop some sort of advanced intelligence, but because a lot of what we do does not require it.

References

  1. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, 2020. [Online]. Available: https://arxiv.org/abs/2005.14165

  1. just kidding ↩︎

  2. structure-wise, not size-wise ↩︎