Who’s Afraid of Virginia Woolf GPT?

The release of ChatGPT in 2022 and GPT-4 this year has put artificial intelligence and machine learning in the public spotlight. These advances are due to the development and deployment of Large Language Model (LLMs), deep neural networks trained on millions of examples. Before the GPTs, image generation and protein folding models had also effectively harnessed LLM architectures. For the average computer and internet user, however, the ability to use natural language to interact with computational systems will have the greatest impact.

The tech hype machine has latched onto these language models as the beginning of artificial general intelligence that will change society forever. Skeptics contend that LLMs don’t really understand language and are merely fancy copy-paste machines.

As a ML researcher, I want to log a few thoughts on how we should think about these models and what their success means for research and society. I’ll start with a high-level overview of how LLMs generate sentences from prompts. I briefly outline some of the debate around language “understanding.” Finally, I try to outline how these LLMs model integrate into current internet structures, particularly advertising. Fundamentally, these Large Language Models are still number processing tools like any machine learning model or algorithm.

It's just autocomplete!

But it’s still a big deal

I think it’s important to start with how these LLMs work and for the general public to understand (at a high level) the “workflow” for generating text outputs from text inputs.

Computers are number processors. Everything a computer does is at some level represented by a set of ones and zeros. Today there is good enough abstraction that we can think of everything as a decimal without a loss of understanding but deep down, a computer is just rearranging ones and zeros. That fundamental point informs all of my takes on machine learning, computers, etc. Computers work with numbers - the physical world mostly does not, and the interesting problems are to convert between the two.

So we’ve established that computers work with numbers. Large language models take text in and produce text out. So, we have to turn the text into numbers! That’s step 1 in our workflow. How do we do this? Long ago researchers created “word embeddings.” Basically, we take a lot of text (let’s call this the embedding dataset), and tag each word or subword with an id. Then, we count how often the subwords appear together in some window. Now we have a big words by words table with a count of the cooccurrence in each row/column pair. We can think of each row as representing a word numerically. Subwords that co-occur often will have similar counts with other similar sub-words. The space of all the subwords in a vocabulary is a lot of numbers! To represent each word a bit more compactly, we compress that table to a words by small dimensions table, that could be returned to the original table with some set of mathematical transformations.

Now each input word has a numerical representation, the embedding, that is some small number of dimensions (probably around 1000 dimensions). That’s step 1. Yay, our words are now numbers, and the computer is happy. We’ve turned a list of words into a list of numbers. We want to turn that input list of words into a list of output words. So, we go to another set of lots of text (call this the model dataset). The model gets a bunch of collections of words and partitions that turn into an input, the first N words, and the training output, the remaining M words. Then, we train a “deep neural network” to predict the M words from N words. I won’t get to into the mechanics of transformers or attention or backpropagation, but at a high level there’s a big ensemble of small linear pieces that approximately map the distribution of the possible M words given the original N words. So, step 2, push our list of word-numbers through this complex statistical distribution generator.

And finally, step 3, sample a new word from this distribution! You can keep repeating the sampling to get multiple words until some sort of stop condition is reached. OpenAI also specifically had people (indirectly) modify these distributions to be more “human-like,” fancifully called Reinforcement Learning with Human Feedback.

So, our workflow:

Turn words into numbers
Turn those numbers into the probability of other numbers
Sample those probabilities to pick the next set of words.

This general strategy is how iPhones autocomplete and GMail’s sentence complete also work, though in the early days I suspect they didn’t use word embeddings and just had words or common phrases indexed as unique tokens. And I think we’ve forgotten how impressed we were when we first saw those! People were autocompleting texts in iMessage to get insights into their lives!

But ChatGPT (and Bard and Sydney etc.) are a leap in ability compared to those tools. The new transformer architecture, based on a concept called “attention,” definitely helped make these models better at predicting the right next subword. Other advances, however, seem to have just needed more data and more computing resources to get a better statistical distribution. ChatGPT can take in more Ns and produce more Ms because OpenAI invested in more computing units for this purpose. It’s a bit disheartening from an elegance perspective, but I think the success of AlphaGo (just compute the possible trees of outcomes downstream efficiently!) had already low-key stunted the elegant models for ML approach.

Do LLMs understand language?

Chris Hayes summed this up perfectly on twitter, we’re rehashing Searle’s Chinese Room. At a high level, this though experiment asks if correctly answering a question given a set of instructions is “understanding” a language. Read the Stanford Encyclopedia of Philosophy article and you can skip reading basically all the discourse because its mostly all been thought and said before.

For those curious, some notable critics of LLMs are Yann LeCun, a big name Facebook AI researcher and THE linguist Noam Chomsky. Sam Altman and the OpenAI team (plus the twitter tech grifter hype squad) seem to think this the path to Artificial General Intelligence, basically machines that are conscious. (When I met Altman in late 2021 at a Retro interview, which I believe I can now talk about since the news is public, he did say that the path to AGI was reinforcement learning on top of a word embedding model so props for sticking to the plan!). Read these guys if you want to get into the topic.

I’m skeptical of the AGI ability of LLMs. I think the numbers/words gap is still too big here. LLMs turn word subpieces into numbers and do computations on those numbers. You can put lipstick on it and slice and dice prompts to show understanding, but at some level it’s all number processing. I think a lot of what people interpret as reasoning, emotions, or intelligence is a projection of our own abilities onto what are still number processing models. If you start talking about love, the model will look at what words are usually used around love, and say things like “I love you.” For reasoning, a common “prompt engineering” trick is to tell the GPT to think step by step and this has been shown to improve computational answers. But what I’d guess the model is doing is going to a distribution of words where “step-by-step” is used, and text where people listed out their process step by step probably has less errors or is better thought out!

That being said, I had a habit of skipping steps on math problem sets and making lots of dumb mistakes. Often teachers would have me go over my work step-by-step to fix those mistakes, so maybe I am an LLM. (People who hear me talk about things I don’t know much about probably think so!) It is interesting that the same processes that make us think better make the Chatbot “think” better, but it is difficult to disentangle how much of that is because of the data the model relies on, and how much is “innate” to the model itself.

How do we understand language? Is it possible we are just number processers as well? Are our neurons and their connections just on/off binary representation of words and concepts? Maybe! My inclination is no. We know how LLMs work. What we don’t know as well is how humans reason and process language. Perhaps LLMs can serve as a kind of “null hypothesis” that we can disprove. The success of LLMs means that models based on numerical representations of language can effectively use human language. Now, we need to figure out how humans process and generate language to figure out what, if anything, makes us different.

How will they change the internet? And in turn, us?

LLMs are computer systems, and computers are good for two things: Ads and Porn. The application of LLMs and generative image models to porn seem relatively obvious, so we need not get into that here.

So the economic and societal question for LLMs is: how will it be used to serve ads? Most of our informational channels are basically ad platforms: newspaper confidentials, TV commercial breaks, the sidebar on google, etc. A lot of modern machine learning is an attempt to optimize ad placement. I remember reading causal inference papers in group meeting around 2019, one of the hot topics of time, just to realize that A/B testing was the motivating factor behind most of the industry research. This isn’t to trivialize the importance of those developments! Each has reshaped the fabric of society! Advertising is a central part of how each operates and how society is going to be impacted.

In some cases, LLMs can replace other text services that serve ads. If you replace google search with “ask the LLM”, like Bing is testing right now, you just keep that ad bar right back there on the side using ye olde search techniques/sponsorship models. If the future is enhancing search with the word embeddings we mentioned before, websites and advertisers might toy around with their Search Engine Optimization to find the words that make the embedding search happy. Again, this isn’t as minimal as it sounds: peoples careers soared and tanked when Facebook pivoted to video feeds. What we’ll likely see if a similar push to hire prompt hackers, who can make content appear over a broad number of LLM topics.

Generating advertisements cheaply with generative models is likely to be a hot field. It satisfies the investor desire for the ever-so-dear “product-market fit” with the allure of being part of “the hot thing.” Someone sent me a job post for Jasper AI, a generative AI for marketing copy, about two days after the ChatGPT reveal. While undoubtedly lucrative, this may be the most boring and least society-changing part of LLMs/generative AI. They’re going to produce a lot more ad prototypes cheaply and quickly, but the fundamental attention market has the same human limits. Anecdotally, a start up a friend worked at that was doing A/B testing different picture options (e.g., green background vs blue background) for social media ads found that it was not super viable as an independent business, even though it began as a nice tool for an advertising and marketing consultant to have in-house. Generating content is already relatively cheap. The burden is still going to be finding a unique voice that cuts through the noise; the noise is just much cheaper to create.

What I’m most curious about is how advertisers will try to influence and control the inputs and outputs of large language models. This can be as simple as adding invisible text to spam websites that will make their way into the model and embedding training data. Just like recipes added all those introductory paragraphs to get ranked higher in google search, product sites might amp up the amount of text using positive and complimentary vocabulary. This can draw the number representation of “Guinness” and “good for you” even closer. It can also influence the language model’s outputs, so anyone prompting “Which toaster should I buy” will see preferred brand.

More insidious, will advertisers pay to adjust the statistical distribution to favor their own products? We know that human feedback is used to tune ChatGPT; how hard will it be to tune the model so that the text outputs Merck or Pfizer’s drug instead of listing generics? Fine tuning “foundation” models is already a common practice in machine learning operations. If free chatbots are how we interface the internet going forward, the economic incentive to skew the model gets pretty high. OpenAI can probably fund their projects from the users of their embeddings and models. A smaller company fine tuning a language model for a web app in a competitive space needs revenue, and advertising is a good source. If you’re using a chatbot for your mental health or legal counsel, as is already being developed, you want to know where this advice is coming from.

This noise from advertisers and cheap text is where I think LLMs can really change how we interact with each other. Our little pseudorandom word generator is going to make lots of short-form text cheap to produce. Think of all the cover letters and job prompts and blog posts we create to sell ourselves as professionals. The number of “thought leaders” and “public intellectuals” on social media is already so high! Academic (and some technical industry) jobs want you to have a website that not only shows your research, but also a blog (yes this is partly why I’m doing this) to show your scope, and also some interesting personal touches about your (now commodified) hobbies.

A LLM makes all these bits of texts cheap to generate. I see two extremes that bound the outcomes: we cut the crap, or we drown in it. In some ways, this just accelerates the fundamental problem of existing with the internet - lots of information, very little of it original or useful. Maybe the cheapness of text generation will lead to a reduction of the amount of filler text we have to generate. If everyone generates the perfect cover letter that fits the job description text, we can get rid of that requirement. If those thought leadership blogs are just chatbot productions, and identifiable as such, we don’t need to have those posted! Pare down interviews and communication in these productive settings to what really matters. If a chatbot can produce most of what you’re saying, it might not need to be said.

The likely outcome, however, is that this just raises the number of hoops we jump through in these settings. If one cover letter is easy to generate, then why not add some more prompts and questions? We need to differentiate the candidates! Add more steps! And so we muddle on as citizens of the information age, trying to filter out the signal from the noise.

Where do we go from here?

“Archit,” you might say, “you do biology, why do you care?”

I think it’s important to repeat the message that computers are number processors, fancy abaci for the 21st century, and we have to understand them as such. I work on machine learning for biology, where we try to resolve the exactness of numbers with pretty strange physical systems that defy easy encoding. Optimization in biological contexts is really tough because we have to try to define what we want in a medicine or a microbe with a number, and the same problems arise in natural language processing and image processing.

These numbers matter. They’ve reshaped the way we work for the last thirty years, in quantitative finance and journalism and sports. These numbers can be intimidating, and numeracy is a skill that decays rather fast when you’re outside educational environments. In the next few years, the VC class and the consultants and influencers are going to push AI. Thinking clearly about how the words (and images and videos) are generated, and knowing that they are a series of numbers drawn from a statistical model of some set of training data, can help prevent the inevitable and foreseeable failure modes. We can’t blindly let these LLMs into critical systems without public awareness of the statistical limits of these tools.

A lot of chatter is around “alignment” in AI, both getting AI to do what we want and interfacing language models with physical systems. Will we create C3PO robots by hooking up ChatGPT to speakers and Boston Dynamics robots? It’s funny in some way that the translation parts seem mostly doable but converting words to smooth, robust robotic motion is the blocker today. The success of transformers with protein folding model bodes well for interactions with the real world and the potential to drive beneficial discovery in the sciences.

OpenAI, through a combination of technical prowess and clever hyping, has accelerated the transition to a mass AI age. Accessible language interfaces are coming, but they need to be used carefully. Thankfully, however, the situation remains the same as the internet or the radio - there’s a lot of content, but you have to be careful what you trust.