I almost hate to add to the AI discourse these days, because I’m not confident my two cents of opinion is productive or helpful to anyone, but I’m finally giving into the peer pressure and laying out my views in one place.
Eliezer Yudkowsky’s AGI x-risk argument, in short, goes as follows:
Artificial general intelligence is possible in principle.
There is a property, which we can call intelligence, which humans have, which allows us vastly more power to shape the world than any other organism. It is possible in principle for a machine to have this property, to a degree exceeding our own. Thus, it is in principle possible for machines to be generally smarter and more powerful than people.
Artificial general intelligence, by default, kills us all.
If humanity creates an AGI, in most possible cases it will be an alien mind, not a human-like or animal-like mind. By default it will place no value on human life; by default it will prioritize some goal and pursue it relentlessly; in the vast majority of possible cases, that goal will not happen to be compatible with the survival of our species. Just as humans have caused many nonhuman species to go extinct as a side effect of our industrial activities, an agent that doesn’t explicitly value our survival will be likely to wipe us out.
It is technically difficult, and perhaps impossible, to ensure an AI values human life.
We have not even begun to develop the theory or technical capacity to get AIs to have any coherent values or goals “in the world”. We have no good way to ensure that a computer program even knows what a human is across contexts, so we can’t possibly “program” it to “value human life”.
Current machine learning developments are progressing rapidly towards an AGI.
Current progress in machine learning performance indicates substantial steps towards the kind of “intelligence” that drastically reshapes the world in pursuit of goals (and is therefore an existential threat.)
My position is that claims 1, 2, and 3 are true, and 4 is false.
I believe that the kind of AGI that would be an existential threat is very hard to create (though possible-in-principle).
An x-risk AGI would need some capacities of types that current machine learning is showing great progress on, and also some other capacities that current machine learning is making almost no progress towards.
Moreover, I think those “missing critical capacities” (necessary for x-risk AGI, not tractable for current machine learning systems) are seeing very little research investment today and don’t appear strongly economically incentivized, so I’m confident they won’t be developed in the 2020s, and maybe not in the 2030’s either.
In particular, I think the very same technical obstacle to “aligning” AGI is an obstacle to creating AGI in the first place. We do not know how to build machines that have (coherent) goals at all, and thus we do not know how to build machines that can pursue goals coherently and persistently enough to create massive side effects like human extinction.
Intelligence and Agency
I believe human intelligence -- the thing that we have to a greater degree than other animals -- is about agency or goal pursuit.
The reason humans are “powerful” (and dangerous) is that we change the physical world in pursuit of our goals.
We changed the climate and geology of our planet. If you look at Earth from space, human agency is obvious -- you can see land transformed by agriculture, continents illuminated by electric lights, the Great Wall of China, satellites in orbit.
Human impact on our physical environment is driven by human technology, and human technology is driven by teleological reasoning. Our ancestors learned to chip stone tools in order to catch more game while hunting.
This is apparently controversial, which is why I’m belaboring the point. Some people do not actually think humans are “agents” who solve problems in pursuit of goals, and I think this is silly. I think humans do pursue goals.
Clearly, not everything people do is goal-directed behavior. When we are “on autopilot” we are not actively solving problems. When we are “going with the flow” we are not aiming at a particular goal.
But I think humans (and some animals!) do have goals, and moreover that most technological advancement is a result of goal-directed, agentic behavior.
When our hominid ancestors began to chip the first stone tools, this was obviously in service of a goal -- they wanted to catch more game while hunting.
Similarly, when a human today tries to fix a printer (or any other kind of “getting a machine to work” activity, whether that’s use, repair, or creation), there’s a goal in mind. You want to get the doohickey to “work right”, to do what you want.
There are, admittedly, non-goal-oriented components to many practical problem-solving activities. You might follow a procedure mechanically, or noodle around playfully at “random”, or slip into a habitual behavior pattern. But if the problem is at all tricky or non-routine, then this will not suffice and you will have to think in a fundamentally different way.
Have you ever, while trying to fix something, blundered around for a while “on autopilot”, trying all the standard tricks you’ve used before, getting nowhere -- and then had a “moment of clarity”?
You stop, you look at the doohickey a different way, you “really think”, or you “figure it out”, and you’re like “oh, this part goes behind that other part, I’ll have to unscrew the front to get at it” or something. There’s a switch from “brute-force” random trial-and-error to…something else. It’s qualitatively different. I
It’s the difference between:
“Blindly” writing code or equations, “following your nose”, vs stepping back and trying to grasp the structure of the problem
“Blindly” making moves in a game and being surprised whether you win or lose, vs. stepping back and trying to figure out what the necessary properties are of a good strategy
“Blindly” trying to “get the notes right” in a musical piece by playing the same thing over and over, vs. stepping back and choosing to focus on the trickiest measures and changing your fingerings
“Blindly” complaining about a chronic problem in your life vs. “actually thinking” and trying to come up with new things you could do to solve it.
There’s this thing that’s “actually try” or “actually be strategic” or “stop Doing for a second and try to understand What to do”, that dramatically improves performance across domains.
And, I think, it’s inherently goal-oriented. You’re trying to figure out the means to an end.
We are not usually doing that problem-solving thing, in daily life -- a person can go days or even years without having to use “real” problem-solving thinking to cope with an unfamiliar problem. When life’s tasks are routine and familiar, or when we are able to follow procedures we’ve heard from others, we are spared the necessity of “real problem-solving” or “actually being strategic.”
But it seems clear to me that occasional bouts of “real problem-solving” are and have always been essential for the development of humanity’s overall capabilities.
William James, the founder of psychology, believed that agency is fundamental to what “intelligence” or the “mind” is.
How can we tell that a human or animal has a “mind” but a rock does not? How can we distinguish Romeo’s “attraction” to Juliet from an iron filing’s “attraction” to a magnet?
James has a nice crisp definition of agentic behavior: “fixed aim, varying means.” A frog trying to escape a jar will try various strategies to get out; iron filings “trying” to reach a magnet, or a rock “trying” to fall, or bubbles “trying” to reach the surface of a jar of water, will not devise diverse ways to get around obstacles placed by the experimenter.
Frogs, and other vertebrates, engage in goal-directed behavior by James’ standard. Any animal that can engage in physical problem-solving -- escaping a maze, breaking into a complicated package to get a treat, etc -- can be an agent.
Humans are unusually flexible agents: we can do the ‘step out of the frame’ thing. We can vary our “means” a great deal more than a frog can, including popping up a meta level to expand the range of our menu of “means”.
Agency is the thing that makes an AGI uniquely dangerous.
Agency is not necessary for most kinds of technological danger; many technologies can kill people accidentally without any sort of “intelligence”. A bug in a computer program can kill people if it controls a power plant, or controls life-support machines in an ICU, or flies an airplane, or many other sensitive applications. This sort of mundane failure risk, of course, applies to machine learning models as well, but it is not qualitatively different from any other kind of technological risk.
The unique risk of superintelligent AGI is that it may have a persistent goal that is dangerous, and the ability to actively, flexibly resist our attempts to stop it.
An adversary is far more dangerous than an accident. Note, by analogy, that only 1% of gun deaths in the US are accidental. If you die from a gunshot wound, it is two orders of magnitude more likely to be a death from human agency (homicide or suicide) than a death by accident.
Without agency, machine learning is mundane technology.
Sure, machine learning poses new types of risks because it has new capabilities, but that’s true of every new technology. At worst, machine learning is in the same category as nuclear fission or synthetic biology -- something that could cause mass death (but probably not human extinction) in the hands of a hostile or negligent organization.
Aren’t Current AIs Already Agents?
In one sense, every machine learning model has a “goal” -- to minimize its training loss.
The programmers specify some loss function to represent how “distant” the model’s response is from the desired response. Then, every time the model is applied to a new training data point, the training procedure nudges the model’s weights (via a gradient descent algorithm) in the direction that would have brought it closer to the desired response to that data point.
So, it’s fair to say, in one sense, that (for instance) an LLM has a “goal” to predict the text prompt completion in a way consistent with its training corpus.
Does this satisfy James’ criterion of “fixed aim, varying means”? Is the LLM’s “goal” the same sort of thing as a frog’s “goal” to escape the water to get a breath of air?
Not quite, I would say.
The LLM relentlessly minimizes its loss function, no matter what the outcome. As far as it’s concerned, “winning” simply is making the number go down.
A frog, on the other hand, has something in the world that it wants (to breathe, so it can survive). The reality of whether the frog gets enough air to breathe is different from the specification of however its brain and body internally represents the goal of “get out of the water and breathe”.
The map, as they say, is not the territory.
What would happen if, say, you gave the frog a snorkel, so it could still breathe underwater? Would it stop struggling because it realized its “initial goal” was superfluous?
I actually don’t know about a literal frog. But a human sure could.
“Embedded” agents (living things, and perhaps someday robots) are acting in physical reality, and reality provides them with information about the consequences of their actions which can be outside their model specification. If a robot’s model says it’s doing fine, but then an anvil falls on it, then the robot is not fine, it is smashed.
Living things have evolutionary pressure towards not deluding themselves about their prospects for survival and reproduction; every time an organism dies because its models say it’s OK, those incorrect models get a bit less frequent in the gene pool.
So organisms are going to have a tendency to avoid gaps between theory and reality, and (at least in “higher” organisms like humans) some ability to scrap or adjust theories that conflict with reality.
An AI designed to minimize its loss function is only as good as its loss function’s specification and its training data set; it isn’t getting “real world out-of-model feedback” from how its responses perform at whatever practical use they’re put to, in the way that a living organism is getting feedback from its current objective state of health, or the way an organism’s genes are getting feedback from its objective performance at survival and reproduction.
There are numerous examples of AIs engaging in what DeepMind calls “specification gaming” -- “discovering” that the easiest way to minimize their loss function is to do something very far from what their human creators intended.
For instance, an AI designed to stack a red (simulated) Lego block on top of a blue block was given a reward function based on the height of the bottom face of the red block. The AI “exploited a loophole” by instead flipping the red block over so its bottom face was higher up.
Specification gaming is a natural consequence of AIs being designed to optimize a single hard-coded objective function, come hell or high water. Organisms work fundamentally differently; the objective they optimize relentlessly is an outcome in the world, not a function in their minds.
Nostalgebraist’s posts “why assume AGIs will optimize for fixed goals?” and “wrapper-minds are the enemy” points at this same distinction.
He uses the word “wrapper” to refer to a fixed, hard-coded terminal goal that an AI would optimize for, no matter what. (Much like the loss function of actual machine learning models.)
He observes that humans are not like this -- we can revise our goals if they seem bad to us later on. Our values -- even our most sacred ones -- are not literally terminal or absolute in the sense that no possible feedback from reality could get us to change them.
“A literally unconditional love would not be a love for a person, for any entity, but only for the referent of an imagined XML tag, defined only inside one's own mind.”
…
“Our values are about the same world that our beliefs are about, and since our beliefs can change with time -- can even grow to encompass new possibilities never before mapped -- so can our values.
"I thought I loved my child no matter what, but that was before I appreciated the possibility of a turn-your-brain-into Napoleon machine." You have to be able to say things like this. You have to be able to react accordingly when your map grows a whole new region, or when a border on it dissolves.”
Agency in the way that organisms do it involves a fixed aim in the world, and varying means including the ability to vary the mental specification of that aim.
Current-gen AIs don’t have that. And, I’ll argue, that’s what an AI would need to be an x-risk.
Aren’t People Trying To Make AIs Agents in the Near Future?
People sure are trying to give AIs something they call agency.
Yohei Nakajima, a venture capitalist, has scared a lot of people by launching what he calls “task-driven autonomous agents”, basically a couple of GPT-4 instances hooked up to a memory store.
This is, first of all, a very weird and suspicious example. Nakajima claims that GPT-4 wrote the blog post and associated Twitter thread based on “code”, but he has no links to the code or a demo that I can find. I’m considering it unverified whether this “autonomous agent” even exists.
But let’s assume it does, because there’s nothing impossible about hooking an LLM up to a database, the Internet, or a bank account. Anyone who claims “LLMs are safe because they’re sandboxed” is wrong.
And that means you can absolutely set up an LLM that can autonomously go around impersonating a human, creating accounts with various internet services, buying things, shipping them anywhere it likes, hiring and instructing people to do stuff via email, and so on.
An LLM-led terrorist organization is not in principle impossible. Most of the uncertainty is in “would the LLM make nonsensical, counterproductive, or easily detectable choices that made it a hopelessly ineffective terrorist leader?”
But this has nothing to do with my claim that LLMs (and other AIs) do not have the kind of agency that would make them x-risks and aren’t on track to develop it soon.
The kind of agency I’m talking about is a cognitive capacity. It’s not about what tools you can hook up the AI to with an API, it’s about the construction of the AI itself.
My claim is that certain key components of agency are unsolved research problems. And in particular, that some of those problems look like they might remain unsolved for a very long time, given that there’s not very much progress on them, not very many resources being devoted to them, and not much economic incentive to solve them, and no trends pointing towards that being on track to change.
What Are The Implications of Long AGI Timelines?
“We’re no closer to solving a critical sub-problem for AGI than we were in the 1960s” of course doesn’t imply that problem can’t be solved tomorrow, but it makes it unlikely. It means that we have enormous uncertainty about when it will be developed.
If “true” AGI looks just as far away today as it did in the 1960’s; if 2020’s AIs like LLMs and deep reinforcement learning agents might be economically transformative and societally disruptive but are fundamentally missing a necessary component of what allowed humans to reshape Earth; then what?
Then my basic policy opinion is that inquiry is good and bans are bad.
Theoretical inquiry about the implications of superhuman thinking machines is older than computers themselves. I think it can be productive. Alan Turing, Norbert Weiner, and John Von Neumann were not silly to speculate about the future of computers.
Research into AGI “alignment” -- how to ensure that a future superintelligent AGI will value human survival -- is, just as it was when SIAI was founded, a long-shot bet that a pre-paradigmatic, basically philosophical or speculative school of thought could actually do something productive. But it’s also basically a free roll; society loses next to nothing if a few more people than usual write papers that end up going nowhere.
And something like a valid philosophy of AI is absolutely necessary for humanity’s long-term survival (as well as for thinking straight in the short term about what kinds of capabilities we can and can’t expect from what systems.)
So I’m thoroughly in favor of speculative inquiry about the necessary components of AGI safety.
Banning the training of large machine learning models, by contrast, is harmful (because it coercively interferes with legitimate scientific and commercial activity) and is no more justified today than a ban on mainframes would have been in the 1960s.
In my view, 2020’s AI is a typical sort of new technology: it has both risks and benefits, but the benefits probably predominate overwhelmingly, and enforced technological stasis has a brutal opportunity cost. Tyler Cowen is basically right -- we should let present-day AI develop freely, reap the prosperity that it makes possible, and figure out how to mitigate any harms as we go along.
Capabilities Required For Agency, And How Current-Gen AIs Perform
In the rest of this essay, I’ll examine several kinds of capabilities that I think are probably necessary for the kind of AGI that could be an x-risk. For each, I’ll discuss:
What the capability is
Why it seems necessary for an x-risk AGI
To what extent present-day AIs seem to have it
To what extent it looks likely that AIs will progress towards the capability in the near future (roughly, the next decade).
World Models
Do present-day AIs have world models? Mixed results.
Should we expect near-term AIs to develop better world models? Yes.
What’s a World Model?
Marvin Minsky referred to a “world model” as something any AI would inherently need to possess.
Most definitions I can find are circular (they use the word “model” or a synonym) or overly broad (Minsky included anything that helped the AI achieve its goals.)
For instance, Yann LeCun describes a “world model” as “a kind of simulator of the part of the world relevant to the task at hand”, which is of course just replacing the word “model” with the synonym “simulator”.
At the risk of reinventing the wheel, let me take a stab at a definition:
A world model is a compact piece of information, composed of multiple parts and relations between them, such that if you define a correspondence between the model and the real world that maps each of the model’s parts to a corresponding real-world object or phenomenon, then the relationships between the parts of the model map to analogous relationships between the real-world images of those model parts.
In mathematical language, you would say that the relationships between the model’s parts commute with the relationship between the model and the world.
For instance, a literal map is a world model; the shapes and symbols on the map correspond to cities, roads, coastlines, and so on, in the real world.
Why would you use a map?
Because it’s conveniently small (you can carry it in your purse) but it corresponds to the world in important respects.
If you move your pen along the map from the dot labeled “New York” to the dot labeled “Boston”, and then you start in the real New York and drive along the same real-life roads as the symbolic roads on your pen-path, you can reasonably expect to end up in the real Boston.
In this case, the correspondence between the map and the real world is spatial and graph-theoretical. Map-locations have the same relative distances on the map as real locations have in real life (in x-y coordinates on the map, or in some projection function of their latitude and longitude in real life). Also, two map-locations are connected by a sequence of map-roads if and only if their real-life equivalents are connected by a sequence of real roads.
World models can also be more dynamic than a map.
A physics simulator, for instance, encodes some laws of physics (a compact piece of information) and computes how any modeled physical system would evolve over time according to those laws. A protein folding simulator encodes some rules for chemical bonding, and computes how a sequence of amino acids would fold itself up according to those rules.
A dynamic simulation model generates conditional inferences about a real-world system -- “under certain conditions, the system will behave like this” -- allowing us to form expectations about unseen or future states of the system.
Dynamic or static, a model is a simplified representation of (part of) the world, based on relationships between parts.
But Aren’t World Models Unnecessary?
In traditional AI of the 1960’s-1980’s, the assumption that AI needed “world models” meant, for instance, the assumption that the “right way” to design a robot that could navigate autonomously would mean generating a simulated 3D model of its physical environment from sensors. Then a separate program would select a simulated path through that model, which would finally be implemented when the robot physically steered along the corresponding path.
This sort of world model, it turned out, was neither necessary nor optimal for successfully solving robotics problems. A robot could simply sense patterns and react to them appropriately, without generating a model of the entire space around itself and a pre-planned path through it.
Herbert Simon’s parable of the ant is an attempt to explain that world models are not inherently necessary for effective coping in varied environments. He imagines an ant that chooses its zigzag path along the beach without any kind of a priori “plan” and without any “map” of the beach. The ant simply moves in a straight line until it encounters an obstacle, and then zigs around it. The ant’s path appears complex as seen from above, but it emerges from a set of simple rules. And it’s perfectly capable of successfully avoiding obstacles on a variety of different beaches.
So, one might ask, why would AIs need world models at all?
Reinforcement learning is another example of a paradigm where an agent selects a path (for instance, a sequence of moves in a game) without necessarily having a “world model”. The agent tries things at random; when it wins the game, the moves it took along the way are rewarded; when it loses the game, those moves are penalized. Over time, the reinforcement learning system will evolve a winning strategy.
A robot that learned to navigate via reinforcement learning would not necessarily need to have a “model” of the entire environment; it might not even need to have representations of physical objects or rooms or anything that corresponded to the “knowledge” that dropped objects fall. It could simply learn from brute-force reward and punishment which sequences of motion would lead it to its goal and which would not.
The success of reinforcement learning programs like AlphaGo and its successors (AlphaZero, AlphaStar, etc) at playing games at a superhuman level has demonstrated that “mere” reinforcement learning can be extremely successful at solving problems that are hard for humans.
These superhuman reinforcement learners play millions of simulated games to generate the “training data” of which sequences of moves tend to result in wins or losses. They derive successful strategies “from scratch”, outperforming both humans and computer programs with hard-coded heuristics and rules.
So, does this prove that world models are unnecessary for AIs to successfully pursue goals?
No.
In principle a reinforcement learner need not contain anything like a representation of the “world”. But also, in principle, it could.
(And, with deep RL systems, we would not have any way of knowing much about whether or not it does, since any such representation would be encoded in an opaque way, in the weights of a huge neural network.)
What we do know actually suggests that game-playing RL agents have some world-modeling capabilities.
For instance, there’s some evidence that AlphaZero evolves towards human-understandable concepts in chess as it plays. “Material” (the value of pieces on the board) arises as the largest factor governing AlphaZero’s play, just as it is for humans, and it converges on the same values for different pieces as standard (human-created) scoring criteria.
This is an example of a compact model (a weighted sum of each player’s remaining pieces) that’s useful for predicting the outcome of the game, emerging as a leading component of the AI’s strategy.
“Brute force” RL isn’t necessarily that brutish; it can implicitly learn to break the world down into simpler parts and model it.
Simon’s ant, by contrast, really does lack a world model, but it can get away with this because its task is contrived by assumption to be wholly local. Its job is to avoid obstacles; its method for avoiding each obstacle is fully hard-coded and assumed by stipulation to work every time. Nothing decision-relevant depends on anything except the presence or absence of an obstacle in the ant’s immediate neighborhood.
Simon’s ant is useful as an illustration of how the decision-relevant part of a problem can be much simpler than the whole thing, but lots of well-studied AI problems (navigating through the physical world to a particular destination, playing a game to win, generating fluent text) are obviously far more complex and do seem to require modeling to work, as I’ll discuss in the following section.
Why World Models Are Useful for Humans and AIs
A “world model” is useful because it has manipulable parts which correspond to parts of the world.
For example, a model of a room could have parts corresponding to the chairs and couch and table and lamp in the room. And one could alter the model, to say “what if we moved the couch somewhere else?” “what if we had one more chair in the room?” and make predictions about that hypothetical room.
The critical element here is the ability to simulate modifications of parts of the system. In other words, to imagine counterfactuals.
“What if just this part were different? What would change?”
This radically compresses the learning and generalization process.
If you parse a room as a collection of separate objects, with locations and volumes in 3-space, this literally compresses the data by many orders of magnitude compared to leaving it as a raw, uncompressed “video file” of your view of the room, at 60 multi-million-pixel images per second.
(Self-driving cars generally contain object-detection algorithms for this very reason. First they break down the scene into “road here, car here, truck there”, and then they plan a path along the road avoiding collisions. In other words, the designers of self-driving cars manually hard-code a world-modeling step into them; they do not allow cars to reinforcement-learn the right path from the raw sensor data and hope that along the way they automatically learn to detect and avoid obstacles!)
Cognitive scientist Joshua Tenenbaum has argued that humans learn to reason about physical objects -- to predict when objects stacked or leaning against each other will balance or fall over, for instance -- using an “intuitive physics model” that approximates Newtonian mechanics.
Children learn to walk and manipulate objects relatively quickly, and they generalize to all kinds of environments, which suggests that they need compact “models” that predict how objects behave in general. Preverbal infants are capable of predicting the outcomes of complex physical scenarios they have never seen, in ways that are consistent with “intuitive physics models.”
And learning to predict events happens much faster and needs less data for models that learn a causal theory compared with models that learn no theory. So it’s plausible that humans do, and AIs should, learn causal theories to explain the world, in order to learn faster from less data.
A “brute force agent” that learned to perform a “correct” behavior through nothing but reward for success and punishment for failure, but had no ability to break down that behavior into separate reconfigurable parts, would not be able to generalize what it learned to perform variations on the behavior.
You can observe this yourself if you’ve ever learned to play a piece of music “by muscle memory”. You may find that you’re unable to stop partway through, pause, and then play the remainder; you may find that you can’t play the piece while changing a single note in isolation. You can play the whole thing exactly as you memorized it, or else you get stuck and forget how to play it at all. Very rigid, very poorly generalizable.
By contrast, something more like “deliberate practice” (and sometimes a bit of music theory) can allow you to learn to play pieces that you can freely modify, improvise variations on, stop and start, and so on.
Similarly, a “brute force” AI, that does not learn to separate data about the environment into relevant vs. irrelevant components, will fail to generate the desired behavior if the environmental data is minutely perturbed in ways that ought to be irrelevant to the correct behavior.
When this occurs, you get the famous “adversarial examples”, such as when a deep learning image classification model makes dramatic misclassification errors if small, invisible-to-humans amounts of apparent high-frequency noise are added to the image.
Clearly, human motor skill learning can involve “breaking down” the skill into separate, modular, reconfigurable steps.
And human visual perception can involve “breaking down” the scene into separate, modular, reconfigurable parts that we can handle somewhat independently (like “recognizable objects”, “foreground and background”, “form and texture”, etc).
And human verbal abilities can involve “breaking down” language into separate, modular letters, words, sentences, and concepts, that we can handle independently and reconfigure at will (what linguists call “compositionality.”)
Noam Chomsky’s 1959 attack on B.F. Skinner’s behaviorist theory of language argued that humans do not learn language through reinforcement learning. Almost all children learn to speak, whether they are carefully rewarded by adults for speaking correctly or not. And we can all comprehend novel sentences that we have never seen before, and even notice grammatical errors in them, which we would not be able to do if Skinner were correct and each hyper-specific “behavior” had to be trained separately with no general principles relating them.
With language, as with navigating physical environments, we can only explain humans’ ability to learn quickly and generalize by positing what Tenenbaum calls “intuitive theories”: inexplicit, compact representations of generalized “laws of nature” or causal beliefs about how things work. We develop an “intuitive physics” to explain how objects behave in space, an “intuitive grammar” to explain how language works, an “intuitive psychology” to explain people’s behavior,
and so on. And there’s some experimental evidence that humans learn in ways that are consistent with us applying intuitive theories.
The bottom line is that “world models” or “intuitive theories” are valuable because they make learning more efficient and generalizable. They let us do more with less data.
Why World Models Are Necessary For X-Risk AGIs
For an AI to persistently pursue a goal in the world that could kill us all, it’s going to need substantial generalization and prediction abilities.
Any physical means of producing human extinction is necessarily a totally unprecedented state of the world, which the AI won’t have observed before, and which won’t be in its training data.
An AI that poses an x-risk needs to have a goal, which happens to cause human extinction, and which it will resist human attempts to prevent.
This means it probably has to have some idea of how to get to an unprecedented state of the world and what will happen when it gets there. It needs the ability to make predictions well outside the distribution of its training data.
An AI that’s trying to get “over there”, to an unprecedented state of the world, cannot assume (like Simon’s ant could) that nothing decision-relevant will be different “over there.”
Exterminating the human species might require the invention of a novel and more destructive weapon than exists today. Or it might “merely” require scaling up the manufacture and distribution of existing types of technology.
Neither can plausibly be done without models -- of physics, of chemistry, of machines and how they work, of human and institutional and market behavior.
Current-Gen AIs Sometimes Do and Sometimes Don’t Develop World-Modeling Abilities
So, do today’s state-of-the-art AIs develop world models incidentally, as the best path towards getting the highest “reward” (or lowest training loss) on the task they’re assigned?
It looks like a mixed bag.
Sometimes, clearly, they do develop world models.
As a minimal example, LLMs could produce perfectly grammatical and natural-sounding text as early as GPT-2. Clearly whatever “intuitive theory” of the structure of English grammar is necessary to produce reams of text that conforms to that structure, LLMs have mastered it.
There are also other results that suggest state-of-the-art AIs incidentally learn structural principles in the domains they model.
Many of the tasks on the BIG-Bench benchmark set involve verbal reasoning questions that attempt to probe “structural” or “world-model” understanding, such as recognizing cause and effect in text, deriving the rules of an invented language from a few examples, predicting counterfactual scenarios, solve multistep arithmetic problems, and so on.
All large language models up to GPT-3 are much worse than humans at BIG-Bench tasks, but they improve predictably with scale.
(We don’t know how GPT-4 performs on BIG-Bench because they used the test questions in their training set; whoops!)
Many tasks in BIG-Bench, including arithmetic, show a “breakthrough” scaling effect where models appear to be totally unable to answer the questions correctly at all until the models reach a certain size, at which point performance starts growing steadily with scale.
This is a familiar historical pattern; image classification was once utterly out of reach for machine learning models, until it wasn’t and started improving proportionally to compute with the first CNNs in 2012; the same went for machine translation, where performance was stalled at “basically nonexistent” until the models got big enough.
On the other hand, there are also examples of AIs failing to incidentally learn world models.
GPT-3 reliably fails some standard tests of common-sense reasoning to interpret ambiguous statements, e.g.
LLMs are able to do addition reliably…if the numbers aren’t too big.
(Both screenshots from ChatGPT 3.5 as of 3/31/2023.)
This is odd, given that the way I add multi-digit numbers, or the way my calculator does, involves following a consistent algorithm that works the same way no matter how big the numbers get. Clearly, any AI that has more trouble with 7 digits than 6 is not doing addition the way we do, which makes me suspicious that it doesn’t contain a full model of how addition “works”.
I also notice that current-gen AI image generators like Midjourney 5 frequently produce Escher-like impossible figures and often fail to correctly visualize spatial relations (“on top of”, “facing towards”) from text prompts.
This indicates that they have not learned certain basic spatial reasoning principles incidentally along the path to their objective (generating images that are good predicted fits to the text caption provided by the user.)
Finally, AlphaZero also is sensitive to adversarial perturbations in Go -- adding two irrelevant stones to a board state can make the AI make wrong moves that human players can trivially avoid. In fact, using these adversarial moves, a human amateur was recently able to beat a top AI system at Go, 14 games to 1.
This suggests that even amateur human Go players have learned some simple “principles” that AlphaZero doesn’t learn in a fully general fashion.
But, overall, there does seem to be a tendency for previously unavailable “world modeling” capabilities to come online as AI models scale up.
Moreover, note that Joshua Tenenbaum himself is a frequent DeepMind collaborator and has been actively researching how to give AIs common-sense reasoning abilities, intuitive physics, and so on, and he’s hardly the only one in his reference class. DeepMind clearly values making its AIs generalize better and require less training data.
So I basically think AI “world modeling” is going to improve over time. It’s an active area of research, it’s economically incentivized, and it follows a well-documented trajectory.
It is not safe to assume that AIs are harmless because they can’t model some domain area that they’re currently weak at.
Causality
Do present-day AIs have causal models? No.
Should we expect near-term AIs to develop causal models? No.
What’s Causality?
In Judea Pearl’s framework, causality is the essential thing that causal models offer that statistical prediction models (including all current machine learning and AI models) do not.
A statistical prediction model essentially learns a function to best fit the data. If it is a deep learning model then it’s a very complex function, but ultimately you’re “learning” to predict the dependent variable Y (the final result of a game, say) as a function of the data X (say, the sequence of moves up until a certain point.)
It’s all just p(Y | X).
Essentially, statistical prediction models are the equivalent of observational studies in natural science. You can identify correlations and patterns in the dataset, and take that to suggest that the pattern generalizes outside the dataset.
What machine learning doesn’t have is the equivalent of experimental studies, where the experimenter deliberately changes X to see how Y changes, and assigns X (the independent variable) a causal effect on Y (the dependent variable) if the answer is yes.
Correlational evidence says “A correlates with B; so maybe if we change A, B will change.” Experimental evidence says “We tried changing A, and B didn’t change.”
All else equal, if they conflict, you should believe the experiment!
Why Causality is Necessary for X-Risk AIs
Agency requires reasoning about the consequences of one’s actions. “I need to do such-and-such, to get to my goal.” This requires counterfactual, causal reasoning.
Mere statistical prediction (“what is the distribution of likely outcomes, conditional on my having taken a given action”) is not the same thing.
Why?
Reverse causality and confounders, that’s why.
If I try to compute a conditional probability p(Y | X), I have to deal with the fact that, in my dataset, most of my examples of someone doing X happen in conditions that cause people to do X. Those conditions could include Y, or could cause Y, or could be caused by something that also causes Y, or any other number of connections.
The probability of feeling cold, given that one is wearing a sweater, might be high; that does not mean that putting on a sweater is likely to make you feel colder.
An AI that cannot distinguish these two outcomes is unlikely to be able to sequence a chain of actions that leads to an unprecedented state of the world, or to resist human attempts to thwart its efforts.
Do Current AI Models Incorporate Causal Models?
To incorporate causality into a machine-learning framework you need an entirely different kind of model, a causal model, that treats “A causes B” as a structurally different type of relationship than “A is evidence in favor of B.”
Neural networks, like all standard machine-learning and statistical models, are a strict subset of causal models.
In other words, it is a provable fact that no statistical model can learn to do everything that a causal model can.
This means that an ordinary neural-net-based AI will not evolve or learn a causal model incidentally as the shortest path to its goal, the way it might learn a purely correlational “world model”.
People will have to decide to build causality into AIs.
Will Future AI Models Incorporate Causality?
There’s an active subcommunity that keeps saying we need to build AI on causal models -- and it includes some luminaries like Yoshua Bengio, who has proposed a research direction for causal AI that involves inferring causal relationships from data.
But this community is small and new.
It was only as recently as 2021 that researchers at Columbia University and the University of Montreal defined a framework for “neural causal models” that are trainable using gradient descent. That means that essentially no empirical work could get done until two years ago.
What’s more, the essential feature of causal models is that they treat observational and interventional data differently. Thus, to infer “what happens when I do this?” you need to train on interactive data; that is, examples where the AI actually tries doing a thing and sees what happens.
In digital worlds, this is feasible; a game-playing AI can try different moves, a code-writing AI can try generating programs and running them (or testing them through various sorts of compilers or type-checkers), and a conversational AI can try writing text and seeing how humans respond.
But in physical worlds, experimenting with real robots in real time is extraordinarily expensive. Robot test data, unlike simulation, happens at the glacial pace of one second per second; and it requires physical capital (the robot) which can damage people, property, or itself if it smashes into things.
Interactive data based on human responses to AI is likewise gated by the low speed and high cost of human time.
And there doesn’t seem to be significant investment, last I checked, into generating these kinds of physical interaction datasets to train models on, at least not on the order of million-dollar training runs the way big AI firms have done with LLMs. Nor do auto companies or industrial robotics companies currently seem to be interested in building their autonomous machines “from scratch” to learn intuitive physics the way a baby does, when hard-coding it works just fine for many narrow applications and has a far more validated track record.
In other words, the economic incentive for developing true causal AI seems to be absent.
In fact, the more commercial success LLMs see, the less economically justifiable it will seem to invest in far more difficult and fundamental advances where the tools are less developed.
Causal AI might be necessary to build an AI that can, say, do an auto mechanic’s job; but if it’s technically much easier and more lucrative to build an AI that can do most of a middle manager’s job, who’s going to bother building the AI auto mechanic?
Goal Robustness Across Ontologies
Do present-day AIs have ontology-robust goals? No.
Should we expect near-term AIs to develop ontology-robust goals? No.
What Is Cross-Ontology Goal Robustness?
If a machine learning model (perhaps steering a robot body) is to pursue a real-world goal, like “find two strawberries and place them on a plate”, one thing it needs to be able to do is detect a strawberry.
The model needs to be able to detect strawberries consistently. It needs to keep detecting them when they are occluded by other objects, when the lighting changes, when the viewing angle changes, etc.
Some of these problems are largely solved today; some are not.
Adversarial perturbations to deep learning classification of images can cause drastic misclassifications from tiny, human-imperceptible “noise” added to images.
You don’t even necessarily need to algorithmically design adversarial images; ordinary camera noise or viewing angle variation can get a classifier trained on a different dataset to mistake a pen for a revolver half the time, simply because the type of camera used changed from the training to the test dataset.
A goal-oriented AI, perhaps equipped with a robot body and a camera, needs to be able to keep recognizing a strawberry even if you upgrade it to a new camera.
More generally, a goal-oriented AI needs to be able to identify physical objects, actions, and states of the world, in ways that are robust to changes that humans would consider “minor” but a computer program often finds catastrophic.
Details of encoding or embedding are often harder for computers to generalize around than they are for us.
A computer chess program can outperform even the best human chess players. But a human chess player can easily grasp the equivalence between:
A physical chessboard
A photograph of a chessboard
A drawing of a chessboard
A list of chess pieces and their board positions
An 8 x 8 matrix with zeros in all positions except the ones containing chess pieces, and numbers 1-32 representing each piece in the other positions
A human who learns to play chess on a physical chessboard has no problem transitioning to playing chess on a computer screen.
AIs have much more trouble with this. A computer chess engine trained exclusively on one format for representing the game would generally not be able to transfer its knowledge to a different format.
With traditional (pre-foundation models) deep learning, adding a single new category to a classifier requires training a whole new network (albeit with the ability to transfer weights from the original network).
This specific issue is less of a problem for very large and general language and image models, but it’s not clear to me what abilities will and won’t turn out to remain “brittle” to encoding and ontology issues.
Very basic details of how data is encoded into a model (how many categories does the classifier have? How is text broken up into “tokens”?) can cause models to fail when they’re slightly modified.
If you anticipate this problem, and build the model to cope with it ahead of time, then these obstacles are surmountable; but the thing about reality is that it requires one to adapt to unanticipated changes that are outside one’s model.
Problem-solving in the real world often goes like this:
I formulate an (abstract) plan; I have an idea of what steps I’ll take, what kinds of results I might get, and what things might go wrong.
I get stuck on a task that I didn’t even realize was the kind of thing that could go wrong.
I now need a new ontology (tasks, results, possible roadblocks) that incorporates the new information.
Leaky abstractions are everywhere in engineering. In trying to root-cause a software bug, you may discover the real problem is a miswired circuit board.
Within an abstraction, there are certain things you are supposedly able to assume will “just work”. When the thing that supposedly “always works” fails, you have to discard that abstraction and use (or create!) a new mental framework that incorporates the messy reality you have just observed.
Humans aren’t always great at this, but we need to do it at all, in order to solve any novel practical problem.
Venkat Rao illustrates this sort of “messy reality” with Meccano, the classic building toy; unlike Lego, where snapping blocks together “just works”, Meccano parts are fastened together by various kinds of rubber bands, joints, and so on, with different tolerance windows. Building in Meccano, you have to worry about (or “model”) much more physical complexity than you do in Lego, closer to what real mechanical engineers have to do. There are whole classes of Meccano failure that can’t appear in Lego.
A novice Meccano builder who only had previous experience with Lego would have to learn new ways to plan around new kinds of problems, expanding his “ontology” of the kinds of issues that might come up in a building project.
Reality has a surprising amount of detail. Anything you have never done before -- building stairs, planting a vegetable garden, writing a computer program -- seems much simpler in the abstract and has tons of finicky little details up close.
For an AI to be able to solve real-world problems, it needs to be able to encounter unanticipated problems and expand its model, its ontology of possibilities, to accommodate them, and it needs to translate its old goal into the new ontology in a “natural” way.
This is the “Ontological Crises” problem. The toy example in the linked paper imagines an AI that makes moves on a simulated “world” made up of a grid of squares. If the AI is assigned to go to the ‘farthest square on the right’, and the ontology changes to increase the number of squares in the “world”, then the AI needs to find a way to revise its goal so it will still go to the farthest-right square in this larger world.
In other words, we’d want the “transformation” between one ontology and another to “preserve” goals in a “natural” way.
What do we mean by “natural”?
The theory here isn’t well-developed, but there are some obvious desiderata, like:
If an ontology expanded to include possibilities irrelevant to the goal, we’d want to leave the goal and the plan unchanged.
If an ontology changed such that we learned new kinds of obstacles were likely to arise in pursuit of a goal, we’d want the plan to change to incorporate ways to avoid the obstacles.
If an ontology changed such that an old subgoal was understood to be incoherent or impossible, we’d want to choose a new subgoal that continued to contribute to the terminal goal.
If an ontology changed such that a terminal goal was found to be incoherent or impossible, we’d want to choose a new terminal goal or shut down.
Why does AGI X-Risk Require Cross-Ontology Goal Robustness?
The ability to “translate” goals naturally across ontologies is necessary for being able to shift levels of abstraction or build models of unanticipated details that turn out to be relevant to solving a problem.
Without this ability, goals would “break” as soon as the AI encountered an out-of-model obstacle.
A powerful agent relentlessly pursuing an arbitrary goal is, by default, very bad news for humanity; but a “breakable” goal that swings around unpredictably every time the AI encounters messy reality is far less damaging because it is less consistent. (For the same reason that adding random mutations to an organism is far more likely to make it sick or dead than to make it more dangerous to other organisms.)
Moreover, cross-ontology goal robustness is required for an agent to “view itself as embedded in a world.”
Being “embedded in a world” means you know there is this thing, “yourself”, which exists inside a bigger thing, “reality” or “the world.”
An “embedded agent” knows that what happens to “itself” can affect its chances of succeeding at its objective. It can “protect itself”, “improve itself”, “acquire resources for itself”, etc -- all subgoals that are instrumentally useful for basically any AI’s goal, and all very plausibly x-risks if an AI tries to do something like “maximize available compute”. But first it has to have some concept of “itself” as a program running on a computer in order to learn to make causal predictions about how to do such things.
In order to view “itself” as “part of a world”, it has to know that its own map is not the territory. It -- and therefore its mind -- is smaller than the world. Its models could be “wrong”.
Wrong according to what? According to the AI’s own loss function?
No! Wronger than that.
Wrong according to reality. According to the “world.”
Wrong in a way that gets it killed, maybe, even as all its sensors and status reports are claiming everything is fine.
And for the agent to be able to conceive of that possibility and count it as something to avoid, for it to be able to say I don’t want my loss function to be spoofed by something that kills me, it needs to have a “goal” or a “value” that goes beyond its loss function.
AKA, not a static terminal goal. A way to distinguish the “true real-world goal” from one’s mere current specification of it, and update the specification as necessary.
AKA, cross-ontological transfer of goals.
Could you have an AI that didn’t care about this, that wasn’t self-aware and couldn’t reason about “what if I get killed” or “what if my reward function says everything is fine but I’m actually in danger?”, but that still persistently sought more computational power, or more security, or whatnot, just because its reward kept rising (or its loss kept falling) whenever it did so?
Sure, in principle.
But much more weakly, I think.
Without a concept of “my available compute”, an AI could gradually reinforcement-learn its way towards doing more of the things that happened to increase its compute.
But to generalize and then start systematically taking those and other new types of actions that increase its compute implies building some kind of implicit ability to distinguish “my compute” from “not my compute.”
That implies building a boundary between “me” and “not me”.
Do current-gen AIs have cross-ontology robust goals?
I don’t think they’re even close.
The theory of what this property even is, and how we’d tell whether an AI had it or not, is so primitive I’m not even sure how to approach the question.
But “how can I get better at achieving my mis-specified goals” isn’t, it seems, even the kind of thing that a current-gen AI could learn incidentally “along the way” to minimizing its loss function.
The loss function is the “wrapper”, full stop.
And since the theory is so undeveloped, as far as I know there’s not even an existing community of people trying to build alternative AI models around this theory like there is with causal AI. (The “embedded” or “embodied” AI communities are doing something similar, but they face the same economic constraints that robotics-based AI does generally.)
This is a weird, undertheorized, kind-of-philosophical issue where progress isn’t encouraged by either academic or industry incentives. The only way progress could happen is if some people actually wanted to build an AGI, were convinced that current methods were inadequate and we’d have to go “back to the drawing board” to a very radical degree, and figured out a way to turn handwavy heuristics into something implementable on a computer.
This kind of open-ended speculation and departure from standard paradigms is risky to a career in academia, and is, I think, even rarer in industry.
Also, to “go back to the drawing board” in search of AGI cuts against the grain of all technology tribalism. The “hype tribe” always says “the goal is achievable and wonderful and the current paradigm will get us there” while the “skeptic tribe” always says “the goal is unachievable and undesirable and the current paradigm doesn’t work.” Someone who says “the goal is achievable and desirable, but we need a departure from the current paradigm to get there” is tribeless and perpetually misunderstood.
So, the kind of advance we’re worried about must come from the rare maverick dreamer types who have their sights fixed on a distant vision of “true” AGI and are willing to spend years scribbling in the wilderness to get there.
Such an advance is of course not impossible -- but it’s a very different threat model from the armies of machine learning researchers and engineers making rapid incremental progress on deep neural nets because they are immediately rewarded with professional success for doing so.
You could probably find all the intellectually generative “AI dreamer” types and talk to them individually -- those sorts of people tend to share their ideas in writing.
If the lines of communication remain open -- if the current AI debate doesn’t tribalize to the point that “pro-AI” and “anti-AI” factions hate each other and can’t have friendly discussions -- then it might be remarkably tractable to just, y’know, persuade a handful of individuals that they should maybe not work too hard to get the world to take notice of their theoretical ideas.
I'm gonna echo a couple other commenters to say that when you say "Why I am not an AI doomer", I would say "Why I don't expect imminent LLM-centric doom, and (relatedly) why I oppose the pause".
(I ALSO don't expect imminent LLM-centric doom, and I ALSO oppose the pause, for reasons described here — https://twitter.com/steve47285/status/1641124965931003906 . But I still describe myself as an AI doomer.)
(I might be literally the only full-time AI alignment researcher who puts >50% probability, heck maybe even the only one with >10% probability, that we will all get killed by an AGI that has no deep neural nets in it. (The human brain has a "neural net", but it's not "deep", and it's kinda different from DNNs in various other ways.))
Like you, I don't expect x-risk in the 2020s, and I also agree with “maybe not the 2030s”. That said, I don’t COMPLETELY rule out the 2020s, because (1) People have built infrastructure and expertise to scale up almost arbitrary algorithms very quickly (e.g. JAX is not particularly tied to deep learning), (2) AI is a very big field, including lots of lines of research that are not in the news but making steady progress (e.g. probabilistic programming), (3) December 31 2029 is still far enough away for some line of research that you haven't ever heard of (or indeed that doesn't yet exist at all) to become the center of attention and get massively developed and refined. (A similar amount of time in the past gets us to Jan 2017, before the transformer existed.)
For example, do you think future AGI algorithms will involve representing the world as a giant gazillion-node causal graph, and running causal inference on it? If so, there are brilliant researchers working on that vision as we speak, even if they're not in the news. And they’re using frameworks like JAX to hardware-accelerate / parallelize / scale-up their algorithms, removing a lot of time-consuming barriers that were around until recently.
> persuade a handful of individuals that they should maybe not work too hard to get the world to take notice of their theoretical ideas.
I do have a short list in my head of AI researchers doing somewhat-off-the-beaten track research that I think is pointing towards important AGI-relevant insights. (I won't say who!) And I do try to do "targeted outreach" to those people. It's not so easy. Several of them have invested their identities and lives in the idea that AGI is going to be awesome and that worrying about x-risk is dumb, and they've published this opinion in the popular press, and they say it at every opportunity, and meanwhile they're pushing forward their research agenda as fast as they can, and they're going around the world giving talks to spread their ideas as widely as possible. I try to gently engage with these people to try to bring them around, and I try to make inroads with their colleagues, and various other things, but I don't see much signs that I'm making any meaningful difference.
Couple of things that strike me as missing on a quick read:
- Whether grinding a loss function over a sufficiently intricate environmental function like "predict the next word of text produced by all the phenomena that are projected onto the Internet" will naturally produce cross-domain reasoning. I'd argue we've already seen some pretty large sparks and actual fire on this.
- Whether an AGI that is say "at least as good at self-reflection and reflective strategicness as Eliezer Yudkowsky" can fill in its own gaps, even if some mental ability doesn't come "naturally" to it.