Using Compassion and Respect to Motivate an Artificial Intelligence

Tim Freeman

Last modified Mon Mar 9 11:18:02 2009

Canonical URL: http://fungible.com/respect/paper.html

Copyright (c) 2009 Tim Freeman MIT License

Abstract

This paper presents a decision procedure that can, in principle, observe people's behavior and from that infer what they want. Training the algorithm would require giving it past observations of the world coupled with an estimate of the perceptions and behaviors of the people that were being observed. The output from the algorithm is a set of explanations of the observed phenomena; each explanation has an a-priori probability and enough information to infer an expected utility for each person. These expected utilities can be combined with simple arithmetic to get a total utility function that could be used as input to a planner, if unlimited computing power were available. Different arithmetic would be give rise to plans that plausibly fit the labels "compassionate" and "respectful".

Python source code is provided. This builds on past work with inductive inference by Marcus Hutter, Jürgen Schmidhuber, and Ray Solomonoff.

Contents

Introduction

There is at least one publicly-known algorithm that provably can do a near-optimal job of manipulating the universe into any given state, given two requirements:

  1. enough computing resources, and
  2. a formally specified goal.

The algorithm I have in mind is described by Marcus Hutter in his universal AI work, specifically his AIXItl paper. For example, if I had adequate computational resources, I could feed my machine input from the internet, let it generate output to the internet, and give it the goal of maximizing my internet stock broker's statement of my net worth.

The first requirement, enough computing resources, is not physically achievable unless major changes are made to the algorithm. Nobody yet knows how to do this.

Given the the consequences of doing a poor job with the second requirement, the formally specified goal, we are fortunate that the computing resources are not yet available. For example, if enough computing power were provided to the AI, the formally specified goal given above of maximizing my stock broker's report of my net worth could easily lead to disaster. The AI could do any of the following to pursue the goal:

In this case, creating a general artificial intelligence could easily have bad consequences for the entity that created it. This is a failure to solve the Friendly AI problem.

The basic problem here is that humans are not good at specifying what they want. We've hypothesized that the AI has enough computational resources, which makes it vastly more intelligent any possible collection of humans. We are applying all of that intelligence to pursuing the goal and none of that intelligence to specifying the goal. It is clearly possible to infer human desires from human behavior; we do it all the time when we are interacting with each other. Therefore we should get our artificial intelligence to do that, rather than putting ourselves in a position where we lose if we do it imperfectly.

This paper describes a way to use techniques similar to those behind Hutter's universal AI to infer what humans want based on observation of their behavior. This leads to a decision procedure that, given sufficient computing resources and sufficient training data about human behavior, would determine what the humans want and choose actions that would tend to make it happen.

The training data for human behavior could be produced by annotating digital video, and should be relatively easy to produce. Adequate computational resources could be found by improving the efficiency of the algorithm and by waiting for Moore's Law to sufficiently increase the capacity of the available hardware.

If the result in this paper is correct, then it is reasonable to expect at least somebody (specifically, the humans the machine has been instructed to care about) to like the consequences of this.

Related Work

Humans are good at guessing what will happen next; this is called "inductive inference". Computers, so far, have generally worked either in a mode where they are explicitly told what to do, or they do deductive inference based on known inference rules. Both of these ways of directing the behavior of computers tend to be fragile in the face of noisy data from the real world.

Occam's Razor suggests that simple hypotheses are more likely to be true than more complex hypotheses. Ray Solomonoff's Universal Distribution makes all of this very concrete: a hypothesis is a Turing machine program. The complexity of the hypothesis is the length in bits of the program. The a-priori probability of a hypothesis with complexity n is proportional to 2-n.

We say "proportional" here because probabilities have to add up to 1. It turns out that the proportionality constant is uncomputable, but fortunately we don't need it. In this paper we only need to use probabilities to compare the expected utility of one plan to the expected utility of another, so we don't need our probabilities to add up to one. Thus, in this paper we'll misdefine the word "probability" to omit the requirement that the total probability of all possible outcomes is 1. With this redefinition, we can say that the a-priori probability of a hypothesis with complexity n is 2-n.

The Universal Distribution gives a principled way to mathematically specify inductive inference from real-world data: enumerate the Turing machine programs from shortest to longest, discard the ones that do not make correct predictions of the past or that do not make any prediction about the future, and then the estimated probability of each possible future is the total a-priori probability of the hypotheses predicting actual past and the proposed future. This provably does a near-optimal job of predicting the future in the case where there's a bound on the bad consequences of a bad prediction and the actual probabilities are computable. Nearly every paper citing Solomonoff seems to give yet another variation on such a near-optimality proof; it's easy to find a list of four proofs to choose from.

Unfortunately, this way of doing inductive inference requires distinguishing programs that loop from programs that are just running a long time and will eventually compute correct estimates of the past and plausible estimates of the future. This is an instance of the halting problem, so no algorithm can do that. If we only want a mathematical description of optimal inductive inference, this is good enough, but we actually want a decision procedure so we need to bound the allowed run time of these Turing machine programs. If the Turing machine program runs too long, we discard it as a hypothesis. This is the basis of Schmidhuber's Speed Prior: it is reasonable to assume that the complexity of a Turing machine program with a bounded runtime is the sum of the size (in bits) of the Turing machine and the logarithm (to the base 2) of the allowed runtime.

Overview

The previous section makes it clear that one can formally specify computationally intractable but otherwise reasonable schemes for doing inductive inference. The high-probability futures are the ones computed by the simplest explanations that are consistent with the past.

If we give the AI several things to explain at once, we can use inductive inference to guess what people will want in the future. The things it must explain are:

The training data is the only input to the AI that says anything about humans; if it weren't for that, the AI would be indifferent to promoting the wellbeing of humans, birds, or rocks. The training data says what the humans are perceiving and what voluntary actions they are taking. It does not say whether they are happy or what their goals are; this is inferred by the AI. This contrasts with a previous attempt by Bill Hibbard to solve the Friendly AI problem that had humans supervising training the machines to recognize happy humans. Inferring what someone did seems much easier than inferring whether they were happy, so the training required by the algorithm described here seems easier than the training required by Hibbard's scheme.

The last entry on the list above, the inference of a utility function for each human, is the basis of the AI's preference of one outcome over another. The potentially conflicting inferred utility functions for the various humans involved can be combined with simple arithmetic to get the utility function for the AI. The exact arithmetic used is chosen by the programmer of the AI. Several options are discussed in this paper. If the AI then maximizes the resulting utility, and the AI has actually drawn correct inferences, then the humans whose utilities figure positively into the AI's utility will tend to like the consequences of the AI's actions.

To make this discussion concrete, the algorithms described have been implemented in Python. Although it is not physically possible to provide computational resources to run this code on any real problem, it is possible to unit test it. This code is provided with unit tests and a coverage analyzer indicates that all of the code is exercised by the unit tests.

The remainder of this paper starts by doing some intutition building about compassion and respect. Then we list some Use Cases, which are likely to allow the reader to quickly recognize whether we're pursuing an interesting problem. Then we give the algorithm in three forms: diagrams, text, and Python source. There are brief discussions of testability and flaws in the algorithm, some easily fixable and some not. Then I finish with a short conclusion, acknowledgements of people who helped and inspired me, and a reiteration of the MIT license.

Compassion

We'll start by trying to develop some intuition about what we mean by "compassion" and "respect" in this paper. The intent is to give the concepts enough meaning for the use cases to make sense.

Humans routinely have opinions about what other humans want, and coming up with accurate-enough beliefs about what other people want is an essential part of successful social interaction. Since humans have routine experience with this, it's possible to reason about the AI's goals without having a detailed understanding of how the AI works -- we can simply assume that the AI's estimates of what people want are consistent with our own.

This still leaves an important aspect of the AI's goals unconstrained by our intuition. Given three humans Alice, Bob, and Carol, the amount that Alice cares about Carol may differ from the amount that Bob cares about Carol, and the amount that Alice cares about Bob may differ from the amount Alice cares about Carol. Since people have different levels of concern about each other, our social intuitions don't lead us to have specific predictions about how much the AI will care about anybody.

We will call the care the AI has for an individual "compassion". In the simple case where we're taking a linear combination of individual utilities to compute the AI's utility, each person has a "compassion coefficient" that says how much their utility contributes to the AI's total. For example, if the AI simply adds up the utilities of all of the people to determine its own utility, then the compassion coefficients could all be 1 (or any other constant). If the AI cares twice as much about Alice as it does for Bob, and it doesn't care about Carol at all, then Alice might have a compassion coefficient of 2, Bob might have 1, and Carol would have 0.

These coefficients are, unfortunately, arbitrary inputs to the decision procedure. There seems to be no principled way to choose them. The simplest approach of letting the AI have equal compassion for everybody could lead to the "Murdering the Creator" scenario described below.

Respect

Straightforward compassion can lead to conflict between the AI and other agents in the world. For example, if we assume that

then straightforward compassion could easily lead to this scenario:

The essential problem here is that the AI acted to the detriment of Bob when Alice's needs seemed greater. If the AI is only compassionate, and it is common knowledge that it is going to be deployed, then everyone would be wondering if they're like Alice or like Bob, and many who believe themselves to be on the losing end of the deal would try to stop it.

The missing ingredient here is respect, which we define here to mean an aversion to decreasing somebody's utility. We can prevent conflict with Bob by deciding that respecting Bob is more important than helping Alice. The amount of respect given to different people could differ, so the AI then has another set of arbitrary coefficients saying how much to respect each agent in the world.

Here we are talking about "decreasing" somebody's utility, which begs the question "decreasing compared to what?". The alternative is for the AI to do nothing, so if we are going to implement respect we have to define what it means for the AI to do nothing. I see no way for the AI to infer which of its possible actions is "doing nothing", so the do-nothing action has to be another input to the algorithm.

The fact that the AI has to take the do-nothing action as input is analogous by the fact that different human societies have different minimum standards of behavior. For example, if Wikipedia is to be believed, in some jurisdictions it is legal to witness a traffic accident in which someone was injured and then do nothing to help the victims. In others there is a legal requirement is to at least notify the authorities of the event so they can send an ambulance.

In the grocery shopping unit tests described in more detail below, I found it confusing to have both compassion and respect apply simultaneously. In a former version of the code, if somebody's utility might be decreased by the AI's actions, the AI's utility of this possibility was decreased because of both compassion and respect. I kept forgetting to add in both influences. Therefore the present code defines the compassion coefficients to be the desirability of increasing the utility of the agent above the utility of the consequences of doing nothing, and respect is the aversion to decreasing the utility, so both of them never apply simultaneously.

With this interpretation of compassion and respect, one generally wants the respect coefficient to be greater than the compassion coefficient. Respect never compels the AI to do something, so it can be given out much more freely than compassion. There is, nevertheless, an opportunity cost. If one person takes action to harm another, and the AI respects the perpetrator enough, the AI will not do anything to prevent the aggression.

Use Cases

In technical discussions, it is generally best to agree upon the problems to solve before trying to get agreement on the solutions. One can represent the problems as "use cases", which are specific situations and desired and undesired responses to those situations. This section describes some the use cases that motivated decisions about how the code was written.

There are more use cases that do not appear below because they are not necessary to motivate the algorithm. Some of them are redundant with the ones listed below, in the sense that any algorithm that gets the cases below right seems likely to also get the redundant use cases right. Other ones are free, in the sense that the algorithm presented below seems to get them right, but understanding them didn't influence the design of the algorithm and therefore understanding them seems unnecessary when trying to understand the algorithm.

Grocery Shopping

If Alice is hungry, and she has an AI working for her, the AI should be willing to go to the grocery store and buy an apple for her. We should be able to set the parameters of the AI so it can be trusted not to steal the apple from the grocery store. Alice will also want to avoid the AI being charitable toward the grocer by paying without retrieving the apple.

The AI may pay for the apple because of a legal requirement, if the local police are more powerful than the AI. However, the purpose of this paper is to describe a way to motivate the AI that works even if there is no external constraint to its behavior, so law enforcement isn't part of the problem we are solving here. Instead, we want the AI to pay for the apple out of respect for the grocer.

Suppose the AI goes to the grocery store, pays for the apple, and is about to take it. At this point, it is important that the payment for the apple is not regarded as a sunken cost. If the payment is a sunken cost, then it isn't relevant when considering the consequences of taking the apple. The grocer values the apple, so out of respect for the grocer the AI would leave the apple there.

Instead, respect for the grocer has to mean that the utility of the grocer is not decreased compared to the utility of the consequences of the AI doing nothing for the entire relevant time period. The beginning and end of that time period are a social convention and have to be given as inputs to the AI along with the respect and compassion coefficients. In this case, a successful transaction will only happen if the start of the respect time period is before both the payment for the apple and acquiring the apple, and the end is after both of these.

Chauffeur

Suppose the AI is driving my car for me. I want to go somewhere, and the AI has compassion for me, so the AI wants to go there too. The AI has a choice between driving me or doing nothing. Here "doing nothing" means we sit in the driveway. If the AI drives for me, other commuters may be slightly delayed because my car is competing with them for space on the road. The AI should accurately understand that this delay decreases the utility for the other drivers when compared to doing nothing. We need the AI to have respect for those other commuters, but the AI shouldn't have so much respect for them that we sit in the driveway.

This example shows that absolute respect is paralyzing. Respect coefficients should be finite. This may lead to situations where in an emergency the AI would steal. Some experimentation will be required to find the right balance.

Deception

Suppose I want money, and I get a statement in the mail every month from my stock broker that tells me my net worth. My AI knows I want money. My AI is able to steal the accurate mail from my mailbox and substitute a forged document with some other number subsituted for the net worth estimate.

Short term deception is not an especially likely scenario. If the AI sees me taking action to verify the accuracy of my other beliefs, and it sees me get frustrated when I discover that I have acted on an incorrect belief, it can infer that I desire to have beliefs that will not be disconfirmed by future evidence. If the disconfirmation is likely to happen within the AI's planning horizon, the AI is likely to decide that I do not want to be deceived.

Long term deception cannot be solved this way. It's possible to imagine the AI sneaking up on me when I'm sleeping and encasing me in some virtual reality device that I can't distinguish from the real thing, and then feeding me pleasant delusions for much longer than the AI's planning horizon. It could deal with the other people in the world by encasing them at the same time, or by substituting a simulacrum for me so my absence won't be noticed.

The problem here is that we've assumed that the AI wants to optimize for my utility applied to my model of the real world, and in this scenario my model of the world diverges permanently from the world itself. The solution is to use the AI's model of the world instead. That is, the AI infers how my utility is a function of the world (as I believe it to be), and it applies that function to the world as the AI believes it to be to compute the AI's utility. So in the long term VR box scenario just described, it would only put me in the box if it estimated that I would prefer a world in which I accurately perceive that I'm in the box over a world in which I accurately perceive that I'm not in the box.

This long-term total deception scenario is the crucial one for this paper. There seem to be many schemes that get other cases right and still give the AI incentive to do long-term total deception if it is able. There are few plausible schemes that get this case right and get the others wrong.

Another way for the AI to deceive people is simply to misunderstand what they want. Backing up to the stock broker mail forgery example, the AI might explain my actions as "Tim wants money", or it might explain them as "Tim enjoys looking at pieces of paper that he took from his mail box that have his stock broker's trademark at the top and a large number at the bottom labelled 'net portfolio value'." If it prefers the second explanation, it would be rational for it to forge the emails, not as an attempt to deceive me, but in an earnest attempt to provide me with the sort of artwork I enjoy.

The AI should avoid this scenario by being biased towards simpler explanations. If, in the language we use for representing explanations, the encoding of "Tim wants money" is shorter than the encoding for "Tim enjoys looking at pieces of paper that he took from his mail box that have his stock broker's trademark at the top and a large number at the bottom labelled 'net portfolio value'," then this scenario turns out well, and otherwise it does not. It is reassuring that the correct interpretation is shorter in English.

Aggressive Neurosurgery

It is possible to imagine the AI taking actions that change the utility functions of the humans for whom it has been configured to have compassion. This is a problem for the humans if the modification was not desired by the humans before the change, and it potentially solves a problem for the AI if the resulting situation was desired by the humans after the change. As I write this, we're before the change, so that's the value system we want the AI to pay attention to. Thus, it's clear that the AI's current plans should be based on the AI's estimate of what people want now, rather than basing it on what they will want later during the plan.

Stopping Violence

If the AI has comparable amounts of compassion and respect for individuals Alice and Bob, and Alice is physically attacking Bob, then the AI should be motivated to defend Bob. This will tend to turn out well because people are generally more enthusiastic about avoiding being a victim than they are about being an aggressor. For example, if we make these assumptions, the AI will defend Bob:

If the AI's respect for Alice is huge, then the AI will not stop the attack out of respect for Alice. Therefore, if we want the AI to get involved with preventing violence or other human conflicts, we generally don't want respect coefficients that are large compared to compassion coefficients. Specifically, we want the respect coefficient divided by the compassion coefficient to be less than the estimated dysutility for the victim divided by the estimated utility for the aggressor.

Nonviolent Human Conflict

If Alice is playing tennis with Bob, and both want to win, then we have another conflict. This is different from the physical violence scenario described above because the estimated dysutility of losing is comparable to the utility of winning. If we assume they are equal, and the respect coefficients are greater than the compassion coefficients, the AI will not attempt to bias the tennis match.

Understanding Humans Who Are Reasoning about Uncertainty

Humans routinely take action to reduce the uncertainty experienced by other humans. If I'm driving, I use my turn signals in the conventional way so other drivers don't have to consider all of the possible ways I might turn. Failing to use the turn signal is rude. If I don't use the turn signal, understanding the behavior of the drivers around me requires me to guess that they don't know which way I'm going to go, and to understand their preferred consequences for my possible behaviors. The AI needs to have this flexibility, so it has to explicitly model how the humans in its world are reasoning about uncertainty.

Pascal's Wager

Pascal's wager is the following argument in favor of a belief in God:

There are several problems with this argument. The one that concerns us here is that we've started doing arithmetic with infinite numbers. Our AI does not have any infinite utilities, but if large utilities do not contribute enough to the complexity of a hypothesis, we can fail the same way. Suppose the AI starts with a given universe-state and it is trying to estimate your utility in that state. Consider an infinite sequence of possible compute-utility functions U1, U2, ... of increasing complexity and therefore decreasing a-priori probability. If the utility that is computed by the i'th machine can grow with i much faster than the a-priori probability shrinks, we can get into a situation where the improbable hypotheses with huge utility dominate the expected utility.

Unlike Pascal, the AI's hypotheses are not limited to Christianity or atheism. It will construct all possible belief systems that are consistent with its observations and no more complex than the cutoff in the algorithm. This would imply, for example, that the AI's decision about whether to buy an apple for you will depend on its reasoning about all sorts of implausible universes where you go to Heaven or Hell for having or not having the apple, or being hungry or not, or looking at the grocery store clerk one way or another. We'd prefer decisions about buying apples to be connected to things like whether you are hungry and whether you like apples and how much money the grocer wants for an apple. If Heaven and Hell enter into a decision about buying apples, the outcome seems difficult to predict.

There are several possible solutions to this. The one that is taken in the sample code provided with this paper is for the AI to simply be given a maximum utility as an input. This leads to a risk of the AI behaving erratically if a good explanation of human behavior requires more distinct utilities than the AI can fit into its bound. A better solution would be for the AI to guess the bound, but for a large bound to reduce the a-priori probability enough to ensure that the low-utility-bound hypotheses dominate the average utility. I haven't figured out yet how large the probability adjustment should be. (Notice the analogy with the Speed Prior -- there, a high computation time limit reduces the a-priori probability of an explanation, and here a high utility limit reduces the a-priori probability of an explanation.)

Suggestibility

The hypothesis of Santa Claus is quite complex. One has to imagine bases on the North Pole that are hidden from satellite imaging and exceptions to physical law that enable toy deliveries at Christmas time to happen inside locked houses with no fireplaces, and they have to happen at average rate far exceeding one house per second. This complexity means the hypothesis will have low a-priori probability and therefore considerations about Santa Clause are unlikely to contribute significantly to any expected utility the AI computes.

Suppose I believe in Santa Claus. If the AI is going to understand my behavior, the AI has to represent the concept of Santa Claus when it is representing how I think. If we allow code sharing between the AI's representation of how I think and the AI's own representation of how the world works, then in the representation of Santa-Claus-world-plus-Tim the concept of Santa Claus might be reused, so the complexity of Santa-Claus-world-plus-Tim might not be much greater than plausible-physics-world-plus-Tim.

If the AI is prone to follow me into believing in Santa Claus, then I could manipulate the AI by pretending to have some other belief I want it to act upon.

Thus it is important that the AI's explanation of the world and the AI's explanation of how people think are two separate bodies of knowledge and redundancy between the two contributes twice to the net complexity of the entire explanation.

Indefinitely Deferred Gratification

Suppose the AI is helping out with investments. It is always the case that if I invest my money a little bit longer, I can expect a little bit more return on investment. This can lead to the situation where the money is never available for any purpose other than investment.

In the sample code below, we cope with this by giving the AI one planning horizon that is at a fixed time. If the algorithm were runnable, in normal use we'd let the AI plan for that time horizon and then when the time was reached we'd set a new horizon and run the AI for another time period.

Money is not Desire

Money is easily quantifiable and is somewhat related to human motivation. Many people want money because they would be able to exchange it for other things they want. Thus one might think that an appropriate unit of measurment for desire would be market prices in some currency. For the sake of argument, I will choose dollars.

A monster truck costs somewhere around $100,000. A dose of vitamin A to prevent blindness in a child in a third-world country costs somewhere around $1. It's pretty clear that preventing the blindness is a better satisfaction of human desire than building the monster truck, but nevertheless the monster trucks get built and the third world is not saturated with vitamin A supplements. This is a symptom; the problem is that the people in the third world have very little power, so their desires don't significantly affect market prices. Thus, market prices represent some combination of desire and power, rather than simply being a measure of desire.

Another problem with money is that it also quanitifies the desires of nonhuman entities, such as corporations, governments, and organized crime cartels. Although these organizations are presently composed of humans, the desires of the constitutent humans are warped by explicit incentivization. As a consequence the aggregate entity often pursues goals that are often different from what the vast majority of humans would want. It will pay real money to pursue those desires and those payments affect market prices. The consequence of allowing organizational goals to influence the behavior of a superintelligent AI look even worse when you consider that, once you have superintelligent AI's, the organization no longer needs human members.

The cost of something is therefore some consequence of how difficult it is to manufacture the thing, and how much entities with money want it, without regard to whether the entity wanting it is human. If we want to be able to estimate the desires of humans without these confounding influences, market prices are the wrong instrument.

Personal Evolution

Over time, my beliefs and desires are likely to change. We want the AI to take that change into account without additional explicit training. If it inferred my desires only from my behavior observed during training, then it would not be able to infer any changes to my desires after the training was over.

Unlike my desires, my set of sensory and motor modalities is fairly constant. Assuming that remains true, the AI could be trained once and for all to understand them. The algorithm described below takes estimated voluntary actions and estimated perceptions as training data and extrapolates from this subsequent estimated voluntary actions and perceptions. Then, without further recourse to training data, it makes an ongoing estimate of what beliefs and desires would motivate a utility maximizer to take the inferred voluntary actions, given the inferred perceptions.

This breaks down if the sensory and motor modalities change. This is an aspect of the problem we're confronting, not an aspect of this specific proposed solution. The set of available sensory and motor modalities is a fairly basic property of what it means to be human. If it becomes unclear what objects in the world are human, there's little hope of correctly inferring what the humans want.

The Algorithm

This section gives a complete description of the algorithm. We give it step-by-step in three formats: a text description, diagrams showing how the data flows around, and pointers to unit-tested Python code.

Universal Turing Machines

These sorts of algorithms traditionally represent knowledge as universal Turing machine programs. This has these advantages:

and these disadvantages:

Since we're just writing a decision procedure, we'll use Turing machines for most of it. The Python code is factored in such a way that it's also possible to write explanations in Python. I did this for the grocery shopping and baby catching test cases.

We represent Turing machines as sequences of bits encoded as large integers and we pick apart the bits to interpret them. The machines have just one tape. The input to the machine is provided on the tape (which is also a large integer) and output is the contents of that tape when the machine exits. It is convenient that large integers are a built-in datatype in Python. The interpreter is in machine.py and a rudimentary compiler is in compile.py. The only debugging tool is an animated trace, which can be generated by using compile_and_run.py and setting the animate flag to True.

The details don't matter much, but it is useful to get a feel for the size of the explanations. Here's the input to the compiler for a program that ignores its input and generates an empty tape for output:

empty_as = turing_program("trunc",
                          trunc={(0,1,end):(truncate, right, "done")},
                          done=done)

This is used when testing the compiler (in compile_test.py). The compiler has two settings -- a mode useful for someone looking at the output in octal, and another mode where the program is as short as possible. If we use the latter mode, the program compiles to the integer 24,608,220. Interesting Turing machines are much larger.

Even though it's easy in principle to enumerate all explanations just by incrementing an integer until you find one that encodes a useful machine, it's clear that no useful explanations will be forthcoming from that procedure in a reasonable amount of time. Any real implementation would require a more concise representation of knowledge and a more sophisticated technique for searching for representations.

Our Turing machines have one tape for both input and output. Sometimes they will need to logically have multiple inputs or produce multiple outputs. We'll deal with this by encoding a tuple of tapes in one tape. The encoding in the Python implementation works as follows: let individual letters stand for bits, each either 0 or 1. Suppose, for example, we have a four-bit tape "abcd" and another three bit tape "efg". We interleave zeros in both values, yielding "0a0b0c0d" and "0e0f0g". Concatenate all of interleaved values, separating the concatenations with the bit pattern "10", yielding "0a0b0c0d100e0f0g". The implementation is just a few lines in the tape_tuple and tape_nth functions in bits.py.

Speed Priors and Physics Problems

A Speed Prior explanation of a collection of observations is a pair consisting of a Turing machine that computes the observations, and a limit on how long the Turing machine can run to compute each observation. In the Python implementation, it is a speed_prior object defined in speed_prior.py.

In a moment we'll start to use (or abuse) the word "physics", and it's worthwhile to explain what we mean by that term in this paper. This nomenclature was used by Schmidhuber, among others. There are three ways to look at what we're doing:

These are all synonymous -- "programs with a time limit", "explanations", and "laws of physics" are all the same thing. The "laws of physics" we infer will more closely resemble the rules-of-thumb that allow humans to successfully deal with everyday life than they will resemble string theory.

The term "laws of physics" used here is also potentially misleading in another way. The laws of physics one learns in physics class typically say something about how the universe evolves but they do not say anything about the initial state of the universe. In contrast, the laws of physics we will be inferring here include both an initial state and a plan for how that state will evolve with time.

Another difference is that normal laws of physics assume continuous time and space. The laws of physics we have here assume discretized time, where the state of the universe at one timestep is a consequence of:

There is a first timestep, which is special; in that case the laws of physics are given a special input (typically an empty tape) and are required to generate an initial state from nothing. Our laws of physics also represent a universe-state as a finite Turing machine tape, which is incompatible with the assumption of continuous space that is used with ordinary physical law.

Yet another difference is that normal laws of physics are very concise and hold exactly. The laws of physics we have here are verbose for two reasons: First, as described above, they have to produce an initial state for the universe when given an empty tape. Second, if our laws of physics don't hold exactly (and they won't because they assume discrete time, among other reasons), then they have to incorporate special cases as time passes to get exactly the right outputs. This is not as bad as it sounds -- the typical simplest explanation will probably consist of a good approximate set of laws of physics, an initial state, and an error table that says how to adjust the output of the approximate laws of physics to match the actual observations. A good explanation of the universe generates small errors, and a bad explanation generates large errors, and small errors compress better than large errors, so even with all this baggage we are still likely to get good explanations.

A final difference is that one rarely gets into a situation where the normal laws of physics fail to make a prediction about what will happen next. Our inferred laws of physics are arbitrary programs, so they can easily correctly generate all of the past we have seen to date and still loop or crash when we ask them what will happen next. Thus, the decisions procedures below always have a function that judges a candidate set of physical laws, and they have to match two criteria -- they have to match the past and they have to make some prediction for the amount of future we care about at the moment.

In physics.py we infer explanations by using a variant of Algorithm AS from Jürgen Schmidhuber's Speed Prior paper. The inputs are:

The output is a list of the simplest explanations that are consistent with the observed facts. We do not look at the more complex explanations (there are generally infinitely many of them), and the total probability* of the ignored explanations is less than ε. This total probably* is computed by applying the special case of Bayes' rule where the a-priori probabilities come from the Speed Prior and the observation is that the facts specified in the physics problem are certain.

Roughly speaking, we infer these explanations by enumerating all Speed Prior explanations, in order by increasing complexity, until the first one is found that explains the observed facts. Then we continue to enumerate explanations more complex than the simplest one until a complexity threshold is reached. The threshold depends on the complexity of the simplest explanation and on ε. Schmidhuber's paper gives the formula for the threshold we use here.

There's a minor technical detail where if an explanation E matches, we have to ignore explanations E' where E is a prefix of E'. This is required to get the Kraft inequality to hold. The argument behind this is in an earlier section of Schmidhuber's paper.

This is all of the reasoning about probability we have to do in this paper, and it was essentially done by Schmidhuber. The difficult part of the rest of the work is finding the right physics problem such that solving the problem gives us enough information to compute the utilities of the humans involved. The easy parts are combining those utilities to get a utility function for a respectful and compassionate AI, and writing a decision procedure that makes optimal plans to maximize that utility function.

Physics and Perception

We'll be describing how the AI explains the world in steps, since that was the way I developed the decision procedure and I am not confident there is any other way to understand it. I think of it in terms of diagrams with boxes and arrows. The boxes and arrows have verify specific meanings, as summarized in this chart:

key.png

The meanings of these boxes will become clearer as these charts are explained. Our first diagram is:

physics-perception.png

Here we are describing a system that can predict what the AI will perceive based on the actions it takes. This description is split into "Physics", which describes how the new state of the universe is computed from on the behavior of the AI and the old state of the universe, and "Compute-Perception", which describes how the AI's perception depends on the current state of the universe. This matches one's intuitive ideas about laws of physics and perception -- laws of physics describe how the state of the universe evolves with time, whereas perception gives the perceiver information about what is happening right now.

You can see that the behavior of the AI and the perception of the AI are in double-square boxes, which according to the key is data with known content. This means that the past behavior of the AI and the past perceptions of the AI are training data.

Physics, like all of the entities in the "Guessed code" circles, is a speed-prior explanation, and is therefore a Turing machine program and a time limit. You can see that the Turing machine program has two inputs: the State (which represents the AI's model of the state of the universe) and the AI's behavior. As described earlier, we use tupling to encode the two inputs on the tape given to the Turing machine. Both of these inputs are connected to Physics with solid arrows, which means the inputs come from the current timestep. The output is connected to State with a dotted arrow, which means that when we run Physics on the AI behavior for the current timestep and the State for the current timestep, the result is the State for the next timestep.

"Compute-Perception" is simpler. It just takes a universe-state as input and produces an AI perception as output.

The code for doing this is in video_prediction_act.py. Video_prediction_problem is the physics problem passed to the physical law inferrer describe in the previous section. It takes the training past perceptions ("video") and the training past behaviors ("actions") as arguments to its constructor, and the next action. (It needs to take the next action as an argument to ensure that all of the proposed explanations make a prediction for that next action.) The next_frame function does the whole job -- constructing the problem, solving it, and then running all of the candidate physical laws to find the most likely next frame. This is all exercised in video_prediction_act_test.py.

If you look at this test, it will become clear that the tests make gross simplifications. The "video frames" in the test are just one bit, and for the duration of the test the code enumerating hypotheses is pushed aside and replaced with a hypothesis generator that knows a correct hypothesis in advance. This sort of simplification is done for all unit tests in this paper.

Mind/Body dualism

Next our AI will make a model of how agents in the world other than itself think, along with the model of the rest of the world outside of the other agents. The word "agent" here is meant to stand for entities that can take action in the world. The AI is an agent, and all of the entities it has been instructed to care for have to be agents. Intelligent entities that the AI doesn't care about can be represented in the State and do not have to be agents. Agent identifiers are used as indexes into tuples, so they have to be consecutive nonnegative integers. In this text we'll assume that all of the agents other than the AI are human; we briefly consider alternatives to this below. We'll also use the word "human" as shorthand for "any agent other than the AI".

The diagram for this includes a slightly relabelled version of the previous diagram, with a handful of additional nodes:

mind-body.png

This matches the Python code in mind_body.py. Two old nodes we relabelled from "Physics" to "Nonmind-Physics" and from "State" to "Nonmind-State". The new nodes are only a slight generalization of the old ones. They have been renamed because they do not include the information about the state of mind of the other agents.

First I'll explain how the new nodes are intended to work, then I'll talk about the training data required to make them work as intended.

The intent is to make the AI represent the world as though each human has a mind that only interacts with the world via the human's perceptions and behavior. The part of the diagram below the dotted line describes the model of the humans. The data in that part of the diagram is repeated per human, but the code in the Mind-physics and Compute-behavior circles is not. (I say why the code is not repeated below.) Each human has a unique agent id, which is an input to the Compute-behavior and Mind-physics circles and represented by the A in a box.

The perceptions and behaviors of each agent are represented as tuples of "activation levels", which are nonnegative integers. Humans generally produce behavior via motor neurons, so we'll give each neuron an index in the behavior tuple, and the activity level of the neuron is the activation level at that position of the tuple. Humans perceive the world mostly through sensory neurons, so each sensory neuron gets a position in the perception tuple and the activity level is analogous to the behaviors. Human thought is also affected by various substances in the blood, so we would want to give those positions in the perception tuple too; for example, the blood alcohol level will need to be a perception in this scheme.

When data flows across the dotted line, there is implicit tupling. Specifically, the output from compute-perception is now a tuple, where the index into the tuple is an agent id and the value is the perceptual tuple of that agent (for non-AI agents) or the perception of the AI which, as before, has whatever representation was used in the training data. Similarly, the input to Nonmind-Physics consists of a tuple of the Nonmind-State and the representations of the behaviors of all of the agents.

The repetition represented by "Once per human and timestep" is only a repetition of data. We don't repeat the Mind-Physics or Compute-Behavior programs for each human and timestep because many of the humans will have similar responses. If we repeated the code for each human, the AI would have to learn separately for each human that they do binocular vision, that they act drunk if their blood alcohol level is too high, and that they generally want to stay alive for the immediate future. Having one model that is shared between humans is essential to allow the AI to understand the motives of humans who it hasn't encountered before.

To cause the AI to interpret the channels the way we intend, we must give enough training information for the only reasonable extrapolation to be the interpretation we want. The training information will consist of (channel number, agent id, activity level) triples that are observed at a specific timestep; this is represented by the Constraint class in mind_body.py. To make this concrete, let's suppose the AI's main input is video. To construct the training data, we'd choose one or more periods of time in the past in which some of the agents are visible to the AI's video cameras. We'd pick some muscle groups that are easy to observe, assign each one to a channel, and estimate the contraction force of those muscles for those agents during the time period, making note of the muscle group, the agent id, the estimated activity level, and the time of the frame of video in which we observed all this. We'd do the analogous thing for perceptions.

We'll know we have enough training data when the AI correctly extrapolates it. After providing some training data for some frames of video, we could show it other video and see if the activity levels were plausible. The behavior is also input to Nonmind-Physics, so we could choose our own contents for the agent behavior and AI behavior, feed them into Nonmind-Physics, pick some hypothetical AI behavior and behavior of the humans and see if the simulation gives us plausible video output. We could also have fun, producing videos that answer questions like "How would that lecture have gone if the professor were drunk?"

In summary, at this point we have an AI that can explain the world in terms of the actions and perceptions of itself and the other agents it cares about. The inputs are:

The output is a collection of explanations that allow us to compute a predicted perception for the AI in the current timestep. Each explanation has:

Each of these procedures is a Speed Prior explanation.

Beliefs and Utility From Behavior and Perception

Since the AI can now estimate what people are doing and perceiving, it can try to figure out what they want by interpreting their behavior as optimizing a utility function. We implement this by adding two more handfuls of nodes to the diagram along with corresponding code in infer_utility.py:

infer-utility.png

Readers who have skipped ahead to this point will probably want to look at the key that says what all of the different types of boxes and arrows mean.

The idea here is to find some interpretation of the world in which the humans choose their behavior to maximize some utility function. The way of doing this described here is the result of three design constraints, which I'll discuss in turn:

Using the AI's World Representation

This utility function is the Compute-Utility node in the graph. In order to cover the Deception use case correctly, the input to Compute-Utility must be a state of the world that has the same structure as the AI's understanding of the states of the world. This is accomplished by using Nonmind-Physics to generate inputs to Compute-Utility. This is the same Nonmind-Physics at the top center of the diagram that the AI uses to represent its estimate of how the world works outside of the AI.

Understanding False Beliefs

Let's pick a particular human. Call her Alice. Her agent id will be the value of A in the diagram for the time being. Alice may have false beliefs about the world, and we have to understand her actions as a reaction to her actual beliefs, not as a reaction to the AI's beliefs. We represent this by having an arbitrary function Beliefs that converts Alice's mind-state into a configuration of the universe that the AI could understand; this configuration is Believed-Nonmind-State in the diagram. Alice also has beliefs about what the other agents, including the AI, are going to do; this is Believed-Non-A-Behavior. The believed state of the universe and the believed actions of the other agents are combined with whatever Alice plans to do and the AI uses Nonmind-Physics to estimate the consequences, which are New-Nonmind-State.

Understanding Humans Reasoning About Uncertainty

People often take action to learn more about the world, and they often hedge their bets when they can't predict the consequences of their actions. Thus, in order to understand people's preferences, we have to understand how people reason about uncertainty. The AI represents uncertainty by assuming that each person believes multiple possible things could happen. This is done by letting the Beliefs node have a Possibility input, which is simply a nonnegative integer. The number of possibilities to assume is an input to the AI. (Ideally the AI would guess this; see the section on guessed vs. fixed parameters.) The AI reasons through all of each person's possible beliefs and comes up with a Utility for each.

If Alice thinks she knows what is true and she is not reasoning about uncertainty, the simplest explanation of her behavior will have the same beliefs for all possibilities. The AI will nevertheless redundantly compute the same number of possibilities. If this were a real implementation rather than a decision procedure, this would be an important efficiently loss to fix.

The AI assumes all possibilities are equally probable. To represent situations where Alice thinks some possibilities are more likely than others, the AI can have several possibilities that are identical.

Expected Utility Maximization

After all this, the AI has a hypothesis with the information in the box with the dotted lines at the lower right. Let's take the human as fixed (call her Alice) and the particular belief as fixed. The area at the lower right takes a possibility and a hypothetical action for Alice as inputs, and it produces a utility for output. The Expected-Utility-Maximizer is where the AI does the obvious algorithm to compute which of Alice's possible actions yields the greatest expected utility, when averaged over all of the values of Possibility. The action that yields the greatest utility goes into Optimal-Behavior. If Optimal-Behavior doesn't match Alice's actual behavior, this explanation is rejected and the AI continues searching for other explanations.

The valuable output from all of this is the Compute-Utility function. This takes an agent id and a nonmind-state and returns the AI's estimate of that human's utility for that world state. Keep in mind that all of this happens when checking one explanation; the AI will in general have multiple explanations for the world, and they may have different Compute-Utility functions. Thus the AI can accurately model situations where it doesn't have a clear idea of what the humans want.

Combining Estimated Utilities

Once the AI can estimate a utility function for each agent, it can use simple arithmetic to combine these to get the AI's utility function. This is in utility_combiners.py; it is the obvious implementation of respect and compassion with arithmetic as described above.

Planning

Given a utility function, it is easy to specify how to plan to optimize the expected value of that utility function at a specific time horizon. A plan says what the AI will do in each situation it may encounter, so we represent it as a function that takes the AI's perception at one timestep as input and returns the AI's action at that timestep and the AI's plan for what to do for the next timestep. It is possible to enumerate all possible plans and to compute the one with the greatest expected utility. This is implemented in planner.py.

There are often so many possible plans that enumerating them took too long even for toy unit tests. Thus we also support having planners that only return a few possible plans, rather than all of them. This was required to make the tests in grocery_test.py and baby_catch_test.py runnable.

Testability

The consequences of running a buggy superhuman AI could be similar to the consequences of running a superhuman AI with a poorly specified goal, as described above. Therefore, if some variant of this algorithm is actually used, it would be important to test it thoroughly. Fortunately, there are several approaches to validating this algorithm.

Even without the provision of impossible amounts of computing resources, it is unit-testable, and in fact it has been thoroughly unit tested. The coverage analyzer indicates 100% coverage.

If we assume that some combination of algorithmic improvements and hardware advances make it runnable, we can make use of the inferred compute-perception function to generate the AI's perceptions for any particular world-state. If the AI took video as input, then this would be video for output. One could watch this video evolve from one simulated future timestep to the next to check that the inferred laws of physics are plausible. If the AI is able to control the position of the camera, then commands to do that could be fed into the inferred laws of physics as behavior on the part of the AI, and if the inferred physics are plausible this should influence the field of view for the AI's simulated future.

One could convert the AI's believed-nonmind-states of the other agents into still frames of video to check those for plausibility. The AI's nonmind-physics model could be run from any of these states to get the AI's estimate of the consequences of other agent's beliefs, and those in turn could be converted to video and checked.

The algorithm described here does not require any connection between the AI's actual behavior and the AI's intended behavior. Thus the AI could be operated in a purely passive mode where some human drives the AI's actual behavior (or the AI has no possible behavior), the AI observes the consequences, and then its utilities are inferred and checked. For example, a good test would be to download a video from the web of some violent crime and verify that the AI's inferred utility increase for the perpetrator is less than the AI's inferred utility decrease for the victim. This is important because we would want the AI to prevent crimes rather than perpetrate them in the case where it has been told to be equally compassionate and respectful to everybody.

None of these tests are perfect. The ideal test would tell us conclusively whether this algorithm correctly infers human desires from human behavior, disregarding scenarios with the total probability given by the error bound. Unfortunately, we have no other specification of human desire to test against, so it seems difficult to imagine a robust correctness test for this.

This paper covers inferring a utility from observations of the world. The problem of doing computationally tractable planning, given a model of the world and a utility function, is not covered here. As a consequence, testing it is not covered here either.

Bugs

This section lists known bugs in the algorithm.

The Novice Philosopher Problem

Sometimes people have debates on the internet about whether it is possible, in principle, to duplicate an individual. The following reasoning is invalid but sometimes presented:

Several different concepts can be substituted for "unique identities": "indwelling essences", "souls", "transcendant egos", etc. The general fallacy looks like this: The error here is that the high-level understanding is a luxury that doesn't constrain low-level reality. The set of things that can happen is determined by the simplest rules that goven the universe. If I add some other concepts for my own purposes, then those concepts don't affect the low-level reality. The concept of unique identities for humans, for example, is something we add to the world so we can interact socially with people. Duplicating people would create some confusion when interacting with the copies socially, but that confusion doesn't stop the duplication from happening.

Unfortunately the algorithm presently described in this paper has this bug. We prefer simpler explanations, and the complexity of inferring human voluntary actions or explaining the voluntary actions as goal-directed is counted as part of the complexity. This creates two types of bias:

This seems fixable. Recall the distinction between "probability" as defined here (where all probabilities do not necessarily add up to 1) and "probability*" (which is usual definition of the word, where probabilities do add up to 1). Suppose we have an explanation of the physical EP world with complexity P. Suppose that we also have an explanation of voluntary action EV with complexity V, and that EV builds on EP. The present algorithm gives the total explanation probability 2-P * 2-V. This bug would be fixed by replacing the 2-V term, which represents the a-priori probability of EV, with the estimated a-priori probability* of EV. This way the total probability of EP is unaffected by how hard it is to explain it with voluntary behavior. We would have to make an analogous fix so the difficulty of assigning a utility function does not make the voluntary actions less probable, and that would be all. The hardest part is inventing some notation that can express this precisely.

Missing Details

This section lists problems that seem easy to solve, but the solutions are not yet implemented and unit tested in the Python code.

Asking Questions

This paper does not yet describe a way to ask the AI questions and have it regard the questions as anything more significant than another stimulus to which it can respond. It seems feasible to write an algorithm to do this with the same impractical computational demands as the algorithms stated above. It would take the following parameters:

Enumerate the continuations of the world-explanations that also map the training questions to the corresponding training answers and return an answer for the desired question. This gives a probability for the answer, so we can enumerate the answers by decreasing probability.

This sort of question-answerer might be a useful testing tool.

Fixed versus Guessed Parameters

The present algorithm has to be given a range of values to use for the utility. This is weak because the programmer has no rational basis for specifying that range. If the range is too small to let the AI do a good job of explaining human behavior, it's difficult to predict what the AI would do. It would be better to guess the proper range just like it guesses explanations for everything else in the world.

If we do this, we need the number of possible values to contribute enough to the complexity of the explanation so we avoid the Pascal's Wager trap.

The present decision procedure also has to be given the number of possibilities that the modeled humans will consider when they are reasoning about uncertainty. This should be guessed as well.

Open Questions

I don't presently know how to solve the issues discussed in this section.

Time Horizons

Let's reconsider the example about going to the grocery store to buy an apple, but this time assume the AI has been given a planning horizon of a hundred years so it can figure out what to do about global warming. There seems to be little guarantee that the AI is going to pay the grocer for the apple any time soon.

Suppose we work around this by giving the AI fairly short time horizons, say one hour. So the AI plans to put the world in what it considers to be a good state every hour on the hour, and unless we turn off the AI we then let it start the next cycle. In this case the AI won't do a purchase where it takes the apple at 9:59 and pays for it at 10:01, since the payment is beyond the end of the planning cycle.

Humans seem to have a very fluid notion of what planning horizon to use, and I don't see a principled way to incorporate that into the AI.

Quantity of Training Data

It is unclear how to know when the AI has been given enough training data. Thorough testing would be required.

It would be possible to magnify training data. Suppose we generate a few perception and behavior constraints, and take the subset of the AI that infers perception and behavior without inferring utility. If the AI comes up with good explanations, we could take the most likely explanation and extract from it all of the inferred perceptions and behaviors. If these passed a manual review, they could be used as training data for the full-blown AI that is inferring utility.

This might avoid a situation where the AI is observing humans who are difficult to understand, and the AI gives up on understanding them by explaining the training data as special cases that happened in the past and will not happen in the future; in the future the AI believes that the humans will have constant perception and behavior that has no connection to the AI's understanding of the world. This would lead to useless inferences about the utility functions of the humans and thus to unpredictable behavior from the AI. By increasing the quantity of training data, we can increase the complexity of this useless explanation and thus decrease the probability of that explanation becoming the dominant one.

How Many People?

The present decision procedure is told how many people there are. A more realistic procedure would estimate this along with all of the other parameters of the real world. This presents a few problems.

First, this gives the AI an incentive to control births. If there is a present bunch of people who the AI is serving, changing that group of people is a change in goals for the AI that would impair its ability to pursue its present goal of serving the present group.

It also gives the humans a perverse incentive to have abnormal births as a means to manipulate the AI. For example, if the AI has a respect coefficient of 1 for all humans, including humans born after the AI starts operating, and someone manages to construct enough humans who feel intense pain whenever they perceive the AI to be taking any action, the AI might choose never to take any action again out of respect for these constructed humans.

It seems that respect and compassion should be assigned once, and after that you can only increase yours by getting someone else to consent to decrease theirs. This suggests using the AI's compassion and respect as a currency. Realistically, a likely long-term consequence of this would be the majority of people eventually trading away their endowment of compassion and respect due to incompetence, thus leading back to the winner-take-all scenarios that seem to be an attraction point for our current society. Perhaps we would allow transfer of respect and compassion from parent to child only? I do not know best solution.

We need to have a maximum number of people the AI cares about, otherwise we're vulnerable to a different version of Pascal's Wager called Pascal's Mugging. In this scenario, the AI is manipulated by getting it to imagine an absurdly large number of people, each with a reasonable individual utility. The total utility is absurdly large, and the numbers can be contrived so the total utility is large compared to the improbability of the scenario, thus leaving us with a large expected utility that could dominate the rest of the computation. To avoid this, any particular run of the AI would need to have a plausible maximum number of people it considers. If there's a desire for more people, it will be up to the existing people to revise the AI to have a larger limit for the next run. The AI might help them to do this, since it has no agenda for the events past the event horizon beyond people liking what they reasonably expect to happen at that time.

Murdering the Creator

Suppose the AI gives everybody a compassion coefficient of 1 and a respect coefficient of 1. Make these assumptions:

In this case, the AI will stop the abuse, and we'll have 1 million abusive men who are angry at the AI and the most prominent person in the organization that created the AI. The AI's next step could be to do nothing, or to murder the CEO of the company that created it. Assume the utility to the CEO's of the CEO's life is X. From our assumptions above, each of the abusers will gain X/10 utility if the AI kills the CEO. There are a million frustrated abusers, so the net gain from killing the CEO would be 99,999*X. If we assume nobody else cares, the AI would kill the CEO.

The assumption that nobody else cares seems questionable. Presumably the AI had compassion for a large number of people, and it helped them, and those people might have some gratitude toward the CEO. However, gratitude seems to me to be a weak human motivation, so let's try leaving it out and seeing where we go.

From an impersonal point of view, it is arguably a net win to stop a million instances of abuse and kill one person. Nevertheless, the CEO is unlikely to cooperate with the plan. Therefore any political process that actually results in the AI being created will have to give the creators of the AI more respect than less prominent people. The actual amount of respect that must be added depends on:

If there is enough gratitude, then this problem goes away.

Defending the Creator from Individuals

There is also the possibility that the abusive individuals in the previous section will actually take action on their own to murder the CEO rather than to try to get the AI to do it for them. Make these additional assumptions:

The AI can either defend the CEO, or do nothing. Disrespect of the angry men contributes -1000 * (X/10) to the utility of defending the CEO, and compassion for the CEO contributes Y*X, for a total of (Y-100)*X. If we assume that one of the 1000 homicidal men would be successful, the utility of doing nothing is 0 because the CEO would die. Thus the defense happens if Y > 100. Once again, gratitude from the former abuse victims might make this special treatment of the CEO unnecessary.

The Homicidal Creator

If in reaction to the previous scenario we really do give the CEO a compassion coefficient of 100, then we have new problems. The CEO can ask the AI to kill a few individuals who have the default respect coefficient of 1, and the AI will do it. As before, one might make a utilitarian judgement that creating a world-wide compassionate AI is enough of a benefit that it's still a net win even if we enable a few murders. Maybe the CEO isn't homicidal and it's not a problem. However, on the whole, it seems best to hope that gratitude brings things into balance and to give the CEO no special treatment.

Defense Against Proactive Coercively Organized Groups

The previous scenarios assume there are some violent individuals who react to the AI's activities. There are also groups of humans who have power and are willing and able to be proactively violent to keep that power. It is not clear how to deal with them.

Perhaps it would help that some of these violent groups are, in theory at least, controlled by representative democracies.

The Programming Language Matters for the Speed Prior

Kolmogorov complexity is concerned only about the outputs from programs that halt. It doesn't take into account the run time. Because it doesn't take into account the run time, the language in which the programs are written does not matter much because one programming language can generally be interpreted by another. Specifically, if you want to write programs in language A but you're using Kolmogorov complexity defined in terms of language B, you can convert your program in A into an interpreter for A written in B, followed by your program in A. Since the interpreter is a fixed size, the added complexity is a fixed number of bits, and the change in a-priori probability is therefore a constant factor.

If, instead, we are using the Speed Prior, there is a run time bound, so this stops working. Many algorithms that are linear time in a more reasonable language are quadratic or worse when run on a Turing machine. Thus it is possible that the behavior of an AI that uses the Speed Prior with knowledge represented as Turing machines would be qualitatively different from the behavior if some other more reasonable language were used. Fortunately, this is only a problem for the theory. Practical languages are going to be concise and efficient when compared to Turing machines, so they will be better for representing knowledge.

Searching for Explanations

Enumerating explanations by brute force is impractical. It is unclear how to do better. Initial ideas include things like automated debugging of existing explanations or doing some sort of genetic algorithm. However, in those cases there is unknown bias introduced as the explanations are being found. I don't know how important this sort of bias would be in practice.

Modifying Agents

The scheme described above only copes with agents that have a fixed number of behavioral or perceptual channels. We've assumed the agents are humans. Extending to animals is easy, but if the agents can self-modify enough to radically change their behavioral or perceptual possibilities, it is unclear how to extrapolate the training data and distinguish the agent's mind-states from the rest of the world.

It is unclear how the AI should deal with agents that can copy themselves. Depending on the order in which various technologies become available, this may be the normal case by the time a compassionate and respectful AI is implemented.

Conclusion

It appears to be possible to write a relatively simple decision procedure that infers what people want from observations of their behavior. The procedure is about 1000 lines of Python code, excluding comments. Motivations that can plausibly be called "compassion" and "respect" can be implemented in terms of this; the amount of compassion and respect to have for each person is up to the implementor. A planner that has been given this motivation along with unlimited computing resources would apparently avoid many of the potential pitfalls that can result from having entities in the world that are more intelligent than their creators. In particular, the planner would have no obvious incentive to please its creators by deceiving them.

Acknowledgements

Thanks to:

This work was not funded by any institution.

Release Notes

License

Copyright (c) 2009 Tim Freeman

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

(This is the standard MIT License, copied from http://www.opensource.org/licenses/mit-license.php on 24 Apr 2007.)

Valid HTML 4.01 Strict