Quantum Computing Since Democritus

PHYS771 Quantum Computing Since Democritus

University of Waterloo, Fall 2006
Tuesdays and Thursdays, 1:00-2:30pm
BFG Building, 2nd floor seminar room (BFG2125)

Instructor: Scott Aaronson
3141 Davis Centre
Email: scott at scottaaronson dot com
Office hours: After class or by appointment

Description: This course tries to connect quantum computing to the wider intellectual world. We'll start out with various scientific, mathematical, or philosophical problems that predate quantum computing: for example, the measurement problem, P versus NP, the existence of secure cryptography, the Humean problem of induction, or the possibility of closed timelike curves. We'll then examine in what ways, if any, quantum computing affects how we should think about the problem. To keep things grounded, each session will end with a concrete puzzle that students will be expected to have thought about (if not solved) by the next session. The class format will strongly encourage participation, discussion, and debate.

Prerequisites: Mathematical maturity and some previous exposure to quantum computing.

Responsibilities: The main one is to scribe one or two sessions; we might experiment with using audio recordings to help with this. Besides that, students are expected to

show up,
actively participate in discussion,
work on the puzzles,
do the readings (which will generally be light), and
turn in one or two problem sets (having either solved the problems or else explained why they couldn't solve them).

Suggested Readings

Democritus (from Stanford Encyclopedia of Philosophy)
Alan Turing, On Computable Numbers
Alan Turing, Computing Machinery and Intelligence
Roger Penrose, The Emperor's New Mind (our "textbook")
Sanjeev Arora and Boaz Barak, Complexity Theory: A Modern Approach (free draft available on the web)
Lucien Hardy, Quantum theory from five simple axioms

Lecture 1: Atoms and the Void

Alright. So why Democritus? First of all, who was Democritus? He was this ancient Greek dude. He was born around 450BC in Abdera, which was sort of this podunk town. where people from Athens said that even the air causes stupidity. He was a disciple of Leucippus, according to my source, which is Wikipedia. He's called a "pre-Socratic," even though actually he was a contemporary of Socrates. That gives you a sense of how important he's considered: "Yeah, the pre-Socratics -- maybe stick 'em in somewhere in the first week of class." (Incidentally, there's a story that Democritus journeyed to Athens to meet Socrates, but then was too shy to introduce himself.)

Almost none of Democritus's writings survive. (Some of them apparently survived into the Middle Ages, but they're lost now.) What we know about him is mostly due to the fact that other philosophers, like Aristotle, brought him up in order to criticize him.

So, what were the ideas they criticized? Democritus thought the whole universe is composed of atoms in a void, constantly moving around according to determinate, understandable laws. These atoms can hit each other and bounce off, and they can stick together to make bigger things. They can have different sizes, weights, and shapes -- maybe some are spheres, some are cylinders, whatever. On the other hand, Democritus says that properties like color and taste are not intrinsic to atoms, but instead emerge out of the interactions of many atoms. For if the atoms that made up the ocean were "intrinsically blue," then how could they form the white froth on waves?

Remember, this is 400BC. So far we're batting pretty well.

Why does Democritus think there are these atoms surrounded by void? He gives a few arguments, one of which can be paraphrased as follows (following Carl Sagan). Suppose we have an apple, and suppose the apple's not made of atoms but is instead this continuous, hard stuff. And suppose we take a knife and cut the apple into two pieces. It's clear that the points on one side go into the first piece and the points on the other side go into the second piece, but what about the points exactly on the boundary? Do they "disappear"? Do they get duplicated? Does the symmetry get broken? None of these possibilities seem particularly elegant.

Incidentally, some of you might know that there's a debate raging even today between atomists and anti-atomists. This time, the headquarters of the atomist side aren't in Abdera; they're a mile down the train tracks from here, in a certain sleek black building. At issue in this debate is whether space and time themselves are made up of indivisible atoms, at the Planck scale of 10^-33 centimeters or 10^-43 seconds. Ironically, the physicists have almost no experimental evidence to go on, and are basically in the same situation that Democritus was in 2400 years ago. If you want an ignorant, uninformed layperson's opinion, my money is on the atomist side. And the arguments I'd use are not entirely different from the ones Democritus used: mostly they hinge on inherent mathematical difficulties with the continuum.

One passage of Democritus that does survive is a dialogue between the intellect and the senses. The intellect starts out, saying: "By convention there is sweetness, by convention bitterness, by convention color, in reality only atoms and the void." In my book, this one line already puts Democritus shoulder-to-shoulder with Plato, Aristotle, or any other ancient philosopher you care to name. But the dialogue doesn't stop there. The senses respond, saying: "Foolish intellect! Do you seek to overthrow us, while it is from us that you take your evidence?"

I first came across this dialogue in a book by Schrödinger. Ah, Schrödinger! -- you see we're inching toward the "quantum computing" in the course title. We're gonna get there, don't worry about that.

But why would Schrödinger be interested in this dialogue? Well, Schrödinger was interested in a lot of things. He was not an intellectual monogamist (or really any kind of monogamist). But one reason he might've been interested is a certain equation he was involved with, which you've probably heard about:

i dψ/dt = Hψ

(Did I get it right, Ray?)

Actually, let me write it in a more correct form:

|ψ_t+1⟩ = U |ψ_t⟩

What is this equation? Well, maybe you have to add a few details to it -- like the physics -- but once you do, it describes the evolution of a quantum pure state. For any isolated region of the universe that you want to consider, this equation describes the evolution in time of the state of that region, which we represent as a normalized linear combination -- a superposition -- of all the possible configurations of elementary particles in that region. So you can think of this equation as the sophisticated, modern version of Democritus's "atoms and the void." And as we all know, it does pretty well at the atoms and the void part.

The part where it maybe doesn't do so well is the "from us you take your evidence" part. Where's the "us"? Remember, the equation describes a superposition over all possible configurations of particles. So, I don't know -- are you in superposition? I don't feel like I am!

Incidentally, one thing I'm not going to do in this class is try to sell you on some favorite interpretation of quantum mechanics. You're free to believe any interpretation your conscience dictates. (What's my own view? Well, I agree with every interpretation to the extent it says there's a problem, and disagree with every interpretation to the extent it claims to have solved the problem!)

Anyway, just like we can classify religions as monotheistic and polytheistic, we can classify interpretations of quantum mechanics by where they come down on the "putting-yourself-in-coherent-superposition" issue. On the one side, we've got the interpretations that enthusiastically sweep the issue under the rug: Copenhagen and its Bayesian and epistemic grandchildren. In these interpretations, you've got your quantum system, you've got your measuring device, and there's a line between them. Sure, the line can shift from one experiment to the next, but for any given experiment, it's gotta be somewhere. In principle, you can even imagine putting other people on the quantum side, but you yourself are always on the classical side. Why? Because a quantum state is just a representation of your knowledge -- and you, by definition, are a classical being.

But what if you want to apply quantum mechanics to the whole universe, including yourself? The answer, in the epistemic-type interpretations, is simply that you don't ask that sort of question! Incidentally, that was Bohr's all-time favorite philosophical move, his WWF piledrive: "You're not allowed to ask such a question!"

On the other side we've got the interpretations that do try in different ways to make sense of putting yourself in superposition: many-worlds, Bohmian mechanics, etc.

Now, to hardheaded problem-solvers like ourselves, this might seem like a big dispute over words -- why bother? I actually agree with that: if it were just a dispute over words, then we shouldn't bother! But as David Deutsch pointed out in the late 1970's, we can conceive of experiments that would differentiate the first type of interpretation from the second type. The simplest experiment would just be to put yourself in coherent superposition and see what happens! Or if that's too dangerous, put someone else in coherent superposition. The point being that, if human beings were regularly being put into superposition, then the whole business of drawing a line between "classical observers" and the rest of the universe would become untenable.

But alright -- human brains are wet, goopy, sloppy things, and maybe we won't be able to maintain them in coherent superposition for 500 million years. So what's the next best thing? Well, we could try to put a computer in superposition. The more sophisticated the computer was -- the more it resembled something like a brain, like ourselves -- the further up we would have pushed the 'line' between quantum and classical. You can see how it's only a miniscule step from here to quantum computing.

I'd like to draw a more general lesson here. What's the point of talking about philosophical questions? Because we're going to be doing a fair bit of it in this class -- I mean, of philosophical bullshitting. Well, there's a standard answer, and that's that philosophy is an intellectual clean-up job -- the janitors who come in after the scientists have made a mess, to try and pick up the pieces. So on this view, philosophers should sit in their armchairs waiting for something surprising to happen in science -- like quantum mechanics, like the Bell inequality, like Gödel's Theorem -- and then swoop in like vultures and say, ah, this is what it really meant.

Well, on its face, that seems sort of boring. But as you get more accustomed to this sort of work, I think what you'll find is ... it's still boring!

Like most of you, I'm interested in results -- in finding solutions to nontrivial, well-defined open problems. So, what's the role of philosophy in that? I want to suggest a more exalted role than intellectual janitor: philosophy can be a scout. It can be an explorer -- mapping out intellectual terrain for science to later move in on, and build condominiums on or whatever. Not every branch of science was "scouted out ahead of time" by philosophy, but some of them were. And in recent history, I think quantum computing is really the poster child here. It's fine to tell people to "Shut up and calculate," but the question is, what should they calculate? At least here at IQC, the sorts of things we like to calculate -- capacities of quantum channels, error probabilities of quantum algorithms, etc. -- are things people would never have thought to calculate if not for philosophy.

Alright, I promised you a puzzle. Earlier I mentioned inherent mathematical difficulties with the continuum, so I've got a puzzle somewhat related to that. If it's too easy, let me know and I'll give you a harder one.

You know the real line, right? Suppose we want a union of open intervals that covers every rational point. Question: does the sum of the lengths of the intervals have to be infinite? One would certainly think so! After all, there are rational numbers pretty much everywhere!

[Richard Cleve immediately solves the puzzle.]

Alright, I guess that was too easy.

[Solution: Not only can the sum of the lengths of the intervals be finite, it can be arbitrarily close to zero! Simply enumerate the rational numbers, r₀, r₁, etc. Then put an interval of size ε/2ⁱ around r_i for every i.]

Here's a harder one: we have the unit square, [0,1]². Consider a function S, which maps every real number x∈[0,1] to a countable subset S(x) of [0,1]. Can we choose S so that, for every (x,y)∈[0,1]², either y∈S(x) or x∈S(y)?

Lecture 2: Sets

Thursday's class started out with a brief presentation by Rahul Jain about atomist ideas in Jainism (circa 500BC). It seems the Jain (the ancient ones, not Rahul) were barking up more or less the same tree as Democritus, but their ideas (like many of the pre-Socratics') were mixed with generous helpings of mysticism.

I mentioned another example of East-West convergence: apparently, several ancient cultures independently came up with the same proof that A = π r². It's obvious that the area of a circle should go like the radius²; the question is why the constant of proportionality (π) should be the same one that relates circumference to diameter. Proof by pizza: Cut a circle of radius r into thin pizza slices, and then "Sicilianize" (i.e. stack the slices into a rectangle of height r and length π r). QED.

One thing I forgot to share with you on Tuesday was a quote from Democritus:

"I would rather discover a single cause than become king of the Persians."

Something to keep in mind when you consider those job offers from Microsoft or Google...

Today we're gonna talk about sets. What will these sets contain? Other sets! Like a bunch of cardboard boxes that you open only to find more cardboard boxes, and so on all the way down.

You might ask, "how is this relevant to a class on quantum computing?" I can give three answers:

When I gave that puzzle on Tuesday (which by the way, we're going to "answer" today), some of you asked what "countable" means. OK, dude. Math is the foundation of all human thought, and set theory -- countable, uncountable, etc. -- that's the foundation of math. So even if this class was about Sanskrit literature, it should still probably start with set theory.
I have a hidden agenda: I'm told we have some physicists here, and I intend to browbeat you into thinking like mathematicians. I mean, what you do in the lab is your own business, but now you're in theorem country.
There actually is a tenuous connection between quantum computing and set theory, which I'll touch on in the next lecture. To give a sneak preview, the connection is that quantum mechanics applied to finite-dimensional systems (like qubits) seems like an interesting "intermediate" case between a continuous and a discrete theory. That is, it involves quantities (the amplitudes) that vary continuously, but that are not directly observable. In this way, it seems to avoid the "paradoxes" associated with the continuum in a way that other continuous physical theories do not. But what are those paradoxes? Well, welcome to my haunted house horror tour of the continuous and the transfinite...

So let's start with the empty set and see how far we get.

THE EMPTY SET.

Any questions so far?

Actually, before we talk about sets, we need a language for talking about sets. The language that Frege, Russell, and others developed is called first-order logic. It includes Boolean connectives (and, or, not), the equals sign, parentheses, variables, predicates, quantifiers ("there exists" and "for all") -- and that's about it. So for example, here are the Peano axioms for the nonnegative integers (where S(x) is the successor function, intuitively S(x)=x+1, and I'm assuming functions have already been defined):

Zero Exists: There exists a z such that for all x, S(x) is not equal to z.
Every Integer Has At Most One Predecessor: If S(x)=S(y) then x=y.

The nonnegative integers themselves are called a model for the axioms (though interestingly, they're not the only model).

Writing down these axioms seems like pointless hairsplitting -- and indeed, as someone pointed out in class, there's an obvious chicken-and-egg problem. How can we state axioms that will "put the integers on a more secure foundation," when the very symbols and so on that we're using to write down the axioms presuppose that we already know what the integers are? Well, precisely because of this point, I don't think that axioms and formal logic can be used to "place arithmetic on a more secure foundation" (whatever that would mean). But this stuff is still extremely interesting for at least three reasons:

The situation will change once we start talking not about integers, but about different sizes of infinity. There, writing down axioms and working out their consequences is pretty much all we have to go on!
Once we've formalized everything, we can then program a computer to reason for us:
- Premise 1: For all x, if A(x) is true then B(x) is true.
- Premise 2: There exists an x such that A(x) is true.
- Conclusion: There exists an x such that B(x) is true.
Well, you get the idea. The point is that deriving the conclusion from the premises is purely a syntactic operation -- one that doesn't require any understanding of what the statements mean.
Besides having a computer find proofs for us, we can also treat proofs themselves as mathematical objects, which opens the way to metamathematics.

Anyway, enough pussyfooting around. Let's see some axioms for set theory. (I'll state the axioms in English; converting them to first-order logic is left as an "exercise for the reader.")

Empty Set: There exists an empty set.
Extensionality: If two sets have the same members then they're equal.
Pairing: For all sets x,y there exists a set {x,y}.
Union: For all sets x, there exists a set equal to the union of all sets in x.
Existence of Infinite Sets: There exists a set x that contains the empty set and that contains y∪{y} for every y∈x.
Power Set: For all sets x there exists a set consisting of the subsets of x.
Replacement (for every function A): For all sets x, there exists a set {A(y) | y∈x}.
Foundation: All nonempty sets x have a member y such that for all z, either z∉x or z∉y. (This is a technical axiom, whose point is to rule out sets like {{{{...}}}}.)

These axioms -- called the Zermelo-Fraenkel axioms -- are the foundation for basically all of math. So I thought you should see them at least once in your life.

Alright, one of the most basic questions we can ask about a set is, how big is it? What's its size, its cardinality? You might say, just count how many elements it has. But what if there are infinitely many? Are there more integers than even integers? This brings us to Georg Cantor (1845-1918), and the first of his several enormous contributions to human knowledge. He says, two sets have the same cardinality if and only if their elements can be put in one-to-one correspondence. Period. And if, whenever you try to pair off the elements, one set always has elements left over, the set with the elements left over is the bigger set.

What possible cardinalities are there? Of course there are finite ones, and then there's the first infinite cardinality, the cardinality of the integers, which Cantor called ℵ₀ ("Aleph-Zero"). The rational numbers have the same cardinality ℵ₀, a fact that's also expressed by saying that the rational numbers are "countable" (i.e., can be placed in one-to-one correspondence with the integers).

What's the proof that the rational numbers are countable? You haven't seen it before? Oh, alright. First list all the rational numbers where the sum of the numerator and the denominator is 2. Then list all the rational numbers where the sum of the numerator and the denominator is 3. And so on. It's clear that every rational number will eventually appear in this list. Hence there's only a countable infinity of them. QED.

But Cantor's biggest contribution was to show that not every infinity is countable -- so for example, the infinity of real numbers is greater than the infinity of integers. More generally, just as there are infinitely many numbers, there are also infinitely many infinities.

You haven't seen the proof of that either? Alright, alright. Let's say you have an infinite set A. We'll show how to produce another infinite set, B, which is even bigger than A. This B will simply be the set of all subsets of A (which is guaranteed to exist by the Zermelo-Fraenkel axioms). How do we know B is bigger than A? Well, suppose we could pair off every element a∈A with an element f(a)∈B, in such a way that no elements of B were left over. Then we could define a new subset S⊆A, consisting of every a that's not contained in f(a). Notice that this S can't have been paired off with any a∈A -- since otherwise, a would be contained in f(a) if and only if it wasn't contained in f(a), contradiction. Therefore B is larger than A, and we've ended up with a bigger infinity than the one we started with.

This is certainly one of the four or five greatest proofs in all of math -- again, good to see at least once in your life.

Besides cardinal numbers, it's also useful to discuss ordinal numbers. Rather than defining these, it's easier to just illustrate them. We start with the natural numbers:

1, 2, 3, ...

Then we say, let's define something that's greater than every natural number:

What comes after omega?

ω+1, ω+2, ...

Now, what comes after all of these?

2ω

Alright, we get the idea:

3ω, 4ω, ...

Alright, we get the idea:

ω², ω³, ...

Alright, we get the idea:

ω^ω, , ...

We could go on for quite a while! Basically, for any set of ordinal numbers (finite or infinite), we stipulate that there's a first ordinal number that comes after everything in that set.

The set of ordinal numbers has the important property of being well-ordered, which means that every subset has a minimum element.

Now, here's something interesting. All of the ordinal numbers I've listed have a special property, which is that they have at most countably many predecessors (i.e., at most ℵ₀ of them). What if we consider the set of all ordinals with at most countably many predecessors? Well, that set also has a successor, call it α. But does α itself have ℵ₀ predecessors? Certainly not, since otherwise α wouldn't be the successor to the set; it would be in the set! The set of predecessors of α has the next possible cardinality, which is called ℵ₁.

What this sort of argument proves is that the set of cardinalities is itself well-ordered. After the infinity of the integers, there's a "next bigger infinity," and a "next bigger infinity after that," and so on. You never see an infinite decreasing sequence of infinities, as you do with the real numbers.

So, starting from ℵ₀ (the cardinality of the integers), we've seen two different ways to produce "bigger infinities than infinity." One of those ways yields the cardinality of sets of integers (or equivalently, the cardinality of real numbers), which we denote 2. The other way yields ℵ₁. Is 2 equal to ℵ₁? Or to put it another way: is there any infinity of intermediate size between the infinity of the integers and the infinity of the reals?

(Note: No sooner had I revealed that there were more reals than integers, than a student actually asked this. He claimed never to have heard of the question before; he thought he was just asking for a technical clarification.)

Well, the question of whether there are any "intermediate" infinities between the integers and the reals was David Hilbert's first problem in his famous 1900 address. It stood as one of the great math problems for over half a century, until it was finally "solved" (in a rather disappointing way, as you'll see).

Cantor himself believed there were no intermediate infinities, and called this conjecture the Continuum Hypothesis. Cantor was extremely frustrated with himself for not being able to prove it.

Besides the Continuum Hypothesis, there's another statement about these infinite sets that no one could prove or disprove from the Zermelo-Fraenkel axioms. This statement is the infamous Axiom of Choice. It says that, if you have a (possibly infinite) set of sets, then it's possible to form a new set by choosing one item from each set. Sound reasonable? Well, if you accept it, you also have to accept that there's a way to cut a solid sphere into a finite number of pieces, and then rearrange those pieces into another solid sphere a thousand times its size. (That's the "Banach-Tarski paradox." Admittedly, the "pieces" are a bit hard to cut out with a knife...)

Why does the Axiom of Choice have such dramatic consequences? Basically, because it asserts that certain sets exist, but without giving any rule for forming those sets. As Bertrand Russell put it: "To choose one sock from each of infinitely many pairs of socks requires the Axiom of Choice, but for shoes the Axiom is not needed." (What's the difference?)

The Axiom of Choice turns out to be equivalent to the statement that every set can be well-ordered: in other words, the elements of any set can be paired off with the ordinals 1, 2, ..., ω, ω+1, ... 2ω, 3ω, ... If you think (for example) about the set of real numbers, this seems far from obvious.

It's easy to see that well-ordering implies the Axiom of Choice: just well-order the whole infinity of socks, then choose the sock from each pair that comes first in the ordering.

Do you want to see the other direction: why the Axiom of Choice implies that every set can be well-ordered? Yes?

OK! We have a set A that we want to well-order. For every proper subset B⊂A, we'll use the Axiom of Choice to pick an element f(B)∈A\B. Now we can start well-ordering A, as follows: first let s₀ = f({}), then let s₁ = f({s₀}), s₂ = f({s₀,s₁}), and so on.

Can this process go on forever? No, it can't. For if it did, then by a process of "transfinite induction," we could stuff arbitrarily large infinite cardinalities into A. And while admittedly A is infinite, it has at most a fixed infinite size! So the process has to stop somewhere. But where? At a proper subset B of A? No, it can't do that either -- since if it did, then we'd just continue the process by adding f(B). So the only place it can stop is A itself. Therefore A can be well-ordered.

OK, should we come back to the puzzle from Tuesday? We have a box, [0,1]². To each real number x∈[0,1], we associate a countable subset S(x)⊂[0,1]. Now, can we choose S in such a way that for every (x,y) pair, either y∈S(x) or x∈S(y)? What do you think?

I'll give you two answers: that it isn't possible, and that it is possible. Which answer do you want to see first?

Alright, we'll start with why it isn't possible. For this I'll assume that the Continuum Hypothesis is false. Then there's some proper subset A⊂[0,1] that has cardinality ℵ₁. Let B be the union of S(x) over all x∈A. Then B also has cardinality ℵ₁. So, since we assumed that ℵ₁ is less than 2, there must be some y∈[0,1] not in B. Now observe that there are ℵ₁ real numbers x∈A, but none of them satisfy y∈S(x), and only ℵ₀ < ℵ₁ of them can satisfy x∈S(y).

Now let's see why it is possible. For this I want to assume both the Axiom of Choice and the Continuum Hypothesis. By the Continuum Hypothesis, there are only ℵ₁ real numbers in [0,1]. So by the Axiom of Choice, we can well-order those real numbers, and do it in such a way that every number has at most ℵ₀ predecessors. Now put y in S(x) if and only if y≤x, where ≤ means with respect to the well-ordering (not the usual ordering on real numbers). Then for every (x,y), clearly either y∈S(x) or x∈S(y).

Today's puzzle is about the power of self-esteem and positive thinking. Is there any theorem that you can only prove by assuming as an axiom that the theorem can be proved?

Lecture 3: Gödel, Turing, and Friends

On Thursday, I probably should've told you explicitly that I was compressing a whole math course into one lecture. On the one hand, that means I don't really expect you to have understood everything. On the other hand, to the extent you did understand -- hey! You got a whole math course in one lecture! You're welcome.

But I do realize that in the last lecture, I went too fast in some places. In particular, I wrote an example of logical inference on the board. The example was, if all A's are B's, and there is an A, then there is a B. I'm told that the physicists were having trouble with that?

Hey, I'm just ribbin' ya. If you haven't seen this way of thinking before, then you haven't seen it. But maybe, for the benefit of the physicists, we should go over the basic rules of logic?

Propositional Tautologies: A or not A, not(A and not A), etc. are valid.
Modus Ponens: If A is valid and A implies B is valid then B is valid.
Equality Rules: x=x, x=y implies y=x, x=y and y=z implies x=z, and x=y implies f(x)=f(y) are all valid.
Change of Variables: Changing variable names leaves a statement valid.
Quantifier Elimination: If For all x, A(x) is valid, then A(y) is valid for any y.
Quantifier Addition: If A(y) is valid where y is an unrestricted variable, then For all x, A(x) is valid.
Quantifier Rules: If Not(For all x, A(x)) is valid, then There exists an x such that Not(A(x)) is valid. Etc.

There's an amazing result called Gödel's Completeness Theorem, which says that these rules are all you ever need. In other words: if, starting from some set of axioms, you can't derive a contradiction using these rules, then the axioms must have a model (i.e., they must be consistent). Conversely, if the axioms are inconsistent, then the inconsistency can be proven using these rules alone.

Think about what that means. It means that Fermat's Last Theorem, the PoincarÃ© Conjecture, or any other mathematical achievement you care to name can be proved by starting from the axioms for set theory, and then applying these piddling little rules over and over again. Probably 300 million times, but still...

(How does Gödel prove the Completeness Theorem? The proof has been described as "extracting semantics from syntax." We simply "cook up objects to order" as the axioms request them! And if we ever run into an inconsistency, that can only be because there was an inconsistency in the original axioms.)

One immediate consequence of the Completeness Theorem is the Löwenheim-Skolem Theorem: every set of axioms has a model of at most countable cardinality. (Note: One of the best predictors of success in mathematical logic is having an umlaut in your name.) Why? Because the process of "cooking up objects to order as the axioms request them" can only go on for a countably infinite number of steps!

It's a shame that, after proving his Completeness Theorem, Gödel never really did anything else of note. [Pause for comic effect] Well, alright, I guess a year later he proved the Incompleteness Theorem. See, the Completeness Theorem was his Master's thesis, and the Incompleteness Theorem was his PhD thesis. Apparently, one of his PhD examiners didn't want to give him a degree because the PhD thesis was "too similar to the Master's thesis."

The Incompleteness Theorem says that, given any consistent, computable set of axioms, there's a true statement about the integers that can never be proved from those axioms. Here consistent means that you can't derive a contradiction, while computable means that either there are finitely many axioms, or else if there are infinitely many, at least there's an algorithm to generate all the axioms.

(If we didn't have the computability requirement, then we could simply take our "axioms" to consist of all true statements about the integers! In practice, that isn't a very useful set of axioms.)

But wait! Doesn't the Incompleteness Theorem contradict the Completeness Theorem, which says that any statement that's entailed by the axioms can be proved from the axioms? Hold that question; we're gonna clear it up later.

First, though, let's see how the Incompleteness Theorem is proved. People always say, "the proof of the Incompleteness Theorem was a technical tour de force, it took 30 pages, it requires an elaborate construction involving prime numbers," etc. Unbelievably, 80 years after Gödel, that's still how the proof is presented in math classes!

Alright, should I let you in on a secret? The proof of the Incompleteness Theorem is about two lines. The caveat is that, to give the two-line proof, you first need the concept of a computer.

When I was in junior high, I had a friend who was really good at math, but maybe not so good at programming. He wanted to write a program using arrays, but he didn't know what an array was. So what did he do? He associated each element of the array with a unique prime number, then he multiplied them all together; then, whenever he wanted to read something out of the array, he factored the product. (If he was programming a quantum computer, maybe that wouldn't be quite so bad!) Anyway, what my friend did, that's basically what Gödel did. He made up an elaborate hack in order to program without programming.

Turing Machines

OK, time to bring Mr. T. on the scene. How many of you have seen Turing machines before? About three-quarters of you? I'll go pretty quickly then.

In 1936, the word "computer" meant a person (usually a woman) whose job was to compute with pencil and paper. Turing wanted to show that, in principle, such a "computer" could be simulated by a machine. What would the machine look like? Well, it would have to able to write down its calculations somewhere. Since we don't really care about handwriting, font size, etc., it's easiest to imagine that the calculations are written on a sheet of paper divided into squares, with one symbol per square, and a finite number of possible symbols. Traditionally paper has two dimensions, but without loss of generality we can imagine a long, one-dimensional paper tape. How long? For the time being, we'll assume as long as we need.

What can the machine do? Well, clearly it has to be able to read symbols off the tape and modify them based on what it reads. We'll assume for simplicity that the machine reads only one symbol at a time. But in that case, it had better be able to move back and forth on the tape. It would also be nice if, once it's computed an answer, the machine can halt! But at any time, how does the machine decide which things to do? According to Turing, this decision should depend only on two pieces of information: (1) the symbol currently being read, and (2) the machine's current "internal configuration" or "state." Based on its internal state and the symbol currently being read, the machine should (1) write a new symbol in the current square, (2) move backwards or forwards one square, and (3) switch to a new state or halt.

Finally, since we want this machine to be physically realizable, the number of possible internal states should be finite. These are the only requirements.

Turing's first result is the existence of a "universal" machine: a machine whose job is to simulate any other machine described via symbols on the tape. In other words, universal programmable computers can exist. You don't have to build one machine for email, another for playing DVD's, another for Tomb Raider, and so on: you can build a single machine that simulates any of the other machines, by running different programs stored in memory. This result is actually a lemma, which Turing uses to prove his "real" result.

So what's the real result? It's that there's a basic problem, called the halting problem, that no program can ever solve. The halting problem is this: we're given a program, and we want to decide if it ever halts. Of course we can run the program for a while, but what if the program hasn't halted after a million years? At what point should we give up?

One piece of evidence that this problem might be hard is that, if we could solve it, then we could also solve many famous unsolved math problems. For example, Goldbach's Conjecture says that every even number 4 or greater can be written as a sum of two primes. Now, we can easily write a program that tests 4, 6, 8, and so on, halting only if it finds a number that can't be written as a sum of two primes. Then deciding whether that program ever halts is equivalent to deciding the truth of Goldbach's Conjecture.

But can we prove there's no program to solve the halting problem? This is what Turing does. His key idea is not even to try to analyze the internal dynamics of such a program, supposing it existed. Instead he simply says, suppose by way of contradiction that such a program P exists. Then we can modify P to produce a new program P' that does the following. Given another program Q as input, P'

runs forever if Q halts given its own code as input, or
halts if Q runs forever given its own code as input.

Now we just feed P' its own code as input. By the conditions above, P' will run forever if it halts, or halt if it runs forever. Therefore P' -- and by implication P -- can't have existed in the first place.

As I said, once you have Turing's results, Gödel's results fall out for free as a bonus. Why? Well, suppose the Incompleteness Theorem was false -- that is, there existed a consistent, computable proof system F from which any statement about integers could be either proved or disproved. Then given a computer program, we could simply search through every possible proof in F, until we found either a proof that the program halts or a proof that it doesn't halt. (This is possible because the statement that a particular computer program halts is ultimately just a statement about integers.) But this would give us an algorithm to solve the halting problem, which we already know is impossible. Therefore F can't exist.

By thinking more carefully, we can actually squeeze out a stronger result. Let P be a program that, given as input another program Q, tries to decide whether Q halts by the strategy above (i.e., searching through every possible proof and disproof that Q halts in some formal system F). Then as in Turing's proof, suppose we modify P to produce a new program P' that

runs forever if Q is proved to halt given its own code as input, or
halts if Q is proved to run forever given its own code as input.

Now suppose we feed P' its own code as input. Then we know that P' will run forever, without ever discovering a proof or disproof that it halts. For P' finds a proof that it halts, then it will run forever, and if it finds a proof that it runs forever, then it will halt, which is a contradiction.

But there's an obvious paradox: why isn't the above argument, itself, a proof that P' will run forever given its own code as input? And why won't P' discover this proof that it runs forever -- and therefore halt, and therefore run forever, and therefore halt, etc.?

The answer is that, in "proving" that P' runs forever, we made a hidden assumption: namely that the proof system F is consistent. If F was inconsistent, then there could perfectly well be a proof that P' halts, even if the reality was that P' ran forever.

But this means that, if F could prove that F was consistent, then F could also prove that P' ran forever -- thereby bringing back the above contradiction. The only possible conclusion is that if F is consistent, then F can't prove its own consistency. This result is sometimes called Gödel's Second Incompleteness Theorem.

The Second Incompleteness Theorem establishes what we maybe should have expected all along: that the only mathematical theories pompous enough to prove their own consistency, are the ones that don't have any consistency to brag about! If we want to prove that a theory F is consistent, then we can only do it within a more powerful theory -- a trivial example being F+Con(F) (F plus the axiom that F is consistent). But then how do we know that F+Con(F) is itself consistent? Well, we can only prove that in a still stronger theory: F+Con(F)+Con(F+Con(F)) (F+Con(F) plus the axiom that F+Con(F) is consistent). And so on infinitely. (Indeed, even beyond infinitely, into the countable ordinals.)

To take a concrete example: the Second Incompleteness Theorem tells us that the most popular axiom system for the integers, Peano Arithmetic, can't prove its own consistency. Or in symbols, PA can't prove Con(PA). If we want to prove Con(PA), then we need to move to a stronger axiom system, such as ZF (the Zermelo-Fraenkel axioms for set theory). In ZF we can prove Con(PA) pretty easily, by using the Axiom of Infinity to conjure up an infinite set that then serves as a model for PA.

On the other hand, again by the Second Incompleteness Theorem, ZF can't prove its own consistency. If we want to prove Con(ZF), the simplest way to do it is to posit the existence of infinities bigger than anything that can be defined in ZF. Such infinities are called "large cardinals." (When set theorists say large, they mean large.) Once again, we can prove the consistency of ZF in ZF+LC (where LC is the axiom that large cardinals exist). But if we want to prove that ZF+LC is itself consistent, then we need a still more powerful theory, such as one with even bigger infinities.

A quick question to test your understanding: while we can't prove in PA that Con(PA), can we least prove in PA that Con(PA) implies Con(ZF)?

No, we can't. For then we could also prove in ZF that Con(PA) implies Con(ZF). But since ZF can prove Con(PA), this would mean that ZF can prove Con(ZF), which contradicts the Second Incompleteness Theorem.

I promised to explain why the Incompleteness Theorem doesn't contradict the Completeness Theorem. The easiest way to do this is probably through an example. Consider the "self-hating theory" PA+Not(Con(PA)), or Peano Arithmetic plus the assertion of its own inconsistency. We know that if PA is consistent, then this strange theory must be consistent as well -- since otherwise PA would prove its own consistency, which the Incompleteness Theorem doesn't allow. It follows, by the Completeness Theorem, that PA+Not(Con(PA)) must have a model. But what could such a model possibly look like? In particular, what you happen if, within that model, you just asked to see the proof that PA was inconsistent?

I'll tell you what would happen: the axioms would tell you that proof of PA's inconsistency is encoded by a positive integer X. And then you would say, "but what is X?" And the axioms would say, "X." And you would say, "But what is X, as an ordinary positive integer?"

"No, no, no! Talk to the axioms."

"Alright, is X greater or less than 10^500,000?"

"Greater." (The axioms aren't stupid: they know that if they said "smaller", then you could simply try every smaller number and verify that none of them encode a proof of PA's inconsistency.)

"Alright then, what's X+1?"

"Y."

And so on. The axioms will keep cooking up fictitious numbers to satisfy your requests, and assuming that PA itself is consistent, you'll never be able to trap them in an inconsistency. The point of the Completeness Theorem is that the whole infinite set of fictitious numbers the axioms cook up will constitute a model for PA -- just not the usual model (i.e., the ordinary positive integers)! If we insist on talking about the usual model, then we switch from the domain of the Completeness Theorem to the domain of the Incompleteness Theorem.

Do you remember the puzzle from Thursday? The puzzle was whether there's any theorem that can only be proved by assuming as an axiom that it can be proved. In other words, does "just believing in yourself" make any formal difference in mathematics? We're now in a position to answer that question.

Let's suppose, for concreteness, that the theorem we want to prove is the Riemann Hypothesis (RH), and the formal system we want to prove it in is Zermelo-Fraenkel set theory (ZF). Suppose we can prove in ZF that, if ZF proves RH, then RH is true. Then taking the contrapositive, we can also prove in ZF that if RH is false, then ZF does not prove RH. In other words, we can prove in ZF+not(RH) that not(RH) is perfectly consistent with ZF. But this means that the theory ZF+not(RH) proves its own consistency -- and this, by Gödel, means that ZF+not(RH) is inconsistent. But saying that ZF+not(RH) is inconsistent is equivalent to saying that RH is a theorem of ZF. Therefore we've proved RH. In general we find that, if a statement can be proved by assuming as an axiom that it's provable, then it can also be proved without assuming that axiom. This result is known as Löb's Theorem (again with the umlauts), though personally I think that a better name would be the "You-Had-The-Mojo-All-Along Theorem."

Oh, you remember on Thursday we talked about the Axiom of Choice and the Continuum Hypothesis? Natural statements about the continuum that, since the continuum is such a well-defined mathematical entity, must certainly be either true or false? So, how did those things ever get decided? Well, Gödel proved in 1939 that assuming the Axiom of Choice (AC) or the Continuum Hypothesis (CH) can never lead to an inconsistency. In other words, if the theories ZF+AC or ZF+CH were inconsistent, that could only be because ZF itself was inconsistent.

This raised an obvious question: can we also consistently assume that AC and CH are false? Gödel worked on this problem but wasn't able to answer it. Finally Paul Cohen gave an affirmative answer in 1963, by inventing a new technique called "forcing." (For that, he won the only Fields Medal that's ever been given for set theory and the foundations of math.)

So, we now know that the usual axioms of mathematics don't decide the Axiom of Choice and the Continuum Hypothesis one way or another. You're free to believe both, neither, or one and not the other without fear of contradiction. And sure enough, opinion among mathematicians about AC and CH remains divided to this day, with many interesting arguments for and against (which we unfortunately don't have time to explore the details of).

Let me end with a possibly-surprising observation: the independence of AC and CH from ZF set theory is itself a theorem of Peano Arithmetic. For, ultimately, Gödel and Cohen's consistency theorems boil down to combinatorial assertions about manipulations of first-order sentences -- which can in principle be proven directly, without ever thinking about the transfinite sets that those sentences purport to describe. This provides a nice illustration of what, to me, is the central philosophical question underlying this whole business: do we ever really talk about the continuum, or do we only ever talk about finite sequences of symbols that talk about the continuum?

Bonus Addendum

What does any of this have to do with quantum mechanics? I will now attempt the heroic task of making a connection. What I've tried to impress on you is that there are profound difficulties if we want to assume the world is continuous. Take this pen, for example: how many different positions can I put it in on the surface of the table? ℵ₁? More than ℵ₁? Less than ℵ₁? We don't want the answers to "physics" questions to depend on the axioms of set theory!

Ah, but you say my question is physically meaningless, since the pen's position could never actually be measured to infinite precision? Sure -- but the point is, you need a physical theory to tell you that!

Of course, quantum mechanics gets its very name from the fact that a lot of the observables in the theory, like energy levels, are discrete -- "quantized." This seems paradoxical, since one of the criticisms that computer scientists level against quantum computing is that, as they see it, it's a continuous model of computation!

My own view is that quantum mechanics, like classical probability theory, should be seen as somehow "intermediate" between a continuous and discrete theory. (Here I'm assuming that the Hilbert space or probability space are finite-dimensional.) What I mean is that, while there are continuous parameters (the probabilities or amplitudes respectively), those parameters are not directly observable, and that has the effect of "shielding" us from the bizarro universe of the Axiom of Choice and the Continuum Hypothesis. We don't need a detailed physical theory to tell us that whether amplitudes are rational or irrational, whether there are more or less than ℵ₁ possible amplitudes, etc., are physically meaningless questions. This follows directly from the fact that, if we wanted to learn an amplitude exactly, then (even assuming no error!) we would need to measure the appropriate state infinitely many times.

Homework Assignment

Let BB(n), or the "nth Busy Beaver number," be the maximum number of steps that an n-state Turing machine can make on an initially blank tape before halting. (Here the maximum is over all n-state Turing machines that eventually halt.)

Prove that BB(n) grows faster than any computable function.
Let S = 1/BB(1) + 1/BB(2) + 1/BB(3) + ...
Is S a computable real number? In other words, is there an algorithm that, given as input a positive integer k, outputs a rational number S' such that |S-S'|<1/k?

Lecture 4: Minds and Machines

Today we're going to launch into something I know you've all been waiting for: a philosophical foodfight about minds, machines, and intelligence!

First, though, let's finish talking about computability. One concept we'll need again and again in this class is that of an oracle. The idea is a pretty obvious one: we assume we have a "black box" or "oracle" that immediately solves some hard computational problem, and then see what the consequences are! (When I was a freshman, I once started talking to my professor about the consequences of a hypothetical "NP-completeness fairy": a being that would instantly tell you whether a given Boolean formula was satisfiable or not. The professor had to correct me: they're not called "fairies"; they're called "oracles." Much more professional!)

Oracles were apparently first studied by Turing, in his 1938 PhD thesis. Obviously, anyone who could write a whole thesis about these fictitious entities would have to be an extremely pure theorist, someone who wouldn't be caught dead doing anything relevant. This was certainly true in Turing's case -- indeed he spent the years after his PhD, from 1939 to 1943, studying certain abstruse symmetry transformations on a 26-letter alphabet.

Anyway, we say that problem A is Turing-reducible to problem B, if A is solvable by a Turing machine given an oracle for B. In other words, "A is no harder than B": if we had a hypothetical device to solve B, then we could also solve A. Two problems are Turing-equivalent if each is Turing-reducible to the other. So for example, the problem of whether a statement can be proved from the axioms of set theory is Turing-equivalent to the halting problem: if you can solve one, you can solve the other.

Now, a Turing-degree is the set of all problems that are Turing-equivalent to a given problem. What are some examples of Turing-degrees? Well, we've already seen two examples: (1) the set of computable problems, and (2) the set of problems that are Turing-equivalent to the halting problem. Saying that these Turing-degrees aren't equal is just another way of saying that the halting problem isn't solvable.

Are there any Turing-degrees above these two? In other words, is there any problem even harder than the halting problem? Well, consider the following "super halting problem": given a Turing machine with an oracle for the halting problem, decide if it halts! Can we prove that this super halting problem is unsolvable, even given an oracle for the ordinary halting problem? Yes, we can! We simply take Turing's original proof that the halting problem is unsolvable, and "shift everything up a level" by giving all the machines an oracle for the halting problem. Everything in the proof goes through as before, a fact we express by saying that the proof "relativizes."

Here's a subtler question: is there any problem of intermediate difficulty between the computable problems and the halting problem? This question was first asked by Emil Post in 1944, and was finally answered in 1956, by Richard Friedberg in the US and (independently) A. A. Muchnik in the USSR. The answer is yes. Indeed, Friedberg and Muchnik actually proved a stronger result: that there are two problems A and B, both of which are solvable given an oracle for the halting problem, but neither of which is solvable given an oracle for the other. These problems are constructed via an infinite process whose purpose is to kill off every Turing machine that might reduce A to B or B to A. Unfortunately, the resulting problems are extremely contrived; they don't look like anything that might arise in practice. And even today, we don't have a single example of a "natural" problem with intermediate Turing degree.

Since Friedberg and Muchnik's breakthrough, the structure of the Turing degrees has been studied in more detail than you can possibly imagine. Here's one of the simplest questions: if two problems A and B are both reducible to the halting problem, then must there be a problem C that's reducible to A and B, such that any problem that's reducible to both A and B is also reducible to C? Hey, whatever floats your boat! But this is the point where some of us say, maybe we should move on to the next topic... (Incidentally, the answer to the question is no.)

Alright, the main philosophical idea underlying computability is what's called the Church-Turing Thesis. It's named after Turing and his adviser Alonzo Church, even though what they themselves believed about "their" thesis is open to dispute! Basically, the thesis is that any function "naturally to be regarded as computable" is computable by a Turing machine. Or in other words, any "reasonable" model of computation will give you either the same set of computable functions as the Turing machine model, or else a proper subset.

Already there's an obvious question: what sort of claim is this? Is it an empirical claim, about which functions can be computed in physical reality? Is it a definitional claim, about the meaning of the word "computable"? Is it a little of both?

Well, whatever it is, the Church-Turing Thesis can only be regarded as extremely successful, as theses go. As you know -- and as we'll discuss later -- quantum computing presents a serious challenge to the so-called "Extended" Church-Turing Thesis: that any function naturally to be regarded as efficiently computable is efficiently computable by a Turing machine. But in my view, so far there hasn't been any serious challenge to the original Church-Turing Thesis -- neither as a claim about physical reality, nor as a definition of ‘computable.'

There have been plenty of non-serious challenges to the Church-Turing Thesis. In fact there are whole conferences and journals devoted to these challenges -- google "hypercomputation." I've read some of this stuff, and it's mostly along the lines of, well, suppose you could do the first step of a computation in one second, the next step in a half second, the next step in a quarter second, the next step in an eighth second, and so on. Then in two seconds you'll have done an infinite amount of computation! Well, as stated it sounds a bit silly, so maybe sex it up by throwing in a black hole or something. How could the hidebound Turing reactionaries possibly object? (It reminds me of the joke about the supercomputer that was so fast, it could do an infinite loop in 2.5 seconds.)

We should immediately be skeptical that, if Nature was going to give us these vast computational powers, she would do so in a way that's so mundane, so uninteresting. Without making us sweat or anything. But admittedly, to really see why the hypercomputing proposals fail, you need the entropy bounds of Bekenstein, Bousso, and others -- which are among the few things the physicists think they know about quantum gravity, and which hopefully we'll say something about later in the course. So the Church-Turing Thesis -- even its original, non-extended version -- really is connected to some of the deepest questions in physics. But in my opinion, neither quantum computing, nor analog computing, nor anything else has mounted a serious challenge to that thesis in the seventy years since it was formulated.

If we interpret the Church-Turing Thesis as a claim about physical reality, then it should encompass everything in that reality, including the goopy neural nets between your respective ears. This leads us, of course, straight into the cratered intellectual battlefield that I promised to lead you into.

As a historical remark, it's interesting that the possibility of thinking machines isn't something that occurred to people gradually, after they'd already been using computers for decades. Instead it occurred to them immediately, the minute they started talking about computers themselves. People like Leibniz and Babbage and Lovelace and Turing and von Neumann understood from the beginning that a computer wouldn't just be another steam engine or toaster -- that, because of the property of universality (whether or not they called it that), it's difficult even to talk about computers without also talking about ourselves.

So, I asked you to read Turing's second famous paper, Computing Machinery and Intelligence. Reactions?

What's the main idea of this paper? As I read it, it's a plea against meat chauvinism. Sure, Turing makes some scientific arguments, some mathematical arguments, some epistemological arguments. But beneath everything else is a moral argument. Namely: if a computer interacted with us in a way that was indistinguishable from a human, then of course we could say the computer wasn't "really" thinking, that it was just a simulation. But on the same grounds, we could also say that other people aren't really thinking, that they merely act as if they're thinking. So what is it that entitles us to go through such intellectual acrobatics in the one case but not the other?

If you'll allow me to editorialize (as if I ever do otherwise...), this moral question, this question of double standards, is really where Searle, Penrose, and every other "strong AI skeptic" comes up empty for me. One can indeed give weighty and compelling arguments against the possibility of thinking machines. The only problem with these arguments is that they're also arguments against the possibility of thinking brains!

So for example: one popular argument is that, if a computer appears to be intelligent, that's merely a reflection of the intelligence of the humans who programmed it. But what if humans' intelligence is just a reflection of the billion-year evolutionary process that gave rise to it? What frustrates me every time I read the AI skeptics is their failure to consider these parallels honestly. The "qualia" and "aboutness" of other people is simply taken for granted. It's only the qualia of machines that's ever in question.

But perhaps a skeptic could retort: I believe other people think because I know I think, and other people look sort of similar to me -- they've also got five fingers, hair in their armpits, etc. But a robot looks different -- it's made of metal, it's got an antenna, it lumbers across the room, etc. So even if the robot acts like it's thinking, who knows? But if I accept this argument, why not go further? Why can't I say, I accept that white people think, but those blacks and Asians, who knows about them? They look too dissimilar from me.

In my view, one can divide everything that's been said about artificial intelligence into two categories: the 70% that's somewhere in Turing's paper from 1950, and the 30% that's emerged from a half-century of research since then.

So, after 56 years, there are some things we can say that would've surprised Alan Turing. What are those things? Well, one of them is how little progress has been made, compared to what he expected! Do you remember, Turing made a falsifiable prediction?

I believe that in about fifty years' time it will be possible, to programme computers, with a storage capacity of about 10⁹, to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning.

How well has his prediction fared? First let's note that the prediction about computers themselves was damn good. Turing predicted that in 50 years' time (i.e., 2000), we'd be programming computers with a storage capacity of about 10⁹ (i.e., one gig).

But what about programming the computers to pass the imitation game? How well has Turing's prediction fared there?

Well, some of you might have heard of a program called ELIZA, written by Joseph Weizenbaum in 1966. This program simulates a psychotherapist who keeps spitting back whatever you said. The amazing thing Weizenbaum found is that many people will spill their hearts out to this program! And sometimes, if you then tell them they were talking to a program (and an extremely simple one at that), they won't believe you.

A few years ago, someone had the brilliant idea to take the original ELIZA program and let it loose in AOL chat rooms to see what happened. If you go to fury.com/aoliza, you can see some hilarious (fortunately anonymized) examples of people trying to seduce the program, telling it about their infidelities, etc. Here's one of my favorite exchanges, from a guy who (before moving on to amorous solicitations) had told the program that he planned to finish his B of A and then "move into corporate business alliance with Starbucks":

Guy: I think that u are getting off on playing with my mind
ELIZA: your mind?
Guy: frustrating me by not joining in my fantasy
ELIZA: your fantasy?
Guy: about having mind blowing sex with u
ELIZA: does that have anything to do with the fact that your b of a and then you will move into corporate business alliance with starbucks?

So this is about the state of the art in terms of man-machine repartee. It seems one actually needs to revise the Turing Test, to say that, if we want to verify intelligence in a computer, then we need some minimal level of intelligence in the human interrogator.

Question from the floor: But isn't the problem that these guys assumed they were talking to a human? To satisfy the conditions of the Turing Test, wouldn't we need to tell them they were trying to distinguish a human from a computer?

Response: Excellent question! Yes, for exactly that reason, this isn't a real Turing Test; it's merely amusing. However, for the past 15 years a guy named Hugh Loebner has been running a contest that's closer to what Turing had in mind. Here the testers are told that they're trying to distinguish humans from computers -- yet many of the transcripts have been just as depressing, both from the standpoint of machine intelligence and from that of human intelligence. (E.g., a woman who tried to converse intelligently about Shakespeare got classified as a computer, since "no human would know that much about Shakespeare...")

Another question from the floor: What if we had a computer doing the interrogation instead of a human?

Response: Another excellent question! As it turns out, that's not at all a hypothetical situation. A few days ago, a guy named Luis von Ahn won a MacArthur award for (among other things) his work on CAPTCHA's, which are those tests that websites use to distinguish legitimate users from spambots. I'm sure you've encountered them -- you know, the things where you see those weird curvy letters that you have to retype. The key property of these tests is that a computer should be able to generate and grade them, but not pass them! (A lot like professors making up a midterm...) Only humans should be able to pass the tests. So basically, these tests capitalize on the failures of AI. (Well, they also capitalize on the computational hardness of inverting one-way functions, which we'll get to later.)

One interesting aspect of CAPTCHA's is that they've already led to an arms race between the CAPTCHA programmers and the AI programmers. When I was at Berkeley, some of my fellow grad students wrote a program that broke a CAPTCHA called Gimpy maybe 30% of the time. So then the CAPTCHA's have to be made harder, and then the AI people get back to work, and so on. Who will win?

You see: every time you set up a Yahoo Mail account, you're directly confronting age-old mysteries about what it means to be human...

Despite what I said about the Turing Test, there have been some dramatic successes of AI. We all know about Kasparov and Deep Blue. Maybe less well-known is that, in 1996, a program called Otter was used to solve a 60-year-old open problem in algebra called the Robbins Conjecture, which Tarski and other famous mathematicians had worked on. (Apparently, for decades Tarski would give the problem to his best students. Then, eventually, he started giving it to his worst students...) The problem is easy to state: given the three axioms

A or (B or C) = (A or B) or C
A or B = B or A
Not(Not(A or B) or Not(A or Not(B))) = A,

can one derive as a consequence that Not(Not(A)) = A?

Let me stress that this was not a case like Appel and Haken's proof of the Four-Color Theorem, where the computer's role was basically to check thousands of cases. In this case, the proof was 17 lines long. A human could check the proof by hand, and say, yeah, I could've come up with that. (In principle!)

What else? Arguably there's a pretty sophisticated AI system that almost all of you used this morning and will use many more times today. What is it? Right, Google.

You can look at any of these examples -- Deep Blue, the Robbins conjecture, Google -- and say, that's not really AI. That's just massive search, helped along by clever programming. Now, this kind of talk drives AI researchers up a wall. They say: if you told someone in the sixties that in 30 years we'd be able to beat the world grandmaster at chess, and asked if that would count as AI, they'd say, of course it's AI! But now that we know how to do it, now it's no longer AI. Now it's just search. (Philosophers have a similar complaint: as soon as a branch of philosophy leads to anything concrete, it's no longer called philosophy! It's called math or science.)

There's another thing we appreciate now that people in Turing's time didn't really appreciate. This is that, in trying to write programs to simulate human intelligence, we're competing against a billion years of evolution. And that's damn hard. One counterintuitive consequence is that it's much easier to program a computer to beat Gary Kasparov at chess, than to program a computer to recognize faces under varied lighting conditions. Often the hardest tasks for AI are the ones that are trivial for a 5-year-old -- since those are the ones that are so hardwired by evolution that we don't even think about them.

In the last fifty years, have there been any new insights about the Turing Test itself? In my opinion, no. There has, on the other hand, been a non-insight, which is called Searle's Chinese Room. This is supposed to be an argument that even a computer that did pass the Turing Test wouldn't be intelligent. The way it goes is, let's say you don't speak Chinese. (Debbie isn't here today, so I think that's a safe assumption.) You sit in a room, and someone passes you paper slips through a hole in the wall with questions written in Chinese, and you're able to answer the questions (again in Chinese) just by consulting a rule book. In this case, you might be carrying out an intelligent Chinese conversation, yet by assumption, you don't understand a word of Chinese! Therefore symbol-manipulation can't produce understanding.

So, class, how might a strong AI proponent respond to this argument? Duh: you might not understand Chinese, but the rule book does! Or if you like, understanding Chinese is an emergent property of the system consisting of you and the rule book, in the same sense that understanding English is an emergent property of the neurons in your brain. Like many other thought experiments, the Chinese Room gets its mileage from a deceptive choice of imagery -- and more to the point, from ignoring computational complexity. We're invited to imagine someone pushing around slips of paper with zero understanding or insight -- much like the doofus freshmen who write (a+b)²=a²+b² on their math tests. But how many slips of paper are we talking about? How big would the rule book have to be, and how quickly would you have to consult it, to carry out an intelligent Chinese conversation in anything resembling real time? If each page of the rule book corresponded to one neuron of (say) Debbie's brain, then probably we'd be talking about a "rule book" at least the size of the Earth, its pages searchable by a swarm of robots traveling at close to the speed of light. When you put it that way, maybe it's not so hard to imagine that this enormous Chinese-speaking entity -- this dian nao -- that we've brought into being might have something we'd be prepared to call understanding or insight.

Of course, everyone who talks about this stuff is really tiptoeing around the hoary question of consciousness. See, consciousness has this weird dual property, that on the one hand, it's arguably the most mysterious thing we know about, and the other hand, not only are we directly aware of it, but in some sense it's the only thing we're directly aware of. You know, cogito ergo sum and all that. So to give an example, I might be mistaken about Richard's shirt being blue -- I might be hallucinating or whatever -- but I really can't be mistaken about my perceiving it as blue. (Or if I could, then we get an infinite regress.)

Question from the floor: What about optical illusions? You might know that a dot isn't moving, yet still perceive it as moving...

Response: There's no contradiction between the dot not moving (and my knowing that it isn't moving), and my being certain that I perceive it as moving. What I perceive and what's out there are two different things (even if I know they're different).

Now, is there anything else that also produces the feeling of absolute certainty? Right -- math! Incidentally, I think this similarity between math and subjective experience might go a long away toward explaining mathematicians' "quasi-mystical" tendencies. (I can already hear Greg Kuperberg wincing. Wince, Greg, wince!) This is a good thing for physicists to understand: when you're talking to a mathematician, you might not be talking to someone who fears the real world and who's therefore retreated into intellectual masturbation. You might be talking to someone for whom the real world was never especially real to begin with! I mean, to come back to something we mentioned earlier: why did many mathematicians look askance at the computer proof of the Four-Color Theorem? Sure, the computer might have made a mistake, but humans make plenty of mistakes too!

What it boils down to, I think, is that there is a sense in which the Four-Color Theorem has been proved, and there's another sense in which many mathematicians understand proof, and those two senses aren't the same. For many mathematicians, a statement isn't proved when a physical process (which might be a classical computation, a quantum computation, an interactive protocol, or something else) terminates saying that it's been proved -- however good the reasons might be to believe that physical process is reliable. Rather, the statement is proved when they (the mathematicians) feel that their minds can directly perceive its truth.

Of course, it's hard to discuss these things directly. But what I'm trying to point out is that many people's "anti-robot animus" is probably a combination of two ingredients:

the directly-experienced certainty that they're conscious -- that they perceive colors, sounds, positive integers, etc., regardless of whether anyone else does, and
the belief that if they were just a computation, then they could not be conscious in this way.

For example, I think Penrose's objections to strong AI derive from these two ingredients. I think his arguments about Gödel's Theorem are window dressing added later.

For people who think this way (as even I do, at least in certain moods), granting consciousness to a robot seems strangely equivalent to denying that one is conscious oneself. Is there any respectable way out of this dilemma -- or in other words, any way out that doesn't rely on a meatist double standard, with one rule for ourselves and a different rule for robots?

My own favorite way out is one that's been advocated by the philosopher David Chalmers. Basically, what Chalmers proposes is a "philosophical NP-completeness reduction": a reduction of one mystery to another. He says that if computers someday pass the Turing Test, then we'll be compelled to regard them as conscious. And as for how they could be conscious, we'll understand that just as well and as poorly as we understand how a bundle of neurons could be conscious. Yes, it's mysterious, but the one mystery doesn't seem so different from the other.

Today's Puzzles

[The well-defined puzzle] Can we assume without loss of generality that a computer program has access to its own source code?
[The vague, ill-defined puzzle] If that which before the 1800's was called water turned out to be CH₄ instead of H₂O, would it still be water, or would it be something else?

Answers to Homework

Recall that BB(n), or the "n^th Busy Beaver number," is the largest number of steps that an n-state Turing machine can make on an initially blank tape before halting.

The first problem was to prove that BB(n) grows faster than any computable function. Did people get this one? Excellent!

Yeah, suppose there were a computable function f(n) such that f(n)>BB(n) for every n. Then given an n-state Turing machine M, we could first compute f(n), then simulate M for up to f(n) steps. If M hasn't halted by then, then we know it never will halt, since f(n) is greater than the maximum number of steps any n-state machine could make. But this gives us a way to solve the halting problem, which we already know is impossible. Therefore the function f doesn't exist.

So the BB(n) function grows really, really, really fast. (In case you're curious, here are the first few values, insofar as they've been computed by people with too much free time: BB(1)=1, BB(2)=6, BB(3)=21, BB(4)=107, BB(5)≥47,176,870.)

The second problem was whether

is a computable real number. In other words, is there an algorithm that given an integer k, outputs a rational number S' such that |S-S'| < 1/k?

People had more trouble with this one? Alright, let's see the answer. The answer is no -- it isn't computable. For suppose it were computable; then we'll give an algorithm to compute BB(n) itself, which we know is impossible.

Assume by induction that we've already computed BB(1), BB(2), ..., BB(n-1). Then consider the sum of the "higher-order terms":

If S is computable, then S_n must be computable as well. But this means we can approximate S_n within 1/2, 1/4, 1/8, and so on, until the interval that we've bounded S_n in no longer contains 0. When that happens, we get an upper bound on 1/S_n. But since 1/BB(n+1), 1/BB(n+2), and so on are so much smaller than 1/BB(n), any upper bound on 1/S_n immediately yields an upper bound on BB(n) as well. But once we have an upper bound on BB(n), we can then compute BB(n) itself, by simply simulating all n-state Turing machines. So assuming we could compute S, we could also compute BB(n), which we already know is impossible. Therefore S is not computable.

Lecture 5: Paleocomplexity

By any objective standard, the theory of computational complexity ranks as one of the greatest intellectual achievements of humankind -- along with fire, the wheel, and computability theory. That it isn't taught in high schools is really just an accident of history. In any case, we'll certainly need complexity theory for everything else we're going to do in this course, which is why the next five or six lectures will be devoted to it. So before we dive in, let's step back and pontificate about where we're going.

What I've been trying to do is show you the conceptual underpinnings of the universe, before quantum mechanics comes on the scene. The amazing thing about quantum mechanics is that, despite being a grubby empirical discovery, it changes some of the underpinnings! Others it doesn't change, and others it's not so clear whether it changes them or not. But if we want to debate how things are changed by quantum mechanics, then we'd better understand what they looked like before quantum mechanics.

It's useful to divide complexity theory into historical epochs:

1950's: Late Turingzoic
1960's: Dawn of the Asymptotic Age
1971: The Cook-Levin Asteroid; extinction of the Diagonalosaurs
Early 1970's: The Karpian Explosion
1978: Early Cryptozoic
1980's: Randomaceous Era
1993: Eruption of Mt. Razborudich; extinction of the Combinataurs
1994: Invasion of the Quantodactyls
Mid-1990's to present: Derandomaceous Era

This lecture will be about "paleocomplexity": complexity in the age before P, NP, and NP-completeness, when Diagonalosaurs ruled the earth. Then Lecture 6 will cover the Karpian Explosion, Lecture 7 the Randomaceous Era, Lecture 8 the Early Cryptozoic, and Lecture 9 the Invasion of the Quantodactyls.

We talked on Thursday about computability theory. We saw how certain problems are uncomputable -- like, given a statement about positive integers, is it true or false? (If we could solve that, then we could solve the halting problem, which we already know is impossible.)

But now let's suppose we're given a statement about real numbers -- for example,

For all real x,y, (x+y)²=x²+2xy+y²

-- and we want to know if it's true or false. In this case, it turns out that there is a decision procedure -- this was proved by Tarski in the 1930's, at least when the statement only involves addition, multiplication, comparisons, the constants 0 and 1, and universal and existential quantifiers (no exponentials or trig functions).

Intuitively, if all our variables range over real numbers instead of integers, then everything is forced to be smooth and continuous, and there's no way to build up Gödel sentences like "this sentence can't be proved."

(If we throw in the exponential function, then apparently there's still no way to encode Gödel sentences, modulo an unsolved problem in analysis. But if we throw in the exponential function and switch from real numbers to complex numbers, then we're again able to encode Gödel sentences -- and the theory goes back to being undecidable! Can you guess why? Well, once we have complex numbers, we can force a number n to be an integer, by saying that we want e^2πin to equal 1. So we're then back to where we were with integers.)

Anyway, the attitude back then was, OK, we found an algorithm to decide the truth or falsehood of any sentence about real numbers! We can go home! Problem solved!

Trouble is, if you worked out how many steps that algorithm took to decide the truth of a sentence with n symbols, it grew like an enormous stack of exponentials: . So I was reading in a biography of Tarski that, when actual computers came on the scene in the 1950's, one of the first things anyone thought to do was to implement Tarski's algorithm for deciding statements about the real numbers. And it was hopeless -- indeed, it would've been hopeless even on the computers of today! On the computers of the 1950's, it was .

So, these days we talk about complexity. (Or at least most of us do.) The idea is, you impose an upper bound on how much of some resource your computer can use. The most obvious resources are (1) amount of time and (2) amount of memory, but many others can be defined. (Indeed, if you visit the Complexity Zoo, you'll find several hundred of them.)

One of the very first insights is, if you ask how much can be computed in 10 million steps, or 20 billion bits of memory, you won't get anywhere. Your theory of computing will be at the mercy of arbitrary choices about the underlying model. In other words, you won't be doing theoretical computer science at all: you'll be doing architecture, which is an endlessly-fascinating, non-dreary, non-boring topic in its own right, but not our topic.

So instead you have to ask a looser question: how much can be computed in an amount of time that grows linearly (or quadratically, or logarithmically) with the problem size? Asking this sort of question lets you ignore constant factors.

So, we define TIME(f(n)) to be the class of problems for which every instance of size n is solvable (by some "reference" computer) in an amount of time that grows like a constant times f(n). Likewise, SPACE(f(n)) is the class of problems solvable using an amount of space (i.e., bits of memory) that grows like a constant times f(n).

What can we say? Well, for every function f(n), TIME(f(n)) is contained in SPACE(f(n)). Why? Because a Turing machine can access at most one memory location per time step.

What else? Presumably you agree that TIME(n²) is contained in TIME(n³). Here's a question: is it strictly contained? In other words, can you solve more problems in n³ time than in n² time? (Here the choice of the exponents 3 and 2 is obviously essential. Asking whether you can solve more problems in n⁴ time than n³ time would just be ridiculous!)

Seriously, it turns out that you can solve more problems in n³ time than in n² time. This is a consequence of a fundamental result called the Time Hierarchy Theorem, which was proven by Hartmanis and Stearns in the mid-1960's and later rewarded with a Turing Award. (Not to diminish their contribution, but back then Turing Awards were hanging pretty low on the tree! Of course you had to know to be looking for them, which not many people did.)

Let's see how the proof goes. We need to find a problem that's solvable in n³ time but not n² time. What will this problem be? It'll be the simplest thing you could imagine: a time-bounded analogue of Turing's halting problem.

Given a Turing machine M, does M halt in at most n^2.5 steps? (Here n^2.5 is just some function between n² and n³.)

Clearly we can solve the above problem in n³ steps, by simulating M for n^2.5 steps and seeing whether it halts or not. (Indeed, we can solve the problem in something like n^2.5 log n steps. We always need some overhead when running a simulation, but the overhead can be made extremely small.)

But now suppose there were a program P to solve the problem in n² steps. We'll derive a contradiction. By using P as a subroutine, clearly we could produce a new program P' with the following behavior. Given a program M as input, P'

runs forever if M halts in at most n^2.5 steps given its own code as input, or
halts in n^2.5 steps if M runs for more than n^2.5 steps given its own code as input.

Furthermore, P' does all of this in at most n^2.5 steps (indeed, n² steps plus some overhead).

Now what do we do? Duh, we feed P' its own code as input! This gives us a contradiction, which implies that P can never have existed in the first place.

Obviously I was joking when I said the choice of n³ versus n² was essential. We can substitute n¹⁷ versus n¹⁶, 3ⁿ versus 2ⁿ, etc. But there's actually an interesting question here: can we substitute any functions f and g such that f grows significantly faster than g? The surprising answer is no! The function g needs a property called time-constructibility, which means (basically) that there's some program that halts in g(n) steps given n as input. Without this property, the program P' wouldn't know how many steps to simulate M for, and the argument wouldn't go through.

Now, every function you'll ever encounter in civilian life will be time-constructible. But in the early 1970's, complexity theorists made up some bizarre, rapidly-growing functions that aren't. And for these functions, you really can get arbitrarily large gaps in the complexity hierarchy! So for example, there's a function f such that TIME(f(n))=TIME(2^f(n)). (Duuuuude. To those who doubt that complexity is better than cannabis, I rest my case.)

Anyway, completely analogous to the Time Hierarchy Theorem is the Space Hierarchy Theorem, which says there's a problem solvable with n³ bits of memory that's not solvable with n² bits of memory.

Alright, next question: in computer science, we're usually interested in the fastest algorithm to solve a given problem. But is it clear that every problem has a fastest algorithm? Or could there be a problem that admits an infinite sequence of algorithms, with each one faster than the last but slower than some other algorithm?

Contrary to what you might think, this is not just a theoretical armchair question: it's a concrete, down-to-earth armchair question! As an example, consider the problem of multiplying two n-by-n matrices. The obvious algorithm takes O(n³) time. In 1968 Strassen gave a more complicated algorithm that takes O(n^2.78) time. Improvements followed, culminating in an O(n^2.376) algorithm of Coppersmith and Winograd. But is that the end of the line? Might there be an algorithm to multiply matrices in n² time? Here's a weirder possibility: could it be that for every ε>0, there exists an algorithm to multiply n-by-n matrices in time O(n^2+ε), but as ε approaches 0, these algorithms become more and more complicated without end?

See, some of this paleocomplexity stuff is actually nontrivial! (T-Rex might've been a dinosaur, but it still had pretty sharp teeth!) In this case, a 1967 result called the Blum Speedup Theorem says that there really are problems that admit no fastest algorithm. Not only that: there exists a problem P such that for every function f, if P has an O(f(n)) algorithm then it also has an O(log f(n)) algorithm!

Ray Laflamme: I wouldn't know how to begin to prove that...

Neither would I! So let's see how it goes. Let t(n) be a complexity bound. Our goal is to define a function f, from integers to {0,1}, such that if f can be computed in O(t(n)) steps, then it can also be computed in O(t(n-i)) steps for any positive integer i. Taking t to be sufficiently large then gives us as dramatic a speedup as we want: for example, if we set t(n):=2^t(n-1), then certainly t(n-1)=O(log t(n)).

Let M₁,M₂,... be an enumeration of Turing machines. Then let S_i = {M₁,...,M_i} be the set consisting of the first i machines. Here's what we do: given an integer n as input, we loop over all i from 1 to n. In the i^th iteration, we simulate every machine in S_i that wasn't "cancelled" in iterations 1 to i-1. If none of these machines halt in at most t(n-i) steps, then set f(i)=0. Otherwise, let M_j be the first machine that halts in at most t(n-i) steps. Then we define f(i) to be 1 if M_j outputs 0, or 0 if M_j outputs 1. (In other words, we cause M_j to fail at computing f(i).) We also "cancel" M_j, meaning that M_j doesn't need to be simulated in any later iteration. This defines the function f.

Certainly f(n) can be computed in O(n² t(n)) steps, by simply simulating the entire iterative procedure above. The key observation is this: for any integer i, if we hardwire the outcomes of iterations 1 to i into our simulation algorithm (i.e. tell the algorithm which M_j's get cancelled in those iterations), then we can skip iterations 1 to i, and proceed immediately to iteration i+1. Furthermore, assuming we start from iteration i+1, we can compute f(n) in only O(n² t(n-i)) steps, instead of O(n² t(n)) steps. So the more information we "precompute," the faster the algorithm will run on sufficiently large inputs n.

To turn this idea into a proof, the main thing one needs to show is that simulating the iterative procedure is pretty much the only way to compute f: or more precisely, any algorithm to compute f needs at least t(n-i) steps for some i. This then implies that f has no fastest algorithm.

Puzzle 1 From Last Week

Can we assume, without loss of generality, that a computer program has access to its own code? As a simple example, is there a program that prints itself as output?

The answer is yes: there are such programs. In fact, there have even been competitions to write the shortest self-printing program. At the IOCCC (the International Obfuscated C Code Contest), this competition was won some years ago by an extremely short program. Can you guess how long it was: 30 characters? 10? 5?

The winning program had zero characters. (Think about it!) Admittedly, a blank file is not exactly a kosher C program, but apparently some compilers will compile it to a program that does nothing.

Alright, alright, but what if we want a nontrivial self-printing program? In that case, the standard trick is to do something like the following (which you can translate into your favorite programming language):

Print the following twice, the second time in quotes.
"Print the following twice, the second time in quotes."

In general, if you want a program to have access to its own source code, the trick is to divide the program into three parts: (1) a part that actually does something useful (this is optional), (2) a "replicator," and (3) a string to be replicated. The string to be replicated should consist of the complete code of the program, including the replicator. (In other words, it should consist of parts (1) and (2).) Then by running the replicator twice, we get a spanking-new copy of parts (1), (2), and (3).

This idea was elaborated by von Neumann in the early 1950's. Shortly afterward, two guys (I think their names were Crick and Watson) found a physical system that actually obeys these rules. You and I, along with all living things on Earth, are basically walking computer programs with the semantics

Build a baby that acts on the following instructions, and also contains a copy of those instructions in its reproductive organs.
"Build a baby that acts on the following instructions, and also contains a copy of those instructions in its reproductive organs."

Puzzle 2 From Last Week

If water weren't H₂O, would it still be water?

Yeah, this isn't really a well-defined question: it all boils down to what we mean by the word water. Is water a "predicate": if x is clear and wet and drinkable and tasteless and freezable to ice, etc. ... then x is water? On this view, what water "is" is determined by sitting in our armchairs and listing necessary and sufficient conditions for something to be water. We then venture out into the world, and anything that meets the conditions is water by definition. This was the view of Frege and Russell, and it implies that anything with the "intuitive properties" of water is water, whether or not it's H₂O.

The other view, famously associated with Saul Kripke, is that the word water "rigidly designates" a particular substance (H₂O). On this view, we now know that when the Greeks and Babylonians talked about water, they were really talking about H₂O, even though they didn't realize it. Interestingly, "water = H₂O" is thus a necessary truth that was discovered by empirical observation. Something with the same properties as water but a different chemical structure would not be water.

Kripke argues that, if you accept this "rigid designator" view, then there's an implication for the mind-body problem.

The idea is this: the reductionist dream would be to explain consciousness in terms of neural firings, in the same way that science explained water as being H₂O. But Kripke says there's a disanalogy between these two cases. In the case of water, we can at least talk coherently about a hypothetical substance that feels like water, tastes like water, etc., but isn't H₂O and therefore isn't water. But suppose we discovered that pain is always associated with the firings of certain nerves called C-fibers. Could we then say that pain is C-fiber firings? Well, if something felt like pain but had a different neurobiological origin, would we say that it felt like pain but wasn't pain? Presumably we wouldn't. Anything that feels like pain is pain, by definition! Because of this difference, Kripke thinks that we can't explain pain as "being" C-fiber firings, in the same sense that we can explain water as "being" H₂O.

Some of you look bored. Dude -- this is considered one of the greatest philosophical insights of the last four decades! I'm serious! Well, I guess if you don't find it interesting, philosophy is not the field for you.

Lecture 6: P, NP, and Friends

We've seen that if we want to make progress in complexity, then we need to talk about asymptotics: not which problems can be solved in 10,000 steps, but for which problems can instances of size n be solved in cn² steps as n goes to infinity? We met TIME(f(n)), the class of all problems solvable in O(f(n)) steps, and SPACE(f(n)), the class of all problems solvable using O(f(n)) bits of memory.

But if we really want to make progress, then it's useful to take an even coarser-grained view: one where we distinguish between polynomial and exponential time, but not between O(n²) and O(n³) time. From this remove, we think of any polynomial bound as "fast," and any exponential bound as "slow."

Now, I realize people will immediately object: what if a problem is solvable in polynomial time, but the polynomial is n⁵⁰⁰⁰⁰? Or what if a problem takes exponential time, but the exponential is 1.00000001ⁿ? My answer is pragmatic: if cases like that regularly arose in practice, then it would've turned out that we were using the wrong abstraction. But so far, it seems like we're using the right abstraction. Of the big problems solvable in polynomial time -- matching, linear programming, primality testing, etc. -- most of them really do have practical algorithms. And of the big problems that we think take exponential time -- theorem-proving, circuit minimization, etc. -- most of them really don't have practical algorithms. So, that's the empirical skeleton holding up our fat and muscle.

Petting Zoo

It's now time to meet the most basic complexity classes -- the sheep and goats of the Complexity Zoo.

P is the class of problems solvable by a Turing machine in polynomial time. In other words, P is the union, over all positive integers k, of TIME(n^k). (Note that by "problem," we'll always mean decision problem: a problem where the inputs are n-bit strings and the outputs are either yes or no.)

PSPACE is the class of problems solvable in polynomial space (but unlimited time). In other words, it's the union over all integers k of SPACE(n^k).

EXP is the class of problems solvable in exponential time. In other words, it's the union over all integers k of TIME(2^{n^k}).

Certainly P is contained in PSPACE. I claim that PSPACE is contained in EXP. Why?

Right: a machine with n^k bits of memory can only go through 2^{n^k} different configurations, before it either halts or else gets stuck in an infinite loop.

Now, NP is the class of problems for which, if the answer is yes, then there's a polynomial-size proof of that fact that you can check in polynomial time. (The NP stands for "Nondeterministic Polynomial," in case you were wondering.) I could get more technical, but it's easiest to give an example: say I give you a 10,000-digit number, and I ask whether it has a divisor ending in 3. Well, answering that question might take a Long, Long Time^TM. But if your grad student finds such a divisor for you, then you can easily check that it works: you don't need to trust your student (always a plus).

I claim that NP is contained in PSPACE. Why?

Right: in polynomial space, you can loop over all possible n^k-bit proofs and check them one by one. If the answer is "yes," then one of the proofs will work, while if the answer is "no," then none of them will work.

Certainly P is contained in NP: if you can answer a question yourself, then someone else can convince you that the answer is yes (if it is yes) without even telling you anything.

One interesting little puzzle, which I'll leave as a homework exercise for Tuesday, is whether P equals NP. In other words, if you can recognize an answer efficiently, can you also find one efficiently?

Oh, alright, you don't have to solve this one by Tuesday. You can have till Thursday.

Look, this question of whether P=NP, what can I say? People like to describe it as "probably the central unsolved problem of theoretical computer science." That's a comical understatement. P vs. NP is one of the deepest questions that human beings have ever asked.

And not only that: it's one of the seven million-dollar prize problems of the Clay Math Institute! What an honor! Imagine: our mathematician friends have decided that P vs. NP is as important as the Hodge Conjecture, or even Navier-Stokes existence and smoothness! (Apparently they weren't going to include it, until they asked around to make sure it was important enough.)

Dude. One way to measure P vs. NP's importance is this. If NP problems were feasible, then mathematical creativity could be automated. The ability to check a proof would entail the ability to find one. Every Apple II, every Commodore, would have the reasoning power of Archimedes or Gauss. So by just programming your computer and letting it run, presumably you could immediately solve not only P vs. NP, but also the other six Clay problems. (Or five, now that Poincaré is down.)

But if that's the case, then why isn't it obvious that P doesn't equal NP? Surely God wouldn't be so benign as to grant us these extravagant powers! Surely our physicist-intuition tells us that brute-force search is unavoidable! (Leonid Levin told me that Feynman -- the king (or maybe court jester) of physicist-intuition -- had trouble even being convinced that P vs. NP was an open problem!)

Well, we certainly believe P≠NP. Indeed, we don't even believe there's a general way to solve NP problems that's dramatically better than brute-force search through every possibility. But if you want to understand why it's so hard to prove these things, let me tell you something.

Let's say you're given an N-digit number, but instead of factoring it, you just want to know if it's prime or composite.

Or let's say you're given a list of freshmen, together with which ones are willing to room with which other ones, and you want to pair off as many willing roommates as you can.

Or let's say you're given two DNA sequences, and you want to know how many insertions and deletions are needed to convert the one sequence to the other.

Surely these are fine examples of the sort of exponentially-hard NP problem we were talking about! Surely they, too, require brute-force search!

Except they don't. As it turns out, all of these problems have clever polynomial-time algorithms. The central challenge any P≠NP proof will have to overcome is to separate the NP problems that really are hard from the ones that merely look hard. I'm not just making a philosophical point. While there have been dozens of purported P≠NP proofs over the years, almost all of them could be rejected immediately for the simple reason that, if they worked, then they would rule out polynomial-time algorithms that we already know to exist.

So to summarize, there are problems like primality testing and pairing off roommates, for which computer scientists (often after decades of work) have been able to devise polynomial-time algorithms. But then there are other problems, like proving theorems, for which we don't know of any algorithm fundamentally better than brute-force search. But is that all we can say -- that we have a bunch of these NP problems, and some of them seem easy and others seem hard?

As it turns out, we can say something much more interesting than that. We can say that almost all of the "hard" problems are the same "hard" problem in different guises -- in the sense that, if we had a polynomial-time algorithm for any one of them, then we'd also have polynomial-time algorithms for all the rest. This is the upshot of the theory of NP-completeness, which was created in the early 1970's by Cook, Karp, and Levin.

The way it goes is, we define a problem B to be "NP-hard" if any NP problem can be efficiently reduced to B. What the hell does that mean? It means that, if we had an oracle to immediately solve problem B, then we could solve any NP problem in polynomial time.

That gives one notion of reduction, which is called Cook reduction. There's also a weaker notion of reduction, which is called Karp reduction. In a Karp reduction from problem A to problem B, we insist that there should be a polynomial-time algorithm that transforms any instance of A to an instance of B having the same answer.

What's the difference between Cook and Karp?

Right: with a Cook reduction, in solving problem A we get to call the oracle for problem B more than once. We can even call the oracle adaptively -- that is, in ways that depend on the outcomes of the previous calls. A Karp reduction is weaker in that we don't allow ourselves these liberties. Perhaps surprisingly, almost every reduction we know of is a Karp reduction -- the full power of Cook reductions is rarely needed in practice.

Now, we say a problem is NP-complete if it's both NP-hard and in NP. In other words, NP-complete problems are the "hardest" problems in NP: the problems that single-handedly capture the difficulty of every other NP problem. As a first question, is it obvious that NP-complete problems even exist?

I claim that it is obvious. Why?

Well, consider the following problem, called DUH: we're given a polynomial-time Turing machine M, and we want to know if there exists an n^k-bit input string that causes M to accept. I claim that any instance of any NP problem can be converted, in polynomial time, into a DUH instance having the same answer. Why? Well, DUH! Because that's what it means for a problem to be in NP!

The discovery of Cook, Karp, and Levin was not that there exist NP-complete problems -- that's obvious -- but rather that many natural problems are NP-complete.

The king of these natural NP-complete problems is called 3-Satisfiability, or 3SAT. (How do I know it's the king? Because it appeared on the TV show "NUMB3RS.") Here we're given n Boolean variables, x₁,...,x_n, as well as a set of logical constraints called clauses that relate at most three variables each:

x₂ or x₅ or not(x₆)
not(x₂) or x₄
not(x₄) or not(x₅) or x₆
...

Then the question is whether there's some way to set the variables x₁,...,x_n to TRUE or FALSE, in such a way that every clause is "satisfied" (that is, every clause evaluates to TRUE).

It's obvious that 3SAT is in NP. Why? Right: Because if someone gives you a setting of x₁,...,x_n that works, it's easy to check that it works!

Our goal is to prove that 3SAT is NP-complete. What will that take? Well, we need to show that if we had an oracle for 3SAT, then we could use it to solve not only 3SAT in polynomial time, but any NP problem whatsoever. That seems like a tall order! Yet in retrospect, you'll see that it's almost a triviality.

The proof has two steps. Step 1 is to show that, if we could solve 3SAT, then we could solve a more "general" problem called CircuitSAT. Step 2 is to show that, if we could solve CircuitSAT, then we could solve any NP problem.

In CircuitSAT, we're given a Boolean circuit and ... wait, do we have any engineers in this class? We do? Then listen up: in computer science, a "circuit" never has loops! Nor does it have resistors or diodes or anything weird like that. For us, a circuit is just an object where you start with n Boolean variables x₁,...,x_n, and then you can repeatedly define a new variable that's equal to the AND, OR, or NOT of variables that you've previously defined. Like so:

x_n+1 := x₃ or x_n
x_n+2 := not(x_n+1)
x_n+3 := x₁ and x_n+2
...

We designate the last variable in the list as the circuit's "output." Then the goal, in CircuitSAT, is to decide whether there's a setting of x₁,...,x_n such that the output is TRUE.

I claim that, if we could solve 3SAT, then we could also solve CircuitSAT. Why?

Well, all we need to do is notice that every CircuitSAT instance is really a 3SAT instance in disguise! Every time we compute an AND, OR, or NOT, we're relating one new variable to one or two old variables. And any such relationship can be expressed by a set of clauses involving at most three variables each. So for example,

x_n+1 := x₃ or x_n

becomes

x_n+1 or not(x₃)
x_n+1 or not(x_n)
not(x_n+1) or x₃ or x_n.

So, that was Step 1. Step 2 is to show that, if we can solve CircuitSAT, then we can solve any NP problem.

Alright, so consider some instance of an NP problem. Then by the definition of NP, there's a polynomial-time Turing machine M such that the answer is "yes" if and only if there's a polynomial-size witness string w that causes M to accept.

Now, given this Turing machine M, our goal is to create a circuit that "mimics" M. In other words, we want there to exist a setting of the circuit's input variables that makes it evaluate to TRUE, if and only if there exists a string w that causes M to accept.

How do we achieve that? Simple: by defining a whole shitload of variables! We'll have a variable that equals TRUE if and only if the 37^th bit of M's tape is set to '1' at the 42^nd time step. We'll have another variable that equals TRUE if and only if the 14^th bit is set to '1' at the 52^nd time step. We'll have another variable that equals TRUE if and only if M's tape head is in the 15^th internal state and the 74^th tape position at the 33^rd time step. Well, you get the idea.

Then, having written down this shitload of variables, we write down a shitload of logical relations between them. If the 17^th bit of the tape is '0' at the 22^nd time step, and the tape head is nowhere near the 17^th bit at that time, then the 17^th bit will still be '0' at the 23^rd time step. If the tape head is in internal state 5 at the 44^th time step, and it's reading a '1' at that time step, and internal state 5 transitions to internal state 7 on reading a '1', then the tape head will be in internal state 7 at the 45^th time step. And so on, and so on. The only variables that are left unrestricted are the ones that constitute the string w at the first time step.

The key point is that, while this is a very large shitload of variables and relations, it's still only a polynomial shitload. We therefore get a polynomially-large CircuitSAT instance, which is satisfiable if and only if there exists a w that causes M to accept.

[Devin, our resident engineer, is laughing hysterically]

Scott: What?

Devin: Nothing! That is just so simple in the basic idea, and so messy in the details...

Scott: Exactly. You've understood completely.

We've just proved the celebrated Cook-Levin Theorem: that 3SAT is NP-complete. This theorem can be thought of as the "initial infection" of the NP-completeness virus. Since then, the virus has spread to thousands of other problems. What I mean is this: if you want to prove that your favorite problem is NP-complete, all you have to do is prove it's as hard as some other problem that's already been proved NP-complete. (Well, you also have to prove that it's in NP, but that's usually trivial.) So there's a rich-get-richer effect: the more problems that have already been proved NP-complete, the easier it is to induct a new problem into the club. Indeed, proving problems NP-complete had become so routine by the 80's or 90's, and people had gotten so good at it, that (with rare exceptions) STOC and FOCS stopped publishing yet more NP-completeness proofs.

I'll just give you a tiny sampling of some of the earliest problems that were proved NP-complete:

Map Colorability: Given a map, can you color every country red, green, or blue, in such a way that no two neighboring countries are colored the same? (Interestingly, if you're only allowed to use two colors, then it's easy to decide whether or not such a coloring is possible -- why? On the other hand, if you're allowed four colors, then it always is possible, at least for maps drawn in the plane -- that's the famous Four-Color Theorem. So then the problem is again easy. Only with three colors is the problem NP-complete.)

Clique: Given a set of N high-school students, together with which ones will sit at a cafeteria table with which other ones, is there a "clique" of N/3 students who will all sit at a table with each other?

Packing: Given a set of boxes of specified dimensions, can you fit them into the trunk of your car?

Etc., etc., etc.

To reiterate: although these problems might look unrelated, they're actually the same problem in different costumes. if any one of them has an efficient solution, then all of them do, and P=NP. If any one of them doesn't have an efficient solution, then none of them do, and P≠NP. To prove P=NP, it's enough to show that some NP-complete problem (no matter which one) has an efficient solution. To prove P≠NP, it's enough to show that some NP-complete problem has no efficient solution. One for all and all for one.

So, there are the P problems, and then there are the NP-complete problems. Is there anything in between? (You should be used to this sort of "intermediate" question by now -- we saw it both in set theory and in computability theory!)

If P=NP, then NP-complete problems are P problems, so obviously the answer is no.

But what if P≠NP? In that case, a beautiful result called Ladner's Theorem says that there must be "intermediate" problems between P and NP-complete: in other words, problems that are in NP, but neither NP-complete nor solvable in polynomial time.

How would we create such an intermediate problem? Well, I'll give you the idea. The first step is to define an extremely slow-growing function t. Then given a 3SAT instance F of size n, the problem will be to decide whether F is satisfiable and t(n) is odd. In other words: if t(n) is odd then solve the 3SAT problem, while if t(n) is even then always output "no."

If you think about what we're doing, we're alternating long stretches of an NP-complete problem with long stretches of nothing! Intuitively, each stretch of 3SAT should kill off another polynomial-time algorithm for our problem, where we use the assumption that P≠NP. Likewise, each stretch of nothing should kill off another NP-completeness reduction, where we again use the assumption that P≠NP. This ensures that the problem is neither in P nor NP-complete. The main technical trick is to make the stretches get longer at an exponential rate. That way, given an input of size n, we can simulate the whole iterative process up to n in time polynomial in n. That ensures that the problem is still in NP.

Besides P and NP, another major complexity class is coNP: the "complement" of NP. A problem is in coNP if a "no" answer can be checked in polynomial-time. To every NP-complete problem, there's a corresponding coNP-complete problem. We've got unsatisfiability, map non-colorability, etc.

Now, why would anyone bother to define such a stupid thing? Because then we can ask a new question: does NP equal coNP? In other words: if a Boolean formula is unsatisfiable, is there at least a short proof that it's unsatisfiable, even if finding the proof would take exponential time? Once again, the answer is that we don't know.

Certainly if P=NP then NP=coNP. (Why?) On the other hand, the other direction isn't known: it could be that P≠NP but still NP=coNP. So if proving P≠NP is too easy, you can instead try to prove NP≠coNP!

This seems like a good time to mention a special complexity class, a class we quantum computing people know and love: NP intersect coNP.

This is the class for which either a yes answer or a no answer has an efficiently checkable proof. As an example, consider the problem of factoring an integer into primes. Over the course of my life, I must've met at least two dozen people who "knew" that factoring is NP-complete, and therefore that Shor's algorithm -- since it lets us factor on a quantum computer -- also lets us solve NP-complete problems on a quantum computer. Often these people were supremely confident of their "knowledge."

But if you think about it for two seconds, you'll realize that factoring has profound differences from the known NP-complete problems. If I give you a Boolean formula, it might have zero satisfying truth assignments, it might have one, or it might have 10 trillion. You simply don't know, a priori. But if I give you a 5000-digit integer, you probably won't know its factorization into primes, but you'll know that has one and only one factorization. (I think some guy named Euclid proved that a while ago.) This already tells us that factoring is somehow "special": that, unlike what we believe about the NP-complete problems, factoring has some structure that algorithms could try to exploit. And indeed, algorithms do exploit it: we know of a classical algorithm called the Number Field Sieve, which factors an n-bit integer in roughly 2^{n^1/3} steps, compared to the ~2^n/2 steps that would be needed for trying all possible divisors. (Why only ~2^n/2 steps, instead of ~2ⁿ?) And of course, we know of Shor's algorithm, which factors an n-bit integer in ~n² steps on a quantum computer: that is, in quantum polynomial time. Contrary to popular belief, we don't know of a quantum algorithm to solve NP-complete problems in polynomial time. If such an algorithm existed, it would have to be dramatically different from Shor's algorithm.

But can we pinpoint just how factoring differs from the known NP-complete problems, in terms of complexity theory? Yes, we can. First of all, in order to make factoring a decision (yes-or-no) problem, we need to ask something like this: given a positive integer N, does N have a prime factor whose last digit is 7? I claim that this problem is not merely in NP, but in NP intersect coNP. Why? Well, suppose someone gives you the prime factorization of N. There's only one of them. So if there is a prime factor whose last digit is 7, then you can verify that, and if there's no prime factor whose last digit is 7, then you can also verify that.

You might say, "but how do I know that I really was given the prime factorization? Sure, if someone gives me a bunch of numbers, I can check that they multiply to N, but how do I know they're prime?" For this you'll have to take on faith something that I told you earlier: that if you just want to know whether a number is prime or composite, and not what its factors are, then you can do that in polynomial time. OK, so if you accept that, then the factoring problem is in NP intersect coNP.

From this we can conclude that, if factoring were NP-complete, then NP would equal coNP. (Why?) Since we don't believe NP=coNP, this gives us a strong indication (though not a proof) that, all those people I told you about notwithstanding, factoring is not NP-complete. If we accept that, then only two possibilities remain: either factoring is in P, or else factoring is one of those "intermediate" problems whose existence is guaranteed by Ladner's Theorem. Most of us incline toward the latter possibility -- though not with as much conviction as we believe P≠NP.

Indeed, for all we know, it could be the case that P = NP intersect coNP but still P≠NP. (This possibility would imply that NP≠coNP.) So, if proving P≠NP and NP≠coNP are both too easy for you, your next challenge can be to prove P ≠ NP intersect coNP!

If P, NP, and coNP aren't enough to rock your world, you can generalize these classes to a giant teetering mess that we computer scientists call the polynomial hierarchy.

Observe that you can put any NP problem instance into the form

Does there exist an n-bit string X such that A(X)=1?

Here A is a function computable in polynomial time.

Likewise, you can put any coNP problem into the form

Does A(X)=1 for every X?

But what happens if you throw in another quantifier, like so?

Does there exist an X such that for every Y, A(X,Y)=1?

For every X, does there exist a Y such that A(X,Y)=1?

Problems like these lead to two new complexity classes, which are called Σ₂P and Π₂P respectively. Π₂P is the "complement" of Σ₂P, in the same sense that coNP is the complement of NP. We can also throw in a third quantifier:

Does there exist an X such that for every Y, there exists a Z such that A(X,Y,Z)=1?

For every X, does there exist a Y such that for every Z, A(X,Y,Z)=1?

This gives us Σ₃P and Π₃P respectively. It should be obvious how to generalize this to Σ_kP and Π_kP for any larger k. (As a side note, when k=1, we get Σ₁P=NP and Π₁P=coNP. Why?) Then taking the union of these classes over all positive integers k gives us the polynomial hierarchy PH.

The polynomial hierarchy really is a substantial generalization of NP and coNP -- in the sense that, even if we had an oracle for NP-complete problems, it's not at all clear how we could use it to solve (say) Σ₂P problems. On the other hand, just to complicate matters further, I claim that if P=NP, then the whole polynomial hierarchy would collapse down to P! Why?

Right: if P=NP, then we could take our algorithm for solving NP-complete problems in polynomial time, and modify it to call itself as a subroutine. And that would let us "flatten PH like a steamroller": first simulating NP and coNP, then Σ₂P and Π₂P, and so on through the entire hierarchy.

Likewise, it's not hard to prove that if NP=coNP, then the entire polynomial hierarchy collapses down to NP (or in other words, to coNP). If Σ₂P=Π₂P, then the entire polynomial hierarchy collapses down to Σ₂P, and so on. If you think about it, this gives us a whole infinite sequence of generalizations of the P≠NP conjecture, each one "harder" to prove than the last. Why do we care about these generalizations? Because often, we're trying to study conjecture BLAH, and we can't prove that BLAH is true, and we can't even prove that if BLAH were false then P would equal NP. But -- and here's the punchline -- we can prove that if BLAH were false, then the polynomial hierarchy would collapse to the second or the third level. And this gives us some sort of evidence that BLAH is true.

Welcome to complexity theory!

Since I talked about how lots of problems have non-obvious polynomial-time algorithms, I thought I should give you at least one example. So, let's do one of the simplest and most elegant in all of computer science -- the so-called Stable Marriage Problem. Have you seen this before? You haven't?

Alright, so we have N men and N women. Our goal is to marry them off. We assume for simplicity that they're all straight. (Marrying off gays and lesbians is technically harder, though also solvable in polynomial time!) We also assume, for simplicity and with much loss of generality, that everyone would rather be married than single.

So, each man ranks the women, in order from his first to last choice. Each woman likewise ranks the men. There are no ties.

Obviously, not every man can marry his first-choice woman, and not every woman can marry her first-choice man. Life sucks that way.

So let's try for something weaker. Given a way of pairing off the men and women, say that it's stable if no man and woman who aren't married to each other both prefer each other to their spouses. In other words, you might despise your husband, but no man who you like better than him likes you better than his wife, so you have no incentive to leave. This is the, um, desirable property that we call "stability."

Now, given the men's and women's stated preferences, our goal as matchmakers is to find a stable way of pairing them off. Matchmaker, matchmaker, make me a match, find me a find, catch me a catch, etc.

First obvious question: does there always exist a stable pairing of men and women? What do you think? Yes? No? As it turns out, the answer is yes, but the easiest way to prove it is just to give an algorithm for finding the pairing!

So let's concentrate on the question of how to find a pairing. In total, there are N! ways of pairing off men with women. For the soon-to-be-newlyweds' sake, hopefully we won't have to search through all of them.

Fortunately we won't. In the early 1960's, Gale and Shapley invented a polynomial-time -- in fact linear-time -- algorithm to solve this problem. And the beautiful thing about this algorithm is, it's exactly what you'd come up with from reading a Victorian romance novel. Later they found out that the same algorithm had been in use since the 1950's -- not to pair off men with women, but to pair off medical-school students with hospitals to do their residencies in. Indeed, hospitals and medical schools are still using a version of the algorithm today.

But back to the men and women. If we want to pair them off by the Gale-Shapley algorithm, then as a first step, we need to break the symmetry between the sexes: which sex "proposes" to the other? This being the early 1960's, you can guess how that question was answered. The men propose to the women.

So, we loop through all the men. The first man proposes to his first-choice woman. She provisionally accepts him. Then the next man proposes to his first-choice woman. She provisionally accepts him, and so on. But what happens when a man proposes to a woman who's already provisionally accepted another man? She chooses the one she prefers, and boots the other one out! Then, the next time we come around to that man in our loop over the men, he'll propose to his second-choice woman. And if she rejects him, then the next time we come around to him he'll propose to his third-choice woman. And so on, until everyone is married off. Pretty simple, huh?

First question: why does this algorithm terminate in linear time?

Right: because each man proposes to a given woman at most once. So the total number of proposals is at most N², which is just the amount of memory we need to write down the preference lists in the first place.

Second question: when the algorithm does terminate, why is everyone married off?

Right: because if they weren't, then there'd be some woman who'd never been proposed to, and some man who'd never proposed to her. But this is impossible. Eventually the man no one else wants will cave in, and propose to the woman no one else wants.

Third question: why is the pairing produced by this algorithm a stable one?

Right: because if it weren't, then there'd be one married couple (say Bob and Alice), and another married couple (say Charlie and Eve), such that Bob and Eve both prefer each other to their spouses. But in that case, Bob would've proposed to Eve before proposing to Alice. And if Charlie also proposed to Eve, then Eve would've made clear at the time that she preferred Bob. And this gives a contradiction.

In particular we've shown, as promised, that there exists a stable pairing: namely the pairing found by the Gale-Shapley algorithm.

Problem Set

We saw that 3SAT is NP-complete. By contrast, it turns out that 2SAT -- the version where we only allow two variables per clause -- is solvable in polynomial time. Explain why.
Recall that EXP is the class of problems solvable in exponential time. One can also define NEXP: the class of problems for which a "yes" answer can be verified in exponential time. In other words, NEXP is to EXP as NP is to P. Now, we don't know if P=NP, and we also don't know if EXP=NEXP. But we do know that if P=NP, then EXP=NEXP. Why?
Show that P doesn't equal SPACE(n) (the set of problems solvable in linear space). Hint: You don't need to prove that P is not in SPACE(n), or that SPACE(n) is not in P -- only that one or the other is true!
Show that if P=NP, then there's a polynomial-time algorithm not only to decide whether a Boolean formula has a satisfying assignment, but to find such an assignment whenever one exists.
[Extra credit] Give an explicit algorithm that finds a satisfying assignment whenever one exists, and that runs in polynomial time assuming P=NP. (If there's no satisfying assignment, your algorithm can behave arbitrarily.) In other words, give an algorithm for problem 4 that you could implement and run right now -- without invoking any subroutine that you've assumed to exist but can't actually describe.

Lecture 7: Randomness

(Thanks to Jibran Rashid for help preparing these notes.)

In the last two lectures, we talked about computational complexity up till the early 1970's. Today we'll add a new ingredient to our already simmering stew -- something that was thrown in around the mid-1970's, and that now pervades complexity to such an extent that it's hard to imagine doing anything without it. This new ingredient is randomness.

Certainly, if you want to study quantum computing, then you first have to understand randomized computing. I mean, quantum amplitudes only become interesting when they exhibit some behavior that classical probabilities don't: contextuality, interference, entanglement (as opposed to correlation), etc. So we can't even begin to discuss quantum mechanics without first knowing what it is that we're comparing against.

Alright, so what is randomness? Well, that's a profound philosophical question, but I'm a simpleminded person. So, you've got some probability p, which is a real number in the unit interval [0,1]. That's randomness.

Alright, so given some "event" A -- say, the event that it will rain tomorrow -- we can talk about a real number Pr[A] in [0,1], which is the probability that A will happen. (Or rather, the probability we think A will happen -- but I told you I'm a simpleminded person.) And the probabilities of different events satisfy some obvious relations, but it might be helpful to see them explicitly if you never have before.

First, the probability that A doesn't happen equals 1 minus the probability that it happens:

Pr[not(A)] = 1 - Pr[A].

Agree? I thought so.

Second, if we've got two events A and B, then

Pr[A or B] = Pr[A] + Pr[B] - Pr[A and B].

Third, an immediate consequence of the above, called the union bound:

Pr[A or B] ≤ Pr[A] + Pr[B].

Or in English: if you're unlikely to drown and you're unlikely to get struck by lightning, then chances are you'll neither drown nor get struck by lightning, regardless of whether getting struck by lightning makes you more or less likely to drown. One of the few causes for optimism in this life.

Despite its triviality, the union bound is probably the most useful fact in all of theoretical computer science. I use it maybe 200 times in every paper I write.

What else? Given a random variable X, the expectation of X, or E[X], is defined to be Σ_k Pr[X=k] k. Then given any two random variables X and Y, we have

E[X+Y] = E[X] + E[Y].

This is called linearity of expectation, and is probably the second most useful fact in all of theoretical computer science, after the union bound. Again, the key point is that any dependencies between X and Y are irrelevant.

Do we also have

E[XY] = E[X] E[Y]?

Right: we don't! Or rather, we do if X and Y are independent, but not in general.

Another important fact is Markov's inequality (or rather, one of his many inequalities): if X ≥ 0 is a nonnegative random variable, then for all k,

Pr[X ≥ k E[X]] ≤ 1/k.

Markov's inequality leads immediately to the third most useful fact in theoretical computer science, called the Chernoff bound. The Chernoff bound says that if you flip a coin 1,000 times, and you get heads 900 times, then chances are the coin was crooked. This is the theorem that casino managers implicitly use when they decide whether to send goons to break someone's legs.

Formally, let h be the number of times you get heads if you flip a fair coin n times. Then one way to state the Chernoff bound is

,

where c is a constant that you look up since you don't remember it. (Oh, all right: c=2 will work.)

How can we prove the Chernoff bound? Well, there's a simple trick: let x_i=1 if the i^th coin flip comes up heads, and let x_i=0 if tails. Then consider the expectation, not of x₁+...+x_n itself, but of exp(x₁+...+x_n). Since the coin flips had better be uncorrelated with each other, we have

Now we can just use Markov's inequality, and then take logs on both sides to get the Chernoff bound. I'll spare you the calculation (or rather, spare myself).

What do we need randomness for?

Even the ancients -- Turing, Shannon, and von Neumann -- understood that a random number source might be useful for writing programs. So for example, back in the forties and fifties, physicists invented (or rather re-invented) a technique called Monte Carlo simulation, to study some weird question they were interested in at the time involving the implosion of hollow plutonium spheres. Statistical sampling -- say, of the different ways a hollow plutonium sphere might go kaboom! -- is one perfectly legitimate use of randomness.

There are many, many reasons you might want randomness -- for foiling an eavesdropper in cryptography, for avoiding deadlocks in communication protocols, and so on. But within complexity theory, the usual purpose of randomness is to "smear out ignorance": that is, to take an algorithm that works on most inputs, and turn it into an algorithm that works on all inputs most of the time.

Let's see an example of a randomized algorithm. Suppose I describe a number to you by starting from 1, and then repeatedly adding, subtracting, or multiplying two numbers that were previously described (as in the card game "24"). Like so:

a=1
b=a+a
c=b²
d=c²
e=d²
f=e-a
g=d-a
h=d+a
i=gh
j=f-i

You can verify (if you're so inclined) that j, the "output" of the above program, equals zero. Now consider the following general problem: given such a program, does it output 0 or not? How could you tell?

Well, one way would just be to run the program, and see what it outputs! What's the problem with that?

Right: Even if the program is very short, the numbers it produces at intermediate steps might be enormous -- that is, you might need exponentially many digits even to write them down. This can happen, for example, if the program repeatedly generates a new number by squaring the previous one. So a straightforward simulation isn't going to be efficient.

What can you do instead? Well, suppose the program has n operations. Then here's the trick: first pick a random prime number p with n² digits. Then simulate the program, but doing all the arithmetic modulo p. This algorithm will certainly be efficient: that is, it will run in time polynomial in n. Also, if the output isn't zero modulo p, then you certainly conclude that isn't zero. However, this still leaves two questions unanswered:

Supposing the output is 0 modulo p, how confident can you be that it wasn't just a lucky fluke, and that the output is actually 0?
How do you pick a random prime number?

For the first question, let x be the program's output. Then |x| can be at most , where n is the number of operations -- since the fastest way to get big numbers is by repeated squaring. This immediately implies that x can have at most 2ⁿ prime factors.

On the other hand, how many prime numbers are there with n² digits? The famous Prime Number Theorem tells us the answer: about . Since is a lot bigger than 2ⁿ, most of those primes can't possibly divide x (unless of course x=0). So if we pick a random prime and it does divide x, then we can be very, very confident (but admittedly not certain) that x=0.

So much for the first question. Now on to the second: how do you pick a random prime with n² digits? Well, our old friend the Prime Number Theorem tells us that, if you pick a random number with n² digits, then it has about a one in n² chance of being prime. So all you have to do is keep picking random numbers; after about n² tries you'll probably hit a prime!

Question: Instead of repeatedly picking a random number, why couldn't you just start at a fixed number, and then keep adding 1 until you hit a prime?
Answer: Sure, that would work -- assuming a far-reaching extension of the Riemann Hypothesis! What you need is that the n²-digit prime numbers are more-or-less evenly spaced, so that you can't get unlucky and hit some exponentially-long stretch where everything's composite. Not even the Extended Riemann Hypothesis would give you that, but there is something called Cramér's Conjecture that would.

Of course, we've merely reduced the problem of picking a random prime to a different problem: namely, once you've picked a random number, how do you tell if it's prime? As I mentioned in the last lecture, figuring out if a number is prime or composite turns out to be much easier than actually factoring the number. Until recently, this primality-testing problem was another example where it seemed like you needed to use randomness -- indeed, it was the granddaddy of all such examples.

The idea was this. Fermat's Little Theorem (not to be confused with his Last theorem!) tells us that, if p is a prime, then x^p=x (mod p) for every integer x. So if you found an x for which x^p≠x (mod p), that would immediately tell you that p was composite -- even though you'd still know nothing about what its divisors were. The hope would be that, if you couldn't find an x for which x^p≠x (mod p), then you could say with high confidence that p was prime.

Alas, 'twas not to be. It turns out that there are composite numbers p that "pretend" to be prime, in the sense that x^p=x (mod p) for every x. The first few of these pretenders (called the Carmichael numbers) are 561, 1105, 1729, 2465, and 2821. Of course, if there were only finitely many pretenders, and we knew what they were, everything would be fine. But Alford, Granville, and Pomerance showed in 1994 that there are infinitely many pretenders.

But already in 1976, Miller and Rabin had figured out how to unmask the pretenders by tweaking the test a little bit. In other words, they found a modification of the Fermat test that always passes if p is prime, and that fails with high probability if p is composite. So, this gave a polynomial-time randomized algorithm for primality testing.

Then, in a breakthrough a few years ago that you've probably heard about, Agrawal, Kayal, and Saxena found a deterministic polynomial-time algorithm to decide whether a number is prime. This breakthrough has no practical application whatsoever, since we've long known of randomized algorithms that are faster, and whose error probability can easily be made smaller than the probability of an asteroid hitting your computer in mid-calculation. But it's wonderful to know.

To summarize, we wanted an efficient algorithm that would examine a program consisting entirely of additions, subtractions, and multiplications, and decide whether or not it output 0. I gave you such an algorithm, but it needed randomness in two places: first, in picking a random number; and second, in testing whether the random number was prime. The second use of randomness turned out to be inessential -- since we now have a deterministic polynomial-time algorithm for primality testing. But what about the first use of randomness? Was that use also inessential? As of 2006, no one knows! But large theoretical cruise-missiles have been pummeling this very problem, and the situation on the ground is volatile. Consult your local STOC proceedings for more on this developing story.

Alright, it's time to define some complexity classes. (Then again, when isn't it time?)

When we talk about probabilistic computation, chances are we're talking about one of the following four complexity classes, which were defined in a 1977 paper of John Gill.

PP (Probabilistic Polynomial-Time): Yeah, apparently even Gill himself recently admitted that it's a lousy name. But this is a serious course, and I will not tolerate any seventh-grade humor. Basically, PP is the class of all decision problems for which there exists a polynomial-time randomized algorithm that accepts with probability greater than 1/2 if the answer is yes, or less than 1/2 if the answer is no. In other words, we imagine a Turing machine M that receives both an n-bit input string x, and an unlimited source of random bits. If x is a yes-input, then at least half of the random bit settings should cause M to accept; while if x is a no-input, then at least half of the random bit settings should cause M to reject. Furthermore, M needs to halt after a number of steps bounded by a polynomial in n.

Here's the standard example of a PP problem: given a Boolean formula φ with n variables, do at least half of the 2ⁿ possible settings of the variables make the formula evaluate to TRUE? (Incidentally, just like deciding whether there exists a satisfying assignment is NP-complete, so this majority-vote variant can be shown to be PP-complete: that is, any other PP problem is efficiently reducible to it.)

Now, why might PP not capture our intuitive notion of problems solvable by randomized algorithms?

Right: because we want to avoid "Florida recount" situations! As far as PP is concerned, an algorithm is free to accept with probability 1/2+2^-n if the answer is yes, and probability 1/2-2^-n if the answer is no. But how would a mortal actually distinguish those two cases? If n was (say) 5000, then we'd have to gather statistics for longer than the age of the universe!

And indeed, PP is an extremely big class: for example, it certainly contains the NP-complete problems. Why? Well, given a Boolean formula φ with n variables, what you can do is accept right away with probability 1/2-2^-2n, and otherwise choose a random truth assignment and accept if and only if it satisfies φ. Then your total acceptance probability will be more than 1/2 if there's at least one satisfying assignment for φ, or less than 1/2 if there isn't.

Indeed, complexity theorists believe that PP is strictly larger than NP -- although, as usual, we can't prove it.

The above considerations led Gill to define a more "reasonable" variant of PP:

BPP (Bounded-Error Probabilistic Polynomial-Time): This is the class of decision problems for which there exists a polynomial-time randomized algorithm that accepts with probability greater than 2/3 if the answer is yes, or less than 1/3 if the answer is no. In other words: given any input, the algorithm can be wrong with probability at most 1/3.

What's important about 1/3 is just that it's some constant smaller than 1/2. Any such constant would be as good as other. Why? Well, suppose we're given a BPP algorithm that errs with probability 1/3. If we're so inclined, we can easily modify the algorithm to err with probability at most (say) 2^-100. How?

Right: just rerun the algorithm a few hundred times; then output the majority answer! If we take the majority answer out of T independent trials, then our good friend the Chernoff bound tells us we'll be wrong with a probability that decreases exponentially in T.

Indeed, not only could we replace 1/3 by any constant smaller than 1/2; we could even replace it by 1/2-1/p(n) where p is any polynomial.

So, that was BPP: if you like, the class of all problems that are feasibly solvable by computer in a universe governed by classical physics.
RP (Randomized Polynomial-Time): As I said before, the error probability of a BPP algorithm can easily be made smaller than the probability of an asteroid hitting the computer. And that's good enough for most applications: say, administering radiation doses in a hospital, or encrypting multibillion-dollar bank transactions, or controlling the launch of nuclear missiles. But what about proving theorems? For certain applications, you really can't take chances.

And that leads us to RP: the class of problems for which there exists a polynomial-time randomized algorithm that accepts with probability greater than 1/2 if the answer is yes, or probability zero if the answer is no. To put it another way: if the algorithm accepts even once, then you can be certain that the answer is yes. If the algorithm keeps rejecting, then you can be extremely confident (but never certain) that the answer is no.

RP has an obvious "complement," called coRP. This is just the class of problems for which there's a polynomial-time randomized algorithm that accepts with probability 1 if the answer is yes, or less than 1/2 if the answer is no.
ZPP (Zero-Error Probabilistic Polynomial-Time): This class can be defined as the intersection of RP and coRP -- the class of problems in both of them. Equivalently, ZPP is the class of problems solvable by a polynomial-time randomized algorithm that has to be correct whenever it does output an answer, but can output "don't know" up to half the time. Again equivalently, ZPP is the class of problems solvable by an algorithm that never errs, but that only runs expected polynomial time.

Sometimes you see BPP algorithms called "Monte Carlo algorithms," and ZPP algorithms called "Las Vegas algorithms." I've even seen RP algorithms called "Atlantic City algorithms." This always struck me as stupid terminology. (Are there also Indian reservation algorithms?)

Here are the known relationships among the basic complexity classes that we've seen so far in this course. The relationships I didn't discuss explicitly are left as exercises for the reader (i.e., you).

It might surprise you that we still don't know whether BPP is contained in NP. But think about it: even if a BPP machine accepted with probability close to 1, how would you prove that to a deterministic polynomial-time verifier who didn't believe you? Sure, you could show the verifier some random runs of the machine, but then she'd always suspect you of skewing your samples to get a favorable outcome.

Fortunately, the situation isn't quite as pathetic as it seems: we at least know that BPP is contained in NP^NP (that is, NP with NP oracle), and hence in the second level of the polynomial hierarchy PH. Sipser, Gács, and Lautemann proved that in 1983. I went through the proof in class, but I'm actually going to skip it in these notes, because it's a bit technical. If you want it, here it is.

Incidentally, while we know that BPP is contained in NP^NP, we don't know anything similar for BQP, the class of problems solvable in polynomial time on a quantum computer. BQP hasn't yet made its official entrance in this course -- you'll have to wait a couple more lectures! -- but I'm trying to foreshadow it by telling you what it apparently isn't. In other words, what do we know to be true of BPP that we don't know to be true of BQP? Containment in PH is only the first of three examples we'll see in this lecture.

In complexity theory, it's hard to talk about randomness without also talking about a closely-related concept called nonuniformity. Nonuniformity basically means that you get to choose a different algorithm for each input length n. Now, why would you want such a stupid thing? Well, remember in Lecture 5 I showed you the Blum Speedup Theorem -- which says that it's possible to construct weird problems that admit no fastest algorithm, but only an infinite sequence of algorithms, with each one faster than the last on sufficiently large inputs? In such a case, nonuniformity would let you pick and choose from all algorithms, and thereby achieve the optimal performance. In other words, given an input of length n, you could simply pick the algorithm that's fastest for inputs of that particular length!

But even in a world with nonuniformity, complexity theorists believe there would still be strong limits on what could efficiently be computed. When we want to talk about those limits, we use a terminology invented by Karp and Lipton in 1982. Karp and Lipton defined the complexity class P/f(n), or P with f(n)-size advice, to consist of all problems solvable in deterministic polynomial time on a Turing machine, with help from an f(n)-bit "advice string" a_n that depends only on the input length n.

You can think of the polynomial-time Turing machine as a grad student, and the advice string a_n as wisdom from the student's advisor. Like most advisors, this one is infinitely wise, benevolent, and trustworthy. He wants nothing more than to help his students solve their respective thesis problems: that is, to decide whether their respective inputs x in {0,1}ⁿ are yes-inputs or no-inputs. But also like most advisors, he's too busy to find out what specific problems his students are working on. He therefore just doles out the same advice a_n to all of them, trusting them to apply it to their particular inputs x.

We'll be particularly interested in the class P/poly, which consists of all problems solvable in polynomial time using polynomial-size advice. In other words, P/poly is the union of P/n^k over all positive integers k.

Now, is it possible that P = P/poly? As a first (trivial) observation, I claim the answer is no: P is strictly contained in P/poly, and indeed in P/1. In other words, even with a single bit of advice, you really can do more than with no advice. Why?

Right! Consider the following problem:

Given an input of length n, decide whether the n^th Turing machine halts.

Not only is this problem not in P, it's not even computable -- for it's nothing other than a slow, "unary" encoding of the halting problem. On the other hand, it's easy to solve with a single advice bit an that depends only on the input length n. For that advice bit could just tell you what the answer is!

Here's another way to understand the power of advice: while the number of problems in P is only countably infinite, the number of problems in P/1 is uncountably infinite. (Why?)

On the other hand, just because you can solve vastly more problems with advice than you can without, that doesn't mean advice will help you solve any particular problem you might be interested in. Indeed, a second easy observation is that advice doesn't let you do everything: there exist problems not in P/poly. Why?

Well, here's a simple diagonalization argument. I'll actually show a stronger result, that there exist problems not in P/n^{log n}. Let M₁,M₂,M₃,... be a list of polynomial-time Turing machines. Also, fix an input length n. Then I claim that there exists a Boolean function f:{0,1}ⁿ→{0,1} that the first n machines (M₁,...,M_n) all fail to compute, even given any n^{log n}-bit advice string. Why? Just a counting argument: there are Boolean functions, but only n Turing machines and advice strings. So choose such a function f for every n; you'll then cause each machine M_i to fail on all but finitely many input lengths. Indeed, we didn't even need the assumption that the M_i's run in polynomial time.

Of course, all this time we've been dancing around the real question: can advice help us solve problems that we actually care about, like the NP-complete problems? In particular, is NP contained in P/poly? Intuitively, it seems unlikely: there are exponentially many Boolean formulas of size n, so even if you somehow received a polynomial-size advice string from God, how would that help you to decide satisfiability for more than a tiny fraction of those formulas?

But -- and I'm sure this will come as a complete shock to you -- we can't prove it's impossible. Well, at least in this case we have a good excuse for our ignorance, since if P=NP, then obviously NP would be in P/poly as well. But here's a question: if we did succeed in proving P≠NP, then would we also have proved that NP is not in P/poly? In other words, would NP in P/poly imply P=NP? Alas, we don't even know the answer to that.

But as with BPP and NP, the situation isn't quite as pathetic as it seems. Karp and Lipton did manage to prove in 1982 that, if NP were contained in P/poly, then the polynomial hierarchy PH would collapse to the second level (that is, to NP^NP). In other words, if you believe the polynomial hierarchy is infinite, then you must also believe that NP-complete problems are not efficiently solvable by a nonuniform algorithm.

This "Karp-Lipton Theorem" is the most famous example of a very large class of complexity results, a class that's been characterized as "if donkeys could whistle, then pigs could fly." In other words, if one thing no one really believes is true were true, then another thing no one really believes is true would be true! Intellectual onanism, you say? Nonsense! What makes it interesting is that the two things that no one really believes are true would've previously seemed completely unrelated to each other.

It's a bit of a digression, but the proof of the Karp-Lipton Theorem is more fun than a barrel full of carp. So let's see the proof right now. We assume NP is contained in P/poly; what we need to prove is that the polynomial hierarchy collapses to the second level -- or equivalently, that coNP^NP = NP^NP. So let's consider an arbitrary problem in coNP^NP, like so:

For all n-bit strings x, does there exist an n-bit string y such that φ(x,y) evaluates to TRUE?

(Here φ is some arbitrary polynomial-size Boolean formula.)

We need to find an NP^NP question -- that is, a question where the existential quantifier comes before the universal quantifier -- that has the same answer as the question above. But what could such a question possibly be? Here's the trick: we'll first use the existential quantifier to guess a polynomial-size advice string a_n. We'll then use the universal quantifier to guess the string x. Finally, we'll use the advice string a_n -- together with the assumption that NP is in P/poly -- to guess y on our own. Thus:

Does there exist an advice string a_n such that for all n-bit strings x, φ(x,M(x,a_n)) evaluates to TRUE?

Here M is a polynomial-time Turing machine that, given x as input and an as advice, outputs an n-bit string y such that φ(x,y) evaluates to TRUE whenever such a y exists. By one of your homework problems from last week, we can easily construct such an M provided we can solve NP-complete problems in P/poly.

Alright, I told you before that nonuniformity was closely related to randomness -- so much so that it's hard to talk about one without talking about the other. So in the rest of this lecture, I want to tell you about two connections between randomness and nonuniformity: a simple one that was discovered by Adleman in the 70's, and a deep one that was discovered by Impagliazzo, Nisan, and Wigderson in the 90's.

The simple connection is that BPP is contained in P/poly: in other words, nonuniformity is at least as powerful as randomness. Why do you think that is?

Well, let's see why it is. Given a BPP computation, the first thing we'll do is amplify the computation to exponentially small error. In other words, we'll repeat the computation (say) n² times and then output the majority answer, so that the probability of making a mistake drops from 1/3 to roughly . (If you're trying to prove something about BPP, amplifying to exponentially small error is almost always a good first step!)

Now, how many inputs are there of length n? Right: 2ⁿ. And for each input, only a fraction of random strings cause us to err. By the union bound (the most useful fact in all of theoretical computer science), this implies that at most a fraction of random strings can ever cause us to err on inputs of length n. Since < 1, this means there exists a random string, call it r, that never causes us to err on inputs of length n. So fix such an r, feed it as advice to the P/poly machine, and we're done!

So that was the simple connection between randomness and nonuniformity. Before moving on to the deep connection, let me make two remarks.

Even if P≠NP, you might wonder whether NP-complete problems can be solved in probabilistic polynomial time. In other words, is NP in BPP? Well, we can already say something concrete about that question. If NP is in BPP, then certainly NP is also in P/poly (since BPP is in P/poly). But that means PH collapses by the Karp-Lipton Theorem. So if you believe the polynomial hierarchy is infinite, then you also believe NP-complete problems are not efficiently solvable by randomized algorithms.
If nonuniformity can simulate randomness, then can it also simulate quantumness? In other words, is BQP in P/poly? Well, we don't know, but it isn't considered likely. Certainly Adleman's proof that BPP is in P/poly completely breaks down if we replace the BPP by BQP. But this raises an interesting question: why does it break down? What's the crucial difference between quantum theory and classical probability theory, which causes the proof to work in the one case but not the other? I'll leave the answer as an exercise for you.

Alright, now for the deep connection. Do you remember the primality-testing problem from earlier in the lecture? Over the years, this problem crept steadily down the complexity hierarchy, like a monkey from branch to branch:

It's obvious that primality-testing is in coNP.
In 1975, Pratt showed it was in NP.
In 1977, Solovay, Strassen, and Rabin showed it was in coRP.
In 1992, Adleman and Huang showed it was in ZPP.
In 2002, Agrawal, Kayal, and Saxena showed it was in P.

The general project of taking randomized algorithms and converting them to deterministic ones is called derandomization (a name only a theoretical computer scientist could love). The history of the primality-testing problem can only be seen as a spectacular success of this project. But with such success comes an obvious question: can every randomized algorithm be derandomized? In other words, does P equal BPP?

Once again the answer is that we don't know. Usually, if we don't know if two complexity classes are equal, the "default conjecture" is that they're different. And so it was with P and BPP -- (ominous music) -- until now. Over the last decade and a half, mounting evidence has convinced almost all of us that in fact P=BPP. In the remaining ten minutes of this lecture, we certainly won't be able to review this evidence in any depth. But let me quote one theorem, just to give you a flavor of it:

Theorem (Impagliazzo-Wigderson 1997): Suppose there exists a problem that's solvable in exponential time, and that's not solvable in subexponential time even with the help of a subexponential-size advice string. Then P=BPP.

Notice how this theorem relates derandomization to nonuniformity -- and in particular, to proving that certain problems are hard for nonuniform algorithms. The premise certainly seems plausible. From our current perspective, the conclusion (P=BPP) also seems plausible. And yet the two seem to have nothing to do with each other. So, this theorem might be characterized as "If donkeys can bray, then pigs can oink."

Where does this connection between randomness and nonuniformity come from? It comes from the theory of pseudorandom generators. We're gonna see a lot more about pseudorandom generators in the next lecture, when we talk about cryptography. But basically, a pseudorandom generator is just a function that takes as input a short string (called the seed), and produces as output a long string, in such a way that, if the seed is random, then the output looks random. Obviously the output can't be random, since it doesn't have enough entropy: if the seed is k bits long, then there are only 2^k possible output strings, regardless of how long those output strings are. What we ask, instead, is that no polynomial-time algorithm can successfully distinguish the output of the pseudorandom generator from "true" randomness. Of course, we'd also like for the function mapping the seed to the output to be computable in polynomial time.

Already in 1982, Andy Yao realized that, if you could create a "good enough" pseudorandom generator, then you could prove P=BPP. Why? Well, suppose that for any integer k, you had a way of stretching an O(log n)-bit seed to an n-bit output in polynomial time, in such a way that no algorithm running in n^k time could successfully distinguish the output from true randomness. And suppose you had a BPP machine that ran in n^k time. In that case, you could simply loop over all possible seeds (of which there are only polynomially many), feed the corresponding outputs to the BPP machine, and then output the majority answer. The probability that the BPP machine accepts given a pseudorandom string has to be about the same as the probability that it accepts given a truly random string -- since otherwise the machine would be distinguishing random strings from pseudorandom ones, contrary to assumption!

But what's the role of nonuniformity in all this? Well, here's the point: in addition to a random (or pseudorandom) string, a BPP machine also receives an input, x. And we need the derandomization to work for every x. But that means that, for the purposes of derandomization, we must think of x as an advice string provided by some superintelligent adversary for the sole purpose of foiling the pseudorandom generator. You see, this is why we had to assume a problem that was hard even in the presence of advice: because we need to construct a pseudorandom generator that's indistinguishable from random even in the presence of the "adversary," x.

(That reminds me of something: why are there so many Israelis in complexity, and particularly in the more cryptographic kinds of complexity? I have a theory about this: it's because complexity is basically mathematicized paranoia. It's that field where, whenever anyone else has any choice in what to do, you immediately assume that person will do the worst possible thing to you and proceed accordingly.)

To summarize: if we could prove that certain problems are sufficiently hard for nonuniform algorithms, then we would prove P=BPP.

This leads to my third difference between BPP and BQP: while most of us believe that P=BPP, most of us certainly don't believe that P=BQP. (Indeed we can't believe that, if we believe factoring is hard for classical computers.) We don't have any "dequantization" program that's been remotely as successful as the derandomization program. Once again, it would seem there's a crucial difference between quantum theory and classical probability theory, which allows certain ideas (like those of Sipser-Gács-Lautemann, Adleman, and Impagliazzo-Wigderson) to work for the latter but not for the former.

Incidentally, over the last few years, Kabanets, Impagliazzo, and others managed to obtain a sort of converse to the derandomization theorems. What they've shown is that, if we want to prove P=BPP, then we'll have to prove that certain problems are hard for nonuniform algorithms. This could be taken as providing some sort of explanation for why, assuming P=BPP, no one has yet managed to prove it. Namely, it's because if you want to prove P=BPP, then you'll have to prove certain problems are hard -- and if you could prove those problems were hard, then you would be (at least indirectly) attacking questions like P versus NP. In complexity theory, pretty much everything eventually comes back to P versus NP.

Puzzles for Thursday

You and a friend want to flip a coin, but the only coin you have is crooked: it lands heads with some fixed but unknown probability p. Can you use this coin to simulate a fair coin flip? (I mean perfectly fair, not just approximately fair.)
n people are standing in a circle. They're each wearing either a red hat or a blue hat, assigned uniformly and independently at random. They can each see everyone else's hats but not their own. They want to vote on whether the number of red hats is even or odd. Each person votes at the same time, so that no one's vote depends on anyone else's. What's the maximum probability with which the people can win this game? (By "win," I mean that their vote corresponds to the truth.) Assume for simplicity that n is odd.

Lecture 8: Crypto

(Thanks to Gus Gutoski for help preparing these notes.)

This lecture begins with Scott bitching out the class for not attempting the puzzle questions. Bunch of lazy punks.

Answers to Puzzles from Lecture 7

Puzzle 1. We are given a biased coin that comes up heads with probability p. Using this coin, construct an unbiased coin.

Solution. The solution is the "von Neumann trick": flip the biased coin twice, interpreting HT as heads and TH as tails. If the flips come up HH or TT then try again. Under this scheme, "heads" and "tails" are equiprobable, each occurring with probability p(1-p) in any given trial. Conditioned on either HT or TH occurring, it follows that the simulated coin is unbiased.

Puzzle 2. n people sit in a circle. Each person wears either a red hat or a blue hat, chosen independently and uniformly at random. Each person can see the hats of all the other people, but not his/her own hat. Based only upon what they see, each person votes on whether or not the total number of red hats is odd. Is there a scheme by which the outcome of the vote is correct with probability greater than 1/2?

Solution. Each person decides his/her vote as follows: if the number of visible blue hats is larger than the number of visible red hats then vote according to the parity of the number of visible red hats. Otherwise, vote the opposite of the parity of the number of visible red hats. If the number of red hats differs from the number of blue hats by at least 2 then this scheme succeeds with certainty. Otherwise the scheme might fail. However, the probability that the number of red hats differs from the number of blue hats by less than 2 is small -- O(1/√n).

Crypto

Cryptography has been a major force in human history for more than 3,000 years. Numerous wars have been won or lost by the sophistication or stupidity of cryptosystems. If you think I'm exaggerating, read The Codebreakers by David Kahn, and keep in mind that it was written before people knew about the biggest cryptographic story of all, Turing's victory in the Second World War.

And yet, even though cryptography has influenced human affairs for millennia, developments over the last thirty years have completely -- yes, completely -- changed our understanding of it. If you plotted when the basic mathematical discoveries in cryptography were made, you'd see a few in antiquity, maybe a few from the Middle Ages till the 1800's, one in the 1920's (the one-time pad), a few more around World War II, and then, after the birth of computational complexity theory in the 1970's, boom boom boom boom boom boom boom...

Our journey through the history of cryptography begins with the famous and pathetic "Caesar cipher" used by the Roman Empire. Here the plaintext message is converted into a ciphertext by simply adding 3 to each letter, wrapping around to A after you reach Z. Thus D becomes G, Y becomes B, and DEMOCRITUS becomes GHPRFULWXV. More complex variants of the Caesar cipher have appeared, but given enough ciphertext they're all easy to crack, by using (for example) a frequency analysis of the letters appearing in the ciphertext. Not that that's stopped people from using these things! Indeed, as recently as last April, the head of the Sicilian mafia was finally caught after forty years because he used the Caesar cipher -- the original one -- to send messages to his subordinates!

It wasn't until the 1920's that an information-theoretically secure cryptosystem was devised: the one-time pad. The idea is simple: the plaintext message is represented by a binary string p, which is exclusive-OR'ed with a random binary key k of the same length. That is, the ciphertext c is equal to p + k, where + denotes bitwise addition mod 2.

The recipient (who knows k) can decrypt the ciphertext with another XOR operation:

c + k = p + k + k = p.

To an eavesdropper who doesn't know k, the ciphertext is just a string of random bits -- since XOR'ing any string of bits with a random string just produces another random string. (To drive home just how random the ciphertext is, Scott makes up an example in class of a plaintext and key, which turn out to encrypt to the all-1 string -- very random-looking indeed!)

The problem with the one-time pad, of course, is that the sender and recipient have to share a key that's as long as the message itself. Furthermore, if the same key is ever used to encrypt two or more messages, then the cryptosystem is no longer information-theoretically secure. (Hence the name "one-time pad.") To see why, suppose two plaintexts p₁ and p₂ are both encrypted via the same key k to ciphertexts c₁ and c₂ respectively. Then we have

c₁ + c₂ = p₁ + k + p₂ + k = p₁ + p₂,

and hence an eavesdropper can obtain the string p₁ + p₂. By itself, this might or might not be useful, but it at least constitutes some information that an eavesdropper could learn about the plaintexts. But this is just a mathematical curiosity, right? Well, in the 1950's the Soviets got sloppy and reused some of their one-time pads. As a result, the NSA, through its VENONA project, was able to recover some (though not all) of the plaintext encrypted in this way. This seems to be how Julius and Ethel Rosenberg were caught.

In the 1940's, Claude Shannon proved that information-theoretically secure cryptography requires the sender and recipient to share a key at least as long as the message they want to communicate. Like pretty much all of Shannon's results, this one is trivial in retrospect. (It's good to be in on the ground floor!) Here's his proof: given the ciphertext and the key, the plaintext had better be uniquely recoverable. In other words, for any fixed key, the function that maps plaintexts to ciphertexts had better be an injective function. But this immediately implies that, for a given ciphertext c, the number of plaintexts that could possibly have produced c is at most the number of keys. In other words, if there are fewer possible keys than plaintexts, then an eavesdropper will be able to rule out some of the plaintexts -- the ones that wouldn't encrypt to c for any value of the key. Therefore our cryptosystem won't be perfectly secure. It follows that, if we want perfect security, then we need at least as many keys as plaintexts -- or equivalently, the key needs to have at least as many bits as the plaintext.

I mentioned before that sharing huge keys is usually impractical -- not even the KGB managed to do it perfectly! So we want a cryptosystem that lets us get away with smaller keys. Of course, Shannon's result implies that such a cryptosystem can't be information-theoretically secure. But what if we relax our requirements? In particular, what if we assume that the eavesdropper is restricted to running in polynomial time? This question leads naturally to our next topic...

Pseudorandom Generators

As I mentioned in the last lecture, a pseudorandom generator (PRG) is basically a function that takes as input a short, truly random string, and produces as output a long, seemingly random string. More formally, a pseudorandom generator is a function f with the following properties:

f maps an n-bit input string (called the seed) to a p(n)-bit output string, where p(n) is some polynomial larger than n.
f is computable in time polynomial in n.
For every polynomial-time algorithm A (called the adversary), the difference

| Pr_{n-bit strings x} [ A accepts f(x) ] - Pr_{p(n)-bit strings y} [ A accepts y ] |

is negligibly small -- by which I mean, it decreases faster than 1/q(n) for any polynomial q. (Of course, decreasing at an exponential rate is even better.) Or in English, no polynomial-time adversary can distinguish the output of f from a truly random string with any non-negligible bias.

Now, you might wonder: how "stretchy" a PRG are we looking for? Do we want to stretch an n-bit seed to 2n bits? To n² bits? n¹⁰⁰ bits? The answer turns out to be irrelevant!

Why? Because even if we only had a PRG f that stretched n bits to n+1 bits, we could keep applying f recursively to its own output, and thereby stretch n bits to p(n) bits for any polynomial p. Furthermore, if the output of this recursive process were efficiently distinguishable from a random p(n)-bit string, then the output of f itself would have been efficiently distinguishable from a random (n+1)-bit string -- contrary to assumption! Of course, there's something that needs to be proved here, but the something that needs to be proved can be proved, and I'll leave it at that.

Now, I claim that if pseudorandom generators exist, then it's possible to build a computationally-secure cryptosystem using only short encryption keys. Does anyone see why?

Right: first use the PRG to stretch a short encryption key to a long one -- as long as the plaintext message itself. Then pretend that the long key is truly random, and use it exactly as you'd use a one-time pad!

Why is this scheme secure? As always in modern cryptography, what we do is to argue by reduction. Suppose that, given only the ciphertext message, an eavesdropper could learn something about the plaintext in polynomial time. We saw before that, if the encryption key were truly random (that is, were a one-time pad), then this would be impossible. It follows, then, that the eavesdropper would in effect be distinguishing the pseudorandom key from a random one. But this contradicts our assumption that no polynomial-time algorithm can distinguish the two!

Admittedly, this has all been pretty abstract and conceptual. Sure, we could do wonderful things if we had a PRG -- but is there any reason to suppose PRG's actually exist?

A first, trivial observation is that PRG's can only exist if P≠NP. Why?

Right: because if P=NP, then given a supposedly random string y, we can decide in polynomial time whether there's a short seed x such that f(x)=y. If y is random, then such a seed almost certainly won't exist -- so if it does exist, we can be almost certain that y isn't random. We can therefore distinguish the output of f from true randomness.

Alright, but suppose we do assume P≠NP. What are some concrete examples of functions that are believed to be pseudorandom generators?

One example is what's called the Blum-Blum-Shub generator. Here's how it works: pick a large composite number N. Then the seed, x, will be a random element of Z_N. Given this seed, first compute x² mod N, (x²)² mod N, ((x²)²)² mod N, and so on. Then concatenate the least-significant bits in the binary representations of these numbers, and output that as your pseudorandom string f(x).

Blum et al. were able to show that, if we had a polynomial-time algorithm to distinguish f(x) from a random string, then (modulo some technicalities) we could use that algorithm to factor N in polynomial time. Or equivalently, if factoring is hard, then Blum-Blum-Shub is a PRG. This is yet another example where we "prove" something is hard by showing that, if it were easy, then something else that we think is hard would also be easy.

Alas, we don't think factoring is hard -- at least, not in a world with quantum computers! So can we base the security of PRG's on a more quantum-safe assumption? Yes, we can. There are many, many ways to build a candidate PRG, and we have no reason to think that quantum computers will be able to break all of them. Indeed, you could even base a candidate PRG on the apparent unpredictability of (say) the "Rule 110" cellular automaton, as advocated by Stephen Wolfram in his groundbreaking, revolutionary, paradigm-smashing book.

Of course, our dream would be to base a PRG's security on the weakest possible assumption: P≠NP itself! But when people try to do that, they run into two interesting problems.

The first problem is that P versus NP deals only with the worst case. Imagine if you were a general or a bank president, and someone tried to sell you an encryption system with the sales pitch that there exists a message that's hard to decode. You see what the difficulty is: for both encryption systems and PRG's, we need NP problems that are hard on average, not just in the worst case. (Technically, we need problems that are hard on average with respect to some efficiently samplable distribution over the inputs -- not necessarily the uniform distribution.) But no one has been able to prove that such problems exist, even if we assume P≠NP.

That's not to say, though, that we know nothing about average-case hardness. As an example, consider the Shortest Vector Problem (SVP). Here we're given a lattice L in Rⁿ, consisting of all integer linear combinations of some given vectors v₁,...,v_n in Rⁿ. Then the problem is to approximate the length of the shortest nonzero vector in L to within some multiplicative factor k.

SVP is one of the few problems for which we can prove a worst-case / average-case equivalence (that is, the average case is every bit as hard as the worst case), at least when the approximation ratio k is big enough. Based on that equivalence, Ajtai, Dwork, Regev, and others have constructed cryptosystems and pseudorandom generators whose security rests on the worst-case hardness of SVP. Unfortunately, the same properties that let us prove worst-case / average-case equivalence also make it unlikely that SVP is NP-complete for the relevant values of k! It seems more likely that SVP is intermediate between P and NP-complete, just like we think factoring is.

Alright, so suppose we just assume NP-complete problems are hard on average. Even then, there's a further difficulty in using NP-complete problems to build a PRG. This is that breaking PRG's just doesn't seem to have the right "shape" to be NP-complete. What do I mean by that? Well, think about how we prove a problem B is NP-complete: we take some problem A that's already known to be NP-complete, and we give a polynomial-time reduction that maps yes-instances of A to yes-instances of B, and no-instances of A to no-instances of B. In the case of breaking a PRG, presumably the yes-instances would be pseudorandom strings and the no-instances would be truly random strings (or maybe vice versa).

Do you see the problem here? If not, let me spell it out for you: how do we describe a "truly random string" for the purpose of mapping to it in the reduction? The whole point of a string being random is that we can't describe it by anything shorter than itself! Admittedly, this argument is full of loopholes, one of which is that the reduction might be randomized. Nevertheless, it is possible to conclude something from the argument: that if breaking PRG's is NP-complete, then the proof will have to be very different from the sort of NP-completeness proofs that we're used to.

One-Way Functions

One-way functions are the cousins of pseudorandom generators. Intuitively, a one-way function (OWF) is just a function that's easy to compute but hard to invert. More formally, a function f from n bits to p(n) bits is a one way function if

f is computable in time polynomial in n.
For every polynomial-time adversary A, the probability that A succeeds at inverting f,

Pr_{n-bit strings x} [f(A(f(x))) = f(x)],

is negligibly small -- that is, smaller than 1/q(n) for any polynomial q.

The event f(A(f(x))) = f(x) appears in the definition instead of just A(f(x)) = x in order to account for the fact that f might have multiple inverses. With this definition, we consider algorithms A that find anything in the preimage of f(x), not just x itself.

I claim that the existence of PRG's implies the existence of OWF's. Can anyone tell me why? Anyone?

Right: because a PRG is an OWF!

Alright then, can anyone prove that the existence of OWF's implies the existence of PRG's?

Yeah, this one's a little harder! The main reason is that the output of an OWF f doesn't have to appear random in order for f to be hard to invert. And indeed, it took more than a decade of work -- culminating in a behemoth 1997 paper of Håstad, Impagliazzo, Levin, and Luby -- to figure out how to construct a pseudorandom generator from any one-way function. Because of Håstad et al.'s result, we now know that OWF's exist if and only if PRG's do. The proof, as you'd expect, is pretty complicated, and the reduction is not exactly practical: the blowup is by about n⁴⁰! This is the sort of thing that gives polynomial-time a bad name -- but it's the exception, not the rule! If we assume that the one-way function is a permutation, then the proof becomes much easier (it was already shown by Yao in 1982) and the reduction becomes much faster. But of course that yields a less general result.

So far we've restricted ourselves to private-key cryptosystems, which take for granted that the sender and receiver share a secret key. But how would you share a secret key with (say) Amazon.com before sending them your credit card number? Would you email them the key? Oops -- if you did that, then you'd better encrypt your email using another secret key, and so on ad infinitum! The solution, of course, is to meet with an Amazon employee in an abandoned garage at midnight.

No, wait ... I meant that the solution is public-key cryptography.

Public-Key Cryptography

It's amazing, if you think about it, that so basic an idea had to wait until the 1970's to be discovered. Physicists were tidying up the Standard Model while cryptographers were still at the Copernicus stage!

So, how did public-key cryptography finally come to be? The first inventors -- or rather discoverers -- were Ellis, Cocks, and Williamson, working for the GCHQ (the British NSA) in the early 70's. Of course they couldn't publish their work, so today they don't get much credit! Let that be a lesson to you.

The first public public-key cryptosystem was that of Diffie and Hellman, in 1976. A couple years later, Rivest, Shamir, and Adleman discovered the famous RSA system that bears their initials. Do any of you know how RSA was first revealed to the world? Right: as a puzzle in Martin Gardner's Mathematical Games column for Scientific American!

RSA had several advantages over Diffie-Hellman: for example, it only required one party to generate a public key instead of both, and it let users authenticate themselves in addition to communicating in private. But if you read Diffie and Hellman's paper, pretty much all the main ideas are there.

Anyway, the core of any public-key cryptosystem is what's called a trapdoor one-way function. This is a function that's

easy to compute,
hard to invert, and
easy to invert given some secret "trapdoor" information.

The first two requirements are just the same as for ordinary OWF's. The third requirement -- that the OWF should have a "trapdoor" that makes the inversion problem easy -- is the new one. For comparison, notice that the existence of ordinary one-way functions implies the existence of secure private-key cryptosystems, whereas the existence of trapdoor one-way functions implies the existence of secure public-key cryptosystems.

So, what's an actual example of a public-key cryptosystem? Well, most of you have seen RSA at some point in your mathematical lives, so I'll go through it quickly.

Let's say you want to send your credit card number to Amazon.com. What happens? First Amazon randomly selects two large prime numbers p and q (which can be done in polynomial time), subject to the technical constraint that p-1 and q-1 should not be divisible by 3. (We'll see the reason for that later.) Then Amazon computes the product N = pq and publishes it for all the world to see, while keeping p and q themselves a closely-guarded secret.

Now, assume without loss of generality your credit card number is encoded as a positive integer x, smaller but not too much smaller than N. Then what do you do? Simple: you compute x³ mod N and send it over to Amazon! If a credit card thief intercepted your message en route, then she would have to recover x given only x³ mod N. But computing cube roots modulo a composite number is believed to be an extremely hard problem, at least for classical computers! If p and q are both reasonably large (say 10,000 digits each), then our hope would be that any classical eavesdropper would need millions of years to recover x.

This leaves an obvious question: how does Amazon itself recover x? Duh -- by using its knowledge of p and q! We know from our friend Mr. Euler, way back in 1761, that the sequence

x mod N, x² mod N, x³ mod N, ...

repeats with period (p-1)(q-1). So provided Amazon can find an integer k such that

3k = 1 mod (p-1)(q-1),

it'll then have

(x³)^k mod N = x^3k mod N = x mod N.

Now, we know that such a k exists, by the assumption that p-1 and q-1 are not divisible by 3. Furthermore, Amazon can find such a k in polynomial time, using Euclid's algorithm (from way way back, around 300BC). Finally, given x³ mod N, Amazon can compute (x³)^k in polynomial time by using a simple repeated squaring trick. So that's RSA.

(Note: to make everything as concrete and visceral as possible, I assumed that x always gets raised to the third power. The resulting cryptosystem is by no means a toy: as far as anyone knows, it's secure! In practice, though, people can and do raise x to arbitrary powers. As another remark, squaring x instead of cubing it would open a whole new can of worms, since any nonzero number that has a square root mod N has more than one of them.)

Of course, if the credit card thief could factor N into pq, then she could run the exact same decoding algorithm that Amazon runs, and thereby recover the message x. So the whole scheme relies crucially on the assumption that factoring is hard! This immediately implies that RSA could be broken by a credit card thief with a quantum computer. Classically, however, the best known factoring algorithm is the Number Field Sieve, which takes about steps.

As a side note, no one has yet proved that breaking RSA requires factoring: it's possible that there's a more direct way to recover the message x, one that doesn't entail learning p and q. On the other hand, in 1979 Rabin discovered a variant of RSA for which recovering the plaintext is provably as hard as factoring.

Alright, but all this talk of cryptosystems based on factoring and modular arithmetic is so 1993! Today we realize that as soon as we build a quantum computer, Shor's algorithm will break the whole lot of these things. Of course, this point hasn't been lost on complexity theorists, many of whom have since set to work looking for trapdoor OWF's that still seem safe against quantum computers. Currently, our best candidates for such trapdoor OWF's are based on lattice problems, like the Shortest Vector Problem (SVP) that I described earlier. Whereas factoring reduces to the abelian hidden subgroup problem, which is solvable in quantum polynomial time, SVP is only known to reduce to the dihedral hidden subgroup problem, which is not known to be solvable in quantum polynomial time despite a decade of effort.

Inspired by this observation, and building on earlier work by Ajtai and Dwork, Oded Regev has recently proposed public-key cryptosystems that are provably secure against quantum eavesdroppers, assuming SVP is hard for quantum computers. Note that his cryptosystems themselves are purely classical. On the other hand, even if you only wanted security against classical eavesdroppers, you'd still have to assume that SVP was hard for quantum computers, since the reduction from SVP to breaking the cryptosystem is a quantum reduction!

A decade ago, the key and message lengths of these lattice-based cryptosystems were so impractical it was almost a joke. But today, largely because of Regev's work, that's no longer true. I'm still waiting for the first commercial implementations of his cryptosystems.

Question from the floor: What about elliptic-curve cryptosystems?

Answer: Yeah, those are still easily breakable by quantum computers, since the problem of breaking them can be expressed as an abelian hidden subgroup problem. (Elliptic curve groups are abelian.) On the other hand, the best-known classical algorithms for breaking elliptic curve cryptosystems apparently have much higher running times than the Number Field Sieve for breaking RSA -- it's a question of ~2ⁿ versus ~2^{n^1/3}. That could be fundamental, or it could just be because algorithms for elliptic curve groups haven't been studied as much.

Thus completes our whirlwind tour of classical complexity and cryptography. I'll be in Europe for the next 10 days and hence the next three lectures are cancelled. We'll reconvene on Thursday, October 19, at which point we'll talk about quantum mechanics and Roger Penrose's The Emperor's New Mind. I'll expect everyone to have read the book by then. But if you read the "sequel," Shadows of the Mind, then you receive negative credit. You have to read another book, say The Road to Reality, in order to compensate for the damage you caused to yourself.

Lecture 9: Quantum

There are two ways to teach quantum mechanics. The first way -- which for most physicists today is still the only way -- follows the historical order in which the ideas were discovered. So, you start with classical mechanics and electrodynamics, solving lots of grueling differential equations at every step. Then you learn about the "blackbody paradox" and various strange experimental results, and the great crisis these things posed for physics. Next you learn a complicated patchwork of ideas that physicists invented between 1900 and 1926 to try to make the crisis go away. Then, if you're lucky, after years of study you finally get around to the central conceptual point: that nature is described not by probabilities (which are always nonnegative), but by numbers called amplitudes that can be positive, negative, or even complex.

Today, in the quantum information age, the fact that all the physicists had to learn quantum this way seems increasingly humorous. For example, I've had experts in quantum field theory -- people who've spent years calculating path integrals of mind-boggling complexity -- ask me to explain the Bell inequality to them. That's like Andrew Wiles asking me to explain the Pythagorean Theorem.

As a direct result of this "QWERTY" approach to explaining quantum mechanics - which you can see reflected in almost every popular book and article, down to the present -- the subject acquired an undeserved reputation for being hard. Educated people memorized the slogans -- "light is both a wave and a particle," "the cat is neither dead nor alive until you look," "you can ask about the position or the momentum, but not both," "one particle instantly learns the spin of the other through spooky action-at-a-distance," etc. -- and also learned that they shouldn't even try to understand such things without years of painstaking work.

The second way to teach quantum mechanics leaves a blow-by-blow account of its discovery to the historians, and instead starts directly from the conceptual core -- namely, a certain generalization of probability theory to allow minus signs. Once you know what the theory is actually about, you can then sprinkle in physics to taste, and calculate the spectrum of whatever atom you want. This second approach is the one I'll be following here.

So, what is quantum mechanics? Even though it was discovered by physicists, it's not a physical theory in the same sense as electromagnetism or general relativity. In the usual "hierarchy of sciences" -- with biology at the top, then chemistry, then physics, then math -- quantum mechanics sits at a level between math and physics that I don't know a good name for. Basically, quantum mechanics is the operating system that other physical theories run on as application software (with the exception of general relativity, which hasn't yet been successfully ported to this particular OS). There's even a word for taking a physical theory and porting it to this OS: "to quantize."

But if quantum mechanics isn't physics in the usual sense -- if it's not about matter, or energy, or waves, or particles -- then what is it about? From my perspective, it's about information and probabilities and observables, and how they relate to each other.

Ray Laflamme: That's very much a computer-science point of view.

Scott: Yes, it is.

My contention in this lecture is the following: Quantum mechanics is what you would inevitably come up with if you started from probability theory, and then said, let's try to generalize it so that the numbers we used to call "probabilities" can be negative numbers. As such, the theory could have been invented by mathematicians in the 19^th century without any input from experiment. It wasn't, but it could have been.

Ray Laflamme: And yet, with all the structures mathematicians studied, none of them came up with quantum mechanics until experiment forced it on them...

Scott: Yes -- and to me, that's a perfect illustration of why experiments are relevant in the first place! More often than not, the only reason we need experiments is that we're not smart enough. After the experiment has been done, if we've learned anything worth knowing at all, then hopefully we've learned why the experiment wasn't necessary to begin with -- why it wouldn't have made sense for the world to be any other way. But we're too dumb to figure it out ourselves!

Two other perfect examples of "obvious-in-retrospect" theories are evolution and special relativity. Admittedly, I don't know if the ancient Greeks, sitting around in their togas, could have figured out that these theories were true. But certainly -- certainly! -- they could've figured out that they were possibly true: that they're powerful principles that would've at least been on God's whiteboard when She was brainstorming the world.

In this lecture, I'm going to try to convince you -- without any recourse to experiment -- that quantum mechanics would also have been on God's whiteboard. I'm going to show you why, if you want a universe with certain very generic properties, you seem forced to one of three choices: (1) determinism, (2) classical probabilities, or (3) quantum mechanics. Even if the "mystery" of quantum mechanics can never be banished entirely, you might be surprised by just how far people could've gotten without leaving their armchairs! That they didn't get far until atomic spectra and so on forced the theory down their throats is one of the strongest arguments I know for experiments being necessary.

A Less Than 0% Chance

Alright, so what would it mean to have "probability theory" with negative numbers? Well, there's a reason you never hear the weather forecaster talk about a -20% chance of rain tomorrow -- it really does make as little sense as it sounds. But I'd like you to set any qualms aside, and just think abstractly about an event with N possible outcomes. We can express the probabilities of those events by a vector of N real numbers:

(p₁,....,p_N),

Mathematically, what can we say about this vector? Well, the probabilities had better be nonnegative, and they'd better sum to 1. We can express the latter fact by saying that the 1-norm of the probability vector has to be 1. (The 1-norm just means the sum of the absolute values of the entries.)

But the 1-norm is not the only norm in the world -- it's not the only way we know to define the "size" of a vector. There are other ways, and one of the recurring favorites since the days of Pythagoras has been the 2-norm or Euclidean norm. Formally, the Euclidean norm means the square root of the sum of the squares of the entries. Informally, it means you're late for class, so instead of going this way and then that way, you cut across the grass.

Now, what happens if you try to come up with a theory that's like probability theory, but based on the 2-norm instead of the 1-norm? I'm going to try to convince you that quantum mechanics is what inevitably results.

Let's consider a single bit. In probability theory, we can describe a bit as having a probability p of being 0, and a probability 1-p of being 1. But if we switch from the 1-norm to the 2-norm, now we no longer want two numbers that sum to 1, we want two numbers whose squares sum to 1. (I'm assuming we're still talking about real numbers.) In other words, we now want a vector (α,β) where α² + β² = 1. Of course, the set of all such vectors forms a circle:

The theory we're inventing will somehow have to connect to observation. So, suppose we have a bit that's described by this vector (α,β). Then we'll need to specify what happens if we look at the bit. Well, since it is a bit, we should see either 0 or 1! Furthermore, the probability of seeing 0 and the probability of seeing 1 had better add up to 1. Now, starting from the vector (α,β), how can we get two numbers that add up to 1? Simple: we can let α² be the probability of a 0 outcome, and let β² be the probability of a 1 outcome.

But in that case, why not forget about α and β, and just describe the bit directly in terms of probabilities? Ahhhhh. The difference comes in how the vector changes when we apply an operation to it. In probability theory, if we have a bit that's represented by the vector (p,1-p), then we can represent any operation on the bit by a stochastic matrix: that is, a matrix of nonnegative real numbers where every column adds up to 1. So for example, the "bit flip" operation -- which changes the probability of a 1 outcome from p to 1-p -- can be represented as follows:

Indeed, it turns out that a stochastic matrix is the most general sort of matrix that always maps a probability vector to another probability vector.

Exercise 1 for the Non-Lazy Reader: Prove this.

But now that we've switched from the 1-norm to the 2-norm, we have to ask: what's the most general sort of matrix that always maps a unit vector in the 2-norm to another unit vector in the 2-norm?

Well, we call such a matrix a unitary matrix -- indeed, that's one way to define what a unitary matrix is! (Oh, all right. As long as we're only talking about real numbers, it's called an orthogonal matrix. But same difference.) Another way to define a unitary matrix, again in the case of real numbers, is as a matrix whose inverse equals its transpose.

Exercise 2 for the Non-Lazy Reader: Prove that these two definitions are equivalent.

Gus Gutoski: So far you've given no motivation for why you've set the sum of the squares equal to 1, rather than the sum of the cubes or the sum of the fourth powers...

Scott: I'm gettin' to it -- don't you worry about that!

This "2-norm bit" that we've defined has a name, which as you know is qubit. Physicists like to represent qubits using what they call "Dirac ket notation," in which the vector (α,β) becomes . Here α is the amplitude of outcome |0〉, and β is the amplitude of outcome |1〉.

This notation usually drives computer scientists up a wall when they first see it -- especially because of the asymmetric brackets! But if you stick with it, you see that it's really not so bad. As an example, instead of writing out a vector like (0,0,3/5,0,0,0,4/5,0,0), you can simply write , omitting all of the 0 entries.

So given a qubit, we can transform it by applying any 2-by-2 unitary matrix -- and that leads already to the famous effect of quantum interference. For example, consider the unitary matrix

which takes a vector in the plane and rotates it by 45 degrees counterclockwise. Now consider the state |0〉. If we apply U once to this state, we'll get -- it's like taking a coin and flipping it. But then, if we apply the same operation U a second time, we'll get |1〉:

So in other words, applying a "randomizing" operation to a "random" state produces a deterministic outcome! Intuitively, even though there are two "paths" that lead to the outcome |0〉, one of those paths has positive amplitude and the other has negative amplitude. As a result, the two paths interfere destructively and cancel each other out. By contrast, the two paths leading to the outcome |1〉 both have positive amplitude, and therefore interfere constructively.

The reason you never see this sort of interference in the classical world is that probabilities can't be negative. So, cancellation between positive and negative amplitudes can be seen as the source of all "quantum weirdness" -- the one thing that makes quantum mechanics different from classical probability theory. How I wish someone had told me that when I first heard the word "quantum"!

Mixed States

Once we have these quantum states, one thing we can always do is to take classical probability theory and "layer it on top." In other words, we can always ask, what if we don't know which quantum state we have? For example, what if we have a 1/2 probability of and a 1/2 probability of ? This gives us what's called a mixed state, which is the most general kind of state in quantum mechanics.

Mathematically, we represent a mixed state by an object called a density matrix. Here's how it works: say you have this vector of N amplitudes, (α₁,...,α_N). Then you compute the outer product of the vector with itself -- that is, an N-by-N matrix whose (i,j) entry is α_iα_j (again in the case of real numbers). Then, if you have a probability distribution over several such vectors, you just take a linear combination of the resulting matrices. So for example, if you have probability p of some vector and probability 1-p of a different vector, then it's p times the one matrix plus 1-p times the other.

The density matrix encodes all the information that could ever be obtained from some probability distribution over quantum states, by first applying a unitary operation and then measuring.

Exercise 3 for the Non-Lazy Reader: Prove this.

This implies that if two distributions give rise to the same density matrix, then those distributions are empirically indistinguishable, or in other words are the same mixed state. As an example, let's say you have the state with 1/2 probability, and with 1/2 probability. Then the density matrix that describes your knowledge is

It follows, then, that no measurement you can ever perform will distinguish this mixture from a 1/2 probability of |0〉 and a 1/2 probability of |1〉.

The Squaring Rule

Now let's talk about the question Gus raised, which is, why do we square the amplitudes instead of cubing them or raising them to the fourth power or whatever?

Devin Smith: Because it gives you the right answer?

Scott: Yeah, you do want an answer that agrees with experiment. So let me put the question differently: why did God choose to do it that way and not some other way?

Ray Laflamme: Well, given that the numbers can be negative, squaring them just seems like the simplest thing to do!

Scott: Why not just take the absolute value?

Alright, I can give you a couple of arguments for why God decided to square the amplitudes.

The first argument is a famous result called Gleason's Theorem from the 1950's. Gleason's Theorem lets us assume part of quantum mechanics and then get out the rest of it! More concretely, suppose we have some procedure that takes as input a unit vector of real numbers, and that spits out the probability of an event. Formally, we have a function f that maps a unit vector to the unit interval [0,1]. And let's suppose N=3 -- the theorem actually works in any number of dimensions three or greater (but interestingly, not in two dimensions). Then the key requirement we impose is that, whenever three vectors v₁,v₂,v₃ are all orthogonal to each other,

f(v₁) + f(v₂) + f(v₃) = 1.

Intuitively, if these three vectors represent "orthogonal ways" of measuring a quantum state, then they should correspond to mutually-exclusive events. Crucially, we don't need any assumption other than that -- no continuity, no differentiability, no nuthin'.

So, that's the setup. The amazing conclusion of the theorem is that, for any such f, there exists a mixed state such that f arises by measuring that state according to the standard measurement rule of quantum mechanics. I won't be able prove this theorem here, since it's pretty hard. But it's one way that you can "derive" the squaring rule without exactly having to put it in at the outset.

Exercise 4 for the Non-Lazy Reader: Why does Gleason's Theorem not work in two dimensions?

If you like, I can give you a much more elementary argument. This is something I put it in one of my papers, though I'm sure many others knew it before.

Let's say we want to invent a theory that's not based on the 1-norm like classical probability theory, or on the 2-norm like quantum mechanics, but instead on the p-norm for some . Call (v₁,...,v_N) a unit vector in the p-norm if

|v₁|^p+...+|v_N|^p = 1.

Then we'll need some "nice" set of linear transformations that map any unit vector in the p-norm to another unit vector in the p-norm.

It's clear that for any p we choose, there will be some linear transformations that preserve the p-norm. Which ones? Well, we can permute the basis elements, shuffle them around. That'll preserve the p-norm. And we can stick in minus signs if we want. That'll preserve the p-norm too. But here's the little observation I made: if there are any linear transformations other than these trivial ones that preserve the p-norm, then either p=1 or p=2. If p=1 we get classical probability theory, while if p=2 we get quantum mechanics.

Ray Laflamme: So if you don't want something boring...

Scott: Exactly! Then you have to set p=1 or p=2.

Exercise 5 for the Non-Lazy Reader: Prove my little observation.

Alright, to get you started, let me give some intuition about why my observation might be true. Let's assume, for simplicity, that everything is real and that p is a positive even integer (though the observation also works with complex numbers and with any real p≥0). Then for a linear transformation A=(a_ij) to preserve the p-norm means that

whenever

Now we can ask: how many constraints are imposed on the matrix A by the requirement that this be true for every v₁,...,v_N? If we work it out, in the case p=2 we'll find that there are constraints. But since we're trying to pick an N-by-N matrix, that still leaves us N(N-1)/2 degrees of freedom to play with.

On the other hand, if (say) p=4, then the number of constraints grows like , which is greater than N² (the number of variables in the matrix). That suggests that it will be hard to find a nontrivial linear transformation that preserves 4-norm. Of course it doesn't prove that no such transformation exists -- that's left as a puzzle for you.

Incidentally, this isn't the only case where we find that the 1-norm and 2-norm are "more special" than other p-norms. So for example, have you ever seen the following equation?

xⁿ + yⁿ = zⁿ

There's a cute little fact -- unfortunately I won't have time to prove it in class -- that the above equation has nontrivial integer solutions when n=1 or n=2, but not for any larger integers n. Clearly, then, if we use the 1-norm and the 2-norm more than other vector norms, it's not some arbitrary whim -- these really are God's favorite norms! (And we didn't even need an experiment to tell us that.)

Real vs. Complex Numbers

Even after we've decided to base our theory on the 2-norm, we still have at least two choices: we could let our amplitudes be real numbers, or we could let them be complex numbers. We know the solution God chose: amplitudes in quantum mechanics are complex numbers. This means that you can't just square an amplitude to get a probability; first you have to take the absolute value, and then you square that. In other words, if the amplitude for some measurement outcome is α = β + γi, where β and γ are real, then the probability of seeing the outcome is |α|² = β² + γ².

Why did God go with the complex numbers and not the real numbers?

Years ago, at Berkeley, I was hanging out with some math grad students -- I fell in with the wrong crowd -- and I asked them that exact question. The mathematicians just snickered. "Give us a break -- the complex numbers are algebraically closed!" To them it wasn't a mystery at all.

But to me it is sort of strange. I mean, complex numbers were seen for centuries as fictitious entities that human beings made up, in order that every quadratic equation should have a root. (That's why we talk about their "imaginary" parts.) So why should Nature, at its most fundamental level, run on something that we invented for our convenience?

Answer: Well, if you want every unitary operation to have a square root, then you have to go to the complex numbers...

Scott: Dammit, you're getting ahead of me!

Alright, yeah: suppose we require that, for every linear transformation U that we can apply to a state, there must be another transformation V such that V² = U. This is basically a continuity assumption: we're saying that, if it makes sense to apply an operation for one second, then it ought to make sense to apply that same operation for only half a second.

Can we get that with only real amplitudes? Well, consider the following linear transformation:

This transformation is just a mirror reversal of the plane. That is, it takes a two-dimensional Flatland creature and flips it over like a pancake, sending its heart to the other side of its two-dimensional body. But how do you apply half of a mirror reversal without leaving the plane? You can't! If you want to flip a pancake by a continuous motion, then you need to go into ... dum dum dum ... THE THIRD DIMENSION.

More generally, if you want to flip over an N-dimensional object by a continuous motion, then you need to go into the (N+1)^st dimension.

Exercise 6 for the Non-Lazy: Prove that any norm-preserving linear transformation in N dimensions can be implemented by a continuous motion in N+1 dimensions.

But what if you want every linear transformation to have a square root in the same number of dimensions? Well, in that case, you have to allow complex numbers. So that's one reason God might have made the choice She did.

Alright, I can give you two other reasons why amplitudes should be complex numbers.

The first comes from asking, how many independent real parameters are there in an N-dimensional mixed state? As it turns out, the answer is exactly N² -- provided we assume, for convenience, that the state doesn't have to be normalized (i.e., that the probabilities can add up to less than 1). Why? Well, an N-dimensional mixed state is represented mathematically by a N-by-N Hermitian matrix with positive eigenvalues. Since we're not normalizing, we've got N independent real numbers along the main diagonal. Below the main diagonal, we've got N(N-1)/2 independent complex numbers, which means N(N-1) real numbers. Since the matrix is Hermitian, the complex numbers below the main diagonal determine the ones above the main diagonal. So the total number of independent real parameters is N + N(N-1) = N².

Now we bring in an aspect of quantum mechanics that I didn't mention before. If we know the states of two quantum systems individually, then how do we write their combined state? Well, we just form what's called the tensor product. So for example, the tensor product of two qubits, α|0〉+β|1〉 and γ|0〉+δ|1〉, is given by

Again one can ask: did God have to use the tensor product? Could She have chosen some other way of combining quantum states into bigger ones? Well, maybe someone else can say something useful about this question -- I have trouble even wrapping my head around it! For me, saying we take the tensor product is almost what we mean when we say we're putting together two systems that exist independently of each other.

As you all know, there are two-qubit states that can't be written as the tensor product of one-qubit states. The most famous of these is the EPR (Einstein-Podolsky-Rosen) pair:

Given a mixed state ρ on two subsystems A and B, if ρ can be written as a probability distribution over tensor product states , then we say ρ is separable. Otherwise we say ρ is entangled.

Now let's come back to the question of how many real parameters are needed to describe a mixed state. Suppose we have a (possibly-entangled) composite system AB. Then intuitively, it seems like the number of parameters needed to describe AB -- which I'll call d_AB -- should equal the product of the number of parameters needed to describe A and the number of parameters needed to describe B:

d_AB = d_A d_B.

If amplitudes are complex numbers, then happily this is true! Letting N_A and N_B be the number of dimensions of A and B respectively, we have

d_AB = (N_A N_B)² = N_A² N_B² = d_A d_B.

But what if the amplitudes are real numbers? In that case, in an N-by-N density matrix, we'd only have N(N+1)/2 independent real parameters. And it's not the case that if N = N_A N_B then

Question: Can this same argument be used to rule out quaternions?

Scott: Excellent question. Yes! With real numbers the left-hand side is too big, whereas with quaternions it's too small. Only with complex numbers is it juuuuust right!

There's actually another phenomenon with the same "Goldilocks" flavor, which was observed by Bill Wootters -- and this leads to my third reason why amplitudes should be complex numbers. Let's say we choose a quantum state

uniformly at random (if you're a mathematician, under the Haar measure). And then we measure it, obtaining outcome |i〉 with probability |α_i|². The question is, will the resulting probability vector also be distributed uniformly at random in the probability simplex? It turns out that if the amplitudes are complex numbers, then the answer is yes. But if the amplitudes are real numbers or quaternions, then the answer is no! (I used to think this fact was just a curiosity, but now I'm actually using it in a paper I'm working on...)

Linearity

We've talked about why the amplitudes should be complex numbers, and why the rule for converting amplitudes to probabilities should be a squaring rule. But all this time, the elephant of linearity has been sitting there undisturbed. Why would God have decided, in the first place, that quantum states should evolve to other quantum states by means of linear transformations?

Answer: Because if the transformations weren't linear, you could crunch vectors to be bigger or smaller...

Scott: Close! Steven Weinberg and others proposed nonlinear variants of quantum mechanics in which the state vectors do stay the same size. The trouble with these variants is that they'd let you take far-apart vectors and squash them together, or take extremely close vectors and pry them apart! Indeed, that's essentially what it means for such theories to be nonlinear. So our configuration space no longer has this intuitive meaning of measuring the distinguishability of vectors. Two states that are exponentially close might in fact be perfectly distinguishable. And indeed, in 1998 Abrams and Lloyd used exactly this observation to show that, if quantum mechanics were nonlinear, then one could build a computer to solve NP-complete problems in polynomial time.

Question: What's the problem with that?

Scott: What's the problem with being able to solve NP-complete problems in polynomial time? Oy, if by the end of this class you still don't think that's a problem, I will have failed you... [laughter]

Seriously, of course we don't know whether NP-complete problems are efficiently solvable in the physical world. But in a survey I wrote a couple years ago, I explained why the ability to solve NP-complete problems would give us "godlike" powers -- arguably, even more so than the ability to transmit superluminal signals or reverse the Second Law of Thermodynamics. The basic point is that, when we talk about NP-complete problems, we're not just talking about scheduling airline flights (or for that matter, breaking the RSA cryptosystem). We're talking about automating insight: proving the Riemann Hypothesis, modeling the stock market, seeing whatever patterns or chains of logical deduction are there in the world to be seen.

So, suppose I maintain the working hypothesis that NP-complete problems are not efficiently solvable by physical means, and that if a theory suggests otherwise, more likely than not that indicates a problem with the theory. Then there are only two possibilities: either I'm right, or else I'm a god! And either one sounds pretty good to me...

Exercise 7 for the Non-Lazy Reader: Prove that if quantum mechanics were nonlinear, then not only could you solve NP-complete problems in polynomial time, you could also use EPR pairs to transmit information faster than the speed of light.

Question: But if I were crafting a universe in my garage, I could choose to make the speed of light equal to infinity.

Scott: Yeah, you've touched on another one of my favorite questions: why should the speed of light be finite? Well, one reason I'd like it to be finite is that, if aliens from the Andromeda galaxy are going to hurt me, then I at least want them to have to come here first!

Further Reading

See this paper by Lucien Hardy for a "derivation" of quantum mechanics that's closely related to the arguments I gave, but much, much more serious and careful. Also see pretty much anything Chris Fuchs has written (and especially this paper by Caves, Fuchs, and Schack, which discusses why amplitudes should be complex numbers rather than reals or quaternions).

Lecture 10: Quantum Computing

Alright, so now we've got this beautiful theory of quantum mechanics, and the possibly-even-more-beautiful theory of computational complexity. Clearly, with two theories this beautiful, you can't just let them stay single -- you have to set them up, see if they hit it off, etc.

And that brings us to the class BQP: Bounded-Error Quantum Polynomial-Time. We talked in Lecture 7 about BPP, or Bounded-Error Probabilistic Polynomial-Time. Informally, BPP is the class of computational problems that are efficiently solvable in the physical world if classical physics is true. Now we ask, what problems are efficiently solvable in the physical world if (as seems more likely) quantum physics is true?

To me it's sort of astounding that it took until the 1990's for anyone to really seriously ask this question, given that all the tools for asking it were in place by the 1960's or even earlier. It makes you wonder, what similarly obvious questions are there today that no one's asking?

So how do we define BQP? Well, there are four things we need to take care of.

1. Initialization. We say, we have a system consisting of n quantum bits (or qubits), and these are all initialized to some simple, easy-to-prepare state. For convenience, we usually take that to be a "computational basis state," though later in the course we'll consider relaxing that assumption. In particular, if the input string is x, then the initial state will have the form |x⟩|0...0⟩: that is, |x⟩ together with as many "ancilla" qubits as we want initialized to the all-0 state.

2. Transformations. At any time, the state of our computer will be a superposition over all 2^p(n) p(n)-bit strings, where p is some polynomial in n:

But what operations can we use to transform one superposition state to another? Since this is quantum mechanics, the operations should be unitary transformations -- but which ones? Given any Boolean function f:{0,1}ⁿ→{0,1}, there's some unitary transformation that will instantly compute the function for us -- namely, the one that maps each input |x⟩|0⟩ to |x⟩|f(x)⟩!

But of course, for most functions f, we can't apply that transformation efficiently. Exactly by analogy to classical computing -- where we're only interested in those circuits that can be built up by composing a small number of AND, OR, and NOT gates -- here we're only interested in those unitary transformations that can be built up by composing a small number of quantum gates. By a "quantum gate," I just mean a unitary transformation that acts on a small number of qubits -- say 1, 2, or 3.

Alright, let's see some examples of quantum gates. One famous example is the Hadamard gate, which acts as follows on a single qubit:

Another example is the Toffoli gate, which acts as follows on three qubits:

|000⟩ → |000⟩
|001⟩ → |001⟩
|010⟩ → |010⟩
|011⟩ → |011⟩
|100⟩ → |100⟩
|101⟩ → |101⟩
|110⟩ → |111⟩
|111⟩ → |110⟩

Or in words, the Toffoli gate flips the third qubit if and only if the first two qubits are both 1. Note that the Toffoli gate actually makes sense for classical computers as well.

Now, it was shown by Shi that the Toffoli and Hadamard already constitute a universal set of quantum gates. This means, informally, that they're all you ever need for a quantum computer -- since if we wanted to, we could use them to approximate any other quantum gate arbitrarily closely. (Or technically, any gate whose unitary matrix has real numbers only, no complex numbers. But that turns out not to matter for computing purposes.) Furthermore, by a result called the Solovay-Kitaev Theorem, any universal set of gates can simulate any other universal set efficiently -- that is, with at most a polynomial increase in the number of gates. So as long as we're doing complexity theory, it really doesn't matter which universal gate set we choose.

(This is exactly analogous to how, in the classical world, we could build our circuits out of AND, OR, and NOT gates, out of AND and NOT gates only, or even out of NAND gates only.)

Now, you might ask: which quantum gate sets have this property of universality? Is it only very special ones? On the contrary, it turns out that in a certain precise sense, almost any set of 1- and 2-qubit gates (indeed, almost any single 2-qubit gate) will be universal. But there are certainly exceptions to the rule. For example, suppose you had only the Hadamard gate (defined above) together with the following "controlled-NOT" gate:

|00⟩ → |00⟩
|01⟩ → |01⟩
|10⟩ → |11⟩
|11⟩ → |10⟩

That seems like a natural universal set of quantum gates, but it isn't. The so-called Gottesman-Knill Theorem shows that any quantum circuit consisting entirely of Hadamard and controlled-NOT gates can be simulated efficiently by a classical computer.

Now, once we fix a universal set (any universal set) of quantum gates, we'll be interested in those circuits consisting of at most p(n) gates from our set, where p is a polynomial, and n is the number of bits of the problem instance we want to solve. We call these the polynomial-size quantum circuits.

3. Measurement. How do we read out the answer when the computation is all done? Simple: we measure some designated qubit, reject if we get outcome |0⟩, and accept if we get outcome |1⟩! (Recall that for simplicity, we're only interested here in decision problems -- that is, problems having a yes-or-no answer.)

We further stipulate that, if the answer to our problem was "yes," then the final measurement should accept with probability at least 2/3, whereas if the answer was "no," then it should accept with probability at most 1/3. This is exactly the same requirement as for BPP. And as with BPP, we can replace the 2/3 and 1/3 by any other numbers we want (for example, 1-2^-500 and 2^-500), by simply repeating the computation a suitable number of times and then outputting the majority answer.

Now, immediately there's a question: would we get a more powerful model of computation if we allowed not just one measurement, but many measurements throughout the computation!

It turns out that the answer is no -- the reason being that you can always simulate a measurement (other than the final measurement, the one that "counts") using a unitary quantum gate. You can say, instead of measuring qubit A, let's apply a controlled-NOT gate from qubit A into qubit B, and then ignore qubit B for the rest of the computation. Then it's as if some third party measured qubit A -- the two views are mathematically equivalent. (Is this a trivial technical point or a profound philosophical point? You be the judge...)

4. Uniformity. Before we can give the definition of BQP, there's one last technical issue we need to deal with. We talked about a "polynomial-size quantum circuit," but more correctly it's an infinite family of circuits, one for each input length n. Now, can the circuits in this family be chosen arbitrarily, completely independent of each other? If so, then we could use them to (for example) solve the halting problem, by just hardwiring into the n^th circuit whether or not the n^th Turing machine halts. If we want to rule that out, then we need to impose a requirement called uniformity. This means that there should exist a (classical) algorithm that, given n as input, outputs the n^th quantum circuit in time polynomial in n.

Exercise. Show that letting a polynomial-time quantum algorithm output the n^th circuit would lead to the same definition.

Alright, we're finally ready to put the pieces together and give a definition of BQP.

BQP is the class of languages L ⊆ {0,1}^* for which there exists a uniform family of polynomial-size quantum circuits, {C_n}, such that for all x ∈ {0,1}ⁿ:

If x ∈ L then C_n accepts input |x⟩|0...0⟩ with probability at least 2/3. If x ∉ L then C_n accepts input |x⟩|0...0⟩ with probability at most 1/3.

Uncomputing

So, what can we say about BQP?

Well, as a first question, let's say you have a BQP algorithm that calls another BQP algorithm as a subroutine. Could that be more powerful than BQP itself? Or in other words, could BQP^BQP (that is, BQP with a BQP oracle) be more powerful than BQP?

Answer: It better not be!

Right! Incidentally, this is related to something I was talking to Dave Bacon about. Why do physicists have so much trouble understanding the class NP? I suspect it's because NP, with its "magical" existential quantifier layered on top of a polynomial-time computation, is not the sort of thing they'd ever come up with. The classes that physicists would come up with -- the physicist complexity classes -- are hard to delineate precisely, but one property I think they'd definitely have is "closure under the obvious things," like one algorithm from the class calling another algorithm from the same class as a subroutine.

I claim that BQP is an acceptable "physicist complexity class" -- and in particular, that BQP^BQP = BQP. What's the problem in showing this?

Right, garbage! Recall that when a quantum algorithm is finished, you measure just a single qubit to obtain the yes-or-no answer. So, what to do with all the other qubits? Normally you'd just throw it away. But now let's say you've got a superposition over different runs of an algorithm, and you want to bring the results of those runs together and interfere them. In that case, the garbage might prevent the different branches from interfering! So what do you do to fix this?

The solution, proposed by Bennett in the 1980's, is to uncompute. Here's how it works:

You run the subroutine.
You copy the subroutine's answer qubit to a separate location.
You run the entire subroutine backwards, thereby erasing everything except the answer qubit. (If the subroutine had some probability of error, this erasing step won't work perfectly, but it will still work pretty well.)

As you'd see if you visited my apartment, this is not the solution I generally adopt. But if you're a quantum computer, cleaning up the messes you make is a good idea.

Relation to Classical Complexity Classes

Alright, so how does BQP relate to the complexity classes we've already seen?

First, I claim that BPP ⊆ BQP: in other words, anything you can do with a classical probabilistic computer, you can also do with a quantum computer. Why?

Right: because any time you were gonna flip a coin, you just apply a Hadamard gate instead. In textbooks, this usually takes about a page to prove. We just proved it.

Can we get any upper bound on BQP in terms of classical complexity classes?

Sure we can! First of all, it's pretty easy to see that BQP ⊆ EXP: anything you can compute in quantum polynomial time you can also compute in classical exponential time. Or to put it differently, quantum computers can provide at most an exponential advantage over classical computers. Why is that?

Right: because if you allow exponential slowdown, then a classical computer can just simulate the whole evolution of the state vector!

As it turns out, we can do a lot better than that. Recall that PP is the class of problems like the following:

Given a sum of exponentially many real numbers, each of which can be evaluated in polynomial time, is the sum positive or negative (promised that one of these is the case)? Given a Boolean formula in n variables, do at least half of the 2ⁿ possible variable settings make the formula evaluate to TRUE? Given a randomized polynomial-time Turing machine, does it accept with probability ≥ 1/2?

In other words, a PP problem involves summing up exponentially many terms, and then deciding whether the sum is greater or less than some threshold. Certainly PP is contained in PSPACE is contained in EXP.

In their original paper on quantum complexity, Bernstein and Vazirani showed that BQP ⊆ PSPACE. Shortly afterward, Adleman, DeMarrais, and Huang improved their result to show that BQP ⊆ PP. (This was also the first complexity result I proved. Had I known that Adleman et al. had proved it a year before, I might never have gotten started in this business! Occasionally it's better to have a small academic light-cone.)

So, why is BQP contained in PP? From a computer science perspective, the proof is maybe half a page. From a physics perspective, the proof is three words:

FEYNMAN PATH INTEGRAL!!!

Look, let's say you want to calculate the probability that a quantum computer accepts. The obvious way to do it is to multiply a bunch of 2ⁿ-by-2ⁿ unitary matrices, then take the sum of the squares of the absolute values of the amplitudes corresponding to accepting basis states (that is, basis states for which the output qubit is |1⟩). What Feynman noticed in the 1940's is that there's a better way -- a way that's vastly more efficient in terms of memory (or paper), though still exponential in terms of time.

The better way is to loop over accepting basis states, and for each one, loop over all computational paths that might contribute amplitude to that basis state. So for example, let α_x be the final amplitude of basis state |x⟩. Then we can write

where each α_x,i corresponds to a single leaf in an exponentially-large "possibility tree," and is therefore computable in classical polynomial time. Typically, the α_x,i's will be complex numbers with wildly-differing phases, which will interfere destructively and cancel each other out; then α_x will be the tiny residue left over. The reason quantum computing seems more powerful than classical computing is precisely that it seems hard to estimate that tiny residue using random sampling. Random sampling would work fine for (say) a typical US election, but estimating α_x is more like the 2000 election.

Now, let S be the set of all accepting basis states. Then we can write the probability that our quantum computer accepts as

where * denotes complex conjugate. But this is just a sum of exponentially many terms, each of which is computable in P. We can therefore decide in PP whether p_accept≤1/3 or p_accept≥2/3.

From my perspective, Richard Feynman won the Nobel Prize in physics essentially for showing BQP is contained in PP.

Currently-known inclusions.

Of course, the question that really gets people hot under the collar is whether BPP ≠ BQP: that is, whether quantum computing is more powerful than classical. Today we have evidence that this is indeed the case, most notably Shor's algorithm for factoring and discrete log. I'll assume you've seen this algorithm, since it was one of the major scientific achievements of the late 20^th century, and is why we're here in Waterloo talking about these things in the first place. If you haven't seen it, there are about 500,000 expositions on the web.

It's worth stressing that, even before Shor's algorithm, computer scientists had amassed formal evidence that quantum computers were more powerful than classical ones. Indeed, this evidence is what paved the way for Shor's algorithm.

One major piece of evidence was Simon's algorithm, which many of you have also seen. Suppose we have a function f:{0,1}ⁿ→ {0,1}ⁿ, which we can access only as a "black box" (that is, by feeding it inputs and examining the outputs). We're promised that there exists a "secret XOR-mask" s∈{0,1}ⁿ, such that for all distinct (x,y) pairs, f(x)=f(y) if and only if x⊕y=s. (Here ⊕ denotes bitwise XOR.) Our goal is to learn the identity of s. The question is, how many times do we need to query f to do that with high probability?

Classically, it's easy to see that ~2^n/2 queries are necessary and sufficient. As soon as we find a collision (a pair x≠y such that f(x)=f(y)), we know that s=x⊕y, and hence we're done. But until we find a collision, the function looks essentially random. In particular, if we query the function on T inputs, then the probability of finding a collision is at most ~T²/2ⁿ by the union bound. Hence we need T≈2^n/2 queries to find s with high probability.

On the other hand, Simon gave a quantum algorithm that finds s using only ~n queries. The basic idea is to query f in superposition, and thereby prepare quantum states of the form

for random (x,y) pairs such that x⊕y=s. We then use the so-called quantum Fourier transform to extract information about s from these states. This use of the Fourier transform to extract "hidden periodicity information" provided a direct inspiration for Shor's algorithm, which does something similar over the abelian group Z_N instead of . (In a by-now famous story, Simon's paper got rejected the first time he submitted it to a conference -- apparently Shor was one of the few people who got the point of it.)

Again, I won't go through the details of Simon's algorithm; see here if you want them.

So, the bottom line is that we get a problem -- Simon's problem -- that quantum computers can provably solve exponentially faster than classical computers. Admittedly, this problem is rather contrived, relying as it does on a mythical "black box" for computing a function f with a certain global symmetry. Because of its black-box formulation, Simon's problem certainly doesn't prove that BPP ≠ BQP. What it does prove that there exists an oracle relative to which BPP ≠ BQP. This is what I meant by formal evidence that quantum computers are more powerful than classical ones.

As it happens, Simon's problem was not the first to yield an oracle separation between BPP and BQP. Just as Shor was begotten of Simon, so Simon was begotten of Bernstein-Vazirani. In the long-ago dark ages of 1993, Bernstein and Vazirani devised a black-box problem called Recursive Fourier Sampling. They were able to prove that any classical algorithm needs at least ~n^{log n} queries to solve this problem, whereas there exists a quantum algorithm to solve it using only n queries.

Unfortunately, even to define the Recursive Fourier Sampling problem would take a longer digression than I feel is prudent. (If you think Simon's problem was artificial, you ain't seen nuthin'!) But the basic idea is this. Suppose we have black-box access to a Boolean function f:{0,1}ⁿ→{0,1}. We're promised that there exists a "secret string" s∈{0,1}ⁿ, such that f(x)=s•x for all x (where • denotes the inner product mod 2). Our goal is to learn s, using as few queries to f as possible.

In other words: we know that f(x) is just the XOR of some subset of input bits; our goal is to find out which subset.

Classically, it's obvious that n queries to f are necessary and sufficient: we're trying to learn n bits, and each query can only reveal one! But quantumly, Bernstein and Vazirani observed that you can learn s with just a single query. To do so, you simply prepare the state

then apply Hadamard gates to all n qubits. The result is easily checked to be |s⟩.

What Bernstein and Vazirani did was to start from the problem described above -- called Fourier sampling -- and then compose it recursively. In other words, they created a Fourier sampling problem where to learn one of the bits f(x), you need to solve another Fourier sampling problem, and to learn one of the bits in that problem you need to solve a third problem, and so on. They then showed that, if the recursion is d levels deep, then any randomized algorithm to solve this Recursive Fourier Sampling problem must make at least ~n^d queries. By contrast, there exists a quantum algorithm that solves the problem using only 2^d queries.

Why 2^d queries, you ask, instead of just 1^d = 1? Because at each level of recursion, the quantum algorithm needs to uncompute its garbage to get an interference effect -- and that keeps adding an additional factor of 2. Like so:

Compute { Compute { Compute Uncompute } Uncompute { Compute Uncompute } } Uncompute { Compute { Compute Uncompute } Uncompute { Compute Uncompute } }

Indeed, one of my results shows that this sort of recursive uncomputation is unavoidable feature of any quantum algorithm for Recursive Fourier Sampling.

So, once we have this gap of n^d versus 2^d, setting d=log n gives us n^{log n} queries on a classical computer versus 2^{log n}=n queries on a quantum computer. Admittedly, this separation is not exponential versus polynomial -- it's only "quasipolynomial" versus polynomial. But that's still enough to prove an oracle separation between BPP and BQP.

You might wonder: now that we have Simon's and Shor's algorithms -- which do achieve an exponential separation between quantum and classical -- why muck around with this recursive archeological relic? Well, I'll tell you why. One of the biggest open problems in quantum computing concerns the relationship between BQP and the polynomial hierarchy PH (defined in Lecture 6). Specifically, is BQP contained in PH? Sure, it seems unlikely -- but as Bernstein and Vazirani asked back in '93, can we actually find an oracle relative to which BQP ⊄ PH? Alas, fourteen years and maybe a half-dozen disillusioned grad students later, the answer is still no. Yet many of us still believe a separation should be possible -- and the significance of Recursive Fourier Sampling is that it's practically the only candidate problem we have for such a separation.

Quantum Computing and NP-complete Problems

From reading newspapers, magazines, Slashdot, and so on, one would think a quantum computer could "solve NP-complete problems in a heartbeat" by "trying every possible solution in parallel," and then instantly picking the correct one.

Well, that's a crock. Indeed, arguably it's the central crock about quantum computing.

Obviously, we can't yet prove that quantum computers can't solve NP-complete problems efficiently -- in other words, that NP ⊄ BQP -- since we can't even prove that P ≠ NP! Nor do we have any idea how to prove that if P ≠ NP then NP ⊄ BQP.

What we do have is the early result of Bennett, Bernstein, Brassard, and Vazirani, that there exists an oracle relative to which NP ⊄ BQP. More concretely, suppose you're searching a space of 2ⁿ possible solutions for a single valid one, and suppose that all you can do, given a candidate solution, is feed it to a 'black box' that tells you whether that solution is correct or not. Then how many times do you need to query the black box to find the valid solution? Classically, it's clear that you need to query it ~2ⁿ times in the worst case (or ~2ⁿ/2 times on average). On the other hand, Grover famously gave a quantum search algorithm that queries the black box only ~2^n/2 times. But even before Grover's algorithm was discovered, Bennett et al. had proved that it was optimal! In other words, any quantum algorithm to find a needle in a size-2ⁿ haystack needs at least ~2^n/2 steps. So the bottom line is that, for "generic" or "unstructured" search problems, quantum computers can give some speedup over classical computers -- specifically, a quadratic speedup -- but nothing like the exponential speedup of Shor's factoring algorithm.

You might wonder: why should the speedup be quadratic, rather than cubic or something else? Let me try to answer that question without getting into the specifics either of Grover's algorithm, or of the Bennett et al. optimality proof. Basically, the reason we get a quadratic speedup is that quantum mechanics is based on the L₂ norm rather than the L₁ norm. Classically, if there are N solutions, only one of which is right, then after one query we have a 1/N probability of having guessed the right solution, after two queries we have a 2/N probability, after three queries a 3/N probability, and so on. Thus we need ~N queries to have a non-negligible (i.e. close to 1) probability of having guessed the right solution. But quantumly, we get to apply linear transformations to vectors of amplitudes, which are the square roots of probabilities. So the way to think about it is this: after one query we have a 1/√N amplitude of having guessed the right solution, after two queries we have a 2/√N amplitude, after three queries a 3/√N amplitude, and so on. So after T queries, the amplitude of having guessed a right solution is T/√N, and the probability is |T/√N|² = T²/N. Hence the probability will be close to 1 after only T ≈ √N queries.

Alright, those of you who read my blog must be tired of polemics about the limitations of quantum computers on unstructured search problems. So I'm going to take the liberty of ending this section now.

Quantum Computing and Many-Worlds

Since this course is Quantum Computing Since Democritus, I guess I should end today's lecture with a deep philosophical question. Alright, so how about this one: if we managed to build a nontrivial quantum computer, would that demonstrate the existence of parallel universes?

David Deutsch, one of the founders of quantum computing in the 1980's, certainly thinks that it would. (Though to be fair, Deutsch thinks the impact would "merely" be psychological -- since for him, quantum mechanics has already proved the existence of parallel universes!) Deutsch is fond of asking questions like the following: if Shor's algorithm succeeds in factoring a 3000-digit integer, then where was the number factored? Where did the computational resources needed to factor the number come from, if not from some sort of 'multiverse' exponentially bigger than the universe we see? To my mind, Deutsch seems to be tacitly assuming here that factoring is not in BPP -- but no matter; for purposes of argument we can certainly grant him that assumption.

It should surprise no one that Deutsch's views about this are far from universally accepted. Many who agree about the possibility of building quantum computers, and the formalism needed to describe them, nevertheless disagree that the formalism is best interpreted in terms of "parallel universes." To Deutsch, these people are simply intellectual wusses -- like the Churchmen who agreed that Copernican system was practically useful, so long as one remembers that obviously the Earth doesn't really go around the sun.

So, how do the intellectual wusses respond to the charges? For one thing, they point out that viewing a quantum computer in terms of "parallel universes" raises serious difficulties of its own. In particular, there's what those condemned to worry about such things call the "preferred basis problem." The problem is basically this: how do we define a "split" between one parallel universe and another? There are infinitely many ways you could imagine slicing up a quantum state, and it's not clear why one is better than another!

One can push the argument further. The key thing that quantum computers rely on for speedups -- indeed, the thing that makes quantum mechanics different from classical probability theory in the first place -- is interference between positive and negative amplitudes. But to whatever extent different "branches" of the multiverse can usefully interfere for quantum computing, to that extent they don't seem like separate branches at all! I mean, the whole point of interference is to mix branches together so that they lose their individual identities. If they retain their identities, then for exactly that reason we don't see interference.

Of course a many-worlder could respond that, in order to lose their separate identities by interfering with each others, the branches had to be there in the first place! And the argument could go on (indeed, has gone on) for quite a while.

Rather than take sides in this fraught, fascinating, but perhaps ultimately-meaningless debate, I'd like to end with one observation that's not up for dispute. What Bennett et al.'s lower bound tells us is that, if quantum computing supports the existence of parallel universes, then it certainly doesn't do so in the way most people think! As we've seen, a quantum computer is not a device that could "try every possible solution in parallel" and then instantly pick the correct one. If we insist on seeing things in terms of parallel universes, then those universes all have to "collaborate" -- more than that, have to meld into each other -- to create an interference pattern that will lead to the correct answer being observed with high probability.

Lecture 10.5: Penrose

So, you guys finally finished reading Roger Penrose's The Emperor's New Mind? What did you think of it?

(Since I forgot to record this lecture, the class responses are tragically lost to history. But if I recall correctly, the entire class turned out to consist of -- YAWN -- straitlaced, clear-thinking materialistic reductionists who correctly pointed out the glaring holes in Penrose's arguments. No one took Penrose's side, even just for sport.)

Alright, so let me try a new tack: who can summarize Penrose's argument (or more correctly, a half-century-old argument adapted by Penrose) in a few sentences?

How about this: Gödel's First Incompleteness Theorem tells us that no computer, working within a fixed formal system F such as Zermelo-Fraenkel set theory, can prove the sentence

G(F) = "This sentence cannot be proved in F."

But we humans can just "see" the truth of G(F) -- since if G(F) were false, then it would be provable, which is absurd! Therefore the human mind can do something that no present-day computer can do. Therefore consciousness can't be reducible to computation.

Alright, class: problems with this argument?

Yeah, there are two rather immediate ones:

Why does the computer have to work within a fixed formal system F? Can humans "see" the truth of G(F)?

Actually, the response I prefer encapsulates both of the above responses as "limiting cases." Recall from Lecture 3 that, by the Second Incompleteness Theorem, G(F) is equivalent to Con(F): the statement that F is consistent. Furthermore, this equivalence can be proved in F itself for any reasonable F. This has two important implications.

First, it means that when Penrose claims that humans can "see" the truth of G(F), really he's just claiming that humans can see the consistency of F! When you put it that way, the problems become more apparent: how can humans see the consistency of F? Exactly which F's are we talking about: Peano Arithmetic? ZF? ZFC? ZFC with large cardinal axioms? Can all humans see the consistency of all these systems, or do you have to be a Penrose-caliber mathematician to see the consistency of the stronger ones? What about the systems that people thought were consistent, but that turned out not to be? And even if you did see the consistency of (say) ZF, how would you convince someone else that you'd seen it? How would the other person know you weren't just pretending?

(Models of Zermelo-Fraenkel set theory are like those 3D dot pictures: sometimes you really have to squint...)

The second implication is that, if we grant a computer the same freedom that Penrose effectively grants to humans -- namely, the freedom to assume the consistency of the underlying formal system -- then the computer can prove G(F).

So the question boils down to this: can the human mind somehow peer into the Platonic heavens, in order to directly perceive (let's say) the consistency of ZF set theory? If the answer is no -- if we can only approach mathematical truth with the same unreliable, savannah-optimized tools that we use for doing the laundry, ordering Chinese takeout, etc. -- then it seems we ought to grant computers the same liberty of being fallible. But in that case, the claimed distinction between humans and machines would seem to evaporate.

(Perhaps Turing himself said it best: "If we want a machine to be intelligent, it can't also be infallible. There are theorems that say almost exactly that.")

In my opinion, then, Penrose doesn't need to be talking about Gödel's theorem at all. The Gödel argument turns out to be just a mathematical restatement of the oldest argument against reductionism in the book: "sure a computer could say it perceives G(F), but it'd just be shuffling symbols around! When I say I perceive G(F), I really mean it! There's something it feels like to be me!"

The obvious response is equally old: "what makes you so sure that it doesn't feel like anything to be a computer?"

Years ago I parodied Penrose's argument by means of the Gödel CAPTCHA. Recall from Lecture 4 that a CAPTCHA (Completely Automated Public Turing Test to tell Computers and Humans and Apart) is a test that today's computers can generate and grade, but not pass. These are those "retype the curvy-looking nonsense word" deals that Yahoo and Google use all the time to root out spambots. Alas, today's CAPTCHA's are far from perfect; some of them have even been broken by clever researchers.

By exploiting Penrose's insights, I was able to create a completely unbreakable CAPTCHA. How does it work? It simply asks whether you believe the Gödel sentence G(F) for some reasonable formal system F! Assuming you answer yes, it then (and this is a minor security hole I should really patch sometime) asks whether you're a human or a machine. If you say you're a human, you pass. If, on the other hand, you say you're a machine, the program informs you that, while your answer happened to be correct in this instance, you clearly couldn't have arrived at it via a knowably sound procedure, since you don't possess the requisite microtubules. Therefore your request for an email account must unfortunately be denied.

Opening the Black Box

Alright, look: Roger Penrose is one of the greatest mathematical physicists on Earth. Is it possible that we've misconstrued his thinking?

To my mind, the most plausible-ish versions of Penrose's argument are the ones based on an "asymmetry of understanding": namely that, while we know the internal workings of a computer, we don't yet know the internal workings of the brain.

How can one exploit this asymmetry? Well, given any known Turing machine M, it's certainly possible to construct a sentence that stumps M:

S(M) = "Machine M will never output this sentence."

There are two cases: either M outputs S(M), in which case it utters a falsehood, or else M doesn't output S(M), in which case there's a mathematical truth to which it can never assent.

The obvious response is, why can't we play the same game with humans?

"Roger Penrose will never output this sentence."

Well, conceivably there's an answer: because we can formalize what it means for M to output something, by examining its inner workings. (Indeed, "M" is really just shorthand for the appropriate Turing machine state diagram.) But can we formalize what it means for Penrose to output something? The answer depends on what we believe about the internal workings of the brain (or more precisely, Penrose's brain)! And this leads to Penrose's view of the brain as "non-computational."

A common misconception is that Penrose thinks the brain is a quantum computer. In reality, a quantum computer would be much weaker than he wants! As we saw before, quantum computers don't even seem able to solve NP-complete problems in polynomial time. Penrose, by contrast, wants the brain to solve uncomputable problems, by exploiting hypothetical collapse effects from a yet-to-be-discovered quantum theory of gravity.

When Penrose visited Perimeter Institute a few years ago, I asked him: why not go further, and conjecture that the brain can solve problems that are uncomputable even given an oracle for the halting problem, or an oracle for the halting problem for Turing machines with an oracle for the halting problem, etc.? His response was that yes, he'd conjecture that as well.

My own view has always been that, if Penrose really wants to speculate about the impossibility of simulating the brain on a computer, then he ought to talk not about computability but about complexity. The reason is simply that, in principle, we can always simulate a person by building a huge lookup table, which encodes the person's responses to every question that could ever be asked within (say) a million years. If we liked, we could also have the table encode the person's voice, gestures, facial expressions, etc. Clearly such a table will be finite. So there's always some computational simulation of a human being -- the only question is whether or not it's an efficient one!

You might object that, if people could live for an infinite or even just an arbitrarily long time, then the lookup table wouldn't be finite. This is true but irrelevant. The fact is, people regularly do decide that other people have minds after interacting with them for just a few minutes! (Indeed, maybe just a few minutes of email or instant messaging.) So unless you want to retreat into Cartesian skepticism about everyone you've ever met on MySpace, Gmail chat, the Shtetl-Optimized comment section, etc., there must be a relatively small integer n such that by exchanging at most n bits, you can be reasonably sure that someone else has a mind.

In Shadows of the Mind (the "sequel" to The Emperor's New Mind), Penrose concedes that a human mathematician could always be simulated by a computer with a huge lookup table. He then argues that such a lookup table wouldn't constitute a "proper" simulation, since (for example) there'd be no reason to believe that any given statement in the table was true rather than false. The trouble with this argument is that it explicitly retreats from what one might have thought was Penrose's central claim: namely, that a machine can't even simulate human intelligence, let alone exhibit it!

In Shadows, Penrose offers the following classification of views on consciousness:

Consciousness is reducible to computation (the view of strong-AI proponents)
Sure, consciousness can be simulated by a computer, but the simulation couldn't produce "real understanding" (John Searle's view)
Consciousness can't even be simulated by computer, but nevertheless has a scientific explanation (Penrose's own view, according to Shadows)
Consciousness doesn't have a scientific explanation at all (the view of 99% of everyone who ever lived)

Now it seems to me that, in dismissing the lookup table as not a "real" simulation, Penrose is retreating from view C to view B. For as soon as we say that passing the Turing Test isn't good enough -- that one needs to "pry open the box" and examine a machine's internal workings to know whether it thinks or not -- what could possibly be the content of view C that would distinguish it from view B?

Again, though, I want to bend over backwards to see if I can figure out what Penrose might be saying.

In science, you can always cook up a theory to "explain" the data you've seen so far: just list all the data you've got, and call that your "theory"! The obvious problem here is overfitting. Since your theory doesn't achieve any compression of the original data -- i.e., since it takes as many bits to write down your theory as to write down the data itself -- there's no reason to expect your theory to predict future data. In other words, your theory is a useless piece of shit.

So, when Penrose says the lookup table isn't a "real" simulation, perhaps what he means is this. Of course one could write a computer program to converse like Disraeli or Churchill, by simply storing every possible quip and counterquip. But that's the sort of overfitting up with which we must not put! The relevant question is not whether we can simulate Sir Winston by any computer program. Rather, it's whether we can simulate him by a program that can be written down inside the observable universe -- one that, in particular, is dramatically shorter than a list of all possible conversations with him.

Now, here's the point I keep coming back to: if this is what Penrose means, then he's left the world of Gödel and Turing far behind, and entered my stomping grounds -- the Kingdom of Computational Complexity. How does Penrose, or anyone else, know that there's no small Boolean circuit to simulate Winston Churchill? Presumably we wouldn't be able to prove such a thing, even supposing (for the sake of argument) that we knew what a Churchill simulator meant! All ye who would claim the intractability of finite problems: that way lieth the P versus NP beast, from whose 2ⁿ jaws no mortal hath yet escaped.

At Risk of Stating the Obvious

Even if we supposed the brain was solving a hard computational problem, it's not clear why that would bring us any closer to understanding consciousness. If it doesn't feel like anything to be a Turing machine, then why does it feel like something to be a Turing machine with an oracle for the halting problem?

All Aboard the Holistic Quantum Gravy Train

Let's set aside the specifics of Penrose's ideas, and ask a more general question. Should quantum mechanics have any affect on how we think about the brain?

The temptation is certainly a natural one: consciousness is mysterious, quantum mechanics is also mysterious, therefore they must be related somehow! Well, maybe there's slightly more to it than that, since the source of the mysteriousness seems the same in both cases: namely, how do we reconcile a third-person description of the world with a first-person experience of it?

When people try to make the question more concrete, they often end up asking: "is the brain a quantum computer?" Well, it might be, but I can think of at least four good arguments against this possibility:

The problems for which quantum computers are believed to offer dramatic speedups -- factoring integers, solving Pell's equation, simulating quark-gluon plasmas, approximating the Jones polynomial, etc. -- just don't seem like the sorts of things that would have increased Oog the Caveman's reproductive success relative to his fellow cavemen.
Even if humans could benefit from quantum computing speedups, I don't see any evidence that they're actually doing so. (It's said that Gauss could immediately factor large integers in his head -- but if so, that only proves that Gauss's brain was a quantum computer, not that anyone else's is!)
The brain is a hot, wet environment, and it's hard to understand how long-range coherence could be maintained there. (With today's understanding of quantum error-correction, this is no longer a knock-down argument, but it's still an extremely strong one.)
As I mentioned earlier, even if we suppose the brain is a quantum computer, it doesn't seem to get us anywhere in explaining consciousness, which is the usual problem that these sorts of speculations are invoked to solve!

Alright, look. So as not to come across as a total curmudgeon -- for what could possibly be further from my personality? -- let me at least tell you what sort of direction I would pursue if I were a woo-woo quantum mystic.

Near the beginning of Emperor's New Mind, Penrose brings up one of my all-time favorite thought experiments: the teleportation machine. This is a machine that whisks you around the galaxy at the speed of light, by simply scanning your whole body, encoding all the cellular structures as pure information, and then transmitting the information as radio waves. When the information arrives at its destination, nanobots (of the sort we'll have in a few decades, according to Ray Kurzweil et al.) use the information to reconstitute your physical body down to the smallest detail.

Oh, I forgot to mention: since obviously we don't want two copies of you running around, the original is destroyed by a quick, painless gunshot to the head. So, fellow scientific reductionists: which one of you wants to be the first to travel to Mars this way?

What, you feel squeamish about it? Are you going to tell me you're somehow attached to the particular atoms that currently reside in your brain? As I'm sure you're aware, those atoms are replaced every few weeks anyway. So it can't be the atoms themselves that make you you; it has to be the patterns of information they encode. And as long as the information is safely on its way to Mars, who cares about the original meat hard-drive?

So, soul or bullet: take your pick!

Quantum mechanics does offer a third way out of this dilemma, one that wouldn't make sense in classical physics.

Suppose some of the information that made you you was actually quantum information. Then even if you were a thoroughgoing materialist, you could still have an excellent reason not to use the teleportation machine: because, as a consequence of the No-Cloning Theorem, no such machine could possibly work as claimed.

This is not to say that you couldn't be teleported around at the speed of light. But the teleportation process would have to be very different from the one above: it could not involve copying you and then killing the original copy. Either you could be sent as quantum information, or else -- if that wasn't practical -- you could use the famous BBCJPW protocol, which sends only classical information, but also requires prior entanglement between the sender and the receiver. In either case, the original copy of you would disappear unavoidably, as part of the teleportation process itself. Philosophically, it would be just like flying from Newark to LAX: you wouldn't face any profound metaphysical dilemma about "whether to destroy the copy of you still at Newark."

Of course, this neat solution can only work if the brain stores quantum information. But crucially, in this case we don't have to imagine that the brain is a quantum computer, or that it maintains entanglement across different neurons, or anything harebrained like that. As in quantum key distribution, all we need are individual coherent qubits.

Now, you might argue that in a hot, wet, decoherent place like the brain, not even a single qubit would survive for very long. And from what little I know of neuroscience, I'd tend to agree. In particular, it does seem that long-term memories are encoded as synaptic strengths, and that these strengths are purely classical information that a nanobot could in principle scan and duplicate without any damage to the original brain. On the other hand, what about (say) whether you're going to wiggle your left finger or your right finger three seconds from now? Is that decision determined in part by quantum events?

Well, whatever else you might think about such a hypothesis, it's clear what it would take to falsify it. You'd simply have to build a machine that scanned a person's brain, and correctly predicted which finger that person would wiggle three seconds from now. (If I remember correctly, computers hooked up to EEG machines can make these sorts of predictions today, but only a small fraction of a second in advance -- not three seconds.)

Lecture 11: Decoherence and Hidden Variables

Why have so many great thinkers found quantum mechanics so hard to swallow? To hear some people tell it, the whole source of the trouble is that "God plays dice with the universe" -- that whereas classical mechanics could in principle predict the fall of every sparrow, quantum mechanics gives you only statistical predictions.

Well, you know what? Whup-de-f@#%ing-doo! If indeterminism were the only mystery about quantum mechanics, quantum mechanics wouldn't be mysterious at all. We could imagine, if we liked, that the universe did have a definite state at any time, but that some fundamental principle (besides the obvious practical difficulties) kept us from knowing the whole state. This wouldn't require any serious revision of our worldview. Sure, "God would be throwing dice," but in such a benign way that not even Einstein could have any real beef with it.

The real trouble in quantum mechanics is not that the future trajectory of a particle is indeterministic -- it's that the past trajectory is also indeterministic! Or more accurately, the very notion of a "trajectory" is undefined, since until you measure, there's just an evolving wavefunction. And crucially, because of the defining feature of quantum mechanics -- interference between positive and negative amplitudes -- this wavefunction can't be seen as merely a product of our ignorance, in the same way that a probability distribution can.

Today I want to tell you about decoherence and hidden-variable theories, which are two kinds of stories that people tell themselves to feel better about these difficulties.

The hardheaded physicist will of course ask: given that quantum mechanics works, why should we waste our time trying to feel better about it? Look, if you teach an introductory course on quantum mechanics, and the students don't have nightmares for weeks, tear their hair out, wander around with bloodshot eyes, etc., then you probably didn't get the point across. So rather than deny this aspect of quantum mechanics -- rather than cede the field to the hucksters and charlatans, the Deepak Chopras and Brian Josephsons -- shouldn't we map it out ourselves, even sell tickets to the tourists? I mean, if you're going to leap into the abyss, better you should go with an experienced guide who's already been there and back.

Into the Abyss

Alright, so consider the following thought experiment. Let |R⟩ be a state of all the particles in your brain, that corresponds to you looking at a red dot. Let |B⟩ be a state that corresponds to you looking at a blue dot. Now imagine that, in the far future, it's possible to place your brain into a coherent superposition of these two states:

At least to a believer in the Many-Worlds Interpretation, this experiment should be dull as dirt. We've got two parallel universes, one where you see a red dot and the other where you see a blue dot. According to quantum mechanics, you'll find yourself in the first universe with probability |3/5|²=9/25, and in the second universe with probability |4/5|²=16/25. What's the problem?

Well, now imagine that we apply some unitary operation to your brain, which changes its state to

Still a cakewalk! Now you see the red dot with probability 16/25 and the blue dot with probability 9/25.

Aha! But conditioned on seeing the red dot at the earlier time, what's the probability that you'll see the blue dot at the later time?

In ordinary quantum mechanics, this is a meaningless question! Quantum mechanics gives you the probability of getting a certain outcome if you make a measurement at a certain time, period. It doesn't give you multiple-time or transition probabilities -- that is, the probability of an electron being found at point y at time t+1, given that had you measured the electron at time t (which you didn't), it "would have" been at point x. In the usual view, if you didn't actually measure the electron at time t, then it wasn't anywhere at time t: it was just in superposition. And if you did measure it at time t, then of course that would be a completely different experiment!

But why should we care about multiple-time probabilities? For me, it has to do with the reliability of memory. The issue is this: does the "past" have any objective meaning? Even if we don't know all the details, is there necessarily some fact-of-the-matter about what happened in history, about which trajectory the world followed to reach its present state? Or does the past only "exist" insofar as it's reflected in memories and records in the present?

The latter view is certainly the more natural one in quantum mechanics. But as John Bell pointed out, if we take it seriously, then it would seem difficult to do science! For what could it mean to make a prediction if there's no logical connection between past and future states -- if by the time you finish reading this sentence, you might as well find yourself deep in the Amazon rainforest, with all the memories of your trip there conveniently inserted, and all the memories of sitting at a computer reading quantum computing lecture notes conveniently erased?

(Still here? Good!)

Look, we all have fun ridiculing the creationists who think the world sprang into existence on October 23, 4004 BC at 9AM (presumably Babylonian time), with the fossils already in the ground, light from distant stars heading toward us, etc. But if we accept the usual picture of quantum mechanics, then in a certain sense the situation is far worse: the world (as you experience it) might as well not have existed 10^-43 seconds ago!

Story #1: Decoherence

The standard response to these difficulties appeals to a powerful idea called decoherence. Decoherence tries to explain why we don't notice "quantum weirdness" in everyday life -- why the world of our experience is a more-or-less classical world. From the standpoint of decoherence, sure there might not be any objective fact about which slit an electron went through, but there is an objective fact about what you ate for breakfast this morning: the two situations are not the same!

The basic idea is that, as soon as the information encoded in a quantum state "leaks out" into the external world, that state will look locally like a classical state. In other words, as far as a local observer is concerned, there's no difference between a classical bit and a qubit that's become hopelessly entangled with the rest of the universe.

So for example, suppose we have a qubit in the state

And suppose this qubit becomes entangled with a second qubit, to form the following joint state:

If we now ignore the second qubit and look only at the first qubit, the first qubit will be in what physicists call the maximally mixed state:

(Other people just call it a classical random bit.) In other words, no matter what measurement you make on the first qubit, you'll just get a random outcome. You're never going to see interference between the |00⟩ and |11⟩ "branches" of the wavefunction. Why? Because according to quantum mechanics, two branches will only interfere if they become identical in all respects. But there's simply no way, by changing the first qubit alone, to make |00⟩ identical to |11⟩. The second qubit will always give away our would-be lovers' differing origins.

To see an interference pattern, you'd have to perform a joint measurement on the two qubits together. But what if the second qubit was a stray photon, which happened to pass through your experiment on its way to the Andromeda galaxy? Indeed, when you consider all the junk that might be entangling itself with your delicate experiment -- air molecules, cosmic rays, geothermal radiation ... well, whatever, I'm not an experimentalist -- it's as if the entire rest of the universe is constantly trying to "measure" your quantum state, and thereby force it to become classical! Sure, even if your quantum state does collapse (i.e. become entangled with the rest of the world), in principle you can still get the state back -- by gathering together all the particles in the universe that your state has become entangled with, and then reversing everything that's happened since the moment of collapse. That would be sort of like Pamela Anderson trying to regain her privacy, by tracking down every computer on Earth that might contain photos of her!

If we accept this picture, then it explains two things:

Most obviously, it explains why in everyday life, we don't usually see objects quantumly interfering with their parallel-universe doppelgängers. (Unless we happen to live in a dark room with two slits in the wall...). Basically, it's the same reason why we don't see eggs unscramble themselves.
As the flip side, the picture also explains why it's so hard to build quantum computers: because not only are we trying to keep errors from leaking into our computer, we're trying to keep the computer from leaking into the rest of the world! We're fighting against decoherence, one of the most pervasive processes in the universe. Indeed, it's precisely because decoherence is so powerful that the quantum fault-tolerance theorem came as a shock to many physicists. (The fault-tolerance theorem says roughly that, if the rate of decoherence per qubit per gate operation is below a constant threshold, then it's possible in principle to correct errors faster than they occur, and thereby perform an arbitrarily long quantum computation.)

So, what about the thought experiment from before -- the one where we place your brain into coherent superpositions of seeing a blue dot and seeing a red dot, and then ask about the probability that you see the dot change color? From a decoherence perspective, the resolution is that the thought experiment is completely ridiculous, since brains are big, bulky things that constantly leak electrical signals, and therefore any quantum superposition of two neural firing patterns would collapse (i.e., become entangled with the rest of the universe) in a matter of nanoseconds.

Fine, a skeptic might retort. But what if in the far future, it were possible to upload your entire brain into a quantum computer, and then put the quantum computer into a superposition of seeing a blue dot and seeing a red dot? Huh? Then what's the probability that "you" (i.e. the quantum computer) would see the dot change color?

When I put this question to John Preskill years ago, he said that decoherence itself -- in other words, an approximately classical universe -- seemed to him like an important component of subjective experience as we understand it. And therefore, if you artificially removed decoherence, then it might no longer make sense to ask the same questions about subjective experience that we're used to asking. I'm guessing that this would be a relatively popular response, among those physicists who are philosophical enough to say anything at all.

Decoherence and the Second Law

We are going to get to hidden variables. But first, I want to say one more thing about decoherence.

When I was talking before about the fragility of quantum states -- how they're so easy to destroy, so hard to put back together -- you might have been struck by a parallel with the Second Law of Thermodynamics. Obviously that's just a coincidence, right? Duhhh, no. The way people think about it today, decoherence is just one more manifestation of the Second Law.

Let's see how this works. Given a probability distribution D=(p₁,...,p_N), recall that the entropy of D is

Then given a quantum mixed state ρ, the von Neumann entropy of ρ is defined to be the minimum, over all unitary transformations U, of the entropy of the probability distribution that results from measuring UρU^-1 in the standard basis. To illustrate, every pure state has an entropy of 0, whereas the one-qubit maximally mixed state has an entropy of 1.

Now, if we assume that the universe is always in a pure state, then the "entropy of the universe" starts out 0, and remains 0 for all time! On the other hand, the entropy of the universe isn't really what we care about -- we care about the entropy of this or that region. And we saw before that, as previously-separate physical systems interact with each other, they tend to evolve from pure states into mixed states -- and therefore their entropy goes up. In the decoherence perspective, this is simply the Second Law at work.

Another way to understand the relationship between decoherence and the Second Law, is by taking a "God's-eye view" of the entire multiverse. Generically speaking, the different branches of the wavefunction could be constantly interfering with each other, splitting and merging in a tangled bush:

What decoherence theory says is that in the real world, the branches look more like a nicely pruned tree:

In principle, any two branches of this tree could collide with each other, thereby leading to "macroscopic interference effects," like in my story with the blue and red dots. But in practice, this is astronomically unlikely -- since to collide, two branches would have to become identical in every respect.

Notice that if we accept this tree picture of multiverse, then it immediately gives us a way to define the "arrow of time" -- that is, to state non-circularly what the difference is between the future and the past. Namely, we can say that the past is the direction toward the root of the "multiverse tree," and the future is the direction toward the leaves. According to the decoherence picture, this is actually equivalent to saying that the future is the direction where entropy increases, and it's also equivalent to saying that the past is the direction we remember while the future is the direction we don't.

The tree picture also lets us answer the conundrums from before about the reliability of memory. According to the tree picture, even though in principle we need not have a unique "past," in practice we usually do: namely, the unique path that leads from the root of the multiverse tree to our current state. Likewise, even though in principle quantum mechanics need not provide multiple-time probabilities -- that is, probabilities for what we're going to experience tomorrow, conditioned on what we're experiencing today -- in practice such probabilities usually make perfect sense, for the same reason they make sense in the classical world. That is, when it comes to transitions between subjective experiences, in practice we're dealing not with unitary matrices but with stochastic matrices.

At this point the sharp-eyed reader might notice a problem: won't the branches have to collide eventually, when the tree "runs out of room to expand"? The answer is yes. Firstly, if the Hilbert space is finite-dimensional, then obviously the parallel universes can only branch off a finite number times before they start bumping into each other. But even in an infinite-dimensional Hilbert space, we need to think of each universe as having some finite "width" (think of Gaussian wavepackets for example), so again we can only have a finite number of splittings.

The answer of decoherence theory is that yes, eventually the branches of the multiverse will start interfering with each other -- just like eventually the universe will reach thermal equilibrium. But by that time we'll presumably all be dead.

Incidentally, the fact that our universe is expanding exponentially -- that there's this vacuum energy pushing the galaxies apart -- seems like it might play an important role in "thinning out the multiverse tree," and thereby buying us more time until the branches start interfering with each other. This is something I'd like to understand better.

Oh, yes: I should also mention the "deep" question that I'm glossing over entirely here. Namely, why did the universe start out in such a low-entropy, unentangled state to begin with? Of course one can try to give an anthropic answer to that question, but is there another answer?

Story #2: Hidden Variables

Despite how tidy the decoherence story seems, there are some people for whom it remains unsatisfying. One reason is that the decoherence story had to bring in a lot of assumptions seemingly extraneous to quantum mechanics itself: about the behavior of typical physical systems, the classicality of the brain, and even the nature of subjective experience. A second reason is that the decoherence story never did answer our question about the probability you see the dot change color -- instead the story simply "pulled a Wittgenstein" (that is, tried to convince us the question was meaningless)!

So if the decoherence story doesn't make you sleep easier, then what else is on offer at the quantum bazaar? Well, now it's the hidden-variable theorists' turn to hawk their wares. (Most of the rest of this lecture will follow my paper Quantum Computing and Hidden Variables.)

The idea of hidden-variable theories is simple. If we think of quantum mechanics as describing this vast roiling ocean of parallel universes, constantly branching off, merging, and cancelling each other out, then we're now going to stick a little boat in that ocean. We'll think of the boat's position as representing the "real," "actual" state of the universe at a given point in time, and the ocean as just a "field of potentialities" whose role is to buffet the boat around. For historical reasons, the boat's position is called a hidden variable -- even though in some sense, it's the only part of this setup that's not hidden! Now, our goal will be to make up an evolution rule for the boat, such that at any time, the probability distribution over possible boat positions is exactly the |ψ|² distribution predicted by standard quantum mechanics.

By construction, then, hidden-variable theories are experimentally indistinguishable from standard quantum mechanics. So presumably there can be no question of whether they're "true" or "false" -- the only question is whether they're good or bad stories.

You might say, why should we worry about these unfalsifiable goblins hiding in quantum mechanics' closet? Well, I'll give you four reasons.

For me, part of what it means to understand quantum mechanics is to explore the space of possible stories that can be told about it. If we don't do so, then we risk making fools ourselves by telling people that a certain sort of story can't be told when in fact it can, or vice versa. (There's plenty of historical precedent for this.)
As we'll see, hidden-variable theories lead to all sorts of meaty, nontrivial math problems, some of which are still open. And in the end, isn't that reason enough to study anything?
Thinking about hidden variables seems scientifically fruitful: it led Einstein, Podolsky, and Rosen to the EPR experiment, Bell to Bell's Inequality, Kochen and Specker to the Kochen-Specker Theorem, and me to the collision lower bound.
Hidden-variable theories will give me a perfect vehicle for discussing other issues in quantum foundations -- like nonlocality, contextuality, and the role of time. In other words, you get lots of goblins for the price of one!

From my perspective, a hidden-variable theory is simply a rule for converting a unitary transformation into a classical probabilistic transformation. In other words, it's a function that takes as input an N-by-N unitary matrix U=(u_ij) together with a quantum state

and that produces as output an N-by-N stochastic matrix S=(s_ij). (Recall that a stochastic matrix is just a nonnegative matrix where every column sums to 1.) Given as input the probability vector obtained from measuring |ψ⟩ in the standard basis, this S should produce as output the probability vector obtained from measuring U|ψ⟩ in the standard basis. In other words, if

then we must have

This is what it means for a hidden-variable theory to reproduce the predictions of quantum mechanics: it means that, whatever story we want to tell about correlations between boat positions at different times, certainly the marginal distribution over boat positions at any individual time had better be the usual quantum-mechanical one.

OK, obvious question: given a unitary matrix U and a state |ψ⟩, does a stochastic matrix satisfying the above condition necessarily exist?

Sure it does! For we can always take the product transformation

which just "picks the boat up and puts it back down at random," completely destroying any correlation between the initial and final positions.

No-Go Theorems Galore

So the question is not whether we can find a stochastic transformation S(|ψ⟩,U) that maps the initial distribution to the final one. Certainly we can! Rather, the question is whether we can find a stochastic transformation satisfying "nice" properties. But which "nice" properties might we want? I'm now going to suggest four possibilities -- and then show that, alas, not one of them can be satisfied. The point of going through this exercise is that, along the way, we're going to learn an enormous amount about how quantum mechanics differs from classical probability theory. In particular, we'll learn about Bell's Theorem, the Kochen-Specker Theorem, and two other no-go theorems that as far as I know don't have names.

1. Independence from the State

Alright, so recall the problem at hand: we're given a unitary matrix U and quantum state |ψ⟩, and want to cook up a stochastic matrix S = S(|ψ⟩,U) that maps the distribution obtained by measuring |ψ⟩ to the distribution obtained by measuring U|ψ⟩.

The first property we might want is that S should depend only on the unitary U, and not on the state |ψ⟩. However, this is easily seen to be impossible. For if we let

then

implies

whereas

implies

Therefore S must be a function of U and |ψ⟩ together.

2. Invariance under Time-Slicings

The second property we might want in our hidden-variable theory is invariance under time-slicings. This means that, if we perform two unitary transformations U and V in succession, we should get the same result if we apply the hidden-variable theory to VU, as if we apply the theory to U and V separately and then multiply the results. (Loosely speaking, the map from unitary to stochastic matrices should be "homomorphic.") Formally, what we want is that

S(|ψ⟩,VU) = S(U|ψ⟩,V) S(|ψ⟩,U).

But again one can show that this is impossible -- except in the "trivial" case that S is the product transformation S_prod, which destroys all correlations between the initial and final times.

To see this, observe that for all unitaries W and states |ψ⟩, we can write W as a product W = VU, in such a way that U|ψ⟩ equals a fixed basis state (|1⟩, for example). Then applying U "erases" all the information about the hidden variable's initial value -- so that if we later apply V, then the hidden variable's final value must be uncorrelated with its initial value. But this means that S(|ψ⟩,VU) equals S_prod(|ψ⟩,VU).

3. Independence from the Basis

When I defined hidden-variable theories, some of you were probably wondering: why should we only care about measurement results in some particular basis, when we could've just as well picked any other basis? So for example, if we're going to say that a particle has a "true, actual" location even before anyone measures that location, then shouldn't we say the same thing about the particle's momentum, and its spin, and its energy, and all the other observable properties of the particle? What singles out location as being more "real" than all the other properties?

Well, these are excellent questions! Alas, it turns out that we can't assign definite values to all possible properties of a particle in any "consistent" way. In other words, not only can we not define transition probabilities for all the particle's properties, we can't even handle all the properties simultaneously at any individual time!

This is the remarkable (if mathematically trivial) conclusion of the Kochen-Specker Theorem, which was proved by Simon Kochen and Ernst Specker in 1967. Formally, the theorem says the following: suppose that for every orthonormal basis B in ℜ³, the universe wants to "precompute" what the outcome would be of making a measurement in that basis. In other words, the universe wants to pick one of the three vectors in B, designate that one as the "marked" vector, and return that vector later should anyone happen to measure in B. Naturally, the marked vectors ought to be "consistent" across different bases. That is, if two bases share a common vector, like so:

then the common vector should be the marked vector of one basis if and only if it's also the marked vector of the other.

Kochen and Specker prove that this is impossible. Indeed, they construct an explicit set of 117 bases (!) in ℜ³, such that marked vectors can't be chosen for those bases in any consistent way.

NerdNote: The constant 117 has since been improved to 31; see here for example. Apparently it's still an open problem whether that's optimal; the best lower bound I've seen mentioned is 18.

The upshot is that any hidden-variable theory will have to be what those in the business call contextual. That is, it will sometimes have to give you an answer that depends on which basis you measured in, with no pretense that the answer would've been the same had you measured in a different basis that also contained the same answer.

Exercise: Prove that the Kochen-Specker Theorem is false in 2 dimensions.

4. Relativistic Causality

The final property we might want from a hidden-variable theory is adherence to the "spirit" of Einstein's special relativity. For our purposes, I'll define that to consist of two things:

Locality. This means that, if we have a quantum state |ψ_AB⟩ on two subsystems A and B, and we apply a unitary transformation U_A that acts only on the A system (i.e. is the identity on B), then the hidden-variable transformation S(|ψ_AB⟩,U_A) should also act only on the A system.
Commutativity. This means that, if we have a state |ψ_AB⟩, and we apply a unitary transformation U_A to the A system only followed by another unitary transformation U_B to the B system only, then the resulting hidden-variable transformation should be the same as if we'd first applied U_B and then U_A. Formally, we want that S(U_A|ψ_AB⟩,U_B) S(|ψ_AB⟩,U_A) = S(U_B|ψ_AB⟩,U_A) S(|ψ_AB⟩,U_B)

Now, you might've heard of a little thing called Bell's Inequality. As it turns out, Bell's Inequality doesn't quite rule out hidden-variable theories satisfying the two axioms above, but a slight strengthening of what Bell proved does the trick.

So what is Bell's Inequality? Well, if you look for an answer in almost any popular book or website, you'll find page after page about entangled photon sources, Stern-Gerlach apparatuses, etc., all of it helpfully illustrated with detailed experimental diagrams. This is necessary, of course, since if you took all the complications away, people might actually grasp the conceptual point!

However, since I'm not a member of the Physics Popularizers' Guild, I'm now going to break that profession's time-honored bylaws, and just tell you the conceptual point directly.

We've got two players, Alice and Bob, and they're playing the following game. Alice flips a fair coin; then, based on the result, she can either raise her hand or not. Bob flips another fair coin; then, based on the result, he can either raise his hand or not. What both players want is that exactly one of them should raise their hand, if and only if both coins landed heads. If that condition is satisfied then they win the game; if it isn't then they lose. (This is a cooperative rather than competitive game.)

Now here's the catch: Alice and Bob are both in sealed rooms (possibly even on different planets), and can't communicate with each other at all while the game is in progress.

The question that interests us is: what is the maximum probability with which Alice and Bob can win the game?

Well, certainly they can win 75% of the time. Why?

Right: they can both just decide never to raise their hands, regardless of how the coins land! In that case, the only way they'll lose is if both of the coins land heads.

Exercise: Prove that this is optimal. In other words, any strategy of Alice and Bob will win at most 75% of the time.

Now for the punchline: suppose that Alice and Bob share the entangled state

with Alice holding one half and Bob holding the other half. In that case, there exists a strategy by which they can win the game with probability

To be clear, having the state |Φ⟩ does not let Alice and Bob send messages to each other faster than the speed of light -- nothing does! What it lets them do is to win this particular game more than 75% of the time. Naïvely, we might have thought that would require Alice and Bob to "cheat" by sending each other messages, but that simply isn't true -- they can also cheat by using entanglement!

So that was Bell's Inequality.

But what does this dumb little game have to do with hidden variables? Well, suppose we tried to model Alice's and Bob's measurements of the state |Φ⟩ using two hidden variables: one on Alice's side and the other on Bob's side. And, in keeping with relativistic causality, suppose we demanded that nothing that happened to Alice's hidden variable could affect Bob's hidden variable or vice versa. In that case, we'd predict that Alice and Bob could win the game at most 75% of the time. But this prediction would be wrong!

It follows that, if we want it to agree with quantum mechanics, then any hidden-variable theory has to allow "instantaneous communication" between any two points in the universe. Once again, this doesn't mean that quantum mechanics itself allows instantaneous communication (it doesn't), or that we can exploit hidden variables to send messages faster than light (we can't). It only means that, if we choose to describe quantum mechanics using hidden variables, then our description will have to involve instantaneous communication.

Exercise: Generalize Bell's argument to show that there's no hidden-variable theory satisfying the locality and commutativity axioms as given above.

So what we've learned, from Alice and Bob's coin-flipping game, is that any attempt to describe quantum mechanics with hidden variables will necessarily lead to tension with relativity. Again, none of this has any experimental consequences, since it's perfectly possible for hidden-variable theories to violate the "spirit" of relativity while still obeying the "letter." Indeed, hidden-variable fans like to argue that all we're doing is unearthing the repressed marital tensions between relativity and quantum mechanics themselves!

Examples of Hidden-Variable Theories

I know what you're thinking: after the pummeling we just gave them, the outlook for hidden-variable theories looks pretty bleak. But here's the amazing thing: even in the teeth of four different no-go theorems, one can still construct interesting and mathematically nontrivial hidden-variable theories. I'd like to end this lecture by giving you three examples.

The Flow Theory

Remember the goal of hidden-variable theories: we start out with a unitary matrix U and a state |ψ⟩; from them we want to produce a stochastic matrix S that maps the initial distribution to the final distribution. Ideally, S should be derived from U in a "natural," "organic" way. So for example, if the (i,j) entry of U is zero, then the (i,j) entry of S should also be zero. Likewise, making a small change to U or |ψ⟩ should produce only a small change in S.

Now, it's not clear a priori that there even exists a hidden-variable theory satisfying the two requirements above. So what I want to do first is give you a simple, elegant theory that does satisfy those requirements.

The basic idea is to treat probability mass flowing through the multiverse just like oil flowing through pipes! We're going to imagine that initially, we have |α_i|² units of "oil" at each basis state |i⟩, while by the end, we want |β_i|² units of oil at each basis state |i⟩. Here α_i and β_i are the initial and final amplitudes of |i⟩ respectively. And we're also going to think of |u_ij|, the absolute value of the (i,j)^th entry of the unitary matrix, as the capacity of an "oil pipe" leading from |i⟩ to |j⟩.

The network G(U,|ψ⟩)

Then the first question is this: for any U and |ψ⟩, can all 1 units of oil be routed from s to t in the above network G(U,|ψ⟩), without exceeding the capacity of any of the pipes?

I proved that the answer is yes. My proof uses a fundamental result from the 1960's called the Max-Flow/Min-Cut Theorem. Those of you who were/are computer science majors will vaguely remember this from your undergrad classes. For the rest of you, well, it's really worth seeing at least once in your life. (It's useful not only for the interpretation of quantum mechanics, but also for stuff like Internet routing!)

So what does the Max-Flow/Min-Cut Theorem say? Well, suppose we have a network of oil pipes like in the figure above, with a designated "source" called s, and a designated "sink" called t. Each pipe has a known "capacity", which is a nonnegative real number measuring how much oil can be sent through that pipe each second. Then the max flow is just the maximum amount of oil that can be sent from s to t every second, if we route the oil through the pipes in as clever a way as possible. Conversely, the min cut is the smallest real number C such that, by blowing up oil pipes whose total capacity is C, a terrorist could prevent any oil from being sent from s to t.

As an example, what's the max flow and min cut for the network below?

Right: they're both 3.

As a trivial observation, I claim that for any network, the max flow can never be greater than the min cut. Why?

Right: because by definition, the min cut is the total capacity of some "choke point" that all the oil has to pass through eventually! In other words, if blowing up pipes of total capacity C is enough to cut the flow from s to t down to zero, then putting those same pipes back in can't increase the flow to more than C.

Now, the Max-Flow/Min-Cut Theorem says that the converse is also true: for any network, the max flow and min cut are actually equal.

Exercise (for those who've never seen it): Prove the Max-Flow/Min-Cut Theorem.

Exercise (hard): By using the Max-Flow/Min-Cut Theorem, prove that for any unitary U and any state |ψ⟩, there exists a way to route all the probability mass from s to t in the network G(U,|ψ⟩) shown before.

So, we've now got our candidate hidden-variable theory! Namely: given U and |ψ⟩, first find a "canonical" way to route all the probability mass from s to t in the network G(U,|ψ⟩). Then define the stochastic matrix S by s_ij := p_ij/|α_i|², where p_ij is the amount of probability mass routed from |i⟩ to |j⟩. (For simplicity, I'll ignore what happens when α_i=0.)

By construction, this S maps the vector of |α_i|²'s to the vector of |β_i|²'s. It also has the nice the property that for all i,j, if u_ij=0 then s_ij=0 as well.

Why?

Right! Because if u_ij=0, then no probability mass can get routed from |i⟩ to |j⟩.

Exercise (harder): Prove that making a small change to U or |ψ⟩ produces only a small change in the matrix (p_ij) of transition probabilities.

The Schrödinger Theory

So that was one cute example of a hidden-variable theory. I now want to show you an example that I think is even cuter. When I started thinking about hidden-variable theories, this was actually the first idea I came up with. Later I found out that Schrödinger had the same idea in a nearly-forgotten 1931 paper.

Specifically, Schrödinger's idea was to define transition probabilities in quantum mechanics by solving a system of coupled nonlinear equations. The trouble is that Schrödinger couldn't prove that his system had a solution (let alone a unique one); that had to wait for the work of Masao Nagasawa in the 1980's. Luckily for me, I only cared about finite-dimensional quantum systems, where everything was much simpler, and where I could give a reasonably elementary proof that the equation system was solvable.

So what's the idea? Well, recall that given a unitary matrix U, we want to "convert" it somehow into a stochastic matrix S that maps the initial distribution to the final one. This is basically equivalent to asking for a matrix P of transition probabilities: that is, a nonnegative matrix whose i^th column sums to |α_i|² and whose j^th row sums to |β_j|². (This is just the requirement that the marginal probabilities should be the usual quantum-mechanical ones.)

Since we want to end up with a nonnegative matrix, a reasonable first step would be to replace every entry of U by its absolute value:

What next? Well, we want the i^th column to sum to |α_i|². So let's continue doing the crudest thing imaginable, and for every 1≤i≤N, just normalize the i^th column to sum to |α_i|²!

Now, we also want the j^th row to sum to |β_j|². How do we get that? Well, for every 1≤j≤N, we just normalize the j^th row to sum to |β_j|².

Of course, after we normalize the rows, in general the i^th column will no longer sum to |α_i|². But that's no problem: we'll just normalize the columns again! Then we'll re-normalize the rows (which were messed up by normalizing the columns), then we'll re-normalize the columns (which were messed up by normalizing the rows), and so on ad infinitum.

Exercise (hard): Prove that this iterative process converges for any U and |ψ⟩, and that the limit is a matrix P=(p_ij) of transition probabilities -- that is, a nonnegative matrix whose i^th column sums to |α_i|² and whose j^th row sums to |β_j|².

Open Problem (if you get this, let me know): Prove that making a small change to U or |ψ⟩ produces only a small change in the matrix P=(p_ij) of transition probabilities.

Bohmian Mechanics

Some of you might be wondering why I haven't mentioned the most famous hidden-variable theory of all: Bohmian mechanics. The answer is that, to discuss Bohmian mechanics, I'd have to bring in infinite-dimensional Hilbert spaces (blech!), particles with positions and momenta (double blech!), and other ideas that go against everything I stand for as a computer scientist.

Still, I should tell you a little about what Bohmian mechanics is and why it doesn't fit into my framework. In 1952, David Bohm proposed a deterministic hidden-variable theory: that is, a theory where not only do you get transition probabilities, but the probabilities are all either 0 or 1! The way he did this was by taking as his hidden variable the positions of particles in ℜ³. He then stipulated that the probability mass for where the particles are should "flow" with the wavefunction, so that a region of configuration space with probability ε always gets mapped to another region with probability ε.

With one particle in one spatial dimension, it's easy to write down the (unique) differential equation for particle position that satisfies Bohm's probability constraint. Bohm showed how to generalize the equation to any number of particles in any number of dimensions.

To illustrate, here's what the Bohmian particle trajectories look like in the famous double-slit experiment:

Again, the amazing thing about this theory is that it's deterministic: specify the "actual" positions of all the particles in the universe at any one time, and you've specified their "actual" positions at all earlier and later times. So if you like, you can imagine that at the moment of the Big Bang, God sprinkled particles across the universe according to the usual |ψ|² distribution; but after that He smashed His dice, and let the particles evolve deterministically forever after. And that assumption will lead you to exactly the same experimental predictions as the usual picture of quantum mechanics, the one where God's throwing dice up the wazoo.

The catch, from my point of view, is that this sort of determinism can only work in an infinite-dimensional Hilbert space, like the space of particle positions. I've almost never seen this observation discussed in print, but I can explain it in a couple sentences.

Suppose we want a hidden-variable theory that's deterministic like Bohm's, but that works for quantum states in a finite number of dimensions. Then what happens if we apply a unitary transformation U that maps the state |0⟩ to

In this case, initially the hidden variable is |0⟩ with certainty; afterwards it's |0⟩ with probability 1/2 and |1⟩ with probability 1/2. In other words, applying U increases the entropy of the hidden variable from 0 to 1. So to decide which way the hidden variable goes, clearly Nature needs to flip a coin!

A Bohmian would say that the reason determinism broke down here is that our wavefunction was "degenerate": that is, it didn't satisfy the continuity and differentiability requirements that are needed for Bohm's differential equation. But in a finite-dimensional Hilbert space, every wavefunction will be degenerate in that sense! And that's why, if our universe is discrete at the Planck scale, then it can't also be deterministic in the Bohmian sense.

Lecture 12: Proofs

(Thanks to Bill Rosgen and Edwin Chen for help preparing these notes.)

We're going to start by beating a retreat from QuantumLand, back onto the safe territory of computational complexity. In particular, we're going to see how in the 80's and 90's, computational complexity theory reinvented the millennia-old concept of mathematical proof -- making it probabilistic, interactive, and cryptographic. But then, having fashioned our new pruning-hooks (proving-hooks?), we're going to return to QuantumLand and reap the harvest. In particular, I'll show you why, if you could see the entire trajectory of a hidden variable, then you could efficiently solve any problem that admits a "statistical zero-knowledge proof protocol," including problems like Graph Isomorphism for which no efficient quantum algorithm is yet known.

What Is A Proof?

Historically, mathematicians have had two very different notions of "proof."

The first is a that a proof is something that induces in the audience (or at least the prover!) an intuitive sense of certainty that the result is correct. On this view, a proof is an inner transformative experience -- a way for your soul to make contact with the eternal verities of Platonic heaven.

The second notion is that a proof is just a sequence of symbols obeying certain rules -- or more generally, if we're going to take this view to what I see as its logical conclusion, a proof is a computation. In other words, a proof is a physical, mechanical process, such that if the process terminates with a particular outcome, then you should accept that a given theorem is true. Naturally, you can never be more certain of the theorem than you are of the laws governing the machine. But as great logicians from Leibniz to Frege to Gödel understood, the weakness of this notion of proof is also its strength. If proof is purely mechanical, then in principle you can discover new mathematical truths by just turning a crank, without any understanding or insight. (As Leibniz imagined legal disputes one day being settled: "Gentlemen, let us calculate!")

The tension between the two notions of proof was thrown into sharper relief in 1976, when Kenneth Appel and Wolfgang Haken announced a proof of the famous Four-Color Map Theorem (that every planar map can be colored with four colors, in such a way that no two adjacent countries are colored the same). The proof basically consisted of a brute-force enumeration of thousands of cases by computer; there's no feasible way for a human to apprehend it in its entirety.

Devin: If the Four-Color Theorem was basically proved by brute force, then how can they be sure they hit all the cases?

Answer: Good question! The novel technical contribution that human mathematicians had to make was precisely that of reducing the problem to finitely many cases -- specifically, about 2000 of them -- which could then be checked by computer. Increasing our confidence is that the proof has since been redone by another group, which reduced the number of cases from about 2000 to about 1000.

Now, people will ask: how do you know that the computer didn't make a mistake? The obvious response is that human mathematicians also make mistakes. I mean, Roger Penrose likes to talk about making direct contact with Platonic reality, but it's a bit embarrassing when you think you've made such contact and it turns out the next morning that you were wrong!

We know the computer didn't make a mistake because we trust the laws of physics that the computer relies on, and that it wasn't hit by a cosmic ray during the computation. But in the last 20 years, there's been the question -- why should we trust physics? We trust it in life-and-death situations every day, but should we really trust it with something as important as proving the Four-Color Theorem? The truth is, we can play games with the definition of proof in order to expand it to unsettling levels, and we'll be doing this for the rest of the lecture.

Probabilistic Proofs

Recall that we can think of a proof as a computation -- a purely mechanical process that spits out theorems. But what about a computation that errs with 2^-1000 probability -- is that a proof? That is, are BPP computations legitimate proofs? Well, if we can make the probability of error so small that it's more likely for a comet to suddenly smash our computer into pieces than for our proof to be wrong, it certainly seems plausible!

Now do you remember our definition of NP, as the class of problems with polynomial-size certificates (for positive answers) that can be verified in polynomial time? In other words, it's the class of problems we can efficiently prove and check. So once we admit probabilistic algorithms as proofs, we should combine them with NP to get MA (named by Laszlo Babai after Merlin and Arthur), the class of problems with proofs efficiently verifiable by some randomized algorithm.

We can also consider the class where you get to Ask Merlin a question -- this is AM. What happens if you get to ask Merlin more than one question? You'd think you'd be able to solve more problems or prove more theorems, right? Wrong! It turns out that if you get to ask Merlin a constant number of questions, say, AMAMAM, then you have exactly the same power as just asking him once.

Zero-Knowledge Proofs

I was talking before about stochastic proofs, proofs that have an element of uncertainty about them. We can also generalize the notion of proof to include zero-knowledge proofs, proofs where the person seeing the proof doesn't learn anything about the statement in question except that it's true.

Intuitively that sounds impossible, but I'll illustrate this with an example. Suppose we have two graphs. If they're isomorphic, that's easy to prove. But suppose they're not isomorphic. How could you prove that to someone, assuming you're an omniscient wizard?

Simple: Have the person you're trying to convince pick one of the two graphs at random, randomly permute it, and send you the result. That person then asks: "which graph did I start with?" If the graphs are not isomorphic, then you should be able to answer this question with certainty. Otherwise you'll only be able to answer it with probability 1/2. And thus you'll almost surely make a mistake if the test is repeated a small number of times.

This is an example of an interactive proof system. Are we making any assumptions? We're assuming you don't know which graph the verifier started with, or that you can't access the state of his brain to figure it out. Or as theoretical computer scientists would say, we're assuming you can't access the verifier's "private random bits."

What's perhaps even more interesting about this proof system is that the verifier becomes convinced that the graphs are not isomorphic without learning anything else! In particular, the verifier becomes convinced of something, but is not thereby enabled to convince anyone else of the same statement.

A proof with this property -- that the verifier doesn't learn anything besides the truth of the statement being proved -- is called is called a zero-knowledge proof. Yeah, alright, you have to do some more work to define what it means for the verifier to "not learn anything." Basically, what it means is that, if the verifier were already convinced of the statement, he could've just simulated the entire protocol on his own, without any help from the prover.

Under a certain computational assumption -- namely that one-way functions exist -- it can be shown that zero-knowledge proofs exist for every NP-complete problem. This was the remarkable discovery of Goldreich, Micali, and Wigderson in 1986.

Because all NP-complete problems are reducible to each other (i.e., are "the same problem in different guises"), it's enough to give a zero-knowledge protocol for one NP-complete problem. And it turns out that a convenient choice is the problem of 3-coloring a graph (meaning: coloring every vertex red, blue, or green, so that no two neighboring vertices are colored the same).

The question is: how can you convince someone that a graph is 3-colorable, without revealing anything about the coloring?

Well, here's how. Given a 3-coloring, first randomly permute the colors -- for example by changing every blue country to green, every green country to red, and every red country to blue. (There are 3!=6 possible permutations.) Next, use a one-way function (whose existence we've assumed) to encrypt the color of every country. Then send the encrypted colors to the verifier.

Given these encrypted colors, what can the verifier do? Simple: he can pick two neighboring vertices, ask you to decrypt the colors, and then check that (1) the decryptions are valid and (2) the colors are actually different. Note that, if the graph wasn't 3-colorable, then either two adjacent countries must gotten the same color, or else some country must not even have been colored red, blue, or green. In either case, the verifier will catch you cheating with probability at least 1/m, where m is the number of edges.

Finally, if the verifier wants to increase his confidence, we can simply repeat the protocol a large (but still polynomial) number of times. Note that each time you choose a fresh permutation of the colors as well as fresh encryptions. If after (say) m³ repetitions, the verifier still hasn't caught you cheating, he can be sure that the probability you were cheating is vanishingly small.

But why is this protocol zero-knowledge? Intuitively it's "obvious": when you decrypt two colors, all the verifier learns is that two neighboring vertices were colored differently -- but then, they would be colored differently if it's a valid 3-coloring, wouldn't they? Alright, to make this more formal, you need to prove that the verifier "doesn't learn anything," by which we mean that by himself, in polynomial time, the verifier could've produced a probability distribution over sequences of messages that was indistinguishable, by any polynomial-time algorithm, from the actual sequence of messages that the verifier exchanged with you. As you can imagine, it gets a bit technical.

Is there any difference between the two zero-knowledge examples I just showed you? Sure: the zero-knowledge proof for 3-coloring a map depended crucially on the assumption that the verifier can't, in polynomial time, decrypt the map by himself. (If he could, he would be able to learn the 3-coloring!) This is called a computational zero-knowledge proof, and the class of all problems admitting such a proof is called CZK. By contrast, in the proof for Graph Non-Isomorphism, the verifier couldn't cheat even with unlimited computational power. This is called is a statistical zero-knowledge proof, a proof in which the distributions given by an honest prover and a cheating prover need to be close in the statistical sense. The class of all problems admitting this kind of proof is called SZK.

Clearly SZK⊆CZK, but is the containment strict? Intuitively, we'd guess that CZK is a larger class, since we only require a protocol to be zero-knowledge against polynomial-time verifiers, not verifiers with unlimited computation. And indeed, it's known that if one-way functions exist, then CZK = IP = PSPACE -- in other words, CZK is "as big as it could possibly be." On the other hand, it's also known that SZK is contained in the polynomial hierarchy. (In fact, under a derandomization assumpition, SZK is even in NP∩coNP).

PCP

A PCP (Probabilistically Checkable Proof) is yet another impossible-seeming game one can play with the concept of "proof." It's a proof that's written down in such a way that you, the lazy grader, only need to flip it open to a few random places to check (in a statistical sense) that it's correct. Indeed, if you want very high confidence (say, to one part in a thousand) that the proof is correct, you never need to examine more than about 30 bits. Of course, the hard part is encoding the proof so that this is possible.

It's probably easier to see this with an example. Do you remember the Graph Non-Isomorphism problem? We'll show that there is a proof that two graphs are non-isomorphic, such that anyone verifying the proof only needs to look at a constant number of bits (though admittedly, the proof itself will be exponentially long).

First, given any pair of graphs G₀ and G₁ with n nodes each, the prover sends the verifier a specially encoded string proving that G₀ and G₁ are non-isomorphic. What's in this string? Well, we can choose some ordering of all possible graphs with n nodes, so call the i^th graph H_i. Then for the i^th bit of the string, the prover puts a 0 there if H_i is isomorphic to G₀, a 1 if H_i is isomorphic to G₁, and otherwise (if H_i is isomorphic to neither) he arbitrarily places a 0 or a 1. How does this string prove to the verifier that G₀ and G₁ are non-isomorphic? Easy: the verifier flips a coin to get G₀ or G₁, and randomly permutes it to get a new graph H. Then, she queries for the bit of the proof corresponding to H, and accepts if and only if the queried bit matches her original graph. If indeed G₀ and G₁ are non-isomorphic, then the verifier will always accept, and if not, then the probability of acceptance is at most 1/2.

In this example, though, the proof was exponentially long and only worked for Graph Non-Isomorphism. What kind of results do we have in general? The famous PCP Theorem says that every NP problem admits PCP's -- and furthermore, PCP's with polynomially long proofs! This means that every mathematical proof can be encoded in such a way that any error in the original proof translates into errors almost everywhere in the new proof.

One way of understanding this is through 3SAT. The PCP theorem is equivalent to the NP-completeness of the problem of solving 3SAT with the promise that either the formula is satisfiable, or else there's no truth assignment that satisfies more than (say) 90% of the clauses. Why? Because you can encode the question of whether some mathematical statement has a proof with at most n symbols as a 3SAT instance -- in such a way that if there's a valid proof, then the formula is satisfiable, and if not, then no assignment satisfies more than 90% of the clauses. So given a truth assignment, you only need to distinguish the case that it satisfies all the clauses from the case that it satisfies at most 90% of them -- and this can be done by examining a few dozen random clauses, completely independently of the length of the proof.

Complexity of Simulating Hidden-Variable Theories

We talked last week about the path of a particle's hidden variable in a hidden-variable theory, but what is the complexity of finding such a path? As Devin points out, this problem is certainly at least as hard as quantum computing -- since even to sample a hidden variable's value at any single time would in general require a full-scale quantum computation. Is sampling a whole trajectory an even harder problem?

Here's another way to ask this question. Suppose that at the moment of your death, your whole life flashes before you in an instant -- and suppose you can then perform a polynomial-time computation on your life history. What does that let you compute? (Assuming, of course, that a hidden-variable theory is true, and that while you were alive, you somehow managed to place your own brain in various nontrivial superpositions.)

To study this question, we can define a new complexity class called DQP, or Dynamical Quantum Polynomial-Time. The formal definition of this class is a bit hairy (see my paper for details). Intuitively, though, DQP is the class of problems that are efficiently solvable in the "model" where you get to sample the whole trajectory of a hidden variable, under some hidden-variable theory that satisfies "reasonable" assumptions.

Now, you remember the class SZK, of problems that have statistical zero-knowledge proof protocols? The main result from my paper was that SZK⊆DQP. In other words, if only we could measure the whole trajectory of a hidden variable, we could use a quantum computer to solve every SZK problem -- including Graph Isomorphism and many other problems not yet known to have efficient quantum algorithms!

To explain why that is, I need to tell you that in 1997, Sahai and Vadhan discovered an extremely nice "complete promise problem" for SZK. That problem is the following:

Given two efficiently-samplable probability distributions D₁ and D₂, are they close or far in statistical distance (promised that one of those is the case)?

This means that when thinking about SZK, we can forget about zero-knowledge proofs, and just assume we have two probability distributions and we want to know whether they're close or far.

But let me make it even more concrete. Let's say that you have a function f:{1,2,...,N}→{1,2,...,N}, and you want to decide whether f is one-to-one or two-to-one, promised that one of these is the case. This problem -- which is called the collision problem -- doesn't quite capture the difficulty of all SZK problems, but it's close enough for our purposes.

Now, how many queries to f do you need to solve the collision problem? If you use a classical probabilistic algorithm, then it's not hard to see that √N queries are necessary and sufficient. As in the famous "birthday paradox" (where if you put 23 people in a room, there's at least even odds that two of the people share a birthday), you get a square-root savings over the naïve bound, since what matters is the number of pairs for which a collision could occur. But unfortunately, if N is exponentially large (as it is in the situations we're thinking about), then √N is still completely prohibitive: the square root of an exponential is still an exponential.

So what about quantum algorithms? In 1997, Brassard, Høyer, and Tapp showed how to combine the √N savings from the birthday paradox with the unrelated √N savings from Grover's algorithm, to obtain a quantum algorithm that solves the collision problem in (this is going to sound a joke) ~N^1/3 queries. So, yes, quantum computers do give at least a slight advantage for this problem. But is that the best one can do? Or could there be a better quantum algorithm, that solves the collision problem in (say) log(N) queries, or maybe even less?

In 2002 I proved the first nontrivial lower bound on the quantum query complexity of the collision problem, showing that any quantum algorithm needs at least ~N^1/5 queries. This was later improved to ~N^1/3 by Yaoyun Shi, thereby showing that Brassard, Høyer, and Tapp's algorithm was indeed optimal.

On the other hand -- to get back to our topic -- suppose you could see the whole trajectory of a hidden variable. In that case, I claim that you could solve the collision problem with only a constant number of queries (independent of N)! How? The first step is to prepare the state

Now measure the second register (which we won't need from this point onwards), and think only about the resulting state of the first register. If f is one-to-one, then in the first register you'll get a classical state of the form i, for some random i. If f is two-to-one, on the other hand, then you'll get a state of the form , where i and j are two values with f(i) = f(j). If only you could perform a further measurement to tell these states apart! But alas, as soon as you measure you destroy the quantum coherence, and the two types of states look completely identical to you.

Aha, but remember we get to see an entire hidden-variable trajectory! Here's how we can exploit that. Starting from the state , first apply a Hadamard gate to every qubit. This produces a "soup" of exponentially many basis vectors -- but if we then Hadamard every qubit a second time, we get back to the original state . Now, the idea is that when we Hadamard everything, the particle "forgets" whether it was at i or j. (This can be proven under some weak assumptions on the hidden-variable theory.) Then, when we observe the history of the particle, we'll learn something about whether the state had the form i or . For in the former case the particle will always return to i, but in the latter case it will "forget," and will need to pick randomly between i and j. As usual, by repeating the "juggling" process polynomially many times one can make the probability of failure exponentially small. (Note that this does not require observing more than one hidden-variable trajectory: the repetitions can all happen within a single trajectory.)

What are the assumptions on the hidden-variable theory that are needed for this to work? The first is basically that if you have a bunch of qubits and you apply a Hadamard to one of them, then you should only get to transfer between hidden-variable basis states that differ in the first qubit.

Raymond: Does this say that the hidden variables are related one-to-one to qubits?

A: Well, it does nontrivially constrain how the hidden variables can work. Note, though, that this assumption is very different from (and weaker than) requiring the hidden-variable theory to be "local", in the sense physicists usually mean by that. No hidden-variable theory can be local (I think some guy named Bell proved that).

And the second assumption is that the hidden-variable theory is "robust" to small errors in the unitaries and quantum states. (This assumption is needed to define the complexity class DQP in a reasonable way.)

As we've seen, DQP contains both BQP and the Graph Isomorphism problem. But interestingly, at least in the black-box model, DQP does not contain the NP-complete problems. (More formally, there exists an oracle A such that NP^A⊄DQP^A.) The proof of this formalizes the intuition that, even as the hidden variable bounces around the quantum haystack, the chance that it ever hits the needle is vanishingly small. It turns out that in the hidden-variable model, you can search an unordered list of size N using ~³√N queries instead of the ~√N you'd get from Grover's algorithm, but this is still exponential. The upshot is even that DQP has severe computational complexity limitations.

Gus: Does this imply that hidden-variable theories aren't that outlandish?

A: Well, at least they don't fail this one test!

Lecture 13: How Big are Quantum States?

Scribe: Chris Granade

I've received some complaints that we've done too much complexity and not enough physics, so today we're going to talk about physics. In particular, we're going to talk about QMA, BQP/poly and many other physics topics. Maybe I should explain my point of view, which is that I've been rolling over backwards to give you as much physics as possible. Here's where I'm coming from: there's this traditional hierarchy where you have biology on top, and chemistry underlies it, and then physics underlies chemistry. If the physicists are in a generous mood, they'll say that math underlies physics. Then, computer science is over somewhere with soil engineering or some other non-science.

Now, my point of view is a bit different: computer science is what mediates between the physical world and the Platonic world. With that in mind, "computer science" is a bit of a misnomer; maybe it should be called "quantitative epistemology." It's sort of the study of the capacity of finite beings such as ourselves to learn mathematical truths. I hope I've been showing you some of that.

Gus: How do you reconcile this with the notion that any actual implementation of a computer must be based on physics, so wouldn't the order of physics and CS be reversed?

This is not a very well-defined defined. One could also say that any mathematical proof has to be written on paper, and so math should therefore go above physics. You could also say that math is basically a field that studies whether particular kinds of Turing machines will halt or not, and so CS is ground that everything sits on. Math is then just the special case where the Turing machines deal with topological spaces or something. But then, the strange thing is that physics has been kind of seeping down towards math and CS. This is kind of how I think of quantum computing: physics isn't staying where it's supposed to! If you like, I'm professionally interested in physics precisely to the extent that it seeps down into CS!

I think that it's helpful to classify interpretations of quantum mechanics, or at least to reframe debates about them, by asking where various interpretations come down on the question of the exponentiality of quantum states. To describe the state of a hundred or a thousand atoms, do you really need more classical bits of information than you could write down in the universe?

Roughly speaking, the Many-Worlds interpretation would say "absolutely." This is a view that David Deutch defends very explicitly; if the different universes (or components of the wave function) used in Shor's Algorithm are not physically there, then where was the number factored?

We also talked about Bohmian mechanics, which says "yes," but that one component of the vector is "more real" than the rest. Then, there is the view that used to be called the Copenhagen view, but is now called the Bayesian view, the information-theoretic view or one of a host of other names.

On the Bayesian view, a quantum state is an exponentially long vector of amplitudes in more-or-less the same sense that a classical probability distribution is an exponentially long vector of probabilities. If you were to take a coin and flip it 1000 times, you would have some set of 2¹⁰⁰⁰ possible outcomes, but we don't because of that decide to regard all of those outcomes as physically real.

At this point, I should clarify that I'm not talking about the formalism of quantum mechanics; that's something that (almost) everyone agrees about. What I'm asking is whether quantum mechanics describes an actual, real "exponential-sized object" existing in the physical world. So, the move that you make when you take the Copenhagen view is to say that the exponentially-long vector is "just in our heads."

Niel: In the sense of a classical probability distribution, isn't this just a different way of putting what you've said about the Bohmian view?
Scott: So the Bohmian view is this strange kind of intermediate position. In the Bohmian view, you do sort of see these exponential numbers of possibilities as somehow real; they're the guiding field, but there's this one "more real" thing that they're guiding. In the Copenhagen interpretation, these exponentially many possibilities really are just in your head. Presumably, they correspond to something in the external world, but what that something is, we either don't know or aren't allowed to ask. Chris Fuchs says that there's some physical context to quantum mechanics -- something outside of our heads -- but that we don't know what that context is. Niels Bohr tended to make the move towards "you aren't allowed to ask."

Now that we have quantum computing, can we bring the intellectual arsenal of computational complexity theory to bear on this sort of question? I hate to disappoint you, but we can't resolve this debate using computational complexity. It's not well-defined enough. Although we can't declare one of these views to be the ultimate victor, what we can do is to put them into various "staged battles" with each other and see which one comes out the winner. To me, this is sort of the motivation for studying all sorts of questions about quantum proofs, advice, and communication. The question that all of these are trying to get at is if you have a quantum state of n qubits, does it act more like n classical bits, or does it act more like 2ⁿ bits? Note that there's this kind of exponentiality in our formal description of our quantum state, but we want to know to what extent we can actually get at it, or root it out.

We have all these complexity classes, and they seem kind of esoteric. Maybe it's just a bad historical accident that we use all of these acronyms. It's like the joke about the prisoners where one of them calls out "37" and all of them will fall on the floor laughing, then another calls out "22" but no one laughs because it's all in the telling. There's all of these enormous concepts and ideas, and we encapsulate them in these sequences of three or four capital letters, and maybe we shouldn't do that.

QMA. You can think of it as the set of truths such that if you had a quantum computer, you could be convinced of the answer by being given a quantum state. More formally, it's the set of problems that admit a polynomial-time quantum algorithm Q such that for every input x the following holds:

If, on the input x, the answer the problem is "yes," then there exists some quantum state |φ⟩ of a polynomial number of qubits such that Q accepts |x⟩|φ⟩ with probability greater than 2/3. If, on the input x, the answer the problem is "no," then there does not exist any polynomial-sized quantum state |φ⟩ such that Q accepts |x⟩|φ⟩ with probability greater than 1/3.

What I mean is that the number of qubits of |φ⟩ should be bounded by a polynomial in the length n of x. You can't be given some state of 2ⁿ qubits. If you could, then that would sort of trivialize the problem.

We want there to be a quantum state of reasonable size that convinces you of a "yes" answer. So when the answer is "yes," there's a state that convinces you, and when the answer is "no," there's no such state. QMA is sort of the quantum analogue of NP. Recall that we have the Cook-Levin Theorem which gives us that the Boolean satisfiability problem (SAT) is NP-complete. There is also a Quantum Cook-Levin Theorem -- which is a great name, since both Cook and Levin are quantum computing skeptics (though Levin much more so than Cook). The Quantum Cook-Levin theorem tells us that we can define a quantum version of the 3SAT problem, which turns out to be QMA-complete as a promise problem.

A promise problem is some problem you only have to get the right answer if there's some promise on the input. If you, as the algorithm, have been screwed over by crappy input, then any court is going to rule in your favor and you can do whatever you want. It may even be a very difficult computation to decide if the promise holds or not, but that's not your department. There are certain complexity classes for which we don't really believe that there are complete problems, but for which there are complete promise problems. QMA is one such class. The basic reason we need a promise is because of the gap between 1/3 and 2/3. Maybe you would be given some input, but you'd accept with some probability that is neither greater than 2/3 nor less than 1/3. In that case, you've done something illegal and so we assume that you aren't given such an input.

So what is this quantum 3SAT problem? Basically, think of n qubits stuck in an ion trap (since I'm suppose to be bringing in some physics), and now we describe a bunch of measurements, each of which involves at most three of the qubits. Each measurement i accepts with probability equal to P sub i . These measurements are not hard to describe, since they involve at most three qubits. Let's say that we add up n of the measurements. Then, the promise will be either there is a state such that this sum is very large, or that for all states, the sum is much smaller. Then, the problem is to decide which of the two conditions holds. This problem is complete in QMA in the same sense that the classical analogue, 3SAT, is complete in NP. This was first proved by Kitaev, and was later improved by many others.

The real interest comes with the question of how powerful the QMA class is. Are there truths that you can verify in a reasonable amount of time with quantum computers, but which you can't verify with a classical computer? This is an example of what we talked about earlier, where we're trying to put realistic and subjective views of quantum states into "staged battle" with each other and see which one comes out the winner.

There's a result of John Watrous which gives an example where it seems that being given an exponentially long vector really does give you some sort of power. The problem is called group non-membership. You're given a finite group . We think of this as being exponentially large, so that you can't be given it explicitly by a giant multiplication table. You're given it in some more subtle way. We will think of it as a black-box group, which means that we have some sort of black box which will perform the group operations for you. That is, it will multiply and invert group elements for you. You're also given a list of generators of the group.

From the floor: How long is this list?
Scott: Polynomially long. Good question.

Each element of the group is encoded in some way by some n-bit string, though you have no idea how it's encoded.

Gus: So what's n here?
Scott: Well, you can define n to be the number of bits needed to write down one of the group elements.
Gus: It seems like the whole point, though, is that the number of elements in the group is exponential in the number of generators.
Scott:Yes. You're right.

So now we're given a subgroup H≤G, which can also be given to us as a list of generators. Now the problem is an extremely simple one: we're given an element x of the group, and want to know whether or not it's in the subgroup. I've specified this problem abstractly in terms of these black boxes, but you can instantiate it, if you have a specific example of a group. For example, these generators could be matrices over some finite field, and you're given some other matrix and are asked whether you can get to it from your generators. It's a very natural question.

Let's say the answer is "yes." Then, could that be proven to you?

Devin: Show how x was generated.
Scott: Yes. There's one thing you need to say (not a very hard thing), which is that if x ∈ H, then there is some "short" way of getting to it. Not necessarily by multiplying the generators you started with, but by recursively generating new elements and adding those to your list, and using those to generate new elements, and so on.

For example, if we started with the group ℤ_n, the additive group modulo n, and if we have some single starting element 1, we can we just keep adding 1 to itself, but it will take us a while to get to 2⁵⁰⁰⁰. But if we recursively build 2 = 1 + 1, 4 = 2 + 2 and so on by repeatedly applying the group operation to our new elements, we'll get to whatever element we want quickly.

Q: Is it always possible to do it in polynomial time?
Scott: Yes. For any group. The way to see that is to construct a chain of subgroups from the one you started with. It takes a little work to show, but it's a theorem of Babai and Szemerédi, which holds whether or not the group is solvable.

Now here's the question: what if x ∉ H? Could you demonstrate that to someone? Sure, you could give them an exponentially long proof, and if you had an exponentially long time, you could demonstrate it, but this isn't feasible. We still don't know quite how to do with this, even if you were given a classical proof and allowed to check it via quantum computation, though we do have some conjectures about that case.

Watrous showed that you can prove non-membership if you're given a certain quantum state, which is a superposition over all the elements of the subgroup. Now this state might be very hard to prepare. Why?

Q: It's exponentially large?
Scott: Well, yeah, but there are other exponentially large superposition states which are easy to prepare, so that can't be the whole answer.
Q: There are too many of them.
Scott: But we can prepare a superposition over all n-bit strings, and there's a lot of those.
Gus: You could efficiently sample uniformly at random, but to create them all in a coherent superposition, you have to somehow get rid of the garbage.
Scott: Yes. That's exactly right. The problem is one of uncomputing garbage.

So we know how to take a random walk on a group, and so we know how to sample a random element of a group. But here, we're asked for something more. We're asked for a coherent superposition of the group's elements. It's not hard to prepare a state of the form ∑|g⟩|garbage_G⟩. Then how do you get rid of that garbage? That's the question. Basically, this garbage will be the random walk or whatever process you use to get to g, but how do you forget how you got to that element?

But what Watrous said is to suppose we had an omniscient prover, and suppose that prover was able to prepare that state and give it to you. Well then, you could verify that an element is not in the subgroup H. We can do this in two steps:

Verify that we really were given the state we needed (we'll just assume this part for now).
Use the state |H⟩ to prove that x∉H by using controlled left-multiplication: Then, do a Hadamard and measure the first qubit. This is basically like a SWAP-test. You have the left qubit act as the control qubit. If x ∈ H, then xH is a permutation of H, and so we get interference fringes (the light went both through the x slit and the xH slit). If x ∉ H, then we have that xH is a coset, and thus shares no elements in common with H. Hence, ⟨H|xH⟩ = 0, and so we measure random bits. We can tell these two cases apart.

You also have to verify that this state |H⟩ really was what we were given. To do this, we will do a test like what we just did. Here, we pick the element x by taking a classical random walk on the subgroup H. Then, if |H⟩ was really the superposition over the subgroup, then |xH⟩ would just be shifted around by x, whereas if x ∉ H, we get something else. You have to prove that this is not only a necessary test, but a sufficient one as well. That's basically what Watrous proved.

This gives us one example where it seems like having a quantum state actually helps you, as if you could really get at the exponentiality of the state. Maybe this isn't a staggering example, but it's something.

An obvious question is whether, in all of those cases where a quantum proof seems to helps you, you could do just as well if you were given a classical proof that you then verified via quantum computation. Are we really getting mileage from having the quantum state, or is our mileage coming from the fact that we have a quantum computer to do the checking? We can phrase the question by asking if QMA is equal to QCMA, where QCMA is like QMA except that the proof now has to be a classical proof. Greg Kuperberg and I wrote a paper where we tried to look at this question directly. One thing we showed looks kind of bad for the realistic view of quantum states (at least in this particular battle): if the Normal Hidden Subgroup problem (what the problem is isn't important right now) can be solved in quantum-polynomial time, and it seems like it can, and if we make some other group-theoretic assumptions that seem plausible according to all the group theorists that we asked, then the Group Non-Membership Problem is actually in QCMA. That is, you can dequantize the proof and replace it with a classical one.

On the other hand, we showed that there exists a quantum oracle A relative to which QMA^A ≠ QCMA^A. This is a really simple thing to describe. To start with, what is a quantum oracle? Quantum oracles are just quantum subroutines which we imagine that both a QMA and a QCMA machine have access. To see the idea behind the oracle that we used, let's say that you're given some n-qubit unitary operation U. Moreover, let's say that you're promised that either U is the identity matrix I or that there exists some secret "marked state" |ψ⟩ such that U|ψ⟩ = −|ψ⟩; that is, that U has some secret eigenvector corresponding to an eigenvalue of -1. The problem is then to decide which of these conditions holds.

It's not hard to see that this problem, as an oracle problem, is in QMA. Why is it in QMA?

Q: Why can you verify that the answer is yes if you were given a quantum state as a proof?
Scott: Because the prover would just have to give the verifier |ψ⟩, and the verifier would apply U|ψ⟩ to verify that, yes, U|ψ⟩ = −|ψ⟩. So that's not saying a whole lot.

What we proved is that this problem, as an oracle problem, is not in QCMA. So even if you had both of the resources of this unitary operation U and some polynomial-sized classical string to kind of guide you to this secret negative eigenvector, you'd still need exponentially many queries to find |ψ⟩.

This gives some evidence in the other direction, that maybe QMA is more powerful than QCMA. If they were equivalent in power, then that would have to be shown using a quantumly non-relativizing technique. That is, a technique that is sensitive to the presence of quantum oracles. We don't really know of such a technique right now, besides techniques that are also classically nonrelativizing and don't seem applicable to this problem.

Gus: This whole idea of quantum oracles is relatively new to me, so I wanted to make sure that I'm clear on the difference between them and classical oracles. For example, that black box that performs the group operation...
Scott: That's a classical oracle.
Gus: Which we can apply in superposition. So if we can already apply a classical oracle in superposition, then it seems like the only difference is that a quantum oracle can apply some phase or something.
Scott: It's not just that, because a quantum oracle can act in an arbitrary basis. Classical oracles always work in the computational basis. That's really the key difference.

So there's really another sort of metaquestion here, which is if there's some kind of separation between quantum and classical oracles. That is, if there's some kind of question that we can only answer with quantum oracles. Could we get a classical oracle separation between QMA and QCMA? All I can tell you is that Greg and I tried for a while and couldn't do it. If you can, that'd be great.

Devin: Has anyone found a separation between classical and quantum oracles for anything?
Scott: No. It's something we thought about. It's sort of a very new set of questions, and the jury is still out. These are not necessarily "hard" problems; they aren't ones that have been attacked for twenty years. Some people, mainly me and a few others, thought about them for several months, which is not a very good certificate of their hardness.

OK. So that was quantum proofs. There are other ways we can try and get at the question of how much stuff is there to be extracted from a quantum state. Holevo's Theorem deals with the following question: if Alice wants to send some classical information to Bob, and if she has access to a quantum channel, can use this to her advantage? If quantum states are these exponentially long vectors, then intuitively, we might expect that if Alice could send some n-qubit state, then maybe she could use this to send Bob 2ⁿ classical bits. We can arrive at this from a simple counting argument. The number of quantum states of n qubits, any pair of which are of almost zero inner product with each other, is doubly exponential in n. All we're saying is that in order to specify such a state, you need exponentially many bits. Thus, we might hope that we could get some kind of exponential compression of information. Alas, Holevo's Theorem tells us that it is not to be. You need n qubits to reliably transmit n classical bits, with just some constant factor representing that you're willing to tolerate some probability of error, but really nothing better than you would get with a classical probabilistic encoding.

Scott: Intuitively, why is this? Does anyone want to give me a handwaving "proof"?
Devin: You can only measure it once.
Scott: Thank you. That's it. Each bit of information you extract cuts in half the dimensionality of the Hilbert space. Sure, in some sense, you can encode more than n bits, but then you can't reliably retrieve them.

This theorem was actually known in the 70's, and was ahead of its time.

It was only recently that anyone asked a very natural and closely-related question: what if Bob doesn't want to retrieve the whole string? We know from Holevo's theorem that getting the whole string is impossible, but what if Bob only wants to retrieve one bit (Alice doesn't know which one ahead of time)? Can Alice create a quantum state |ψ_x⟩ such that for whichever bit x_i Bob wants to know, he can just measure |ψ_x⟩ in the appropriate basis and would then learn that particular bit? After he's learned x_i, then he's destroyed the state and can't learn any more, but that's OK. Alice wants to send Bob a quantum phonebook, and Bob only wants to look up one number. It turns out that, via a proof from Ambainis, Nayak et al., this is still not possible. What they proved is that to encode n bits in this manner, so that any one can be read out, you need at least n over log n qubits.

Maybe you could get some small savings, but certainly not an exponential savings. Shortly after, Nayak proved that actually, if you want to encode n bits, you need n qubits. If we're willing to lose a logarithmic factor or two, I can show rather easily how this is a consequence of Holevo's Theorem. The reason that it's true illustrates a technique that I've gotten a lot of mileage out of recently, and there might be more mileage that can still be gotten out of it.

Suppose, by way of contradiction, that we had such a protocol that would reliably encode n bits into no more than log n qubits in such a way that any one bit could then be retrieved with high probability -- we'll say with error at most one-third. Then, what we could do is to take a bunch of copies of the state. We just want to push down the error probability, so we take a tensor product of, say, log n copies. Given this state, what Bob can do is to run the original protocol on each copy to get x_i and then take the majority vote. For some sufficiently large constant times log n, this will push down the error rate to at most n^-2. So for any particular bit i, Bob will be able to output a bit y_i such that . Now, since Bob can do that, what else can he do? He can keep repeating this, and get greedy. I'm going to run this process and get x₁, but now, because the outcome of this measurement could be predicted almost with certainty given this state, you can prove because of that, that you aren't getting a lot of information, and so the state is only slightly disturbed by the measurement. This is just a general fact about quantum measurements. If you could predict the outcome with certainty, then the state wouldn't be disturbed at all by the measurement.

So this is what we do. We've learned x₁ and the state has been damaged only slightly. When we run the protocol again, we learn what x₂ is with only small damage. Since small damage plus small damage is still small damage, we can find what x₃ is and so on. So, we can recover all of the bits of the original string using fewer qubits then the bound shown by Holevo. Based on this, we can say that we can't have such a protocol.

Why do we care about any of this? Well, maybe we don't, but I can tell you how this stuff entered my radar screen. Now, we're not going to ask about quantum proofs, but about a closely-related concept called quantum advice. So we'll bring in a class called BQP/qpoly: the set of problems efficiently solvable by a quantum computer, given a polynomially-sized quantum advice state. What's the difference between advice and proof? The first difference is that advice doesn't depend on the input itself, but only on the input length n. The second difference is that advice comes from an advisor, and advisors (as we know) are inherently trustworthy beings. As we also know, advisors don't really know what problem their students are working on, so they're going to always give you the same advice no matter what problem you're working on. All joking aside, a proof is not trustworthy; that's why the entity who receives the proof is called the verifier.

So the advantage of advice is that you can trust it, but the disadvantage is that it might not be as useful as it isn't tailored to the particular problem instance that you're trying to solve. So we can imagine that maybe it's hard for quantum computers to solve NP-complete problems, but only if the quantum computer has to start in some all-zero initial state. Maybe there are some very special states that were created in the Big Bang and that have been sitting around in some nebula ever sense (somehow not decohering), and if we get on a spaceship and find these states, they obviously can't anticipate what particular instance of SAT we wanted to solve, but they sort of anticipated that we would want to solve some instance of SAT. Could there be this one generic SAT-solving state |ψ_n⟩, such that given any Boolean formula P of size n, we could, by performing some quantum computation on |ψ_n⟩ figure out whether P is satisfiable? What we're really asking here is if NP⊆BQP/qpoly.

Bill: Are you allowed to destroy the advice?
Scott: Yes. There's a bunch of these things sitting around in the nebula. It's like oil; there's an infinite supply of advice. Though it turns out, you only have to gather polynomially many advice states and then you can use them on an exponential number of inputs, because of the observation we made before than an exponentially-small error probability disturbs the state only slightly.

What can we say about the power of BQP/qpoly? We can adapt Watrous's result about quantum proofs to this setting of quantum advice. Returning to the Group Non-Membership Problem, if the Big Bang anticipated what subgroup we wanted to test membership in, but not what element we wanted to test, then it could provide us with the state |H⟩ that's a superposition over all the elements of H, and then whatever element we wanted to test for membership in H, we could do it. This shows that a version of the Group Non-Membership Problem is in BQP/qpoly.

I didn't mention this earlier, but we can prove that QMA is contained in PP, so there's evidently some limit on the power of QMA. You can see that, in the worst case, all you would have to do is search through all possible quantum proofs (all possible states of n qubits), and see if there's one that causes our machine to accept. You can do better than that, and that's where the bound of PP comes from.

What about BQP/qpoly? Can anyone see any upper bound on the power of this class? That is, can anyone see any way of arguing what it can't do?

Gus: BQEXP/qpoly?
Scott: Right, sure, but how do we know that both BQP/qpoly and BQEXP/qpoly aren't equal to ALL, the set of all languages whatsoever (including uncomputable languages)! Let's say you were given an exponentially long classical advice string. Well, then, it's not hard to see that you could then solve any kind of problem whatsoever. Why? Because say that is the Boolean function we want to compute. Then, we just let the advice be the entire truth table for the function and then we just need to look up the appropriate entry in the truth table, and we've solved any problem of size n we want to solve. The halting problem, you name it.
It would sort of be like having access to exponentially many bits of Chaitin's Ω string. You could certainly take that as your advice. Though actually, while the bits of Ω are very hard to compute, if you were given them, they'd be useless. Almost by definition, they'd just look like a random string. You couldn't extract anything useful from them. But you could certainly be given them if you wanted.

Intuitively, it seems a bit implausible that BQP/qpoly = ALL, because being given a polynomial number of qubits really isn't like being given an exponentially long string of classical bits. The question is, how much can this "sea" of exponentially many classical bits that are needed to describe a quantum state determine what we get out?

I guess I'll cut to the chase and tell you that at a workshop years ago, Harry Buhrman asked me this question, and it was obvious to me that BQP/qpoly wasn't everything, and he told me to prove it. And eventually I realized that anything you could do with polynomially-sized quantum advice, you could do with polynomially-sized classical advice, provided that you can make a measurement and then postselect on its outcome. That is, I proved that BQP/qpoly ⊆ PostBQP/poly. In particular, this implies that BQP/qpoly ⊆ PSPACE/poly. Anything you can be told by quantum advice, you can also be told by classical advice of a comparable size, provided that you're willing to spend exponentially more computational effort to extract what that advice is trying to tell you.

Gus: You seem really fascinated with the whole idea of advice, and I'm not convinced that it's entirely relevant other than to the extent that it develops tools that are useful to other, more relevant aspects of complexity theory.
Scott: Why do I care about advice? First of all, yes: it does show up again and again, even if, for example, all we want to know about is uniform computation. Even if all we want to know is if we can derandomize BPP, it turns out to be a question about advice. So it's very connected to the rest of complexity. Basically, you can think of an algorithm with advice as being no different than an infinite sequence of algorithms, just like what we saw with the Blum Speedup Theorem. It's just an algorithm, where as you go to larger and larger input lengths, you get to keep using new ideas and get more speedup. This is one way to think about advice.
Gus: Classes based on proofs capture the difficulty of natural problems, but I don't see that there are any natural problems based on advice.
Scott: I think that we have as many natural problems in BQP/qpoly as we have in QMA. QMA-complete problems are kind of in another category, since you wouldn't have thought of them without quantum, whereas Group Non-Membership is something you might think of just on its own without any connection to quantum.
I can give you another argument. We really don't know the initial conditions of the universe. We make the assumption that a quantum computer should always start in this all-zero state, but the question is if that's a justified assumption. The usual argument that it's a justified assumption is that, for whatever other state your quantum computer might start in, there's some physical process that gave rise to that state. Presumably, this is only a polynomial-time physical process. So you could simulate the whole process that gave rise to that state, tracing it back to the Big Bang if needed. But is this really reasonable?
Gus: But when we're thinking about advice, we're thinking about states that may not be preparable in polynomial time. It may not be possible for them to actually ever exist in the universe.
Niel: If there were some kind of fundamental process which gave rise to those kinds of states, wouldn't that motivate changing our elementary gate system?
Scott: It might. You can think of advice as -- I keep changing my metaphors here -- freeze-dried computation. There's some great, enormous sort of computational effort, that we then encapsulate in this convenient polynomially-sized string over in the frozen foods section and that you can go and heat into the microwave to do work with.
Nick: If you were given an advice string, how could you trust it?
Scott: Well, we rolled into the definition of the advice that you trust it, otherwise, it would be a proof. You could study advice that wasn't trustworthy, and in fact, I have. I defined some complexity classes based on untrustworthy advice, but in the usual definition of advice, we assume that it's trustworthy. It's a question of how much computation can we freeze-dry into this polynomially-sized state?
Gus: I agree that it's an interesting theoretical question, but how to you convince the NSF to fund BQP/qpoly studies?
Scott: If I'm ever cast out into the street, I'll have to think of something else, but right now, Ray [Laflamme] has me on board. Would he fire me because BQP/qpoly isn't a "realistic" class? He seems to have not made up his mind, so while I'm not starving, I'll pursue what interests me. That's advice, and it leads pretty directly into the interpretational question that we started with. We're trying to get at what quantum states really are. It's clear that, to try and learn about this question, we could do a billion experiments and maybe they'd all be consistent with the predictions of quantum mechanics. On the other hand, we could sit around and argue like philosophers do, and it's not very clear that such an approach would get us very far, either. What I'm suggesting is a third approach, which is that we put the different views of what quantum states are into various staged battles with each other, in some sort of concrete mathematical or computational setting, and see which ones come out the winner.
Gus: So, you could say that by studying advice, you're studying one way in which quantum states could be different from classical states.
Scott: Yes.

Returning the question of an upper bound for BQP/qpoly, it's again a two-minute endeavor to give a handwaving proof that BQP/qpoly ⊆ PSPACE/poly. I like the way that Greg Kuperberg described the proof. What he said is that what we do if we have some quantum advice and we want to simulate it using classical advice by post-selection, we use a "Darwinian training set" of inputs. We'll say that we've got this machine that takes classical advice, and then we want to describe to this machine some set of quantum advice using only classical advice. To do so, we consider some test inputs X₁, X₂, ..., X_T. Note, by the way, that our classical advice machine doesn't know the true quantum advice state |ψ⟩. The classical advice machine starts by guessing that the quantum advice is the maximally mixed state, since without a priori knowledge, any quantum state is equally likely to be the advice state. Then, X₁ is an input to the algorithm such that if the maximally mixed state is used in place of the quantum advice, the algorithm produces the wrong answer with a probability of greater than one-third. If the algorithm still guesses the right answer, then making a measurement changes the advice state to some new state ρ₁. So why is this process described as "Darwinian"? The next part of the classical advice, X₂ describes some input to the algorithm such that the wrong answer will be produced with probability greater than one-third if the state ρ₁ is used in place of the actual quantum advice. If, despite the high chance of getting the wrong answer when run with X₁ and X₂ as input, the algorithm still produces two correct answers, then we use the resultant estimate of the advice state ρ₂ to produce the next part of the classical advice X₃. Basically, we're trying to teach our classical advice machine the quantum state by repeatedly telling it, "supposing you got all the previous lessons right, here's a new test you're still going to fail. Go and learn, my child."

The point is that if we let |ψ_n⟩ be the true quantum advice, then since we can decompose the maximally mixed state into whatever basis we want, we can imagine it as a mixture of the true advice state that we're trying to learn, and a bunch of things that are all orthogonal to it. Each time that we give a wrong answer with a probability greater than one-third, it's like we're lopping off another third of this space. We then postselect on succeeding. We also know that if we were to start with the true advice state, then we would succeed, and so this process has to bottom out somewhere; we eventually winnow away all the chaff and run out of examples where the algorithm fails.

So, in this setting, quantum states are not acting like exponentially long vectors. They're acting like they only encode some polynomial amount of information, although extracting what you want to know might be exponentially more efficient than if the same information were presented to you classically. Again, we're getting ambiguous answers, but that's what we expected. We knew that quantum states occupy this weird kind of middle realm between probability distributions and exponentially long strings. It's nice to see exactly how this intuition plays out, though, in each of these concrete scenarios. I guess this is what attracts me to quantum complexity theory. In some sense, this is same stuff that Bohr and Heisenberg argued about, but we're now able to ask the questions in a much more concrete way -- and sometimes even answer them.

Lecture 14: Skepticism of Quantum Computing

Scribe: Chris Granade

Last time, we talked about whether quantum states should be thought of as exponentially long vectors, and I got challenged a bit about why I care about the class BQP/qpoly and concepts like quantum advice. Actually, I'd say that the main reason why I care is something I didn't even think to mention last time, which is that it relates whether we should expect quantum computing to be fundamentally possible or not. There are people, like Leonid Levin and Oded Goldreich, who just take it as obvious that quantum computing must be impossible. Part of their argument is that it's extravagant to imagine a world where describing the state of 200 particles takes more bits then there are particles in the universe. To them, this is a clear indication something is going to break down. So part of the reason that I like to study the power of quantum proofs and quantum advice is that it helps us answer the question of whether we really should think of a quantum state as encoding an exponential amount of information.

So, on to the Eleven Objections:

Works on paper, not in practice.
Violates Extended Church-Turing Thesis.
Not enough "real physics."
Small amplitudes are unphysical.
Exponentially large states are unphysical.
Quantum computers are just souped-up analog computers.
Quantum computers aren't like anything we've ever seen before.
Quantum mechanics is just an approximation to some deeper theory.
Decoherence will always be worse than the fault-tolerance threshold.
We don't need fault-tolerance for classical computers.
Errors aren't independent.

What I did is to write out every skeptical argument against the possibility of quantum computing that I could think of. We'll just go through them, and make commentary along the way. Let me just start by saying that my point of view has always been rather simple: it's entirely conceivable that quantum computing is impossible for some fundamental reason. If so, then that's by far the most exciting thing that could happen for us. That would be much more interesting than if quantum computing were possible, because it changes our understanding of physics. To have a quantum computer capable of factoring 10000-digit integers is the relatively boring outcome -- the outcome that we'd expect based on the theories we already have.

I like to engage skeptics for several reasons. First of all, because I like arguing. Secondly, often I find that the best way to come up with new results is to find someone who's saying something that seems clearly, manifestly wrong to me, and then try to think of counterarguments. Wrong people provide a fertile source of research ideas.

So what are some of the skeptical arguments that I've heard? The one I hear more than any other argument is "well, it works formally, on paper, but it's not gonna work in the real world." People actually say this, and they actually treat it like it was an argument. For me, the fallacy here is not that people can have ideas that don't work in the real world, but that if they don't work in the real world, they can still somehow "work on paper." Of course, there could be assumptions such that an idea only works if the assumptions are satisfied. Thus, the question becomes if the assumptions are stated clearly or not.

Q: Do you think maybe this is just a rather unsophisticated way of challenging the assumptions of a result?
Scott: Yes -- but in that case, one hopes the challenge will become more sophisticated!

I was happy to find out that I wasn't the first person to point out this particular fallacy. Immanuel Kant wrote an entire treatise demolishing it: On the Common Saying: "That may be right in theory but does not work in practice."

Before going into the second argument, I'd like to remind you that these are all actual arguments that I've heard -- they aren't strawman arguments. With that in mind, the second argument is that quantum computing must be impossible because it violates the Extended Church-Turing Thesis. That is, we know that quantum computing can't be possible (assuming BPP≠BQP), because we know that BPP defines the limit of the efficiently computable.

Q: What is the Extended Church-Turing Thesis?
Scott: That's the the thesis that anything that is efficiently computable in the physical world is computable in polynomial time on a standard Turing machine.

So, we have this thesis, and quantum computing violates the thesis, so it must be impossible. On the other hand, if you replaced Factoring with NP-complete problems, then this argument would actually become more plausible to me, because I would think that any world in which we could solve NP-complete problems efficiently would not look much like our world. For NP-intermediate problems like Factoring and Graph Isomorphism, I'm not willing to take some sort of a priori theological position.

Q: So you're saying that if somebody came up with a brilliant proposal for solving NP-complete problems, you would be skeptical?
Scott: Yeah. I might even take a position not far from the one that Leonid Levin takes toward quantum computing. People actually do have proposals where you could do the first step of your computation in one second, the next in half a second, then a quarter second and so on, so that after 2 seconds, you'd have done infinitely many steps. Of course, if you could do this, you could solve the Halting Problem. As it turns out, we do sort of understand why this model isn't physical: we believe that the very notion of time starts breaking down when you get down to around 10^-43 seconds (the Planck scale). We don't really know what happens there. Nevertheless, no matter what theory we have for quantum gravity, I would argue that it would have to rule out something like this.
Q: It seems that once you get to the Planck scale, you're getting into a really sophisticated argument. Why not just say you're always limited in practice by noise and imperfection?
Scott: The question is why are you limited? I think that if you try to make the argument precise, ultimately, you're going to be talking about the Planck scale.
Q: It's similar to saying that you can't store a real number in a register.
Scott: But why can't you store a real number in a register?
Q: Is there some reason you feel that Factoring is not in P?
Scott: Dare I say that the reason is that no one can solve it efficiently in practice? Though it's not a good argument, people are certainly counting on it not being in P. Admittedly, we don't have as strong a reason to believe that Factoring is not in P as we do to believe that P≠NP. It's even a semi-respectable opinion to say that maybe Factoring is in P, and that we just don't know enough about number theory to prove it. My own intuitive map of the complexity space is shown off to the side. Factoring, Graph Isomorphism, etc. have structure, and structure can potentially be exploited by algorithms. Maybe not by classical, polynomial-time algorithms, but in some cases it can be exploited by quantum algorithms and in some other cases by hidden-variable algorithms, etc. For NP-complete problems, we really don't have this structure --- at least by conjecture. That just serves to underscore the importance of the P≠NP conjecture. If it were false, then that would change how we think about all of this.

So that was the second argument. On to the third: "I'm suspicious of all these quantum computing papers because there isn't enough of the real physics that I learned in school. There's too many unitaries and not enough Hamiltonians. There's all this entanglement, but my professor told me not to even think about entanglement, because it's all just kind of weird and philosophical, and has nothing to do with the structure of the helium atom." What can one say to this? Certainly, this argument succeeds in establishing that we have a different way of talking about quantum mechanics now, in addition to the ways people have had for many years. Those making this argument are advancing an additional claim, though, which is that the way of talking about quantum mechanics they learned is the only way. I don't know if any further response is needed.

The fourth argument is that "these exponentially small amplitudes are clearly unphysical." This is another argument that Leonid Levin makes. Consider some state of 300 qubits, such that each component has an amplitude of 2^-150. We don't know of any physical law that holds to more than about a dozen decimal places, and you're asking for accuracy to hundreds of decimal points. Why should someone even imagine that makes any sense whatsoever?

Q: Intuitively, this is equivalent to the classical case where each 300-bit string has a 2^-300 probability. In that case, this argument would say that classical probability theory is also unphysical.
Scott: You know what? Why don't you tell the skeptics that? Maybe they'll listen to you...

The obvious repudiation of argument 4, then, is that I can take a classical coin and flip it a thousand times. Then, the probability of any particular sequence is 2^-1000, which is far smaller than any constant we could ever measure in nature. Does this mean that probability theory is some "mere" approximation of a deeper theory, or that it's going to break down if I start flipping the coin too many times?

Bill: Maybe I don't believe this argument myself, but you could make the argument that with classical probability theory, you've got these extremely small amplitudes, but it's over things that you don't know, and it's actually quite deterministic and only happens one way. Meanwhile, in the quantum case, all those amplitudes might matter.
Scott: Right. That is the difference, and that is the argument that is made. Now, though, there's a further problem with the argument, which is that I could take a state like |+⟩^⊗1000. This state has extremely small amplitudes, but presumably not even the staunchest quantum computing skeptic would dispute that we can prepare such a state.
Q: You could probably dispute that we can reliably prepare that state, and that you couldn't actually control 1,000 qubits well enough to verify that each of them is in the |+⟩ state.
Scott: Maybe a physicist could take 1,000 photons and put them through a mirror.
Q: But are you looking at each individual photon to see if it's in the right state?
Q: You can post-select using a beam splitter, and you might get one photon not in the state, but you can account for that.
Scott: Then the question becomes one of if it's somehow illegitimate to put all of the photons into one big tensor product...
Q: We need some kind of formulation of ultra-finitism for physics.
Scott: Right, that's a good way to put it.
Q: Going back to Bill's question, the problem is that the quantum state amplitudes interfere with each other, and they somehow have to "know" what each other's amplitudes are, as opposed to classical probability.
Scott: We'll get to this more later, but for me the key point is that amplitudes evolve linearly, and in that respect are similar to probabilities. We've got minus signs, and so we've got interference, but maybe if we really thought about why probabilities are okay, we could argue that it's not just that we're always in a deterministic state and just don't know what it is, but that this property of linearity is something more general. Linearity is the thing that prevents small errors from creeping up on us. If we have a bunch of small errors, the errors add rather than multiplying. That's linearity.

Argument 5 gets back to what we were talking about in the previous lecture: "it's obvious that quantum states are these extravagant objects; you can't just take 2ⁿ bits and pack them into n qubits." Actually, I was arguing with Paul Davies, and he was making this argument, appealing to the holographic principle and saying that we have a finite upper bound on the number of bits that can be stored in a finite region of spacetime. If you have some 1000-qubit quantum state, it requires 2¹⁰⁰⁰ bits, and according to Davies we've just violated the holographic bound.

So what does one say to that? First of all, this information, whether or not we think it's "there", can't generally be read out. This is the content of results like Holevo's Theorem. In some sense you might be able to pack 2ⁿ bits into a state, but the number of bits that you can reliably get out is only n.

Q: The holographic bound --- I know it's the number of bits per surface area, but what was the proportionality constant?
Scott: 1.4×10⁶⁹. It's a lot, but it's still a constant. Basically, it's one bit per Planck area.
Q: Why isn't it in volume?
Scott: That's a very profound question that people, like Witten and Maldacena, stay up at night worrying about. The doofus answer is that if you try to take lots and lots of bits and pack them into some volume (such as a cubical hard disk), then at some point, your cubical hard disk will collapse and form a black hole. A flat drive will also collapse, but a one-dimensional drive won't collapse.
Q: You haven't been making a very good case for why this bound is measured in square meters.
Scott: Here's the thing: a hard drive will collapse to a black hole when its information density becomes large enough, so at some point, it seems as if you have all these bits that are near the event horizon of the black hole. That's the part that no one really understands yet, but it would suggest that you talk about the surface area of the event horizon.
Q: Why are they at the event horizon?
Scott: If you're standing outside a black hole, you never see someone pass through the event horizon. Then, if you want to preserve unitarity, and not have pure states evolve into mixed states when something gets dropped into a black hole, you say that when the black hole evaporates via Hawking radiation, then the bits get peeled off like scales, and go flying out into space. Again, this is not something that people really understand. People treat the holographic bound (rightfully) as the one of the few clues we have for a quantum theory of gravity, but they don't yet have the detailed theory that implements the bound.
Q: I was wondering if the following would be another approach to understanding it, that doesn't involve black holes. If you're talking about getting the information, you basically have to access what's in the volume, and the only way to do that is to cut through the boundary.
Scott: Maybe, but the problem is why couldn't we say that we've got an amount of information that scales with the volume, but in order to get to the part in the middle, you have to peel away at the stuff on the outside. That seems like a consistent way of looking at it. The information is there, you just have to peel the other stuff away to get at it. The issue that you run up against then is one of gravitational collapse: to store information, you need some amount of energy. If you have enough energy within a bounded region of spacetime, then you pass the Schwarzschild limit, and it collapses.

There actually is an interesting question here. The holographic principle says that you can store only so much information within a region of space, but what does it mean to have stored that information? Do you have to have random access to the information? Do you have to be able to access whatever bit you want and get the answer in a reasonable amount of time? In the case that these bits are stored in a black hole, apparently if there are n bits on the surface, then it takes on the order of n^3/2 time for the bits to evaporate via Hawking radiation. So, the time-order of retrieval is polynomial in the number of bits, but it still isn't particularly efficient. A black hole should not be one's first choice for a hard disk.

The other funny thing about this is that, in classical general relativity, the event horizon doesn't play a particularly special role. You could pass through it and you wouldn't even notice. Eventually, you'll know you passed through it, because you'll be sucked into the singularity, but while you're passing through it, it doesn't feel special. On the other hand, this information point of view says that as you pass through, you'll pass a lot of bits near the event horizon. What is it that singles out the event horizon as being special in terms of information storage? It's very strange, and I wish I understood it.

Argument 6: "a quantum computer would merely be a souped-up analog computer." This I've heard again and again, from people like Robert Laughlin, Nobel laureate, who espoused this argument in his popular book A Different Universe. This is a popular view among physicists. We know that analog computers are not that reliable, and can go haywire because of small errors. The argument proceeds to ask why a quantum computer should be any different, since you have these amplitudes which are continuously varying quantities. Anyone want to answer this one for me?

A: The Threshold Theorem?
Scott: Thank you.

That argument describes what people thought before we had the Threshold Theorem (also called the Quantum Fault-Tolerance Theorem), and yet people are still arguing about it ten years after the theorem.

Q: OK, so you have the Threshold Theorem, but then you have to do some error correction, right? Your computation becomes longer, right?
Scott: Yeah, but by a factor of polylog(n). This isn't challenging the Church-Turing Thesis, but yeah, that's true.
Q: I'm not sure if you'd have to perform another error correction as you proceed.
Scott: The entire content of the Threshold Theorem is that you're correcting errors faster than they're created. That's the whole point, and the whole non-trivial thing that the theorem shows. That's the problem it solves.
Q: Isn't there a Threshold Theorem for classical computing as well?
Scott: There is.
Q: Is there a Threshold Theorem for analog computers?
Scott: No, and there can't be. The point is, there's a crucial property that is shared by discrete theories, probabilistic theories and quantum theories, but that is not shared by analog or continuous theories. That property is insensitivity to small errors. That's really a consequence of linearity. When I think about the Threshold Theorem, I try to take a step back and ask "what does this really mean?" It's really a consequence of the linearity of quantum mechanics. If we want a weaker Threshold Theorem, we could consider a computation taking t time steps, where the amount of error per time step could be 1/t. Then, the Threshold Theorem would be trivial to prove. If we have a product of unitaries U₁U₁...U₁₀₀, and each one were to be corrupted by 1/t (1/100 in this case), then we'd have a product like: (U₁ + U'₁/t) (U₂ + U'₂/t)... (U₁₀₀ + U'₁₀₀/t) The product of all these errors still won't be much, again because of linearity. An observation made by Bernstein and Vazirani was that quantum computation is sort of naturally robust against one-over-polynomial errors. In principle, that already answers the question.
Q: I heard from a physicist that the fidelity of a gate decreases exponentially with the physical distance between the gates. When you increase the number of qubits, then the fidelity decreases exponentially, but you've only gained a linear number of qubits.
Scott: But we know that you can do universal quantum computing in the nearest neighbor model. Thus, even supposing that what you said was true, I don't see how it's a fundamental obstacle. I didn't even bother to explicitly list here arguments that apply to one specific architecture, because I take for granted that what we're talking about is if it's possible in principle to build one of these things. If we want to talk about specific architectures, then we can do that too, but no bait-and switch!

On to argument 7. This is an argument that Dyakonov makes many times in his recent paper. The argument goes that all the systems we have experience with involve very rapid decoherence, and thus that it isn't plausible to think that we could "just" engineer some system which is not like any of the systems in nature that we have any experience with.

Q: Could we sic the "brains are quantum computers" people on these guys?
Scott: That'll be good... put them in a room together. I hadn't thought of that.

I actually had a less amusing reaction, which is that a nuclear fission reactor is also unlike any naturally occurring system in many ways. What about a spacecraft? Things don't normally use propulsion to escape the earth. We haven't seen anything doing that in nature. Or a classical computer. I don't know if more than that needs to be said.

Q: These arguments are all pretty bad. Aren't there good arguments against quantum computing?
Scott: I keep looking for them. What I'm listing, again, are the arguments that I actually hear most often.
Q: Maybe that's homework for the audience.

Next, there are the people who just take it for granted that quantum mechanics must be an approximate theory that only works for a small number of particles. When you go to a larger number of particles, something else must take over. The trouble is, there have been experiments that have tested quantum mechanics with fairly large numbers of particles, like the Zeilinger group's experiment with buckyballs. There have also been SQUID experiments that have prepared the "Schrödinger cat state" |0...0⟩ + |1...1⟩ on n qubits where, depending on what you want to count as a degree of freedom, n is as large as several billion.

Again, though, the fundamental point is that discovering a breakdown of QM would be the most exciting possible outcome of trying to build a quantum computer. And, how else are you going to discover that, but by investigating these things experimentally and seeing what happens? Astonishingly, I meet people (especially computer scientists) who ask me, "what, you're going to expect a Nobel Prize if your quantum computer doesn't work?" To them, it's just so obvious that a quantum computer isn't going to work that it isn't even interesting.

Q: Would this be a credible objection if it offered a reason why?
Scott: Yes.

Some people will say "no, no, I want to make a separate argument. I don't believe that quantum mechanics is going to break down, but even if it doesn't, quantum computing could still be fundamentally impossible, because there's just too much decoherence in the world." These people are claiming that decoherence is a fundamental problem. That is, that the error will always be worse than the fault-tolerance threshold, or maybe that some graviton will always come through and decohere your quantum computer.

Q: In all fairness, the people making this argument believe that we'll never get the error down to the threshold, but we believe the opposite, only because of faith. These are both arguments of faith, and our counter-argument isn't that much more sound than the original.
Response from the floor: It's much easier to posit that the minimum possible error is zero then it is, say, 22. They're just saying that "it's impossible, so there's no point in even trying or to investigate."
Q: I guess there's a fallacy there, but then again, we base some of our complexity theoretic assumptions on faith as well. We assume that anything with a polynomial-time algorithm is "efficient."
Scott: Shor's Algorithm, fortunately, isn't just polynomial, but is n². I love how, especially when you're discussing this on the Internet, people love to raise this issue as if it's something that no complexity theorist has ever thought of. Gosh, it never occurred to me that, even if it's polynomial, it could be n⁵⁰⁰. People love to lecture me about this. Q: What's the worst best known polynomial-time algorithm?
Scott: That's an actual question! Well, the problem is that we could consider the problem, given a Turing machine, does it halt after n^{Ackermann(10,000)} steps? But you could ask, what's the largest polynomial runtime that ever occurred in practice? Again, are we going to count cases where the time is like n^1/ε², where ε is the error we want the algorithm to return with? There are many such algorithms, where the exponent involves a parameter you can vary, and so you can always get an exponent as large as you want by saying that you want a small error. If we exclude that, there are also cases where there are whole sequences of reductions that you've composed with each other. For example, I talked a while ago about this proof from Håstad et al. that you can get a pseudorandom number generator from any one-way function. The whole sequence of reductions involves this kind of blow up, and you get something like n⁴⁰.
Q: And there's no known way to improve on that?
Scott: There might be a way; I don't know the latest. Often when there are these large polynomial running times, people find clever ways to bring them down. Hopefully the same will be true with the fault-tolerance threshold in quantum computing.

The next argument is a little more subtle: for a classical computer, we don't have to go through all this effort. You just get fault-tolerance naturally. You have some voltage that either is less than a lower threshold or is greater than an upper threshold, and that gives us two easily distinguishable states that we can identify as 0 and 1. We don't have to go through the same amount of work to get fault-tolerance. In modern microprocessors, for example, they don't even bother to build in much redundancy and fault-tolerance, because the components are just so reliable that such safeguards aren't needed. The argument then proceeds by noting that you can, in principle, do universal quantum computing by exploiting this fault-tolerant machinery, but that this should raise a red flag --- why do you need all that error correction machinery? Shouldn't this make you suspicious?

Anyone want to give this one a try?

A: The only reason we don't need fault-tolerance machinery for classical computers is that the components are so reliable, but we haven't been able to build reliable quantum computer components yet. Presumably, if we could build extremely reliable components, we wouldn't need error correction and fault-tolerance technology.
Scott: Yes, that's what I would say. In the early days of classical computing, it wasn't clear at all that reliable components would exist. Von Neumann actually proved a classical analog of the Threshold Theorem, then later, it was found that we didn't need it. He did this to answer skeptics who said there was always going to be something making a nest in your JOHNNIAC, insects would always fly into the machine, and that these things would impose a physical limit on classical computation. Sort of feels like history's repeating itself.

We can already see hints of how things might eventually turn out. People are currently looking at proposals such as non-abelian anyons where your quantum computer is "naturally fault-tolerant," since the only processes that can cause errors have to go around the quantum computer with a nontrivial topology. These proposals show that it's conceivable we'll someday be able to build quantum computers that have the same kind of "natural" error correction that we have in classical computers.

I wanted to have a round number of arguments, but I wound up with eleven. So, Argument 11 comes from people who understand the Fault-Tolerance Theorem, but who take issue with the assumption that the errors are independent. This argument posits that it's ridiculous to suppose that errors are uncorrelated, or even that they're only weakly correlated, from one qubit to the next. Instead, the claim is that such errors are correlated, albeit in some very complicated way. In order to understand this argument, you have to work from the skeptics' mindset: to them, this isn't an engineering issue, it's given a priori that quantum computing is not going to work. The question is how to correlate the errors such that quantum computing won't work.

My favorite response to this argument comes from Daniel Gottesman who was arguing about this against Levin, who of course believes that the errors will be correlated in some conspiracy that defies the imagination. Gottesman said, supposing the errors were correlated in such a diabolical fashion and that Nature should undergo so much work to kill off quantum computation, why couldn't you turn that around and use whatever diabolical process Nature employs to get access to even more computational power? Maybe you could even solve NP-complete problems. It seems like Nature would have to expend enormous amounts of effort just to correlate qubits so as to kill quantum computation.

Q: Not only would your errors have to be correlated in some diabolical way, they'd have to be correlated in some unpredictable diabolical way. Otherwise, you could deal with the problem in general.

To summarize, I think that arguing with skeptics is not only amusing but extremely useful. It could be that quantum computing is impossible for some fundamental reason. So far, though, I haven't seen an argument that's engaged me in a really nontrivial way. That's what I'm still waiting for. People are objecting to this or to that, but they aren't coming up with some alternative picture of the world in which quantum computing wouldn't be possible. That's what's missing for me, and what I keep looking for in skeptical arguments and not finding.

Q: What about theargument that quantum mechanics breaks down with so many particles?
Scott: Even then, they usually don't give an actual theory in which it would fall apart. They just say "it will fall apart."
Q: I was curious as to how you'd respond to a generic argument that tends to come from people outside of a field when discussing a possible invention: "Well, maybe X is useful, but until you build one, it doesn't seem like a good investment." I've heard this about quantum computing, fusion, etc.
Scott: Well, we know nuclear fusion is possible — the Sun does it!
Q: Sometimes, the reason why not is a fundamental problem, other times it's just an engineering problem. You do hear the same argument in other contexts.
Scott: Well, now it just boils down to a question of "what are you interested in?" These same people who say that it's not practical yet, and that we should go back to more practical work, will go do something like (if they're a theoretical computer scientist) improve an n log n algorithm to an n log n / log log log n algorithm. Very little of what we do in theoretical computer science is directly connected to a practical application. That's just not what we're trying to do. Of course, what we do has applications, but indirectly. We're trying to understand computation. If you take that as our goal, then it seems clear that starting from the best physical theories we have is a valuable activity. If you want to ask a different question, such as what we can do in the next five to ten years, then, that's fine. Just make it clear that's what you're doing. Again, what annoys me are people who say that they're talking about what's possible even in principle, who then switch to talking about what's possible in the next few years.

I'll close with a question that you should think about before the next lecture. If we see 500 crows, which are all black, should we expect that the 501^st crow we see will also be black? If so, why? Why would seeing 500 black crows give you any grounds whatsoever to draw such a conclusion?

Lecture 15: Computational Learning

Scribe: Chris Granade

Last lecture, you were given Hume's Problem of Induction as a homework assignment.

Puzzle. If you observe 500 black ravens, what basis do you have for supposing that the next one you observe will also be black?

Many people's answer would be to apply Bayes’ Theorem. For this to work, though, we need to make some assumption like that all the ravens are drawn from the same distribution. If we don’t assume that the future resembles the past at all, then it’s very difficult to get anything done. This kind of problem has led to lots of philosophical arguments like the following.

Suppose you see a bunch of emeralds, all of which are green. This would seem to lend support to the hypothesis that all emeralds are green. But then, define the word grue to mean “green before 2050 and blue afterwards.” Then, the evidence equally well supports the hypothesis that all emeralds are grue, not green. This is known as the grue paradox.

If you want to delve even “deeper,” then consider the “gavagai” paradox. Suppose that you’re trying to learn a language, and you’re an anthropologist visiting an Amazon tribe speaking the language. (Alternatively, maybe you’re a baby in the tribe. Either way, suppose you’re trying to learn the language from the tribe.) Then, suppose that some antelope runs by and some tribesman points to it and shouts “gavagai!” It seems reasonable to conclude from this that the word “gavagai” means “antelope” in their language, but how do you know that it doesn’t refer to just the antelope’s horn? Or it could be the name of the specific antelope that ran by. Worse still, it could mean that a specific antelope ran by on some given day of the week! There’s any number of situations that the tribesman could be using the word to refer to, and so we conclude that there is no way to learn the language, even if we spend an infinite amount of time with the tribe.

There’s a joke about a planet full of people who believe in anti-induction: if the sun has risen every day in the past, then today, we should expect that it won’t. As a result, these people are all starving and living in poverty. Someone visits the planet and tells them, “Hey, why are you still using this anti-induction philosophy? You’re living in horrible poverty!”

“Well, it never worked before...”

What we want to talk about today is the efficiency of learning. We’ve seen all these philosophical problems that seem to suggest that learning is impossible, but we also know that learning does happen, and so we want to give some explanation of how it happens. This is sort of a problem in philosophy, but in my opinion the whole landscape around the problem has been transformed in recent years by what's called “computational learning theory.” This is not as widely known as it should be. Even if you’re (say) a physicist, it’s nice to know something about this theory, since it gives you a framework---different from the better-known Bayesian framework, but related to it, and possibly more useful in some contexts---for deciding when you can expect a hypothesis to predict future data.

I think a key insight that any approach has to take on board---whether it's Bayesianism, computational learning theory, or something else---is that we're never considering all logically conceivable hypotheses on an equal footing. If you have 500 ravens, each either white or black, then in principle there are 2⁵⁰⁰ hypotheses that you have to consider. If the ravens could also be green, that would produce still more hypotheses. In reality, though, you’re never considering all of these as equally possible. You’re always restricting your attention to some minuscule subset of hypotheses---broadly speaking, those that are "sufficiently simple"---unless the evidence forces you to a more complex hypothesis. In other words, you're always implicitly using what we call Occam’s Razor (all though it isn’t at all clear if it’s what Occam meant).

Why does this work? Fundamentally, because the universe itself is not maximally complicated. We could well ask why it isn’t, and maybe there’s an anthropic explanation, but whatever the answer, we accept as an article of faith that the universe is reasonably simple, and we do science.

This is all talk and blather, though. Can we actually see what the tradeoffs are between the number of hypotheses we consider and how much confidence we can have in predicting the future? One way we do this was formalized by Leslie Valiant in 1984. His framework is called PAC-learning, where PAC stands for “probably approximately correct.” We aren’t going to predict everything that happens in the future, nor will we even predict most of it with certainty, but with high probability, we’ll try to get most of it right.

So how does this work? We’ll have a set S which could be finite or infinite, called our sample space. For example, we’re an infant trying to learn a language, and are given some examples of sentences which are grammatical or ungrammatical. From this, we need to come up with a rule for deciding whether a new sentence is grammatical or not. Here, our sample space is the set of possible sentences.

A concept is a Boolean function f : S → {0,1} (we can later remove the assumption that concepts are Boolean, but for simplicity, we’ll stick with it for now) that maps each element of the sample space to either 0 or 1. In our example, the concept is the language that we’re trying to learn; given a sentence, the concept tells us whether it is or isn’t grammatical. Then, we can have a concept class, denoted C. Here, C can be thought of as the set of languages which our baby comes in to the world thinking a priori to be possible, before gathering any data as to the actual language spoken.

Q: Good thing there aren’t any experimental philosophers.
Scott: You can actually connect some of this stuff to experiments. For example, this theory has been used in experiments on things like neural networks and machine learning. When I was writing a paper on PAC-learning recently, I wanted to find out how the theory was actually used, so I looked on Google Scholar. The paper by Valiant was cited about 2,000 times and about half of the citations seemed to be from experimenters of various sorts. Based on this, we can infer that further papers are likely.

For now, we’re going to say that we have some probability distribution D over the samples. In the infant example, this is like the distribution from which the child’s parents or peers draw what sentences to speak. The baby does not have to know what this distribution is. We just have to assume that it exists.

So what’s the goal? We’re given m examples x_i drawn independently from the distribution D, and for each x_i, we’re given f(x_i); that is we’re told whether each of our examples is or isn’t grammatical. Using this, we want to output a hypothesis language h such that:

That is, we want our hypothesis h to disagree with the concept f no more than ε of the time given examples x drawn from our distribution D. Can we hope to do this with certainty? No? Well, why not?

Response from the floor: You might get unlucky with the samples you’re given. If there are only finitely many of them, I guess you can do it with certainty.
Scott: Even then, you could be given the same sentence over and over again. If the only sentence you’re ever exposed to as a baby is “what a cute baby!” then you’re not going to have any basis for deciding whether “we hold these truths to be self-evident” is also a sentence.
Response from the floor: So if you could hear all the sentences, then you’d be sure.
Scott: That’s correct, but you don’t know that you’d hear all the sentences. In fact, we should assume that there are exponentially many possible sentences, of which the baby only hears a polynomial number.

So, we say that we only need to output an ε-good hypothesis with probability 1 − δ over the choice of samples. Now, we can give the basic theorem from Valiant's paper:

Theorem: In order to satisfy the requirement that the output hypothesis h agrees with 1 − ε of the future data from drawn from D, with probability 1 − δ over the choice of samples, it suffices to find any hypothesis h that agrees with samples chosen independently from D.

The key point about this bound is that it's logarithmic in the number of possible hypotheses |C|. Even if there are exponentially many hypotheses, this bound is still polynomial. Now, why do we ask that the distribution D on which the learning algorithm will be tested is the same as the distribution from which the training samples are drawn?

Response from the floor: Because if your example space is a limited subset of sample space, then you're hosed.
Scott: Right. This is like saying that nothing should be on the quiz that wasn’t covered in class. If the sentences that you hear people speaking have support only in English, and you want a hypothesis that agrees with French sentences, this is not going to be very possible. There’s going to have to be some assumption about the future resembling the past.

Once you make this assumption, then Valiant’s theorem says that for a finite number of hypotheses, with a reasonable number of samples, you can learn.

Q: So there’s no other assumption involved?
Scott: There’s really no other assumption involved.
Q: But a Bayesian would certainly tell you that if your priors are different, then you’ll come to an entirely different conclusion.
Scott: Certainly. If you like, you can see this entire lecture as a critique of the Bayesian religion. I mean, I respect their faith and all, but not when they try to impose it on others.
Q: But there should be some point where the two either are reconciled, or disagree.
Scott: I can speak to that. The Bayesians start out with a probability distribution over the possible hypotheses. As you get more and more data, you update this distribution using Bayes’ Rule. That’s one way to do it, but computational learning theory tells us that it's not the only way. You don’t need to start out with any assumption about a probability distribution over the hypotheses. You can make a worst-case assumption about the hypothesis (which we computer scientists love to do, being pessimists!), and then just say that you'd like to learn any hypothesis in the concept class, for any sample distribution, with high probability over the choice of samples. In other words, you can trade the Bayesians' probability distribution over hypotheses for a probability distribution over sample data. In a lot of cases, this is actually preferable: you have no idea what the true hypothesis is, which is the whole problem, so why should you assume some particular prior distribution? We don’t have to know what the prior distribution over hypotheses is in order to apply computational learning theory. We just have to assume that there is a distribution.

The proof of Valiant's theorem is really simple. Given a hypothesis h, call it bad if it disagrees with f for at least an ε fraction of the data. Then, for any specific bad hypothesis h, since x₁, ..., x_m are independent we have that:

Pr[h(x₁) = f(x₁), ..., h(x_m) = f(x_m)] < (1 −ε)^m

Now, what is the probability that there exists a bad hypothesis h ∈ C that agrees with all the sample data? We can use the union bound:

Pr[there exists a bad h that agrees with f for all samples] < |C| (1 −ε)^m

We can set this equal to δ and solve for m. Doing so gives that:

QED.

This gives us a bound on the number samples needed for a finite set of hypotheses, but what about infinite concept classes? For example, what if we're trying to learn a rectangle in the plane? Then our sample space is the set of points in the plane, and our concept class is the set of all filled-in rectangles. Suppose we’re given m points, and for each one are told whether or not it belongs to a a "secret rectangle."

Well, how many possible rectangles are there? There are 2^ℵ₀ possibilities, so we can’t apply the previous theorem! Nevertheless, given 20 or 30 random points in the rectangle, and 20 or 30 random points not in the rectangle but near it, intuitively it seems like we have a reasonable idea of where the rectangle is. Can we come up with a more general learning theorem to apply when the concept class is infinite? Yes, but first we need a concept called shattering.

For some concept class C, we say that a subset of the sample space {s₁, s₂, ..., s_k} is shattered by C if, for all 2^k possible classifications of s₁, s₂, ..., s_k, there is some function f ∈ C that agrees with that classification. Then, define the VC dimension of the class C, denoted VCdim(C) as the size of the largest subset shattered by C.

What is the VC dimension of the concept class of rectangles? We need the largest set of points such that for each possible setting of whether each point is or is not in the rectangle, there is some rectangle that contains only the points we want and not the other ones. The diagram below illustrates how to do this with four points. On the other hand, there's no way to do it with five points (proof: exercise for you!).

One corollary of the next theorem is that one can perform PAC learning, with a finite number of samples, if and only if the VC dimension of the concept class is finite.

Theorem (Blumer, Ehrenfeucht, Haussler, Warmuth 1989). In order to produce a hypothesis h that will explain a 1 − ε fraction of future data drawn from distribution D, with probability 1 − δ, it suffices to output any h in C that agrees with sample points drawn independently from D. Furthermore, this is tight (up to the dependence on ε).

This theorem is harder to prove than the last one, and would take a whole lecture in itself, so we'll skip it here. The intuition behind the proof, however, is simply Occam’s Razor. If the VC dimension is finite, then after seeing a number of samples that's larger than the VC dimension, the entropy of the data that you’ve already seen should only go roughly as the VC dimension. You make m observations, after which the possible number of things that you’ve seen is less than 2^m, otherwise VCdim(C) ≥ m. It follows that to describe these m observations takes less than m bits. This means that you can come up with a theory which explains the past data, and which has fewer parameters than the data itself.

If you can do that, then intuitively, you should be able to predict the next observation. On the other hand, supposing you had some hypothetical theory in (say) high-energy physics such that, no matter what the next particle accelerator found, there'd still be some way of–I don't know–curling up extra dimensions or something to reproduce those observations [pause for laughter]---well, in that case you’d have a concept class whose VC dimension was at least as great as the number of observations you were trying to explain. In such a situation, computational learning theory gives you no reason to expect that whatever hypothesis you output will be able to predict the next observation.

The upshot is that this intuitive trade-off between the compressibility of past data and the predictability of future data can actually be formalized and proven; given reasonable assumptions, Occam’s Razor is a theorem.

What if the thing that we’re trying to learn is a quantum state, say some mixed state ρ? We could have a POVM measurement E with two outcomes. In this case, I’ll say that E “accepts” ρ with probability tr (Eρ), and “rejects” ρ with probability 1 − tr (Eρ). For simplicity, we’ll restrict ourselves to two outcome measurements. If we’re given some state ρ, what we’d like to be able to do is to predict the outcome of any measurement made on the state. That is, to predict tr (Eρ) for any two-outcome POVM measurement E. This is easily seen to be equivalent to quantum state tomography, which is recovering the density matrix ρ itself.

But, what is ρ? It's some n-qubit state represented as a 2ⁿ × 2ⁿ matrix with 4ⁿ independent parameters. The number of measurements needed to do tomography on an n-qubit state is well-known to grow exponentially with n. Indeed, this is already a serious practical problem for the experimenters. To learn an 8-qubit state, you might have to set your detector in 65,536 different ways, and to measure in each way hundreds of times to get a reasonable accuracy.

So again, this is a practical problem for experimenters. But is it a conceptual problem as well? Some quantum computing skeptics seem to think so; we saw in the last lecture that one of the fundamental criticisms of quantum computing is that it involves manipulating these exponentially long vectors. To some skeptics, this is an inherently absurd way of describing the physical world, and either quantum mechanics is going to break down when we try to do this, or there's something else that we must not have taken into account, because you clearly can’t have 2ⁿ "independent parameters" in your description of n particles.

Now, if you need to make an exponential number of measurements on a quantum state before you know enough to predict the outcome of further measurements on it, then this would seem to be a way of formalizing the above argument and making it more persuasive. After all, our goal in science is always to come up with hypotheses that succinctly explain past observations, and thereby let us predict future observations. We might have other goals, but at the least we want to do that. So if, to characterize a general state of 500 qubits, you had to make more measurements than you could in the age of the universe, that would seem to be a problem with quantum mechanics itself, considered as a scientific theory. I'm actually inclined to agree with the skeptics about that.

Recently I had a paper where I tried to use computational learning theory to answer this argument. At a conference in London recently, Umesh Vazirani had a really nice way of explaining my result. He said, suppose you're a baby trying to learn a rule for predicting whether or not a given object is a chair. You see a bunch of objects labeled "chair" or "not-chair", and based on that you come up with general rules ("a chair has four legs," "you can sit on one," etc.) that work pretty well in most cases. Admittedly, these rules might break down if (say) you're in a modern art gallery, but we don’t worry about that. In computational learning theory, we only want to predict most of the future observations that you’ll actually make. If you’re a Philistine, and don’t go to MOMA, then don’t you worry about any chair-like objects that might be there. We need to take into account the future intentions of the learner, and for this reason, we relax the goal of quantum state tomography to the goal of predicting the outcomes of most measurements drawn from some probability distribution D.

More formally, given a mixed state ρ on n qubits, as well as given measurements E₁,E₂, ..., E_m ~ D and estimated probabilities p_j ≈ tr (E_jρ) for each j ∈ {1,2,...,m}, is to produce a hypothesis state σ which has, with probability at least 1 − δ, the property that:

For this goal, I gave a theorem that bounds the number of sample measurements needed:

Theorem. Fix error parameters ε, δ and γ and fix η > 0 such that γε ≥ 7η. Call E=(E₁,...,E_m) a "good" training set of measurements if any hypothesis state σ that satisfies |Tr(E_iσ)−Tr(E_iρ)|≤η also satisfies: Then, there exists a constant K > 0 such that E is a good training set with probability at least 1 − δ over E₁,...,E_m drawn from D, provided that m satisfies:

It’s important to note that this bound is only linear in the number of qubits n, and so this tells us that the dimensionality is not actually exponential in the number of qubits, if we only want to predict most of the measurement outcomes.

Why is this theorem true? Remember the result of Blumer et al, which said that you can learn with a number of samples that grows linearly with the VC dimension of your concept class. In the case of quantum states, we’re no longer dealing with Boolean functions. You can think of a quantum state as a real-valued function that takes as input a two-outcome measurement E, and produces as output a real number in [0,1] (namely, the probability that the measurement accepts). That is, ρ takes a measurement E and returns tr (Eρ).

So, can one generalize the Blumer et al. result to real-valued functions? Fortunately, this was already done for me by Alon, Ben-David, Cesa-Bianchi, and Haussler, and by Bartlett and Long among others.

Next, recall from Lecture 13 Ambainis, Nayak, et al.'s lower-bound on random access codes, which tells us how many classical bits can be reliably encoded into a state of n qubits. Given an m-bit classical string x, suppose we want to encode x into a quantum state of n qubits, in such a way that any bit x_i of our choice can later be retrieved with probability at least 1-ε. Ambainis et al. proved that we really can't get any savings by packing classical bits into a quantum state in this way. That is to say, n still must be linear in m. Since this is a lower-bound, we can view it as a limitation of quantum encoding schemes. But we can also turn it on its head, and say: this is actually good, as it implies some upper bound on the VC dimension of quantum states considered as a concept class. Roughly speaking, the theorem tells us that the VC dimension of n-qubit states considered as a concept class is at most m = O(n). To make things more formal, we need a real-valued analogue of VC dimension (called the “fat-shattering” dimension; don’t ask), as well as a theorem saying that we can learn any real-valued concept class using a number of samples that grows linearly with its fat-shattering dimension.

What about actually finding the state? Even in the classical case, I’ve completely ignored the computational complexity of finding a hypothesis. I’ve said that if you somehow found a hypothesis consistent with the data, then you’re set, and can explain future data, but how do you actually find the hypothesis? For that matter, how do you even write down the answer in the quantum case? Writing out the state explicitly would take exponentially many bits! On the other hand, maybe that’s not quite so bad, since even in the classical case, it can take exponential time to find your hypothesis.

What this tells us is that, in both cases, if you care about computational and representational efficiency, then you’re going to have to restrict the problem to some special case. The results from today's lecture, which tell us about sample complexity, are just the beginning of learning theory. They answer the first question, the information-theoretic question, telling us that it suffices to take a linear number of samples. The question of how to find and represent the hypothesis comprises much of the rest of the theory. As yet, almost nothing is known about this part of learning theory in the quantum world.

I can tell you, however, some of what's known in the classical case. Maybe disappointingly, a lot of what's known takes the form of hardness results. For example, with a concept class of Boolean circuits of polynomial size, we believe it's a computationally hard problem to find a circuit (or equivalently, a short efficient computer program) that outputs the data that you’ve already seen, even supposing such a circuit exists. Of course we can’t actually prove that this problem has no polynomial-time algorithm (for that would prove P≠NP), nor, as it turns out, can we even prove in our current state of knowledge that it's NP-complete. What we do know is that the problem is at least as hard as inverting one-way functions, and hence breaking almost all modern cryptography. Remember when we were talking about cryptography in Lecture 8, we talked about one-way functions, which are easy to compute but hard to invert? As we discussed then, Hâstad, Impagliazzo, Levin, and Luby proved in 1997 that from any one-way function one can construct a pseudorandom generator, which maps n "truly" random bits to (say) n² bits that are indistinguishable from random by any polynomial-time algorithm. And Goldreich, Goldwasser and Micali had shown earlier that from any pseudorandom generator, one can construct a pseudorandom function family: a family of Boolean functions f:{0,1}ⁿ→{0,1} that are computed by small circuits, but are nevertheless indistinguishable from random functions by any polynomial-time algorithm. And such a family of functions immediately leads to a computationally-intractable learning problem.

Thus, we can show based on cryptographic assumptions that these problems of finding a hypothesis to explain data that you’ve seen are probably hard in general. By tweaking this result a bit, we can say that if finding a quantum state consistent with measurements that you’ve made can always be done efficiently, then there’s no one-way function secure against quantum attack. What this is saying is that we kind of have to give up hope of solving these learning problems in general, and that we have to just look at special cases. In the classical case, there are special concept classes that we can learn efficiently, such as constant-depth circuits or parity functions. I expect that something similar will be true in the quantum world.

Next Week's Puzzle

In addition to the aforementioned rectangle learning puzzle, here's another raven puzzle, due to Carl Hempel. Let’s say that you want to test our favorite hypothesis that all ravens are black. How do we do this? We go out into the field, find some ravens, and see if they’re black. On the other hand, let’s take the contrapositive of our hypothesis, which is logically equivalent: “all not-black things are non-ravens.” This suggests that I can do ornithology research without leaving my office! I just have to look at random objects, note that they are not black, and check if they are ravens. As I go along, I gather data that increasingly confirms that all not-black things are non-ravens, confirming my hypothesis. The puzzle is whether this approach works. You’re allowed to assume for this problem that I do not go out bird-watching in fields, forests or anywhere else.

Lecture 16: Interactive Proofs and More

Scribe: Chris Granade

Last lecture, I ended by giving you a puzzle problem: can I do ornithology without leaving my office?

I want to know if all ravens are black. The old-fashioned approach would involve going outside, looking for a bunch of ravens and seeing if they're black or not. The more modern approach: look around the room at all of the things which are not black and note that they also are not ravens. In this way, I become increasingly convinced that all not-black things are not ravens, or equivalently, that all ravens are black. Can I be a leader in the field of ornithology this way?

A: Well, when you're doing the not-black things, there's a lot more possible observations. You could get the same effect as measuring millions of not-black things by measuring just one raven.
Scott: Yeah, I think that's a large part of it. Anyone else?
A: You wouldn't be getting a good random sample of non-black things by just sitting in your office.
Scott: I wouldn't be getting a random sample of all ravens either. I'd be getting some of those ravens that live in Waterloo.

Something completely tangential that I'm reminded of: there's this game where you're given four cards, each of which you're promised has a letter on one side and a number on the other. If what you can see of the cards is shown in the figure below, which cards do you need to flip over to test the rule that all cards with a K on one side have a 3 on the other?

Apparently, if you give this puzzle to people, the vast majority get it wrong. In order to test that K ⇒ 3, you need to flip the K and the 1. On the other hand, you can give people a completely equivalent problem, where they're a bouncer at a bar and need to know if anyone under 21 (or 19 in Canada) is drinking, and where they're told that there's someone who is drinking, someone who isn't drinking, someone who's over 21 and someone who's under 21. In this scenario, funny enough, most people get it right. You ask the person who's drinking, and the underage customer. This is a completely equivalent problem to the cards, but if you give it to the people in the abstract way, many say (for example) that you have to turn over the 3 and the Q, which is wrong. So, people seem to have this built-in ability to reason logically about social situations, but they have to be painstakingly taught to apply that same ability to abstract mathematical problems.

Anyway, the point is that there are many, many more not-black things then there are ravens, so if there were a pair (raven, not-black), then we would be much more likely to find it by randomly sampling a raven then sampling a not-black thing. Therefore, if we sample ravens and fail to find a not-black raven, then we're much more confident in saying that "all ravens are black," because our hypothesis had a much higher chance of being falsified by sampling ravens.

Interactive Proofs

Why should we in quantum computing care about interactive proofs? I'll answer this question in a rather unconventional way, by asking a different question: can quantum computers be simulated efficiently by classical computers?

I was talking to Ed Fredkin a while ago, and he said that he believes that the whole universe is a classical computer and thus everything can be simulated classically. But instead of saying that quantum computing is impossible, he takes things in a very interesting direction, and says that BQP must be equal to P. Even though we have factoring algorithms for quantum computers that are faster than known classical algorithms, that doesn't mean that there isn't a fast classical factoring algorithm that we don't know about. On the other side you have David Deutsch, who makes an argument that we've talked about several times before: if Shor's Algorithm doesn't involve these "parallel universes," then how is it factoring the number? Where was the number factored, if not using these exponentially many universes? I guess one way that you could criticize Deutsch's argument (certainly not the only way), is to say he's assuming that there isn't an efficient classical simulation. We believe that there's no way for Nature to perform the same computation using polynomial classical resources, but we don't know that. We can't prove that.

Q: How many people have tried to prove that you can't classically simulate a quantum computer?
Scott: I don't even know that anyone has looked specifically at this question or tried to prove this directly. The crucial point is that if you could prove that P≠BQP, then you would have also proved that P≠PSPACE. (Physicists might think it's obvious these classes are unequal and it doesn't even require proof, but that's another matter...) As for going in the other direction and proving P = BQP, I guess people have tried that. I don't know if I should say this in public, but I've even spent a day or two on it. It would at least be nice to put BQP in AM, or the polynomial hierarchy---some preliminary fact like that. Unfortunately, I think we simply don't yet understand efficient computation well enough to answer such questions, completely leaving aside the quantum aspect.

The question is, if P≠BQP, P≠NP, etc., why can't anyone prove these things? There are several arguments that have been given for that. One of them is relativaztion. We can talk about giving a P computer and a BQP computer access to the same oracle. That is, give them the same function that they can compute in a single computation step. There will exist an oracle that makes them equal and there will exist another oracle that makes them unequal. The oracle that makes them equal, for example, could just be a PSPACE oracle which kind of sandwiches everything and just makes everything equal to PSPACE. The oracle that makes them unequal could be an oracle for Simon's Problem, or some period-finding problem that the quantum computer can solve but the classical one can't. Then, you see that any proof technique is going to have to be sensitive to the presence of these oracles. This doesn't sound like such a big deal until you realize that almost every proof technique we have is not sensitive to the presence of oracles. It's very hard to come up with a technique that is sensitive, and that---to me---is why interactive proofs are interesting. This is the one clear and unambiguous example I can show you of a technique we have that doesn't relativize. In other words, we can prove that something is true, which wouldn't be true if you just gave everything an oracle. You can see this as the foot in the door or the one distant point of light in this cave that we're stuck in. Through the interactive proof results, we can get a tiny glimmer of what the separation proofs would eventually have to look like if we ever came up with them. The interactive proof techniques seem much too weak to prove anything like P≠NP, or else you would have heard about it. (Note: A year after giving this lecture, Avi Wigderson and I proposed algebrization, which gives a formal explanation for why the interactive proof techniques are too weak to prove P≠NP and other basic conjectures in complexity theory.) Already, though, we can use these techniques to get some non-relativizing separation results. I'll show you some examples of that also.

Q: What about P versus BPP? What's the consensus there?

The consensus is that P and BPP actually are equal. We know from Impagliazzo and Wigderson that if we could prove that there exists a problem solvable in 2ⁿ time that requires circuits of size 2^Ω(n), then we could construct a very good pseudorandom generator; that is, one which cannot be distinguished from random by any circuit of fixed polynomial size. Once you have such a generator, you can use it to derandomize any probabilistic polynomial-time algorithm. You can feed your algorithm the output of the pseudorandom generator, and your algorithm won't be able to tell the difference between it and a truly random string. Therefore, the probabilistic algorithm could be simulated deterministically. So we really seem to be seeing a difference between classical randomness and quantum randomness. It seems like classical randomness really can be efficiently simulated by a deterministic algorithm, whereas quantum "randomness" can't. One intuition for this is that, with a classical randomized algorithm, you can always just "pull the randomness out" (i.e., treat the algorithm as deterministic and the random bits as part of its input). On the other hand, if we want to simulate a quantum algorithm, what does it mean to "pull the quantumness out?"

Q: I could see how, with two classes that are different, adding an oracle could kind of "boost" them up to the same level, but if two classes are the same, intuitively, how can giving them more power make them different?
Scott: That's a good question, and the key is to realize that when we feed an oracle to a class, we aren't acting on the class itself. We're acting on the definition of the class. As an example, even though we believe P = BPP in the real world, it's very easy to construct an oracle O where P^O≠BPP^O. Clearly, if what we were doing was operating on the classes, then operating on two equal classes would give two results that were still equal. But that's not what we're doing, and maybe the notation is confusing that way. (A rough analogy: "The third planet from the Sun is the third planet from the Sun" is a tautology, whereas "Earth is the third planet from the Sun" is not a tautology---even though, as it turns out, Earth = the third planet from the Sun.)
Q: Are there any classes that are provably equal, for which there's an oracle that makes them unequal?
Scott: Yes. We're going to see that today.

So let's see this one example of a non-relativizing technique. So we've got a Boolean formula (like the ones used in SAT) in n variables which is not satisfiable. What we'd like is a proof that it's not satisfiable. That is, we'd like to be convinced that there is no setting of the n variables that causes our formula to evaluate to TRUE. This is what we saw before as an example of a coNP-complete problem. The trouble is that we don't have enough time to loop through every possible assignment and check that none of them work. Now the question that was asked in the 80s was, "what if we have some super-intelligent alien that comes to Earth and can interact with us?" We don't trust the alien and its technology, but we'd like it to prove to us that the formula is unsatisfiable in such a way that we don't have to trust it. Is this possible?

Normally in computational complexity, when you can't answer a question, the first thing you do is find an oracle where the question is true or false. It's probably like what physicists do when they do perturbative calculations. You do it because it you can, not because it necessarily tells you what you ultimately want to know. So this is what Fortnow and Sipser did in the late 80s. They said, all right, suppose you have an exponentially long string, and the alien wants to convince you that this exponentially long string is the all-zero string. That is, that there are no 1's anywhere. So can this prover do it? Let's think of what could happen. The prover could say, "the string is all zeroes."
"Well, I don't believe you. Convince me."
"Here, this location's a zero. This one's also a zero. So is this one..."
OK, now there's only 2^10,000 bits left to check, and so the alien says "trust me, they're all zeroes." There's not a whole lot the prover can do. Fortnow and Sipser basically formally proved this obvious intuition. Take any protocol of messages between you and the prover that terminates with you saying "yes" if you're convinced and "no" if you aren't. What we could then do is to then pick one of the bits of the string at random, surreptitiously change it to a 1, and almost certainly, the entire protocol goes through as before. You'll still say that the string is all zeroes.

As always, we can define a complexity class: IP. This is the set of problems where you can be convinced of a "yes" answer by interacting with the prover. So we talked before about these classes like MA and AM---those are where you have a constant number of interactions. MA is where the prover sends a message to you and you perform a probabilistic computation to check it. In AM, you send a message to the prover, and then the prover sends a message back to you and you run a probabilistic computation. It turns out that with any constant number of interactions, you get the same class AM, so let's be generous and allow polynomially many interactions. The resulting class is IP. So what Fortnow and Sipser did is they gave a way of constructing an oracle relative to which coNP is not in IP. They showed that, relative to this oracle, you cannot verify the unsatisfiability of a formula via a polynomial number of interactions with a prover. Following the standard paradigm of the field, of course we can't prove unconditionally that coNP is not in IP, but this gives us some evidence; that is, it tells us what we might expect to be true.

Now for the bombshell (which was discovered by Lund, Fortnow, Karloff, and Nisan): in the "real," unrelativized world, how do we show that a formula is unsatisfiable? We're going to somehow have to use the structure of the formula. We'll have to use that it's a Boolean formula that was explicitly given to us, and not just some abstract Boolean function. What will we do? Let's assume this is a 3SAT problem (since 3SAT is NP-complete, that assumption is without loss of generality). There's a bunch of clauses (n of them) involving three variables each, and we want to verify that there's no way to satisfy all the clauses. Now what we'll do is map this formula to a polynomial over a finite field. This trick is called arithmetization. Basically, we're going to convert this logic problem into an algebra problem, and that'll give us more leverage to work with. This is how it works: we rewrite our 3SAT instance as a product of degree-3 polynomials. Each clause---that is, each OR of three literals---just becomes 1 minus the product of 1 minus each of the literals: e.g., (x OR y OR z) becomes 1-(1-x)(1-y)(1-z). Notice that, so long as x, y, and z can only take the values 0 and 1, this polynomial is exactly equivalent to the logic expression that started with. But now, what we can do is reinterpret the polynomial as being over some much larger field. Pick some reasonably large prime number N, and we'll interpret the polynomial as being over GF_N (the field of N elements). I'll call the polynomial P(x₁,...,x_n). Now what we want to verify is this: there are no satisfying assignments, or equivalently, that if you take P(x₁,...,x_n) and sum it over all possible Boolean settings of x₁,...,x_n, then you get zero. The problem, of course, is that this doesn't seem any easier than what we started with! We've got this sum over exponentially many terms, and we have to check every one of them and make sure that they're all zero. But now, we can have the prover help us. If we just have this string of all zeroes, and he just tells us that it's all zeroes, we don't believe him. But now, we've lifted everything to a larger field and we have some more structure to work with.

Q: Why does it follow that if the formula is unsatisfiable, then the sum evaluates to zero?
Scott: If the formula is unsatisfiable, then no matter what setting x₁,...,x_n you pick for the variables, there's going to be some clause in the formula that isn't satisfied. Hence one of the degree-3 polynomials that we're multiplying together will be zero, and hence the product will itself be zero. And since this is true for all 2ⁿ Boolean settings of x₁,...,x_n, you'll still get zero if you sum P(x₁,...,x_n) over all of them.

So now what can we do? What we ask the prover to do is to sum for us over all 2ⁿ⁻¹ possible settings of the variables x₂, ..., x_n, leaving x₁ unfixed. Thus, the prover sends us a univariate polynomial Q₁ in the first variable. Since the polynomial we started with had poly(n) degree, the prover can do this by sending us a polynomial number of coefficients. He can send us this univariate polynomial. Then, what we have to verify is that Q₁(0)+Q₁(1)=0 (everything being mod N). How can we do that? The prover has given us the claimed value of the entire polynomial. So just pick an r₁ at random from our field. Now, what we would like to do is verify that Q₁(r₁) equals what it's supposed to. Forget about 0 and 1, we're just going to go somewhere else in the field. Thus, we send r₁ to the prover. Now the prover sends a new polynomial Q₂, where the first variable is fixed to be r₁, but where x₂ is left unfixed and x₃, ..., x_n are summed over all possible Boolean values (like before). We still don't know that the prover hasn't been lying to us and sending bullshit polynomials. So what can we do?

Check that Q₂(0)+Q₂(1)=Q₁(r₁), then pick another element r₂ at random and send it to the prover. In response, he'll send us a polynomial Q₃(X). This will be a sum of P(x₁,...,x_n) over all possible Boolean settings of x₄ up to x_n, with x₁ set to r₁ and x₂ set to r₂, and x₃ left unfixed. Again, we'll check and make sure that Q₃(0) + Q₃(1) = Q₂(r₂). We'll continue by picking a random r₃ and sending it along to the prover. This keeps going for n iterations, when we reach the last variable. What do we do when we reach the last iteration? At that point, we can just evaluate P(r₁,...,r_n) ourselves without the prover's help, and check directly if it equals Q_n(r_n).

We have a bunch of tests that we're doing along the way. My first claim is that if there is no satisfying assignment, and if the prover was not lying to us, then each of the n tests accepts with certainty. The second claim is that if there was a satisfying assignment, then with high probability, at least one of these tests would fail. Why is that the case? The way I think of it is that the prover is basically like the girl in Rumpelstiltskin. The prover is just going to get trapped in bigger and bigger lies as time goes on until finally the lies that will be so preposterous that we'll be able to catch them. This is what's going on. Why? Let's say that, for the first iteration, the real polynomial that the prover should give us is Q₁, but that the prover gives us Q₁' instead. Here's the thing: these are polynomials of not too large a degree. The final polynomial, P, has degree at most three times the number of clauses. We can easily fix the field size to be larger. So let the degree d of the polynomial be much smaller than the field size N.

A quick question: suppose we have two polynomials P₁ and P₂ of degree d. How many points can they be equal at (assuming they aren't identical)? Consider the difference P₁−P₂. Since this is also a polynomial of degree at most d, by the Fundamental Theorem of Algebra, it can have at most d distinct roots (again, assuming it's not identically zero). Thus, two polynomials that are not equal can agree in at most d places, where d is the degree. This means that if these are polynomials over a field of size N, and we pick a random element in the field, we can bound the probability that the two will agree at that point: it's at most d/N.

Going back to the protocol, we assumed that d is much less than N, and so the probability that Q₁ and Q₁' agree at some random element of the field is much less than 1. So when we pick r₁ at random, the probability that Q₁(r₁)=Q₁'(r₁) is at most d/N. Only if we've gotten very unlucky will we pick r₁ such that these are equal, so we can go on and assume that Q₁(r₁)≠Q₁'(r₁). Now, you can picture the prover sweating a little. He's trying to convince us of a lie, but maybe he can still recover. But next, we're going to pick an r₂ at random. Again, the probability that he'll be able to talk himself out of the next lie is going to be at most d/N. This is the same in each of the iterations, so the probability that he can talk himself out of any of the lies is at most nd/N. We can just choose N to be big enough that this will be much smaller than 1.

Q: Why not just run this protocol over the positive integers?
Scott: Because we don't have a way of generating a random positive integer, and we need to be able to do that. So we just pick a very large finite field.

So this protocol gives us that coNP ⊆ IP. Actually, it gives us something stronger. Does anyone see the stronger thing that it gives us?

A: Strictly contained?
Scott: No, we can't show that, though it would be nice.

A standard kind of argument shows us that the biggest IP could possibly be in our wildest dreams would be PSPACE. You can prove that anything you can do with an interactive protocol, you can simulate in PSPACE. Can we bring IP up? Make it bigger? What we were trying to verify was that all of these values of P(x₁,...,x_n) summed to zero, but the same proof would go through as before if we were trying to verify that they summed to some other constant (whatever we want). So that actually lets us do counting, and shows that IP contains P^♯P, which in turn we know to contain the entire polynomial hierarchy (by Toda's Theorem). After this "LFKN Theorem" came out, a number of people carried out a discussion by e-mail, and a month later, Shamir figured out that IP = PSPACE---that is, IP actually "hits the roof." I won't go through Shamir's result here, but this means that if a super-intelligent alien came to Earth, it could prove to us whether white or black has the winning strategy in chess, or if chess is a draw. It could play us and beat us, of course, but then all we'd know is that it's a better chess player. But it can prove to us which player has the winning strategy by reducing chess to this game of summing polynomials over large finite fields. (Technical note: this only works for chess with some reasonable limit on the number of moves, like the "50-move rule" used in tournament play.)

Q: Chess on an n×n board, right?
Scott: Sure. The protocol works in particular for n = 8, but you can generalize chess to arbitrary board sizes. You just have to limit the number of moves to some polynomial in n, or else you get EXP. Then you'd need two provers to convince you---which is another story!

This is already something that is---to me---pretty counterintuitive. Like I said, it gives us a very small glimpse of the kinds of techniques we'd need to use to prove non-relativizing results like P≠NP. A lot of people seem to think that the key is somehow to transform these problems from Boolean to algebraic ones. The question is how to do that. I can show you, though, how these techniques already let you get some new lower bounds. Heck, even some quantum circuit lower bounds.

First claim: if we imagine that there are polynomial-size circuits for counting the number of satisfying assignments of a Boolean formula, then there's also a way to prove to someone what the number of solutions is. Does anyone see why this would follow from the interactive proof result? Well, notice that, to convince the verifier about the number of satisfying assignments of a Boolean formula, the prover itself does not need to have more computational power than is needed to count the number of assignments. After all, the prover just keeps having to compute these exponentially large sums! In other words, the prover for ♯P can be implemented in ♯P. If you had a ♯P oracle, then you too could be the prover. Using this fact, Lund et al. pointed out that if ♯P⊂P/poly---that is, if there's some circuit of size polynomial in n for counting the number of solutions to a formula of size n---then P^♯P=MA. For in MA, Merlin can give Arthur the polynomial-size circuit for solving ♯P problems, and then Arthur just has to verify that it works. To do this, Arthur just runs the interactive protocol from before, but where he plays the part of both the prover and the verifier, and uses the circuit itself to simulate the prover. This is an example of what are called self-checking programs. You don't have to trust an alleged circuit for counting the number the solutions to a formula, since you can put it in the role of a prover in an interactive protocol.

Now, we can prove that the class PP, consisting of problems solvable in probabilistic polynomial-time with unbounded error, does not have linear-sized circuits. (This result is originally due to Vinodchandran.) Why? Well, there are two cases. If PP doesn't even have polynomial-sized circuits, then we're done. On the other hand, if PP does have polynomial-sized circuits, then so does P^♯P, by the basic fact (which you might enjoy proving) that P^♯P=P^PP. Therefore P^♯P = MA by the LFKN Theorem, so P^♯P = MA = PP, since PP is sandwiched in between MA and P^♯P. But one can prove (and we'll do this shortly) that P^♯P doesn't have linear-sized circuits, using a direct diagonalization argument. Therefore, PP doesn't have linear-sized circuits either.

All I'm trying to say is that once you have this interactive proof result, you can leverage it to get new circuit lower bounds. For example, you can show that there's a language in the class PP that doesn't have linear-sized circuits. In fact, for any fixed k, there's a language in PP that doesn't have circuits of size n^k. Of course, that's much weaker than showing that PP doesn't have polynomial-sized circuits, but it's something. (Note: After I gave this lecture, Santhanam improved on Vinodchandran's result, to show that for every fixed k, there's a language in the complexity class PromiseMA that doesn't have circuits of size n^k.)

I'd like to go back now and fill in the missing step in the argument. Let's say you wanted to show for some fixed k that P^♯P doesn't have circuits of size n^k. How many possible circuits are there of size n^k? Something like n^{2n^k}. Now what we can do is define a Boolean function f by looking at the behavior of all circuits of size n^k. Order the possible inputs of size n as x₁, ..., x_2^n. If at least half of the circuits accept x₁, then set f(x₁) = 0, while if more than half of the circuits reject x₁, then set f(x₁) = 1. This kills off at least half of the circuits of size n^k (i.e., causes them to fail at computing f on at least one input). Now, of those circuits that got the "right answer" for x₁, do the majority of them accept or reject x₂? If the majority accept, then set f(x₂) = 0. If the majority reject, then set f(x₂) = 1. Again, this kills off at least half of those circuits remaining. We continue this Darwinian process where each time we define a new value of our function, we kill off at least half of the remaining circuits of size n^k. After log₂(n^{2n^k})+1≈2n^klog(n) steps, we will have killed off all of the circuits of size n^k. Furthermore, the process of constructing f involves a polynomial number of counting problems, each of which we can solve in P^♯P. So the end result is a problem which is in P^♯P, but which by construction does not have circuits of size n^k (for any fixed k of our choice). This is an example of a relativizing argument, because we paid no attention to whether these circuits had any oracles or not. To get this argument to go down from P^♯P to the smaller class PP, we had to use a non-relativizing ingredient: namely, the interactive proof result of LFKN.

But does this actually give us a non-relativizing circuit lower bound? That is, does there exist an oracle relative to which PP has linear-sized circuits? A couple years ago, I was able to construct such an oracle. This shows that Vinodchandran's result was non-relativizing---indeed, it's one of the few examples in all of complexity theory of an indisputably non-relativizing separation result. In other words, the relativization barrier---which is one of the main barriers to showing P≠NP---can be overcome in some very limited cases. It would be nice to overcome it in other cases, but this is what we can do.

Q: So these arguments show that there are no quadratic-size circuits for PP?
Scott: Yes. Let me put this another way: for any fixed k, there exists a language L in PP such that L cannot be decided by a circuit of size O(n^k). That's very different from saying that there is a single language in PP that does not have any circuits of any polynomial size. The second statement is much harder to show! If you give me your (polynomial) bound, then I find a PP problem that defeats circuits constrained by your bound, but the problem might be solvable by circuits with some larger polynomial bound. I could also defeat that larger polynomial bound, but I'd have to construct a different problem, and so on indefinitely.

While we're waiting for better circuit lower bounds in the classical case, I can tell you about the quantum case. We always have to ask about the quantum case. Here, a simple extension of the previous argument shows that not only does PP not have circuits of size n^k, it doesn't even have quantum circuits of size n^k. You can get a quantum circuit lower bound, but that's peanuts. Let's try to throw in quantum to something and get a different answer.

We can define a complexity class QIP: Quantum Interactive Proofs. This is the same as IP, except that now you're a quantum polynomial-time verifier, and instead of exchanging classical messages with the prover, you can exchange quantum messages. For example, you could send the prover half of an EPR pair and keep the other half, and play whatever other such games you want.

Certainly, this class is at least as powerful as IP. You could just restrict yourself to classical messages if that's what you wanted to do. Since IP = PSPACE, QIP has to be at least as big as PSPACE. The other thing that was proved by Kitaev and Watrous, using a semidefinite programming argument, was that QIP is contained in EXP. This is actually all we know about where QIP lies. It would be a great Ph.D. thesis for any of you to show (for example) that QIP can be simulated in PSPACE, and hence QIP=PSPACE. The exciting thing that we do know (also due to Kitaev and Watrous) is that any quantum interactive protocol can be simulated by one that takes place in three rounds. In the classical case, we had to play this whole Rumpelstiltskin game, where we kept asking the prover one question after another until we finally caught him in a lie. We had to ask the prover polynomially many questions. But in the quantum case it's no longer necessary to do that. The prover sends you a message, you send a message back, then the prover sends you one more message and that's it. That's all you ever need.

We don't have time today to prove why that's true, but I can give you some intuition. Basically, the prover prepares a state that looks like ∑_r|r⟩|q(r)⟩. This r is the sequence of all the random bits that you would use in the classical interactive protocol. Let's say that we're taking the classical protocol for solving coNP or PSPACE, and we just want to simulate it by a three-round quantum protocol. We sort of glom together all the random bits that the verifier would use in the entire protocol and take a superposition over all possible settings of those random bits. Now what's q(r)? It's the sequence of messages that the prover would send back to you if you were to feed it the random bits in r. Now, the prover will just take the q register and second r register and will send it to you. Certainly, the verifier can check that then q(r) is a valid sequence of messages given r. What's the problem? Why isn't this a good protocol?

A: It could be a superposition over a subset of the possible random bits.

Right! How do we know that the prover didn't just cherry-pick r to be only drawn from those that he could successfully lie about? The verifier needs to pick the challenges. You can't have the prover picking them for you. But now, we're in the quantum world, so maybe things are better. If you imagine in the classical world that there was some way to verify that a bit is random, then maybe this would work. In the quantum world, there is such a way. For example, if you were given a state like

you could just rotate it and verify that, had you measured in the standard basis, you would have gotten 0 and 1 with roughly equal probability. (More precisely: if the outcome in the standard basis would have been random, then you'll accept with probability 1; if the outcome would have been far from random, then you'll reject with noticeable probability.)

Still, the trouble is that our |r⟩ is entangled with the |q(r)⟩ qubits. So we can't just apply Hadamard operations to |r⟩---if we did, we'd just get garbage out. However, it turns out that what the verifier can do is to pick a random round i of the protocol being simulated---say there are n such rounds---and then ask the prover to uncompute everything after round i. Once the prover has done that, he's eliminated the entanglement, and the verifier can then check by measuring in the Hadamard basis that the bits for round i really were random. If the prover cheated in some round and didn't send random bits, this lets the verifier detect that with probability that scales inversely with the number of rounds. Finally, you can repeat the whole protocol in parallel a polynomial number of times to increase your confidence. (I'm skipping a whole bunch of details---my goal here was just to give some intuition.)

Q: So this is kind of like quantum MAM (Merlin-Arthur-Merlin)?
Scott: Yes. In the classical world, you've just got MA and AM: every proof protocol between Arthur and Merlin with a larger constant number of rounds collapses to AM. If you allow a polynomial number of rounds, then you go up to IP (which equals PSPACE). In the quantum world, you've QMA, QAM and then QMAM which is the same as QIP. There's also another class, QIP[2], which is different from QAM in that Arthur can send any arbitrary string to Merlin (or even a quantum state) instead of just a random string. In the classical case, AM and IP[2] are the same, but in the quantum case, we have no idea.

That's our tour of interactive proofs, so I'll end with a puzzle for next week. God flips a fair coin. Assuming that the coin lands tails, She creates a room with a red-haired person. If the coin lands heads, She creates two rooms: one has a person with red hair and the other has a person with green hair. Suppose that you know that this is the whole situation, then wake up to find a mirror in the room. Your goal is to find out which way the coin landed. If you see that you've got green hair, then you know right away how the coin landed. Here's the puzzle: if you see that you have red hair, what is the probability that the coin landed heads?

Lecture 17: Fun With the Anthropic Principle

This is a lecture about the Anthropic Principle, and how you apply Bayesian reasoning where you have reason about the probability of your own existence, which seems like a very strange question. It's a fun question, though---which you can almost define as a question where it's much easier to have an opinion than to have a result. But then, we can try to at least clarify the issues, and there are some interesting results that we can get.

There's a central proposition that many people interested in rationality believe they should base their lives around---even though they generally don't in practice. This is Bayes's Theorem.

If you talk to philosophers, this is often the one mathematical fact that they know. (Kidding!) As a theorem, Bayes' Theorem is completely unobjectionable. The theorem tells you how to update your probability of a hypothesis H being true, given some evidence E.

The term P[E|H] describes how likely you are to observe the evidence E in the case that the hypothesis H holds. The remaining terms on the right-hand side, P[H] and P[E], are the two tricky ones. The first one describes the probability of the hypothesis being true, independent of any evidence, while the second describes the probability of the evidence being observed, averaged over all possible hypotheses. Here, you're making a commitment that there are such probabilities in the first place---in other words, that it makes sense to talk about what the Bayesians call a prior. When you're an infant first born into the world, you estimate there's some chance that you're living on the third planet around the local star, some other chance you're living on the fourth planet and so on. That's what a prior means: your beliefs before you're exposed to any evidence about anything. You can see already that maybe that's a little bit of a fiction, but supposing you have such a prior, Bayes' theorem tells you how to update it given new knowledge.

The proof of the theorem is trivial. Multiply both sides by P[E], and you get that P[H|E] P[E] = P[E|H] P[H]. This is clearly true, since both sides are equal to the probability of the evidence and the hypothesis together.

So if Bayes' Theorem seems unobjectionable, then I want to make you feel queasy about it. That's my goal. The way to do that is to take the theorem very, very seriously as an account of how we should reason about the state of the world.

I'm going to start with a nice thought experiment which is due to the philosopher Nick Bostrom. This is called God's Coin Toss. Last lecture, I described the thought experiment as a puzzle.

Imagine that at the beginning of time, God flips a fair coin (one that lands heads or tails with equal probability). If the coin lands heads, then God creates two rooms: one has a person with a red hair and the other has a person with green hair. If the coin lands tails, then God creates one room with a red-haired person.

Q: Are these rooms the entire universe?
Scott: Yes, and these are the only people in the universe, in either case.

We also imagine that everyone knows the entire situation and that the rooms have mirrors. Now suppose that you wake up, look in the mirror and learn your hair color. What you really want to know is which way the coin landed. Well, in one case, this is easy. If you have green hair, then the coin must have landed heads. Suppose that you find you have red hair. Conditioned on that, what probability should you assign to the coin having landed heads?

Q: So there's no door to the other room?
Scott: That's right. You can't see the other room. All you can see is the room that you're in. So does anyone have a guess?
A: One half.

One half is the first answer that someone could suggest. You could just say, "look, we know the coin was equally likely going to land heads or tails, and we know that in either case that there was going to be a red-haired person, so being red-haired doesn't really tell you anything about which way it landed, and therefore it should be a half." Can anyone defend a different answer?

A: It seems more likely for it to have landed tails, since in the heads case, the event of waking up with red hair is diluted with the other possible event of waking up with green hair. The effect would be more dramatic if there were a hundred rooms with green hair.
Scott: Exactly.
Q: It isn't clear at all to me that the choice of whether you're the red- or the green-haired person in the heads case is at all probabilistic. We aren't guaranteed that.
Scott: Right. This is a question.
Q: It could have been that before flipping the coin, God wrote down a rule saying that if the coin lands heads, make you red-haired.
Scott: Well, then we have to ask, what do we mean by "you?" Before you've looked in the mirror, you really don't know what your hair color is. It really could be either, unless you believe that it's really a part of the "essence" of you to have red hair. That is, if you believe that there is no possible state of the universe in which you had green hair, but otherwise would have been "you."
Q: Are both people asked this question, then?
Scott: Well, the green-haired person certainly knows how to answer, but you can just imagine that the red-haired people in the heads and tails cases are both asked the question.

To make the argument a little more formal, you can just plop things into Bayes's Theorem. We want to know the probability P[H|R] that the coin landed heads, given that you have red hair. We could do the calculation, using that the probability of us being red haired given that the coin lands heads is ½, all else being equal. There are two people, and you aren't a priori more likely to be either the red-haired or the green-haired person. Now, the probability of heads is also ½ — that's no problem. What's the total probability of your having red hair? That's just given by P[R|H] P[H] + P[R|T] P[T]. As we've said before, if the coin lands tails, you certainly have red hair, so P[R|T] = 1. Moreover, we've already assumed that P[R|H] = ½. Thus, what you get out is that P[H|R] = ¼ / ¾ = ⅓. So, if we do the Bayesian calculation, it tells us that the probability should be ⅓ and not ½.

Does anyone see an assumption that you could make that would get the probability back to ½?

A: You could make the assumption that whenever you exist, you have red hair.
Scott: Yeah, that's one way. But is there a way to do it that doesn't require a prior commitment about your hair color?

Well, there is a way to do it, but it's going to be a little bit strange. One way is to point out that in the heads world, there are twice as many people in the first place. So you could say that you're a priori twice as likely to exist in a world with twice as many people. In other words, you could say that your own existence is a piece of evidence you should condition upon. I guess if you want to make this concrete, the metaphysics that this would correspond to is that there's some sort of a warehouse full of souls, and depending on how many people there are in the world, a certain number of souls get picked out and placed into bodies. You should say that in a world with more people, it would be more likely that you'd be picked at all.

If you do make that assumption, then you can run through the same Bayesian ringer. You find that the assumption precisely negates the effect of reasoning that if the coin landed heads, then you could have had green hair. So you get back to ½.

Thus, we see that depending on how you want to do it, you can get an answer of either a third or a half. It's possible that there are other answers that could be defended, but these seem like the most plausible two.

That was a fairly serene thought experiment. Can we make it more dramatic?

Q: Part of the reason that this feels like philosophy is that there aren't any real stakes here.
Scott: Exactly. Let's get closer to something with real stakes in it.

The next thought experiment is, I think, due to the philosopher John Leslie. Let's call it the Dice Room. Imagine that there's a very, very large population of people in the world, and that there's a madman. What this madman does is, he kidnaps ten people and puts them in a room. He then throws a pair of dice. If the dice land snake-eyes, then he simply murders everyone in the room. If the dice do not land snake-eyes, then he releases everyone, then kidnaps 100 people. He now does the same thing: he rolls two dice; if they land snake-eyes then he kills everyone, and if they don't land snake-eyes, then he releases them and kidnaps 1,000 people. He keeps doing this until he gets snake-eyes, at which point he's done. So now, imagine that you've been kidnapped. You can assume either that you do or do not know how many other people are in the room.

Q: Have you been watching the news?
Scott: You have been watching the news and you know the entire situation. That's always a convenient assumption.

So you're in the room. Conditioned on that fact, how worried should you be? How likely is it that you're going to die?

A: 1/36.
Scott: OK. That would be one guess.

One answer is that the dice have a 1/36 chance of landing snake-eyes, so you should be only a "little bit" worried (considering). A second reflection you could make is to consider, of people who enter the room, what the fraction is of people who ever get out. Let's say that it ends at 1,000. Then, 110 people get out and 1,000 die. If it ends at 10,000, then 1,110 people get out and 10,000 die. In either case, about 8/9 of the people who ever go into the room will die.

Q: But that's not conditioning on the full set of information. That's just conditioning on the fact that I'm in the room at some point.
Scott: But you'll basically get the same answer, no matter what time you go into the room. No matter when you assume the process terminates, about 8/9 of the people who ever enter the room will be killed. For each termination point, you can imagine being a random person in the set of rooms leading up to that point. In that case, you're much more likely to die.
Q: But aren't you conditioning on future events?
Scott: Yes, but the point is that we can remove that conditioning. We can say that we're conditioning on a specific termination point, but that no matter what that point is, we get the same answer. It could be 10 steps or 50 steps, but no matter what the termination point is, almost all the people who go into the room are going to die, because the number of people is increasing exponentially.

If you're a Bayesian, then this kind of seems like a problem. You could see this as a bizarre madman-related thought experiment, or if you preferred to, you could defend the view that this is the actual situation the human species is in. We'd like to know what the probability is that we're going to suffer some cataclysm or go extinct for some reason. It could be an asteroid hitting the earth, nuclear war, global warming or whatever else. So there are kind of two ways of looking at it. The first way is to say that all of these risks seem pretty small---they haven't killed us yet! There have been many generations, and in each one, there have been people predicting imminent doom and it never materialized. So we should condition on that and assign a relatively small probability to our generation being the one to go extinct. That's the argument that conservatives like to make; I'll call it the Chicken Little Argument.

Against that, there's the argument that the population has been increasing exponentially, and that if you imagine that the population increases exponentially until it exhausts the resources of the planet and collapses, then the vast majority of the people who ever lived will live close to the end, much like in the dice room. Even supposing that with each generation there's only a small chance of doom, the vast majority of the people ever born will be there close to when that chance materializes.

Q: But it still seems to me like that's conditioning on a future event. Even if the answer is the same no matter which future event you choose, you're still conditioning on one of them.
Scott: Well, if you believe in the axioms of probability theory, then if p = P[A|B] = P[A|¬B], then P[A] = p.
Q: Yes, but we're not talking about B and ¬B, we're talking about an infinite list of choices.
Scott: So you're saying the infiniteness makes a difference here?
Q: Basically. It's not clear to me that you can just take that limit, and not worry about it. If your population is infinite, maybe the madman gets really unlucky and is just failing to roll snake-eyes for all eternity.
Scott: OK, we can admit that the lack of an upper bound on the number of rolls could maybe complicate matters. However, one can certainly give variants of this thought experiment that don't involve infinity.

The argument that I've been talking about goes by the name of The Doomsday Argument. What it's basically saying is that you should assign to the event of a cataclysm in the near future a much higher probability that you might naïvely think, because of this sort of reasoning. One can give a completely finitary version of the Doomsday Argument. Just imagine for simplicity that there are only two possibilities: Doom Soon and Doom Late. In one, the human race goes extinct very soon, whereas in the other it colonizes the galaxy. In each case, we can write down the number of people who will ever have existed. For the sake of discussion, suppose 80 billion people will have existed in the Doom Soon case, as opposed to 80 quadrillion in the Doom Late case. So now, suppose that we're at the point in history where almost 80 billion people have lived. Now, you basically apply the same sort of argument as in God's Coin Toss. You can make it stark and intuitive. If we're in the Doom Late situation, then the vast, vast majority of the people who will ever have lived will be born after us. We're in the very special position of being in the first 80 billion humans---we might as well be Adam and Eve! If we condition on that, we get a much lower probability of being in the Doom Late case than of being in the Doom Soon case. If you do the Bayesian calculation, you'll find that if naïvely view the two cases are equally likely, then after applying the Doomsday reasoning, we're almost certainly in the Doom Soon case. For conditioned on being in the Doom Late case, we almost certainly would not be in the special position of being amongst the first 80 billion people.

Maybe I should give a little history. The Doomsday Argument was introduced by a astrophysicist named Brandon Carter in 1974. The argument was then discussed intermittently throughout the 80s. Richard Gott, who was also an astrophysicist, proposed the "mediocrity principle": if you view the entire history of the human race from a timeless perspective, then all else being equal we should be somewhere in the middle of that history. That is, the number of people who live after us should not be too much different from the number of people who lived before us. If the population is increasing exponentially, then that's very bad news, because it means that humans are not much longer for the world. This argument seems intuitively appealing, but has been largely rejected because it doesn't really fit into the Bayesian formalism. Not only is it not clear what the prior distribution is, but you may have special information that indicates that you aren't likely to be in the middle.

So the modern form of the Doomsday Argument, which was formalized by Bostrom, is the Bayesian form where you just assume that you have some prior over the possible cases. Then, all the argument says is that you have to take your own existence into account and adjust the prior. Bostrom has a book about this where he concludes that the resolution of the Doomsday Argument really depends on how you'd resolve the God's Coin Toss puzzle. If you give ⅓ as your answer to the puzzle, that corresponds to the Self-Sampling Assumption (SSA) that you can sample a world according to your prior distribution and then sample a person within that world at random. is called the Self-Sampling Assumption (SSA). If you make that assumption about how to apply Bayes's Theorem, then it seems very hard to escape the doomsday conclusion.

If you want to negate that conclusion, then you need an assumption he calls the Self-Indication Assumption (SIA). That assumption says that you are more likely to exist in a world with more beings than one with less beings. You would say in the Doomsday Argument that if the "real" case is the Doom Late case, then while it's true that you are much less likely to be one of the first 80 billion people, it's also true that because there's so many more people, you're much more likely to exist in the first place. If you make both assumptions, then they cancel each other out, taking you back to your original prior distribution over Doom Soon and Doom Late, in exactly the same way that making the SIA led us to get back to ½ in the coin toss puzzle.

On this view, it all boils down to which of the SSA and SIA you believe. There are some arguments against Doomsday that don't accept these presuppositions at all, but those arguments themselves are open to different objections. One of the most common arguments that you hear against Doomsday is that cavemen could have made the same argument, but that they would have been completely wrong. The problem with that argument is that the Doomsday Argument doesn't at all ignore that effect. Sure, some people who make the argument will be wrong; the point is that the vast majority will be right.

Q: It seems as though there's a tension between wanting to be right yourself and wanting to design policies that try to maximize the number of people who are right.
Scott: That's interesting.
Q: Here's a variation I want to make on the red room business: suppose that God has a biased coin such that with 0.9 probability, there's one red and many, many greens. With 0.1 probability, there's just a red. In either case, in the red-haired person's room, there's a button. You have the option to push the button or not. If you're in the no-greens case, you get a cookie if you press the button, whereas if you're in the many-greens case, you get punched in the face if you press the button. You have to decide whether to press the button. So now, if I use the SSA and find that I'm in a red room, then probably we're in the no-greens case and I should press the button.
Scott: Absolutely. It's clear that what probabilities you assign to different states of the world can influence what decisions you consider to be rational. That, in some sense, is why we care about any of this.

There's also an objection to the Doomsday Argument that denies that it's valid at all to talk about you being drawn from some class of observers. "I'm not a random person, I'm just me." The response to that is that there are clearly cases where you think of yourself as a random observer. For example, suppose there's a drug that kills 99% of the people who take it, but such that 1% are fine. Are you going to say that since you aren't a random person, the fact that it kills 99% of the people is completely irrelevant? Are you going to just take it anyway? So for many purposes, you do think of yourself as being drawn over some distribution of people. The question is when is such an assumption valid and when isn't it?

Q: I guess to me, there's a difference between being drawn from a uniform distribution of people and a uniform distribution of time. Do you weight the probability of being alive in a given time by the population at that time?
Scott: I agree; the temporal aspect does introduce something unsettling into all of this. Later, we'll get to puzzles that don't have a temporal aspect. We'll see what you think of those.
Q: I've also sometimes wondered "why am I a human?" Maybe I'm not a random human, but a random fragment of consciousness. In that case, since humans have more brain matter than other animals, I'm much more likely to be a human.
Scott: Another question is if you're more likely to be someone who lives for a long time. We can go on like this. Suppose that there's lots of extraterrestrials. Does that change the Doomsday reasoning? Almost certainly, you wouldn't have been a human at all.
Q: Maybe this is where you're going, but it seems like a lot of this comes down to what you even mean by a probability. Do you mean that you're somehow encoding a lack of knowledge, or that something is truly random? With the Doomsday Argument, has the choice of Doom Soon or Doom Late already been fixed? With that drug argument, you could say, "no, I'm not randomly chosen—I am me—but I don't know this certain property of me."
Scott: That is one of the issues here. I think that as long as you're using Bayes's Theorem in the first place, you may have made some commitment about that. You certainly made a commitment that it makes sense to assign probabilities to the events in question. Even if we imagine the world was completely deterministic, and we're just using all this to encode our own uncertainty, then a Bayesian would tell you that's what you should do anytime you're uncertain about anything, no matter what the reason. You must have some prior over the possible answers, and you just assign probabilities and start updating them. Certainly, if you take that view and try to be consistent about it, then you're led to strange situations like these.

As the physicist John Baez has pointed out, anthropic reasoning is kind of like science on the cheap. When you do more experiments, you can get more knowledge, right? Checking to see if you exist is always an experiment that you can do easily. The question is, what can you learn from having done it? It seems like there are some situations where it's completely unobjectionable and uncontroversial to use anthropic reasoning. One example is asking why the Earth is 93 million miles away from the Sun and not some other distance. Can we derive 93 million miles as some sort of a physical constant, or get the number from first principles? It seems clear that we can't, and it seems clear that to the extent that there's an explanation, it has to be that if Earth were much closer, it would be too hot and life wouldn't have evolved, whereas if it were much further, it'd be too cold. This is called the "Goldilocks Principle": of course, life is going to only arise on those planets that are at the right temperature for life to evolve. It seems like even if there's a tiny chance of life evolving on a Venus or a Mars, it'd still be vastly more likely that life would evolve on a planet roughly our distance from the Sun, and so the reasoning still holds.

Then there are much more ambiguous situations. This is actually a current issue in physics, which the physicists argue over. Why is the fine structure constant roughly 1/137 and not some other value? You can give arguments to the effect that if it were much different, we wouldn't be here.

Q: Is that the case like with the inverse-square law of gravity? If it weren't r⁸², but just a little bit different, would the universe be kind of clumpy?
Scott: Yes. That's absolutely right. In the case of gravity, though, we can say that general relativity explains why it's an inverse square and not anything else, as a direct consequence of space having three dimensions.
Q: But we wouldn't need that kind of advanced explanation if we just did science on the cheap and said "it's gotta be this way by the Anthropic Principle."
Scott: This is exactly what people who object to the Anthropic Principle are worried about—that people will just get lazy and decide that there's no need to do any experiment about anything, because the world just is how it is. If it were any other way, we wouldn't be us; we'd be observers in a different world.
Q: But the Anthropic Principle doesn't predict, does it?
Scott: Right, in many cases, that's exactly the problem. The principle doesn't seem to constrain things before we've seen them. The reductio ad absurdum that I like is where a kid asks her parents why the moon is round. "Clearly, if the moon were square, you wouldn't be you, but you would be the counterpart of you in a square-moon universe. Given that you are you, clearly the moon has to be round." The problem being that if you hadn't seen the moon yet, you couldn't make a prediction. On the other hand, if you knew that life was much more likely to evolve on a planet that's 93 million miles away from the Sun than 300 million miles away, then even before you'd measured the distance, you could make such a prediction. In some cases, it seems like the principle really does give you predictions.
Q: So apply the principle exactly when it gives you a concrete prediction?
Scott: That would be one point of view, but I guess one question would be what if the prediction comes out wrong?

Like a student said before, this does feel like it's just philosophy. You can set things up, though, so that real decisions depend on it. Maybe you've heard of the surefire way of winning the lottery: buy a lottery ticket and if it doesn't win, then you kill yourself. Then, you clearly have to condition on being alive to ask the question of whether you are alive or not, and so because you're asking the question, you must be alive, and thus must have won the lottery. What can you say about this? You can say that in actual practice, most of us don't accept as a decision-theoretic axiom that you're allowed to condition on remaining alive. You could jump off a building and condition on there happening to be a trampoline or something that will rescue you. You have to take into account the possibility that your choices are going to kill you. On the other hand, tragically, some people do kill themselves. Was this in fact what they were doing? Were they eliminating the worlds where things didn't turn out how they wanted?

Of course, everything must come back to complexity theory at some point. And indeed, certain versions of the anthropic principle should have consequences for computation. We've already seen how it could be the case with the lottery example. Instead of winning the lottery, you want to do something even better: solve an NP-complete problem. You could use the same approach. Pick a random solution, check if it's satisfying, and if it isn't, kill yourself. Incidentally, there is a technical problem with that algorithm. Does anyone see what it is?

A: Is there a solution at all?
Scott: Right. If there's no solution then you'd seem to be in trouble. On the other hand, there's actually a very simple fix to this problem.
A: Add some dummy string like *ⁿ that acts as a "get out of jail free" card.
Scott: Exactly.

So we say that there are these 2ⁿ possible solutions, and that there's also this dummy solution you pick with some tiny probability like 2^-2n. If you pick the dummy solution, then you do nothing. Otherwise, you kill yourself if and only if the solution you picked is unsatisfying. Conditioned upon their being no solution and your being alive, then you'll have picked the dummy solution. Otherwise, if there is a solution, you'll almost certainly have picked a satisfying solution, again conditioned on your being alive.

As you might expect, you can define a complexity class based on this principle: BPP_path. Recall the definition of BPP: the class of problems solvable by a probabilistic polynomial time algorithm with bounded error. That is, if the answer to a problem is "yes," at least 2/3 of the BPP machine's paths must accept, whereas if the answer is "no," then at most 1/3 of the paths must accept. So BPP_path is just the same, except that the computation paths can all have different lengths. They have to be polynomial, but they can be different. A simple argument shows that BPP_path is equivalent to a class which I'll call PostBPP (BPP with post selection). PostBPP is again the set of problems solvable by a polynomial time probabilistic algorithm where again you have the 2/3-1/3 acceptance condition, but now if you don't like your choice of random bits, then you can just kill yourself. You can condition on having chosen random bits such that you remain alive. Physicists call this postselection. You can postselect on having random bits with some very special property. Conditioned on having that property, a "yes" answer should cause 2/3 of the paths to accept, and a "no" answer should cause no more than 1/3 to accept. Can you see why this is equivalent to BPP_path?

Q: First, I'm a little unclear on the difference between BPP and BPP_path. You said that the computation paths are of different lengths. The way I've always seen BPP, you're making random choices, but you can make a different number on each path.
Scott: OK. Here's the point: in BPP_path, if a choice leads to more different paths, then it can get counted more. Let's say for example that in 2^n − 1 branches, we just accept or reject---that is, we just halt---but in one branch, we're going to flip more coins and do something more. In BPP_path, we can make one branch completely dominate all the other branches. I've shown an example of this in the figure at right–suppose we want branch colored in red to dominate everything else. Then we can hang a whole tree from that path, and it will dominate the paths we don't want (colored in black).
A: So you just find one path you like, and then just sit there flipping a coin for a while?
Scott: Exactly, that's basically a proof that PostBPP ⊆ BPP_path. Given an algorithm with postselection, you make a bunch of random choices and if you like them, you make a bunch more random choices, and those paths overwhelm the paths where you didn't like your random bits. What about the other direction? BPP_path ⊆ PostBPP?

The point is, in BPP_path we've got this tree of paths that have different lengths. What we could do is complete it to make a balanced binary tree. Then, we could use postselection to give all these ghost paths suitably lower probabilities than the true paths, and thereby simulate BPP_path in PostBPP.

Completing a BPP_path tree.

Q: Can you go over the definition of PostBPP again?
Scott: Yes. PostBPP is the class of all languages L for which there exist polynomial time Turing machines A and B (A being the thing that decides whether to accept or reject and B being the thing that decides to postselect) such that for every x∈L, Pr_r[A(x,r)|B(x,r)] ≥ 2/3 and such that for every x∉L, Pr_r[A(x,r)|B(x,r)] ≤ 1/3. As a technical issue, we also require that Pr[B(x,r)] > 0.

Now that we know that PostBPP = BPP_path, we can ask how big BPP_path is. By the argument we gave before, NP ⊆ BPP_path.

Q: The argument we gave before picked the dummy solution with exponentially small, but non-zero probability.
Scott: That's right, but remember that's allowed here. It still satisfies our bounded-error conditions.

On the other hand, is NP = BPP_path? Certainly, that's going to be hard to show, even if it's true. One reason is that BPP_path is closed under complement. Another reason is that it contains BPP. In fact, you can show that BPP_path contains MA and P^||NP (P with parallel queries to an NP oracle). I'll leave that as an exercise. In the other direction, it's possible to show that BPP_path is contained in BPP^||NP, and thus in the polynomial hierarchy. Under a derandomization hypothesis, then, we find that the anthropic principle gives you the same computational power as P^||NP.

Q: How does BPP_path relate to PP?
Scott: That's a great question. BPP_path ⊆ PP.
Q: PP is so huge.
Scott: Why is BPP_path ⊆ PP? Deciding whether to accept or reject is kind of this exponential summation problem. You can say that each of the paths which is a dummy path contributes both an accept and a reject, while each of the accepting paths contributes two accepts and each of the rejecting paths contributes two rejects. Then, just ask if there are more accepts than rejects. That will simulate it in PP.

Of course, none of this would be complete if we didn't consider quantum postselection. That's what I wanted to end with. In direct analogy to PostBPP, we can define PostBQP as the class of decision problems solvable in polynomial time by a quantum computer with the power of post selection. What I mean is that this is the class of problems where you get to perform a polynomial-time quantum computation and then you get to make some measurement. If you don't like the measurement outcome, you get to kill yourself and condition on still being alive.

Q: Right, but is the B in the definition of PostBPP classical or quantum?
Scott: In PostBQP, we're going to have to define things a bit differently, because there's no analog of r. Instead, we'll say that you perform a polynomial-time quantum computation, make a measurement that accepts with probability greater than 0, and then condition on the outcome of that measurement. Finally, you perform a further measurement on the reduced quantum state that tells you whether to accept or reject. If the answer to the problem is "yes," then the second measurement should accept with probability at least 2/3, conditioned on the first measurement accepting. Likewise, if the answer to the problem is "no," the second measurement should accept with probability at most 1/3, conditioned on the first measurement accepting.

Then, we can ask how powerful PostBQP is. One of the first things you can say is that certainly, PostBPP ⊆ PostBQP. That is, we can simulate a classical computer with post selection. In the other direction, we have PostBQP ⊆ PP. There's this proof that BQP ⊆ PP due to Adleman, DeMorrais and Huang. In that proof, they basically do what physicists would call a Feynman path integral, where you sum over all possible contributions to each of the final amplitudes. It's just a big PP computation. From my point of view, Feynman won the Nobel Prize in physics for showing that BQP ⊆ PP, though he didn't state it that way. Anyway, the proof easily generalizes to show that PostBQP ⊆ PP, because you just have to restrict your summation to those paths where you end up in one of those states you postselect on. You can make all the other paths not matter by making them contribute an equal number of pluses and minuses.

Q: Once you postselect, can you postselect later on as well?
Scott: Can you simulate multiple postselections with one postselection? That's another great question. The answer is yes. We get to that by using the so-called Principle of Deferred Measurement, which tells us that in any quantum computation, we can assume without loss of generality that there's only one measurement at the end. You can simulate all the other measurements using controlled-NOT gates, and then just not look at the qubits containing the measurement outcomes. The same thing holds for postselection. You can save up all the postselections until the end.

What I showed three years ago is that the other direction holds as well: PP ⊆ PostBQP. In particular, this means that quantum postselection is much more powerful than classical postselection, which seems kind of surprising. Classical postselection leaves you in the polynomial hierarchy while quantum postselection takes you up to the counting classes, which we think of as much larger.

Let's run through the proof. So we've got some Boolean function f : {0,1}ⁿ→{0,1} where f is efficiently computable. Let s be the number of inputs x for which f(x) = 1. Our goal is to decide whether s ≥ 2^n-1. This is clearly a PP-complete problem. For simplicity, we will assume without loss of generality that s > 0. Now using standard quantum computing tricks (which I'll skip), it's relatively easy to prepare a single-qubit state like:

This also means that we can prepare the state:

That is, essentially a conditional Hadamard applied to |ψ⟩ for some α and β to be specified later. Let's write out explicitly what H|ψ⟩ is:

So now, I want to suppose that we take two-qubit state above and post-select on the second qubit being 1, then look at what that leaves in the first qubit. You can just do the calculation and you'll get the following state, which depends on what values of α and β we chose before:

Using post selection, we can prepare a state of that form for any fixed α and β that we want. So now, given that, how do we simulate PP? What we're going to do is keep preparing different versions of that state, varying the ratio β/α through {2^-n, 2^-(nn+1), ..., ½, 1, 2, ..., 2ⁿ}. Now, there are two cases: either s < 2^n-1 or s ≥ 2^n-1. Suppose that the first case holds. Then, s and 2ⁿ−2s have the same sign. Since α and β are real, the states |ψ_{α, β}⟩ lie along the unit circle:

If s<2^n − 1, then as we vary β/α, the state |ψ_{α, β}⟩ will always have a positive amplitude for both |0⟩ and |1⟩ (it will lie in the upper-right quadrant). It's not hard to see that at some point, the state is going to become reasonably balanced. That is, the amplitudes of |0⟩ and |1⟩ will come within a constant of each other, as is shown by the solid red vector in the figure above. If we keep measuring these states in the {|+⟩, |−⟩} basis, then one of the states will yield the outcome |+⟩ with high probability.

In the second case, where s ≥ 2^n-1, the amplitude of |1⟩ is never positive, no matter what α and β are set to, while the amplitude of |0⟩ is always positive. Hence, the state always stays in the lower-right quadrant. Now, as we vary β/α through a polynomial number of values, |ψ_{α, β}⟩ never gets close to |+⟩. This is a detectable difference.

So I wrote this up, thinking it was a kind of cute proof. A year later, I realized that there's this Beigel-Reingold-Spielman Theorem which showed that PP is closed under intersection. That is, if two languages are both in PP, then the AND of those two languages is also in PP. This solved a problem that was open for 20 years. What I noticed is that PostBQP is trivially closed under intersection, because if you want to find the intersection of two PostBQP languages, just run their respective PostBQP machines and postselect on both computations giving a valid outcome, and then see if they both accept. You can use amplification to stay within the right error bounds. Since PostBQP is trivially closed under intersection, it provides an alternate proof that PP is closed under intersection that I think is much simpler than the original proof. The way to get this simpler proof really is by thinking about quantum anthropic postselection. It's like a higher-level programming language for constructing the "threshold polynomials" that Beigel-Reingold-Spielman needed for their theorem to work. It's just that quantum mechanics and postselection give you a much more intuitive way of constructing these polynomials.

I just had a couple puzzles to leave you with. You asked about the temporal aspect and how it introduced additional confusion into the Doomsday Argument. There's one puzzle that doesn't involve any of that, but which is still quite unsettling. This puzzle—also due to Bostrom—is called the Presumptuous Philosophers. Imagine that physicists have narrowed the possibilities for a final theory of physics down to two a priori equally likely possibilities. The main difference between them is that Theory 1 predicts that the universe is a billion times bigger than Theory 2 does. In particular, assuming the universe is relatively homogenous (which both theories agree about), Theory 2 predicts that there are going to be about a billion times as many sentient observers in the universe. So the physicists are planning on building a massive particle accelerator to distinguish the two theories—a project that will cost many billions of dollars. Now, the philosophers come along and say that Theory 2 is the right one to within a billion-to-one confidence, since conditioned on Theory 2 being correct, we're a billion times more likely to exist in the first place. The question is whether the philosophers should get the Nobel Prize in Physics for their "discovery."

Of course, what the philosophers are assuming here is the Self-Indication Assumption. So here's where things stand with the SSA and SIA. The SSA leads to the Doomsday Argument, while the SIA leads to Presumptuous Philosophers. It seems like whichever one you believe, you get a bizarre consequence.

Finally, if we want to combine the anthropic computing idea with the Doomsday Argument, then there's the Adam and Eve puzzle. Suppose that Adam and Eve are the first two observers, and that what they'd really like is to solve an instance of an NP-complete problem, say 3SAT. To do so, they pick a random assignment, and form a very clear intention beforehand that if the assignment happens to be satisfying, then they won't have any kids, whereas if the assignment is not satisfying, then they will go forth and multiply. Now, let's assume the SSA. Then, conditioned on having chosen an unsatisfying assignment, how likely is it that they would be an Adam and Eve in the first place, as opposed to one of the vast number of future observers? If we assume that they'll ultimately have (say) 2²ⁿ descendants, then the probability would seem to be at most 2^{-2n + 1}. Therefore, conditioned upon the fact that they are the first two observers, the SSA predicts that with overwhelming probability, they will pick a satisfying assignment. If you're a hardcore Bayesian, you can take your pick between SSA and SIA and swallow the consequences either way!

Lecture 18: Free Will

Scribe: Chris Granade

So today we're going to ask---and hopefully answer---this question of whether there's free will or not. If you want to know where I stand, I'll tell you: I believe in free will. Why? Well, the neurons in my brain just fire in such a way that my mouth opens and I say I have free will. What choice do I have?

Before we start, there are two common misconceptions that we have to get out of the way. The first one is committed by the free will camp, and the second by the anti-free-will camp.

The misconception committed by the free will camp is the one I alluded to before: if there's no free will, then none of us are responsible for our actions, and hence (for example) the legal system would collapse. Well, I know of only one trial where the determinism of the laws of physics was actually invoked as a legal defense. It's the Leopold and Loeb trial in 1926. Have you heard of this? It was one of the most famous trials in American history, next to the OJ trial. So, Leopold and Loeb were these brilliant students at the University of Chicago (one of them had just finished his undergrad at 18), and they wanted to prove that they were Nietzschean supermen who were so smart that they could commit the perfect murder and get away with it. So they kidnapped this 14-year-old boy and bludgeoned him to death. And they got caught---Leopold dropped his glasses at the crime scene.

They were defended by Clarence Darrow---the same defense lawyer from the Scopes monkey trial, considered by some to be the greatest defense lawyer in American history. In his famous closing address, he actually made an argument appealing to the determinism of the universe. "Who are we to say what could have influenced these boys to do this? What kind of genetic or environmental influences could've caused them to commit the crime?" (Maybe Darrow thought he had nothing to lose.) Anyway, they got life in prison instead of the death penalty, but apparently it was because of their age, and not because of the determinism of the laws of physics.

Alright, what's the problem with using the non-existence of free will as a legal defense?

A: The judge and the jury don't have free will either.
Scott: Thank you! I'm glad someone got this immediately, because I've read whole essays about this, and the obvious point never gets brought up.

The judge can just respond, "The laws of physics might have predetermined your crime, but they also predetermined my sentence: DEATH!" (In the US, anyway. In Canada, maybe 30 days...)

Alright, that was the misconception of the free will camp. Now on to the misconception of the anti-free will camp. I've often heard the argument which says that not only is there no free will, but the very concept of free will is incoherent. Why? Because either our actions are determined by something, or else they're not determined by anything, in which case they're random. In neither case can we ascribe them to "free will."

For me, the glaring fallacy in the argument lies in the implication Not Determined ⇒ Random. If that was correct, then we couldn't have complexity classes like NP---we could only have BPP. The word "random" means something specific: it means you have a probability distribution over the possible choices. In computer science, we're able to talk perfectly coherently about things that are non-deterministic, but not random.

Look, in computer science we have many different sources of non-determinism. Arguably the most basic source is that we have some algorithm, and we don't know in advance what input it's going to get. If it were always determined in advance what input it was going to get, then we'd just hardwire the answer. Even talking about algorithms in the first place, we've sort of inherently assumed the idea that there's some agent that can freely choose what input to give the algorithm.

Q: Not necessarily. You can look at an algorithm as just a big compression scheme. Maybe we do know all the inputs we'll ever need, but we just can't write them in a big enough table, so we write them down in this compressed form.
Scott: OK, but then you're asking a technically different question. Maybe there's no efficient algorithm for some problem such that there is an efficient compression scheme. All I'm saying is that the way we use language---at least in talking about computation---it's very natural to say there's some transition where we have this set of possible things that could happen, but we don't know which is going to happen or even have a probability distribution over the possibilities. We would like to be able to account for all of them, or maybe at least one of them, or the majority of them, or whatever other quantifier we like. To say that something is either determined or random is leaving out whole swaths of the Complexity Zoo. We have lots of ways of producing a single answer from a set of possibilities, so I don't think it's logically incoherent to say that there could exist transitions in the universe with several allowed possibilities over which there isn't even a probability distribution.
Q: Then they're determined.
Scott: What?
Q: According to classical physics, everything is determined. Then, there's quantum mechanics, which is random. You can always build a probability distribution over the measurement outcomes. I don't think you can get away from the fact that those are the only two kinds of things you can have. You can't say that there's some particle which can go to one of three states, but that you can't build a probability distribution over them. Unless you want to be a frequentist about it, that's something that just can't happen.
Scott: I disagree with you. I think it does make sense. As one example, we talked about hidden-variable theories. In that case, you don't even have a probability distribution over the future until you specify which hidden-variable theory you're talking about. If we're just talking about measurement outcomes, then yes, if you know the state that you're measuring and you know what measurement you're applying, quantum mechanics gives you a probability distribution over the outcomes. But if you don't know the state or the measurement, then you don't even get a distribution.
Q: I know that there are things out there that aren't random, but I don't concede this argument.
Scott: Good! I'm glad someone doesn't agree with me.
Q: I disagree with your argument, but not your result that you believe in free will.
Scott: My "result"?
Q: Can we even define free will?
Scott: Yeah, that's an excellent question. It's very hard to separate the question of whether free will exists from the question of what the definition of it is. What I was trying to do is, by saying what I think free will is not, give some idea of what the concept seems to refer to. It seems to me to refer to some transition in the state of the universe where there are several possible outcomes and we can't even talk coherently about a probability distribution over them.
Q: Given the history?
Scott: Given the history.
Q: Not to beat this to death, but couldn't you at least infer a probability distribution by running your simulation many times and seeing what your free will entity chooses each time?
Scott: I guess where it becomes interesting is, what if (as in real life) we don't have the luxury of repeated trials?

Newcomb's Paradox

So let's put a little meat on this philosophical bone with a famous thought experiment. Suppose that a super-intelligent Predictor shows you two boxes: the first box has $1000, while the second box has either $1,000,000 or nothing. You don't know which is the case, but the Predictor has already made the choice and either put the money in or left the second box empty. You, the Chooser, have two choices: you can either take the second box only, or both boxes. Your goal, of course, is money and not understanding the universe.

Here's the thing: the Predictor made a prediction about your choice before the game started. If the Predictor predicted you'll take only the second box, then he put $1,000,000 in it. If he predicted you'll both boxes, then he left the second box empty. The Predictor has played this game thousands of times before, with thousands of people, and has never once been wrong. Every single time someone picked the second box, they found a million dollars in it. Every single time someone took both boxes, the found that the second box was empty.

First question: Why is it obvious that you should take both boxes? Right: because whatever's in the second box, you'll get $1,000 more by taking both boxes. The decision of what to put in the second box has already been made; your taking both boxes can't possibly affect it.

Second question: Why is it obvious that you should take only the second box? Right: because the Predictor's never been wrong! Again and again you've seen one-boxers walk away with $1,000,000, and two-boxers walk away with only $1,000. Why should this time be any different?

Q: How good is the Predictor's computer?
Scott: Well, clearly it's pretty good, given that he's never been wrong. We're going to get to that later.

This paradox was popularized by a philosopher named Robert Nozick in 1969. There's a famous line from his paper about it: "To almost everyone, it is perfectly clear and obvious what should be done. The difficulty is that these people seem to divide almost evenly on the problem, with large numbers thinking that the opposing half is just being silly."

There's actually a third position---a boring "Wittgenstein" position---which says that the problem is simply incoherent, like asking about the unstoppable force that hits the immovable object. If the Predictor actually existed, then you wouldn't have the freedom to make a choice in the first place; in other words, the very fact that you're debating which choice to make implies that the Predictor can't exist.

Q: Why can't you get out of the paradox by flipping a coin?
Scott: That's an excellent question. Why can't we evade the paradox using probabilities? Suppose the Predictor predicts you'll take only the second box with probability p. Then he'll put $1,000,000 in that box with the same probability p. So your expected payoff is: 1,000,000 p² + 1,001,000 p(1 − p) + 1,000(1 − p)² = 1,000,000p + 1,000(1-p) leading to exactly the same paradox as before, since your earnings will be maximized by setting p = 1. So my view is that randomness really doesn't change the fundamental nature of the paradox at all.

To review, there are three options: are you a one-boxer, a two-boxer or a Wittgenstein? Let's take a vote.

Results of Voting

Take both boxes:

Take only one box:

The question is meaningless:

Note: there were many double votes.

Well, it looks like the consensus coincides with my own point of view: (1) the question is meaningless, and (2) you should take only one box!

Q: It it really meaningless if you replace the question "what do you choose to do" with "how many boxes will you take?" It's not so much that you're choosing; you're reflecting on what you would in fact do, whether or not there's choice involved.
Scott: That is, you're just predicting your own future behavior? That's an interesting distinction.
Q: How good of a job does the Predictor have to do?
Scott: Maybe it doesn't have to be a perfect job. Even if he only gets it right 90% of the time, there's still a paradox here.
Q: So by the hypothesis of the problem, there's no free will and you have to take the Wittgenstein option.
Scott: Like with any good thought experiment, it's never any fun just to reject the premises. We should try to be good sports.

I can give you my own attempt at a resolution, which has helped me to be an intellectually-fulfilled one-boxer. First of all, we should ask what we really mean by the word "you." I'm going to define "you" to be anything that suffices to predict your future behavior. There's an obvious circularity to that definition, but what it means is that whatever "you" are, it ought to be closed with respect to predictability. That is, "you" ought to coincide with the set of things that can perfectly predict your future behavior.

Now let's get back to the earlier question of how powerful a computer the Predictor has. Here's you, and here's the Predictor's computer. Now, you could base your decision to pick one or two boxes on anything you want. You could just dredge up some childhood memory and count the letters in the name of your first-grade teacher or something and based on that, choose whether to take one or two boxes. In order to make its prediction, therefore, the Predictor has to know absolutely everything about you. It's not possible to state a priori what aspects of you are going to be relevant in making the decision. To me, that seems to indicate that the Predictor has to solve what one might call a "you-complete" problem. In other words, it seems the Predictor needs to run a simulation of you that's so accurate it would essentially bring into existence another copy of you.

Let's play with that assumption. Suppose that's the case, and that now you're pondering whether to take one box or two boxes. You say, "all right, two boxes sounds really good to me because that's another $1,000." But here's the problem: when you're pondering this, you have no way of knowing whether you're the "real" you, or just a simulation running in the Predictor's computer. If you're the simulation, and you choose both boxes, then that actually is going to affect the box contents: it will cause the Predictor not to put the million dollars in the box. And that's why you should take just the one box.

Q: I think you could predict very well most of the time with just a limited dataset.
Scott: Yeah, that's probably true. In a class I taught at Berkeley, I did an experiment where I wrote a simple little program that would let people type either "f" or "d" and would predict which key they were going to push next. It's actually very easy to write a program that will make the right prediction about 70% of the time. Most people don't really know how to type randomly. They'll have too many alternations and so on. There will be all sorts of patterns, so you just have to build some sort of probabilistic model. Even a very crude one will do well. I couldn't even beat my own program, knowing exactly how it worked. I challenged people to try this and the program was getting between 70% and 80% prediction rates. Then, we found one student that the program predicted exactly 50% of the time. We asked him what his secret was and he responded that he "just used his free will."
Q: It seems like a possible problem with "you-completeness" is that, at an intuitive level, you is not equal to me. But then, anything that can simulate me can also presumably simulate you, and so that means that the simulator is both you and me.
Scott: Let me put it this way: the simulation has to bring into being a copy of you. I'm not saying that the simulation is identical to you. The simulation could bring into being many other things as well, so that the problem it's solving is "you-hard" rather than "you-complete."
Q: What happens if you have a "you-oracle" and then decide to do whatever the simulation doesn't do?
Scott: Right. What can we conclude from that? If you had a copy of the Predictor's computer, then the Predictor is screwed, right? But you don't have a copy of the Predictor's computer.
Q: So this is a theory of metaphysics which includes a monopoly on prediction?
Scott: Well, it includes a Predictor, which is a strange sort of being, but what do you want from me? That's what the problem stipulates.

One thing that I liked about about my solution is that it completely sidesteps the mystery of whether there's free will or not, in much the same way that an NP-completeness proof sidesteps the mystery of P versus NP. What I mean is that, while it is mysterious how your free will could influence the output of the Predictor's simulation, it doesn't seem more mysterious than how your free will could influence the output of your own brain! It's six of one, half a dozen of the other.

One reason I like this Newcomb's Paradox is that it gets at a connection between "free will" and the inability to predict future behavior. Inability to predict the future behavior of an entity doesn't seem sufficient for free will, but it does seem somehow necessary. If we had some box, and if without looking inside this box, we could predict what the box was going to output, then we would probably agree among ourselves that the box doesn't have free will. Incidentally, what would it take to convince me that I don't have free will? If after I made a choice, you showed me a card that predicted what choice I was going to make, wellm that's the sort of evidence that seems both necessary and sufficient. Modern neuroscience does get close to this for certain kinds of decisions. There were some famous experiments in the 1980's, where someone would attach electrodes to someone's brain and would tell them that they could either press button 1 or 2. Something like 200ms before the person was conscious of making the decision of which button to press, certainly before they physically moved their finger, you could see the neurons spiking for that particular finger. So you can actually predict which button the person is going to press a fraction of a second before they're aware of having made a choice. This is the kind of thing that forces us to admit that some of our choices are less free than they feel to us---or at least, that whatever is determining these choices acts earlier in time than it seems to subjective awareness.

If free will depends on an inability to predict future behavior, then it would follow from that free will somehow depends on our being unique: on it being impossible to copy us. This brings up another of my favorite thought experiments: the teleportation machine.

Suppose that in the far future, there's a very simple way of getting to Mars---the Mars Express---in only 10 minutes. It encodes the positions of all the atoms in your body as information, then transmits it to Mars as radio waves, reconstitutes you on Mars, and (naturally) destroys the original. Who wants to be the first to sign up and buy tickets? You can assume that destroying the original is painless. If you believe that your mind consists solely of information, then you should be lining up to get a ticket, right?

Q: I think there's a big difference between the case where you take someone apart then put them together on the other end, and the case where you look inside someone to figure out how to build a copy, build a copy at the end and then kill the original. There's a big difference between moving and copying. I'd love to get moved, but I wouldn't go for the copying.
Scott: The way moving works in most operating systems and programming languages is that you just make a copy then delete the original. In a computer, moving means copy-and-deleting. So, say you have a string of bits x₁, ..., x_n and you want to move it from one location to another. Are you saying it matters whether we first copy all of the bits then delete the first string, or copy-and-delete just the first bit, then copy-and-delete the second bit and so on? Are you saying that makes a difference?
Q: It does if it's me.
Q: I think I'd just want to be copied, then based on my experiences decide whether the original should be destroyed or not, and if not, just accept that there's another version of me out there.
Scott: OK. So which of the two yous is going to make the decision? You'll make it together? I guess you could vote, but you might need a third you to break the tie.
Q: Are you a quantum state or a classical state?
Scott: You're ahead of me, which always makes me happy. One thing that's always really interested me about the famous quantum teleportation protocol (which lets you "dematerialize" a quantum state and "rematerialize" it at another location) is that in order for it to work, you need to measure---and hence destroy---the original state. But going back to the classical scenario, it seems even more problematic if you don't destroy the original than if you do. Then you have the problem of which one is "really" you. Q: This reminds me of the many-worlds interpretation.
Scott: At least there, two branches of a wave function are never going to interact with each other. At most, they might interfere and cancel each other out, but here the two copies could actually have a conversation with each other! That adds a whole new layer of difficulties.
Q: So if you replaced your classical computer with a quantum computer, you couldn't just copy-and-delete to move something...
Scott: Right! This seems to me like an important observation. We know that if you have an unknown quantum state, you can't just copy it, but you can move it. So then the following question arises: is the information in the human brain encoded in some orthonormal basis? Is it copyable information or non-copyable information? The answer does not seem obvious a priori. Notice that we aren't asking if the brain is a quantum computer (let alone a quantum gravity computer a la Penrose), or whether it can factor 300-digit integers. Maybe Gauss could, but it's pretty clear that the rest of us can't. But even if it's only doing classical computation, the brain could still be doing it in a way that involves single qubits in various bases, in such a way that it would be physically impossible to copy important parts of the brain's state. There wouldn't even have to be much entanglement for that to be the case. We know that there are all kinds of tiny effects that can play a role in determining whether a given neuron will fire or not. So, how much information do you need from a brain to predict a person's future behavior (at least probabilistically)? Is all the information that you need stored in "macroscopic" variables like synaptic strengths, which are presumably copyable in principle? Or is some of the information stored microscopically, and possibly not in a fixed orthonormal basis? These are not metaphysical questions. They are, in principle, empirically answerable ones.

Now that we've got quantum in the picture, let's stir the pot a little bit more and bring in relativity. There's this argument (again, you can read whole Ph. D. theses about all these things) called the block-universe argument. The idea is that somehow special relativity precludes the existence of free will. Here you are, and you're trying to decide whether to order pizza or Chinese take-out. Here's your friend, who's going to come over later and wants to know what you're going to order. As it happens, your friend is traveling close to the speed of light in your rest frame. Even though you perceive yourself agonizing over the decision, from her perspective, your decision has already been made.

Q: You and your friend are spacelike-separated, so what does that even mean?
Scott: Exactly. I don't really think, personally, that this argument says anything about the existence or non-existence of free will. The problem is that it only works with spacelike-separated observers. Your friend can say, in principle, that in what she perceives to be her spacelike hypersurface, you've already made your decision---but she still doesn't know what you actually ordered! The only way for the information to propagate to your friend is from the point where you actually made the decision. To me, this just says that we don't have a total time-ordering on the set of events---we just have a partial ordering. But I've never understood why that should preclude free will.

I have to rattle you up somehow, so let's throw quantum, relativity and free will all into the stew. There was a paper recently by Conway and Kochen called The Free Will Theorem, which got a fair bit of press. So what is this theorem? Basically, Bell's Theorem, or rather an interesting consequence of Bell's Theorem. It's kind of a mathematically-obvious consequence, but still very interesting. You can imagine that there's no fundamental randomness in the universe, and that all of the randomness we observe in quantum mechanics and the like was just predetermined at the beginning of time. God just fixed some big random string, and whenever people make measurements, they're just reading off this one random string. But now suppose we make the following three assumptions:

We have the free will to choose in what basis to measure a quantum state. That is, at least the detector settings are not predetermined by the history of the universe.
Relativity gives some way for two actors (Alice and Bob) to perform a measurement such that in one reference frame Alice measures first, and in another frame Bob measures first.
The universe cannot coordinate the measurement outcomes by sending information faster than light.

Given these three assumptions, the theorem concludes that there exists an experiment---namely, the standard Bell experiment---whose outcomes are also not predetermined by the history of the universe. Why is this true? Basically, because supposing that the two outcomes were predetermined by the history of the universe, you could get a local hidden-variable model, in contradiction to Bell's Theorem. You can think of this theorem as a slight generalization of Bell's Theorem: one that rules out not only local hidden-variable theories, but also hidden-variable theories that obey the postulates of special relativity. Even if there were some non-local communication between Alice and Bob in their different galaxies, as long as there are two reference frames such that Alice measures first in one and Bob measures first in the other, you can get the same inequality. The measurement outcomes can't have been determined in advance, even probabilistically; the universe must "make them up on the fly" after seeing how Alice and Bob set their detectors. I wrote a review of Steven Wolfram's book a while ago where I mentioned this, as a basic consequence of Bell's Theorem that ruled out the sort of deterministic model of physics that Wolfram was trying to construct. I didn't call my little result the Free Will Theorem, but now I've learned my lesson: if I want people to pay attention, I should be talking about free will! Hence this lecture.

Years ago, I was at one of John Preskill's group meetings at Caltech. Usually, it was about very physics-y stuff and I had trouble understanding. But once, we were talking about a quantum foundations paper by Chris Fuchs, and things got very philosophical very quickly. Finally, someone got up and wrote on the board: "Free Will or Machine?" And asked for a vote. "Machine" won, seven-to-five. But we don't have to accept their verdict! We can take our own vote.

The Results

Free Will:

Machines:

Note: The class was largely divided between those who abstained and those who voted for both.

I'll leave you with the following puzzle for next time:
Dr. Evil is on his moon base, and he has a very powerful laser pointed at the Earth. Of course, he's planning to obliterate the Earth, being evil and all. At the last minute, Austin Powers hatches a plan, and sends Dr. Evil the following message: "Back in my lab here on Earth, I've created a replica of your moon base down to the last detail. The replica even contains an exact copy of you. Everything is the same. Given that, you actually don't know if you're in your real moon base or in my copy here on Earth. So if you obliterate the Earth, there's a 50% chance you'll be killing yourself!" The puzzle is, what should Dr. Evil do? Should he fire the laser or not? (See here for the paper about this.)

Lecture 19: Time Travel

Scribe: Chris Granade

Last time we talked about free will, superintelligent predictors, and Dr. Evil planning to destroy the earth from his moon base. Today I'd like to talk about a more down-to-earth topic: time travel. The first point I have to make is one that Carl Sagan made: we're all time travelers—at the rate of one second per second! Har har! Moving on, we have to distinguish between time travel into the distant future and into the past. Those are very different.

Travel into the distant future is by far the easier of the two. There are several ways to do it:

Cryogenically freeze yourself and thaw yourself out later. Travel at relativistic speed. Go close to a black hole horizon.

This suggests one of my favorite proposals for how to solve NP-complete problems in polynomial time: why not just start your computer working on an NP-complete problem, then board a spaceship traveling at close to the speed of light and return to Earth to pick up the solution? If this idea worked, it would let us solve much more than just NP. It would also let us solve PSPACE-complete and EXP-complete problems–maybe even all computable problems, depending on how much speedup you want to assume is possible. So what are the problems with this approach?

A: The Earth ages, too.
Scott: Yeah, so all your friends will be dead when you get back. What's a solution to that?
A: Bring the whole Earth with you, and leave your computer floating in space.
Scott: Well, at least bring all your friends with you!

Let's suppose you're willing to deal with the inconvenience of the Earth having aged exponentially many years. Are there any other problems with this proposal? The biggest problem is, how much energy does it take to accelerate to relativistic speed? Ignoring the time spent accelerating and decelerating, if you travel at a v fraction of the speed of light for a proper time t, then the elapsed time in your computer's reference frame is:

It follows that, if you want t′ to be exponentially larger than t, then v has to be exponentially close to 1. There might already be fundamental difficulties with that, coming from quantum gravity, but let's ignore that for now. The more obvious problem is, you're going to need an exponential amount of energy to accelerate to this speed v. Think about your fuel tank, or whatever else is powering your spaceship. It's going to have to be exponentially large! Just for locality reasons, how is the fuel from the far parts of the tank going to affect you? Here, I'm using the fact that spacetime has a constant number of dimensions. (Well, and I'm also using the Schwarzchild bound, which limits the amount of energy that can be stored in a finite region of space: your fuel tank certainly can't be any denser than a black hole!)

Let's talk about the more interesting kind of time travel: the backwards kind. Can closed timelike curves (CTCs) exist in Nature? This question has a very long history of being studied by physicists on weekends. It was discovered early on, by Gödel and others, that classical general relativity admits CTC solutions. All of the known solutions, however, have some element that can be objected to as being "unphysical." For example, some solutions involve wormholes, but that requires "exotic matter" having negative mass to keep the wormhole open. They all, so far, involve either non-standard cosmologies or else types of matter or energy that have yet to be experimentally observed. But that's just classical general relativity. Once you put quantum mechanics in the picture, it becomes an even harder question. General relativity is not just a theory of some fields in spacetime, but of spacetime itself, and so once you quantize it, you'd expect there to be fluctuations in the causal structure of spacetime. The question is, why shouldn't that produce CTCs?

Incidentally, there's an interesting metaquestion here: why have physicists found it so hard to create a quantum theory of gravity? The technical answer usually given is that, unlike (say) Maxwell's equations, general relativity is not renormalizable. But I think there's also a simpler answer, one that's much more understandable to a doofus layperson like me. The real heart of the matter is that general relativity is a theory of spacetime itself, and so a quantum theory of gravity is going to have to be talking about superpositions over spacetime and fluctuations of spacetime. One of the things you'd expect such a theory to answer is whether closed timelike curves can exist. So quantum gravity seems "CTC-hard", in the sense that it's at least as hard as determining if CTCs are possible! And even I can see that this can't possibly be a trivial question to settle. Even if CTCs are impossible, presumably they're not going to be proven impossible without some far-reaching new insight. Of course, this is just one instantiation of a general problem: that no one really has a clear idea of what it means to treat spacetime itself quantum-mechanically.

In the field I come from, it's never our place to ask if some physical object exists or not, it's to assume it exists and see what computations we can do with it. Thus, from now on, we'll assume CTCs exist. What would the consequences be for computational complexity? Perhaps surprisingly, I'll be able to give a clear and specific answer to that.

So how would you exploit a closed timelike curve to speed up computation? First let's consider the naïve idea: compute the answer, then send it back in time to before your computer started.

From my point of view, this "algorithm" doesn't work even considered on its own terms. (It's nice that, even with something as wacky as time travel, we can definitively rule certain ideas out!) I know of at least two reasons why it doesn't work. Anyone want to take a shot?

A: The universe can still end in the time you're computing the answer.
Scott: Yes! Even in this model where you can go back in time, it seems to me that you still have to quantify how much time you spend in the computation. The fact that you already have the answer at the beginning doesn't change the fact that you still have to do the computation! Refusing to count the complexity of that computation is like maxing out your credit card, then not worrying about the bill. You're going to have to pay up later!
A: Couldn't you just run the computation for an hour, go back in time, continue the computation for another hour, then keep repeating until you're done?
Scott: Ah! That's getting toward my second reason. You just gave a slightly less naïve idea, which also fails, but in a more interesting way.
A: The naïve idea involves iterating over the solution space, which could be uncountably large.
Scott: Yeah, but let's assume we're talking about an NP-complete problem, so that the solution space is finite. If we could merely solve NP-complete problems, we'd be pretty happy.

Let's think some more about the proposal where you compute for an hour then go back in time, compute for another hour then go back again and so on. The trouble with this proposal is that it doesn't take seriously that you're going back in time. You're treating time as a spiral, as some sort of scratchpad that you can keep erasing and writing over, but you're not going back to some other time, you're going back to the time that you started from. Once you accept that this is what we're talking about, you immediately start having to worry about the Grandfather Paradox (i.e., where you go back in time and kill your grandfather). For example, what if your computation takes as input a bit b from the future, and produces as output a bit ¬b, which then goes back in time to become the input? Now when you use ¬b as input, you compute ¬¬b = b as output, and so on. This is just the Grandfather Paradox in a computational form. We have to come up with some account of what happens in this situation. If we're talking about closed timelike curves at all, then we're talking about something where this sort of behavior can happen, and we need some theory of what results.

My own favorite theory was proposed by David Deutsch in 1991. His proposal was that if you just go to quantum mechanics, the problem is solved. Indeed, quantum mechanics is overkill: it works just as well to go to a classical probabilistic theory. In the latter case, you have some probability distribution (p₁,...,p_n) over the possible states of your computer. Then the computation that takes place within the closed timelike curve can be modeled as a Markov chain, which transforms this distribution to a different one. What should we impose if we want to avoid grandfather paradoxes?

A: That the output distribution should be the same as the input one?
Scott: Exactly.

We should impose the requirement that Deutsch calls causal consistency: the computation within the CTC must map the input probability distribution to itself. In deterministic physics, we know that this sort of consistency can't always be achieved—that's just another way of stating the Grandfather Paradox. But as soon as we go to probabilistic theories, well, it's a basic fact that every Markov chain has at least one stationary distribution. In this case of the Grandfather Paradox, the unique solution is that you're born with probability ½, and if you're born, you go back in time and kill your grandfather. Thus, the probability that you go back in time and kill your grandfather is ½, and hence you're born with probability ½. Everything is consistent; there's no paradox.

One thing that I like about Deutsch's resolution is that it immediately suggests a model of computation. First we get to choose a polynomial-size circuit C : {0,1}ⁿ → {0,1}ⁿ. Then Nature chooses a probability distribution D over strings of length n such that C(D)=D, and gives us a sample y drawn from D. (If there's more than one fixed-point D, then we'll suppose to be conservative that Nature makes her choice adversarially.) Finally, we can perform an ordinary polynomial-time computation on the sample y. We'll call the complexity class resulting from this model P_CTC.

What can we say about this class? My first claim is that NP ⊆ P_CTC; that is, closed timelike curve computers can solve NP-complete problems in polynomial time. Does anyone see why? More concretely, suppose we have a Boolean formula φ in n variables and we want to know if there's a satisfying assignment. What should our circuit C do?

A: If the input is a satisfying assignment, spit it back out?
Scott: Good. And what if the input isn't a satisfying assignment?
A: Iterate to the next assignment?
Scott: Right! And go back to the beginning if you've reached the last assignment.

We'll just have this loop over all possible assignments, and we stop as soon as we get to a satisfying one. Assuming there exists a satisfying assignment, the only stationary distributions will be concentrated on satisfying assignments. So when we sample from a stationary distribution, we'll certainly see such an assignment. (If there are no satisfying assignments, then the stationary distribution is uniform.)

Q: So we're assuming that Nature gives us this stationary distribution for free?
Scott: Yes. Once we set up the CTC, its evolution has to be causally consistent to avoid grandfather paradoxes. But that means Nature has to solve a hard computational problem to make it consistent! That's the key idea that we're exploiting.

Related to this algorithm for solving NP-complete problems is what Deutsch calls the "knowledge creation paradox." The paradox is best illustrated through the movie Star Trek IV. The Enterprise crew has gone back in time to the present (meaning to 1986) in order to find a humpback whale and transport it to the 23^rd century. But to build a tank for the whale, they need a type of plexiglass that hasn't been invented yet. So in desperation, they go to the company that will invent the plexiglass, and reveal the molecular formula to that company. They then wonder: how did the company end up inventing the plexiglass? Hmmmm....

Note that the knowledge creation paradox is a time travel paradox that's fundamentally different from the grandfather paradox, because here there's no actual logical inconsistency. This paradox is purely one of computational complexity: somehow this hard computation gets performed, but where was the work put in? In the movie, somehow this plexiglass gets invented without anyone ever having taken the time to invent it!

As a side note, my biggest pet peeve about time travel movies is how they always say, "Be careful not to step on anything, or you might change the future!" "Make sure this guy goes out with that girl like he was supposed to!" Dude—you might as well step on anything you want. Just by disturbing the air molecules, you've already changed everything.

OK, so we can solve NP-complete problems efficiently using time travel. But can we do more than that? What is the actual computational power of closed timelike curves? I claim that certainly, P_CTC is contained in PSPACE. Does anyone see why?

Well, we've got this exponentially large set of possible inputs x ∈ {0, 1}ⁿ to the circuit C, and our basic goal is to find an input x that eventually cycles around (that is, such that C(x)=x, or C(C(x))=x, or...). For then we'll have found a stationary distribution. But finding such an x is clearly a PSPACE computation. For example, we can iterate over all possible starting states x, and for each one apply C up to 2ⁿ times and see if we ever get back to x. Certainly, this is in PSPACE.

My next claim is that P_CTC is equal to PSPACE. That is, CTC computers can solve not just NP-complete problems, but all problems in PSPACE. Why?

Well, let M₀, M₁, ... be the successive configurations of a PSPACE machine M. Also, let M_acc be the "halt and accept" configuration of M, and let M_rej be the "halt and reject" configuration. Our goal is to find which of these configurations the machine goes into. Note that each of these configurations takes a polynomial number of bits to write down. Then, we can define a polynomial-size circuit C that takes as input some configuration of M plus some auxiliary bit b. The circuit will act as follows:

C(⟨M_i, b⟩)	=	⟨M_i + 1, b⟩
C(⟨M_acc, b⟩)	=	⟨M₀, 1⟩
C(⟨M_rej, b⟩)	=	⟨M₀, 0⟩

So, for each configuration that isn't the accepting or rejecting configuration, C increments to the next configuration, leaving the auxiliary bit as it was. If it reaches an accepting configuration, then it loops back to the beginning and sets the auxiliary bit to 1. Similarly, if it reaches an rejecting configuration, then it loops back and sets the auxiliary bit to 0.

Simulating a PSPACE computation in P_CTC.

Now if we think about what's going on, we have two parallel computations: one with the answer bit set to 0, the other with the answer bit set to 1. If the true answer is 0, then the rejecting computation will go around in a loop, while the accepting computation will lead into that loop. Likewise, if the true answer is 1, it's the accepting computation that will go around in a loop. The only stationary distribution, then, is a uniform distribution over the computation steps with b set to the correct answer. We can then read off a sample and look at b, to find out whether the PSPACE machine accepts or rejects.

Thus we can tightly characterize P_CTC as equal to PSPACE. One way to think about it is that having a closed timelike curve makes time and space equivalent as computational resources. In retrospect, maybe we should have expected that all along, but we still have to show it!

Now, there's an obvious question that we have to ask: what if we have a quantum computer acting inside the closed timelike curve? Obviously, we need to know the answer. How does this work? Now we have a polynomial-sized quantum circuit instead of a classical circuit, and we say that we have two sets of qubits: "closed timelike curve qubits" and "chronology-respecting qubits." We can do some quantum computation on both of them, but we're only really going to care about the CTC qubits. There will be some induced superoperator S that acts on the CTC qubits. (Recall that a superoperator is just a general quantum operation, not necessarily unitary.) Then, Nature will adversarially find a mixed state ρ that is a fixed point of S: i.e., such that S(ρ) = ρ. It's not always possible to find a pure state ρ=|ψ⟩⟨ψ| with that property, but by basic linear algebra (Deutsch worked out the details, and I won't drag you through them) there is always such a mixed state.

Q: So ρ is a state just over the CTC qubits?
Scott: Yes. The only real reason for the other qubits is that without them, the superoperator would always be unitary, in which case the maximally mixed state I would always be fixed point. And that would trivialize the model.

As a general principle, quantum computers can simulate classical ones, and (as is easily shown) it's no different when we throw in closed timelike curves. So we can certainly say that BQP_CTC contains PSPACE. But what's an upper bound on BQP_CTC?

A: EXPSPACE?
Scott: EXPSPACE would certainly work, yes. Can we give a better upper bound?

So we're given an n-qubit superoperator (specified implicitly by a circuit) and we want to find a fixed point of it. This is basically a linear algebra problem. We know that you can do linear algebra in time polynomial in the dimension of the Hilbert space, which in this case is 2ⁿ. This implies that we can simulate BQP_CTC in EXP. So we now have that BQP_CTC is somewhere between PSPACE and EXP. In my survey paper on NP-complete Problems and Physical Reality a few years ago, pinning this down further was the main technical open problem!

Recently John Watrous and I were able to solve the problem. Our result is that BQP_CTC = P_CTC = PSPACE. In other words, if closed timelike curves existed, then quantum computers would be no more powerful than classical ones.

Q: Do we know anything about other classes with closed timelike curves? Like PSPACE_CTC?
Scott: That one is going to be PSPACE again. On the other hand, you can't just take any complexity class and append a _CTC to it. You have to say what that means, and for some classes (like NP) it won't even make any sense.

In the last part of the lecture I can give you a little hint of why BQP_CTC ⊆ PSPACE. Given a superoperator S that's described by a polynomial-size quantum circuit, which maps n qubits to n qubits, our goal is to compute a mixed state ρ such that S(ρ) = ρ. We won't be able to write down ρ explicitly (it would be far too large to fit in a PSPACE machine's memory), but all we're really aiming to do is to simulate the result of some polynomial-time computation that could have been performed on ρ.

Let vec(ρ) be the "vectorization" of ρ (a vector with 2²ⁿ components, one for each matrix entry of ρ). Then there exists a 2²ⁿ×2²ⁿ matrix M such that for all ρ, S(ρ)=ρ if and only if M vec(ρ) = vec(ρ). In other words, we can just expand everything out from matrices to vectors, and then our goal is to find a +1 eigenvector of M.

Define P := lim_z → 1 (1 − z)(I − zM)^-1. Then by Taylor expansion:

MP	=	M lim_z → 1 (1 − z) (I+zM+z²M²+⋅⋅⋅)
	=	lim_z → 1 (1 − z)(M+zM²+z²M³+⋅⋅⋅)
	=	lim_z → 1 (1 − z)/z (zM+z²M²+z³M³+⋅⋅⋅)
	=	lim_z → 1 (1 − z)/z [(I − zM)^-1 − I]
	=	lim_z → 1 (1 − z)/z (I − zM)^-1
	=	lim_z → 1 (1 − z) (I − zM)^-1
	=	P

In other words, P projects onto fixed-points of M. For all v, M(Pv) = (Pv).

So now all we need to do is start with some arbitrary vector v—say vec(I) where I is the maximally mixed state—and then compute:

But how do we do apply this matrix P in PSPACE? Well, we can apply M in PSPACE since it's just a polynomial-time quantum computation. But what about taking a matrix inverse? Here, we borrow something from computational linear algebra. Csanky's algorithm, proposed in the 1970's, lets us compute the inverse of an n×n matrix not merely in polynomial time, but by a circuit of depth log²n. Similar algorithms are actually used in practice today, for example when doing scientific computing with lots of parallel processors. Now, "shifting everything up" by an exponential, we find that it's possible to invert a 2²ⁿ×2²ⁿ matrix using a circuit of size 2^O(n) and depth O(n²). But computing the output of an exponential-size, polynomial-depth circuit (which is described to us implicitly) is a PSPACE computation—in fact it's PSPACE-complete. As a final step, one can take the limit as z → 1 using algebraic rules, and some further tricks due to Beame, Cook, and Hoover.

Obviously I'm skipping a lot of details.

Q: So does this P always project onto the vectorization of a density matrix?
Scott: OK, that's an additional point that needs to be argued. If you look at the power series above, each individual term maps a vectorization of a density matrix onto another such vectorization, so the sum has to project onto vectorizations of density matrices as well. (Well, you might worry about the normalization, but that works out also.)

As usual, I'll end with a puzzle for next lecture. Suppose you can only fit a single bit at a time through a CTC. You can make as many CTCs as you like, but you can only send one bit through each, not a polynomial number of bits. (After all, we don't want to be extravagant!) In this alternate model, can you solve NP-complete problems in polynomial time?

Lecture 20: Cosmology and Complexity

Scribe: Chris Granade

Puzzle from last week: What can you compute with "narrow" CTCs that only send one bit back in time?

Solution: Let x be a chronology-respecting bit, and let y be a CTC bit. Then, set x := x ⊕ y and y := x. Suppose that Pr[x = 1] = p and Pr[y = 1] = q. Then, causal consistency implies p = q. Hence, Pr[x ⊕ y = 1] = p(1 − q) + q(1 − p) = 2p(1 − p).

So we can start with p exponentially small, and then repeatedly amplify it. We can thereby solve NP-complete problems in polynomial time (and indeed PP ones also, provided we have a quantum computer).

I was going to offer some grand summation that would drive home the central message of this class, but then I thought about it and realized that there is no central message. I just stand up here and rant about whatever I deem interesting. What we'll do instead is to talk about cosmology and complexity.

I'll start with the "New York Times model" of cosmology– that is, the thing that you read about in popular articles until fairly recently– which says that everything depends on the density of matter in the universe. There's this parameter Ω which represents the mass density of the universe, and if it's greater than 1, the universe is closed. That is, the matter density of the universe is high enough that after the Big Bang, there has to be a Big Crunch. Furthermore, if Ω>1, spacetime has a spherical geometry (positive curvature). If Ω=1, the geometry of spacetime is flat and there's no Big Crunch. If Ω<1, then the universe is open, and has a hyperbolic geometry. The view was that these are the three cases.

Today, we know that this model is wrong in at least two ways. The first way it's wrong is of course that it ignores the cosmological constant. As far as astronomers can see, space is roughly flat. That is, no one has detected a non-trivial spacetime curvature at the scale of the universe. There could be some curvature, but if there is, then it's pretty small. The old picture would therefore lead you to think that the universe must be poised on the brink of a Big Crunch: change the matter density just a tiny bit, and you could get a spherical universe that collapses or a hyperbolic one that expands forever. But in fact, the universe is not anywhere near the regime where there would be a Big Crunch. Why are we safe? Well, you have to look at what the energy density of the universe is made up of. There's matter, including ordinary matter as well as dark matter, there's radiation, and then there's the famous cosmological constant detected a decade ago, which describes the energy density of empty space. Their (normalized) sum Ω seems to equal 1 as far as anyone can measure, which is what makes space flat, but the cosmological constant Λ is not zero, as had been assumed for most of the 20^th century. In fact, about 70% of the energy density of the observable universe (in this period of time) is due to the cosmological constant.

From http://www-supernova.lbl.gov/.

Along the diagonal black line is where space is flat. This is where the energy densities due to the cosmological constant and matter sum to 1. In the previous view, there was no cosmological constant, and space was flat, and so we're at the intersection of the two solid black lines. You can see the other solid black line slowly starts curving up. If you're above that line, then the universe expands forever, whereas if you're below this line, then the universe recollapses. So if you're at the intersection, then you really are right at the brink between expanding and collapsing. But, given that 70% of the energy density of the universe is due to Λ, you can see that we're somewhere around the intersection of the diagonal line with the blue oval–i.e., nowhere near where we recollapse.

But that's only one thing that's wrong with the simple "spherical/flat/hyperbolic" trichotomy. Another thing wrong with it is that the geometry of the universe and its topology are two separate questions. Just assuming the universe is flat doesn't imply that it's infinite. If the universe had a constant positive curvature, that would imply it was finite. Picture the Earth; on learning that it has a constant positive curvature, you would conclude it's round. I mean, yes, it could curve off to infinity where you can't see it, but assuming it's homogenous in curvature, mathematically it has to curve around in either a sphere or some other more complicated finite shape. If space is flat, however, that doesn't tell you whether it's is finite or infinite. It could be like one of the video games where when you go off one end of the screen, you reappear on the other end. That's perfectly compatible with geometric flatness, but would correspond to a closed topology. The answer, then, to whether the universe is finite or infinite, is unfortunately that we don't know. (For more see this paper by Cornish and Weeks.)

Q: But with positive curvature, you could have something that tapers off infinitely like a paraboloid.
Scott: Yes, but that wouldn't be uniform positive curvature. Uniform means that the curvature is the same everywhere.
Q: It seems like what's missing in all these pictures so far is time. Are we saying that time started at some fixed point, or that time goes all the way back to negative infinity?
Scott: All of these pictures assume that there was a Big Bang, right? All of these are Big Bang cosmologies.
Q: So if time started at some finite point, then time is finite. But relativity tells us that there's really no difference between space and time, right?
Scott: No, it doesn't tell us that. It tells us that time and space are interrelated in a non-trivial way, but time has a different metric signature than space. As an aside, this is one of my pet peeves. I actually had a physicist ask me once how P could be different from PSPACE since "relativity tells us that time and space are the same." Well, the point is that time has a negative signature. This is related to the fact that you can go backwards and forwards in space, but you can only go forwards in time. We talked in the last lecture about closed timelike curves. The point about CTCs is that they would let you go backwards in time and as a consequence, time and space really would become equivalent as computational resources. But as long as you can only go one direction in time, it's not the same as space.
Q: So can we go far in space enough to loop around?
Scott: If your arm was long enough, could you stretch it out in front of you and punch yourself in the back of your head? As I was saying, the answer is that we don't know.
Q: As far as the spread of mass is concerned, I think that people believe that is finite, because of the Big Bang.
Scott: That's a misconception about the Big Bang. The Big Bang is not something that happens at one point in space; the Big Bang is the event that creates spacetime itself. The standard analogy is that the galaxies are little spots on a balloon, and as the balloon expands, it's not that the spots are rushing away from each other, it's that the balloon is getting bigger. If spacetime is open, then it could well be that instead of just having a bunch of matter crowded around, you've actually got an infinite amount of matter at the moment of the Big Bang. As time goes by, the infinite universe gets stretched out, but at any point in time, it would still go on infinitely. If you look at our local horizon, we see things rushing away from each other, but that's just because we can't go past that horizon and see what's beyond it. So the Big Bang isn't some explosion that happened at some time and place; it's just the beginning of the whole manifold.
Q: But then shouldn't the mass/energy not spread out faster than the speed of light?
Scott: That's another great question; I'm glad to have something I can actually explain! Within a fixed reference frame, you can have two points appearing to recede from each other faster than light, but the reason is they appear to recede is just that the intervening space is expanding. Indeed, the empirical fact is that faraway galaxies do rush away from each other faster than light. What's limited by the speed of light is the speed with which an ant can move along the surface of the expanding balloon--not the expansion speed of the balloon itself.
Q: So would it be possible to observe an object moving away faster than the speed of light?
Scott: Well, if some light was emitted a long time ago (say, shortly after the Big Bang), then by the time that light reaches us, we may be able to infer that the galaxy the light came from must now be receding away from us faster than the speed of light.
Q: Can two galaxies move towards each other faster than the speed of light?
Scott: In a collapse, yes.
Q: How do we avoid all the old paradoxes that come with allowing objects to move faster than the speed of light?
Scott: In other words, why doesn't faster-than-light expansion or contraction cause causality problems? See, this is where I start having to defer to people who actually understand GR. But let me take a shot: there are certainly possible geometries of spacetime–for example, those involving wormholes, or Gödel's rotating universe–that do have causality problems. But what about the actual geometry we live in? Here, things are just receding away from each other, which is not something you can actually use to sendsignals faster than light. What you can get, in our geometry, are objects that are so far away from each other that naïvely they should "never have been in causal contact," but nevertheless seem like they must have been. So, the hypothesis is that there was a period of rapid inflation in the extremely early universe, so that objects could reach equilibrium with each other and only then be causally separated by inflation.

So what is this cosmological constant? Basically, a kind of anti-gravity. It's something that causes two given points in spacetime to recede away from each other at an exponential rate. What's the obvious problem with that? As the Woody Allen character's mother told him, "Brooklyn is not expanding." If this expansion is such an important force in the universe, why doesn't it matter within our own planet or galaxy? Because on the scale that we live, there are other forces like gravity that are constantly counteracting the expansion. Imagine two magnets on the surface of a slowly-expanding balloon: even though the balloon is expanding, the magnets still stick together. It's only on the scale of the entire universe that the cosmological constant is able to win over gravity.

You can talk about this in terms of the scale factor of the universe. Let's measure the time t since the beginning of the universe in the rest frame of the cosmic background radiation (the usual trick). How "big" is the universe as a function of t? Or to put it more carefully, given two test points, how has the distance between them changed as a function of time? The hypothesis behind inflation is that at the very beginning–at the Big Bang–there's this enormous exponential growth for a few Planck times. Following that, you've got some expansion, but also have gravity trying to pull the universe together. It works out there that the scale factor increases as t^2/3. Ten billion years after the Big Bang, when life is first starting to form on Earth, the cosmological constant starts winning out over gravity. After this, it's just exponential all the way, like in the very beginning but not as fast.

Evolution of the scale factor a(t) (not to scale).

It's an interesting question as to why we should be alive at a time when the cosmological constant is 70% and matter is 30% of the energy density. Why shouldn't one of them be almost all and the other negligible? Why should we be living in the small window where they're both of the same order of magnitude? One argument you can make is the anthropic one: if we were in a later epoch, then there'd maybe be two or three of us here, and the rest of us would be outside of the cosmological horizon. The universe would be a much thinner place.

So that's how physicists would describe the cosmological constant, but how I would describe it is just the inverse of the number of bits that can ever be used in a computation! More precisely:

In Planck units, the cosmological constant is about 10^-121, and so we find that 10¹²² is about the maximum number of bits that could ever be used in a computation in the physical world. (We're going to get later to what exactly we mean by the maximum number of bits that can be involved in a computation.) How do we get to that interpretation of the cosmological constant?

Q: What's the definition of the cosmological constant?
Scott: It's the vacuum energy. Again, this is physics. People don't define things, they observe them. They don't actually know what this vacuum energy is, they just know it's there. It's an energy of empty space, and could have many different possible origins.
Q: An average?
Scott: Well, yes, but it seems to be very close to constant wherever people can measure it and also seems to be very constant over time. No one has found any deviation from the assumption that it's the same everywhere. One way to think of it is that, in a vacuum, there's always these particle/anti-particle pairs forming and annihilating each other. Empty space is an extremely complicated thing! So maybe it's not so surprising that it should have a non-zero energy. Indeed, the hard problem in quantum field theory is not to explain why there's a cosmological constant, but rather to explain why it isn't 10¹²⁰ times larger than it is! A naïve quantum field theory argument gives you a prediction that the entire universe should just blow apart in an instant.
Q: So is this Ω_Λ?
Scott: No, Ω_Λ is the fraction of the total energy density that's comprised of the cosmological constant. So that also depends on the matter density, and unlike Λ itself it can change with time.

To see what any of this has to do with computation, we have to take a detour into the holographic bound. This is one of the few things that seems to be known about quantum gravity, with the string theorists and loop quantum gravity people actually agreeing. Plus it's a bound, which is a language I speak. My treatment will follow a nice survey paper by Bousso. I'm going to make this assigned reading, but only for the physicists. We saw way back in the first lecture that there's this Planck area ℓ_p² = Gℏ/c³. You can get it by combining a bunch of physical constants together until the units cancel such that you get length². Planck himself did that back around 1900. This is clearly very deep, because you're throwing together Newton's constant, Planck's constant and the speed of light and you're getting a length scale which is on the order of 10^-69 m².

The holographic bound says that in any region of spacetime, the amount of entropy that you can put in the region–or up to a small constant, the number of bits you can store in it–is at most the surface area of the region measured in Planck units divided by 4. This is the surprising part: the number of bits you can store doesn't grow with the volume, it grows with the surface area. I can show you a derivation of this (or rather, what the physicists take to be a derivation).

Q: Does the derivation tell you why you divide by 4 and not, say, 3?
Scott: The string theorists believe they have an explanation of that. It's one big success that they like to lord over other quantum gravity approaches! For the loop quantum gravity people, the constant comes out wrong and they have to adjust it by hand by what they call the Immirzi parameter.

The rough intuition is that if you try to build a cube of bits (say, a hard disk) and keep making it bigger and bigger, then it's eventually going to collapse to a black hole. At that point, you can still put more bits in it, but when you do that, the information just sort of gloms onto the event horizon in a way that people don't fully understand. But however it happens, from that point on, the information content is just going to increase like the surface area.

To "derive" this, the first ingredient we need is the so-called Bekenstein bound. Bekenstein was the guy who back in the 70s realized that black holes should have an entropy. Why? If there's no entropy and you drop something into a black hole, it disappears, which would seem to violate the Second Law of Thermodynamics. Furthermore, black holes exhibit all sorts of unidirectional properties: you can drop something in a black hole but you can't get it out, or you can merge two black holes and get a bigger one but then you can't split one black hole into multiple smaller black holes. This unidirectionality is extremely reminiscent of entropy. This is obvious in retrospect; even someone like me can see it in retrospect.

So what is this Bekenstein bound? It says that in Planck units, the entropy S of any given region satisfies:

where k is Boltzmann's constant, E is the energy of the region and R is the radius of the region (again, in Planck units). Why is this true? Basically, this formula combines π, Boltzmann's constant, Planck's constant and the speed of light. It has to be true. (I'm learning to think like a physicist. Kidding!) Seriously, it comes from a thought experiment where you drop some blob of stuff into the black hole and figure out how much the temperature of the black hole must increase (using physics we won't go into), and then use the relation between temperature and entropy to figure out how much the entropy of the black hole must have increased. You then apply the Second Law and say that the blob you dropped in must have had at most the entropy gained by the black hole. For otherwise, the total entropy of the universe would have decreased, contradicting the Second Law.

Q: Doesn't the area go like the square of the radius?
Scott: It does.
Q: Then why should R appear in the Bekenstein bound and not R²?
Scott: We're getting to that!

That's fact one. Fact two is the Schwarzschild bound, which says that the energy of a system can be at most proportional to its radius. In Planck units, E ≤ R/2. This is again because if you were to pack matter/energy more densely than that, it would eventually collapse to a black hole. If you want to build a hard disk where each bit takes a fixed amount of energy to represent, then you can make a one-dimensional Turing tape which could go on indefinitely, but if you tried to make it even two-dimensional, then when it became big enough, it would collapse to a black hole. The radius of a black hole is proportional to its mass (its energy) by this relationship. You could say that a black hole gives you the most bang for your buck in terms of having the most energy in a given radius. So black holes are maximal in at least two senses: they have the most energy per radius and also the most entropy per radius.

Now, if you accept these two facts, then you can put them together:

That is, the entropy of any region is at most the surface area in Planck divided by 4. As for explaining why we divide A by 4, in effect we've reduced the problem to explaining why E ≤R/2. The π goes away since the surface area of a sphere is 4πR².

There actually is a problem with the holographic bound as I've stated it—it clearly fails in some cases. One of them would be a closed spacetime. Let's say that space is closed—if you go far enough in one direction you appear back in another direction—and let's say that this region here can be at most proportional to the surface area. But how do I know that this is the inside? There's a joke where a farmer hires a mathematician to build a fence in as efficient a fashion as possible–that is, to build a fence with the most area inside given some perimeter. So the mathematician builds a tiny circle of fence, steps inside and declares the rest of the Earth to be outside. Maybe the whole rest of the universe is the inside! Clearly, the amount of entropy in the entire rest of the universe could be more than the surface area of this tiny little black hole, or whatever else it is. In general, the problem with the holographic bound is that it is not "relativistically covariant." You could have the same surface area, and in one reference frame, the holographic bound is true, whereas in another it might fail.

Anyway, it appears that Bousso and others have essentially solved these problems. The way they do it is by looking at "null hypersurfaces," which are made up of paths traced by photons (geodesics). These are relativistically invariant. So the idea is that you have some region, and you look at the light rays emanating from the surface of the region. Then, you define the inside of the region to be the direction in which the light rays are converging upon each other. One advantage of doing it this way is that you can switch to another reference frame, but these geodesics are unchanged. On this account, the way you should interpret the holographic bound is as upper bounding the amount of entropy you could see in the region if you could travel from the surface inwards at the speed of light. In other words, the entropy being upper-bounded is the entropy you would see along these null hypersurfaces. Doing it this way seems to solve the problems.

So what does any of this have to do with computation? You might say that if the universe is infinite, then clearly, in principle you could perform an arbitrarily long computation. You just need enough Turing machine tape. What's the problem with that argument?

Q: The tape would collapse to a black hole?
Scott: As I said, you could just have a one-dimensional tape, and that could be extended arbitrarily.
Q: What if the tape starts receding away from you?
Scott: Right! Your bits are right there, then after you turn your back for just a few tens of billions of years, they've receded beyond your cosmological horizon due to the expansion of the universe.

The point is, it's not enough just to have all of these bits available in the universe somewhere. You have to be able to control all of them—you have to be able to set them all—and then you need to be able to access them later while performing a computation. Bousso formalizes this notion with what he calls a "causal diamond," but I'd just call it a computation with an input and an output. The idea is you have some starting point P and some endpoint Q, and then you look at the intersection of the forward light-cone of P and the past light-cone of Q. That's a causal diamond. The idea is that for any experiment we could actually perform—any computation we could actually do—we're going to have to have some starting point of the experiment, and some end point where you collect the data (read the output). What's relevant isn't the total amount of entropy in the universe, but just the total amount of entropy that can be contained in one of these causal diamonds. So now, Bousso has this other paper where he argues that if you're in a de Sitter space—that is, a space with a cosmological constant, like the space we seem to live in—then, the amount of entropy that can be contained in one of these causal diamonds is at most 3π/Λ. That's why, in our universe, there's the bound of around 10¹²² bits. The point is that the universe is expanding at an exponential rate, and so a point that's at the edge of our horizon now will be, after another 15 billion years or so (another age of the universe), a constant factor as far away as it is now.

Q: So where do you place P and Q to get that number?
Scott: You could put them anywhere. You're maximizing over all P and Q. That's really the key point here.
Q: Then where does the maximum occur?
Scott: Well, pick P wherever you like, then pick Q maybe a couple tens of billion years in its causal future. If you don't wrap your computation up after 20 billion years or so, then the data at the other end of your memory is going to recede past your cosmological horizon. You can't actually build a working computer whose radius is more than 20 billion light years or whatever. It's depressing, but true.
Q: Does Λ change with time?
Scott: The prevailing belief is that it doesn't change with time. It might, but there are pretty strong experimental constraints on how much. Now the proportion Ω_Λ of the energy density taken up by Λ, that is changing. As the universe gets more and more dilute, the proportion of the energy taken up by Λ gets bigger and bigger, even though Λ itself stays the same.
Q: But the radius of the universe is changing.
Scott: Yes. In our current epoch, we get to see a larger and larger amount of the past as light reaches us from further and further away. But once Λ starts winning out over matter, the radius of the observable universe will reach a steady state of 10 billion light years or whatever it is.
Q: Why is it 10 billion light years?
Scott: Because that's the distance such that something that far away from you will appear to be receding away from you at the speed of light, if there's no countervailing influence of gravity.
Q: So it's just a coincidence that that distance happens to be about the current size of the observable universe?
Scott: Either a coincidence or something deeper that we don't fully understand yet!

This is fine, but I promised you that I'd talk about computational complexity. Well, if the holographic bound combined with the cosmological constant put a finite upper bound on the number of bits in any possible computation, then you might argue that we can only solve problems that are solvable in constant time! And you might feel that in some sense, this trivializes all of complexity theory. Fortunately, there's an elegant way out of that: we say that now we're interested in asymptotics not just in n (the size of the input), but in 1/Λ. Forget for now that Λ has a known (tiny) value, and think of it as a varying parameter—then complexity theory comes back! Taking that point of view, let me make the following claim: suppose the universe is (1+1)-dimensional (that is, one space and one time dimension) and has cosmological constant Λ. Then the class of problems that we can solve is contained in DSPACE(1/Λ): the class of problems solvable by a deterministic Turing machine using ~1/Λ tape squares. In fact it's equal to DSPACE(1/Λ), depending on what assumptions you want to make about the physics. Certainly it at least contains DSPACE(1/√Λ).

First of all, why can't we do more than DSPACE(1/Λ)?

Well, to be more formal, let me define a model of computation that I'll call the Cosmological Constant Turing Machine. In this model, you've got an infinite Turing machine tape, but now at every time step, between every two squares, there's an independent probability Λ of a new square forming with a '*' symbol in it. As a first pass, this seems like a reasonable model for how Λ would affect computation. Now, if your tape head is at some square, the squares at a distance 1/Λ will appear to be receding away from the tape head at a rate of one square per time step on average. So, you can't hope to ever journey to those squares. Every time you step towards them, a new square will probably be born in the intervening space. (You can think of the speed of light in this model as one tape square per time step.) So, the class of problems you can solve will be contained in DSPACE(1/Λ), since you can always just record the contents of the squares that are within 1/Λ of the current position of the tape head, and ignore the other squares.

But can we actually achieve DSPACE(1/Λ)? You might imagine a very simple algorithm for doing so. Namely, just think of your 1/Λ bits as a herd of cattle that keep wandering away from each other. You have to keep lassoing them together like a cosmological cowboy. In other words, your tape head will just keep going back and forth, compressing the bits together as they try to spread out while simultaneously performing the computation on them. Now, the question is, can you actually lasso the bits together in time O(1/Λ)? I haven't written out a proof of this, but I don't think it's possible in less than ~1/Λ² time with a standard Turing machine head (one without, e.g., the ability to delete tape squares). On the other hand, certainly you can lasso ~1/√Λ bits in O(1/Λ) time. You can therefore compute DSPACE(1/√Λ). I conjecture that this is tight.

A second interesting point is that in two or more dimensions, you don't get the same picture. In two dimensions, the radius still doubles on a timescale of about 1/Λ, but even to visit all the bits that need to be lassoed now takes on the order of 1/Λ² time. And so we can ask if there is something you can do on a 2-D square grid in time 1/Λ which you couldn't do in time 1/Λ on a 1-D tape. You've got this 1/Λ² space here, and intuitively you'd think that you can't make use of more than 1/Λ of the tape squares in 1/Λ time, but it's not clear if that's actually true. Of course, for added fun, you can also ask all of these questions for quantum Turing machines.

The other thing you can ask about is query complexity in this model. For example, what if you lost your keys and they could be anywhere in the universe? If your keys are somewhere within your cosmological horizon, and your space has one dimension, then in principle you can find them. You can traverse the entire space within your horizon in time O(1/Λ). But in two dimensions, the number of locations you can check before most of the observable universe has receded is only like the square root of the number of possible locations. You can pick some faraway place to go, take a journey there and by the time you come back, the region has doubled in size.

In the quantum case, there's actually a way out: use Grover's algorithm! Recall that Grover's algorithm lets us search a database of N items in only √N steps. So it would seem that this would let us search a 2-D database of size on the order of the observable universe. But there's a problem. Think about how Grover's algorithm actually works. You've got these query steps interleaved with the amplitude amplification steps. In order to amplify amplitudes, you've got to collect all the amplitudes in one place, so that you can perform the Grover reflection operation. If we think about some quantum robot searching a 2-D database having dimension √N × √N, then you only need to do √N iterations of Grover's algorithm, since there's only N items in the database, but each iteration takes √N time, since the robot has to gather the results of all the queries. That's a problem, because we don't seem to get any benefit over the classical case. Thus, the proposed solution for searching a database the size of the universe doesn't seem to work. It does seem to give us some advantage in three dimensions. If you think of a 3-D hard disk, here the side length is N^1/3, so we would need √N Grover iterations taking N^1/3 time each, giving a total time of N^5/6. At least that's somewhat better than N. As we add more dimensions, the performance would get closer to √N. For example, if space had 10 large dimensions, then we'd get a performance of N^12/22.

In a paper I wrote with Andris Ambainis some years ago, what we did is we showed that you can use a recursive variant of Grover's algorithm to search a 2-D grid using time of order √N log^3/2 N. For three or more dimensions, the time order is simply √N. I can give some very basic intuition as to how our algorithm works. What you do is use a divide-and-conquer strategy: that is, you divide your grid into a bunch of smaller grids. Then you can keep dividing the subgrid into smaller subgrids, and appoint regional Grover's algorithm commanders for each subgrid.

Even, as a first step, let's say that you search each row separately. Each row only takes √N time to search, and then you could come back and collect everything together. You can then do a Grover search of the √N rows, taking N^1/4 time, giving a total time of N^3/4.

That's the first way of solving the problem. Later, some other people discovered another way of doing it using quantum random walks, but the bottom line is that given a 2-D database the size of the universe, you actually can search it for a marked item before it recedes past the cosmological horizon. You can only do one search, or at best a constant number of searches, but at least you can find one thing you're really desperate for.

Lecture 21: Ask Me Anything

Today, we're going to follow the great tradition pioneered by Feynman, in which the last class should be one where you can ask the teacher anything. Feynman's rule was that you could ask about anything except politics, religion, or the final exam. Here, we have no final exam, you can ask about politics, and I've been telling you my religion all semester. So anything goes! Ask about something from the course you'd like to know more about, or anything else. Who would like to start?

Q: Do you often think about using computer science to limit or give us a hint about physical theories? Do you think that we'll be able to discover physical theories which give more powerful models than quantum computation?

Scott: Is BQP the end of the road, or is there more to be found? That's a fantastic question, and I wish more people would think about it. I'm being a bit of a politician here and not answering directly, because obviously the answer is "I don't know." I guess the whole idea with science is that if we don't know the answer, we don't try to sprout one out of our butt or something. We try to base our answers on something. So, everything we know is consistent with the idea that quantum computing is the end of the road. Greg Kuperberg had an analogy I really liked. He said that there are people who keep saying that we've gone from classical to quantum mechanics so what other surprises are in store? But maybe that's like first assuming the Earth is flat, and then on discovering that it's round, saying who knows, maybe it has the topology of a Klein bottle. There's a surprise in a given direction, but once you've assimilated it, there may not be any further surprise in that same direction.

The Earth is still as round as it was for Eratosthenes. We talked before in this class about the strange property of quantum mechanics that it seems like a very brittle theory. Even general relativity, you could imagine putting in torsion or other ways of playing around with it. But quantum mechanics is very hard to fool around with without making it inconsistent. Of course, that doesn't prove that there's nothing beyond it. To people in the 1700s, it probably looked like you couldn't twiddle around much with Euclidean geometry without making it inconsistent. But on the other hand, the mere fact that something is conceivable doesn't imply that we ought to spend time on it. So, are there actual ideas about what could be beyond quantum mechanics?

Well, there are these quantum gravity proposals where it looks like you don't even have unitarity–people can't even get the probabilities to sum to 1. The positive spin on that would be "woohoo! We found something beyond quantum mechanics!" The negative spin would be that these theories (as they currently stand) are just nonsense, and when quantum gravity people finally figure out what they're doing, they'll have recovered unitarity. And then there are phenomena that seem to change our understanding of quantum mechanics a little bit. One of these is the black hole information loss problem.

So here's you falling into a black hole. The basic problem is that all the information about you–if you fall into the black hole–is supposed to come out later as Hawking radiation. If the physics outside the event horizon is unitary, that information would have to come out. We don't know exactly how the information comes out, though. If you do a semi-classical calculation, it seems like only completely thermal noise coming out. However, most physicists (even Hawking) now believe that if we really understood what was going on, then we'd see that the information comes out.

The trouble is once you're in the black hole, you're not even near the event horizon. You're headed straight for the singularity. On the other hand, if the black hole is going to be leaking out information, then it seems like the information should somehow be on the event horizon or very close to it. This is especially so since we know that the amount of information in the black hole is proportional to the surface area. But from your perspective, you're just somewhere in the interior. So it seems like the information has to be in two places at once.

Anyway, one proposal that people like Gerard 't Hooft and Lenny Susskind have come up with is that, yes, the information does get "duplicated." On its face, that would seem to violate unitarity, and specifically the No-Cloning Theorem. But on the other hand, how would you ever see both copies of the information? If you're inside the black hole, then you're never going to see the outside copy. You can imagine that if you're really desperate to find out if the No-Cloning Theorem is violated–so desperate you'd sacrifice your life to find out–you could first measure the outside copy, then jump in to the black hole to look for the inside copy. But here's the funny thing: people actually calculated what would happen if you tried to do this, and they found that you'd have to wait a very long time for the information to come out as Hawking radiation, and by the time one copy comes out via Hawking radiation, the other copy is already at the singularity. It's like there's some kind of censorship that acts to keep you from seeing both copies at once. So from any one observer's perspective, it's as if unitarity is maintained. So it's funny that there are these little things that seem like they might cause a conflict with quantum mechanics or lead to a more powerful model of computation, but when you really examine them, it no longer seems like they do.

So that's a physics way to get at the question, but there's also a computer-science answer. We can ask, what kinds of complexity classes can we define or imagine above BQP? Well, firstly, if we had a model that could solve NP-complete problems in polynomial time, then from my perspective, that's where we enter the realm of dreams and fantasy.

I mean, at that point, maybe we should just quit doing math and science! If you want to prove the Riemann Hypothesis, just turn the crank, and there's your proof–no thinking involved. Of course, whether NP-complete problems can be efficiently solved is an empirical question–if you can do it, you can do it.

But suppose we want to believe that there's something more powerful than quantum computing, but that still can't solve NP-complete problems in polynomial time. Then, how much "room" is there for such a model? We do have some problems that seem to be easier than NP-complete, but that are still to hard to efficiently solve with a quantum computer. Two examples are graph isomorphism and approximate shortest vector. Very "close" to NP-complete, but probably not quite there, seem to be the problems of inverting one-way functions and distinguishing random from pseudorandom functions.

Years ago I came up with one example of a computational model (discussed earlier in the course), where you get to see the entire history of a hidden variable during the course of a quantum computation. I gave evidence that in this model, you do get more than with ordinary quantum computing–for example, you get graph isomorphism and approximate shortest vector–but still you don't get the NP-complete problems. On the other hand, my model was admittedly rather artificial. So maybe there is one more dramatic step before you get to NP-complete–I'm not sure.

Q: How can you say "one step?" You can theoretically always contrive a problem between any other two problems.

Scott: Of course, but here's the point: no one was interested in quantum computing when Bernstein and Vazirani discovered you could solve the Recursive Fourier Sampling problem. People only became interested when it was found that you solve problems that were previously considered to be important, like factoring. So if we judge our hypothetical new model by the same standard, and ask what problems it can solve that we already think are important, there are arguably not that many of them between factoring and NP-complete. So again, there could be some new model that gets you slightly beyond BQP–maybe it lets you solve Graph Isomorphism, or the Hidden Subgroup Problem for a few more non-Abelian groups–but at least in our current picture, there's only a limited amount of "room" between BQP and the NP-complete problems.

Q: BQP hasn't been shown to be in NP, right?

Scott: Right. That's also an excellent question. In fact, we don't even know if BQP is in the polynomial-time hierarchy; this has been a central open question in quantum computing theory for more than a decade. Of course, we can't hope to prove unconditionally that BQP is not in PH (since that would imply P≠PSPACE among other things). But we could at least hope to construct an oracle relative to which BQP isn't in the polynomial hierarchy. That's the least we could hope for, but years have gone by and we still can't even do that.

A number of us have worked on this, and it seems extremely hard. In fact, we can't even prove there's an oracle relative to which BQP is not in AM (Arthur-Merlin). AM is just like a probabilistic version of NP. So AM is where the progress stopped. We've got an oracle relative to which BQP is not in MA (Merlin-Arthur), but no such oracle for AM. I strongly believe that there exists such an oracle, but current techniques just don't seem strong enough to prove it. Indeed, I believe that there's an oracle relative to which BQP isn't in the polynomial hierarchy.

Q: But what about in the non-oracle world?

Scott: That's a whole other can of worms. If we ask if BQP is in the polynomial hierarchy in the real world, then I'm no longer willing to say. It would not surprise me enormously if, in the real world, BQP were in AM. We don't have any concrete example of a non-promise decision problem that we have good reason to believe is in BQP and not in AM. I mean, we've got these things like Recursive Fourier Sampling, but not only are they oracle problem, it's not clear how you'd ever get the oracle out of the picture.

Note that we believe that AM = NP, because of derandomization. So if BQP were in AM, then BQP would be in NP ∩ coNP. That would say that whenever a quantum computer produces an output, there'd be a short classical proof that it produced that output. That would really be astounding, but I don't think we can exclude it based on any of the evidence we have.

Q: Where would you ever get an oracle?

Scott: You just define it. Let A be an oracle...

Q: That's a bit of an issue.

Scott: It is, it is. It's strange to me that only computer scientists get this kind of flak for using the techniques that they have to answer questions. Like physicists say that they're going to do some calculation in the perturbative regime. "Oh! Of course, what else would you do? These are deep and difficult problems." Of course, you're going to do what works. Computer scientists say that we can't yet prove that P≠NP, but we'll study it in the relativized world. "That's cheating!" It just seems obvious that you just start with the kind of results that you can prove and work from there. One objection that could be made against oracle results in the past would be that some of them were just trivial. Some of them essentially just amounted to restatements of the question. But these days, we've got some very nontrivial oracle separations. I mean, I can tell you in very concrete terms what an oracle result's good for. About every month or so, I see another paper on the arXiv solving NP-complete problems in polynomial time on a quantum computer. This must be the easiest problem in the world. Often these papers are very long and complicated. But if you know about oracle results, you don't have to read the papers. That's a very useful application. You can say if this proof works, then it also works relative to oracles, but that can't be the case, because we know of an oracle where it's false. Of course, that probably won't convince the author, but it will at least convince you.

As another example, I gave this oracle relative to which SZK (Statistical Zero-Knowledge) is not in BQP. In other words, finding collisions is hard for a quantum computer. Sure enough, as the years go by, I see these papers that talk about how to find collisions with a constant number of queries on a quantum computer, and without reading the paper, I can say no, this has to fail, because it's not doing anything non-relativizing. So, oracles are there to tell you what approaches not to try. They direct you towards the non-relativizing techniques that we know we're eventually going to need.

Q: What complexity class are you?

Scott: I'm not even all of P. I'm not even LOGSPACE! Especially if I haven't had much sleep.

Q: What's the complexity class for creativity?

Scott: That's an excellent question. I was thinking about it just this morning. Someone asked me if humans have an oracle in their head for NP. Well, maybe Gauss or Wiles did. But for most of us, finding proofs is a very hit-or-miss business. You can change your perspective and it seems pathetic that after three billion years of natural selection and after this time building up civilizations, all the wars and everything else, we can solve a few instances of SAT–but if you switch to the Riemann Hypothesis or Goldbach's Conjecture instances, suddenly we can't solve those.

When it comes to proving theorems, you're dealing with a very special case of an NP-complete problem. You aren't just taking some arbitrary formula of size polynomial in n, you're taking some fixed question of fixed size and asking, does this have a proof of size n?. So you're uniformly generating these instances for whatever length proof you're looking for. But even for this sort of problem, the evidence is not good that we have some sort of general algorithm for solving them. A few people decided to forsake their social lives and spend their whole lives in this monastic existence, thinking about math problems. Finally, they've managed to succeed on a few problems and sometimes even win Fields Medals for that. But there's still this huge universe of problems that everyone knows about and no one can solve. So I would say that before reaching for Penrose-style speculations about human mathematical creativity transcending computation, we should first make sure sure the data actually supports the hypothesis that humans are good at finding proofs. I'm not convinced that it does.

Now, it's clear that in certain cases, we are very good at finding patterns or taking a problem that looks to be hard and decomposing it into easier subproblems. In many cases, we're much much better at that than any computer. We can ask "why is that?" That's a very big question, but I think part of the answer is we've got a billion year head-start. We've got the advantage of a billion years of natural selection giving us a very good toolbox of heuristics for solving certain kinds of search problems. Not all of them and not all the time, but in some cases, we can do really well. Like I said, I believe that NP-complete problems are not efficiently solvable in the physical universe, so I believe that there can never be a machine that can just prove any theorem efficiently, but there could certainly be machines that would take advantage of the same kind of creative insight that human mathematicians have. They don't have to beat God, they just have to beat Andrew Wiles. That could be an easier problem, but it takes us outside of the scope of complexity theory and into AI.

Q: So even if there's no way to solve NP-complete problems in polynomial time, human mathematicians could still be rendered obsolete?

Scott: Sure. And after the computers take over from us, maybe they'll worry that they'll be out of a job once some NP oracle comes along.

Q: When we discussed P_CTC (the class of problems efficiently solvable using a closed timelike curve), shouldn't that have been BPP_CTC, since P doesn't have access to any randomness, whereas with closed timelike curves you have to have a distribution?

Scott: That's a tricky question–even with a fixed-point distribution, we can still require the CTC computer to produce a deterministic output (so that in essence, randomness is only used to avoid the Grandfather Paradox and not for any other purpose). On the other hand, if you relax that requirement and let the answer have some probability of error, it turns out that you get the same complexity class. That is, one can show that P_CTC = BPP_CTC = PSPACE.

Q: Bell inequalities seem to be an important tool in studying the limitations of quantum mechanics. We know what happens if we have completely non-local boxes, but what happens (say, to computational complexity) if we allow correlations just above what, say, quantum entanglement gives?

Scott: That's a good question, and there are people at Perimeter and elsewhere who have been thinking about it. The fundamental problem is that you can imagine Tsirelson's bound is violated–that is, you can imagine that there are these non-local correlations stronger than anything allowed by quantum mechanics–but saying that still doesn't give us a model of computation. I mean, what are the allowed operations? What is the model you're assuming that lets you break Tsirelson's bound?

Admittedly, you can just assume there are these non-local boxes and then determine the consequences of that for communication complexity. For example, Brassard et al. (building on an earlier result of Wim van Dam) showed that if you have a good enough non-local box (if the error is small enough), then it makes communication complexity trivial (i.e., all communication problems can be solved with just a single bit).

Q: I didn't know about the error part. I just heard that result for a perfect non-local box.

Scott: It was extended to if the box has 7% error or something. So there's still some gap. Quantum mechanics lets you win the so-called CHSH game 85% of the time. If you had a magical box that let you win the CHSH game 93% of the time, then communication complexity would become trivial. There's still that gap between 85% and 93%, which it would be nice to close. As I said, if you want to ask about computational complexity, then it becomes more complicated. A non-local box does not a model of computation make.

Q: I'm thinking of what a quantum circuit would look like if you allowed some non-locality. You just define some unitary version of the non-local box and query it in superposition whenever you want to, and that's the only way to join two separated parties.

Scott: You mean you don't have multi-qubit gates anymore?

Q: You do, but any multi-qubit gates cannot span parties anymore.

Scott: OK, but usually when we define a model of computation, we don't even have a concept of parties. There's one party.

Q: Not necessarily. Think of multi-party interactive proofs.

Scott: All right, well now we have a different question. You've got a multi-prover interactive proof, and you've got a non-local box between the provers. What's the power of that? That's a great question, which I'd be happy to talk about offline. Getting back to the original question, of course I'd be extremely interested if you could come up with a model of computation that looked like BQP, but incorporated non-local boxes in an interesting way. I haven't seen that yet.

Q: I think you could have just one very big, complicated non-local box that just gives the right answer.

Scott: That reminds me of something: we've got this class BQP/qpoly (BQP with polynomial-sized advice states that only depend on the length of the input), and several people have asked me what happens if you extend that so that you have a polynomial-sized advice unitary operation. Well, how powerful is that? Whatever Boolean function f you want to compute, just design a a unitary operation that computes that function and you're done! So that class can solve any problem whatsoever.

Q: Do you see there being a bit more clearing up of the complexity classes? We just keep getting more and more.

Scott: To me, that's like asking a chemist if she sees a clearing up of the Periodic Table. Is Nitrogen going to collapse with Helium? In our case, it's a little bit better than for the chemist, since we can expect a collapse of some classes. For example, we hope and expect that P, RP, ZPP and BPP are going to collapse. We hope and expect that NP, AM and MA are going to collapse. IP and PSPACE already collapsed. So yeah, there are collapses, but we also know that there are other pairs of classes that can't collapse. We know, for example, that P is different from EXP, which immediately tells you that either P has to be different from PSPACE or PSPACE has to be different from EXP, or both. So not everything can collapse. That shouldn't really be surprising.

Now, maybe complexity theory took a wrong turn when it gave everything this string of random-looking capital letters as its name–I appreciate how they can look to people like codenames or inside jokes. But really, we're just talking about different notions of computation. Time, space, randomness, quantumness, having a prover around. There are as many complexity classes as there are different notions of computation. So, the richness of the complexity zoo just seems like an inevitable reflection of the richness of the computational world.

Q: Do you think that BPP will collapse with P?

Scott: Oh, yeah. Absolutely. We have not just one but several reasonable-looking circuit lower bound conjectures where we know that if they're true, then P = BPP. I mean, there were people who realized even in the 1980s that P should equal BPP. Even then, Yao pointed out that if you had good enough cryptographic pseudorandom number generators, then you could use them to derandomize any probabilistic algorithm, hence P = BPP. Now, what people have managed to do in the last ten years is to get the same conclusion with weaker and weaker assumptions.

Besides that, there's also an "empirical" case, in that two of the most spectacular results in complexity theory in the last decade were the AKS primality test showing that primality testing is in P, and Reingold's result that searching an undirected graph is in deterministic logspace. So, this program of taking specific randomized algorithm and derandomizing them has had considerable success. It sort of increases one's confidence that if we were smart enough or knew enough, then this would probably work for other BPP problems as well. You can also look at a specific case, like derandomizing polynomial identity testing, and maybe this is a good example to illustrate the point.

The question is, if you've got some polynomial like x²-y²-(x+y)(x-y), is it identically zero? In this case, the answer is yes. But you could have some very complicated polynomial identity involving variables raised to very high powers, and then it's not obvious how you would check it efficiently even with a computer. If you tried to expand everything out, you'd get an exponential number of terms.

Now, we do know of a fast randomized algorithm for this problem: namely, just plug in some random values (over some random finite field) and see whether the identity holds or not. The question is whether this algorithm can be derandomized. That is, is there an efficient deterministic algorithm to check whether a polynomial is identically zero? If you bang your head against this problem, you quickly get into some very deep questions in algebraic geometry. For example, can you come up with some small list of numbers, such that given any polynomial p(x) described by a small arithmetic formula, all you have to do is plug in the numbers in that list, and if p(x)=0 for every x in the list, then it's zero everywhere? That seems like it should be true, because all you should have to do is pick some "generic" set of numbers to test which is much larger than the size of the formula for p. For example, if you find that p(1)=0,p(2)=0,...,p(k)=0, then either p must be zero, or else it must be evenly divisible by the polynomial (x-1)...(x-k). But is there any nonzero multiple of (x-1)...(x-k), that can be represented by an arithmetic formula of size much smaller than k? That's really the crucial question. If you can prove that no such polynomial exists, then you'll give a way to derandomize polynomial identity testing (a major step towards proving P=BPP).

Q: What do you think the chances are that three Indian mathematicians will come up with an elementary proof?

Scott: I think it's gonna take at least four Indian mathematicians! We know today that if you prove good enough circuit lower bounds, then you can prove P=BPP. But Impagliazzo and Kabanets also proved a result in the other direction: if you want to derandomize, you're going to have to prove circuit lower bounds. To me that gives some explanation as to why people haven't succeeded yet in proving that P=BPP. It's all because we don't know how to prove circuit lower bounds. The two problems are almost–though not quite–the same.

Q: Does P = BPP imply that NP = MA?

Scott: Almost. If you derandomize PromiseBPP, then you derandomize MA. No one has any idea of how to derandomize BPP that wouldn't also derandomize PromiseBPP.

Any other questions? Maybe a less technical one?

Q: How would you answer an intelligent design advocate? Without getting shot?

Scott: You know, I'm genuinely not sure. It'sone of those cases where might be anthropic selection going on. If someone could be persuaded by evidence on this question, then wouldn't he or she already have been? I think we have to concede that there are people for whom the most important thing about a belief isn't whether it's true, but rather some other properties of the belief, such as its role in a community. So they're playing a different game where beliefs are judged by a different standard. It's like you're a basketball player on a football field. How are you supposed to win?

Q: Is complexity theory relevant to the evolution versus intelligent design controversy?

Scott: To the extent that you need complexity theory, it's all sort of trivial complexity theory. For example, just because we believe that NP is exponentially hard doesn't mean that we believe that every particular instance (say, evolving a working brain or a retina) has to be hard.

Q: When Steven Weinberg came to talk at the Perimeter Institute, the question was asked "where does God fit into all of this"? His answer was to just dismiss religion as an artifact of our evolution that now has no value, and that we'd eventually grow out of it.

Scott: Was that a question?

Q: Do you agree with him?

Scott: So I think that there are several questions here.

Q: You're being a politician.

Scott: Look, this is a hot topic right now, and I've read books like Richard Dawkins's.

Q: Was it a good book?

Scott: Yes. Dawkins is always amusing, and he's at his absolute best when he's ripping into bad arguments like a Rottweiler. Anyway, one way to think about it is that the world would clearly be a better place if there were no wars, or for that matter if there were no lawyers and no one sued anyone else. And there are those who want to turn that idea into an actual political program. I'm not talking about people who oppose specific wars like the one in Iraq for specific reasons, but absolute pacifists. And the obvious problem with their position is a game-theoretic one. Yes, the world would be a better place with no armies, but the other guys have an army.

It's clear that religion fills some sort of role for people; otherwise it wouldn't have been so ubiquitous for thousands of years or resisted very significant efforts to stamp it out. For example, maybe people who believe God is on their side are braver in battle. Or maybe religion is one of the factors (besides more obvious factors) that induces men and women to get married and have lots of babies, and is therefore adaptive just from a Darwinian point of view. Years ago I was struck by an irony: in contemporary America, you've got these stereotypical coastal elites who believe in Darwinism and often live by themselves well into their thirties or forties, and then you've got these stereotypical heartland folks who reject Darwinism but marry young and have 7 kids, 49 grandkids and 243 great-grandkids. So then it's not really a contest between Darwinists and anti-Darwinists; it's just a contest between the Darwinian theorists and the Darwinian practitioners!

If this speculation is at least partly correct–if religion survives because it helps induce people to win wars, have more babies, etc.–then the question arises, how are you ever going to counter it unless you've got a competing religion?

Q: Yeah, I'm sure that's what people are thinking about when they decide whether or not to believe in a religion.

Scott: I'm not saying it's conscious, or that people are thinking it through in these terms. Surely a few are, but the point is they don't have to in order for it to describe their behavior.

Q: We can have lots of kids without accepting a religion if we want to.

Scott: Sure, we can, but do we on average? I don't know the numbers offhand, but I believe there are studies backing up the claim that religious people have more children on average.

Now, there's another key factor, which is that sometimes irrationality can be supremely rational, because it's the only way of proving to someone else that you're committed to something. Like if someone shows up at your doorstep and asks for $100, you're much more likely to give it to him if his eyes are bloodshot and he looks really irrational–you don't know what he's going to do! The only way that this is actually effective is if the show of irrationality is convincing. The person can't just feign, or you'll see through it. He has to be really, really irrational and show that he's ready to get revenge on you no matter what. If you believe that the person's going to defend his honor to the death, you're probably not going to mess with him.

So the theory is that religion is a way of committing yourself. Someone might say that he believes in a certain moral code, but others might figure talk is cheap and not trust him. On the other hand, if he has a really long beard and prays every day and really seems to believe that he'll have an eternity in hellfire if he breaks the code, then he's making this very expensive commitment to his belief. It becomes much more plausible that he means it. So on this theory, religion functions as a way of publicly advertising a commitment to a certain set of rules. Of course, the rules might be good or they might be terrible. Nevertheless, this sort of public commitment to obeying a set of rules, backed up with supernatural rewards and punishments, seems like an important element of how societies organized themselves for thousands of years. It's why rulers trusted their subjects not to rebel, men trusted their wives to stay faithful, wives trusted their husbands not to abandon them, etc. etc.

So, I feel like these are the sorts of game-theoretic forces that Dawkins and Hitchens and the other anti-religion crusaders are up against, and that they maybe don't sufficiently acknowledge in their writing. What makes it easier for them, of course, is that their opponents can't just come out and say, "yes, of course it's all a load of hooey, but here are the important social functions it serves!" Instead, religious apologists often resort to arguments that are easily demolished (at least since the days of Hume and Darwin)–since their real case, though considerably stronger, is one that's so hard for them to make openly!

In summary, maybe it's true that humans (if we survive long enough) will eventually outgrow religion, now that we have something better to fill religion's explanatory role. But before that will happen, I think that at the least we'll need to better understand the social functions that religion played for most of history and still plays in most of the world, and maybe come up with alternative social mechanisms to solve the same sorts of problems.

Q: I was just thinking of if there's another case where irrationality might be preferred over rationality.

Scott: Where to begin?

Q: Especially if you have incomplete information. Like if you have a politician who's committed and won't change his ideals later on, you can feel more assured that he'll do what he said he would.

Scott: Because he has conviction. He believes in what he says. To most voters, that matters much more than the actual content of the beliefs. Bush gives a very good impression of belief.

Q: I'm not sure that's the best for the interests of the United States.

Scott: Right, that's the question! How do you defeat people who have mastered the mechanisms of rational irrationality? By saying "no, look here, you've got your facts wrong"? Which game are you going to play? Or take another example: a singles bar. The ones who succeed are the ones best able to convince themselves (at least temporarily) of certain falsehoods: "I'm the hottest guy/girl here." This is a very clear case where irrationality seems to be rational in some sense.

Q: The standard example is if you're playing Chicken with someone, it's advantageous to you if you break your steering wheel so it can't turn.

Scott: Exactly.

Q: Why is computer science not a branch of physics departments?

Scott: The answer to that isn't philosophical, it's historical. Computer scientists back in the day were either mathematicians or electrical engineers. People who would have been computer scientists when there wasn't such a department went into either math or EE. Physics had its plate full with other things, and to get into physics you had to learn this enormous amount of other stuff which maybe wasn't directly relevant if you just wanted to hack around and write programs, or if you wanted to think theoretically about computation. Paul Graham has said that computer science is not so much a unified discipline as a collection of people thrown together by accident of history, like Yugoslavia. You've got the "mathematicians," the "hackers," and the "experimentalists," and we just throw them all together in the same department and hope they sometimes talk to each other. But I do think (and this is a cliched thing to say) that the boundaries between CS, math, physics, and so on are going to look less and less relevant, more and more like a formality. It's clear that there's a terrain, but it's not clear where to draw the boundaries.