Empiricism, Rationalism, Super Empiricism: Recent Changes In the Concept of Mind

Matthew McKeever
28 min readNov 4, 2018

A question of perennial interest to philosophers is to what extent our mental life depends on experience: on what is given by sense perception. The British empiricists like John Locke thought that the mind was a tabula rasa, a slate blank and ready to receive data from the world and on that basis form ideas which make up the contents of thought.

Others doubted this. For rationalists, there were ideas which couldn’t have come from experience, and so, since they we had them, we must have been born with them: innate ideas. A couple of millennia before Locke Plato wondered how we could acquire mathematical and geometrical ideas. We have the idea of circles but nature provides no perfect circles, we might think, or again we know that 2+2 is always or must be 4 but nature provides no experience of alwaysness or mustness.

The aims of this post are two: to trace the empiricist/rationalist debate in the twentieth century, and in particular to explain why rationalism came, thanks to the development of the field of generative linguistics, to be favoured as against the empiricist behaviourism that preceded it. To do so I will set out Chomsky’s famous response to the behaviourists as well as some of the basics of his early linguistic theory.

The second aim of the post is a bit weirder, and is maybe best classed as philosophical-fiction (along the lines of science-fiction). I will suggest with the advent of machine learning and big data, a third contender view about the nature of mind arises, or, otherwise put, those recent developments open the conceptual space for recognising a new type of mind: a super-empirical mind. This is the sort of mind we can attribute to machine learning algorithms which harness big data sets to produce scarily accurate predictions but which, at least at the time of writing, have not yet achieved the holy grail of AI research, namely domain general intelligence. AI exists that can predict pretty well if we’re going to get sick soon, but we can’t talk back to such AIs and ask them what, you know, we should do about that.

I’m not saying I buy super empiricism as a true description of the world. That depends at least in part on the massive question as to whether computers have minds, which I’m not going to touch. Maybe super empiricism is just a fiction. But just as we can evaluate science-fiction scenarios as cool or weird or hope-inspiring or depressing, so I will end by evaluating my philosophical-fiction idea of super empiricism, and I’ll suggest that super-empirical minds, minds deeply and spookily knowledgeable about one particular thing but of no help otherwise, should be objects of the sort of fear and reverence that those in tragedy held towards oracles and other supernatural creatures, like riddling sphinxes or the witches in Macbeth: things that know, and will tell you what they know, but won’t help you in any other way. Super empirical minds are terrifying, is my conclusion, basically.

This post is long and there’s a lot in it. Some sections are optional and I’ve marked them as such.

Part 1: From Behaviourism To Generative Grammar

Story 1

Let’s start with a story. It’s World War Two and an American soldier is in London. A weird pattern emerges: the location of his one night stands are all located in a place that the Germans subsequently bomb. His penis seems to be able to predict war. How? Well, it’s unclear, but an explanation offered is that it’s an instance of classical conditioning, a central tenet of the behaviourist psychology that would have been big at the time.

You’ve probably heard of Pavlov’s dogs: dogs presented with food respond by salivating, and if you associate the presentation of food with a stimulus, say the ringing of a bell, you can bring about the response — the salivation — in the absence of the food. You can train animals to respond to stimuli, and later, wildly unethically, the psychologist J.B. Watson showed that you could do the same to children by inducing a phobia of toys in a young child by making a loud noise when he — the child — was presented with one.

The soldier, Tyrone, was experimented on in a similar unethical manner as a child, and it’s suspected that he was trained to respond sexually somehow to signs of war. We don’t know what that those stimuli were but they are — so this thought goes — there, and the war effort will be much helped if the allies can get to the heart of this conditioned response.

You’ll be glad and perhaps not too surprised to hear that story is not true: it’s a central plot arc of Thomas Pynchon’s 1973 novel Gravity’s Rainbow, a novel peppered with references to behaviourism. But it’s instructive of some of the thinking in vogue around the second world war, and in particular the behaviourism of people like Watson: this was the thought that a scientifically respectable psychology could make do with concepts like stimulus and response to explain mindedness and behaviour. We don’t need, to explain phobias, for example, to postulate a complex mechanism of psychological forces as Freud would have done. The empirically unimpeachable concepts of behaviour and response suffice. Let me say a bit more about that.

The war-torn London Tyrone Slothrop inhabited (or would have inhabited if he weren’t fictional).

Behaviourism

There are a couple of key points. First, behaviourism is a paradigm empiricist theory, concentrating only on the given of experience, eschewing not only more speculative notions like Freudian unconscious but even the notion of consciousness itself, in the hope that this could put psychology on a scientifically firm footing. Second, responding to a theory of behaviourism about the use of language forms a sort of pivot in the 20th century as thinking of the mind turned away from empiricism towards rationalism. To see this, let’s look at the behaviourist theory of language of B.F. Skinner and Noam Chomsky’s take down of it in his famous 1959 article ‘Review of BF Skinner’s Verbal Behaviour’.

I don’t want to get into all the details, but Skinner’s thought is that we can use the concepts of response and stimulus to make sense of language use. In certain very simple cases this doesn’t seem ridiculously implausible. Thus the presence of a fire, which plays the role of a stimulus, would indeed pretty often elicit the response ‘fire!’ Unlike the phobia case, we voluntarily would respond ‘fire!’, it’s not an automatic reflex, which makes it a case of what’s called operant conditioning. Maybe all of language use is based on operant conditioning?

Well, but the devil is in the details, Chomsky says. Try to make sense of the notions of stimulus and response outside of a simplistic laboratory setting and things get much more difficult. Here’s a nice easy case: take a given name for a person. One might think the behaviourist theory for it would be like the ‘fire’ case above: we use the name in the presence of the stimulus, namely the person themselves.

But that’s no good. Chomsky points out that he has used the words ‘Eisenhower’ and ‘Moscow’ despite having never been stimulated by either of them. Moreover, presumably I’m constantly presented with a me-stimulus. One might think the behaviourist would thus predict that I constantly or at least frequently say ‘Matthew’. But I assure you, I do not (see paragraph 3 of the review for these points).

These arguments might seem a bit quick and rhetorical; and maybe they are. But it’s up to the behaviourist to try to respond to them, and thinking about them, and others of the sort, make one realize how uphill a battle the behaviourist is going to have.

Moving on, let’s consider an example of what would probably be called today Chomsky’s dumping on Skinner. Apparently the latter said that we could quantify the strength of a response to a stimulus by means of pitch, speed, repetition, and so on, such that saying something quickly in a high pitch repeatedly constitutes a strong response. Chomsky has his doubts. He quotes Skinner as saying:

if we are shown a prized work of art and exclaim Beautiful!, the speed and energy of the response will not be lost on the owner.

(same location as above, paragraph 3)

And responds, in a passage that never fails to make me laugh:

It does not appear completely obvious that in this case the way to impress the owner is to shriek Beautiful! in a loud, high-pitched voice, and with no delay (high response strength). It may be equally effective to look at the picture silently (long delay) and then to murmur Beautiful in a soft, low-pitched voice (by definition, very low response strength).

(same place)

This is snarky, indeed to such an extent that a reader might wonder whether it’s uncharitable, and a case of style over substance. In a sense, though, it doesn’t matter too much for our purposes, and that for two reasons: as a matter of fact, I think it’s reasonable to say that Chomsky’s review began a move away from behaviourism, even if his arguments were a bit uncharitable. And, more importantly, this destructive work was complimented with Chomsky’s positive views on linguistics, views which inaugurated a research program still flourishing today and one which has much to teach us about the topics of this essay, namely empiricism and rationalism. So next I’ll explain a bit of the basics of Chomsky’s theory.

(Pedantic and skippable note: I’m going to be concentrating on the very first iteration of the Chomskyian theory as found in 1957’s Syntactic Structures. This has been mostly superseded by other work by Chomsky and others, and cognoscenti might find it a strange text to concentrate on, but I think for the purposes of quickly bringing out some central features of Chomsky’s program it’s sufficient, and trying to think how to do the same either with the more canonical 1965 Aspects of the Theory Of Syntax or later work got me nowhere, so I figured I’d start (and end) at the beginning.)

Generative Grammar

Chomsky’s work in linguistics is marked by several important questions, theoretical tools, and a methodology which differ markedly from the behaviourist’s. Methodologically, it is concerned with nitty-gritty details about how language is used, with the small but still important difference between, say, the two readings of each of the below sentences:

  1. Old men and dogs get on well together
  2. The shooting of the hunters was atrocious

Consider the first sentence. On one reading, this is saying that old men and dogs of any age get on well together. On the second, slightly less available, reading, it’s saying that old men and in particular old dogs get on well together.

And the second sentence (discussed in chapter 8 of Syntactic Structures) we may construe hunters either as objects or as subjects of the shooting: it could be that the hunters were shot and we’re saying it was atrocious that they were shot, or it could be that the hunters shot and they did so atrociously. One of the main aims of contemporary linguistic theory is to find out what the deal is with such ambiguities and other quirks of language which serve as data points to constrain theorizing. And perhaps you can see that, even with the quick and partial presentation of it I gave above, the behaviourist theory would probably be too blunt an implement to shed interesting light on such matters.

Theoretically, Chomskyian linguistics attempts to harness tools from mathematical logic and computer science to model the grammar underlying these sentences, where a grammar can be conceived of as a machine that spits out (generates) all and only the grammatical sentences of a given language. One can think of a human’s capacity to use language as being underwritten by a sort of machine, and the question then is what sort of machine it is and what rules does it obey.

Meticulous attention to linguistic data to be modeled with formal tools is part of what makes contemporary syntactic theory a rich and productive discipline. But the reason it’s of interest beyond linguistics departments is because the search for these grammars is motivated not only by its inherent interest, but because of the possibility it has of shedding light on fundamental questions about the nature of the mind, and in particular the nature of empiricism.

Because underlying the work is this important question: how can a language be learned, and in particular how can a child learn a language? What must language be like in light of the fact that, based on the extremely limited experience they have of their parents and others speaking, soon young children become able to produce sentences never before uttered (and thus to which they could never have been exposed)?

Roughly, Chomsky’s thought is that the Lockean blank slate model of the mind, according to which a mind gets everything from experience, just doesn’t work here. If a child can learn a language so easily, from such little experience, it must already be equipped at birth to do so. The child’s mind must meet the world half-way, it must bring something to the table to make from the poor and fragmentary data it receives a theory of language. So Chomsky is led to postulate a universal grammar, a set of rules for generating sentences that we all possess.

And the goal is to find such rules in part by attending to particularities of how language is used. Let me illustrate this by running through a couple of the arguments from that early book, not because they’re still accepted — maybe the first is, but the second isn’t — but because they give a good idea of the style of argument that you will still see if you open a recently published syntax textbook.

So return again to our example 1. It’s ambiguous. But it’s ambiguous in an interesting way. Some ambiguities aren’t so interesting: that ‘bank’ means both financial institution and side of a river isn’t a particularly deep fact about language. But notice that the ambiguity of 1. is systematic. Not only can we get it whatever we replace ‘old’ with, but we also get it for other sorts of expressions, like adverbs and what are called determiners. Thus consider:

3. They chatted and danced gracefully (they did both things gracefully; cf. 3' They ate chicken wings and danced gracefully. It’s hard to eat chicken wings gracefully, so the natural reading of 3' has it that it’s only the dancing which was graceful.)

4. Most men and women like Beto (Most men and most women like Beto; cf. 4' Most electronic devices and explosives aren’t allowed on board. We all know explosives aren’t allowed on board, so we know we’re not merely saying that most explosives aren’t.)

What are we to say about such things? In some sense it seems that on one reading a word (like ‘old’, ‘gracefully’, or ‘most’) ‘goes with’ just one of conjoined expressions, and on the other it goes with both. We could capture this goingwithness with a system of bracketing. Thus we’d have

{Old men} and dogs get on well together

{Old {men and dogs} get on well together}

They {chatted and danced} gracefully

They ate chicken wings and {danced gracefully}

Most {men and women} like Beto

{Most electronic devices} and handguns aren’t allowed on board

Then here’s a thought: sentences, although they don’t come with this bracketing on their surface, nevertheless have associated with them this sort of bracketed description. Ambiguous sentences have two possible bracketings, and that’s what (this type of) ambiguity is: the existence of two bracketings, or, to use the technical terminology I’ve been avoiding, two phrase-structure trees (actually, the above aren’t phrase structure trees, which are a bit more complicated, but they exhibit the general idea.) The key point is the argumentative move from some systematic linguistic data to positing structure to the sentence not revealed on the surface, structure that we somehow must grasp given that we understand the ambiguities.

Consider now our second ambiguous sentence. It is, repeated:

2. The shooting of the hunters was atrocious

But note that bracketing won’t help here (to convince yourself of this, try to come up with a bracketing that will disambiguate the sentence!) Partly — and very much only partly — on this basis, in Syntactic Structures Chomsky is led to posit another sort of hidden operation or structure in addition to bracketing: what he calls transformations.

In particular, he posits rules that transform expressions into other expressions. (I’m being pretty sloppy here and below and using my own terminology for simplicity, so please forgive me linguists.) So, here is one transform. Given a sentence like:

5. The hunters shot the deer

We can perform the following participle transform (subject):

Participle transform (subject): (i) Change the verb to its present participle, (ii) delete the object and add an ‘of’ and (iii) move the subject to the end. We get:

(i) The hunters shooting the deer

(ii) The hunters shooting of

(iii) Shooting of the hunters

And we can then use that phrase to build up sentence 2. by treating (iii) as if it were any other noun phrase (like ‘man’ or ‘dog’).

Here’s another, the participle transform (object). Given:

5. The men shot the hunters

We can perform:

Participle Transform (object): (i) Change to verb to its present participle,(ii) add ‘of’ after it, (iii) delete the subject.

A little thought should tell you that transform can turn 5. into something capable of generating the second reading of our sentence.

Then the thought is that though our original sentence isn’t structurally ambiguous in the sentence of having two different structures associated with it, it is ambiguous as being the result of transforming two different underlying sentences to form its noun phrase: it is transformationally ambiguous.

We are thus, by subtle arguments like this, led to posit underlying structure and operations, hidden brackets and transformations, to account for particular oddnesses of grammar. And moreover, since many languages exhibit similar sorts of behaviour, we are thereby given a reason for thinking that this underlying stuff might be a universal feature of human language.

And the big picture thing I want to note is the move away from empiricism. We are positing a whole heap of covert operations and structure, and we are doing so, moreover, with a view to making the case that these operations and structure, or some of them, are universal features of language that we come from the womb knowing, or at least ready, with little experience, to come to know.

Chomsky’s work has undoubtedly been very influential for this picture of language and the mind that it presents, and any intellectual history will recognise this. In the next part, though, and in a considerably more speculative vein, I want to suggest that another picture of the mind, from another discipline, is on the scene, a picture notably different from the Chomskyian rationalism we’ve just seen and one deserving of our attention from what it can teach us about the mind.

Part 2: Machine Learning And Super Empiricism

Again let me start with a story.

Story 2 (cn: suicide)

A man went to the doctor with a cold. They took his medical information, consulted his charts, and told him there was a good chance he would try to kill himself in the next two years. A piece of software they used had told them so, and they were wondering if he could consider staying on as in-patient to receive the help the computer thinks he needs.

Unlike the first story this is, or could soon be, reality (see https://qz.com/1367197/machines-know-when-someones-about-to-attempt-suicide-how-should-we-use-that-information/). Researchers have created software that takes in medical records and on that basis is able to predict, with 80% accuracy, whether someone would try to kill themselves in the next two years. The scary software that managed this feat used machine learning, and the focus of the rest of this post is how (a particular and popular sort of ) machine learning works, what it can tell us about the rationalism/empiricism debate and the nature of the mind or intelligence, and how we should feel about it.

Super empiricism as I use it is the harnessing of machine learning algorithms and big data sets to yield predictions above what humans are capable of in certain specific domains, such as predicting a person’s possible medical future on the basis of their records. If machines can think — and I’m not going to discuss that massive question here — then one could argue that the Chomskyian rationalist model of the mind is not apt for them, but super empiricism is. Because I’m not going to argue for the crucial premise underlying this thought, I call super empiricism philosophical-fiction (again, it’s meant to call to mind science fiction) and I think it’s worth exploring as a view because of its intrinsic interest. In order to do so, though, I want, as I did for Chomsky above, to provide an overview of how machine learning works, because, apart from the intrinsic interestingness of it, the particular nature of machine learning — and in particular the fact that it is (in a sense) not governed by any hard and fast rules — is what causes it to be distinct from the Chomskyian view and what accounts for the aesthetically interesting or suggestive parts of the theory. So, if you don’t have much of a taste for some of the mildly technical detail that follows, bear in mind there’ll be some big picture consequences drawn from them.

Neural Networks, 1

Oftentimes, an object has a property, or might have a property, thanks to its having another, different property or properties. For example, if someone has the properties of being over 35 and under 65, born in the US, male, and of having attended an Ivy League and law school, worked in government, and been the subject of widespread media attention, then the probability of that person trying to run for presidency is higher than for someone lacking those properties.

In general, there are sets of properties on one hand and other properties on the other such that possessing the former increases the chance of possessing the latter. For brevity, let’s call these relations between properties connections: some properties are connected with some other property if the likelihood of an object possessing the latter property is greater if it possesses the former properties (and let’s just completely ignore any complicated correlation\causation questions, and also questions about the nature of what likelihood is. Sorry, philosophers of science!)

Now here’s a fact: the world is full of connections that we can’t see because we’re limited creatures. The presidency connection mentioned above is one we know of because the president is perhaps the most powerful man in the world and we all know off-hand the biographical details of many presidents.

But there are many connections we just don’t have enough worldly knowledge or time to work out. Thus think, for example, of what it is that makes a social media post go viral. There are some things we can guess immediately: it receiving a large audience, it being funny, etc. But there are many other properties that might have an effect without our realising it. Maybe the gender of the sender matters: maybe men go viral more than non-men. Maybe age matters: maybe being between 24–35 helps a lot. Maybe the time of day, or the day of the week matters. Conversely, maybe some things don’t matter: maybe for all our Anglocentrism, viral posts (defined as let’s say ones with over 1000 engagements) are equally likely to be in any language.

Or maybe it’s some wildly complicated mix: maybe, for whatever reason, a man posting something funny on Tuesdays to a lot of followers makes virality likely, as does a woman posting not in the evenings about politics. And maybe some other property I haven’t even thought about plays a surprising role: maybe if your post has more than one picture the chance of it going viral are extremely low. Who knows?

Neural networks know, or at least could know if someone were to care enough to code one to answer this problem. They are a way to harness the number-crunching abilities of computers and the big data sets that the information age gives us to try to find connections between sets of properties and target properties that are too complicated or surprising for humans to grok in the way we can grok (to some limited extent) possible presidential candidates. When successful, they give us in essence a machine that we can present an object with and be told an accurate estimation of how likely that object has that target property. To see this requires some technical detail, so I will begin by presenting a massively simplified example that nevertheless exhibits some of the key features. I rely on Tariq Rashid, Make Your Own Neural Network for the pedagogical idea of using a simple example like the below to introduce the topic, and in general warmly recommend the book for someone looking for a quick and accessible introduction that nevertheless doesn’t shirk much of the important mathematical detail. Also, I’m not going to go into questions like why they’re called neural networks or the history of the concept because that information is easily googleable and would needlessly make this post even longer than it already is.

Just as presidential candidateness is connected to some properties, and virality might be, so is the property of weighing 55 pounds connected to the (single) property of weighing 25 kilograms, and more generally for any weight in pounds there is a connected weight in kilograms.

Say we want a piece of software to classify a person as a possible president given their attributes, a post as possibly viral given facts about creation, or a weight in pounds given a weight in kilograms, there is an obvious way to go: we simple define a function that takes the attributes or facts about creation or weight in kilograms as input and spits out the answer. In the weight case, it would look something like this:

Function convert(kilogram){pounds=kilogram x 2.205Return pounds
}

A program could then call the function like this:

i= Get input “Tell me the weight in kgs you want converted”Print convert(i) “ is the weight in pounds”

Easy — but try doing that for the president or the viral case, and you’ll struggle. The cool and crucial idea which lies at the heart of neural networks is that we don’t have to define the function ourselves: we can let the code work out the function, in the sense that, when it is given a certain value, it will yield the right output.

To see this, let’s see how an algorithm could work out the function connecting weight in kilograms with weight in pounds. What should we do, if we can’t feed the algorithm the conversion rate, as I did above?

Well, here’s a thought: we just let it guess, and then try to correct itself. I italicise advisedly: this is pretty much all there is, conceptually, to neural networks, the devil is just, as always, in the details. So here’s how it could go. The compute might guess:

Function convert(kilogram){pounds=kilogram x 3Return pounds}

That is, it guesses that the relevant equation is pounds=kilograms x 3. It will accordingly output 75, but since the right answer is 55, it is in error by 20 pounds.

Now here is a very important concept in neural networks, that will remain relevant even in the more complicated cases: error. We use the error to help the function improve. We look at the error and then let the computer make another guess based on how much it was in error. Let’s say its next guess is the following (where I henceforth just give the equation rather than the whole function each time):

pounds=kilogram x 2.5

That yields a guess of 62.5 and an error of 7.5.

The error is much smaller, so our next guess should be much closer to our previous one. Let’s try:

pounds = kilograms x 2.3

That yields a guess 57.5 and a puny error of 2.5. You can see how this goes: we keep on this process til we’re as close as we want to be.

Now, I haven’t actually told you how to make a computer do this iteratively refined guessing, which is obviously the most important thing. You’ll just have to take it on trust that we can do so (hopefully you can maybe see that this wouldn’t be the most difficult thing in the world, if you have some coding experience, and there are many helpful tutorials a mere google away).

Let’s recap, because we’ve actually learned a lot. We want to have a computer determine functional relationships, so we let it guess, then use the error to refine guesses til we get the right answer. The reason why this sort of thing is useful is because we might not know how to work out the function given some attributes, but a computer can try out much more quickly many more possibilities than we can, and so we can leave the job of coming up with the function to it.

Philosophical Interlude: Big Empiricism

Before going on to a slightly more complicated — and optional — example, let me make some philosophical points. A function is basically a rule of the form: if you get this in, give this out. And the big idea of machine learning is that we can let computers work out functions of interest to us. Moreover, we’ve seen that machine learning can be used to yield almost spooky predictions about us, like about what our medical future has in store for us, in the face of which it can seem we are in the presence of a mind: the computer knows, we might think.

But if the computer knows, it does so in a very different way from the way that we know grammar on the Chomskyian picture. The big idea there, on at least a pretty natural reading, is that there are some inbuilt rules that explain how we acquire language based on a poverty of stimulus. And so, I suggest, that as we move to an era where machine learning is intruding open our lives more and more with its otherworldly predictions, so we’ll move more and more, philosophical commitments notwithstanding, to a concept of mind as something un-rule-governed, or at least governed by rules beyond our ken, to an understanding of the nature of intelligence according to which its paradigm is not language use, but big-data-enabled machine learning predictions, predictions which are the output of functions that are inscrutable to us because not created by us.

(I haven’t even so far mentioned one of the biggest reasons that this should freak us out: because neural networks rely on data about us, and because we’re super racist, sexist, and so on, so are our neural networks. Here perhaps more than anywhere else is the old computer science adage of garbage in garbage out applicable, because computers must learn from the garbage pile that is a lot of human behaviour.)

(Skippable if you must response to some counterarguments: One of the reasons I like blogging as opposed to writing academic articles is that you can get away with ignoring counterarguments, focusing instead on just putting forward a position about some topic. But I can’t resist replying to some counterarguments that might have occurred to the reader of the last paragraph. The first is that there is no tension here, because although computers are guessing rules, they do so by means of a rule that we put in them, namely the sort of algorithm I sketched above. Even if machine learning’s outputs are in a sense inscrutable, they’re nevertheless also scrutable in the sense that it’s our code that makes them produce those inscrutable outputs. When you trace things far enough back, you’re left with good old intra-cerebral human intelligence ruling the roost.

Fair enough: that seems right. Maybe there is no solid argument that there are notably different super-empirical minds out there in our computers. But that’s fine. As I said, I kind of treat this hypothesis as a philosophical-fiction, so not quite seriously, and what I’m really interested in is how we should feel about the idea even bracketing its truth. Moreover, even if my thinking here isn’t accurate, I do think it picks up something in the Zeitgeist: I can’t help but think that the image or thought or feeling or image-thought-feeling that our computers are ruleless beasts outside our control is out there, that such an image-thought-feeling would have been hard to comprehend a few decades ago, and that it’s unlikely to go away as more and more spooky machine learning feats hit the news, and I find it a fun image-thought-feeling to engage with intellectually and imaginatively, and that’s what I’m doing.

Here’s a second counterargument: the computer knows, I suggested. But, well, isn’t that a big contentious can of worms to be opening, or rather not opening and just dumping and moving on? Surely most of us don’t really think, either as theoreticians or as every day people, that computers know.

Well, maybe. I’m attracted to Shelley Turkle’s work on this topic (see, for example, her Alone Together), according to which we are wont to attribute mindedness to things that behave mindedly, even if we know they are not minded. I’m accordingly tempted to think that whatever scruples we might have about saying computers know, that will be belied by the way we navigate a machine-learnt world; that our navigation will be the navigation of a sort of being that thinks that it does know, no matter what we say. But that’s obviously a big issue that I can’t deal with satisfactorily here.)

Neural Networks, 2

The example of weight, while it does foreground the important concepts, is sufficiently trivial that you might wonder how the method underlying it could be used to give the great predictions machine learning seems to be capable of. Accordingly, in this section I’ll consider some details of how a slightly more realistic and powerful neural network might function, but if don’t care about those details feel free to skip.

Recall our example of the viral post. The first thing to note is that there are several input properties (unlike just the one for weight), and we want them to be the input to our machine that tells us whether or not the post will go viral. Here’s a drawing:

Let me simplify massively and not very satisfactorily by assuming that these inputs all take only one of two possible numbers as values (say 1 and 2), where this represents some sort of division related to the properties in question(time would be divided into before noon or after noon, gender female or not-female, day weekday or weekend, topic comedy or politics, language English or not-English, followers over 1000 or under 1000).

The things on the left are the input properties, the middle bit is the function the computer will learn for itself, and the thing on the right is the output it will produce.

Let’s make some wildly simplistic assumptions. Let’s say that day of week turns out to be surprisingly very important — weekday posts are much more likely to go viral than weekend posts — and language surprisingly not important. What we want is for our network to work that fact out by itself, and represent that fact somehow in its network. And one way to do so would be if it assigned more importance to day of the week. Here’s how we could do that: think of the input travelling along a connection (the lines) into the function that magnifies or lessens its impact on the output of the function. We have drawings like this:

This is what a neural network would look like that thinks number of followers and day of the week is important to a tweet’s going viral but language isn’t

With a suitably defined function — which I’m not going to get into at all (there are many places on the net to help you — google ‘activation function’ to get started), it shouldn’t seem too implausible that this could yield the right output: when given a post, which will be represented as a list of numbers representing the properties mentioned above, it will spit out the ones that are posted on the weekend to lots of followers regardless of language. Of course, what makes a post go viral is much more complicated than that, but hopefully you can see how the idea could be made more realistic by adding more properties and allowing us to discriminate between tweets more accurately (as it stands, all tweets written on a weekday after noon by a man about politics in English to over 1000 followers will be assigned the same probability of going viral, which is obviously the wrong result).

But even if we added a ton of properties, there would still be something arguably lacking from this sort of set up. Remember we want to find the input properties that make likely the output properties. It could be — it probably is — that what makes likely most output properties is some wildly complicated combination of input properties. It could be, for example, that while language doesn’t matter by itself, a tweet in English by a woman on politics that’s written either on the weekend or after noon is likely to go viral. We would like to capture the fact that the very complicated combined property of being in English and on politics and either written on the weekend or written in after noon makes virality likely. Can we do this? We can. What we do is we add a set of hidden layers, which function to represent connections between properties. These hidden layers are partly like input and partly like outputs: they take inputs and yield a result, but then that result itself gets fed forward to yield the final prediction of the network.

Unfortunately at this point my powers of graphic design, or rather my patience for drawing many, many lines, wears out, so you’re going to have to use your imagination when looking at the below and imagine that the left circles are our input properties, and the colour of the wire leading from the first to the middle circles is indicative of how much weight we assign to those properties:

Imagine that the left nodes were labelled as in the previous diagram, and that the colour of the connection indicates the strength of the signal (where blue, say, means >0.7). The rightmost node is the output, which is determined by the middle nodes which represent combinations of input properties and themselves have a strength associated with them.

Look at the top middle node. It gets a strong signal from the time, day, and follower nodes, calculates an intermediary output from them, which output goes strongly towards determining the output. Roughly, this represents a network that thinks the properties of time, day, and follower count are important to determining virality. With a little thought you should be able to see how we could use those intermediary nodes to also represent things like disjunction (orness) and conjunctions (andness), that is to say to capture the importance of what I called above very complicated combined properties.

But of course the crucial question is: how do we come upon such a network? The whole point is that we don’t know in advance what weight the connections have, and there are far too many possibilities to test them all. But the answer is the same as in the easy case: we guess, work out the answer, see how wrong the guess was, adjust the weights accordingly, work out the answer again and keep on going til we have a function that does well at classifying some known viral posts (which function as training data, a concept I’ve pretty much ignored). And then we put it to the real test and try it on a new post not in the training data, and see how well our network does.

I’m afraid that to properly explain it would take a couple more thousand words so, with some regret, I’m going to ask that either you take it on trust or — better — research it yourselves (google backpropagation and gradient descent, for starters). But if such a network were successful, it could know things we don’t know by means we don’t entirely understand. How should we feel about that?

Story 3, Conclusion

Here’s a third and final story to end things. Oedipus doesn’t know who his parents are; he goes to an oracle and it doesn’t tell him, but does give him the kind of bummerish news that he’ll kill his dad and have sex with his mum. Then his town gets sick and he’s told that it’s due to pollution caused by the murder of his father, and he needs to find out the murderer to make the town better, but again the oracle doesn’t tell him the somewhat important info that he himself is the murderer. And that’s comparatively straight speaking from the oracle, which is at other times prone to talk in annoying riddles.

An oracle, I think one could say, is an arational intelligence: it knows things, but it can’t be reasoned with. Another way to put it is that it’s not a general intelligence: it can’t do all the things that humans can do. It can tell, but cannot respond to questions.

AI researchers want general intelligence, and while they don’t have it, they do have some pretty decent specific intelligences, intelligences of which, I have suggested, super-empiricism is true. This is a new sort of mind capable of processing much much more data than our puny human brains could contemplate and finding patterns we never could.

What should we make of this new sort of mind? We should take them to be oracles, and should view our world as stalked by not quite all there super intelligences, by minds capable of interfering in human affairs and scrutinising our data to tell us we might get sick, or go to prison, or die, or judge us negatively because we belong to a historically persecuted group, but incapable of offering us anything other than those harsh predictions and judgements. And so I conclude: super-empirical minds are a fitting object of fear and pity for an age without myths.

--

--