Gresham College Lectures

Gresham College has been providing free public lectures since 1597, making us London's oldest higher education institution. This podcast offers our recorded lectures that are free to access from the Gresham College website, or our YouTube channel.

All Episodes

Gresham College Lectures

Big Data in Business

March 06, 2023 • Gresham College

Big data has really taken off over the past decade because of the presence of ubiquitous sensor technology everywhere. For example, we are all constantly monitored by our phones, smart doorbells, heating systems, televisions, watches and jewellery. These devices generate a constant flow of information about us. But this data is pretty much meaningless without context.

This lecture talks about how data needs to be processed to make it useful to business.

A lecture by Raghavendra Rau recorded on 27 February 2023 at Barnard's Inn Hall, London.

The transcript and downloadable versions of the lecture are available from the Gresham College website: https://www.gresham.ac.uk/watch-now/data-business

Gresham College has offered free public lectures for over 400 years, thanks to the generosity of our supporters. There are currently over 2,500 lectures free to access. We believe that everyone should have the opportunity to learn from some of the greatest minds. To support Gresham's mission, please consider making a donation: https://gresham.ac.uk/support/

Website: https://gresham.ac.uk
Twitter: https://twitter.com/greshamcollege
Facebook: https://facebook.com/greshamcollege
Instagram: https://instagram.com/greshamcollege

Support the show

0:00

(graphics whooshing)- So this year I've been giving a series of lectures on how technology is affecting the world of finance. This is the fourth in that series, and it's about how big data is being used in business to make inferences. So let's start with some popular examples, right? One of the most famous examples, and many of you may have heard of this example before, is that of the American company, Target. Now, for those of you who have never shopped at Target, it's a typical big box retailer, a little bit more like Waitrose than Tesco, a little upscale, but they had a big problem in 2012. And the problem was that people shopping is "sticky." That means, in other words, when you go to a supermarket to buy stuff, you walk in almost an autopilot. You walk in, you look for the milk, you look for the groceries, you just know where these things are. And so you don't actually, you know, go from one supermarket to another. You know where things are available at your own supermarket. So Target, in trying to persuade you to move to their supermarket, had a great deal of difficulty. So they spent an enormous amount of time and money collecting details using Target loyalty cards, not on phones, but using your loyalty card to try to figure out exactly what you were doing

in order to predict one thing: 1:31

That is, were you, as a woman, were you likely to be pregnant? Now, why is that so interesting? Because if you're pregnant for the first time, you're doing something brand new. You've never done this before. You're buying stuff you've never bought before. So in other words, if Target can predict when you are pregnant, they can hit you with, you know, baby-related material. If you start shopping at Target then, well, stickiness again, you never leave Target again, right? So that was the idea. And so they collected a lot of information and, you know, apparently the story is they had came up with data, for example, it turns out that if you suddenly start buying unscented hand lotion, where previously you were using scented hand lotion, that's an indicator that your sense of smell has changed. That's an indicator for a possible pregnancy. Things like, are you using more moisturizer than before? Things like that. In fact, apparently they were so accurate, and this is a book by a guy called Charles Duhigg, who wrote something called"The Power of Habit," apparently they were so successful that this man apparently stormed into a Target one day, and he said, "What are you guys doing?" And they said,"I'm sorry, I don't know what you're talking about." And so he goes to the manager and he says,"Look, my daughter is 16 years old, and you're giving her all this baby-related stuff. What the hell do you think you're doing?" And the manager apologizes, he says,"I'm so sorry, I don't know how this happened." And, you know, gave the guy a $100 coupon or whatever. And the guy goes away, you know? A week later he comes back and he's shopping again. The manager notices him and says,"Oh, I'm so sorry about last week, I really apologize." And the guy says,"Well, there were some things happening in my home that I really didn't know about. I'm afraid my daughter is indeed pregnant." So, you know, Target knew about this before the family did. Now, of course, now Target tries to be much less creepy about it, right? I mean, they don't want to hit you with this. So what they do now is basically when they send you their ad circulars, they have baby-related stuff, but they also put in things that no mother is going to buy, like a chainsaw or snow tires or something like that. But just the proportion of baby-related stuff is higher in your circular than in other people. I mean, you know, nobody compares your ad circulars with your neighbors, right? So that's the idea here. So that was one example. It's a very famous example of companies able to use big data to predict things that we don't know about ourselves. Another example is Facebook. Facebook earns billions of dollars every year by categorizing you as a particular type of consumer: somebody who drinks beer, somebody who's a Republican, somebody who's something else. And they charge ad companies fortunes to get your ad in front of precisely the right type of customer. Another example was this professor who bought tickets to fly on a plane, bought well in advance, because he wanted cheap tickets. But when he was actually flying, he checked with his neighbors, both his neighbors apparently had bought tickets just a few weeks before, but they paid much lower prices than him. So he got mad, he downloaded all the data from Sabre, which was the airline reservation system, and tried to see what patterns he could find out about airline prices. So he came up with a company called Farecast. So some of you may have used this. And the way it works is you try to buy a ticket and it'll tell you, "Don't buy yet, prices are likely to drop in the next one week." Or, "Buy right now, because prices are likely to go up." They actually do a good job using big data to predict what the airlines are going to do. Google Flu is another example where Google apparently said, okay, if people search on Google for things like,"What are the symptoms of flu?""How do I cure a stuffed nose, or a blocked nose, or a sore throat?" And things like that, that's a leading indicator, a real-time indicator for flu. So that is much better than waiting till people actually go to the doctor and find out whether they have flu or not. Cambridge Analytica, which is based in Cambridge, took a lot of this data from Facebook and apparently tried to sell it to political parties in America to try to target particular types of people for political ads. And we have here lots more examples. We have Amazon, which makes recommendations to you about what to buy. We have Netflix, which makes also recommendations about what to watch next on Netflix. And all these companies are famous examples of big data. But there's one thing which we don't know yet, right? For example, what is big data? What is "big" about big data? Why do we call it "big data," right? So one way to think about it is very simple. When we analyze something, we have a choice. We can analyze a sample from a population, or analyze the whole population. If you are analyzing the whole population, that's a form of big data. If you're analyzing a sample from the population, that's regular data, okay? But all our lives, at least till a few years ago, we were analyzing samples. We were not analyzing the complete population. So what does that mean? First requirement, if you're analyzing a sample, not a population, is that you have to have clean data. I spend years sometimes cleaning databases just to make sure there are no errors when you actually run a regression, or do something like that, right? So cleaning the data is first step. But beyond that, the second step is, okay, how do we choose the sample, right? And there are lots of problems here. The first problem is how do we get precision in our sample? And precision doesn't come with the size of the sample, it comes with the randomness in the sample. Two of the most famous examples, for example, was in 1948 when the New York Post announced"Dewey defeats Truman," based on early polls. But the sample was wrong, and in fact, Truman had defeated Dewey. Or the Trump election in 2016, where a lot of pollsters had no idea they were going to win, right? So why do these things happen? Well, let's take an example of a sample. Suppose you're a pollster, and you have a list of everybody's landlines, and you call them to find out what their voting intentions are.

But who are you getting: 8:26

people with landlines, right? I mean, and who has landlines? A particular type of people, possibly on the older side. I mean, I don't have a landline myself, but, you know, I use a mobile for everything. And people who answer the landlines once they're actually called, right? So that's a particular group of people. And so they may not be the kind of people who are actually going to be voting. And it gets worse as you go into finer and finer samples. For example, if you think about trying to find out in an area, are the Republican women with two children, are they going to vote, and who are they going to vote for? That's a very specific sample. So again, randomness is gone once you go to more and more precise data. And we have a lot of ways to handle that data. What's the average customer like? What's the average voter like? What's the modal voter like? So there are ways of summarizing the distribution. But the key part about all of this is, we are trying to solve the problem of causation. Does A cause B, right? I mean, this is a fundamental problem in science. We talk about causation all the time. We have a model in our heads, and we collect data to prove or disprove a particular hypothesis, okay? Now, what about big data? Well, the first thing is, what's the big difference when analyzing a population and analyzing a sample? Let's take some examples. One example, which I've referred to before, is the Domesday Book here in England, around 1066, after William the Conqueror came in to England. He decided to count the number of people in his kingdom and check how much property they had, how much cattle, how many, you know, heads of livestock, things like that. There was an impending war with Denmark. He wanted to make sure that he had enough people to actually fight in that war. So they were trying to get the data on the whole population. Unfortunately, it was an incredibly labor-intensive job. And when William died, they stopped doing it. So that was pretty much the last attempt at trying to get data on the whole population. But that was revived. So for example, in America, they decided to have 10-year censuses. The problem is, till about 1880, the amount of data they were collecting on people from around America was so much, it was taking more than 10 years before the data could be analyzed. By that time, the next census was already started. I mean, what do you do, right? I mean, there's no way to get that data analyzed in real time. So that's when they contacted Hollerith, and he came up with punch cards to actually collate the data, that created in the 1890 census, in about two years, rather than 13 years, as it was happening. Of course, Hollerith went on, and then it became IBM. So IBM was one of the first companies which was actually created to solve the problem of big data. Cambridge Analytica, I've already talked about, I'll spend a little more time later. But one of the things that we need to emphasize is it's a population, but the population doesn't need to be big. There's a book by, you know, called "Freakonomics," which some of you may have read, by Levitt and Dubner, and they talked about the population of sumo wrestling bouts in Japan. And because they looked at the whole population, they were able to detect instances of cheating between sumo wrestlers. You can only see that if you look at the whole population, not if you look at a sample from that population. But what's the common theme about all of this, right? Just think about these issues, it's messy as hell, right? So the data varies in quality. It's collected by different people at different points in time. It's kept in a wide variety of places. It's incredibly messy, it's not clean data. And that means you can't really make precise inferences. You can sort of point to general directions, and I'm going to talk a little bit more about that as well. But where is this big data coming from? Well, one of the big areas is we reuse the data from other sources, right? I mean, and that's one of the reasons why it's messy, because unlike a sample, where we are collecting data to test a particular hypothesis, here, there's already data out there. We repurpose it for a new question which that data was never meant to answer.

So let's take an example: 13:06

machine translation. Machine translation used to be an amazingly difficult job. So the earliest people who did it tried to codify the rules of language. What they said was,"Here's a noun, here's an adjective, here's a verb. This needs to be conjugated in this way." But language is messy, right? I mean, there are exceptions to every rule. There's ways to express yourself that don't really fit well with the rules. It was really tough to do machine translation. So even with the biggest dataset they had, which was in Canada, at the Canadian parliament. What they had was about 30,000 speeches of Canadian parliamentarians, which had been translated in both French and English, right? And so, very precise translations, very well done. So they took the 30,000 pages and said,"Let's see if you can find patterns." And they couldn't. It was impossible to do. So what did Google do that was different? Well, they already had a project called digitizing the world's books, Google Books, right? And they took all that data and they just shoved it into an algorithm, which we'll talk about in a bit, and they said, "Can you find patterns?" And they were able to find patterns. And the data's messy, but why were they able to find patterns? Because they had over 3 billion pages, not 30,000, 3 billion. That's an order of magnitude, several orders of magnitude more than you would expect with just, you know, 30,000 pages. And now this is pretty accurate. I use Google Translate all the time. So, it's not that the machine knows what it's translating, it's just looking for patterns, the way these things correlate in different languages. Another example, a couple of professors at MIT launched this Billion Prices Project to solve the problem of inflation. And how do you figure out whether inflation is taking off or not within a country? Usually, the Bank of England or the Fed looks at a particular basket of goods and says,"What are the change in the price of that basket of goods?" So you have to have a representative sample. You have to monitor those prices very, very carefully. These guys did something very different. They just took every price they could find on all websites, on all retailers they could find in America. That means over a billion prices. They're not comparable, right? I mean, what one retailer might call a suitcase, somebody else might call a briefcase. They're all over the place. But the point is, this is realtime big data. And so they were able to make inferences about inflation, that it takes a longer, much longer time, for entities like the Bank of England or the Federal Reserve to solve these problems. And of course, there's lots of new data.

Examples: 16:05

Satellite photos, this is a classic example. So if you want to know, for example, how busy a retailer is, one possibility is wait for the annual report. So their quarterly reports, right? It'll tell you their sales. Another possibility is take satellite photos of their parking lots. How many cars are there in their parking lot at any one day? And that's available on a daily basis. So you can have a much faster estimate of just how much traffic is happening at these big box retailers. Another example, open source intelligence. For example, in the Ukraine war, before the Ukraine war started, you could use Google Maps to figure out where the invasion was happening. How? Well, it turns out that a lot of the people who were in those areas were reporting traffic jams, because the tanks, the Russian tanks, were all blocking up the traffic on those days. So you could look at Google Maps and get a good sense of where the tanks were massing to come across the border. Another example of, you know, data being repurposed for something else. And of course, smartphones, right? All of us have smartphones, which is collecting an immense amount of data on us. They collect data on what searches we have carried out, what books we have bought, what vacations you shop for. Literally everything we do is on our smartphones. And of course now, most of us have Oura Rings, we have Apple watches, I have a Google Pixel watch, we have Fitbits. They're all adding stuff, which we don't even know ourselves, all being captured by big data, right? So your phone knows, for example, when you're stressed. It knows when you're low on sugar. It knows when you like a person of the same or the opposite sex. It's all there. And we willingly carry some of these things with us wherever we go, right? I mean, most of us would feel lost without these, in a way, tracking devices, right? In the old days, the only people who would be allowed to, who would be made to wear these tracking devices were convicted felons on parole. But now, you know, we have it, all generate our own data from it. So if you classify these different types of data, there's several different types. There's geospatial data, which tells you how you are moving from place to place. There is sociometric data, which is the study of your social relationships. There is the study of psychometric data. What's your mental state, what's your personality? And of course, there's biometric data, which looks at your biological characteristics, right? I'll take examples of each of those. But before I show you those examples, why are we collecting so much data? I think a popular phrase, which a lot of people have been talking about, is they say, "Data is the new oil," right? It's like if you have data, it's like you are going to be the next powerful company around the world. So a lot of companies believe the more data we have, the more the valuable insights we're going to get. And of course, regulators also assume that more data is better, right? I mean, people usually assume that more is always better, right? Bigger house, better than a smaller house. More money, better than less money. More children... No, it doesn't work for children, but it works for everything else. Okay, well, so if you have, for example, companies listed on the stock market, right? You have to file quarterly information. Banks and investment funds have to have stringent reporting of obligation. Certain sectors are additional information. So everybody is being given more and more information. The problem is we can't interpret this big data just like that. It's just generating data at us, what do we do? So what we have to do is we need to process the data. We're going to talk about three steps we need to process that data. The first step is a language. We need to come up with a language to describe that data. Without that language, big data is useless, okay? Second thing, we need to look at our preferences along multiple dimensions. And I'll talk about that using an example. And finally, we need to capture all those preferences to predict what we're going to do at any one point in time, right? So those are our three steps. So, why do we need a language? Well, let's just take a cup of tea, right? So what dimensions could we have with this cup of tea? The type of tea, of course, but it could also be even the type of tea, where is the tea coming from? Is it coming from China, is it coming from Assam? What's the carbon footprint? Is it organic, is it something else? There's so many dimensions you can classify just even a cup of tea. So that means that you need to quantify those dimensions. You need a way to say, convert each dimension into a number, because a computer doesn't understand words, it understands numbers. So we need to quantify each of those dimensions. So what does that quantification process consist of? Well, first of all, all those different dimensions have to have uniform tags. You would have to say, okay, it's a tea bag as opposed to loose tea. It's a Chinese tea as opposed to an Assam tea, and so on. So there are specific tags you need. And those tags can be incredibly complex. For example, this is a company called Zappos in America, which just sells shoes, that's it. If you type in "men's sneakers and athletic shoes," it comes up with 9,260 choices.

There are subcategories: 21:53

lifestyle sneakers, athletic shoes, sizes, width, brand, prices, colors. You name it, there are categories which you can classify this data into. Now, that's easy in a specialized marketplace, like a shoe, like a washing machine, like a hard disk, there's a specific number of dimensions. But once the data becomes more unstructured, it becomes more and more difficult to do. For example, think about YouTube. If you're trying to find something on YouTube,"how to juggle," right? How does YouTube know how to get you that kind of information, right? How does it find a video about juggling to you? Well, it looks for multiple things. For example, is the word"juggling," is it relevant? Which part of the video is that, right? It looks at how many people have engaged with the video. It looks at the quality of that, right? And whether it can be personalized. Let me give you an example here. So it looks at what videos you watched in the past. What are the videos which are typically watched together? So if you have one video on juggling, what is it also watched by? So there's a group up there. And topically related videos. For example, if you hunt for the word "cricket," right? There are two possibilities, right? You have a cricket chirping at night, or you could be playing cricket. Now, which one will it give you? It depends again on things like what have you watched in the past? If you watched a lot of insect videos, you're going to get that. If you're watching a lot of, you know, sports videos, you're going to get that, right? So it has to infer from all these bases, what is it? How does it search for a concept in a tag database? Now, in this particular, in YouTube's case, it has millions of possible videos which it can serve you, but it looks at your user history and your context. It looks at other candidate sources to winnow that down. There's millions of sources into hundreds, eventually ranks them on the basis of video features. And what you get is just the tip of the iceberg. So here are some ways you can do this. For example, you want to be as precise as possible in your name, right? So if you have laparoscopic-appendectomy.mov, it's looking for that name there. So if you're looking for a video on appendectomy, that's where it'll show up in the name, right? It'll look for the video title. You want to say,"A real life step by step laparoscopic appendectomy." Much better than if you have a very vague thing. So this is up to you when you're uploading to give it as many cues as possible to let people find that video for you, okay? Now, this is our first step, classifying our data, right? We are developing a data ontology."Ontology" just means a language. So developing a data language is our first thing. And this is really important. Like eBay, if you try to find anything on eBay, it's a complete mess, right? Very difficult to find stuff on eBay. Much easier to find stuff on Amazon, because Amazon started with a classification system for the first items they sold: books, right? The Dewey Decimal classification system, easily available, easy to classify. But this is a really hot field for jobs, by the way. So, you know, when you guys were growing up, I mean, how many of you, when your parents or your uncles or whatever asked you,"What do you want to be when you grow up?" How many of you said, "I want to be a librarian!" Anybody? Well, it turns out, this is a really hot field these days. Being a librarian is a very, very good job, because it's about understanding data, not about just books anymore. Okay, so that's the first step, right? So you've got all this data, you've nicely classified this data. And the second step is how do you match the different dimensions of the data you've collected, the metadata, into something that is one number saying,"You will like this, or you won't like this." So this is an algorithm which looks at your multiple preferences and says,"This is the highest preference." How is that done? Well, we already have those algorithms. We have the same algorithms to manage your photo collection. We have Siri, Alexa, understand a voice command. Or our smart watches detect when our heart isn't beating rhythmically, right? So basically it's just a data stream. So what you're going to do is to find a pattern-matching algorithm to get that data stream, you know, to whatever it is trying to do to make sure it understands your preferences in one number.

So let's take an example: 26:55

Netflix. And in this example, I'm actually going to go back to Netflix in 2001, right? Right when it started. Some of us may remember this, but Netflix was not always a streaming service. It started out with DVDs by mail. In the UK they called it LoveFilm. But basically the idea was, you have a list of all the movies you want to watch on the website, and Netflix will mail you three DVDs, right? And then you watch them at your own pace, and then send them back and, you know, Netflix charges you 10 bucks a month, right? So what it was trying at that time was persuading people to keep paying that 10 bucks a month, because it didn't want you to actually borrow any movies, so if you had, you know, if you kept the same movie for one month, two months, no problem, there's no late fees. But at the same time, it didn't want you to stop paying the 10 bucks. So it wants to make you continue watching the movies. Fine. How you persuade them to watch movies is you recommend good movies to them. But how do you get those recommendations? The first one was recommendations by employees or editors. You know how well that works. I mean, you know how many times do we read The Guardian or we read, you know, the FT or the New York Times, and we get these movie recommendation, then we watch it and you're like,"What the...this is awful!" Right? It happens to me all the time. At least, let me put it this way, it happens more with my wife than with me, because I pick the movies and she looks at the movies and she says,"That was a horrible, terrible movie." I'm like, "But the New York Times loved it!" I never want to see a New York Times movie again. So the moral of the story is this doesn't work very well. And of course people were hacking these things all the time. So I'll give you another example. If you have a brand new movie, just been released, like, let's say "Avatar," right? Just released a couple of months ago. The recent DVD, if two people have it on their list, who gets that movie first? Well, the newer person, because I don't have enough data on him yet. The person who's already been in the system for a year does not get a loyalty bonus. I discriminate against them by sending them a different movie from their list because I'm trying to build up my information on the new guy first. So people knew this, of course. So what they would do is, you know, cancel your Netflix subscription and then sign up under a new name. So you are always a new subscriber. Or you would match together a group of your friends, you order the movie, one of you gets it, you share it first, and then you return it, right? So pretty standard way. So they were trying to solve these problems like this. Keep people hooked, keep people staying, pay the $10 a month and stay within the system. So they came up with a system called Cinematch. How did that work? Something called"collaborative filtering." And collaborative filtering is a popular way in which you can actually try to find out, you know, how do you recommend things to people? So let's take a simple example. You have two people, two movies. We have Pauline who has rated both "Avengers" and "Spiderman." 2 for "Avengers," 2.5 to "Spiderman." Julien has rated "Avengers" as 3. The question is, what would Julien rate "Spiderman" as? The easiest one, the slope one strategy says, these two people are alike, they will rate it in the same way. So you could say 2 to 2.5; 3 to 3.5, maybe. Or this is a 25% increase, 3 to 3.75, right? You have to pick a strategy, but that's the basic story. But then it becomes more complicated. So for example, now let's suppose we have three movies and three people. Cesar has seen all three. Blanche has only seen two. And Emma has only seen two. The question is, how does Emma rate "Avengers" versus how does Blanche rate "Wonder Woman?" Okay, so what do we do? One way is, okay, let's take a look at these two. Cesar likes "Avengers" more than "Spiderman." He gives it a rating of 2 higher, but Blanche gives it a rating of 1 lower.

So let's take the average: 31:21

2 higher, 1 lower, divided by two, is 0.5. So Emma would be 1 plus 0.5, is 1.5. That's one way of doing it. But you could also go with Cesar. Cesar likes "Avengers" 4 over 1. So there's a difference of 3. So this is 4 plus 3 is 7, that's another number. So you've got 1.5 under one matching algorithm. You've got 7 under a different matching algorithm. And then you weight them. Two-thirds the weight for the first, one-third the weight for the second, because there are two sources for the first one, one source for the second. And, you know, you can get as complicated as you want. Then you start saying, okay, which of these people are so close to each other? Whose ratings are really close to each other? Let's take those. Or use, you know, customers only in the same cluster to make predictions. So the algorithm starts becoming more and more complicated. All right. They used what they called an ordinal logit model, right? Just a technical term. All that means is the dependent variable, which you're trying to predict how much these people like the system, is on a scale of one to five, right? So that all we know is five is greater than four, three is greater than two, and so on. But the key difference is, the differences are not the same. For example, if you think, some people might say,"Five is really, really unbelievably good." Very few movies will get a rating of five. So the difference between four and five is really big, the difference between three and four not so big. Below three, they're all the same, it doesn't matter. So you know, you just... So you have to adjust for all of these things. So what are the problems they had? First problem was the cold start problem, right? The cold start problem is, a movie's just been added, how do you rate it?'Cause nobody's watched it yet. Or some new user's just arrived, he has no preferences. How do we rate him, what do you predict for him? Second problem, popularity bias. So the more popular a movie is the more people will rate it. So you'll end up just recommending popular movies to everybody. The older movies, nobody really gets. Third one, sparse data problem. It's a huge database, but most people don't rate movies. So you have mostly zeros, only a few data points in between. And finally, noisy data problem. You might love a movie, but you might feel a little embarrassed to tell your friends you really liked that movie. So you tell Netflix that you liked it, or you didn't, but you're thinking, that movie, you know, very different from what I really felt about the movie. Okay, so how did Cinematch deal with that? They called it the alternative least squares model. This is getting more and more complex. It's just ways of manipulating data. And what you have is a complicated database that looks like this. People, movies. And you're trying to fill in these missing bits. So what they did was separate into two databases. One is called a user matrix and the other is called a movie matrix. So you fill in average numbers here, random numbers here, and then you multiply these two together to see how close you're getting to the final database, right? That difference is what you're trying to minimize. So that's called the root mean square error, right? So that's what we are going to try to do. We're going to minimize that. All right, so what did Netflix need this big data for? As I said, they wanted to keep you within their ecosystem, but they had other issues. They wanted new shows to commission."House of Cards" was probably the one which people say, okay, that was the one where Netflix really used big data to predict that, you know, this is a great movie, a great series, which people will watch. Every other studio had tuned it down. Or, how much to pay for new films, right? Or, how do you get people to actively rate a movie, as opposed to just passively watching the movie, right? Now, of course, it's much more sophisticated. Netflix has tagged every movie. Somebody has watched it and tagged everything:"Chinese American blinks eyes while doing this," right? And so when you are watching a movie, and you pause or you click, it knows exactly when you're pausing. And it's got all the tags on the movies already. So it knows everything about who watches a movie, at what point they stop, when they go on to a different movie, everything, okay? So more sophisticated. But at that time, they didn't have this. This was the DVD days. So what they did was announce a Netflix prize. It wasn't a new thing, right? Because remember, this has been going on for centuries. So in 1418, the Duomo in Firenze... The city announced a competition to build the Duomo, won by Brunelleschi, at that time. In 1741, the Longitude Act was to try to find a precise way of getting longitude from anywhere in the world. Won by John Harrison. The X-Prize in 1995 offered a whole bunch of prizes for different things. And you can still see other prizes today. Go to InnoCentive.com. You can go in there and try to find examples of prizes to win. So Netflix had a big problem, right? How do we design this thing? We are trying to collect this data. We're trying to solve the big data problem. Should I use my own platform? Then people will say, "They can manipulate the data." Should I use an outside host to be more neutral? How much data should I release? If I release too much data, it's confidential data, it'll be available to everybody. What about customer privacy? They can back-engineer who these people are. And that actually did happen to Netflix. They got sued. Intellectual property, right? Is it an anonymous contest, is it an open contest? What happens if the winner says,"I own the intellectual property here. You want me to give you my intellectual property, give me more money." Or what about the losing solution? It may be interesting. Just not good enough, but it may be interesting. So who owns the intellectual property? And what happens if the algorithm is stolen by your competitors, right? So a huge number of problems they had. They had worries about designing the problem as well. How do you specify the problem? It has to be very clearly defined, right? How big an increase in the root mean square error should they have? How long should the contest be kept open? If you say, "I want a 10% increase," but nobody manages to get there, people will lose interest in two or three years. So how long do you keep the contest open, right? What happens if there are multiple winners? What should the size of the award be? They had to solve all these problems. And eventually, in October, 2006, they announced a Netflix prize, a million dollars, invited the public to devise a recommendation algorithm that could beat their Cinematch program by 10%. And the contest would be open for five years. That was the thing they put up. That was the Netflix prize. You could just click there. They basically had every piece of data that they had, they gave it to the public, right? And if no one won within a year, they'd give 50,000 to whoever had the highest achievement, the best they win. And they keep awarding the same amount every year until someone won the grand prize, right? And it released the database. So a hundred million customer ratings for anyone interested in cracking the code. You can try this yourself, by the way. I'll put the website out, but it's called GroupLens.com. You can download that data and try winning. You won't win a million because it was already won in 2009. So these are the two top scorers. They merged with each other, and they came up with an algorithm and won. It was called BellKor's Pragmatic Chaos. And they won a million bucks. Now, they increased the accuracy of the Netflix system by at least 10%. Excellent. What was the problem? Well, the problem was there were some movies which nobody could classify.

Example: 39:43

"Napoleon Dynamite." Has anyone seen this movie? Yeah, some people have. Okay, it turns out this is a movie which is unclassifiable. Nobody knows why people like this movie. People who like romantic comedies like this movie; action movies, like this movie. You can't predict this.

Another movie: 40:02

"Miss Congeniality." Nobody knows why people like that movie. You take these movies and you keep them in the database, it screws up the accuracy of the entire algorithm by 5%. So the eventual thing is they had to take out all these movies which cannot be classified. Nobody knows, in spite of the big data, nobody knows why these things work. Okay, final part. We've got the data, we have classified the data. We have merged all those preference streams using an algorithm. Next question is, can we actually use that to make predictions about our behavior? So we have the huge volumes of data. We have frequent feedback for the system. And the system can self-adjust. So we don't know why the system is self-adjusting. All the system is trying to do is to keep errors within a certain boundary, but it changes the weight by itself. You don't know what it's doing. So that means it throws out some data, says,"That data is no longer fresh, I'm going to throw it out." We don't know why it's doing many of that thing. But let's take some examples. There's a paper on Twitter and Foursquare data, a paper on Facebook data, and a paper on loan application data. Just three examples here. So the first one, they say Hristova was actually a PhD student at Cambridge when I came across this paper. So it's an interesting paper because it looks at data from Twitter, data from Foursquare, and data from Flickr, right? So these were all geotagged tweets, geotagged photos, and when you check-in. For example, if you have someone who says,"I'm at El Palacio De Hierro in Mexico City" on Twitter, and she checks-in on Foursquare, you merge the two together. What she ended up with was 38,000 users of Foursquare and Twitter. 433,000 connections on Twitter. 550 check-ins. 3 million user transactions. She had converted a social network into a place network. Now why is this useful? Well, one of the things it's really useful for is real estate. Let's take an example. This is London. And so this is the deprivation rank, how deprived a particular area is. And this is a diversity rank. And the one I want to point out is Hackney. Hackney is apparently an incredibly deprived part of London, second most after Newham. But it's also ended up being incredibly diverse, right? So everybody was going to Hackney. Younger people were going, older people were going, you know, gay people, straight people, doesn't matter. Hackney was full of people. And so what these guys predicted was that house prices in Hackney would go up over a period of five years, and it nearly doubled. So it went from 326,000 to 546,000, just on the basis of big data. Another example, psychometric data. This is a paper by David Stillwell, who's a colleague of mine at the Judge Business School. So David, as part of his PhD program, developed a personality test. So what the idea was, you look at pictures like this and say,"Who do you prefer, A or B?" Or, "What word cloud is more you?" Is that "weekend, home, happy," or is it, "universe, music, dreams?" What is more you? So people will answer those question, not just one or two, it'll be a hundred or so questions. And you come up with a scale, right? Their are five scales, I'll talk about them in a bit, but this is the openness scale. So if you pick A, A, A, A all the way, you're conservative and traditional, if you pick B, B, B all the way you're liberal and artistic. That's an example, okay? Now, what he then did was people started sending those quizzes to their friends, and their friends, and eventually he came up with an app called the Mypersonality app, where it turned out that 6 million people signed up to take their personality through the app. And he asked them, "Okay guys, you know, you were taking this app, would you mind giving me your data, allow me to take your data from Facebook?" And a significant chunk of them said,"Yeah, sure, you can take my data from Facebook." And that's what he did. So what he then did was, you've taken the test, so I know your personality, right? But I also know what stuff you're writing on your Facebook page. The question is, can I use the stuff on your Facebook page to predict your personality? That's what he wanted to find out. So if you're an extrovert,

you would use words like this: 44:40

"party, great, I'm missing you guys so much," really elongate the vowels over there. If you're an introvert, on the other hand,"anime, internet, computer, Pokemon, manga." You would use elongated vowels, but usually it's about negatives, so, you know,"dammit, noo, ooh," whatever, right? So, he went beyond that. He said, "Well, let's see how the predictions compare to human beings." So what you have over here is the accuracy. And what you have over here is the number of Facebook likes. Now what you see is your average work colleague is not really very accurate about your personality, because most of us are professional at work, right? We don't scream at our colleagues. We are pretty professional. So they're about 27% accurate about your true personality. Your friends are a little more accurate, but still edging on the positive side because, again, they're our friends because we don't scream at them, right? So that's about 45%. Your family's about 50% accurate. They can see both the positives and the negatives. David says that the computer's average accuracy is 56%.

To put that in another way: 45:50

Facebook knows you better than your own mother. Okay, well, he went further. So he had data on their Facebook likes, on art, CNN, BMW, and so on. And then he collected a hundred components using a singular value decomposition method. And then he would predict,"What can I predict based on the stuff which you like on Facebook?" So the dependent variable is your age, gender, political view, and so on. And this is what he can do. Predict whether you're single or in a relationship. Do you smoke cigarettes? 73% accurate. Are you Caucasian versus African American? 95% accurate. Are you gay? 88% accurate. Or Christianity versus Islam, 82% accurate, and so on. So this is all based on the data which you put on your Facebook page. Last one, loan application data. This is a paper which looked at people who are applying for a loan on a website called Prosper, which is a peer-to-peer lending site. People would come in and they would say, "Okay, let me try to borrow money from a bunch of strangers." So I write an essay about why I need the money, okay? And so what these guys did was to, we know that 13% of the borrowers eventually defaulted on the loan. So they say, "Can I predict who's going to default based on a description of what they would write?" Let's take two examples.

Borrower one writes: 47:21

"I'm a hardworking person, married for 25 years, have two wonderful boys. Please let me explain why I need help. I use a $2,000 loan to fix our roof. Thank you, God bless you, and I promise to pay you back." That's one.

Borrower two writes: 47:36

"While the past year in our new place has been more than great, the roof is now leaking, and I need to borrow $2,000 to cover the cost of the repair. I pay all bills, car loans, cable, utilities on time." How many of you would say borrower one is a better borrower, will more likely pay you back than borrower two? Okay, some hands up there. Okay, some hands up there. All right, borrower two? Okay, turns out actually borrower two is in fact much better than borrower one. And it turns out, if someone tells you they will pay you back, they will not pay you back. The more assertive the promise, the more likely they're going to break it. If someone says,"I promise I will pay you back, so help me, God!" Gone, your money's gone. Okay? Religion doesn't matter. Someone who mentions God is 2.2 times more likely to default on their loans. Another example of using, you know, big data to make inferences about us, right? So the ultimate goal of big data is complete personalization. This is fake. It was tweeted around that time as an example of personalized news. So you have the Wall Street Journal saying,"Trump softens his tone" sent to a Democrat area, and "Trump talks tough on wall" sent to a Republican area. This wasn't really true, because actually it was different editions on the same day. At the beginning he started talking tough, later he softened his tone. But they put it together to make it look like that. But anyway, the point is, this is the ultimate goal of big data, right? Give you exactly what you want, like this. But this means that there's a big problem here. And the problem is that, at the end of the day, we don't know why these things are happening. We are still measuring correlations. This is not causation, these are correlations. So what we are doing is putting people into groups and saying they are like this, using something like principal component analysis, discriminant analysis. You put them into groups, and we say,"You are like that because you are with that group." But are you like that? It's unclear, right? Let's take a couple of examples. Kaggle had a competition in 2012 to find out what are the best used cars, the ones with the highest quality on the market, the ones that don't get into accidents. Turns out the answer is orange cars, bright orange cars. Why? I don't know. They didn't know, right? Maybe orange car drivers are more careful, maybe they're so eye-catching, other people keep away from them, nobody knows. It just pops out of the system. Or another example here in England, the famous example of cholera, which broke out here in England after 1839 when it was brought in from India. So people were dying all over London. And there's a book called "The Ghost Map" by Steven Johnson, and he writes about the marvel of causal inference that John Snow did. So John Snow basically thought that it was transmitted through the water, eventually narrowed down his investigation to one pump on Broad Street where everyone was drinking water and dying subsequently of cholera. So he persuaded the authorities to remove the handle of the pump, and cholera stopped in that area. And in fact today, every year, they have the Pumphandle Lecture, which is on public health. And afterwards everybody goes to lunch at the John Snow pub. So worth checking out as well. But that's a causal inference story. The correlation story was his colleague William Farr, when the prevailing theory at that time was this is caused by miasma, right? It's the bad air which is causing cholera. And he did a beautiful analysis where he showed that the closer you were to the Thames, the closer you were living to the Thames, the higher the probability you're going to get cholera. And unfortunately, you know, he was wrong, but the data was there, the correlation. There was no causation, but he basically said it was actually caused by the fact that, you know, the closer you were living to the Thames, you were actually drawing water from that same pool, the same watershed, but people living further away were drawing from a different water body, and so that was what was causing it. Big data says we can try to solve these problems by collecting more and more data. But at the end, we don't know why. And that's going to be a problem. And I'm going to talk about that problem next time as well. So next time it's going to be on AI, and how AI is working. Things like ChatGPT, we'll talk about that. And the question will be, you know, how do these things carry out a conversation with us, sometimes a very creepy conversation, but how does that work? How do these algorithms work? And this is an issue which I'm going to talk about again next time, all right? And I think we are, end of it. Okay.(audience applauding)- First question from Anna Lisa. She says, "From this lecture, it looks like big data heavily rely on behavioral information. Do you think in the future new ways and metrics to capture users' behaviors will be used or developed?"- That's actually a very, very good question.- It is a good question, yeah.- Excellent question. The thing though, there are two parts to this, right? We use behavioral data, we use how people are actually behaving in big data. The point is, we don't know why, right? Does somebody, like, take another example, which I didn't mention here. Turns out that one of the big patterns you see is the sale of nappies and beer are highly correlated. Why? Is it that male, you know, when you're going shopping for the baby, you pick up a can of beer to keep you going while the baby is, you know, maybe women like drinking beer. We don't know why, but that's a high correlation, right? So there is no behavioral story here. Without the story, it's really difficult to come up and say, okay, this is what we're doing. That's problem number one. But problem number two is actually more insidious. It is, I predict what you're going to do based on these behavioral patterns, and then I give you information about precisely the things which I think you're going to do. But because you have no other sources of information, you behave differently from how you would do it otherwise. So in a way it's like, you know, I give you information so that you do something, and you have nothing else to do, so you do that. So it reinforces, it becomes a vicious circle, right? How much one is going to be happening to the other, I have no idea, but it's a very good and scary question.- Yeah, that's a good question. The second question actually is a bit related. It's from Pedro C, and he says,"What is your opinion on apps using users' data to purposefully make themselves more addictive? And at what point does it become unethical?" So I think that question sort of takes you in several possible directions, doesn't it? Are you affecting the thing you're measuring, or can you collect more data than is ethical by addiction?- That is true. I personally have to say that I don't have any social media of any type, right? I don't use Facebook, I don't have LinkedIn, I have nothing. And whenever I go to Amazon or anything like that, I'm always using a VPN or some other thing to say where... I probably go a little too far, right? But for most of us, the question is, these are addictive, right? I mean, if you use TikTok for example, people can spend hours just scrolling through TikTok. I mean, it's a nice time-passer thing, and it keeps pushing stuff on you. I'll spend a lot of time talking about that in my final lecture, which is about the dark side of all these things, right? At what point does this go beyond giving you what you want, to making you do what they want you to do, in a way, right? This sounds vaguely, "they want you to," it sound vaguely paranoid already, right? But wait till my sixth lecture.- Great answer. So I did have another question actually, if you don't mind. So harbingers, I wanted to ask you about. There's all this work on harbinger customers. And so a Harbinger customer, if I've got this right, is, you know, let's say you are a supermarket and you decide to have turnip ice cream. And it's not very popular. But it turns out that certain customers are much more likely to buy these duff products, and they're called harbingers. And the interesting thing about harbingers, I think, is that they all tend to live in the same sort of place. And so they're called "harbingers" because they spell doom. So if you look at your customer base and harbingers are buying stuff, that product is a duff one. You know, it's the turnip ice cream. So my question, though, was there's an example of pulling data in from multiple different data sources, you know, where people live and where... and that seems to me the point where privacy really starts to become very challenging. I wondered if you'd sort of given any thoughts to either the ethical position or the sort of practical consequences of this?- It goes back to that question of how much should you protect yourself, right? I mean, at the end of the day, and one of the topics I'm also going to touch upon in my sixth lecture is what happens if they make a mistake, right? So if the algorithm classifies you as something you are not, how are you going to tell the algorithm,"I'm not that type of person." For example, let's take a simple example. Google Photos. A lot of you may be using Google Photos. It turns out that when people uploaded pictures of Black people, Google identifies everything, right? So it says this is a picture of a beach, this is a picture of a mountain. So when you type in "beach, mountain," it'll give you a picture of a beach or a mountain. It turns out that when people uploaded pictures of Black people, Google Photos identified them as gorillas, right? And so, obviously, you know, the major reason was that the training dataset were employees of Google. Who works for Google? White people, Asians, Indians, people like that, but very few Black people. So how did they solve this problem? Any wild guesses?- [Chairman] Better training data.- Sorry?- [Chairman] Better training data.- Well, no. They couldn't solve it, because the algorithm had gone so complex. So what they did was get rid of the ability to identify either Black people or gorillas. That's the only way they could solve it. They couldn't figure out what had gone wrong in the algorithm. So the question there is what happens if the algorithm makes a mistake? You know, you don't even know why the mistake is happening, and that's the worry I have, right? If you are not let out on jail, because some algorithm says you're very likely to commit a crime again.- Yeah, explainable AI.- Exactly.- Or unexplainable AI. Now I'm just watching the time, because Gresham College being an academic institution, sort of runs on the clock, and we have run the clock, which is bad chairmanship on my part. There's only one thing, time left to say really, which is Raghavendra, that was a really good lecture. I mean, I really enjoyed it. I mean, I felt it's an absolute treat listening to someone who's so distinguished sort of just talking about this stuff. I mean, for me, it was a just a delight. So, thank you very much.- Thank you very much, I really appreciate that. Thank you.(audience applauding)