Art vs. Science: Using A/B Testing to Inform Design

Filmed on July 19, 2016 in San Francisco

A/B testing can reveal less obvious truths about human behavior. At Netflix, we use A/B testing at a strategic level to inform design from concept to shipped product in order to deliver a measurably better user experience.

In this talk, we will walk through insights gleaned from our years of A/B testing on tens of millions of Netflix members, showing relevant examples from the product to help you think about your own designs.

Sign up for the newsletter

Anna Blaylock
Product Design & Strategy at Netflix
Anna Blaylock has been a designer for over 15 years, and she has spent the last 5+ leading experience design for growth and user acquisition for Netflix. Over her tenure, she has helped the company grow from 20M subscribers in 2 countries to over 80M in over 190 countries, relishing the challenges and lessons learned along the way. Prior to Netflix, she was Creative Director at Match.com. Anna is a Brit living in San Francisco with her husband Robert, and furry baby, Monk_e.
Navin Iyengar
Product Designer at Netflix
Navin Iyengar is a Product Designer at Netflix where he has led design on several shipped products including the Netflix website and Netflix for iPad and iPhone. During his time at Netflix he’s helped the company grow from 7M members in the US to a global internet TV network with 80M members. In his spare time Navin provides mentorship at 500 Startups and Tradecraft. He lives in San Francisco with his wife and son.

Presentation

Art vs. Science: Using A/B Testing to Inform Design at Netflix

[Navin Iyengar]
Hi everyone, my name is Navin Iyengar. I'm a product designer at Netflix, where I work on the member experience: what we sometimes call "discovery". It's the act of choosing something to watch, watching it, and binge-watching on to the next episode.

[Anna Blaylock]
I'm Anna Blaylock. I'm also a product designer at Netflix, and I focus on growth and memberships. I'm thinking about how to acquire new members globally.

[Navin Iyengar]
We're part of the larger Netflix product design team - about 40 people of us. We are product designers, prototypers, visual specialists. We work on everything that the Netflix product touches, from helping you sign up, to making sure that your experience is so great that you want to stick around.

Anna and and I were talking recently: we've been at Netflix for a long time - over five years each. We realized that Netflix culture is very unique and has changed us, for the better in a lot of ways. It has affected the way we think about our design process and approach. Even after we leave Netflix, we're going to still carry around the habits that we've created. So how are we different at Netflix?

We are scientists at Netflix. We're not just designers. That's why we thought it would be fun to come talk to you about how we geek out on both design and science.

Some of you might be thinking that science is a dirty word - hopefully not. We don't think so - we're really into science. We're going to talk about our particular brand of science, which we call "consumer science." We're going to talk about what that specifically means for us, which is centered on A/B testing.

First we'll do a quick primer on A/B testing to make sure we're all on the same page. After that, we'll talk about some of the key learnings from A/B testing that we've taken away for our design process over the years.

To start: what do scientists do? They experiment. They run experiments. This idea - this lens of experimentation - is really what we look through to approach all of our work. At Netflix it's how we frame every project that we do.

The lens of experimentation allows us to think about how to try crazy, innovative, or even stupid ideas in a small scale way before we roll them out to all of our 80 million members. Experimentation allows us to take big bets on risky ideas with confidence.

What underlies the idea of experimentation? The scientific method. Let's go back to school for two minutes - I promise we won't go super long. The systematic approach of the scientific method is a way of finding out reliable information about the world. It's what we want to do with experiments.

The scientific method starts with a hypothesis, which is the statement of a belief that you have about the world. A hypothesis is stated in a way that is provable or disprovable.

Next you go out and run experiments. This is you going out into the real world, and trying to understand real behavior, and trying to prove whether your hypothesis is true or false.

Next is the results phase, where you're using your observations and the measurements you've taken to determine whether your hypothesis was true or false. Whether it's true or false is actually not a big deal: either way you're learning something, right? You're learning that the hypothesis is true and that you should continue exploring that direction. Or you've learned that it is false and probably not an avenue that you want to go down.

Galileo was a scientist in the 16th century in Italy. During his time, science had not changed for 2,000 years. It had been defined by Greek philosophers, and hadn't changed all through the Middle Ages.

One of the things that the Greeks believed was that heavier objects fall faster than light objects. That seems intuitive on the surface, right? If you see a stone falling through the air, it falls fast. If you see a feather falling through the air, it kind of floats down slowly, right?

Galileo decided to challenge this idea and he had a hypothesis. His hypothesis was that weight, or mass, has nothing to do with how fast something falls through the air. He designed an experiment to test this hypothesis, and it is now one of the most famous experiments of all time.

Here's what Galileo did: he went to the top of the Leaning Tower of Pisa and dropped two stones. They were about the same size, but one was heavier than the other.

They hit the ground at the same time. In one instant he had disproven what had been a scientific fact for 2,000 years.

Now, as to why a feather falls slower: he couldn't prove it. He couldn't go out and do an experiment. He suspected that it had something to do with air resistance.

As technology improves, you can run more sophisticated experiments. Recently a team from the BBC actually ran Galileo's intended experiment. They made a huge vacuum chamber, got a large stone and a large feather, and dropped them at the same time to show how gravity works.

At Netflix, we really appreciate the willingness to go into the real world and experiment. We use that lens on everything we do. The only difference is that we're not looking at the world of physical objects. We're looking at the world of customer behavior.

I want to talk about of all the different tools we use Netflix, not just A/B testing, because they all really compliment each other. There are a few different things that we use.

We use surveys. We ask members what they think of Netflix. We do focus groups where we talk to people. We try to understand some of the reasons behind the things people do - what their needs are.

These are qualitative methods. When people talk to us they're giving us a view of things that includes bias.

We also use quantitative methods. We look at trends in data to see what people are doing and how it changes over time. A/B testing allows us to observe differences in behavior without letting people know that their actions are being observed.

A/B testing is the most reliable way for us to learn about our customers. We're going to talk a little bit about how we AB test.

First I want to address one thing: the "41 shades of blue" thing. Do you know this story? Google a few years ago came under fire from the design community because they tested 41 different shades of blue on their link colors. The experiment was cited as an example of how A/B testing and design are on two ends of the spectrum - that A/B testing is only about really small ideas. I think it's a fair point; if this is the only type of test you're running, that's probably true.

One thing to remember is that Google is working in a very large scale. Small changes like this can have very large impacts. In fact, in this case the test created a $200 million / year revenue improvement for Google.

I'm not saying that we only run these types of tests; we do run these types of tests because we're also a large-scale business. But, I don't want you to think that we're spending all our time all day just moving buttons around and moving little pixels. In this presentation, I'm talking about really big conceptual tests that we use in our experiments. We'll go through some examples that illustrate that.

So how does A/B testing work at Netflix? It's a little more sophisticated than the experiments I talked about previously because we use "controlled" experiments. When you have a controlled experiment, that means you have two or more different variations: a control that is unchanged from your current product, and others that are different.

We start every experiment with a hypothesis. Our hypotheses are about customer behavior. For example: we might think we can change acquisition numbers by doing something, or we think we can change our subscriber retention numbers by doing something else.

We start with the control experience, which is just our systems in production as they are today. No change.

Then we create variations. We may use a lot of different variations, but we refine down to the variations that we want to test. This is actually costly; we need to put different people into these variations to understand how they're actually experiencing them. At some point you run out of people.

Our variations are generally very broad. They're not small differences.

Then we run a test. People are randomly put into different experiences. When people see one of the variations, they think that that's what Netflix is - they don't know about the other variations. This could be tens of thousands of people. In some cases it could be millions of people.

We'll observe the experiment data for weeks or even months. After that, we determine a winner based on business metrics that we tie back to our hypothesis from the beginning. Once we have a winner, it is rolled out to everyone. It is the new default experience for Netflix, and we run more tests on top if it.

In my time at Netflix, we've run thousands of tests. We're constantly iterating on the Netflix experience, and moving only in directions that makes sense to invest in further.

If there's one thing I could ask you remember from today: think about your design work through this lens of experimentation. We know not everybody has access to A/B testing - we're not asking you to do that.

What we're really saying is that there is a systematic, methodical way of thinking about design, as if you are A/B testing. You can apply that thinking to your work, whether you A/B test or not. We think that doing so will lead to more successful, more thoughtful designs.

For our first learning, I'm gonna pass it over to Anna.

[Anna Blaylock]
Thank you. We're going to review three learnings today. The first one I'm going to talk about is the idea of observing what people do, not what they say.

This is Henry Ford. He said: "If I had asked people what they wanted, they would have said faster horses." He's talking about the invention of the car, and that people didn't really know that they wanted it. But once they had it, they were pleased they had it.

At Netflix we agree with Mr. Ford. We also think it's really important to listen to our users - we're not belittling that. But we think we need more to validate our ideas. Let me give you a Netflix example.

These are some of the ways we think about listening to and observing our users: we have surveys to our members and our non-members. We also talk directly to our users. Most importantly, we observe them interacting with prototypes, including during our gold standard approach: A/B testing.

A few years ago we surveyed a bunch of people that didn't have Netflix, asking them what they would want to know more about before signing up for Netflix. What did the largest portion of people say?

46% said they wanted to know which movies and TV shows were available before signing up. That confirmed an intuition I had when I first started working at Netflix. I had wondered "why on earth don't we show all the catalog, like we do on the member side so that all the users can really understand what there is on offer?"

When this survey went out, this is how the non-member home page looked. A lot of things have changed since then. You can see it's really quite minimal, and the focus was all about hitting that button to start your free month. It's a lot of information, but really, the trial was the real focus of the experience.

The question was: could we design an experience that got as many sign-ups - ideally more sign-ups - than this experience by showing the whole catalog? We could have gone straight to A/B testing at this point, and there are endless ways of designing experiments. Instead, we did lots of brainstorming, came up with a ton of ideas and designs, and then prototyped some of them out. We decided we'd get a clearer idea of what we wanted to A/B test through observing, so took them to user research.

This is one of the new experiences we tested, and here is the control experience for your reference. At first glance you would think they're quite similar, but you see that there's navigation at the top. Then as you scroll down the page, you can scroll through a lot of different content. You can use a nav, go to the different genres, and click on titles for more information. All the while, remember: you can't play a video at this point. You have to start a free trial to play, but you can browse 'til your heart's content.

When we took this to research, we were hearing from users: "Yes, this is exactly what I need before I sign up." But we were observing that they were getting pretty caught up in scrolling through these titles, trying to go through everything. And sometimes even looking for specific titles - if we didn't have it, they were disappointed.

We started to wonder: is this the right experience for us? We were passionate about this idea. We believed that maybe it wasn't necessarily a slam dunk, but we wanted to go from believing to knowing. That's where A/B testing comes in.

Why did we test it five times? The first time we tested it, we suspect that it was the right idea with the wrong execution. We did another test version that was very different but still that same idea of showing the full catalog. Again, it didn't beat the control, and we'd thought: "Well, there's still a ton of other ways." So we kept repeating with new variations.

We learned that while the user might think they need to see the full catalog in order to decide if Netflix is right for them, they really need to experience Netflix through the free trial. They need to experience the magic of starting to watch something on their TV and finishing it on their phone later. Of seeing how quickly videos play with little buffering. They need the experience of actually using the product to see if Netflix is right for them.

It doesn't end there. We did want to fulfill the users' need to understand the catalog more effectively. We continue to look into it. Of course we're also a business, so we need to make it work for customer acquisition.

This is the normal state of the home page as it stands today. We continued experimenting, and now we're able to represent a subset of the catalog so the user still gets that understanding of the breadth of content, without getting bogged down in all this navigation. This isn't interactive. You can't click on the titles.

If you go to the non-member home page today, we're running tests like this that are trying to push the boundaries on how we represent content. We try to balance what the user wants and what we need as a business.

What behavior can you observe? If you have access to A/B testing, that's awesome. But don't fret if you don't have A/B testing. You can be prototyping - there are so many different tools out there, and it's so easy to prototype your ideas. Prototype them, put them in front of users, and listen to what they're saying. More importantly: observe what they do. Are they finding the one thing you want them to find? Are they doing the one thing you want them to do? You might be surprised and uncover things you can improve upon.

I'm going to turn it over to Navin.

[Navin Iyengar]
Our second point: design to the extremes.

How many times have you done this: you get a design brief for starting a new design project. You have this gut reaction of what the design solution is going to be. You crank out a design. You give it back to whoever wants it and you move onto the next thing?

There's this idea that great designers have a solution that just pops into their head, fully formed. They just give it to you, they commit to paper or the screen, and that's it. That's what makes a great designer.

I think if you have an intuitive idea, it's something to explore. But we've found over years of A/B testing that our intuition is often wrong. Anna's example illustrates that, right? Not only is your intuition wrong, but your CEO's intuition is also probably wrong, your PM's intuition is probably wrong, etc.

The only way to prove it is through A/B testing. That's why we invest so much in A/B testing - it allows us to stay honest and go to where the numbers are leading. Not with the opinion of whoever has the highest salary.

If you think your first idea is the best idea, how can you really know that? You can explore a lot of different ideas, and if you still come back to that first idea, that's great. It probably was the best idea. But it doesn't make you a bad designer if you can't visualize that solution right away. In fact, we believe that the best solutions actually come from multiple rounds of exploration and continual refinement down to a final idea.

Let me give you an example: concept cars. A concept car is not something that you can design, build, and ship at scale to millions of people, right? A concept car is just a design exercise. The good concept cars help show what's possible. They try to create a true, working system, to see where it breaks down. It frees designers from having to think about exactly how to build something that's scalable right away. It allows them to push on the extremes of what's possible.

For example, with this car, you can imagine the designers might have said: "a car is supposed to be aerodynamic, so how aerodynamic can we make make it? And as we make it aerodynamic, will the cabin start to get too small, or will the visibility of the road break down?"

We took the same approach last time we redesigned the Netflix website. This is the website a year ago. When we set about to redesign this, we really started at a high level. We asked: conceptually, what drives user behavior?

One of the axes that we looked at was the idea of active to passive. At the highest level, what is the Netflix experience? You could argue that if it's on a laptop, most websites are actually more active. You're clicking through a lot of pages, clicking on actors, clicking on similar titles. You want to spend a lot of time digging through the video catalog that's available.

At the other end of the spectrum, you can imagine that Netflix should be passive on the laptop. After all, Netflix is entertainment, right? You want to sit down in front of Netflix at the end of the day, crack open a beer, you don't really want to think, you just kind of want to have a few choices, and click on something that starts playing, right?

Another axis we looked at was the idea of uniform to differentiated. What I mean by this is: all the titles that you see on your screen, are they all presented on an even playing field, so you're just looking at the artwork to decide? Or are we promoting certain titles, or making certain tiles more prominent, making them seem more important to focus the user? We took those axes and we created designs that push those ideas forward as well.

This is something I took directly out of our research sessions - a keynote movie that we created. It is a non-interactive prototype that allowed us to show different people examples of what we could do if we pushed to the boundary of these axes. In this example, you can see there is some differentiated sizing and there's not a lot of information up front, so we were going for a more passive experience. We're not showing as many titles as our current experience.

Here's something that was on the opposite end of the spectrum. As you can see it has uniform title sizing, but when you clicked on a title you got a much larger canvas. You got a lot more information about the title. You could click on similar titles from here. You got larger actor photos, directors, creators, and genres - all the things that allow you to get lost in the catalog and dig around.

None of these ideas were actually "the thing" that we wanted to move forward with. But by exploring these we revealed some of the kernels of truth that we were able to apply back to what we thought was the best final design.

What we actually launched was probably around here. This was not the first intuitive idea we had, but by understanding what the range of possibilities were, we were able to come back to this.

Think about what the extremes are next time you're working on a design project. Think about what your axes are, and try to understand which directions you could go. Your instinct might be to start in the middle. But there are actually a lot of design solutions that are often in these four quadrants. They could reveal better ideas and make your instinct even better.

One thing I want to be clear about: you don't have to share these ideas. These ideas might break down or they might seem very extreme. You don't have to share them with your PM or your client. These are "back-pocket" ideas for you, unless you want to share them.

When someone's asking if you pushed in a certain direction, you can say: "well, I thought of that, but it doesn't work, and here's why." In that sense you will have a much more thoughtful design discussion.

[Anna Blaylock]
Our third point is the idea of using metrics as your compass. To illustrate, I'm going to talk about shoes.

This is a stiletto. What's the success metric for designing a stiletto? You are designing for style. It's all about getting that tall heel that elongates the legs... makes them nice and slim, a bit sexy, etc.. That's the stiletto.

When you are designing a running shoe, the success metrics are very different. You're designing for comfort. It's all about supporting the arches of your feet and providing a nice soft landing when you're running. It's all about stability so you land equally. There are all sorts of things that you can do with running shoes.

Stilettos and running shoes: they're both shoes, but you design them in very different ways depending on the success metrics. At Netflix we always have success metrics in mind when we're designing. We think that the design changes radically depending on the success metric.

If you don't think about success metrics, it could lead to some pretty unusable designs. No one is going to buy a running shoe that has a massive stiletto heel. Super uncomfortable. I might get a running shoe with a little glitter, but other than that, this does not look ideal for running.

Back to Netflix, away from the shoes. The shoes are finished.

Imagine your friend sends you a link to "House of Cards" on Netflix. If you're a member and you clicked on that link, you'd see a page like this. What are the success metrics for members at Netflix?

We think about streaming hours. The more people stream, the more likely they're going to stay with the service. They're basically satisfied if they're streaming more. We've designed this page to encourage members to play the content. The play button is very prominent on the page. You can play episodes and trailers, and if you scroll down there's all this other wonderful content to play. The focus is on enticing the user to click play.

Take that same link a friend sent you, and pretend you're a non-member (you haven't joined Netflix). The page looks very different. The success metrics for non-members are all focused on starting a free trial of the service, and then converting them to paid members.

The most prominent piece of this design is the free month button. It's far more minimal than the other experience I showed you. It's super focused, we still have a nice rich visual showing "House of Cards". But the main thing that we want the user to do is start their free trial and experience the service.

We continue to test ideas on both these experiences. But no matter what: they're always likely to be quite different because of the different success metrics.

How can you apply the concept of success metrics to your work? At the beginning of a design project, think about what your success metrics should be. Then as you work, try to stay true to those metrics. Think of them as the guiding light for how you design. You may or may not be able to see data on your designs, to tell whether you were successful. Either way, we think it will help you create more focused, usable designs.

[Navin Iyengar]
To conclude: thinking about your work as an experiment and framing it with this more systematic, experimental approach can lead to successful designs. That's what we've done at Netflix through the years, with plenty of A/B testing. To recap:

1) Observe what people do - not what they say - to get at the truth of their behavior.

2) Design to the extremes to really explore the full spectrum of solutions.

3) Use metrics as your compass to make sure that your design is doing the job that it was meant to do.

Thank you.