List

Split Testing for Product Discovery

Split Testing for Product Discovery

by Bryan Woods

In the talk titled "Split Testing for Product Discovery" at Rails Conf 2013, Bryan Woods explores the concept of split testing as a strategic method for enhancing web businesses. The discussion emphasizes that split testing is not just about superficial changes aimed at improving revenue and conversion rates; it serves as a tool for uncovering customer preferences and potentially guiding product development. Woods, despite lacking a formal business or economics background, presents some foundational economic concepts to illustrate his points.

Key Points:

  • Understanding Scale: Traditional businesses face limitations in scaling, but web businesses can achieve higher profits with smaller margins due to the lack of physical constraints. The goal for developers, likened to a "black box," is to maximize revenue through effective testing and optimization.
  • Speed vs. Data: While rapid deployment of features is crucial in agile development, it is essential to back it up with data analysis and metrics. Woods warns against merely focusing on speed without assessing whether progress is meaningful.
  • Basic A/B Testing: The video outlines common A/B testing practices, which can lead to incremental gains. Bryan introduces tools like Visual Website Optimizer and Optimizely, which allow marketers to experiment with minor changes and measure their impact on user conversion rates.
  • Going Deeper with Testing: Beyond basic A/B tests, Woods encourages teams to identify customer needs and the value of potential features. He emphasizes the importance of testing fundamental assumptions about business models, service pricing, and user engagement strategies.
  • Case Studies: Woods shares experiences from his dating website, howaboutwe.com. He illustrates that receiving more messages can incite users to engage more, and introduces features like a "surprise me" button to spark creativity in date postings.
  • Statistical Rigor: He discusses the importance of using sound statistical methods to validate test results, suggesting that data analysis should guide decision-making rather than merely following best practices blindly.
  • Tackling Customer Feedback: The talk highlights how A/B testing can serve as a defense against conflicting user feedback, allowing organizations to rely on data rather than subjective opinions.

Main Takeaways:

  • Rapid development is not fruitful without a direction backed by data.
  • Incremental improvements through consistent A/B testing can yield significant results over time.
  • Engaging deeply with user behavior and preferences can lead to better product discovery.
  • A/B testing should challenge and validate assumptions about the business model and user experience.

Woods advocates for robust, data-driven experimentation as a pathway to not only optimize web applications but also to foster genuine understanding of what leads users to convert and engage productively.

In this talk, we'll explore split testing as a way to not only increase revenue and conversion through simple, surface-level changes, but also to dig deeper in order to help guide a product's roadmap by discovering which features customers really want and how much they're willing to pay.

Help us caption & translate this video!

http://amara.org/v/FGaj/

Rails Conf 2013

00:00:16.320 okay i'm gonna try this from a pdf and hopefully this works well thanks for being here uh thanks for having me in portland it's really great to be here
00:00:22.320 with all of you uh this is split testing for product discovery uh i've never taken a business or an economics class
00:00:27.760 so i'm sorry if i'm the worst person to do this but just as a 101 on economics and a thought experiment if i was
00:00:33.920 selling dollar bills for 99 cents each how many would you buy
00:00:39.280 all of them is the correct answer right so at least as many as you could possibly raise the money for um so to
00:00:46.399 take a step back and to talk about brick and mortar businesses for a second um let's talk about like stumptown coffee roasters down the street right
00:00:53.520 they presumably make a dollar or more for every 99 cents that they spend on goods and materials but how far can that
00:01:00.239 possibly scale like at some point you're going to hit this ceiling where your profits are just as high as they can get even if you have lines
00:01:06.080 stretching down the street you can't service customers as fast as possible so this kind of reminds me of lemonade
00:01:11.119 stand i don't know how many of you have maybe played this game i think it's an old computer game from even the 70s or the 80s it's been repurposed a few times
00:01:17.840 i think the last one i saw was called like lemonade stand tycoon or something the premise is very similar where you
00:01:23.840 open a lemonade stand and people are kind of walking up and down the street and they stop and they see you and they enjoy your lemonade and eventually they
00:01:30.479 tell their friends and more and more people start coming and they're wrapping along the street and everything's great but eventually you have a problem with
00:01:36.799 scale right because your costs scale linearly with the demand for your product so as people are coming down the street they start tapping their feet a
00:01:42.799 little as they're waiting in line they grumble to their friends a rainstorm comes it you know wipes out your supply
00:01:48.240 of uh lemons and water and that kind of thing so how does that contrast to a web
00:01:53.520 business like for the exact same thing if we could take 99 cents and make a dollar
00:02:00.000 every single time um is that any different than a brick-and-mortar business and i think it is but it all
00:02:05.280 comes down to scale um so on the internet the scale exists unless we're trying to
00:02:11.120 service the literally like the tiniest niche market in the world um the scale exists that we can with
00:02:17.120 pretty small margins make pretty good profits right i think that there's you know we've seen some stuff with daily
00:02:22.319 deals sites and uh this isn't always the case but eventually like this is the holy grail
00:02:27.760 for all web developers right we want to make a black box that is our web business where we funnel in dollar bills
00:02:33.680 and we print out two dollar bills on the other side so i think a b testing is one great way to get you towards that goal even if it's a difficult one to
00:02:39.599 ultimately obtain this is the holy grail right putting a dollar bill into our machine
00:02:45.920 spitting out two bucks is giving us a license to print money so how do we do this like as web
00:02:51.120 developers what do we do to try to get there we talk about all this stuff we talk about agile development rapid iteration continuous deployment what do
00:02:58.400 all these things have in common like they're essential to building agile software everybody needs them on their team but the one thing that they have in
00:03:05.040 common is speed right and this all makes sense we know as agile software developers and practitioners that we
00:03:11.599 don't want to code in a laboratory we want to basically be able to ship features as quickly as possible and see our users using them and we need it but
00:03:18.239 is speed the whole thing right like if we're just shipping code as fast as it can fly out of our fingertips then
00:03:23.760 there's a chance we could be going nowhere fast so we need to be collecting data and analyzing it and getting the right metrics for our business in order
00:03:30.319 to make sure that we're actually moving towards the discernible goal and the ultimate like that we're servicing our businesses
00:03:36.560 bottom line so i think a b testing is often seen as this kind of fluffy thing for marketing
00:03:42.480 people or business guys to kind of tweak around in sheets and that kind of thing
00:03:47.680 but i think you can open up your mind and just think of this as a way to make use of your elite hacking skills or whatever right because it's kind of an
00:03:53.840 optimization problem this is obviously a totally fake graph but you know on one side we have money
00:03:59.120 in or whatever money coming out we see that like ultimately the goal is just to get this thing to kind of fan out
00:04:05.120 and just as the way that you kind of tweak the performance of an algorithm over time um you know little tiny tweaks over time
00:04:11.680 like oh here here's a percent improvement here's two percent as it goes on longer and longer over the course of months or years it can really
00:04:18.239 have drastic effects and aggregate so who am i i'm brian woods uh this is
00:04:23.360 me with a cockatoo i met in the florida everglades i work at howaboutwe.com
00:04:29.120 uh we're in brooklyn new york we are in the dumbo neighborhood this isn't actually outside of our office window
00:04:34.400 but it's pretty close that's what it looks like uh sorry about this pdf but this is kind of
00:04:40.400 what our landing page looks like right now so we're dating site primarily i started in 2010 and uh we have a singles
00:04:46.479 product which is just a dating site and then a new one that we launched about six months ago for couples which is um
00:04:51.919 similar kind of things where the premise is just to get people to go on awesome dates
00:04:57.600 so real quick about my business there's a few things this isn't an advertisement but i'm going to give you some examples so it's
00:05:02.720 good to know the one thing is that if you're going to launch a startup everybody says don't launch a dating
00:05:08.080 website the market is crazy saturated uh our competitors have really really deep pockets um and it's hard like it's just
00:05:15.280 everybody is already doing online dating and that kind of thing um so we needed to do a few things we needed to really
00:05:20.720 make sure that we could out compete our competitors um and do it with less money so the data that we're getting and the
00:05:26.560 kind of testing that we're doing is trying to get us towards that goal we're also a subscription paid service so people give us their credit card to use
00:05:32.560 our software and um that ends up being kind of important to kind of get that scale working
00:05:38.560 so let's just start with some very basic av testing this is the stuff that people think about when they hear a b testing
00:05:43.680 generally i think so this is an almost real example again i've launched my kitten sharing website
00:05:49.280 or whatever i have twitter bootstrap i'm doing everything i know to launch my mvp my header kind of says hello world
00:05:56.240 but i think maybe it's a little bit stuffy so i try like sup everybody you know we're gonna really kind of lighten things up here
00:06:02.880 our customers are you know not easily fooled so it loses but maybe we can i don't know test it with some red
00:06:09.680 color uh maybe if we move the button into this like red cool gradient thing maybe that
00:06:15.039 works better who knows you know this is a common thing like i'll just slap some disgusting
00:06:20.880 satisfaction guaranteed every time sticker so this is just trivial stuff it's cosmetic it's tiny little tweaks onto
00:06:27.039 your pages and if you're skeptical you might be surprised so going back to that optimization problem that i was talking
00:06:32.080 about earlier um these things aren't going to like make or break your business right i think a lot of times you see people
00:06:38.160 kind of bragging on the internet like oh i threw a puppy on my subscription page and now i have a 300 conversion
00:06:43.759 improvement um in my experience this is not how it works but we do see improvements of one percent or two
00:06:49.199 percent or three percent and there's no reason that you can't just keep testing that stuff over and over again right there's no like high ceiling like you
00:06:55.120 this is as many people i mean i guess 100 right but um eventually like you can just keep
00:07:00.160 improving and over time in any aggregate these things really really do add up
00:07:05.280 so again fake graph the last one i promise but uh generally the idea here is that with this kind of basic cosmetic
00:07:12.479 kind of a b test this is the kind of trends that i've always seen where it's just small and tiny and you know month
00:07:17.759 over month it doesn't really look like it's adding up too much but if you look at it over time you see like oh wow we went from you know 10 to 15 conversion
00:07:24.160 rate um so if you want to do this kind of stuff it's great there's some great tools out there here too that i've used visual
00:07:30.880 website optimizer and optimizely they don't need any server-side code to be written you can just drop a javascript javascript snippet on the
00:07:37.440 page and marketers and product people can kind of tweak things to their heart's content um if you're using rails which i
00:07:45.120 assume we all are vanity and a bingo are also a great test for this and they let you go a little bit deeper
00:07:51.360 so if i was testing a button on the sign up page and one says you know buy now and one says take a tour or whatever
00:07:57.680 both of these allow us to say like okay here's my goal i'm testing signing up a user so my user create action or
00:08:03.039 something i would say i have a goal and i've logged which button the person saw and then in an admin back
00:08:08.240 admin dashboard somewhere they're telling up the results and showing you significance
00:08:14.319 so when to implement this kind of stuff i'm not talking about anything too crazy but just basic let's test some colors let's try some
00:08:21.120 different fonts and that kind of thing the first thing the one caveat is that
00:08:26.400 this is not going to be a substitute for finding product market fit um this probably goes without saying but if you've built something that nobody wants
00:08:32.240 and is willing to fork over some money for it's putting lipstick on a pig and it'll feel that way
00:08:38.159 that being said this kind of stuff is really low risk it's low cost and the rewards are possibly high um so if you really can
00:08:45.120 just at the end of work one day come up with some button ideas and just throw them into an optimizely thing and just
00:08:50.320 go home and the test runs there's really no reason that you can't do that from the very beginning of your product like from the moment you put up
00:08:56.480 that mvp there's no reason you can't just have a couple different headers that you're trying
00:09:01.680 but what i really want to get into in this talk is kind of going deeper deeper not just in terms of deeper into
00:09:08.480 our rails application which is the case but also trying to figure out like deep into our organization of what do we
00:09:14.480 want to actually build what do our customers want and how do we use split testing to figure that out
00:09:20.160 so let's go down the rabbit hole
00:09:25.200 so to start uh what should we test first we need to figure out what features do our customers want i know
00:09:31.600 this varies from business to business but often places i've worked there's kind of a product person who has ideas
00:09:36.880 or is reading news and kind of watching trends and has ideas of uh you know our product should have this road map
00:09:42.880 because you know anything ranging from this is like a hip new thing to users have been clamoring for it or whatever
00:09:49.120 but how do we figure out like what the value of these features that we're building is and when we should be building them
00:09:54.720 and what is the simplest version of the feature we can build to gauge interest like what is it literally the tiniest
00:09:59.839 little thing that we can do that might just be the glimmer of that ultimate goal of a feature that we can just kind
00:10:05.279 of throw in there as a quick little test and see if there's even any interest or if it you know spikes and changes anything on your site
00:10:13.040 what are our customers willing to pay for which features are free and which require an upgrade
00:10:18.640 and furthermore can you trivially move your paywall around um i think there's no reason not to
00:10:24.240 question very fundamental assumptions about your business model right um for instance i said in our application
00:10:30.640 we charge you to message other people um that's always been the case but we've tried other stuff like
00:10:36.160 you know it's free to message people or you know certain other things are paid or it's paid under certain circumstances
00:10:42.079 or whatever i think most code bases if you haven't thought about this kind of thing it's not just trivial to change
00:10:47.760 the business model that way but it's important and i think that um you know if if you're even like an
00:10:53.680 iphone app right like you're free one day you want to charge five bucks the next day um
00:10:59.200 there's no reason not to test this stuff so how much will they pay are our pricing schemes and payment structure
00:11:05.519 flexible enough to be varied and are we under charging
00:11:12.160 so when and how often should we remarket to our users and are we oversaturating our users are
00:11:18.160 we missing possible sales opportunities um i think one thing here is that i don't know if you guys have noticed this
00:11:23.600 trend um where you sign up for an application or something some fancy new thing you read about on hacker news or whatever and you're at lunch and you're
00:11:30.000 playing with it and you kind of forget about it and you go home and then maybe two days later you get this email from
00:11:35.360 the founder and he's like hey what's up i saw you playing around with the application do you like it is there anything i can
00:11:40.560 do i think somebody probably published an article once that just said this worked great for us and everybody's kind of just doing it i think that's a great
00:11:47.279 example um because if it works then that's great it's getting retention it's getting your users back but it also could be totally bothering them
00:11:54.640 another thing is i know a lot of e-commerce stores have this system where you know if i wanted to buy all of cormac mccarthy's back catalog of
00:12:00.880 novels right i put them all on my shopping cart on amazon or something um i see the total price at the very end
00:12:06.639 and i get a little bit freaked out and i just go away and then maybe three days i get an email that just randomly and
00:12:12.000 surprisingly there's a 30 off cormac mccarthy novel sale um these kinds of things again are great
00:12:17.600 you don't want to be spamming people all the time but as long as you're gathering this data it's good to see like which of these
00:12:22.880 kinds of things are actually getting customers back and making sales and which ones aren't
00:12:29.200 and ultimately just any product curiosity you know every user facing feature anytime you think i wonder if
00:12:34.639 our users would like this if you have a cheap way of building these a b tests into your product there's no reason that you shouldn't at
00:12:40.639 least be gathering data about the effectiveness of them and seeing how they affect your customers and your business
00:12:46.480 so here's some examples uh that we've run recently the first thing is um how can we get
00:12:52.959 users to receive more messages not necessarily send we found that this is probably kind of
00:12:58.399 obvious but um since we charge to send messages a good way to get people to want to do that is to make sure that
00:13:04.560 they receive a lot of messages so this is one of the things that we really try to do on our site so here's an example of a feature we
00:13:11.120 have called speed date it's pretty straightforward we just you know we have a photo here and some new dates and some
00:13:16.959 information about them if you click yes it'll send this little message saying hey i'm intrigued that's it and if you
00:13:22.639 click skip nothing so the first embodiment or the first like vision of this feature was as a way to get users
00:13:30.320 who had kind of fallen off back we wanted to say like okay well they haven't logged in in 60 days or 30 or
00:13:36.160 something who knows what's going on with them but like maybe if we can just surface them again maybe they've gotten kind of stale and we can kind of bring
00:13:42.000 them back um that was the idea and it didn't really work i think generally people are gone for whatever reason and they're not
00:13:48.720 going to come back their profiles look kind of stale and it wasn't really that great so the first obvious a b test we
00:13:53.839 did here was well what happens if we show newer users and i thought that that would be kind of a no-brainer if this wasn't working
00:14:01.040 when we dug into the data though we found out that actually something interesting had happened which was that in our larger markets where we have you
00:14:07.199 know lots and lots of people uh new york los angeles san francisco that kind of thing that algorithm worked better
00:14:13.040 because we're showing fresher people they look more active and it's more obvious that they want to be on the site
00:14:18.240 and communicate with you but in the middle of kansas or something it was horrible we're just we run out of dates
00:14:23.760 to show them um so there's fewer results so then we we took that data and we said okay great can we do another algorithm
00:14:30.320 that you know improves for both can we say you know show newer people in new york
00:14:35.360 show older people in kansas and kind of balance it out and ultimately we did
00:14:41.680 so another thing we want to do is you know how can we encourage users to post more dates same kind of thing you're active on the
00:14:47.279 site people are going to like you you're going to have a better experience
00:14:52.639 so a big part of this is our site requires some kind of creativity just because
00:14:58.720 it's not about filling out your profile about yourself you need to kind of think of something clever to do like i want to
00:15:03.920 do you know i want to walk the highline in new york i want to try grimaldi's pizza or something and you know that's not a simple thing
00:15:09.600 to ask for somebody so we just a b tested this little surprise me button um versus not having it so we
00:15:16.560 said you know you click this surprise me thing and we just auto populate it with some ideas that we think would be great for your neighborhood
00:15:22.959 the idea here was that we were going to watch date posting metrics and just see if they're increased and then if people are getting more messages
00:15:29.199 probably no surprises this worked this was another same kind of idea like we want people
00:15:35.519 posting dates we realize it's hard so you click on this little form you're about to post a date and a few seconds
00:15:41.040 later this thing that says stumped kind of fades in and gives you some ideas that also won
00:15:47.360 so should we force them so here's the thing like we want people to post dates we know that at certain points um in their
00:15:55.040 life cycle it's really important to get them active and getting them engaged with other people so this might feel
00:16:00.320 weird right like should we force our users to post dates and this is what we tested so in our sign up flow we have these
00:16:07.040 modals that you can kind of they black out the page but you can kind of skip all of them we tested this as the last step which just says you know if you're
00:16:13.120 on our dating website we want you to post a date like that's what you're here for right and this is kind of a time
00:16:18.399 when they're showing a lot of intent anyway and we knew that this is just important if we're going to get them in and get them involved they should be posting a
00:16:25.279 date so there's no option out of this like you can close your browser but or you know if you know javascript you
00:16:30.720 could probably hide it but for all intents and purposes this is a forced thing so if you're getting this kind of spidey
00:16:36.639 sense going off in the back of your head like this is horrible this is obvious bad user experience we don't want to
00:16:42.560 harm people we want to cuddle them um you know this is a famous user experience book that i've loved don't
00:16:49.120 make me think but i like it because it says it's a common sense approach to web usability um and i think it makes sense like this
00:16:55.600 is obviously common sense that you want to limit the friction of your users while they're using your software
00:17:01.199 but there's another thing so your spidey sense is going off because you know best practices you know a best practice in user experience is to say
00:17:08.400 like limit friction but like what is a best practice and if you have enough data do best practices
00:17:14.079 really matter and my answer to that is that sufficient data obviates best practices so what are
00:17:20.160 best practices like best practices are things that work for other people right and if you don't
00:17:25.280 have enough data that makes sense like that's all that you can rely on so if you know some user experience expert
00:17:30.720 says don't force your users to do something you crazy person it can make sense but if you go ahead
00:17:36.559 and try it and you see that actually people don't leave the site they're fine with it and it does get them more involved then your own data about your
00:17:43.600 own business means much more than any best practice
00:17:50.240 so once we start doing this stuff we're you know adding hooks into our application to try to move different levers kind of
00:17:56.799 back and forth we want to basically expand our conversion funnel we want to do that by finding feedback loops and exploiting them
00:18:03.120 so here's kind of a flow of our site right like you sign up you upload a photo you post a date you'll probably get some messages
00:18:10.080 which leads to more subscriptions so you might be noticing a little pattern here which is that this is just
00:18:15.600 a funnel right and i think your whole business is a conversion funnel i think people think
00:18:21.919 of their conversion funnels in overly simplistic terms right so you have a landing page and you convert by signing
00:18:27.039 up and then maybe you do a sale so people are really hyper optimized about trying to test those little places but
00:18:32.720 there's places all throughout your application that you can move just do tiny little lever things and you can find flows all over the place so if you
00:18:39.520 can broaden the base of that funnel you're going to be able to improve revenue and conversion rates everywhere
00:18:45.360 so the results i'm singing the praises of a b testing they're probably not that surprising but we really have seen huge
00:18:51.120 boost to both conversion rates and revenue doing this kind of stuff so i want to talk about our technical
00:18:57.200 implementation a little bit uh and beginning with this so
00:19:02.559 this is something that i think would be probably the most naive solution that you could think of if i wanted to just split users in a and b buckets for a
00:19:10.000 given experiment like what's the easiest thing i could do well maybe if the integer value of this second is odd i'll
00:19:15.919 show it if it's not i won't so this one might be obvious but can anyone think of why this isn't a great
00:19:21.600 idea sure yeah so that's the idea right that
00:19:28.960 if i'm trying to hide your mailbox or something you know you can click one time and it's going to be gone and then if you come in
00:19:34.400 a different second it'll be there and that kind of thing so i think the next somewhat naive solution and we actually did this for a
00:19:40.799 little while is something like this so should we show the feature okay great
00:19:45.840 let's just look at the user id and see if it's odd or even and then we'll show the feature or not
00:19:51.919 and i think this is no it worked for us for a little while um but also can anyone think why this is
00:19:57.679 not perfect
00:20:04.160 exactly so the idea here is that um you know even though we didn't have like one method that was always on or off
00:20:10.799 we actually started realizing that we were very very often choosing even to be on and odd to be off so people with even
00:20:17.919 user ids just started having crazy experiences because you're just you know testing weird whims that you have and
00:20:23.200 suddenly if you have user id 4 or something the site is crazy and if you have an odd number it's not um so we've
00:20:30.159 gone a little bit further this is kind of what we ended up having to do we assumed that you have a user's table
00:20:35.440 and we're just i mean the code doesn't really matter but we're taking you know uh
00:20:40.640 the experiment name and then your user id and we're basically just grabbing a shot of it and converting it from b16 to
00:20:46.000 base 10 taking eight characters and then modulating it against the number of buckets in the experiment so which is
00:20:51.440 just a long way of saying that if we have a two bucket experiment which is an a and a b test uh for this thing we
00:20:57.200 could say okay well for this user are they in bucket zero or bucket one we can filter it that way and kind of partition
00:21:02.559 users better which means we can also kind of wrap this at a higher level and we can say
00:21:07.600 well is this user in this experiment uh what is their experiment bucket and are they in the bucket that i'm asking
00:21:13.919 for so what we end up doing for reals in our code is
00:21:19.280 something more like this so is current user in this bucket one for the experiment that i'm running
00:21:25.840 so the good news if you're interested in any of this is that this is yours free we've released it as experimental it's
00:21:32.240 on our github my slides will be up if you're interested in this you can grab it
00:21:37.840 has some included batteries the first thing is that user partitioning i was just talking about
00:21:43.200 those convenience methods so we can just have a nicer kind of friendlier way than that sql scariness um
00:21:50.559 ultimate and we can start and end experiments from an admin dashboard uh in particular we ran one test that was
00:21:57.600 more crazy than usual i think maybe we tried tripling our prices one weekend and i know i just said weekend you probably know where this is going but um
00:22:04.640 nobody was around it sucked we lost some money for three days nobody could really we had to deploy new code
00:22:10.720 to end the experiment it wasn't great um so we added this ability that if nothing else you can at least end it or start an
00:22:16.799 experiment from an admin dashboard and it hooks into you know most of rails like admin frameworks that you know and
00:22:22.159 love so things that are not yet included that i would love to see added at some point
00:22:28.159 one thing is goals so when i was talking about a bingo and vanity um you can you know you give it some kind of metric that you're trying to follow um ours i
00:22:35.280 showed you are kind of multi-dimensional like we want to see people post more dates and send more messages and that kind of thing but it'd be good if
00:22:40.960 nothing else if you could just say like here's the one metric i really want to move with this test i want to say you know i want more users posting more
00:22:48.159 dates so i should be able to track that i would like to have statistical significance built in just so that you
00:22:53.600 can kind of watch it in the admin dashboard we have a data scientist so we kind of let him be that number cruncher but it'd be cool if it was just doing it
00:22:59.520 on its own and maybe some visualization just to see how things are trending um really pull it down let us know if
00:23:05.280 you have any thoughts because i i think that it could do a lot more
00:23:10.640 so to talk quickly though about technical debt particularly technical debt in a brave new world of doubled
00:23:16.320 complexity so i think you know it's not like i'm talking about a lot of debt that's going to be added to your code
00:23:21.360 base kind of inherently by adding these things but that being said if your application has to work in both you know bucket one
00:23:28.559 and bucket two there just is some inherent complexity there um i know in dhhs keynote yesterday he was talking
00:23:34.400 about you know hiding and showing you know delete buttons and that kind of thing those are just kind of often bug
00:23:39.760 prone areas even with good testing there's just another thing you have to kind of think about
00:23:45.600 i think my obvious first answer is that you know automated testing is essential i think we all probably know this i hope we know it um i'm surprised still how
00:23:53.520 often i hear you know automated testing get talked about as some kind of inertia that needs to be overcome but automated
00:23:59.039 testing will give you a good framework to just good baseline to know that if i'm throwing crazy code that i'm ripping
00:24:04.080 in and out i always know what things are supposed to be doing so all code also needs to be held to the
00:24:10.720 same standard of rigor you might want to under engineer things a tiny bit because again if you're testing crazy whims that you know in two
00:24:16.799 days you're just going to rip back out you don't want to like go crazy with how you first implement that feature
00:24:22.159 um but that doesn't mean that it shouldn't be well tested it should be well factored you should be able to reason about it and it should be
00:24:27.360 isolated and self-contained so test in isolation
00:24:32.559 um this obviously means something very specific to object-oriented programmers and people doing tdd what i mean in this
00:24:38.720 case is really making sure that your a b tests aren't bleeding into one another so we try to limit things to one part of
00:24:44.480 the application at a time if we're doing an a b test on the signup funnel we don't want to have five going on at any
00:24:49.760 time and not even just for statistical reasons it's also just the code gets crazy you know you don't want to have
00:24:55.279 that stuff nested you're either in and out or out of the experiment we want to keep it that way
00:25:01.200 and finally of course kill dead code immediately i know that you can't really kill things that are already dead
00:25:08.400 but i think the important thing is we all know this anyway like you don't leave dead code in your application but
00:25:13.600 especially this stuff is really risky to leave around especially if it's crazy you want to try something just you know
00:25:20.240 rather radical it gets really hard to reason about what your application is supposed to be doing if you just have a b test forks littered
00:25:27.360 all over the place and particularly you need to kill dead code thinking in terms of revertability
00:25:33.679 um so i think ideally in in a perfect world uh if you implement an a b test you know
00:25:40.159 you have it all in one commit or something pulling it out and ending it should be as simple as reverting the commit i
00:25:45.360 think in the real world you know you might have some qa stuff that you push on top of it maybe a bug fix here or there or whatever but as close as you
00:25:51.919 can get to just having an obvious way like here's where i started the experiment now i'm pulling the code back out
00:25:57.440 it's going to save you from a whole world of hell of complex stuff happening and finally just don't get emotionally
00:26:03.600 invested maybe it's silly to talk about emotions when we're talking about software i mean
00:26:08.799 we're ruby programmers so i guess we all think you can be happy while you write code but it's hard right like you spend
00:26:14.159 two or three days working on a feature you think it's going to be awesome you're excited for users to use it and
00:26:19.679 it costs your company money and you have to pull it out obviously there's just no use really
00:26:25.520 harping on it and just end the test move on and try something else so measuring success
00:26:32.640 statistical significance is obviously important so what i mean by this is that i mean this could be a whole nother talk i'm not a statistician but um you need
00:26:39.440 to be using a real tried and true uh statistical significance algorithm um we use something based on binomial
00:26:45.279 distribution we calculate confidence intervals um so if you've ever studied statistics obviously
00:26:51.279 you know that if i flip a coin 100 times it's not going to be heads 50 and tails 50 times
00:26:56.320 so it's important that you're at least using something kind of scientific to make sure that the trends that you're seeing are
00:27:01.679 actually real and i've actually you know even if you're not using a framework with this
00:27:07.279 stuff built in which many have um the algorithms aren't that tricky it's worth just reading them on wikipedia or
00:27:12.480 something just just making sure that you're doing something that is real so another thing we rely on heavily is
00:27:18.480 cohort metrics we've been taking this along for a while just kind of for other reasons to
00:27:25.039 just see how our site's growing over time and that kind of stuff but they've been so helpful when we've been doing these a
00:27:31.200 b tests and the reason is because once you start getting kind of obsessed with tuning conversions and testing
00:27:36.640 everything um it's easy to get into a situation where you're tuning a landing page or you're tuning an upgrade page
00:27:42.240 and you know you're really really working on conversion rates and you don't really realize unless you're checking old users how it's affecting
00:27:48.880 them right so you can cause this like really hyper optimized site that ends up being really aggressive and kind of
00:27:54.159 spammy or something and the users who have been with you forever who love your product end up losing and leaving so if
00:28:00.240 you're not like watching what's happening to the people that have been here all along while you're doing stuff to the new people you're really going to
00:28:05.600 lose out we also have a site health dashboard as a dating website i think we have kind of
00:28:11.440 unique things we have to take a look at for instance gender ratios location ages
00:28:16.960 of our users so this is the kind of stuff that it might sound crazy but we've had experiments where uh we've a b tested
00:28:23.919 and the experiment has one for males and lost for females sometimes i mean it's anomalous we try
00:28:29.200 not to think too hard about why that might be um but you know we want to keep these ratios healthy so it's good to at least
00:28:34.559 know that if somehow we're running a test and these ratios get all screwed up we can at least see you know the dates
00:28:39.840 when we deployed these things and figure out why that happened and finally we just have a system of
00:28:45.600 daily emails where every day it sends us all the a b tests that are running the number of times the users have seen it
00:28:51.279 and statistical significance whether it's been obtained or whether it's kind of trending one way or another
00:28:58.080 what do you do if there's a tie sorry um i think you know you can get kind of
00:29:04.080 obsessed with this stuff too you can see a lot of comments on on forums where people are debating like well you know if it looks like a tie it
00:29:10.559 probably isn't you're probably using the wrong algorithm but a lot of stuff ends up being kind of a wash and it doesn't really matter
00:29:16.159 right so um the first thing is that generally you should err on removing the
00:29:22.000 new test code i think that's just because you don't want to add new debt it's just more code to maintain
00:29:27.200 so if it really is just a wash just pull it out and forget about the future there's some i think times when that's not
00:29:33.120 necessarily true the first is if if it really is kind of a tie but the thing that you're building
00:29:38.399 kind of is one incremental step towards the ultimate goal of your project if you're doing a redesign you're adding
00:29:44.000 some new colors that are going to be handy in the future or something like that it might be worth leaving in one other example i can think of is if
00:29:51.039 if you can really build something that's self-contained enough that doesn't have a lot of maintenance needs we have one page on our site that's a date map where
00:29:57.919 some engineers built it on just some friday afternoon time and we've never had to do any kind of
00:30:03.039 maintenance on it and it's cool it's just like this live updating thing where you see dates that are being posted
00:30:08.559 developers loved working on it made them happier at their job doesn't really cost us anything to leave it up so we did
00:30:15.520 so we've also seen some unexpected benefits from doing this testing at our company
00:30:20.880 the first and i think the most important is that there's we spend far less time arguing over new features
00:30:26.559 um i think it's really hard i know we talked about this before but you know when you have product people and tech
00:30:31.760 people and tech people know why it's crazy you know we why would we want to like implement a facebook connect login
00:30:37.440 facebook's api has changed every single day it's horrible but you know product people know that it's this viral cool
00:30:43.120 thing um why argue about it right and this is a good example like this comes from
00:30:48.440 forbes.com i don't know why they were writing about this but mixed cloud claims to have increased their conversion rate to sign up by 200 to 300
00:30:55.120 using facebook connect now although most sites won't see the same level of improvement i've chatted with several other developers that have quoted a 20
00:31:01.520 to 50 increase in ups anecdotal data obviously uh reads kind
00:31:06.559 of like an advertisement but this is the kind of thing that a well-meaning product manager can read and say we're losing money like if we don't have this
00:31:13.279 built like we could be increasing our conversions by 200 to 300 percent so you get to work on the feature and
00:31:18.720 you're adding the little button and then the next day you see surprise people hate being forced to use facebook we
00:31:23.760 surveyed the internet they say hell no so this kind of stuff i found happened all the time until we started really
00:31:28.799 testing this stuff i think the important thing is we've been able to move our conversations from is it worth building
00:31:34.559 just is it worth testing i think building has a lot of baggage to software developers like we know that
00:31:40.000 we're going to be maintaining this stuff like we really have to fight and make sure that scope doesn't creep out of everywhere um but if it's just worth
00:31:47.519 testing then at least those conversations become a little bit easier to have and a lot of things are at least worth testing if you can do it this way
00:31:56.320 finally it gives us a defense against conflicting customer feedback you love your customers they're
00:32:01.760 well-meaning often the people who love your site the most are the ones who scream the loudest about things and it's often really
00:32:07.919 really hard to make decisions when your gut tells you that one thing is good for your business but your customers are demanding something else right
00:32:14.720 for us i mean the most obvious thing being a paid dating site is if you go to a user voice page we have hundreds and hundreds of votes to make our site free
00:32:22.000 and i get it like i get why having something free is cooler than paying money for it um
00:32:28.399 we've ab tested this so i don't know i don't know if it's weird that we've be tested like a fundamental
00:32:34.799 thing like are we a paid site or not but we have so usually the premise that they argue with this is you know there aren't
00:32:40.480 many people in my neighborhood or my my whole city so why should i pay to meet them and i think that's totally a fair
00:32:46.880 criticism so we said okay great so you know markets that don't have this x number of people in it they're free
00:32:52.720 let's see how much faster they grow and absolutely to my surprise they haven't grown faster at all like actually asking
00:32:58.320 people for their credit card has had no impact on the growth of those markets
00:33:03.360 so if we had just listened to them like this would be the kind of stuff that would stress us out all the time and i
00:33:08.399 don't know what we would do about it finally of course more money doesn't hurt
00:33:16.240 so just to recap rapid development is useless if you're not moving in the right direction like
00:33:21.600 speed is one thing and it's important but you need to make sure that you're taking data to make sure that you're moving quickly in the right direction
00:33:29.279 tiny improvements over time can become huge and aggregate this to me is like the bread and butter of really good a b
00:33:35.120 test strategy where like don't get hung up or concerned that you're not seeing huge improvements right away just keep
00:33:40.559 at it and just keep relentlessly testing it let data inform your decisions uh
00:33:46.000 there's a whole lot of conversations you just don't have to have if you have data like you can just say sorry uh no
00:33:52.880 it's that's not how it is uh there are tools at your disposal so i know i mentioned optimizely visual
00:33:59.360 website optimizer a bingo vanity our own experimental so many open source tools for every single framework
00:34:06.080 there's no reason not to just get your hands dirty and just start testing this stuff and finally just be rigorous and
00:34:11.520 relentless test everything you can test whims test fundamental assumptions about your business model
00:34:17.839 but do so with rigor you know test your code make sure that you're moving in a sustainable way and that this kind of
00:34:23.679 stuff isn't just going to bog you down so thank you so much
00:35:06.720 you