List

Workshop - Machine Learning For Fun and Profit

Workshop - Machine Learning For Fun and Profit

by John Paul Ahenfelter

In the workshop titled "Machine Learning For Fun and Profit" presented by John Paul Ahenfelter at RailsConf 2014, participants are introduced to basic machine learning techniques applicable to their Ruby on Rails applications. The main theme revolves around leveraging data from user tables to generate insights that can enhance business profitability.

Key Points Discussed:
- Understanding Users: The workshop begins with the importance of user data, where Ahenfelter engages the audience about their user tables and business goals, emphasizing the necessity to understand users to retrieve meaningful data insights.
- Machine Learning Techniques: Participants learn several foundational machine learning techniques starting from categorizing users, segmenting behavior, and employing recommendation algorithms. Ahenfelter stresses the importance of using science and data analytics effectively to derive actionable business insights.
- Practical Implementation: A significant focus of the workshop is on hands-on coding examples, utilizing the 'sex machine' gem to assign gender to users based on first names and analyzing user data without extensive surveys. This approach aims to achieve better accuracy than traditional survey methods.
- Geolocation: The presenter also covers geolocation, explaining how to derive rough user locations from IP addresses using free geo-IP services, which helps in understanding user demographics and tailoring support strategies accordingly.
- Clustering Algorithms: The workshop highlights the concept of user segmentation through clustering algorithms like K-means and hierarchical clustering, which aid in identifying different user groups based on their interaction with the application. Ahenfelter provides practical demonstrations, encouraging attendees to explore these algorithms for real-time data.
- Recommender Systems: Finally, attendees are introduced to recommendations using Single Value Decomposition (SVD) to find similar users based on their interactions, which could assist in personalizing user experience and increasing engagement.

Takeaways:
- The workshop emphasizes the need for Rails developers to become data scientists to utilize the wealth of user data effectively.
- Participants take home practical knowledge about applying machine learning in their own applications, equipped with the tools to answer user-related business questions and drive profitability.

Your Rails app is full of data that can (and should!) be turned into useful information with some simple machine learning technqiues. We'll look at basic techniques that are both immediately applicable and the foundation for more advanced analysis -- starting with your Users table.

We will cover the basics of assigning users to categories, segmenting users by behavior, and simple recommendation algorithms. Come as a Rails dev, leave a data scientist.

Help us caption & translate this video!

http://amara.org/v/FGZx/

RailsConf 2014

00:00:18.279 all right people cool if I go ahead and kick into this right on time everybody okay with that looks like we got a lot
00:00:23.480 of seats full thank you guys so much for coming down um to talk I'm John I'm going to talk about machine learning I
00:00:29.599 know you be at s's talk right now um so I appreciate you coming down to this instead Sandy will be great on video uh
00:00:36.280 she's a wonderful person and I know that I have to deliver at least as much value as you would have gotten out of sy's
00:00:42.239 talk so you set a pretty high bar for me and I appreciate it and I hope I don't let you all
00:00:47.840 down so what's our goal my goal I like one takeaway three takeaways is great
00:00:53.120 one takeaway is even better I want to use Ruby to answer questions about your users and your business that's my goal
00:00:58.600 we're going to use machine learning to do it there's some chairs down here guys if you want to go ahead and grab them and Scoot them around somewhere um put
00:01:06.720 them around the back something like that this room set up kind of funky um so I've got a question for all of you this
00:01:13.119 is going to be interactive for a bit how many people have a users table
00:01:18.400 in their rails app okay this is a better question how many people don't have a users table all right yeah uh I'm just
00:01:25.560 curious what what's the god object in your table or in your app instead of users you asset sorry asset asset okay
00:01:33.520 that one makes sense what else assets are users so um machine
00:01:39.799 learning for Fun and Profit with your users table sorry assets and uh you know some things like that you'll probably
00:01:45.600 find the same techniques apply but everybody's got a user table which is what started this thing out now what
00:01:51.119 what is the goal of UR all's business anyone just shout it out what is the real goal of your business make money thank you thank you
00:01:58.079 so we got users and we got the profit right um who's got a plan for making
00:02:03.479 money from their users I just need to raise your hand all right you are first
00:02:09.479 so I'm going to just put you on the spot what what's your plan how do you turn users into profit so we give loans to
00:02:14.760 users and then they pay off awesome I understand that business
00:02:19.800 plan that is awesome so um so we we uh take take uh loan people pay off the
00:02:25.480 loans we make money on that profit that's awesome who works for a social network type company that's going to
00:02:31.640 monetize the attention economy of the yeah oh well okay yeah for sure um I've
00:02:38.480 done that too and so that that's what kind of frames the story for me because
00:02:43.560 we're probably all familiar with people with that plan right we've got users we want to make a profit everyone knows the
00:02:49.720 underwear gnomes our friends the underwear gnomes and you know there's that wonderful part where the Gnomes
00:02:55.599 explaining to the South Park Boys that uh step two is you know you know this big question mark after they collect the
00:03:01.680 Underpants and they're going to make profit and it's just strange to speculate on what business models
00:03:07.360 actually you could build by collecting Underpants and using machine learning on Underpants to create profit but we're
00:03:14.480 not going to do that today we're going to figure out how to fill in that bit how to fill in that question mark the
00:03:20.239 stuff that's in your users table right now that you can use to turn into money
00:03:25.640 um or hopefully some kind of money we're going to use a particular set of tools mie which everybody here is probably
00:03:31.480 pretty familiar with we already talked about how everybody's got a user table
00:03:36.840 um almost everybody sorry sorry almost everybody's got a user table and we are going to use science so um I'm a big fan
00:03:45.400 of science I was a chemist in another life so I like that um but science can take you down a bad path so I want to be
00:03:52.079 sure that when we're thinking about the data science we're going to do today there we a little less this crazy guy um
00:03:58.400 the professor I think this is from the third movie Back to the Future and we're a little more kickass science guy Neil
00:04:03.760 degrass Tyson is my one of my favorite guys in the world so um we're going to use our users table to figure out how to
00:04:09.760 make a profit with data science and we're going to try to do it thinking more kickass like Neil degrass Tyson
00:04:15.159 than crazy like like that so real quick the obligatory that's me um I'm John
00:04:22.160 Paul Ashenfelter I work here at treeh house um I asked earlier but how many
00:04:27.919 fans at treeh house a few okay before that I worked at General Assembly so I've got two of the the big ones taken
00:04:33.520 care of so I'll probably go over to Dev boot camp and get a job there next so I can just continue collecting working for
00:04:39.960 Education companies I've got Treehouse stickers for anybody that wants them up here because we do have pretty cool uh
00:04:45.240 branding you can come get Mike the frog or you can come get I got to say we've got they sent me these I really have no
00:04:51.639 clue what a boat is for Treehouse but it's wonderful so you can have one of those if you
00:04:56.919 wish but more importantly why should you care about me in data science and being someone to tell you I've been doing this
00:05:03.000 for a long long long long long time um these you can't see too many details but
00:05:08.199 this is from 2006 um I started the data warehousing track um which is another kind of data
00:05:13.960 science at the MySQL conference I taught it a lot at O'Reilly's open source convention and you can see what I highlighted here because it's just funny
00:05:20.240 to go back and look at what you know just about 10 years was we were talking about big databases that were in the 10
00:05:25.840 to 100 gigabyte range okay I mean that's just that's just huge it was hard to figure out how you store data that big
00:05:32.759 in those times um who's got a database bigger than 100 gigs just
00:05:37.840 curious yeah a fair handful of people uh bigger than a terabyte we have Facebook here with
00:05:44.120 their exabyte data I don't think there's any Facebook people here because they're all PHP right so but anyhow data has
00:05:49.759 changed a lot um and uh that means the tools have changed a lot just one more quick uh digression from the history of
00:05:56.680 me you can't see everything there but when I started doing this I actually started with neural networks back in
00:06:02.600 grad school um or actually undergrad to be honest I started with neural networks and you can't see it over here but down
00:06:08.360 at the bottom here it says for MS DOS so we did Visual Basic with no number for
00:06:14.080 Ms Doss and we had to buy a math Cod processor for the computers we ran it on because you know the 386 math Cod
00:06:20.800 processor was an additional cost and you slotted it in and all of a sudden your math got better um because most of the
00:06:26.199 time when you were running numerical simulations back in those days it was kind of like this you pressed play and
00:06:32.080 you waited for um on average uh you know 2 3 hours I had some runs that took 3
00:06:37.160 days between data points literally three days um and that's changed a lot so I've been doing this for a long time
00:06:43.319 ironically someone who started like a lot of things like this at exactly the same time um the month I started my
00:06:49.560 research project this was the cover of uh Inc magazine so just say in um there was some interesting stuff going on at
00:06:55.879 that time that far more interesting than what I was doing data science this is
00:07:01.120 our format we're going to start with a problem and some data and we're going to do some stuff with code to get some kind
00:07:06.840 of results we're going to learn something about our users and we're going to use that to make money I
00:07:12.479 started this as the machine learning for Fun and Profit um as I've been doing a lot more of this and thinking about it a
00:07:18.360 lot more I've started to think a lot more about it like storytelling storytelling what's going on with in my
00:07:23.400 case in our case for these example storytelling about your users because I think stories are a much more powerful metaphor so this is sort of arranged
00:07:30.759 into stories and we're going to start where people like to start with simple stories stories you tell around the campfire stories you tell to make people
00:07:37.319 happy stories you tell to teach people things stories that we all love and enjoy so I'm going to ask a question um
00:07:45.560 who knows who their users are do any of you actually really work with your users table you know maybe you're in marketing
00:07:53.039 maybe you're in um you know the business Dev side maybe you're very many people
00:07:58.800 really feel like they know who their users are so yeah so um no one's willing to
00:08:04.240 say I do I know my users um that's probably because you've got a lot of them right you know it's easy when
00:08:09.280 you've got five or 10 users I mean I you know I literally look out in here and I can't count how many people there are
00:08:14.400 because you know your mind goes one too many um you know and it's kind of gone so it's hard I'll tell you one thing I
00:08:20.319 bet all of you know about your users how many people are familiar with thinking about your users like that right you
00:08:27.000 know Google analytics Heap mix panel um kiss any of these things you're used to
00:08:32.279 thinking about your users like that which is another way of saying thinking about your users like that right they're
00:08:38.159 all exactly the same person and then what do you do well you take that and
00:08:43.240 you aggregate it and honestly if you if you look at a typical Google analytics dashboard about the only thing in there
00:08:48.959 that tells any sort of story is we've got a lot of people in North America in this one I mean it's a little more of a
00:08:55.680 story I could guess you know I can tell a story looking out at you guys there's a lot lot of white guys with facial hair
00:09:02.040 there's more women than there used to be here um we're missing all of the people with colored hair you know I mean I can
00:09:08.959 notice a few things about this but um you know it's very superficial data much
00:09:15.160 like you know that Google analytics dashboard showing me hey you know we got a lot more traffic from the US so this
00:09:21.000 is what our users look like and we add things up about users we use vanity metrics we've got you know 10 gabillion
00:09:27.399 users the user spent so so much you know this is what we can do we roll them up into Aggregates all the time all right
00:09:33.760 now Aggregates are okay but um they really don't tell the whole story
00:09:39.200 Aggregates tell you about your average user how many of you all dream of being the average user of a
00:09:47.920 company really no one wants to be the average user of a company um I mean you
00:09:53.560 know we all know that you know everybody's not a special snowflake we've been hearing that over and over and over you know we should all have the
00:09:59.800 same tools everybody wants to feel special though regardless you know of of how we're looking at the data so that
00:10:06.720 means we need to tell good stories Aggregates are boring um SQL dbas from
00:10:11.760 the past people who deal with reports any of anyone anyone right that's why
00:10:16.800 you're Ruby dab so you don't have to do SQL and reports and all those things that's for the Java guys running Bert
00:10:23.560 and the people who are using Crystal business objects whatever it's called this week and Oracle um in a world most
00:10:29.320 most of us don't live in but um Aggregates can tell more of a story they can then turn into events in motion that
00:10:36.600 are more interesting seeing aggates over time is wonderful being able to press the play button and see data change over
00:10:41.959 time seeing your users grow over time your number of tweets grow over time your uh your cash base grow over time
00:10:48.959 and then context makes it interesting so there's a lot of questions where you want to know things about the context
00:10:55.000 and when you're putting all those things together you're telling stories so
00:11:00.639 I was thinking that there is some users in my
00:11:05.720 database and the users in my database spent good money at my company and then I thought wonder how
00:11:13.480 many of them are female and then I had a revelation right I'm trying to do IR glasses storytelling
00:11:19.079 which no one does as well as IR glass this is a napkin representation of the storytelling that America this American
00:11:24.760 life does something happened then something happened then something happened oh my God inside
00:11:30.120 happen happen happen oh my God more and you know if you ever listened to This
00:11:35.560 American Life uh you you learn stuff about it they're telling individual stories and then you come up with some
00:11:41.600 better picture some better understanding some insight to guide your life Morning Edition does it in a similar way the big
00:11:48.760 V there is after their intro they go way down into the trough to talk about all the details and then they come back out
00:11:54.760 of the trough and they say well we talked to John Ashenfelter about this and we talk to OBD G about this and we
00:12:01.519 talk to this other person about this to put it back into human context there different ways of telling stories and
00:12:06.639 there's the the very internet way which is still a good way to tell stories because you all click on stuff like this seven Unbelievable Facts about your
00:12:12.880 users click here for more um so you know I mean that all these things are ways people want to hear about data
00:12:20.880 so your users what do you know how do you know it what's missing this might be what your users
00:12:27.560 look like you got some vague outlines of them in your head how do you find out more about your users if you right now
00:12:33.320 wanted to know let's let's keep going with the male female distribution if you want how many people collect that right now like like actually ask in
00:12:39.639 registration or something how many so not many not many how would you figure it out just holler out if you needed to
00:12:45.800 know for whatever reason what the male female ratio was of your population go ahead and holler it out name analysis
00:12:51.839 name analysis wow that would be a good one if only someone was doing a talk at this conference on name analysis thank
00:12:57.680 you um there's there nothing paid for that um what's the traditional way to do it ask them how do you ask
00:13:05.360 them surveys thank you how how does anyone know what survey percentages are like have you ever done a survey through
00:13:10.920 Survey Monkey or something like that yeah are you going to get them yeah go ahead tiny they're tiny they are they
00:13:17.360 are and that's assuming that people open the email in the first place which is more than likely how you shipped it out so you're multiplying tiny numbers which
00:13:23.360 leads to even tinier numbers and you end up with very small data sets that you extrapolate from and hope that they're
00:13:29.399 somehow relevant hope that they're somehow statistically significant and there are statistical techniques for dealing with that but wouldn't it be
00:13:35.440 better if you had more confidence knew more about your users descriptive data
00:13:40.880 lets you um slice your users into segments right um You can use things like lookup tables to do this um which
00:13:47.199 we're going to do in a second you can do uh you can do uh the name analysis which we're going to do in a second most of
00:13:53.320 these are fast easy to do they're going to give you way better results so if I said I could give you like 80% accuracy
00:14:00.079 on male and female based on first name male and female gender based on first name who thinks that's worse than what
00:14:07.639 they get from a survey right yeah I mean it's at least as good as what you're going to get from
00:14:12.800 a survey you know probably better and uh it takes very little time and effort so that's one thing I'm going to send you
00:14:18.120 home with today so let's talk about it this is one of the first examples we're going to do and we're going to see how
00:14:23.480 these examples go um the first two I know you can do without any sort of of
00:14:28.759 uh any any uh Crazy Gems or any linear algebra or anything like that so yeah the Gym's called sex machine I I did not
00:14:35.600 write it um so uh this is literally the code you know you're selecting all your
00:14:41.800 users by first name and then we're going to take the sex machine Gem and uh and
00:14:47.279 analyze it and so let's run through code real quick and then we'll do the code we'll see see if that works how many
00:14:52.440 people sort of got gems installed so far okay so there should be
00:14:58.320 at least someone near nearby that you can kind of see and so I'm going to explain it we're going to take a minute to do it maybe while people are doing it
00:15:04.680 you can try one more Network to see if you can get the gyms installed from the repo and we'll go with it but so
00:15:11.320 basically sex machine is is pretty trivial I'll tell you a little bit about what's under the hood so you create a
00:15:16.639 detector there's a couple of cool things you can do there's case sensitivity so you know most people don't want to do case sensitivity you can also pass it uh
00:15:24.279 locals um because different names are masculine or feminine in different
00:15:29.560 locals how many people are British in here UK British okay just just the
00:15:34.800 British people and and the US people think to yourself your answer okay the name Jamie boy or girl British
00:15:42.399 people boy us people girl right I mean it's not
00:15:48.000 certain right Jamie is a little bit of an androgynous name but in Great Britain it is far more boy than girl in the US
00:15:53.800 it is far more girl than boy so this Library understands some of those things if you want to uh to uh lock it to
00:16:00.600 particular particular regions so um basically we've got this detector which
00:16:06.199 we're going to which we're going to create from the sex machine gym and we are going to basically get
00:16:12.279 the G gender of names that's literally all this gym is now that's a lot easier even than putting together a woo survey
00:16:18.040 and sending out an email to all your users and honestly later on you can check how right it is if you want to
00:16:23.519 send out a survey if you need better statistics so that is literally all
00:16:28.600 there is to the sexine gym I'm going to tell you just a bit um about where it
00:16:33.639 came from because you should always question these black boxes right we've got this black box that you put a name
00:16:39.040 in and you get a gender out now for all you know it's random right you know um hopefully it's not random so what what
00:16:46.360 is interesting about this gym is the data is seven years old it's a collection of 40 42,000 names I think um
00:16:55.399 that uh some guy in Germany did by checking Census Data from all sorts of different comp IES it's uh got
00:17:00.880 percentages by country and it's packaged up because God forbid I use a gym that's
00:17:06.079 pure Ruby in this talk it's packaged up in a c extension and so the sex machine
00:17:11.240 gy wraps the C code that has all the names in it but it's very easy to mck about with this gem um and it runs
00:17:16.839 really quick so let's do our first exercise there's two files if you look
00:17:22.160 in the exercise one gender thing in this repo and I'll put the repo thing up again for people that might not have it
00:17:27.199 check your gender let you check your own gender um so you can put your name in it you can see some of the unusual ones I have
00:17:32.880 a couple of questions just before you see the results of it I I have liter I pull friends that have children with
00:17:38.720 unusual names okay so I want to know male or female when I holler these names out
00:17:45.559 Cedar okay uh
00:17:50.799 River Justice you know it's just fascinating
00:17:56.760 right because you know um I I'll say cuz those are all in the check your gender file cuz that one I was I put together
00:18:03.080 just put your name in it and to put some other examples you can see what what uh sex machine has to say about those those are all true stories friends of mine all
00:18:09.440 have children name those various things so um and then the assigned gender to users so there's this ongoing story that
00:18:16.720 we can tell through what's in this database there's a machine learning there's a SQL light file so it's not hard I would encourage you if you've got
00:18:23.360 a slice of your users table on your local machine to go ahead and hook it up to your local machine you know use the
00:18:29.360 postgress gy use the my SQL gy if you want to take this and actually run it against your real data right now there's no reason you have to use mine but I
00:18:35.880 gave you a set of data I pulled a bunch of the people that work at treehouse I uh I dumped out some of the personal
00:18:42.080 data but I kept their name and I ran the uh ran that into a a SQL light database
00:18:48.080 so it was easy to distribute and that assigned gender to users file is actually more of a read from the database pick a gender write it out so
00:18:55.000 what I want to do is see how it works for us going ahead and trying that we're going to try it maybe for about 5 to 10
00:19:00.720 minutes this one's not too hard so it's either going to run or it's not going to run you either have the gym down or you're not and the Wi-Fi is going to
00:19:05.799 kill you and let's just see what happens so is everybody cool with that plan all right well let's take five minutes to
00:19:11.840 start and then we'll go a few more and see if we can get either check your gender and assign to gender I would be particularly interested when you do the
00:19:17.720 first one if you feel really upset about what it tells you you are um I usually go by John Paul because there's so many
00:19:24.039 John's at most of the companies I work at and it says I'm androgynous so I guess it's because I have two names um
00:19:29.840 but if I do John or Paul obviously it says I am masculine so it'll be interesting to see what it says for you
00:19:35.120 um especially if your name is Justice uh or river or Cedar um or something like
00:19:40.559 that so um yeah let's let's give it a shot um and see how this goes this is
00:19:47.039 the the your time to do something so I'll go ahead and show some code up
00:19:57.120 here okay so this is is what we
00:20:06.919 got so that's the assigned gender to users hang on let me get the other one uh what's the other one called the
00:20:12.880 other ones so check my gender um all these you
00:20:18.640 can just run in Ruby from the command line if you've got it you can just Ruby these these files so you can put your
00:20:23.960 name here your name here by the way is androgynous as well if you actually run it just like this it'll tell you your name here is androgynous um test user
00:20:31.200 one is androgynous test user tin is androgynous I learned a lot about um about what is an androgynous name
00:20:37.919 looking through some of the junk data in our uh in our tables so anyhow let's give it a shot yes
00:20:45.400 sir yeah sorry I'll bring the repo up
00:20:55.640 too the repo is right there so let's see how this works nothing like
00:21:04.120 live coding with all of you so let's see how it
00:21:10.279 works in case you in case you really want to know like the before and after are in there all the code is there it's
00:21:17.760 no big deal if it doesn't work for you or if you don't want to do it right now or whatever it's got the before and after so you've got this so you can take
00:21:23.480 it back you can hook it into your user table and get something of value right away
00:21:35.720 if you run it and you're surprised or or happy about the gender assignment just just raise your hand and tell people
00:21:42.159 yeah it says mostly female really how about that how about that met I would
00:21:48.000 have to agree with that too so um we we actually have Kyle's at tree house so
00:21:53.559 yes I actually knew that that one was weird I don't know why um fascinating so
00:21:58.919 Kyle clearly is a problem what else it be the loal H it could be the local the
00:22:04.919 default Lo local is us though which is weird um it also spits up a lot when it
00:22:11.240 doesn't know a name and so some of the newer names like it's not so good with uh common names now like um uh kisi uh
00:22:19.760 which my wife delivers you everyone laughs you all think I'm making that up my wife delivers babies she that's what
00:22:26.000 she does and she hasn't delivered one but they keep track of like the hot names and you would be stunned at what
00:22:32.000 the hot names are um so uh we're kind of past the cheyen and the shayas that era
00:22:37.600 is kind of kind of kind of over um but it it changes over time this gym was
00:22:42.799 first done in 2007 um there's some interesting things in it the good news is it's really easy to hack the data
00:22:48.400 format for it because it's basically just a big text file how how are people doing are people getting this to run okay that have it I mean if you don't
00:22:54.520 have it downloaded so yeah cool anyone else surprised shocked upset
00:23:00.279 disappointed to find out they're and androgynous or mostly male or mostly female I laugh every time I think Ginder
00:23:06.600 assignment is what what I mean that that's just is the wrong thing to call this but um yeah what
00:23:13.640 put Biff is androgynous there you go and you can see what it did for justice and
00:23:19.400 uh and I didn't put in charity um but Justice and uh we should see what it does for kisi I think it throws up its
00:23:26.200 hands and calls anything it doesn't know androgynous to so um you can also set the uh the um
00:23:33.919 there there's some ways in the code to set the threshold for uh saying androgynous and N androgynous but so
00:23:48.760 um so are we close enough I should go on I have no clue how to do the pulse of of
00:23:53.960 this because you know either you've got it you can run it or you can't run it are we kind of yes sir go ahead I have a
00:23:59.039 question sure so I just giberish me so doesn't know it's just
00:24:05.600 going to yes yeah yeah um there there's a threshold and off the top of my head I can't remember what it is but it it
00:24:12.320 basically like it's something like if it's somewhere like 80 85% sure that it
00:24:17.679 will say male or female and there's this window where it's kind of sure and it does the mostly male and mostly female
00:24:23.840 and so if you run that second file the um the uh this well I can run it so I mean we can at least see what's going on
00:24:29.679 if we if you do assign gender to users let's get a terminal up here uh hang on I know the terminal's
00:24:36.919 not there
00:24:48.000 yet so if we assign gender to users sorry I made this bigger so you could see it and then it doesn't fit on the
00:24:53.679 screen so it's so if we assign gender to users and run it across the uh the uh
00:25:01.039 database that I gave you you know you can see that Treehouse apparently skews pretty male and uh pretty androgynous
00:25:08.399 the androgynous is all sorts of test junk in there um I did a longer write up of this but you know garbage in garbage
00:25:14.240 out we have three mostly female and seven mostly male and uh then the whole bunch of androgynous names and just a
00:25:19.799 handful of women um we have one woman named Fabby who like it doesn't know what to do with um there there's all
00:25:26.000 sorts of people it's confused about Amy a a i m e it uh gets mostly female I
00:25:31.880 think um but that one it it was having interesting times with so anyhow Jinder so what I just did is I saved you having
00:25:38.880 to do a survey and having to compile the results and having to deal with the statistical sampling technique you'd
00:25:45.120 have to do to backfill it enough that you felt confident that you got an overall uh overall um decent segment of
00:25:52.120 your users so you could figure out who's male and female um the problem I originally did this for was to figure out how to order t-shirts so I want to
00:25:58.679 put this in a real context we were trying to figure out for one of the meetups how many male and female t-shirts we could do of course I could
00:26:04.159 have counted because Treehouse only has like 70 employees I could have counted um but I thought it was a good example
00:26:10.520 of figuring out how to use machine learning to do that because it let us know how many male and female t-shirts we needed it was about 10% female
00:26:16.679 t-shirts made it really easy um and now if we we literally have recently started hey all you Treehouse users that
00:26:22.880 mentioned you're there we've initially started we've started sending out Treehouse users or sorry sending out t-shirts people who subscribe and we're
00:26:30.360 like wonder how many male and female ones we need and what what it would cost well now we have a good idea to estimate how much that would be because we can
00:26:36.760 run this against our user base figure out how many are probably female and probably male and get good estimates on that so hopefully that's one take-home
00:26:43.960 you can assign gender to your users everybody cool with where we are so far all right so the next one also is not
00:26:50.360 very sexy machine learning this one is geolocation I bet a ton of you do geolocation right people who do
00:26:55.520 geolocation already from IP address for their users a fair bit um how many people use a third party service for
00:27:02.399 that Max mine probably sorry what
00:27:09.120 else sorry cool so there's a handful of of
00:27:14.240 companies that do it anyone used free goip that's why that's why I was curious so free goip is the focus of the next
00:27:21.600 one again it's something you get to take home it's something you can use today and there's real reasons for using it um
00:27:27.760 we wanted at the the context for this oh let me get ahead of my talk um the context for this at treeh
00:27:36.960 house the context was we wanted to know better what we needed to do with our support hours we wanted to see where
00:27:42.320 people were uh we didn't need Super accuracy so this was a good technique for it we just needed to know roughly
00:27:48.480 how much West Coast time how much East Coast time how much European GMT time what does our profile look like of our
00:27:54.720 users so we knew better how to staff to support people again this is going to let us put a financial value on our
00:28:01.240 users a financial value on how much we spend on support and make sure we can use a really good way of spending money
00:28:08.000 effectively to support our users make people love us more and uh have a really good experience so um basically our
00:28:14.640 technique is very similar we're going to select an IP address who who has the IP address in their users table right it
00:28:20.080 would be anyone who uses Its devis Right by default that has an IP address so almost anyone almost any vanilla rails
00:28:26.159 app has IP address in there a lot of other people just put IP address in there in general um so free goip net is
00:28:34.120 a service but the code is all open source they use two things they use the um the uh maxmind free location database
00:28:43.039 which has about 20 miles accuracy maybe five miles accuracy depending on where you are it's good enough for a lot of
00:28:49.559 the kind of things that people need to do for us we need a time zone resolution easy easy enough um though I guess
00:28:55.559 people here in Chicago are right on the wrong Edge of a time zone and then there's the people in Indiana that are
00:29:01.440 on the other edge of that time zone so I guess maybe 20 miles does matter so if you want to run this you can get it all
00:29:07.240 from GitHub it needs Python and go because it wouldn't be fun if we didn't have as many possible languages it uses
00:29:13.320 python to pull down all the data um from maxb it then takes that data munges it
00:29:18.679 with a local CSV file that adds more information about locations and countries to it then you spin up a go
00:29:25.480 server how many go programmers we got awesome so you know we spin up a go
00:29:30.600 server and then we can use Ruby to throw uh IP addresses at it seems like a lot
00:29:36.640 of work one nice thing about this is you can control it inside you can keep it inside you can use the data and add your
00:29:42.919 own data to it to make um M make the country information richer um to make
00:29:48.919 the uh the uh IP address information more rich and you can uh can uh basically have a good time with it so
00:29:54.840 the code's pretty straightforward we're going to walk through the code give you an opportunity to do it um in my perfect world I thought I'd sit
00:30:00.240 up here and i' just run the server so you all could hit it so you didn't have to install go pretty sure that's not
00:30:06.640 going to work um so uh against a conference Wi-Fi so this might be a take it home so basically if we look through
00:30:14.080 the code and this is all in exercise two the location thing this is all ruby right we're going to set a geocoder
00:30:19.519 which for us is going to be Local Host we're going to use Faraday because I'm old school just to grab a request and
00:30:26.519 then when we do that we're going to grab the user we're going to make sure it's an ip4 regular
00:30:32.919 expression that's what that little bit here does the reason I do that is our load balancer at treehouse was
00:30:38.320 misconfigured for a while and some of our users have um the load balancers ip6
00:30:43.559 data in it both of which are problematic because a it says it's coming from the load balancer which has nothing to do
00:30:48.760 with the user and B ip6 can be a problem with some of the uh some of the libraries so matching it against ip6 so
00:30:55.799 we're throwing away the data that's bad everyone following Ruby so far right nothing Rock science here rocket science
00:31:01.200 here Rockstar rocket science going to get it straight and then we're just going to Jason get we're going to just grab some body so we're calling that uh
00:31:07.799 that uh connection uh that we set up with Faraday to get the Json representation of the current login IP
00:31:15.320 and we're going to parse that out as some Json data and what that's going to give us among other things is a latitude and a longitude and a big Json packet so
00:31:23.039 again what I did is I set up a bunch of uh data um in the machine learning SQL
00:31:28.440 light database uh this runs um against that it throws data against a go server
00:31:35.080 and then it Stuffs the result into a new table so you've got it so again we could ask people where they
00:31:41.039 live doesn't really matter where they live if we can figure out where they live based on their IP address is this perfect no right um I did a lot of
00:31:48.679 geolocation work at general assembly and we tried to deal with things like I leave a plane in San Franc leave on a
00:31:54.720 plane from San Francisco and I'm going to New York both the places we were having both places where we had uh
00:32:00.240 presence and I have Wi-Fi on the plane should I show you New York San Francisco or something else my answer was let them
00:32:06.720 choose but the answer we had internally was let's use some sort of magic to figure out where they are and try to
00:32:12.679 assign them to San Francisco and New York which we could also do as well anyhow how do we do this
00:32:17.880 code you see the assigned location to users uh in the second exercise it's
00:32:23.519 going to look like this
00:32:31.720 and you can see that I was pretty honest about what it is right one thing that's really
00:32:38.000 important is this doesn't work if you're not running the go server so I'm going to go ahead because at least I can demo
00:32:46.159 this I'm G to run a GH server over here hello there is our GH server so
00:32:51.880 we're running a go server we've already downloaded the data and parsed it with python go is running our server here and
00:32:58.120 then we were going to go over here sorry I keep saying over here and I'm saying it too
00:33:05.559 much and we are going to go over to exercise
00:33:10.880 two we're going to make it so you can actually read
00:33:15.960 this and you're welcome to do it right so we're going to just Ruby sign location to
00:33:21.960 users it's going to go ahead and stuff a lot of users into our database we can go see that it
00:33:29.399 asked the go server you can see the go server is doing its thing go server is wicked fast by the way I love this
00:33:34.639 little go server this may me want to try go um so it's just sending a bunch of IP addresses from our
00:33:39.919 database and we're getting all the Json location data now we've got latitude and longitude so you want to take five minutes and try it if you've got your
00:33:46.320 users table you can hook it up how many people said they had like devis or something with IP addresses handful um just a hint uh
00:33:55.240 127.0.0.1 doesn't geolocate tin. anything doesn't geolocate you know all
00:34:00.960 the 1926 the 19268 addresses don't geolocate the junk addresses don't
00:34:06.440 geolocate so you know you need to throw those out but so two things so far we've
00:34:12.919 got a way to assign gender to users and we've got a way to assign
00:34:18.560 location to users so we took these guys maybe we turned them into these
00:34:24.320 guys so they actually are people you know in full color um and uh look like
00:34:29.560 people as opposed to these vague Silhouettes and we described our users
00:34:36.679 so that is kind of the end of Act One um so let's take a couple of minutes to go
00:34:42.839 ahead and see if you can get um the geolocation running would it help if I wandered around I'm not sure how useful
00:34:48.560 it is because the code either runs or it doesn't because the bundle install is such a pain in the butt with the internet in here so I'm trying to take a
00:34:55.919 pulse I'm taking a survey which I already told you is wrong I should just use some sort of machine learning to figure out whether I'm going fast or
00:35:01.880 slow enough I could use maybe eyeball contact or you know some sort of other
00:35:07.240 statistic um we doing okay
00:35:14.560 okay the linear algebra yeah there's
00:35:21.119 um yep yeah basically you need did did what you do who who was saying you were
00:35:27.359 saying that you were Kyle see I already Kyle is already a person to me because
00:35:32.720 we we uh use the gender assignment thing to make sure he was male I'm going to guess he's from the Pacific Northwest
00:35:39.000 and be wrong no okay good
00:35:50.720 okay command that I need to run myter had none of that software and it's working but I think I forgot go though
00:35:56.520 CU I can't run this yeah yeah exactly yeah and I try I was going to put it on like memory sticks but I can't put go on
00:36:02.760 a memory stick in a way that installs I can't put all the C libraries it it's we when we get to the conclusion you'll see
00:36:09.880 why um we chose this and what some of the options are but the question that I was looping back to it's probably that
00:36:16.680 oh sorry it's probably this qu post well anyhow I got the qu post here but what it turns into is
00:36:23.079 probably the installing the build the build uh to see so you can install that
00:36:29.359 if you can get an internet connection that'll probably let the linear algebra gym to compile because since we're doing polygot remember we've done Ruby we've
00:36:36.280 done python we've done go we've done C with sex machine and now we're adding Fortran to the mix because this is
00:36:42.720 taking a ruby library and using C bindings to take the um the uh for Tran
00:36:47.920 C bindings and plug it all together um because that is how you win how many
00:36:53.800 people have ever used Fortran any time in their life wow wow I am so I mean really used for
00:37:00.800 TR not used it under the hood right you know on so yeah um Fortran was my second
00:37:06.200 third third programming language um side project yeah everybody should
00:37:11.720 pick up for TR side project I I totally agree with that okay so um we've got
00:37:17.000 about 45 minutes to get through the second half how are people doing with the geolocation they're just fine because like no one has go and they
00:37:22.760 can't run go and we don't have any Internet in here and I can't run things so okay to move on
00:37:28.200 all right so if you leave now you have taken two useful things hopefully you
00:37:33.520 can do the the gender assignment of your users you can do the location and you can find out more about your users
00:37:39.319 hopefully in a useful way we're going to start well I always get ahead of my slides hang on I'll shut up till I have
00:37:44.960 a slide to talk from so we're past this point so now we're going to stories of myth and
00:37:51.079 Legend we're going to take stories that were simple you know who you are tell me about yourself tell me uh your or tell
00:37:57.839 me your uh your location tell me the normal things you say when you're introducing yourself to somebody or telling a story about yourself now we're
00:38:04.440 going to do something crazy um I put this here there be dragons because I knew I knew we were going to have
00:38:10.000 trouble with linear algebra I knew we're going to have trouble with compilation I knew we're going to have trouble with Wi-Fi so you know all these things are a
00:38:17.839 problem dragons can be scary right um so SM right from The Hobbit is pretty
00:38:23.800 freaking scary with Benjamin cumberbach but they're not always scary um who knows what this movie
00:38:29.839 is yeah a handful of people know it's Peach Dragon I did not add anything to that it looks like he's holding a ruby
00:38:36.160 and that just came straight off Google Docs and that let me know I was on the right path cuz there are dragons at the
00:38:41.560 edge of the map right and we are at the edge of the Ruby map and we want to have
00:38:46.760 less him that's you know eating us and more him who is our friend um as we get
00:38:52.040 here at the end of the map um of of what ruby can do so we've been looking at how
00:38:58.480 people can be described so you know here we've got a whole bunch of people and now we can describe them a little better
00:39:04.480 we can put an AG gender on them we can put a location on them but the next step is to put people into clusters to put
00:39:10.960 people into groups because we form tribes naturally you know that's the people telling the stories around the campfire um myths grow up around uh
00:39:18.960 groups of people and individuals right but people agglomerate towards those uh
00:39:24.079 towards those people and this is a random grouping of people but you look at that and you're like there's some order you know I mean you can see
00:39:30.720 clusters in there you can argue about how you draw the cluster you know I could draw the cluster and say Here's a
00:39:36.520 cluster of people someone else might say here's the cluster of people you know someone else might say well that's a really good cluster and this is just
00:39:42.720 some weird shaped cluster but it doesn't matter visually you can look at that and you can say okay some of those people
00:39:49.200 are not like the other people they're grouped differently and your users are like that
00:39:55.280 too um when when we're looking at at users we often use that giant Google
00:40:01.040 analytics pile of of Aggregates right and you know Batman was slapping us for
00:40:06.800 using Aggregates because they don't tell the whole story so what we can do is we can take important properties about your
00:40:13.000 users whatever those are I'm going to do it in a a sort of a treehouse context um
00:40:18.119 but you can figure out whatever your important properties are and we're going to find ways to see the inherent
00:40:23.400 structure there and to find ways to find similarity there those are the two things things we're going to do so this
00:40:28.839 is where we get into math too before I go into how many people had linear algebra in the past okay all of that was
00:40:35.280 because you were computer science Majors right and they made you do it okay so linear algebra is is fascinating and
00:40:42.520 underlines all this um linear algebra is very easy to do wrong I've got one thing here that's done kind of by hand and
00:40:49.400 then we're going to use this wonderful Library the uh AI for R the artificial intelligence for Ruby JY is chalk full
00:40:56.839 of clustering algorithms and ID3 decision trees and all sorts of wonderful so you don't have to do it by
00:41:03.400 hand um but it relies on the linear algebra gy anyhow we're going to take important properties from our users
00:41:10.000 we're going to use either by hand or we're going to use this gym we're going to do stuff with it so here is a
00:41:15.760 specific example at treehouse we treat all our users pretty much the same but it would not be out of the ordinary to
00:41:22.680 think maybe we have kind of casual users we maybe have professional users
00:41:27.839 and then we have the crazy people that earn every single badge we have and 25,000 points and all that you know we
00:41:33.280 got our super users we got normal users we got casual users that's a hypothesis let's you know
00:41:40.240 we could find out if that's true well what we're going to do is we're going to use a technique called um K means clustering we'll Define it in a second
00:41:46.599 but basically it's a way of saying I want to take all this data and put it into a particular number K of groups so
00:41:54.599 I'm not clustering and you know about uh about until so let me rewind I'm saying
00:42:01.920 I know how many clusters I want I want three I want five I want 10 the example I'm using right now is three I want to
00:42:07.720 break into what I think are casual users um super users and uh and professional
00:42:13.839 users let's say what you'll find if you do a lot of machine learning is you will take a lot
00:42:19.800 of those assumptions and you'll try it with 3 four 5 seven and 10 or something like that and see if the results make sense because you don't know this
00:42:26.720 algorithm them much like a lot of Statistics you make assumptions at the beginning and then you have to kind of stick with them all the way through the
00:42:32.599 then so for C's clustering we're going to figure out K clusters we're going to put these users into groups and we're
00:42:39.040 going to see what we can learn from them so I want to talk through some code if you look in the EX3 uh the example three
00:42:44.800 uh folder in the repo I'm going to just pull bits and pieces out of the uh out
00:42:49.920 of the clustering the first clustering so I'm going to make some clusters so like I said for our example we're going
00:42:55.200 to do three so I'm going to do three clusters we don't have to know what a cluster is a cluster is just a group and
00:43:00.760 then I'm going to take all of my users and I'm just going to going to uh going
00:43:06.760 to uh modul them by K so I end up with randomly sprinkling them so if the visual of this is I'm taking all of my
00:43:12.920 users and I'm just throwing them out on the floor in any random order that's what I'm
00:43:18.160 doing and the value I'm going to use I chose to use uh the bird the the number
00:43:23.559 of uh badges people have earned at tree house because I think the badges has correlation with whether these people
00:43:28.760 are power users or casual users or stuff at Tre house you get a badge for finishing a significant chunk of work
00:43:34.920 basically so I'm going to use badges as the one thing I'm going to measure I can use way more than one but it's easier to
00:43:40.720 work with one I'm going to just basically throw the people out on the floor don't really care how they're organized because I'm going to try to
00:43:46.920 find some order in there and then I'm going to go on to uh the actual math
00:43:53.400 math of it so um I'm basically going to for
00:43:58.520 each I'm going to find the center of each cluster remember I started with
00:44:03.640 three clusters kind of thre them all out randomly people are in these clusters I'm going to figure out the center
00:44:08.800 basically so I'm going to figure out the center using some mysterious math and then for each person I'm going to go
00:44:16.200 through all the other clusters and see if they're closer to another cluster than they are to the one they're in the
00:44:22.000 center of it so basically when I'm looking at all these people on the floor I find three visual points that are
00:44:28.599 centers and then if someone's kind of like hanging way out here on the edge between this one and this one the person
00:44:35.119 out on the edge probably really belongs over here and so I'm going to put him over there and I'm going to do that for each person and then when that's done
00:44:42.680 I'm going to do it again but I'm going to calculate new centers for everything and so what that's going to do is eventually it will stop moving people
00:44:49.200 will sort themselves out to the closest Center the center will move a little and it'll kind of separate people out slowly
00:44:55.240 it's really cool to see the visualization it's really hard to do the visualization in Ruby um so I've got a text a text visualization but I'm just
00:45:02.640 basically going to keep doing that till it's done um and at the end of that I'm going to have three groups and then I
00:45:08.760 can look at the statistics of those three groups so um you might wonder what calculate GD is calculate GD is
00:45:14.880 calculate the geomet geometric distance so you'll find for all of these algorithms calculating the distance is
00:45:20.640 the one true thing that uh differs between them there's all sorts of ways to do it I'm using a geometric distance
00:45:26.000 which is kind of you know if if you want to think about like a hypotenuse there's Manhattan distance which is like blocks
00:45:31.760 in Manhattan where you never take a diagonal because there's a building there there's all sorts of ways to do these things and that's when when uh you
00:45:38.400 know the linear algebra pays off to know some and to understand the bits and pieces so anyhow we've got the assigning
00:45:44.599 users to a segment um just curious how many people actually have the linear algebra gy installed and it's it's
00:45:50.079 probably like 10 or 12 of you right okay so I'm going to show it up here feel free to go ahead and do it but I'm going
00:45:55.359 to give you an idea of how this actually works um
00:46:01.400 so we've got we can close off our go server because we don't need it
00:46:06.480 anymore we can close off this we can look at this we're going to
00:46:11.760 go ahead and open the uh what is this this is the um
00:46:18.240 cluster all right so if you skip through the code you can see there's our calculate centroid and
00:46:24.720 there's you know blah blah blah there's some math um math is not very complicated this is
00:46:29.760 all square root math and this is basically just what I said we're doing earlier when I went through all those
00:46:34.920 bits and pieces I put in a bunch of really ugly puts so we can see what's happening and it'll make sense when I run it so I'll just shut up and run it
00:46:41.880 so I'm going to see X3 and I'm G to pull this over so you can see
00:46:51.040 it and you'll see there's two different gyms in here or sorry there's uh two different uh files here one's doing it
00:46:57.680 with AI for R so basically what that does is that hides all the details it does K means you don't have to derive
00:47:03.160 anything you don't have to do any math it's wonderful um I'll tell you the first time I saw it just as an aside clustering I was at L comp last year in
00:47:11.680 Paris and um Tamer uh did a machine learning thing
00:47:18.160 where he basically said eight people sit at a table fill out the survey I will assign you based on machine learning to
00:47:24.520 tables because there are in or there cas tables each table can have so many people at it and he basically used a
00:47:30.839 version of this kind of clustering to figure out who should sit with who based on some interesting questions um most
00:47:36.079 people ended up just sitting where they wanted to but it was an interesting experiment um and he did not use the
00:47:41.359 linear algebra gy and the code was really really hard to understand so if
00:47:46.880 you think the linear algebra is hard to understand um doing it by hand is just horrible so anyhow enough slamming on
00:47:53.280 that so we're going to assign users to segment and so bunch of stuff happened
00:47:59.880 so I want to talk through it and I couldn't think of a big bigger way so remember we basically threw all these
00:48:05.319 users out onto the floor and so we're going through each cluster that we assign them
00:48:11.880 to we initially assigned them to a cluster and we're calculating the center of each cluster and then for each person
00:48:16.960 we look at all the other clusters and see if we should move them from the one they're in to one that is geometric
00:48:22.119 distance wise closer to them so you can see we moved a lot of people around from CL from cluster zero to one or two
00:48:29.079 because they were closer and then we went to Cluster one and we did the same thing and then we went through cluster 2
00:48:35.319 and then you say it says iterate again we're just starting again because we went through one pass of all the Clusters move people around and we
00:48:42.160 iterated again and we you know kind of skim through all these iterations and then at the end you can see what
00:48:49.240 happened the the movements got smaller right there were more and more people in the right cluster and fewer and fewer
00:48:55.359 people that were need to get moved and So eventually nobody moved and it said okay we're in a steady state let's stop
00:49:03.040 so what I did then is I spit out the cluster to see how many people were in each group and because badge count was
00:49:08.200 what was important to me I wanted to see what the badge average was so um out of the uh there were not quite 200 people
00:49:15.000 I'm sorry I don't remember we can add it up how many people were in this so 61 people were in cluster zero and they had
00:49:20.200 an average of 12 Badges and if you look at what the badges are this is like I said the world's worst visualization because it's text but it was easier to
00:49:27.200 do it this way in Ruby um so you can see you know you're like okay oh you know
00:49:32.319 they're all not too many they don't have too many badges I could buy that and you look at this one well this cluster you
00:49:38.760 know has an average of 51 there's a lot fewer people in it there's 30 in this one and you look at it and just
00:49:44.200 intuitively you're kind of like yeah that makes sense I mean you know 56 39
00:49:49.280 40 all right you know just like we looked at the picture earlier we could intuitively say here's a cluster and here's a cluster and the edges might be
00:49:56.200 a little foggy but they're good enough and then when we get down to the third cluster our third cluster is like all
00:50:03.200 the people with crazy mini badges um so and this is all staff we have students
00:50:08.319 with tons more but this average the croid was around 150 so we had clusters that were really well separated the
00:50:14.040 first one was like 12 the second one was uh was uh what was it 48 um and the last
00:50:20.240 one's 150 what was the second 51 and 150 so we have really good separation here
00:50:26.240 and so maybe we do have three clusters of users I can run this again and cluster into five users I'll probably believe the Clusters it comes out you
00:50:32.799 know I also can put more dimensions in here it's easy to think about in two Dimensions it gets weirder to think
00:50:38.480 about when it gets multi-dimensional that's what I did in the next example I put in not only their total badges but
00:50:43.960 how many points they earned for each one of our 10 major areas HTML CSS JavaScript because I was like maybe we
00:50:50.520 have clusters of people it stands to reason JavaScript and CSS go together if you're in HTML if you're learning that
00:50:56.839 and maybe they're different from Ruby because Ruby's over here and you know there's overlap between the skills but they're probably different people and
00:51:02.960 the people that are taking WordPress are probably really different from those people too but they probably take some
00:51:08.040 HTML time too so what I did with the next one is I used um the gym to make
00:51:16.000 life easier actually sorry so linear algebra is all about matricies and vectors under the hood that's all
00:51:22.040 matrices and vectors um you already do a lot with sets which are kind of like matrices and vectors if you know SQL um
00:51:29.520 as we're finding there are way more better numerical tools than Ruby okay um Ruby is good to get your feet wet Ruby's
00:51:35.799 good to do some basic stuff um python of course is is the king for doing numerical programming and that's okay
00:51:42.200 it's okay to have something that's different R is crazy um R has been around forever it's like strangely like
00:51:50.559 lisp in some ways which you know lisps are hot right now and it's weird to think R is sort of like that but there
00:51:56.160 are some things about SC yeah matb I was getting ready to say that so not only mat lab but there's an open source mat
00:52:03.040 lab called octave which I didn't know about um until I just started I took that Stanford machine learning class to
00:52:09.079 see what it was that big giant Muk on corsera and they now pretty much standardize on octave which is just as
00:52:15.160 hostile as mat lab but $2,000 cheaper so um I mean I used to do mat lab in
00:52:20.280 Mathematica way back when I when I was a chemist um and because those were the languages or the platforms tools
00:52:27.160 of uh of uh quantum physics and stuff like that but so our python um octave or
00:52:34.280 Matlab Mathematica there are tools that are really good at this and they all have strengths and weaknesses just like for instance um R is memory bound so
00:52:42.440 it's really great there there are some weird ways around the memory binding of it but if you have a 16 gig data set you
00:52:49.240 need like 16 gigs to well you need more to get everything to run but you have to have enough memory to store your stuff R
00:52:54.960 is weird um Mo most other languages manage okay Ruby
00:53:00.040 is probably pretty bad when you get up there too and I was just going to say one more thing about Tam I should have started
00:53:05.200 with this so um you know I highlighted all the fun words Vector quantization partition in observations and de
00:53:11.440 clusters which is what we did we had in users K clusters nearest mean my favorite word after out of that is Veron
00:53:18.240 cells um which sounds like you know some sort of of uh science fiction something
00:53:24.200 apparently he's a mathematician from the 1800 late 1800s who did computational
00:53:29.599 geography so or geometry computational geometry so those are Von cells so
00:53:35.520 that's apparently what we just worked with all right so the the next example
00:53:40.880 is where we talk about the alternatives to K Mees so there are other clustering tools so basically it's the same thing
00:53:46.200 I've got a bunch of users I'm throwing them out on the floor and I'm going to try to put them together what we did
00:53:51.599 before is we arbitrarily chose three centers because we put people into three groups groups and then we move them to
00:53:57.599 the closest one hierarchal clustering is kind of cool what it does is it starts everybody off as their own cluster so
00:54:03.480 there's in clusters and then it looks to see if which two clusters are closest and the two closest clusters get merged
00:54:10.359 together so it kind of aerates up and then it stops when it gets to the number of clusters you say or to the distance
00:54:16.559 between the Clusters there's two different ways to have stop conditions so um if you look in the uh the AI forr
00:54:24.040 example I've got in there we're using something called complete linkage there are 11 different kinds of linkages which
00:54:31.000 are basically the way to measure the distance in the AI for arum and they will all give you slightly different
00:54:36.280 results there are different ways to agglomerate the users up and measure distance in between them the other way
00:54:42.359 is to do it in reverse the device of hierarchial clusterer starts with one cluster and then plucks the person
00:54:47.920 that's furthest out into a new one and then keeps doing that till it settles down and no people keep getting moved so
00:54:54.520 all of these work through similar techniques that differences are whether they uh whether they agglomerate divide
00:55:01.880 or kind of like scatter and pull together and the big difference is how they measure distance you might say
00:55:07.200 since they're all giving us different answers what's the point of this well in any sort of machine learning problem what you're doing is you're traversing
00:55:13.119 this multi-dimensional space and trying to find a Minima and there's local Minima and there's Global Minima and the
00:55:18.640 goal is to get to the best answer in a reasonable amount of time and all these tools are different ways to approach
00:55:24.880 that you probably won't see too many differences I ran the linkages and you can do it too you can just change the
00:55:30.480 name of the linkage that's the AI for gym the AI for R gym is great you can say complete linkage simple linkage you
00:55:35.799 know you just go through all the Clusters and see how it changes your data it'll change the averages a little it'll change the Clusters a little but
00:55:41.079 it probably won't change your conclusions all that much which is what's really interesting and if it does it's probably because there was a a
00:55:47.559 local Minima all the things got stuck in except the the one that was different so anyhow alternatives to K means this is
00:55:53.240 all in the AI for arum and and flip him back
00:56:05.760 over priz pretty cool right to be able to dig into this and see what's really going on um so AI for R uh and pl linear
00:56:13.359 alra is worth installing the lineal gym just for getting AI for R because it doesn't just have cluster if you want to do ID3 decision trees and get your feet
00:56:20.119 wet with that if you want to get your feet wet with neural networks which I got to say are a pain in the ass to
00:56:25.839 program I mean having started years and years ago the back propagation algorithm is like one of the most complicated
00:56:31.640 algorithms to implement which is why you want a library to do it because if you are trying to make money from your users
00:56:37.160 the last thing you want is to like use the wrong algorithm you're going through enough effort to get all the data
00:56:42.720 together and get the data clean and get the data into the system and figure out how the heck to run this stuff you don't
00:56:49.119 want a bad algorithm on top of that so there are far smarter people than me that do these algorithms and I am very
00:56:54.799 happy to steal their algorithms same thing happens in mat lab and octave there are libraries that do these that
00:57:00.960 minimize functions awesome it's good to know how it works but it's much better to take it and use it to learn about
00:57:06.680 your users and make money instead of learning how to do a new numerical simulation um because they are they
00:57:12.280 unless you love numerical programming if you love numerical programming that's fine so this AI for our gym we grab the
00:57:19.240 users we put them in a data set you can see right here here's the money shot so the cluster we're using complete linkage
00:57:25.920 and we're telling complete linkage that we want a data set with three clusters I mean how much easier is
00:57:32.280 than that you know I mean there's no figuring out what the algorithms are there's no figuring out the geometric difference there's no math um all the
00:57:39.400 math is under the hood and then we can spit out data about it so I use this one um to Do complete linkage on the same
00:57:47.200 data set and we get slightly different answers than we did before and you're welcome to
00:57:54.599 do this for the 12 of you that manag to get this uh
00:58:02.960 installed okay so this data is just pure pure just awful um but if you look at
00:58:09.200 it this is the one I want to do so we used two different things in here if we
00:58:14.559 dig under the code one I did the same badge exercise same people with badges this
00:58:20.160 one came out kind of different than the other one did there are a lot fewer people in the top one and I could have I should have
00:58:27.559 calculated the average I think the average is probably a little lower for the second group which is weird and a lot more people are in the bottom
00:58:35.079 group does that change our analysis much I'm not sure you know if I was trying to figure out how many badges you have to
00:58:41.000 be to be a you know super user up there in group three the other one said 150 the average here is like 200 I'd
00:58:48.839 probably air on the size of on the side of 150 because I'd rather call more people super users than less super users
00:58:54.880 but either way I know that I have this group of people and it's a fairly rarified group compared to the big group
00:59:00.440 of people down here and if I'm really focused on marketing maybe what I want to do is try to figure out how to get more of these people up there and I know
00:59:07.160 that you know somewhere there's this magic point between 30 or 40 Badges and up around 100 where they jump from one
00:59:13.400 group to another what can I do to figure out how people get from that one group to another the stuff down here at the
00:59:18.760 bottom is I used something even more incomprehensible I put in all of their
00:59:24.119 point earnings in the 10 major categories that's why there's 10 chunks of data here so that's saying I don't
00:59:29.760 care about your total points I care about the point distribution among Word Perfect sorry Word Perfect that gets
00:59:35.119 that just dates me whenever I see WP I think word perfect because it's just built into my DNA and it means WordPress
00:59:42.480 where we are which never crosses my mind as a tool I'd use so sorry to the WordPress people anyhow uh WordPress
00:59:49.119 design PHP HTML JS CSS that's what all these numbers are how many points people earn so I was digging down into the
00:59:55.200 points this is is still pretty incomprehensible but you can see we have groups of people you know that based on
01:00:01.640 their points we could say well what do we know about those points we could start aggregating the bits and pieces we could throw this on a visualization
01:00:07.079 which would help a ton because this just looks like vomit and um yeah so now
01:00:12.319 we've done two different kinds of grouping if I want to change this from complete linkage to simple linkage or one of the other ones that's supported
01:00:17.799 by IFR I change you know that one line of code to use the different Linker and I see if my results change significantly
01:00:23.720 so now we've done the first of our really nasty um sets of uh data analytics we
01:00:30.200 we've um worked with the different kinds of clustering algorithms to take our
01:00:35.240 users and to segment them in different ways I mean we described them we said they're male female we said where they're from we could have used that as
01:00:42.240 input into here we could have said male or female latitude longitude cluster like that maybe we have a big female
01:00:48.640 following in the UK at treehouse that would be an interesting piece of data to know maybe we have a huge male following
01:00:54.720 in the Pacific Northwest I mean I think that's probably a given with the subject matter that we have but who knows again
01:01:00.319 that would be fun stuff to find out I was interested in Badges and so that's what we were starting to dig out and um
01:01:05.559 you know visualizing this is uh kind of best used um if people use things like D3 or I guess you could use some of the
01:01:13.640 the GG plot things some of this ends up for me going into R so I use some of the r tools to show it off so but this lets
01:01:19.960 us get Ruby results pretty quickly you saw that that one ran for a little longer though there was a definite pause when that one ran it was taxing my
01:01:26.440 little MacBook a here a lot final thing you all are Troopers for dealing with the uh the Wi-Fi situation and the
01:01:34.119 linear algebra compilation problems what we're going to do in the last bit getting back to the
01:01:41.280 talks let's talk about likes things that are like each other um talk about similarity
01:01:47.559 so um again I'm interested in how people collaborate how people um recommend
01:01:54.920 things to one to another how people uh find people who are similar to them and
01:01:59.960 everybody here I I would bet at some point has used something like Netflix or something similar that tells you what
01:02:05.359 you're going to like based on how you've rated movies or how you've rated purchases or something like that so
01:02:12.039 again we're starting with important properties like before we were using badges or Point totals this is pure
01:02:17.480 linear algebra and this magic thing called SVD um so single value decomposition uh it's one of a handful
01:02:25.039 of um techniques for simplifying complex matrices and so basically the gist of
01:02:31.920 SVD is you have users and ratings of something you know you're all the users
01:02:37.920 all the movies and I think I'm going to do something out of order here and give
01:02:43.559 you a visual so basically we're going to come back to how the math got calculate here
01:02:48.720 but basically what we're doing is we're taking all the Netflix users and all the movies imagine how huge that Matrix is
01:02:56.000 you know just in in your mind just this giant Matrix and we're collapsing it down using math into a two-dimensional
01:03:03.559 space um there are very clear uh proofs you can do if you care about uh linear
01:03:09.240 algebra proofs they make my head explode so I'm willing to trust the people that are much smarter than me that uh that
01:03:14.559 have done the math but basically we're going to summarize everybody onto a board like this and then we're going to
01:03:20.559 take someone new Bob in this example we're going to throw Bob on the board by taking this greatly reduced
01:03:26.839 simplification um that SVD gives us putting Bob through a mathematical
01:03:32.240 process and put him on this Bo and then see who he's similar to and we're going to use something called coine similarity
01:03:38.160 which basically measures the angle um that uh that Bob has from the origin and
01:03:43.319 finds people that are very close to the line that he draws so that line from the origin through Bob says this is Bob's
01:03:49.559 space of what Bob likes um or a good approximation of it and we find the closest people and say oh well Ben and
01:03:55.359 Fred probably really good people to use to recommend things to him so we I mean
01:04:00.760 that's amazing if you think about taking giant sets of users and ratings and
01:04:06.160 collapsing it down into something where we can just throw Bob in against one simple Matrix spit out a number and do a
01:04:13.480 a distance calculation it's amazing how that works um it it is
01:04:19.760 nuts so this also looks like the terminal vomited a bit um we're going to
01:04:25.240 uh make matrices in the linear algeb I'm going to run through this real quick um the the key part is right here uh mu
01:04:34.440 Sigma and vpos that's why those are named like they are even though later on I transpose the VT transpose the magic
01:04:40.960 happens here there's actually a single value decomposition method in the linear algebra gym that's why you want it
01:04:47.160 because it saves you having to do Matrix math by hand because Matrix math by hand is just stunningly nasty as that Lon
01:04:54.319 exercise that Tamer did would demonstrate um so what that does is that
01:04:59.720 basically ends up giving you um a uh Ju Just a two-dimensional representation of
01:05:05.680 this giant Matrix and the the single value decomposition theorem is that uh
01:05:12.240 you can take a matrix and any Matrix can be decomposed into the the MU the sigma and the the V transpose you know magic
01:05:19.319 magic magic happens and then you end up extracting a two-dimensional um space from it and now you can magically apply
01:05:26.400 users it's a black box right and unless you're writing the code for it you don't really care about that black box so um
01:05:34.520 so uh I I hate to belittle the fact that there's really complex math in there and that you know it should be invisible but
01:05:40.480 it should be invisible if because you don't care about that you care about the users you just want it right anyhow so we're going to use that magical single
01:05:47.599 value decomposition method to get out the matrices we want then we're going to flatten it into 2D space because a
01:05:54.880 theorem says we can and then we're going to take Bob we're going to put Bob's
01:06:01.279 values a or Bob's values we're going to turn Bob into a matrix and then we're going to multiply him by the bits and
01:06:07.880 pieces that we extracted from that other Matrix using math and then we're going to use more Matrix math to uh magically
01:06:16.440 calculate the uh cosine similarity um by using normalized matrices and matric do
01:06:22.119 transpose and I'm just going to hand wve from there I mean I can sort of follow the math but um that's the magic of SVD
01:06:30.319 and then you end up with being able to Loop through all the users um that are
01:06:36.079 similar to Bob and decide who's similar enough to him to recommend him so we end up here so I'm going to run through this
01:06:43.720 real quick and then we will wrap up and talk about questions SVD is the one that
01:06:50.039 is um that I do least as you can probably tell we don't do a lot of recommendations at treeh house um so
01:06:56.760 everything I've done with SVD is is kind of kind of more uh what's the best word
01:07:01.960 to put it more exploratory um what am I thinking of hang on that's
01:07:07.520 what it is so what I'm doing again is to figure
01:07:13.520 out similar users I'm trying to figure out based on the points they've earned in various courses who would be most
01:07:20.960 like what course we should recommend to People based on who they are most similar to whether we should put more of them in HTML put them more in a CS track
01:07:28.480 put them more in a ruby track an iOS track what have you and the easiest way is probably just to run it so we're
01:07:34.359 going to get rid of this we're GNA over there and we are going to
01:07:46.799 Ruby okay so um what we're doing it's a
01:07:52.680 little off screen what we're doing is we're taking all the users and we use the same users for all of these things
01:07:58.200 oh it wrapped that's what it did sorry and you'll see one line a bunch of our users in this test that never earned
01:08:04.160 anything they're mostly test users and if you put zeros in zeros blow up to not a number and zeros are useful for
01:08:09.680 recommendations because how could someone who's never done anything give you a valid recommendation so there's math and logic reasons for throwing them
01:08:16.199 out so we got 85 users left we're going to get Bob's Point scores those are
01:08:21.400 actually my point scores on Treehouse um because like I said I don't do a lot of exercises and then I'm going to find all
01:08:28.640 the similar users I like that the most similar user in one way is an unsubscribed user from our test database
01:08:33.679 that made me that made laugh we've got 999 similarity um and you know here's
01:08:39.279 another user down here that's got 997 uh pan who runs our conference programs is 996 demo demo is pretty
01:08:47.040 similar to me that's that's that's great and our uh for some reason our guy who does Finance is really similar to me um
01:08:53.239 which surprised me because he basically does accounts I didn't actually knew know he had ever earn points so that was
01:08:58.759 exciting and honestly I don't know who Luke is so I think Luke is an old employee or fake looking at those uh
01:09:04.440 those scores anyhow um so I put in my scores I wanted to find the people that
01:09:10.040 are similar you see I have a whole bunch of zeros for tracks I haven't done and the goal here is what track should I do
01:09:15.359 next because um the SVD is best at saying if you haven't done this this is one you'd like and unsubscribed user is
01:09:22.920 the most similar to me and it's suggesting I start JavaScript
01:09:28.279 CSS or Word Perfect joking I know it's WordPress um so it's suggesting that
01:09:34.880 those are the tracks I should start in order of which one is probably the one that I would like most and now I know
01:09:41.719 that I could tell the user hey JS track might be where you want to go next based on what you've done and the recommendations of all of our users so
01:09:50.000 SVD we doing good I give you a minute to run it but I know linear algebra J is
01:09:55.120 just blow up for everybody who doesn't have it installed and you can run it whenever you want now which is awesome
01:10:02.960 too so to wrap up um the goal was Ruby to answer questions about your users and
01:10:08.719 your business and I want to make sure you left with some tools because you gave up two two slots and seen Sandy talk to
01:10:15.400 watch this um and fought with linear algebra and fought with Wi-Fi and we're back here in the furthest part of the
01:10:21.719 dungeon in the basement that was possible um so I wanted to make sure you had had something useful to take out of
01:10:26.920 here so you know I keep thinking about a black box and there are good black boxes and there are bad black boxes a lot of
01:10:34.080 the machine learning for our intents and purposes especially if you're a ruby person can be a black box you don't have
01:10:40.400 to know the details of SVD if you can get it implemented right you don't have to know how the uh how a neural network
01:10:47.239 back propagation algorithm works or an ID3 decision tree works if you can find one and use it you
01:10:53.719 have to know what the goal of it is and you know you have to know what the the foundation of What kinds of questions you can ask with it but you don't
01:10:59.159 actually have to implement the math um and that's good because the math isn't what's exciting unless you work at Mat
01:11:05.159 lab um the ma the math is what gets you to having a better plan right because we
01:11:10.320 go back to our friends you know the Underpants Gnomes that we're at the very beginning a lot of times we've got all
01:11:15.480 these users in our table and there's got to be a way to make more money from them um Treehouse we had a really interesting
01:11:21.640 discussion about how to increase uh ARS our average revenue per subscriber we have two plans what's the what's the
01:11:26.880 best way to increase your average revenue per subscriber if you're a subscription based business anybody
01:11:32.320 raise the price thank you so after that got shot down um we we have multiple
01:11:37.440 tiers we have silver and gold and the goal was to get the only way Arps can go up if you can't raise the price is to move people from the lower tier to the
01:11:44.440 higher tier and so the goal was to figure out what we can do to move people from a lower tier where they pay in a
01:11:50.199 lower value a lower amount per month to a higher tier and so the way to start with that was figure out more
01:11:56.120 information about our higher tier users or we're transitioning from calling them
01:12:01.600 from gold to Pro and I'm not sure which they are now but so our gold or Pro users versus our basic S silver users
01:12:06.719 how do we get people to move from one to the other and before we can do that we need to understand more about them and we can offer discounts we can treat them
01:12:12.560 as a big agglomerated mess of people who are all the same or we can treat them individually and the kinds of individual
01:12:19.960 treatment we were doing we were looking at the male female ratios you know do women and men do it the same way as far as silver and gold
01:12:25.800 um do uh does time zone make a difference um because you know foreign countries we have far fewer people in
01:12:32.080 gold for instance so it was fascinating and the goal at the end of all this right was to use some black boxes right
01:12:39.440 to have a better business plan so we can roll in money because rolling in money is usually what the goal is that keeps
01:12:45.400 you employed and um that that is what I hope the tools that you have from this
01:12:51.520 can kind of help you get started with real quick I'm going to do credits and then I'll do questions so thanks to the rails comp team which
01:12:58.199 is awesome for having me do this Jeff runs the workshops if you don't know Jeff G school is sorry Turing IO is
01:13:04.520 awesome it's also where Katrina Works they're great people um I work at
01:13:10.120 treehouse I have um all sorts of Treehouse stickers three different kinds up here um I like taking the train so uh
01:13:17.400 I ended up here coding a lot on the train on the way out here thank you Amtrak and that is my contact info which
01:13:23.120 will come back up no worries real quick recommendations because people always ask where to look and what to do with
01:13:29.000 stuff O'Reilly has great data science books um and they go on sale fairly regularly and they also have something
01:13:35.480 called the data science toolkit which is like five books or seven books I forget however many it is that go on sale
01:13:41.199 pretty regularly so they and I think all these are in it um none of them talk
01:13:47.080 about Ruby so just be prepared right because Ruby is not the optimal language for this but you can read about SVD in
01:13:53.639 one of these books and then go implement it or use one of the implementations of it in Ruby you can read about what an
01:13:58.719 ID3 decision tree looks like and then come back and do it in Ruby so these books are certainly have their place and
01:14:04.120 these are some of the ones I found most useful um I also really like lean analytics um from the The Lean Startup
01:14:10.239 series even better if you like more class oriented stuff um corsera that one
01:14:17.120 on the far left is Stanford's uh well-known machine learning class which is brutal um uses octave teaches you
01:14:24.159 octave which is kind of open source mat lab um and uh this week is actually back propagation so I'm taking my pass um
01:14:30.639 because I'm here at rails comp from doing the homework this week um and uh corsera has a couple other ones John
01:14:35.880 Hopkins has a whole data science theme um the first one is a gimme for anybody that's interested in it because
01:14:41.679 basically by the end of it you have put a markdown document on GitHub so for most arubi people the bar is pretty low for the first class then it's intro to R
01:14:48.679 then it's data scraping and cleaning and then it gets into using R to do stuff there's like nine Parts it's one of the specializations they offer so so they
01:14:55.280 can actually get money um so for like 500 bucks you can get a certificate for $0 you can take the same stuff and do
01:15:01.800 the same homework and get all the same learning that horrible horrible horribly com uh uh horribly copied uh image there
01:15:09.159 is the triar thing code school has a try r that they did in association with O'Reilly so R is worth it just to um
01:15:16.440 appreciate Ruby and to try a lisp and to uh actually be able to do some really
01:15:21.560 really elegant mathematical calculations and plot some horrible graphs so all
01:15:26.719 those are tools of the trade you'll see them um and then that's me um I would
01:15:33.239 happily ask answer questions you can hit me up on Twitter where you know I may or may not respond because I only Twitter
01:15:39.000 at conferences I will um be around today tomorrow I'll sit in here we can try to
01:15:44.400 get the linear algebra gym installed if you're really upset that we couldn't do it it's really internet related for the most part and then the f2c thing so
01:15:51.480 today for Tran C python go Ruby right and then you also um can go
01:15:59.400 home with users that you can figure out their gender figure out their location and then one day when you get the uh
01:16:05.560 linear algebra gy finally installed for real you can go ahead and uh do uh either SVD or k means clustering K means
01:16:13.000 is so fun I mean it's ridiculous to talk about math like that I love seeing what you can do with crazy clusters um you
01:16:19.120 want to Cluster your people into 100 all right what happens what what do you find from that cluster them into two groups
01:16:25.040 wow I didn't ever know that cluster among crazy things like points earned and how long they've been a member
01:16:32.080 doesn't seem so crazy maybe there is a correlation there all sorts of cool things you can do with it so I hope it was worthwhile thanks for skipping Sandy
01:16:37.760 to do this thanks for skipping the other slot I'm answering questions thank