LinkedIn – I am currently a data scientist at LinkedIn, one of the world's most advanced big data companies.LivePerson – I have previously worked at LivePerson where I was the first person hired to build their big data solution, so I have experienced both the very beginning of big data solutions and the cutting edge.I will share with you the lessons I've learned while working on big data from both ends of the spectrumI also have a business degree from the Israeli Institute of Technology and a computer science degree from Ben-Gurion university
This is what I am going to talk about, I chose these subjects because they answer the most burning questions both when I was starting with big data and when I was perfecting my craft
The term, Big Data, is used in many ways, so before I'll start talking about big data, I want to explain what big data is
Yes, there is an entry in the Oxford English Dictionary for Big Data
The main word here is standard. Before Big Data, standard methods and tools were enough to process the data we had and now it's not, but what happened?
Data created opportunities, which in turn created demand for even more data and the amount of data in the world grew larger and larger
So what are those big data opportunities I've mentioned? The best way to see is through examples.
Amazon, the ecommerce giant analyzes data about its shoppers. It analyzes what products they are looking at, what products they are searching for and most importantly, what products they are buying.This analysis enables them to produce a product I am sure you have all seen ...
Here we can see that if I look at the book "Big Data Analytics", Amazon provides me with other recommendations about similar books.-- Show increase in sales –So why did it increase sales so much? The logic here is simple, the more products customers see, the higher the chance they will buy something. Amazon wants to show us as many products as it can in order to get us to buy something.
My second example is Netflix.Netflix is an American company that started as a DVD rental service and quickly became a streaming platform for movies and TV shows. It has about 30 million subscribers.At the end of each movie, Netflix asks the viewer to rate the movie he just watched. Netflix has billions of movie ratings from millions of users and it uses this data to create the following product.
Using our rating history, Netflix calculates a unique "taste" for every one of its subscribers and uses this taste to recommend them movies. This product is so important to Netflix, that in 2006 Netflix offered a prize of million dollars to whoever can improve their algorithm by more than 10%.-- Show statisticsSo why is this recommendation engine is so important? The more users find movies they like on Netflix, the longer they will keep their subscription, earning money to Netflix.
My third example is a small Israeli startup. Waze is a GPS mobile app that tracks where people are and at what speed are they travelling.
Waze uses this data to compute traffic maps where they show which streets are have traffic jams and route you according to this data, providing much better traffic suggestions than apps that don't use traffic information.After gaining more than 50 million users for its app, Waze was acquired by Google for about 1.1 billion dollars.Side note: I understand there will be a talk later today by a Korean company that does something very similar.
The above examples, and many more, lead me to the first lesson I've learned about big data
These are great examples. But to dive even deeper to big data applications, let's look at the company I currently work for, LinkedIn.Since we said that Big Data is more about business than data, let me show you first what is LinkedIn's business.
LinkedIn is the largest professional social network in the world. It has more than 225M members. Our largest markets today are North America and Europe, but Asia is growing very well too, with several countries having more than a million members on LinkedIn.
Not only LinkedIn has a lot of members, it also makes significant revenue. Across it 3 bussiness lines, LinkedIn has made almost a billion dollars last year and about 325 million in the first quarter of 2013.
These 3 product lines are Premium Subscriptions, Marketing Solutions and Talent Solutions.Let's dive more deeply into each one of them to understand them better
The premium subscriptions business is for LinkedIn users that want to get extra features on LinkedIn. Those features might be better analytics about who viewed their profile and the ability to contact anyone on LinkedIn through In Mails, LinkedIn's personal messaging system.This product really separates LinkedIn from other social networks in the fact that some of the users of the network pay extra to use it.
Marketing solutions is more similar to what you can find on other social networks. We offer companies the ability to market their products to our members. Since LinkedIn is a professional network with most members having a job or even a lucrative one. The target population is very appealing for marketers who want to market their products.
Our third and largest in terms of revenue product line is the talent solution. Here companies like Sony, Walmart and Loreal pay for their recruiters to have additional functionality for their recruiting needs. This is almost like another product inside LinkedIn for our recruiter members. This product line bring about 57% of LinkedIn's revenue.
LinkedIn's number 1 mission is connecting talent with opportunity. Both helping companies find new talent and helping our 225+ million members find new opportunities when they need themOne of the first big data applications at LinkedIn was to help members find a new job, and I will now dive deep into how it was done
JYMBII is a big data product that matches members with job postings on LinkedIn. For example: here is me, and some of the jobs companies posted on LinkedIn. For every job, we create a score on how much this job is a good fit for the member. Here you can see that I am a good match for a data scientist position at Facebook, and not such a good match for a product manager at Yahoo.
After creating scores for all the jobs in our database, we create a small widget on our homepage where every member can see his top matching jobs.
I will walk you through the 3 pillars of every big data product – Design, Algorithms and Infrastructure/Framework.
Let's start with design. In a consumer oriented company design is very important, because this is how users interact with your product. Also, in many cases, design is the hardest thing for a single small team to change because so many teams are involved.In most companies the big data team is separate from the team that works on the main product, so those of you who already started implementing big data solutions probably know how difficult it is to try to do some tests on the main product. Try to do anything you can to bypass other teams in your organization to test your big data solutions.When LinkedIn's Data Science team decided to build JYMBII, they wanted a very very simple way to test whether their product is working without making too many changes to the main site. This is how they did it. They started with email. Here you can see how the actual email looks today, where I got some recommendations for jobs I might be interested in.The reason why they chose email, is because it is a way to test your product on a small subset of users, without everyone who comes to your website being affected by it and also there is no need to make any changes to the main website.
After the initial emails showed great success and that people are actually interested in it. Our team has built this very small widget that shows the top jobs you might be interested in. Again, it was done with minimum integration with the main website, by having this widget replace one of the ads we had on the site for a certain percentage of the users.
After the great success of the widget, Jobs have now their own section at the LinkedIn website where users can search for jobs and more.Having the job section resulted in having 1000 times more users looking at the LinkedIn jobs than beforehandRemember, JYMBII did not start with its own website, but grew up to have it.
My main message about how to design data products is to start simple and grow with success.
Let's now talk about algorithms, or how does LinkedIn matches members with job postings.The first iteration of the algorithm was very simple. We look at the member's profile, we look at the job posting and we do keyword matching. Very similar to how recruiters screen candidate resumes for a potential match. In this example we can see that my profile is a pretty decent match for this job opportunity.There is no need for a natural language processing expert or a computer science doctor to implement this algorithm. It is pretty simple and worked pretty well for our first prototype.
When the first protype of the email succeeded the team moved to imrove the algorithm a bit further, adding features like education and experience which are also very important for determining the candidate's fit to a position. These improvement, improved the recommendations even further, resulting in more people engaging with jobs on the LinkedIn website
Finally, now that we have our job page on the website where users can search for jobs, save jobs and apply for jobs. We can use all of these signals to recommend users similar jobs to the ones the found themselves.All of these improvements resulted in a 50% more accurate job recommendations to our members.
The message for algorithms is the same as it for design, don't try to implememnt something very difficult before you know your customers even want it. Start simple and grow with success.
Here is a quote from a Twitter engineering manager that I like very much. What it says that most of the time, Hadoop doesn't solve a big data problem, it actually brings a set of new problems to deal with even before we know that what we are trying to build is worth building.
The first JYMBII prototype was developed using a very simple technology. Oracle, some perl scripts in between in some shell scripts. The process involved someone copying files manually from one computer to another, running some scripts on that computer and then copying back the results. The process was so inefficient that it took 6 weeks to run.But 6 weeks is better than never.
After the success of the initial product, LinkedIn has decided to make some infrastructure invetment in buying a parallel database from companies like GreenPlum and AsterData. This sped up the process to run now in a single week instead of 6.
Eventually LinkedIn moved not only to Hadoop but also built it's own infrastucture with project like Kafka, Voldemort and Zoie. You can find more information about them on the linkedin open source page.Now we are generating new recommendations every day, which is 50 times better than having it every 6 weeks.You probably figured out the second lessong by now ...
One of the most important questions that kept me busy for a long time as well is where you find big data expertsBefore I give you the answer, I would like to show you 2 graphs
Here you can see that in the beginning of 2011 the demand for big data experts was 30 times higher than the year before. Now it is even higher. Everyone is looking for big data experts.
Here is a graph from LinkedIn's own analytics team. Here you can see that 33% of the people who started a job as data scientist or analysts are new to this job.You can probably see where I am going with this. Most people who work in big data are new to big data.LinkedIn have realized it quickly and here is the proof ...
Here is an actual LinkedIn job posting from 2008 when LinkedIn just started with big data.The key message is this ... No specific technical skills are requiredHere is an example of how LinkedIn have implemented this strategy on 2 of my colleagues.
Joseph Adler came to LinkedIn from Netflix, where he did Operations Engineering. Now he is one of our top experts on big data and even written a very successful book about it.
Jason is a new data scientist at linkedin. Prior to that he was radar signal processing expert. He is still just at the beginning of his career at LinkedIn, but so far he is doing very well and educating himself quickly,
My third lesson is a bit hard to chew, but if you follow my previous 2, it becomes easier. Look for big data experts everywhere and at all times, but don't let it stop you from starting your projects.
So how do you start a big data project? I would like to show you a very simple recipe you could follow
As always, in order to make it more clear, I will use an example to guide us through the recipe.People You May Know is a LinkedIn Big Data product that traverses your profile and the entire LinkedIn graph to suggest people you should connect with.Let's see how can we use our recipe to create big data applications such as People You May Know.
Important business metric – how often members visit the websiteCorrelating factors – How many new items they have on their news feed. But that is not the root of the cause, something else is affecting it.Causing factors – How many connections do the have.Product – Recommend new connections to users – People You May Know.Beware of the second-system effect, how many of you have been involved with projects where the first prototype was pretty succesful and the second one was much bigger and failed?