This document discusses tools and techniques used at LinkedIn for building data products, including Pig, Java, R, Hive, Voldemort, Kafka, and Azkaban. It describes how LinkedIn uses Pig extensively for its conciseness and expressiveness. DataFu is introduced as an open source library of user-defined functions (UDFs) for Pig that was created to share useful UDFs developed by different teams at LinkedIn. Examples of UDFs in DataFu include Assert, Coalesce, In, CountEach, for sessionization, and for performing left joins of multiple relations efficiently in a single MapReduce job.
Today I'm going to talk about how we we use Hadoop at LinkedIn to build products with data.
So far covered building data products at a high level. Now let's look more at the tools we use work with the data.
This is a non-exhaustive list of some of the tools we use to develop data products at LinkedIn. I'm going to only focus on Pig here.
Mention that will focus on Pig for the remainder, because it is used so heavily within LinkedIn for building data products.
Will talk about DataFu. The thing I want you to get out of this is that UDFs are very useful and you can write them yourselves. When you are writing Pig code think about whether a problem could best be solved wth a UDF. The advantage of UDFs is that they are reusable.
Will talk about DataFu. The thing I want you to get out of this is that UDFs are very useful and you can write them yourselves. When you are writing Pig code think about whether a problem could best be solved wth a UDF. The advantage of UDFs is that they are reusable.
We use Coalesce because with endorsements we are joining in features to candidates for ranking purposes. There may not be a feature corresponding to a candidate, in which case we want to replace with zero.
CountEach is used by endorsements. We recommend itmes to members and want counts to improve our algorithms.
There are also non-streaming versions of median and quantiles, but these are less efficient because they require the input data to be sorted.
Left joins are used quite often. We use it a lot in endorsements. Again, we have candidates and need to join in features for ranking. We don't want to eliminate a candidate if there isn't a corresponding feature.