SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Applied Recommender Systems 
Bob Brehm 
5/20/2014
Presentation Topics 
 Hadoop MapReduce Overview 
 Mahout Overview 
 Hive Overview 
 Review recommender systems 
 Introduction to Spring XD 
 Demonstrations as we go
Hadoop Overview 
 History [7] 
 2003: Apache Nutch (open-source web 
search engine) was created by Doug 
Cutting and Mike Caferalla. 
 2004: Google File System and 
MapReduce papers published. 
 2005: Hadoop was created in Nutch as 
an open source inplementation to GFS 
and MapReduce.
Hadoop Overview 
 Today Hadoop is an independent Apache 
Project consisting of 4 modules: [6] 
 Hadoop common 
 HDFS – distributed, scalable file system 
 YARN (V2) – job scheduling and cluster 
resource management 
 MapReduce – system for parallel 
processing of large data sets 
 Hadoop market size is over $3 billion!
Hadoop Overview 
 Other Hadoop Related projects include 
 Hive – data warehouse infrastructure 
 Mahout – Machine learning library 
 While there are many more projects the 
rest of the talk will be focused on these two 
as well as MapReduce and HDFS.
Hadoop Overview 
 NameNode – keeps track of all DataNodes 
 JobTracker – main scheduler 
 Data Node – individual data clusters 
 TaskTracker – sequences each DataNode
Hadoop Overview 
 HDFS basic command examples: 
 Put – copies from local to HDFS 
 hadoop fs -put localfile 
/user/hadoop/hadoopfile 
 Mkdir – makes a directory 
 hadoop fs -mkdir /user/hadoop/dir1 
/user/hadoop/dir2 
 Tail – Displays last kilobyte of file 
 hadoop fs -tail pathname 
 Very similar to Linux commands
Hadoop Overview 
 Input data – wrangling can be difficult 
 Mapper – split data into key value pairs 
 Sort – sort values by key 
 Reducer – Combine values by key
Hadoop Overview 
 Wordcount (HelloWorld) – Counts 
occurrences of each word in a document 
 Half of TF-IDF
Hadoop Overview 
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, 
Text, IntWritable> { 
private final static IntWritable one = new IntWritable(1); 
private Text word = new Text(); 
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> 
output, Reporter reporter) throws IOException { 
String line = value.toString(); 
StringTokenizer tokenizer = new StringTokenizer(line); 
while (tokenizer.hasMoreTokens()) { 
word.set(tokenizer.nextToken()); 
output.collect(word, one); 
} 
} 
}
Hadoop Overview 
public static void main(String[] args) throws Exception { 
JobConf conf = new JobConf(WordCount.class); 
conf.setJobName("wordcount"); 
conf.setOutputKeyClass(Text.class); 
conf.setOutputValueClass(IntWritable.class); 
conf.setMapperClass(Map.class); 
conf.setCombinerClass(Reduce.class); 
conf.setReducerClass(Reduce.class); 
conf.setInputFormat(TextInputFormat.class); 
conf.setOutputFormat(TextOutputFormat.class); 
FileInputFormat.setInputPaths(conf, new Path(args[0])); 
FileOutputFormat.setOutputPath(conf, new Path(args[1])); 
JobClient.runJob(conf); 
}
Hadoop Overview 
 Setup the data: 
 /usr/joe/wordcount/input - input directory in HDFS 
 /usr/joe/wordcount/output - output directory in HDFS 
 $ hadoop fs -ls /usr/joe/wordcount/input/ 
 /usr/joe/wordcount/input/file01 
 /usr/joe/wordcount/input/file02 
 $ hadoop fs -cat /usr/joe/wordcount/input/file01 
 Hello World Bye World 
 $ hadoop fs -cat /usr/joe/wordcount/input/file02 
 Hello Hadoop Goodbye Hadoop
Hadoop Overview 
 Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is 
the Hadoop version installed, compile WordCount.java and create a jar: 
 $ mkdir wordcount_classes 
 $ javac -classpath ${HADOOP_HOME}/hadoop- 
${HADOOP_VERSION}-core.jar -d wordcount_classes WordCount.java 
 $ jar -cvf /usr/joe/wordcount.jar -C wordcount_classes/ . 
 Run the application: 
 $ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount 
/usr/joe/wordcount/input /usr/joe/wordcount/output 
 Output: 
 $ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000 
 Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2
Hadoop Overview 
 Interesting facts about MapReduce 
 MapReduce can run on any type of file including 
images. 
 Hadoop streaming technology allows other 
languages to use MapReduce. Python, R, Ruby. 
 Can include a Combiner method that can 
streamline traffic 
 Not required to include a Reducer (image 
processing, ETL) 
 Hadoop includes a JobTracker WebUI 
 MRUnit – Junit test framework
Hadoop Overview 
 Spring for Apache Hadoop project 
 Configure and run MapReduce jobs as 
container managed objects 
 Provide template helper classes for 
HDFS, Hbase, Pig and Hive. 
 Use standard Spring approach for 
Hadoop! 
 Access all Spring goodies – Messaging, 
Persistence, Security, Web Services, etc.
Hive 
 Hive is an alternative to writing MapReduce 
jobs. Hive compiles to MapReduce. 
 Hive programs are written in HiveQL. 
Similar to to SQL. 
 Examples: 
 Create table: hive> CREATE TABLE pokes (foo 
INT, bar STRING); 
 Loading data: hive> LOAD DATA LOCAL 
INPATH './examples/files/kv1.txt' OVERWRITE 
INTO TABLE pokes;
Hive 
 Examples (cont): 
 Getting data out of hive: INSERT OVERWRITE 
DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM 
invites a WHERE a.ds='2008-08-15'; 
 Join: FROM pokes t1 JOIN invites t2 ON (t1.bar 
= t2.bar) INSERT OVERWRITE TABLE events 
SELECT t1.bar, t1.foo, t2.foo; 
 Hive may reduce the amount of code you have to write 
when you are doing data wrangling. 
 It's a tool that has it's place and is useful to know.
Mahout 
 Started as a subproject of Lucene in 2008. 
 Idea behind Mahout is that is provides a 
framework for the development and 
deployment of Machine Learning 
algorithms. 
 Currently it has three distinct capabilities: 
 Classification 
 Clustering 
 Recommenders
Mahout 
 Support for recommenders include: 
 Data model – provides connections to data 
 UserSimilarity – provides similarity to users 
 ItemSimilarity – provides similarity to items 
 UserNeighborhood – find a neighborhood (mini cluster) of 
like-minded users. 
 Recommender – the producer of recommendations. 
 Algorithms!
Intro to Recommenders
What is a recommender? 
 Wikipedia [3]: 
 A subclass of [an] information filtering system that seek to 
predict the 'rating' or 'preference' that user would give to 
an item 
 My addition: A subclass of machine-learning. 
 Recommender model [2]: 
 Users 
 Items 
 Ratings 
 Community
What is a recommender? [2]
Recommender types 
 Non-personalized [2] 
 Content-based filtering (user-item) [2] 
 Hybrid [3] 
 Collaborative filtering (user-user, item-item) 
[2]
Recommender types 
 Non-personalized [2] 
 Content-based filtering (user-item) [2] 
 Hybrid [3] 
 Collaborative filtering (user-user, item-item) 
[2]
Collaborative Filtering 
 We will now look at item-item collaborative 
filtering as the recommendation algorithm. 
 Answers the question: what items are similar 
to the ones you like? 
 Popularized by Amazon who found that 
item-item scales better, can be done in real 
time, and generate high-quality results. [8] 
 Specifically we will look at Pearson 
Correlation Coefficient algorithm.
Collaborative Filtering 
 Pearson's correlation coefficient - defined as the 
covariance of the two variables divided by the product of 
their standard deviations.
Collaborative Filtering 
 Idea is to examine a log file for user's 
movie ratings. Data looks like this: 
 109.170.148.120 - - [06/Jan/1998:01:48:18 -0500] "GET /rate?movie=268&rating=4 
HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; 
rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286" 
 109.170.148.120 - - [05/Jan/1998:22:48:57 -0800] "GET /rate?movie=345&rating=4 
HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; 
rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286" 
 109.170.148.120 - - [05/Jan/1998:22:50:15 -0800] "GET /rate?movie=312&rating=4 
HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; 
rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286"
Collaborative Filtering 
 Steps used for the analysis: 
 Run a hive script to extract the user data 
from a log file 
 Run Mahout command from the 
command line (could be done 
programmatically as well). 
 Examine the contents.
Collaborative filtering 
<hive-runner id="hiveRunner"> 
<script> 
CREATE TABLE MAHOUT_INPUT_A 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
AS 
SELECT cookie as user, 
regexp_extract(request, "GET 
/rate?movie=(d+) &amp; rating=(d) HTTP/1.1", 1) as movie, 
CAST(regexp_extract(request, "GET 
/rate?movie=(d+) &amp; rating=(d) HTTP/1.1", 2) as double) 
as rating 
from ACCESS_LOGS 
WHERE regexp_extract(request, "GET 
/rate?movie=(d+) &amp; rating=(d) HTTP/1.1", 2) != ""; 
</script> 
</hive-runner>
Collaborative filtering 
public class HiveApp { 
private static final Log log = 
LogFactory.getLog(HiveApp.class); 
public static void main(String[] args) throws Exception { 
AbstractApplicationContext context = new 
ClassPathXmlApplicationContext( 
"/META-INF/spring/hive-context.xml", HiveApp.class); 
context.registerShutdownHook(); 
HiveRunner runner = context.getBean(HiveRunner.class); 
runner.call(); 
} 
}
Collaborative Filtering 
 Hive output looks like this (This is 
the format that Mahout requires): 
UserId, MovieID, relationship strength 
943,373,3.0 
943,391,2.0 
943,796,3.0 
943,237,4.0 
943,840,4.0 
943,230,1.0 
943,229,2.0 
943,449,1.0 
943,450,1.0 
943,228,3.0
Collaborative filtering 
 Rerun Mahout with a different correlation 
say SIMILARITY_EUCLIDEAN_DISTANCE 
 Do A/B comparison in production 
 Gather statistics over time 
 See if one algorithm is better than others.
Spring XD 
 XD - Spring.io project that extends the work 
that Spring Data team did on Spring for 
Apache Hadoop project. 
 High throughput distributed data ingestion into HDFS from a 
variety of input sources. 
 Real-time analytics at ingestion time, e.g. gathering metrics and 
counting values. 
 Hadoop workflow management via batch jobs that combine 
interactions with standard enterprise systems (e.g. RDBMS) as 
well as Hadoop operations (e.g. MapReduce, HDFS, Pig, Hive 
or Cascading). 
 High throughput data export, e.g. from HDFS to a RDBMS or 
NoSQL database.
Spring XD 
 Configure a stream using XD. Simple case:
Spring XD 
 More typical Corporate Use Case Stream:
Spring XD 
 Admin UI
?
Thanks!
References 
 [1] Introduction to recommender systems. Joseph Konstan. 
 [2] Intro to recommendations. Coursera. 
 [3] Recommender system. Wikipedia. 
 [4] An Algorithmic Framework for Performing Collaborative Filtering. 
 [5] Hybrid Web Recommender Systems. 
 [6] Hadoop web site. 
 [7] Apache Hadoop. Wikipedia 
 [8] Amazon.com Recommendations paper. cs.umd.edu. 
 [9] Cloudera Data Science Training. Cloudera.

Weitere ähnliche Inhalte

Was ist angesagt?

Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm
Adam Kawa
 

Was ist angesagt? (20)

report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hive
 
Hypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comHypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.com
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Spark and shark
Spark and sharkSpark and shark
Spark and shark
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
SQLRally Amsterdam 2013 - Hadoop
SQLRally Amsterdam 2013 - HadoopSQLRally Amsterdam 2013 - Hadoop
SQLRally Amsterdam 2013 - Hadoop
 
Hadoop2.2
Hadoop2.2Hadoop2.2
Hadoop2.2
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentation
 
Connecting HDF with ISO Metadata Standards
Connecting HDF with ISO Metadata StandardsConnecting HDF with ISO Metadata Standards
Connecting HDF with ISO Metadata Standards
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Introduction to Hadoop
Introduction to Hadoop Introduction to Hadoop
Introduction to Hadoop
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an example
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 
Hadoop introduction seminar presentation
Hadoop introduction seminar presentationHadoop introduction seminar presentation
Hadoop introduction seminar presentation
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 

Andere mochten auch

Incremental deployment of new features
Incremental deployment of new featuresIncremental deployment of new features
Incremental deployment of new features
sullis
 

Andere mochten auch (6)

Incremental deployment of new features
Incremental deployment of new featuresIncremental deployment of new features
Incremental deployment of new features
 
Recommender.system.presentation.pjug.01.21.2014
Recommender.system.presentation.pjug.01.21.2014Recommender.system.presentation.pjug.01.21.2014
Recommender.system.presentation.pjug.01.21.2014
 
Benefits of drinking water
Benefits of drinking waterBenefits of drinking water
Benefits of drinking water
 
10 Tips for WeChat
10 Tips for WeChat10 Tips for WeChat
10 Tips for WeChat
 
10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer Experience10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer Experience
 
Learn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionLearn BEM: CSS Naming Convention
Learn BEM: CSS Naming Convention
 

Ähnlich wie Recommender.system.presentation.pjug.05.20.2014

2. Develop a MapReduce program to calculate the frequency of a given word in ...
2. Develop a MapReduce program to calculate the frequency of a given word in ...2. Develop a MapReduce program to calculate the frequency of a given word in ...
2. Develop a MapReduce program to calculate the frequency of a given word in ...
Prof. Maulik Trivedi
 
assignment3
assignment3assignment3
assignment3
Kirti J
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
BOSC 2010
 

Ähnlich wie Recommender.system.presentation.pjug.05.20.2014 (20)

hadoop
hadoophadoop
hadoop
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and Hadoop
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
2. Develop a MapReduce program to calculate the frequency of a given word in ...
2. Develop a MapReduce program to calculate the frequency of a given word in ...2. Develop a MapReduce program to calculate the frequency of a given word in ...
2. Develop a MapReduce program to calculate the frequency of a given word in ...
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
assignment3
assignment3assignment3
assignment3
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 

Recommender.system.presentation.pjug.05.20.2014

  • 1. Applied Recommender Systems Bob Brehm 5/20/2014
  • 2. Presentation Topics  Hadoop MapReduce Overview  Mahout Overview  Hive Overview  Review recommender systems  Introduction to Spring XD  Demonstrations as we go
  • 3. Hadoop Overview  History [7]  2003: Apache Nutch (open-source web search engine) was created by Doug Cutting and Mike Caferalla.  2004: Google File System and MapReduce papers published.  2005: Hadoop was created in Nutch as an open source inplementation to GFS and MapReduce.
  • 4. Hadoop Overview  Today Hadoop is an independent Apache Project consisting of 4 modules: [6]  Hadoop common  HDFS – distributed, scalable file system  YARN (V2) – job scheduling and cluster resource management  MapReduce – system for parallel processing of large data sets  Hadoop market size is over $3 billion!
  • 5. Hadoop Overview  Other Hadoop Related projects include  Hive – data warehouse infrastructure  Mahout – Machine learning library  While there are many more projects the rest of the talk will be focused on these two as well as MapReduce and HDFS.
  • 6. Hadoop Overview  NameNode – keeps track of all DataNodes  JobTracker – main scheduler  Data Node – individual data clusters  TaskTracker – sequences each DataNode
  • 7. Hadoop Overview  HDFS basic command examples:  Put – copies from local to HDFS  hadoop fs -put localfile /user/hadoop/hadoopfile  Mkdir – makes a directory  hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2  Tail – Displays last kilobyte of file  hadoop fs -tail pathname  Very similar to Linux commands
  • 8. Hadoop Overview  Input data – wrangling can be difficult  Mapper – split data into key value pairs  Sort – sort values by key  Reducer – Combine values by key
  • 9. Hadoop Overview  Wordcount (HelloWorld) – Counts occurrences of each word in a document  Half of TF-IDF
  • 10. Hadoop Overview public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
  • 11. Hadoop Overview public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); }
  • 12. Hadoop Overview  Setup the data:  /usr/joe/wordcount/input - input directory in HDFS  /usr/joe/wordcount/output - output directory in HDFS  $ hadoop fs -ls /usr/joe/wordcount/input/  /usr/joe/wordcount/input/file01  /usr/joe/wordcount/input/file02  $ hadoop fs -cat /usr/joe/wordcount/input/file01  Hello World Bye World  $ hadoop fs -cat /usr/joe/wordcount/input/file02  Hello Hadoop Goodbye Hadoop
  • 13. Hadoop Overview  Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is the Hadoop version installed, compile WordCount.java and create a jar:  $ mkdir wordcount_classes  $ javac -classpath ${HADOOP_HOME}/hadoop- ${HADOOP_VERSION}-core.jar -d wordcount_classes WordCount.java  $ jar -cvf /usr/joe/wordcount.jar -C wordcount_classes/ .  Run the application:  $ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output  Output:  $ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000  Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2
  • 14. Hadoop Overview  Interesting facts about MapReduce  MapReduce can run on any type of file including images.  Hadoop streaming technology allows other languages to use MapReduce. Python, R, Ruby.  Can include a Combiner method that can streamline traffic  Not required to include a Reducer (image processing, ETL)  Hadoop includes a JobTracker WebUI  MRUnit – Junit test framework
  • 15. Hadoop Overview  Spring for Apache Hadoop project  Configure and run MapReduce jobs as container managed objects  Provide template helper classes for HDFS, Hbase, Pig and Hive.  Use standard Spring approach for Hadoop!  Access all Spring goodies – Messaging, Persistence, Security, Web Services, etc.
  • 16. Hive  Hive is an alternative to writing MapReduce jobs. Hive compiles to MapReduce.  Hive programs are written in HiveQL. Similar to to SQL.  Examples:  Create table: hive> CREATE TABLE pokes (foo INT, bar STRING);  Loading data: hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
  • 17. Hive  Examples (cont):  Getting data out of hive: INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';  Join: FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;  Hive may reduce the amount of code you have to write when you are doing data wrangling.  It's a tool that has it's place and is useful to know.
  • 18. Mahout  Started as a subproject of Lucene in 2008.  Idea behind Mahout is that is provides a framework for the development and deployment of Machine Learning algorithms.  Currently it has three distinct capabilities:  Classification  Clustering  Recommenders
  • 19. Mahout  Support for recommenders include:  Data model – provides connections to data  UserSimilarity – provides similarity to users  ItemSimilarity – provides similarity to items  UserNeighborhood – find a neighborhood (mini cluster) of like-minded users.  Recommender – the producer of recommendations.  Algorithms!
  • 21. What is a recommender?  Wikipedia [3]:  A subclass of [an] information filtering system that seek to predict the 'rating' or 'preference' that user would give to an item  My addition: A subclass of machine-learning.  Recommender model [2]:  Users  Items  Ratings  Community
  • 22. What is a recommender? [2]
  • 23. Recommender types  Non-personalized [2]  Content-based filtering (user-item) [2]  Hybrid [3]  Collaborative filtering (user-user, item-item) [2]
  • 24. Recommender types  Non-personalized [2]  Content-based filtering (user-item) [2]  Hybrid [3]  Collaborative filtering (user-user, item-item) [2]
  • 25. Collaborative Filtering  We will now look at item-item collaborative filtering as the recommendation algorithm.  Answers the question: what items are similar to the ones you like?  Popularized by Amazon who found that item-item scales better, can be done in real time, and generate high-quality results. [8]  Specifically we will look at Pearson Correlation Coefficient algorithm.
  • 26. Collaborative Filtering  Pearson's correlation coefficient - defined as the covariance of the two variables divided by the product of their standard deviations.
  • 27. Collaborative Filtering  Idea is to examine a log file for user's movie ratings. Data looks like this:  109.170.148.120 - - [06/Jan/1998:01:48:18 -0500] "GET /rate?movie=268&rating=4 HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286"  109.170.148.120 - - [05/Jan/1998:22:48:57 -0800] "GET /rate?movie=345&rating=4 HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286"  109.170.148.120 - - [05/Jan/1998:22:50:15 -0800] "GET /rate?movie=312&rating=4 HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286"
  • 28. Collaborative Filtering  Steps used for the analysis:  Run a hive script to extract the user data from a log file  Run Mahout command from the command line (could be done programmatically as well).  Examine the contents.
  • 29. Collaborative filtering <hive-runner id="hiveRunner"> <script> CREATE TABLE MAHOUT_INPUT_A ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' AS SELECT cookie as user, regexp_extract(request, "GET /rate?movie=(d+) &amp; rating=(d) HTTP/1.1", 1) as movie, CAST(regexp_extract(request, "GET /rate?movie=(d+) &amp; rating=(d) HTTP/1.1", 2) as double) as rating from ACCESS_LOGS WHERE regexp_extract(request, "GET /rate?movie=(d+) &amp; rating=(d) HTTP/1.1", 2) != ""; </script> </hive-runner>
  • 30. Collaborative filtering public class HiveApp { private static final Log log = LogFactory.getLog(HiveApp.class); public static void main(String[] args) throws Exception { AbstractApplicationContext context = new ClassPathXmlApplicationContext( "/META-INF/spring/hive-context.xml", HiveApp.class); context.registerShutdownHook(); HiveRunner runner = context.getBean(HiveRunner.class); runner.call(); } }
  • 31. Collaborative Filtering  Hive output looks like this (This is the format that Mahout requires): UserId, MovieID, relationship strength 943,373,3.0 943,391,2.0 943,796,3.0 943,237,4.0 943,840,4.0 943,230,1.0 943,229,2.0 943,449,1.0 943,450,1.0 943,228,3.0
  • 32. Collaborative filtering  Rerun Mahout with a different correlation say SIMILARITY_EUCLIDEAN_DISTANCE  Do A/B comparison in production  Gather statistics over time  See if one algorithm is better than others.
  • 33. Spring XD  XD - Spring.io project that extends the work that Spring Data team did on Spring for Apache Hadoop project.  High throughput distributed data ingestion into HDFS from a variety of input sources.  Real-time analytics at ingestion time, e.g. gathering metrics and counting values.  Hadoop workflow management via batch jobs that combine interactions with standard enterprise systems (e.g. RDBMS) as well as Hadoop operations (e.g. MapReduce, HDFS, Pig, Hive or Cascading).  High throughput data export, e.g. from HDFS to a RDBMS or NoSQL database.
  • 34. Spring XD  Configure a stream using XD. Simple case:
  • 35. Spring XD  More typical Corporate Use Case Stream:
  • 36. Spring XD  Admin UI
  • 37. ?
  • 39. References  [1] Introduction to recommender systems. Joseph Konstan.  [2] Intro to recommendations. Coursera.  [3] Recommender system. Wikipedia.  [4] An Algorithmic Framework for Performing Collaborative Filtering.  [5] Hybrid Web Recommender Systems.  [6] Hadoop web site.  [7] Apache Hadoop. Wikipedia  [8] Amazon.com Recommendations paper. cs.umd.edu.  [9] Cloudera Data Science Training. Cloudera.