Recommender.system.presentation.pjug.05.20.2014

Applied Recommender Systems
Bob Brehm
5/20/2014

Presentation Topics
 Hadoop MapReduce Overview
 Mahout Overview
 Hive Overview
 Review recommender systems
 Introduction to Spring XD
 Demonstrations as we go

Hadoop Overview
 History [7]
 2003: Apache Nutch (open-source web
search engine) was created by Doug
Cutting and Mike Caferalla.
 2004: Google File System and
MapReduce papers published.
 2005: Hadoop was created in Nutch as
an open source inplementation to GFS
and MapReduce.

Hadoop Overview
 Today Hadoop is an independent Apache
Project consisting of 4 modules: [6]
 Hadoop common
 HDFS – distributed, scalable file system
 YARN (V2) – job scheduling and cluster
resource management
 MapReduce – system for parallel
processing of large data sets
 Hadoop market size is over $3 billion!

Hadoop Overview
 Other Hadoop Related projects include
 Hive – data warehouse infrastructure
 Mahout – Machine learning library
 While there are many more projects the
rest of the talk will be focused on these two
as well as MapReduce and HDFS.

Hadoop Overview
 NameNode – keeps track of all DataNodes
 JobTracker – main scheduler
 Data Node – individual data clusters
 TaskTracker – sequences each DataNode

Hadoop Overview
 HDFS basic command examples:
 Put – copies from local to HDFS
 hadoop fs -put localfile
/user/hadoop/hadoopfile
 Mkdir – makes a directory
 hadoop fs -mkdir /user/hadoop/dir1
/user/hadoop/dir2
 Tail – Displays last kilobyte of file
 hadoop fs -tail pathname
 Very similar to Linux commands

Hadoop Overview
 Input data – wrangling can be difficult
 Mapper – split data into key value pairs
 Sort – sort values by key
 Reducer – Combine values by key

Hadoop Overview
 Wordcount (HelloWorld) – Counts
occurrences of each word in a document
 Half of TF-IDF

Hadoop Overview
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text,
Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}

Hadoop Overview
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}

Hadoop Overview
 Setup the data:
 /usr/joe/wordcount/input - input directory in HDFS
 /usr/joe/wordcount/output - output directory in HDFS
 $ hadoop fs -ls /usr/joe/wordcount/input/
 /usr/joe/wordcount/input/file01
 /usr/joe/wordcount/input/file02
 $ hadoop fs -cat /usr/joe/wordcount/input/file01
 Hello World Bye World
 $ hadoop fs -cat /usr/joe/wordcount/input/file02
 Hello Hadoop Goodbye Hadoop

Hadoop Overview
 Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is
the Hadoop version installed, compile WordCount.java and create a jar:
 $ mkdir wordcount_classes
 $ javac -classpath ${HADOOP_HOME}/hadoop-
${HADOOP_VERSION}-core.jar -d wordcount_classes WordCount.java
 $ jar -cvf /usr/joe/wordcount.jar -C wordcount_classes/ .
 Run the application:
 $ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount
/usr/joe/wordcount/input /usr/joe/wordcount/output
 Output:
 $ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000
 Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2

Hadoop Overview
 Interesting facts about MapReduce
 MapReduce can run on any type of file including
images.
 Hadoop streaming technology allows other
languages to use MapReduce. Python, R, Ruby.
 Can include a Combiner method that can
streamline traffic
 Not required to include a Reducer (image
processing, ETL)
 Hadoop includes a JobTracker WebUI
 MRUnit – Junit test framework

Hadoop Overview
 Spring for Apache Hadoop project
 Configure and run MapReduce jobs as
container managed objects
 Provide template helper classes for
HDFS, Hbase, Pig and Hive.
 Use standard Spring approach for
Hadoop!
 Access all Spring goodies – Messaging,
Persistence, Security, Web Services, etc.

Hive
 Hive is an alternative to writing MapReduce
jobs. Hive compiles to MapReduce.
 Hive programs are written in HiveQL.
Similar to to SQL.
 Examples:
 Create table: hive> CREATE TABLE pokes (foo
INT, bar STRING);
 Loading data: hive> LOAD DATA LOCAL
INPATH './examples/files/kv1.txt' OVERWRITE
INTO TABLE pokes;

Hive
 Examples (cont):
 Getting data out of hive: INSERT OVERWRITE
DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM
invites a WHERE a.ds='2008-08-15';
 Join: FROM pokes t1 JOIN invites t2 ON (t1.bar
= t2.bar) INSERT OVERWRITE TABLE events
SELECT t1.bar, t1.foo, t2.foo;
 Hive may reduce the amount of code you have to write
when you are doing data wrangling.
 It's a tool that has it's place and is useful to know.

Mahout
 Started as a subproject of Lucene in 2008.
 Idea behind Mahout is that is provides a
framework for the development and
deployment of Machine Learning
algorithms.
 Currently it has three distinct capabilities:
 Classification
 Clustering
 Recommenders

Mahout
 Support for recommenders include:
 Data model – provides connections to data
 UserSimilarity – provides similarity to users
 ItemSimilarity – provides similarity to items
 UserNeighborhood – find a neighborhood (mini cluster) of
like-minded users.
 Recommender – the producer of recommendations.
 Algorithms!

What is a recommender?
 Wikipedia [3]:
 A subclass of [an] information filtering system that seek to
predict the 'rating' or 'preference' that user would give to
an item
 My addition: A subclass of machine-learning.
 Recommender model [2]:
 Users
 Items
 Ratings
 Community

Recommender types
 Non-personalized [2]
 Content-based filtering (user-item) [2]
 Hybrid [3]
 Collaborative filtering (user-user, item-item)
[2]

Collaborative Filtering
 We will now look at item-item collaborative
filtering as the recommendation algorithm.
 Answers the question: what items are similar
to the ones you like?
 Popularized by Amazon who found that
item-item scales better, can be done in real
time, and generate high-quality results. [8]
 Specifically we will look at Pearson
Correlation Coefficient algorithm.

 Pearson's correlation coefficient - defined as the
covariance of the two variables divided by the product of
their standard deviations.

 Idea is to examine a log file for user's
movie ratings. Data looks like this:
 109.170.148.120 - - [06/Jan/1998:01:48:18 -0500] "GET /rate?movie=268&rating=4
HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64;
rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286"

 Steps used for the analysis:
 Run a hive script to extract the user data
from a log file
 Run Mahout command from the
command line (could be done
programmatically as well).
 Examine the contents.

Collaborative filtering
<hive-runner id="hiveRunner">
<script>
CREATE TABLE MAHOUT_INPUT_A
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
AS
SELECT cookie as user,
regexp_extract(request, "GET
/rate?movie=(d+) & rating=(d) HTTP/1.1", 1) as movie,
CAST(regexp_extract(request, "GET
/rate?movie=(d+) & rating=(d) HTTP/1.1", 2) as double)
as rating
from ACCESS_LOGS
WHERE regexp_extract(request, "GET
/rate?movie=(d+) & rating=(d) HTTP/1.1", 2) != "";
</script>
</hive-runner>

public class HiveApp {
private static final Log log =
LogFactory.getLog(HiveApp.class);
public static void main(String[] args) throws Exception {
AbstractApplicationContext context = new
ClassPathXmlApplicationContext(
"/META-INF/spring/hive-context.xml", HiveApp.class);
context.registerShutdownHook();
HiveRunner runner = context.getBean(HiveRunner.class);
runner.call();
}
}

 Hive output looks like this (This is
the format that Mahout requires):
UserId, MovieID, relationship strength
943,373,3.0
943,391,2.0
943,796,3.0
943,237,4.0
943,840,4.0
943,230,1.0
943,229,2.0
943,449,1.0
943,450,1.0
943,228,3.0

 Rerun Mahout with a different correlation
say SIMILARITY_EUCLIDEAN_DISTANCE
 Do A/B comparison in production
 Gather statistics over time
 See if one algorithm is better than others.

Spring XD
 XD - Spring.io project that extends the work
that Spring Data team did on Spring for
Apache Hadoop project.
 High throughput distributed data ingestion into HDFS from a
variety of input sources.
 Real-time analytics at ingestion time, e.g. gathering metrics and
counting values.
 Hadoop workflow management via batch jobs that combine
interactions with standard enterprise systems (e.g. RDBMS) as
well as Hadoop operations (e.g. MapReduce, HDFS, Pig, Hive
or Cascading).
 High throughput data export, e.g. from HDFS to a RDBMS or
NoSQL database.

Spring XD
 Configure a stream using XD. Simple case:

Spring XD
 More typical Corporate Use Case Stream:

References
 [1] Introduction to recommender systems. Joseph Konstan.
 [2] Intro to recommendations. Coursera.
 [3] Recommender system. Wikipedia.
 [4] An Algorithmic Framework for Performing Collaborative Filtering.
 [5] Hybrid Web Recommender Systems.
 [6] Hadoop web site.
 [7] Apache Hadoop. Wikipedia
 [8] Amazon.com Recommendations paper. cs.umd.edu.
 [9] Cloudera Data Science Training. Cloudera.

Recommender.system.presentation.pjug.05.20.2014

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (6)

Ähnlich wie Recommender.system.presentation.pjug.05.20.2014

Ähnlich wie Recommender.system.presentation.pjug.05.20.2014 (20)

Recommender.system.presentation.pjug.05.20.2014