SlideShare ist ein Scribd-Unternehmen logo
1 von 75
Chu-Cheng Hsieh
Modern technologies
in data science
www.linkedin.com/in/chucheng
3
Task: Find “Error” in /var/log/system.log
4
/*Java*/
BufferedReader br = new BufferedReader(
new FileReader(”system.log"));
try {
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
if (line.contains(“Error”) {
System.out.println(line);
}
String everything = sb.toString();
} finally {
br.close();
}
5
# Python
with open(”system.log", "r") as ins:
for line in ins:
if “Error” in line:
print line[:-1]
6
# Bash
grep "Error" system.log
# !! Best !!
grep –B 3 –A 2 "Error" system.log|less
7
Task: Find “Error” in /var/log/system.log
What if,
system.log > 1TB? 10TB? … 1PB?
8
More CPUsMore Data
9credit: http://www.cse.wustl.edu/~jain/
10
(1) Multi-thread coding is really really
painful.
(2) Max # of cores in a machines in
limited
11
12
$US 390 billion
Map Reduce
13 13
14
An easy way of “multi-core”
programming.
Map
15 15
16
Q. How to eat a pizza in 1 minute?
17
Ans.
18
Map = “things can be divide-and-conquer”
Reduce
19 19
20
Reduce =
“merge pieces to make result meaningful”
21
“Word Count”
4 Nodes
22
“map”
23
“reduce”
24
Map-reduce - an easy way to
“collaborate”.
25
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
//////////// MAPPER function
//////////////
public static class Map extends
MapReduceBase implements
Mapper<LongWritable, Text, Text,
IntWritable> {
public void map(LongWritable key, Text
value, OutputCollector<Text, IntWritable>
output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new
StringTokenizer(line);
Text word = new Text();
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, new IntWritable(1));
}
}
}
//////////// REDUCER function /////////////
public static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key, Iterator<IntWritable>
values, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws
IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws
Exception {
/////////// JOB description ///////////
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new
Path(args[0]));
FileOutputFormat.setOutputPath(conf, new
Path(args[1]));
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
}
}
26
a = load ’input.txt';
#map
b = foreach a generate
flatten(TOKENIZE((chararray)$0)) as word;
#reduce
c = group b by word;
d = foreach c generate COUNT(b), group;
store d into ‘wordcount.txt';
27
What’s wrong with PIG?
28
Word count + “case insensitive”
29
REGISTER lib.jar;
a = load ’input.txt';
#map
b = foreach a generate
flatten(TOKENIZE((chararray)$0)) as word;
b2 = foreach b generate lib.UPPER(word)
#reduce
c = group b2 by word;
d = foreach c generate COUNT(b), group;
store d into ‘wordcount.txt';
Case In sensitive
30
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class UPPER extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0 || input.get(0) == null)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw new IOException(”Error in input row ", e);
}
}
}
31
32 https://spark.apache.org/
33
chsieh@dev-sfear01:~$ spark-shell --master local[4]
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/03/26 14:32:12 WARN NativeCodeLoader: Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 1.3.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_76)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.
scala>
34
chsieh@dev-sfear01:~/SparkTutorial$ cat wc.txt
one
two two
three three three
four four four four
five five five five five
35
// RDD[String] = MapPartitionsRDD
val file = sc.textFile("wc.txt")
val counts = file.flatMap(line => line.split(" "))
// RDD[String] = MapPartitionsRDD
.map(word => (word, 1))
// RDD[(String, Int)] = MapPartitionsRDD
.reduceByKey( (l, r) => l + r)
// RDD[(String, Int)] = ShuffledRDD
counts.collect()
// Array[(String, Int)] =
// Array((two,2), (one,1), (three,2), (five,5), (four,4))
counts.saveAsTextFile(”file:///path/to/wc_result")
// Use hdfs:// for hadoop file system
36
mapFlat( … )
map( … ) and then flatten( )
scala> List(List(1, 2), Set(3, 4)).flatten
res0: List[Int] = List(1, 2, 3, 4)
// create an empty list
// travel each item once
val l = List(List(1,2), List(3,4,List(4, 5))).flatten
l: List[Any] = List(1, 2, 3, 4, List(4, 5))
37
// ReduceByKey
(”two”, 1) (”two”, 1)
^^^^^ ^^^^^
key key
l r
(”two”, 2)
38
Why Spark?
39
It’s super fast !!
40
How can it so fast?
41
RDD (Resilient Distributed Datasets)
42
How to create an RDD?
43
Method #1 (Parallelized)
scala> val data = Array(1, 2, 3, 4, 5)
data: Array[Int] = Array(1, 2, 3, 4, 5)
scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
parallelize
44
Method #2 (External Dataset)
scala> val distFile = sc.textFile("data.txt")
distFile: RDD[String] = MappedRDD@1d4cee08
 Local Path
 Amazon S3 => s3n://
 Hadoop => hdfs://
 etc. URI
Desired Properties for map-reduce
 Distributed
 Lazy
(Optimize as much as you can)
 Persistence
(Caching)
45
46
Input
(disk)
Tuples
(disk)
Tuples
(disk)
Tuples
(disk)
Output
(disk)
MR
1
MR
2
MR
3
MR
4
Input
(disk)
RDD1
(in memory)
RDD2
(in disk)
RDD3
(in memory)
Output
(disk)
t1 t2 t3 t4
47
We have RDD, than how to “operate” it?
48
# Transformation (can wait)
RDD1 => “map” => RDD2
# Action (cannot wait)
RDD1 => “reduce” => ??
49
scala> val data = Array(1, 2, 3, 4, 5)
data: Array[Int] = Array(1, 2, 3, 4, 5)
scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at
<console>:23
scala> val addone = distData.map(x => x + 1) // Transformation
addone: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[6] at map at
<console>:25
scala> val back = addone.map(x => x - 1) // Transformation
back: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[7] at map at <console>:27
scala> val sum = back.reduce( (l, r) => l + r) // Action
sum: Int = 15
50
Passing functions
Object Util {
def addOne(x:Int) = {
x + 1
}
}
val addone_v1 = distData.map(x => x + 1)
// or
val addone_v2 = distData.map(Util.addOne)
51
Popular Transformations
map(func):
run func(x)
filter(func)
return x if func(x) is true
sample(withReplacement, fraction, seed)
union(otherDataset)
intersection(otherDataset)
52
Popular Transformations (cont.)
Assuming RDD[(K, V)]
groupByKey([numTasks])
return a dataset of (K, Iterable<V>) pairs.
reduceByKey(func, [numTasks])
groupByKey and then reduce by “func”
join(otherDataset, [numTasks])
(K, V) join (K, W) => (K, (V, W))
53
Popular Actions
reduce(func):
(left, right) => func(left, right)
collect()
force computing transformations
count()
first()
take(n)
persist()
54
More about “persist” -- reuse
Input
(disk)
RDD1
(in memory)
RDD2
(in disk)
RDD3
(in memory)
Output
(disk)
t1 t2 t3
t4
RDD4
(in memory)
persist()
t5
55
You are master now.
Journey Continue
56 56
57
58
val file = sc.textFile("wc.txt")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey( (l, r) => l + r)
// Using SQL syntax is possible
import sqlContext.implicits._
val sq = new org.apache.spark.sql.SQLContext(sc)
case class WC(word: String, count: Int)
val wordcount = counts.map(col => WC(col._1, col._2))
val df = wordcount.toDF()
df.registerTempTable("tbl")
val avg = sq.sql("SELECT AVG(count) FROM tbl")
59
Why Spark SQL?
60
Guess what I’m doing here?
// row: (country, city, profit)
data .filter( _._1 == “us”)
.map ( (a, b, c) => (b, c))
.groupBy( _._1 )
.mapValues( v =>
v.reduce(
(a, b) => (a._1, a._2+b._2)
)
)
.values
.sortBy (x => x._2, false)
.take(3)
61
sq.sql(”
SELECT city, SUM(profit) as p
FROM data
WHERE country=‘us’
GROUP BY city
ORDER BY p DESC
LIMIT 3
")
//(country, city, profit)
data .filter( _._1 == “us”)
.map ( (a, b, c) => (b, c))
.groupBy( _._1 )
.mapValues( v =>
v.reduce(
(a, b) => (a._1,
a._2+b._2)
)
)
.values
.sortBy (x => x._2, false)
.take(3)
Find top 3 cities in US with highest profit
62
63
64
chsieh@dev-sfear01:~/SparkTutorial$ cat kmeans_data.txt
0.0 0.0 0.0
0.1 0.1 0.1
0.5 0.5 0.8
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2
65
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ')
.map(_.toDouble))).cache()
// Cluster the data into two classes using KMeans
val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(
parsedData, numClusters, numIterations
)
// Show results
scala> clusters.clusterCenters
res0: Array[org.apache.spark.mllib.linalg.Vector] =
Array([9.099999999999998,9.099999999999998,9.099999999999998],
[0.19999999999999998,0.19999999999999998,0.3])
66
67
68
PageRank: Random Surfer Model
The probability of a Web surfer to reach a page
after many clicks, following random links
Random Click
70
Credit: http://en.wikipedia.org/wiki/PageRank
PageRank
 PR(p) = PR(p1)/c1 + … + PR(pk)/ck
pi : page pointing to p,
ci : number of links in pi
 One equation for every page
 N equations, N unknown variables
Credit: Prof. John Cho / CS144 (UCLA)
Users.txt
1,BarackObama,Barack Obama
2,ladygaga,Goddess of Love
3,jeresig,John Resig
4,justinbieber,Justin Bieber
6,matei_zaharia,Matei Zaharia
7,odersky,Martin Odersky
8,anonsys
72
followers.txt
2 1
4 1
1 2
6 3
7 3
7 6
6 7
3 7
73
74
// Load the edges as a graph
val graph = GraphLoader.edgeListFile(sc, "followers.txt")
// Run PageRank
val ranks = graph.pageRank(0.0001).vertices // id, rank
// Join the ranks with the usernames
val users = sc.textFile("users.txt")
.map { line =>
val fields = line.split(",")
(fields(0).toLong, fields(1)) // id, username
}
val ranksByUsername = users.join(ranks).map {
case (id, (username, rank)) => (username, rank)
}
// Print the result
println(ranksByUsername.collect().mkString("n"))
75

Weitere ähnliche Inhalte

Was ist angesagt?

Poor Man's Functional Programming
Poor Man's Functional ProgrammingPoor Man's Functional Programming
Poor Man's Functional ProgrammingDmitry Buzdin
 
ITT 2015 - Saul Mora - Object Oriented Function Programming
ITT 2015 - Saul Mora - Object Oriented Function ProgrammingITT 2015 - Saul Mora - Object Oriented Function Programming
ITT 2015 - Saul Mora - Object Oriented Function ProgrammingIstanbul Tech Talks
 
Psycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptPsycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptSurvey Department
 
"PostgreSQL and Python" Lightning Talk @EuroPython2014
"PostgreSQL and Python" Lightning Talk @EuroPython2014"PostgreSQL and Python" Lightning Talk @EuroPython2014
"PostgreSQL and Python" Lightning Talk @EuroPython2014Henning Jacobs
 
Python postgre sql a wonderful wedding
Python postgre sql   a wonderful weddingPython postgre sql   a wonderful wedding
Python postgre sql a wonderful weddingStéphane Wirtel
 
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseSages
 
JAVA 8 : Migration et enjeux stratégiques en entreprise
JAVA 8 : Migration et enjeux stratégiques en entrepriseJAVA 8 : Migration et enjeux stratégiques en entreprise
JAVA 8 : Migration et enjeux stratégiques en entrepriseSOAT
 
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenJohn Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenPostgresOpen
 
From Zero to Application Delivery with NixOS
From Zero to Application Delivery with NixOSFrom Zero to Application Delivery with NixOS
From Zero to Application Delivery with NixOSSusan Potter
 
Haskell in the Real World
Haskell in the Real WorldHaskell in the Real World
Haskell in the Real Worldosfameron
 
Writing Domain-Specific Languages for BeepBeep
Writing Domain-Specific Languages for BeepBeepWriting Domain-Specific Languages for BeepBeep
Writing Domain-Specific Languages for BeepBeepSylvain Hallé
 
Java 8 Puzzlers [as presented at OSCON 2016]
Java 8 Puzzlers [as presented at  OSCON 2016]Java 8 Puzzlers [as presented at  OSCON 2016]
Java 8 Puzzlers [as presented at OSCON 2016]Baruch Sadogursky
 
2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locatorAlberto Paro
 
The Ring programming language version 1.5.3 book - Part 25 of 184
The Ring programming language version 1.5.3 book - Part 25 of 184The Ring programming language version 1.5.3 book - Part 25 of 184
The Ring programming language version 1.5.3 book - Part 25 of 184Mahmoud Samir Fayed
 
Clojure: The Art of Abstraction
Clojure: The Art of AbstractionClojure: The Art of Abstraction
Clojure: The Art of AbstractionAlex Miller
 
Joker 2015 - Валеев Тагир - Что же мы измеряем?
Joker 2015 - Валеев Тагир - Что же мы измеряем?Joker 2015 - Валеев Тагир - Что же мы измеряем?
Joker 2015 - Валеев Тагир - Что же мы измеряем?tvaleev
 

Was ist angesagt? (20)

Poor Man's Functional Programming
Poor Man's Functional ProgrammingPoor Man's Functional Programming
Poor Man's Functional Programming
 
ITT 2015 - Saul Mora - Object Oriented Function Programming
ITT 2015 - Saul Mora - Object Oriented Function ProgrammingITT 2015 - Saul Mora - Object Oriented Function Programming
ITT 2015 - Saul Mora - Object Oriented Function Programming
 
Psycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptPsycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python Script
 
"PostgreSQL and Python" Lightning Talk @EuroPython2014
"PostgreSQL and Python" Lightning Talk @EuroPython2014"PostgreSQL and Python" Lightning Talk @EuroPython2014
"PostgreSQL and Python" Lightning Talk @EuroPython2014
 
Python postgre sql a wonderful wedding
Python postgre sql   a wonderful weddingPython postgre sql   a wonderful wedding
Python postgre sql a wonderful wedding
 
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
 
JAVA 8 : Migration et enjeux stratégiques en entreprise
JAVA 8 : Migration et enjeux stratégiques en entrepriseJAVA 8 : Migration et enjeux stratégiques en entreprise
JAVA 8 : Migration et enjeux stratégiques en entreprise
 
Dotnet 18
Dotnet 18Dotnet 18
Dotnet 18
 
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenJohn Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
 
From Zero to Application Delivery with NixOS
From Zero to Application Delivery with NixOSFrom Zero to Application Delivery with NixOS
From Zero to Application Delivery with NixOS
 
Pune Clojure Course Outline
Pune Clojure Course OutlinePune Clojure Course Outline
Pune Clojure Course Outline
 
Haskell in the Real World
Haskell in the Real WorldHaskell in the Real World
Haskell in the Real World
 
Writing Domain-Specific Languages for BeepBeep
Writing Domain-Specific Languages for BeepBeepWriting Domain-Specific Languages for BeepBeep
Writing Domain-Specific Languages for BeepBeep
 
Java 8 Puzzlers [as presented at OSCON 2016]
Java 8 Puzzlers [as presented at  OSCON 2016]Java 8 Puzzlers [as presented at  OSCON 2016]
Java 8 Puzzlers [as presented at OSCON 2016]
 
2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator
 
The Ring programming language version 1.5.3 book - Part 25 of 184
The Ring programming language version 1.5.3 book - Part 25 of 184The Ring programming language version 1.5.3 book - Part 25 of 184
The Ring programming language version 1.5.3 book - Part 25 of 184
 
Clojure: The Art of Abstraction
Clojure: The Art of AbstractionClojure: The Art of Abstraction
Clojure: The Art of Abstraction
 
ES6 in Real Life
ES6 in Real LifeES6 in Real Life
ES6 in Real Life
 
Oop lecture9 13
Oop lecture9 13Oop lecture9 13
Oop lecture9 13
 
Joker 2015 - Валеев Тагир - Что же мы измеряем?
Joker 2015 - Валеев Тагир - Что же мы измеряем?Joker 2015 - Валеев Тагир - Что же мы измеряем?
Joker 2015 - Валеев Тагир - Что же мы измеряем?
 

Ähnlich wie Modern technologies in data science

Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
Python 101 language features and functional programming
Python 101 language features and functional programmingPython 101 language features and functional programming
Python 101 language features and functional programmingLukasz Dynowski
 
Byterun, a Python bytecode interpreter - Allison Kaptur at NYCPython
Byterun, a Python bytecode interpreter - Allison Kaptur at NYCPythonByterun, a Python bytecode interpreter - Allison Kaptur at NYCPython
Byterun, a Python bytecode interpreter - Allison Kaptur at NYCPythonakaptur
 
GE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python ProgrammingGE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python ProgrammingMuthu Vinayagam
 
Clojure Intro
Clojure IntroClojure Intro
Clojure Introthnetos
 
Apache Commons - Don\'t re-invent the wheel
Apache Commons - Don\'t re-invent the wheelApache Commons - Don\'t re-invent the wheel
Apache Commons - Don\'t re-invent the wheeltcurdt
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data ManagementAlbert Bifet
 
The Curious Clojurist - Neal Ford (Thoughtworks)
The Curious Clojurist - Neal Ford (Thoughtworks)The Curious Clojurist - Neal Ford (Thoughtworks)
The Curious Clojurist - Neal Ford (Thoughtworks)jaxLondonConference
 
Java 7 Launch Event at LyonJUG, Lyon France. Fork / Join framework and Projec...
Java 7 Launch Event at LyonJUG, Lyon France. Fork / Join framework and Projec...Java 7 Launch Event at LyonJUG, Lyon France. Fork / Join framework and Projec...
Java 7 Launch Event at LyonJUG, Lyon France. Fork / Join framework and Projec...julien.ponge
 
A gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojureA gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojurePaul Lam
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
 
sonam Kumari python.ppt
sonam Kumari python.pptsonam Kumari python.ppt
sonam Kumari python.pptssuserd64918
 

Ähnlich wie Modern technologies in data science (20)

Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Python 101 language features and functional programming
Python 101 language features and functional programmingPython 101 language features and functional programming
Python 101 language features and functional programming
 
Byterun, a Python bytecode interpreter - Allison Kaptur at NYCPython
Byterun, a Python bytecode interpreter - Allison Kaptur at NYCPythonByterun, a Python bytecode interpreter - Allison Kaptur at NYCPython
Byterun, a Python bytecode interpreter - Allison Kaptur at NYCPython
 
Spark_Documentation_Template1
Spark_Documentation_Template1Spark_Documentation_Template1
Spark_Documentation_Template1
 
GE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python ProgrammingGE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python Programming
 
Full Stack Clojure
Full Stack ClojureFull Stack Clojure
Full Stack Clojure
 
Clojure Intro
Clojure IntroClojure Intro
Clojure Intro
 
Apache Commons - Don\'t re-invent the wheel
Apache Commons - Don\'t re-invent the wheelApache Commons - Don\'t re-invent the wheel
Apache Commons - Don\'t re-invent the wheel
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
Java VS Python
Java VS PythonJava VS Python
Java VS Python
 
The Curious Clojurist - Neal Ford (Thoughtworks)
The Curious Clojurist - Neal Ford (Thoughtworks)The Curious Clojurist - Neal Ford (Thoughtworks)
The Curious Clojurist - Neal Ford (Thoughtworks)
 
A bit about Scala
A bit about ScalaA bit about Scala
A bit about Scala
 
Java 7 Launch Event at LyonJUG, Lyon France. Fork / Join framework and Projec...
Java 7 Launch Event at LyonJUG, Lyon France. Fork / Join framework and Projec...Java 7 Launch Event at LyonJUG, Lyon France. Fork / Join framework and Projec...
Java 7 Launch Event at LyonJUG, Lyon France. Fork / Join framework and Projec...
 
A gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojureA gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojure
 
Scala @ TomTom
Scala @ TomTomScala @ TomTom
Scala @ TomTom
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
sonam Kumari python.ppt
sonam Kumari python.pptsonam Kumari python.ppt
sonam Kumari python.ppt
 

Kürzlich hochgeladen

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 

Kürzlich hochgeladen (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

Modern technologies in data science

  • 3. 3 Task: Find “Error” in /var/log/system.log
  • 4. 4 /*Java*/ BufferedReader br = new BufferedReader( new FileReader(”system.log")); try { StringBuilder sb = new StringBuilder(); String line = br.readLine(); while (line != null) { if (line.contains(“Error”) { System.out.println(line); } String everything = sb.toString(); } finally { br.close(); }
  • 5. 5 # Python with open(”system.log", "r") as ins: for line in ins: if “Error” in line: print line[:-1]
  • 6. 6 # Bash grep "Error" system.log # !! Best !! grep –B 3 –A 2 "Error" system.log|less
  • 7. 7 Task: Find “Error” in /var/log/system.log What if, system.log > 1TB? 10TB? … 1PB?
  • 10. 10 (1) Multi-thread coding is really really painful. (2) Max # of cores in a machines in limited
  • 11. 11
  • 14. 14 An easy way of “multi-core” programming.
  • 16. 16 Q. How to eat a pizza in 1 minute?
  • 18. 18 Map = “things can be divide-and-conquer”
  • 20. 20 Reduce = “merge pieces to make result meaningful”
  • 24. 24 Map-reduce - an easy way to “collaborate”.
  • 25. 25 import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { //////////// MAPPER function ////////////// public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); Text word = new Text(); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, new IntWritable(1)); } } } //////////// REDUCER function ///////////// public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { /////////// JOB description /////////// JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); } }
  • 26. 26 a = load ’input.txt'; #map b = foreach a generate flatten(TOKENIZE((chararray)$0)) as word; #reduce c = group b by word; d = foreach c generate COUNT(b), group; store d into ‘wordcount.txt';
  • 28. 28 Word count + “case insensitive”
  • 29. 29 REGISTER lib.jar; a = load ’input.txt'; #map b = foreach a generate flatten(TOKENIZE((chararray)$0)) as word; b2 = foreach b generate lib.UPPER(word) #reduce c = group b2 by word; d = foreach c generate COUNT(b), group; store d into ‘wordcount.txt'; Case In sensitive
  • 30. 30 package myudfs; import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0 || input.get(0) == null) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw new IOException(”Error in input row ", e); } } }
  • 31. 31
  • 33. 33 chsieh@dev-sfear01:~$ spark-shell --master local[4] Spark assembly has been built with Hive, including Datanucleus jars on classpath 15/03/26 14:32:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 1.3.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_76) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala>
  • 34. 34 chsieh@dev-sfear01:~/SparkTutorial$ cat wc.txt one two two three three three four four four four five five five five five
  • 35. 35 // RDD[String] = MapPartitionsRDD val file = sc.textFile("wc.txt") val counts = file.flatMap(line => line.split(" ")) // RDD[String] = MapPartitionsRDD .map(word => (word, 1)) // RDD[(String, Int)] = MapPartitionsRDD .reduceByKey( (l, r) => l + r) // RDD[(String, Int)] = ShuffledRDD counts.collect() // Array[(String, Int)] = // Array((two,2), (one,1), (three,2), (five,5), (four,4)) counts.saveAsTextFile(”file:///path/to/wc_result") // Use hdfs:// for hadoop file system
  • 36. 36 mapFlat( … ) map( … ) and then flatten( ) scala> List(List(1, 2), Set(3, 4)).flatten res0: List[Int] = List(1, 2, 3, 4) // create an empty list // travel each item once val l = List(List(1,2), List(3,4,List(4, 5))).flatten l: List[Any] = List(1, 2, 3, 4, List(4, 5))
  • 37. 37 // ReduceByKey (”two”, 1) (”two”, 1) ^^^^^ ^^^^^ key key l r (”two”, 2)
  • 40. 40 How can it so fast?
  • 42. 42 How to create an RDD?
  • 43. 43 Method #1 (Parallelized) scala> val data = Array(1, 2, 3, 4, 5) data: Array[Int] = Array(1, 2, 3, 4, 5) scala> val distData = sc.parallelize(data) distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize
  • 44. 44 Method #2 (External Dataset) scala> val distFile = sc.textFile("data.txt") distFile: RDD[String] = MappedRDD@1d4cee08  Local Path  Amazon S3 => s3n://  Hadoop => hdfs://  etc. URI
  • 45. Desired Properties for map-reduce  Distributed  Lazy (Optimize as much as you can)  Persistence (Caching) 45
  • 47. 47 We have RDD, than how to “operate” it?
  • 48. 48 # Transformation (can wait) RDD1 => “map” => RDD2 # Action (cannot wait) RDD1 => “reduce” => ??
  • 49. 49 scala> val data = Array(1, 2, 3, 4, 5) data: Array[Int] = Array(1, 2, 3, 4, 5) scala> val distData = sc.parallelize(data) distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:23 scala> val addone = distData.map(x => x + 1) // Transformation addone: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[6] at map at <console>:25 scala> val back = addone.map(x => x - 1) // Transformation back: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[7] at map at <console>:27 scala> val sum = back.reduce( (l, r) => l + r) // Action sum: Int = 15
  • 50. 50 Passing functions Object Util { def addOne(x:Int) = { x + 1 } } val addone_v1 = distData.map(x => x + 1) // or val addone_v2 = distData.map(Util.addOne)
  • 51. 51 Popular Transformations map(func): run func(x) filter(func) return x if func(x) is true sample(withReplacement, fraction, seed) union(otherDataset) intersection(otherDataset)
  • 52. 52 Popular Transformations (cont.) Assuming RDD[(K, V)] groupByKey([numTasks]) return a dataset of (K, Iterable<V>) pairs. reduceByKey(func, [numTasks]) groupByKey and then reduce by “func” join(otherDataset, [numTasks]) (K, V) join (K, W) => (K, (V, W))
  • 53. 53 Popular Actions reduce(func): (left, right) => func(left, right) collect() force computing transformations count() first() take(n) persist()
  • 54. 54 More about “persist” -- reuse Input (disk) RDD1 (in memory) RDD2 (in disk) RDD3 (in memory) Output (disk) t1 t2 t3 t4 RDD4 (in memory) persist() t5
  • 57. 57
  • 58. 58 val file = sc.textFile("wc.txt") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey( (l, r) => l + r) // Using SQL syntax is possible import sqlContext.implicits._ val sq = new org.apache.spark.sql.SQLContext(sc) case class WC(word: String, count: Int) val wordcount = counts.map(col => WC(col._1, col._2)) val df = wordcount.toDF() df.registerTempTable("tbl") val avg = sq.sql("SELECT AVG(count) FROM tbl")
  • 60. 60 Guess what I’m doing here? // row: (country, city, profit) data .filter( _._1 == “us”) .map ( (a, b, c) => (b, c)) .groupBy( _._1 ) .mapValues( v => v.reduce( (a, b) => (a._1, a._2+b._2) ) ) .values .sortBy (x => x._2, false) .take(3)
  • 61. 61 sq.sql(” SELECT city, SUM(profit) as p FROM data WHERE country=‘us’ GROUP BY city ORDER BY p DESC LIMIT 3 ") //(country, city, profit) data .filter( _._1 == “us”) .map ( (a, b, c) => (b, c)) .groupBy( _._1 ) .mapValues( v => v.reduce( (a, b) => (a._1, a._2+b._2) ) ) .values .sortBy (x => x._2, false) .take(3) Find top 3 cities in US with highest profit
  • 62. 62
  • 63. 63
  • 64. 64 chsieh@dev-sfear01:~/SparkTutorial$ cat kmeans_data.txt 0.0 0.0 0.0 0.1 0.1 0.1 0.5 0.5 0.8 9.0 9.0 9.0 9.1 9.1 9.1 9.2 9.2 9.2
  • 65. 65 import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.linalg.Vectors // Load and parse the data val data = sc.textFile("kmeans_data.txt") val parsedData = data.map(s => Vectors.dense(s.split(' ') .map(_.toDouble))).cache() // Cluster the data into two classes using KMeans val numClusters = 2 val numIterations = 20 val clusters = KMeans.train( parsedData, numClusters, numIterations ) // Show results scala> clusters.clusterCenters res0: Array[org.apache.spark.mllib.linalg.Vector] = Array([9.099999999999998,9.099999999999998,9.099999999999998], [0.19999999999999998,0.19999999999999998,0.3])
  • 66. 66
  • 67. 67
  • 68. 68
  • 69. PageRank: Random Surfer Model The probability of a Web surfer to reach a page after many clicks, following random links Random Click
  • 71. PageRank  PR(p) = PR(p1)/c1 + … + PR(pk)/ck pi : page pointing to p, ci : number of links in pi  One equation for every page  N equations, N unknown variables Credit: Prof. John Cho / CS144 (UCLA)
  • 72. Users.txt 1,BarackObama,Barack Obama 2,ladygaga,Goddess of Love 3,jeresig,John Resig 4,justinbieber,Justin Bieber 6,matei_zaharia,Matei Zaharia 7,odersky,Martin Odersky 8,anonsys 72
  • 73. followers.txt 2 1 4 1 1 2 6 3 7 3 7 6 6 7 3 7 73
  • 74. 74 // Load the edges as a graph val graph = GraphLoader.edgeListFile(sc, "followers.txt") // Run PageRank val ranks = graph.pageRank(0.0001).vertices // id, rank // Join the ranks with the usernames val users = sc.textFile("users.txt") .map { line => val fields = line.split(",") (fields(0).toLong, fields(1)) // id, username } val ranksByUsername = users.join(ranks).map { case (id, (username, rank)) => (username, rank) } // Print the result println(ranksByUsername.collect().mkString("n"))
  • 75. 75