Modern technologies in data science

Chu-Cheng Hsieh
Modern technologies
in data science

3
Task: Find “Error” in /var/log/system.log

4
/*Java*/
BufferedReader br = new BufferedReader(
new FileReader(”system.log"));
try {
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
if (line.contains(“Error”) {
System.out.println(line);
}
String everything = sb.toString();
} finally {
br.close();
}

5
# Python
with open(”system.log", "r") as ins:
for line in ins:
if “Error” in line:
print line[:-1]

6
# Bash
grep "Error" system.log
# !! Best !!
grep –B 3 –A 2 "Error" system.log|less

7
Task: Find “Error” in /var/log/system.log
What if,
system.log > 1TB? 10TB? … 1PB?

9credit: http://www.cse.wustl.edu/~jain/

10
(1) Multi-thread coding is really really
painful.
(2) Max # of cores in a machines in
limited

14
An easy way of “multi-core”
programming.

16
Q. How to eat a pizza in 1 minute?

18
Map = “things can be divide-and-conquer”

20
Reduce =
“merge pieces to make result meaningful”

24
Map-reduce - an easy way to
“collaborate”.

25
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
//////////// MAPPER function
//////////////
public static class Map extends
MapReduceBase implements
Mapper<LongWritable, Text, Text,
IntWritable> {
public void map(LongWritable key, Text
value, OutputCollector<Text, IntWritable>
output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new
StringTokenizer(line);
Text word = new Text();
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, new IntWritable(1));
}
}
}
//////////// REDUCER function /////////////
public static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key, Iterator<IntWritable>
values, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws
IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws
Exception {
/////////// JOB description ///////////
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new
Path(args[0]));
FileOutputFormat.setOutputPath(conf, new
Path(args[1]));
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
}
}

26
a = load ’input.txt';
#map
b = foreach a generate
flatten(TOKENIZE((chararray)$0)) as word;
#reduce
c = group b by word;
d = foreach c generate COUNT(b), group;
store d into ‘wordcount.txt';

28
Word count + “case insensitive”

29
REGISTER lib.jar;
a = load ’input.txt';
#map
b = foreach a generate
flatten(TOKENIZE((chararray)$0)) as word;
b2 = foreach b generate lib.UPPER(word)
#reduce
c = group b2 by word;
d = foreach c generate COUNT(b), group;
store d into ‘wordcount.txt';
Case In sensitive

30
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class UPPER extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0 || input.get(0) == null)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw new IOException(”Error in input row ", e);
}
}
}

33
chsieh@dev-sfear01:~$ spark-shell --master local[4]
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/03/26 14:32:12 WARN NativeCodeLoader: Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 1.3.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_76)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.
scala>

34
chsieh@dev-sfear01:~/SparkTutorial$ cat wc.txt
one
two two
three three three
four four four four
five five five five five

35
// RDD[String] = MapPartitionsRDD
val file = sc.textFile("wc.txt")
val counts = file.flatMap(line => line.split(" "))
// RDD[String] = MapPartitionsRDD
.map(word => (word, 1))
// RDD[(String, Int)] = MapPartitionsRDD
.reduceByKey( (l, r) => l + r)
// RDD[(String, Int)] = ShuffledRDD
counts.collect()
// Array[(String, Int)] =
// Array((two,2), (one,1), (three,2), (five,5), (four,4))
counts.saveAsTextFile(”file:///path/to/wc_result")
// Use hdfs:// for hadoop file system

36
mapFlat( … )
map( … ) and then flatten( )
scala> List(List(1, 2), Set(3, 4)).flatten
res0: List[Int] = List(1, 2, 3, 4)
// create an empty list
// travel each item once
val l = List(List(1,2), List(3,4,List(4, 5))).flatten
l: List[Any] = List(1, 2, 3, 4, List(4, 5))

37
// ReduceByKey
(”two”, 1) (”two”, 1)
^^^^^ ^^^^^
key key
l r
(”two”, 2)

41
RDD (Resilient Distributed Datasets)

43
Method #1 (Parallelized)
scala> val data = Array(1, 2, 3, 4, 5)
data: Array[Int] = Array(1, 2, 3, 4, 5)
scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
parallelize

44
Method #2 (External Dataset)
scala> val distFile = sc.textFile("data.txt")
distFile: RDD[String] = MappedRDD@1d4cee08
 Local Path
 Amazon S3 => s3n://
 Hadoop => hdfs://
 etc. URI

Desired Properties for map-reduce
 Distributed
 Lazy
(Optimize as much as you can)
 Persistence
(Caching)
45

46
Input
(disk)
Tuples
(disk)
Tuples
(disk)
Tuples
(disk)
Output
(disk)
MR
1
MR
2
MR
3
MR
4
Input
(disk)
RDD1
(in memory)
RDD2
(in disk)
RDD3
(in memory)
Output
(disk)
t1 t2 t3 t4

47
We have RDD, than how to “operate” it?

48
# Transformation (can wait)
RDD1 => “map” => RDD2
# Action (cannot wait)
RDD1 => “reduce” => ??

49
scala> val data = Array(1, 2, 3, 4, 5)
data: Array[Int] = Array(1, 2, 3, 4, 5)
scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at
<console>:23
scala> val addone = distData.map(x => x + 1) // Transformation
addone: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[6] at map at
<console>:25
scala> val back = addone.map(x => x - 1) // Transformation
back: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[7] at map at <console>:27
scala> val sum = back.reduce( (l, r) => l + r) // Action
sum: Int = 15

50
Passing functions
Object Util {
def addOne(x:Int) = {
x + 1
}
}
val addone_v1 = distData.map(x => x + 1)
// or
val addone_v2 = distData.map(Util.addOne)

51
Popular Transformations
map(func):
run func(x)
filter(func)
return x if func(x) is true
sample(withReplacement, fraction, seed)
union(otherDataset)
intersection(otherDataset)

52
Popular Transformations (cont.)
Assuming RDD[(K, V)]
groupByKey([numTasks])
return a dataset of (K, Iterable<V>) pairs.
reduceByKey(func, [numTasks])
groupByKey and then reduce by “func”
join(otherDataset, [numTasks])
(K, V) join (K, W) => (K, (V, W))

53
Popular Actions
reduce(func):
(left, right) => func(left, right)
collect()
force computing transformations
count()
first()
take(n)
persist()

54
More about “persist” -- reuse
Input
(disk)
RDD1
(in memory)
RDD2
(in disk)
RDD3
(in memory)
Output
(disk)
t1 t2 t3
t4
RDD4
(in memory)
persist()
t5

58
val file = sc.textFile("wc.txt")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey( (l, r) => l + r)
// Using SQL syntax is possible
import sqlContext.implicits._
val sq = new org.apache.spark.sql.SQLContext(sc)
case class WC(word: String, count: Int)
val wordcount = counts.map(col => WC(col._1, col._2))
val df = wordcount.toDF()
df.registerTempTable("tbl")
val avg = sq.sql("SELECT AVG(count) FROM tbl")

60
Guess what I’m doing here?
// row: (country, city, profit)
data .filter( _._1 == “us”)
.map ( (a, b, c) => (b, c))
.groupBy( _._1 )
.mapValues( v =>
v.reduce(
(a, b) => (a._1, a._2+b._2)
)
)
.values
.sortBy (x => x._2, false)
.take(3)

61
sq.sql(”
SELECT city, SUM(profit) as p
FROM data
WHERE country=‘us’
GROUP BY city
ORDER BY p DESC
LIMIT 3
")
//(country, city, profit)
data .filter( _._1 == “us”)
.map ( (a, b, c) => (b, c))
.groupBy( _._1 )
.mapValues( v =>
v.reduce(
(a, b) => (a._1,
a._2+b._2)
)
)
.values
.sortBy (x => x._2, false)
.take(3)
Find top 3 cities in US with highest profit

64
chsieh@dev-sfear01:~/SparkTutorial$ cat kmeans_data.txt
0.0 0.0 0.0
0.1 0.1 0.1
0.5 0.5 0.8
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2

65
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ')
.map(_.toDouble))).cache()
// Cluster the data into two classes using KMeans
val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(
parsedData, numClusters, numIterations
)
// Show results
scala> clusters.clusterCenters
res0: Array[org.apache.spark.mllib.linalg.Vector] =
Array([9.099999999999998,9.099999999999998,9.099999999999998],
[0.19999999999999998,0.19999999999999998,0.3])

PageRank: Random Surfer Model
The probability of a Web surfer to reach a page
after many clicks, following random links
Random Click

70
Credit: http://en.wikipedia.org/wiki/PageRank

PageRank
 PR(p) = PR(p1)/c1 + … + PR(pk)/ck
pi : page pointing to p,
ci : number of links in pi
 One equation for every page
 N equations, N unknown variables
Credit: Prof. John Cho / CS144 (UCLA)

Users.txt
1,BarackObama,Barack Obama
2,ladygaga,Goddess of Love
3,jeresig,John Resig
4,justinbieber,Justin Bieber
6,matei_zaharia,Matei Zaharia
7,odersky,Martin Odersky
8,anonsys
72

followers.txt
2 1
4 1
1 2
6 3
7 3
7 6
6 7
3 7
73

74
// Load the edges as a graph
val graph = GraphLoader.edgeListFile(sc, "followers.txt")
// Run PageRank
val ranks = graph.pageRank(0.0001).vertices // id, rank
// Join the ranks with the usernames
val users = sc.textFile("users.txt")
.map { line =>
val fields = line.split(",")
(fields(0).toLong, fields(1)) // id, username
}
val ranksByUsername = users.join(ranks).map {
case (id, (username, rank)) => (username, rank)
}
// Print the result
println(ranksByUsername.collect().mkString("n"))

Modern technologies in data science

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Modern technologies in data science

Ähnlich wie Modern technologies in data science (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Modern technologies in data science