2. ⢠Experience
Vpon Data Engineer
TWM, Keywear, Nielsen
⢠Bryanâs notes for data analysis
http://bryannotes.blogspot.tw
⢠Spark.TW
⢠Linikedin
https://tw.linkedin.com/pub/bryan-yang/7b/763/a79
ABOUT ME
6. Terminology
⢠Observation
- The basic unit of data.
⢠Features
- Features is that can be used to describe each observation
in a quantitative manner.
⢠Feature vector
- is an n-dimensional vector of numerical features that
represents some object.
⢠Training/ Testing/ Evaluation set
- Set of data to discover potential predictive relationships
8. Learning(Training)
ID Color Type Shape
is
apple (
Label)
1 Red Fruit Cirle Y
2 Red Fruit Cirle Y
3 Black Logo
nearly
Cirle
N
4 Blue N/A Cirle N
9. Categories of Machine Learning
⢠Classification: predict class from observations.
⢠Clustering: group observations into meaningful
groups.
⢠Regression: predict value from observations.
15. What is MLlib
⢠MLlib is an Apache Spark component focusing on
machine learning:
⢠MLlib is Sparkâs core ML library
⢠Developed by MLbase team in AMPLab
⢠80+ contributions from various organization
⢠Support Scala, Python, and Java APIs
20. Dense & Sparse
⢠Raw Data:
ID
A
B C D E F
1 1 0 0 0 0 3
2 0 1 0 1 0 2
3 1 1 1 0 1 1
21. Dense vs Sparse
⢠Training Set:
- number of example: 12 million
- number of features: 500
- sparsity: 10%
⢠Not only save storage, but also received a 4x speed
up
Dense Sparse
Storge 47GB 7GB
Time 240s 58s
22. Labeled Point
⢠Dummy variable (1,0)
⢠Categorical variable (0, 1, 2, âŚ)
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
# Create a labeled point with a positive label and a dense feature vector.
pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])
# Create a labeled point with a negative label and a sparse feature vector.
neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))
24. Descriptive Statistics
⢠Supported function:
- count
- max
- min
- mean
- variance
âŚ
⢠Supported data types
- Dense
- Sparse
- Labeled Point
25. Example
from pyspark.mllib.stat import Statistics
from pyspark.mllib.linalg import Vectors
import numpy as np
## example dataďź2 x 2 matrix at leastďź
data= np.array([[1.0,2.0,3.0,4.0,5.0],[1.0,2.0,3.0,4.0,5.0]])
## to RDD
distData = sc.parallelize(data)
## Compute Statistic Value
summary = Statistics.colStats(distData)
print "Duration Statistics:"
print " Mean: {}".format(round(summary.mean()[0],3))
print " St. deviation: {}".format(round(sqrt(summary.variance()[0]),3))
print " Max value: {}".format(round(summary.max()[0],3))
print " Min value: {}".format(round(summary.min()[0],3))
print " Total value count: {}".format(summary.count())
print " Number of non-zero values: {}".format(summary.numNonzeros()[0])
27. Quiz 1
⢠Which statement is wrong purpose of descriptive statisticďź
1. Find the extreme value of the data.
2. Understanding the data properties preliminary.
3. Describe the full properties of the data.
29. Example
⢠Data Set
from pyspark.mllib.stat import Statistics
## é¸ćçľąč¨ćšćł
corrType='pearson'
## ĺťşçŤć¸ŹčŠŚčłćĺ
group1 = sc.parallelize([1,3,4,3,7,5,6,8])
group2 = sc.parallelize([1,2,3,4,5,6,7,8])
corr = Statistics.corr(group1,group2, corrType)
print corr
# 0.87
A B C D E F G H
Case1 1 3 4 3 7 5 6 8
Case2 1 2 3 4 5 6 7 8
32. Quiz 2
⢠Which situation below can not use pearson
correlation?
1. When two variables have no correlation
2. When two variables have correlation of indices
3. When count of data rows > 30
4. When variables are normal distribution.
34. K-means
⢠K-means clustering aims to partition n observations
into k clusters in which each observation belongs to
the cluster with the nearest mean, serving as a
prototype of the cluster.
42. K-Means: Python
from pyspark.mllib.clustering import KMeans, KMeansModel
from numpy import array
from math import sqrt
# Load and parse the data
data = sc.textFile("data/mllib/kmeans_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations=10,
runs=10, initializationMode="random")
# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))
50. Sample Code
from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint
from numpy import array
# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.split(' ')]
return LabeledPoint(values[0], values[1:])
data = sc.textFile("data/mllib/sample_svm_data.txt")
parsedData = data.map(parsePoint)
# Build the model
model = LogisticRegressionWithSGD.train(parsedData)
# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() /
float(parsedData.count())
print("Training Error = " + str(trainErr))
51. Quiz 4
⢠What situation is not suitable for logistic regression?
1. data with multiple label point
2. data with no label
3. data with more the 1,000 features
4. data with dummy features
54. Definition
⢠Collaborative Filtering(CF) is a subset of algorithms
that exploit other users and items along with their
ratings(selection, purchase information could be
also used) and target user history to recommend an
item that target user does not have ratings for.
⢠Fundamental assumption behind this approach is
that other users preference over the items could be
used recommending an item to the user who did not
see the item or purchase before.
55. Definition
⢠CF differs itself from content-based methods in the
sense that user or the item itself does not play a
role in recommeniation but rather how(rating) and
which users(user) rated a particular item.
(Preference of users is shared through a group of
users who selected, purchased or rate items
similarly)