SlideShare ist ein Scribd-Unternehmen logo
1 von 50
Downloaden Sie, um offline zu lesen
Clustering with Spark 
Sandy Ryza / Data Science / Cloudera
Me 
● Data scientist at Cloudera 
● Recently lead Apache Spark development at 
Cloudera 
● Before that, committing on Apache Hadoop 
● Before that, studying combinatorial 
optimization and distributed systems at 
Brown
Sometimes you find yourself 
with lots of stuff
Large Scale Learning
Network Packets
Detect Network Intrusions
Credit Card Transactions
Detect Fraud
Movie Viewings
Recommend Movies
Unsupervised Learning 
● Learn hidden structure of your data 
● Interpret new data as it relates to this 
structure
Two Main Problems 
● Designing a system for processing huge 
data in parallel 
● Taking advantage of it with algorithms that 
work well in parallel
MapReduce 
Map Map Map Map Map Map Map Map Map Map Map Map 
Reduce Reduce Reduce Reduce 
Key advances by MapReduce: 
•Data Locality: Automatic split computation and launch of mappers appropriately 
•Fault tolerance: Write out of intermediate results and restartable mappers meant 
ability to run on commodity hardware 
•Linear scalability: Combination of locality + programming model that forces developers 
to write generally scalable solutions to problems 
* CONFIDENTIAL - RESTRICTED
MapReduce 
Map Map Map Map Map Map Map Map Map Map Map Map 
Reduce Reduce Reduce Reduce 
Limitations of MapReduce 
•Each job reads data from HDFS 
•No concept of a session 
•Jobs are rigin map-then-reduce 
* CONFIDENTIAL - RESTRICTED
Spark is a general purpose computation framework geared towards massive 
data - more flexible than MapReduce 
Extra properties: 
•Leverages distributed memory 
•Full Directed Graph expressions for data parallel computations 
•Improved developer experience 
Yet retains: 
Linear scalability, Fault-tolerance and Data-Locality 
* CONFIDENTIAL - RESTRICTED
RDDs 
val lines = sc.textFile 
(“bigfile.txt”) 
bigfile.txt lines 
val numbers = lines.map 
((x) => x.toDouble) numbers.sum() 
numbers 
Partition 
Partition 
Partition 
Partition 
Partition 
Partition 
HDFS 
sum 
Driver
RDDs 
val lines = sc.textFile 
(“bigfile.txt”) 
bigfile.txt lines 
val numbers = lines.map 
((x) => x.toInt) numbers.cache() 
numbers 
Partition 
Partition 
Partition 
Partition 
Partition 
Partition 
HDFS 
.sum() 
sum 
Driver
numbers.sum() 
bigfile.txt lines numbers 
Partition 
Partition 
Partition 
sum 
Driver
Spark MLlib 
Discrete Continuous 
Supervised Classification 
● Logistic regression (and 
regularized variants) 
● Linear SVM 
● Naive Bayes 
● Random Decision Forests 
(soon) 
Regression 
● Linear regression (and 
regularized variants) 
Unsupervised Clustering 
● K-means 
Dimensionality reduction, matrix 
factorization 
● Principal component analysis / 
singular value decomposition 
● Alternating least squares
Spark MLlib 
Discrete Continuous 
Supervised Classification 
● Logistic regression (and 
regularized variants) 
● Linear SVM 
● Naive Bayes 
● Random Decision Forests 
(soon) 
Regression 
● Linear regression (and 
regularized variants) 
Unsupervised Clustering 
● K-means 
Dimensionality reduction, matrix 
factorization 
● Principal component analysis / 
singular value decomposition 
● Alternating least squares
Using it 
val data = sc.textFile("kmeans_data.txt") 
val parsedData = data.map( _.split(' ').map(_.toDouble)) 
// Cluster the data into two classes using KMeans 
val numIterations = 20 
val numClusters = 2 
val clusters = KMeans.train(parsedData, numClusters, 
numIterations)
K-Means 
● Choose some initial centers 
● Then alternate between two steps: 
○ Assign each point to a cluster based on 
existing centers 
○ Recompute cluster centers from the 
points in each cluster
K-Means - very parallelizable 
● Alternate between two steps: 
○ Assign each point to a cluster based on 
existing centers 
■ Process each data point independently 
○ Recompute cluster centers from the 
points in each cluster 
■ Average across partitions
// Find the sum and count of points mapping to each center 
val totalContribs = data.mapPartitions { points => 
val k = centers.length 
val dims = centers(0).vector.length 
val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]]) 
val counts = Array.fill(k)(0L) 
points.foreach { point => 
val (bestCenter, cost) = KMeans.findClosest(centers, point) 
costAccum += cost 
sums(bestCenter) += point.vector 
counts(bestCenter) += 1 
} 
val contribs = for (j <- 0 until k) yield { 
(j, (sums(j), counts(j))) 
} 
contribs.iterator 
}.reduceByKey(mergeContribs).collectAsMap()
// Update the cluster centers and costs 
var changed = false 
var j = 0 
while (j < k) { 
val (sum, count) = totalContribs(j) 
if (count != 0) { 
sum /= count.toDouble 
val newCenter = new BreezeVectorWithNorm(sum) 
if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) { 
changed = true 
} 
centers(j) = newCenter 
} 
j += 1 
} 
if (!changed) { 
logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations") 
} 
cost = costAccum.value
The Problem 
● K-Means is very sensitive to initial set of 
center points chosen. 
● Best existing algorithm for choosing centers 
is highly sequential.
K-Means++ 
● Start with random point from dataset 
● Pick another one randomly, with probability 
proportional to distance from the closest 
already chosen 
● Repeat until initial centers chosen
K-Means++ 
● Initial cluster has expected bound of O(log k) 
of optimum cost
K-Means++ 
● Requires k passes over the data
K-Means|| 
● Do only a few (~5) passes 
● Sample m points on each pass 
● Oversample 
● Run K-Means++ on sampled points to find 
initial centers
Then on the full data...
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitionerSubhas Kumar Ghosh
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorSubhas Kumar Ghosh
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
 
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse MatricesPresto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse MatricesQian Lin
 
Start From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize AlgorithmStart From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize AlgorithmYu Liu
 
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...Menlo Systems GmbH
 
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Databricks
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsCarl Lu
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...Adrian Florea
 
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...Accumulo Summit
 
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large ClustersMapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusterskazuma_sato
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceM Baddar
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
 
Mapreduce total order sorting technique
Mapreduce total order sorting techniqueMapreduce total order sorting technique
Mapreduce total order sorting techniqueUday Vakalapudi
 
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit
 

Was ist angesagt? (20)

Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse MatricesPresto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
 
Start From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize AlgorithmStart From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize Algorithm
 
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
 
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
[ppt]
[ppt][ppt]
[ppt]
 
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
 
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large ClustersMapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusters
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
Mapreduce total order sorting technique
Mapreduce total order sorting techniqueMapreduce total order sorting technique
Mapreduce total order sorting technique
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
 
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
 

Andere mochten auch

Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science ChallengeMark Nichols, P.E.
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
 
Optimization for iterative queries on Mapreduce
Optimization for iterative queries on MapreduceOptimization for iterative queries on Mapreduce
Optimization for iterative queries on Mapreducemakoto onizuka
 
Spark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticSpark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticALTIC Altic
 
Learning Apache Spark by examples
Learning Apache Spark by examplesLearning Apache Spark by examples
Learning Apache Spark by examplesSamuel Yee
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clusteringmobius.cn
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Varad Meru
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...Victor Giannakouris
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Anomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark StreamingAnomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark StreamingKeira Zhou
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnDatabricks
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 

Andere mochten auch (20)

Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
MachineLearning_MPI_vs_Spark
MachineLearning_MPI_vs_SparkMachineLearning_MPI_vs_Spark
MachineLearning_MPI_vs_Spark
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
 
Optimization for iterative queries on Mapreduce
Optimization for iterative queries on MapreduceOptimization for iterative queries on Mapreduce
Optimization for iterative queries on Mapreduce
 
Spark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticSpark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, altic
 
Learning Apache Spark by examples
Learning Apache Spark by examplesLearning Apache Spark by examples
Learning Apache Spark by examples
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clustering
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Parallel-kmeans
Parallel-kmeansParallel-kmeans
Parallel-kmeans
 
Anomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark StreamingAnomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark Streaming
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Predictive Model
Predictive ModelPredictive Model
Predictive Model
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 

Ähnlich wie Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1Stefanie Zhao
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
svm-proyekt.pptx
svm-proyekt.pptxsvm-proyekt.pptx
svm-proyekt.pptxElinEliyev
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkAlpine Data
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkSpark Summit
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelJenny Liu
 
Distributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsDistributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsBita Kazemi
 

Ähnlich wie Sandy Ryza – Software Engineer, Cloudera at MLconf ATL (20)

Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
svm-proyekt.pptx
svm-proyekt.pptxsvm-proyekt.pptx
svm-proyekt.pptx
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Planet
PlanetPlanet
Planet
 
Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Presentation
PresentationPresentation
Presentation
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 
Distributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsDistributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasets
 

Mehr von MLconf

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...MLconf
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...MLconf
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushMLconf
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceMLconf
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...MLconf
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMLconf
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionMLconf
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLMLconf
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldMLconf
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...MLconf
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...MLconf
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeMLconf
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareMLconf
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesMLconf
 

Mehr von MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Kürzlich hochgeladen

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

  • 1. Clustering with Spark Sandy Ryza / Data Science / Cloudera
  • 2. Me ● Data scientist at Cloudera ● Recently lead Apache Spark development at Cloudera ● Before that, committing on Apache Hadoop ● Before that, studying combinatorial optimization and distributed systems at Brown
  • 3. Sometimes you find yourself with lots of stuff
  • 11. Unsupervised Learning ● Learn hidden structure of your data ● Interpret new data as it relates to this structure
  • 12.
  • 13.
  • 14. Two Main Problems ● Designing a system for processing huge data in parallel ● Taking advantage of it with algorithms that work well in parallel
  • 15. MapReduce Map Map Map Map Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Key advances by MapReduce: •Data Locality: Automatic split computation and launch of mappers appropriately •Fault tolerance: Write out of intermediate results and restartable mappers meant ability to run on commodity hardware •Linear scalability: Combination of locality + programming model that forces developers to write generally scalable solutions to problems * CONFIDENTIAL - RESTRICTED
  • 16. MapReduce Map Map Map Map Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Limitations of MapReduce •Each job reads data from HDFS •No concept of a session •Jobs are rigin map-then-reduce * CONFIDENTIAL - RESTRICTED
  • 17. Spark is a general purpose computation framework geared towards massive data - more flexible than MapReduce Extra properties: •Leverages distributed memory •Full Directed Graph expressions for data parallel computations •Improved developer experience Yet retains: Linear scalability, Fault-tolerance and Data-Locality * CONFIDENTIAL - RESTRICTED
  • 18. RDDs val lines = sc.textFile (“bigfile.txt”) bigfile.txt lines val numbers = lines.map ((x) => x.toDouble) numbers.sum() numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver
  • 19. RDDs val lines = sc.textFile (“bigfile.txt”) bigfile.txt lines val numbers = lines.map ((x) => x.toInt) numbers.cache() numbers Partition Partition Partition Partition Partition Partition HDFS .sum() sum Driver
  • 20. numbers.sum() bigfile.txt lines numbers Partition Partition Partition sum Driver
  • 21. Spark MLlib Discrete Continuous Supervised Classification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
  • 22. Spark MLlib Discrete Continuous Supervised Classification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
  • 23.
  • 24.
  • 25. Using it val data = sc.textFile("kmeans_data.txt") val parsedData = data.map( _.split(' ').map(_.toDouble)) // Cluster the data into two classes using KMeans val numIterations = 20 val numClusters = 2 val clusters = KMeans.train(parsedData, numClusters, numIterations)
  • 26. K-Means ● Choose some initial centers ● Then alternate between two steps: ○ Assign each point to a cluster based on existing centers ○ Recompute cluster centers from the points in each cluster
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32. K-Means - very parallelizable ● Alternate between two steps: ○ Assign each point to a cluster based on existing centers ■ Process each data point independently ○ Recompute cluster centers from the points in each cluster ■ Average across partitions
  • 33. // Find the sum and count of points mapping to each center val totalContribs = data.mapPartitions { points => val k = centers.length val dims = centers(0).vector.length val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]]) val counts = Array.fill(k)(0L) points.foreach { point => val (bestCenter, cost) = KMeans.findClosest(centers, point) costAccum += cost sums(bestCenter) += point.vector counts(bestCenter) += 1 } val contribs = for (j <- 0 until k) yield { (j, (sums(j), counts(j))) } contribs.iterator }.reduceByKey(mergeContribs).collectAsMap()
  • 34. // Update the cluster centers and costs var changed = false var j = 0 while (j < k) { val (sum, count) = totalContribs(j) if (count != 0) { sum /= count.toDouble val newCenter = new BreezeVectorWithNorm(sum) if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) { changed = true } centers(j) = newCenter } j += 1 } if (!changed) { logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations") } cost = costAccum.value
  • 35.
  • 36. The Problem ● K-Means is very sensitive to initial set of center points chosen. ● Best existing algorithm for choosing centers is highly sequential.
  • 37.
  • 38. K-Means++ ● Start with random point from dataset ● Pick another one randomly, with probability proportional to distance from the closest already chosen ● Repeat until initial centers chosen
  • 39. K-Means++ ● Initial cluster has expected bound of O(log k) of optimum cost
  • 40. K-Means++ ● Requires k passes over the data
  • 41. K-Means|| ● Do only a few (~5) passes ● Sample m points on each pass ● Oversample ● Run K-Means++ on sampled points to find initial centers
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49. Then on the full data...