SlideShare ist ein Scribd-Unternehmen logo
1 von 41
Downloaden Sie, um offline zu lesen
Presented by
Date
Event
Data Analytics and
Machine Learning: From
Node to Cluster
Understanding use cases to optimize on
ARM Ecosystem
Viswanath Puttagunta
Ganesh Raju
BKK16-404B
March 10th, 2016
Linaro Connect BKK16
Version 0.1
Vish Puttagunta
● Technical Program
Manager, Linaro
● DSP (TI C64x) optimizations
(Image / Signal Processing)
● ARM®
NEON™
optimizations upstreamed to
opus audio codec
● Data Analysis and Machine
Learning: Why ARM
Ganesh Raju
● Tech Lead, Big Data, Linaro
● Brings in Enterprise
experience in implementing
Big Data solutions
Data Science: Big Picture
Data
Analytics
Machine
Learning
Deep
Learning
High
Performance
Computing
(HPC)
Big Data
Statistics
Lin/Log Reg
KNN, Decision
Trees,PCA...
Software Tools
Python,R..
Pandas, scikit-learn, nltk..
Caffe, TensorFlow..
Spark, MLlib, Hadoop..
Deep
Learning/NN
CNN
RNN..
Data
Science
● Value Prediction
● Classification
● Transformation
● Correlations
● Causalities
● Wrangling...
Overview
● Basic Data Science use cases
● Pandas Library(Python)
○ Time Series, Correlations, Risk Analysis
● Supervised Learning
○ scikit-learn Library (Python)
● Understand operations done repeatedly
● Open discussion for next steps:
○ Profile, Optimize (ARM Neon, OpenCL, OpenMP..) on
a single machine.
○ Scale beyond a single machine
● Ref: https://github.com/viswanath-puttagunta/bkk16_MLPrimer
Goal
● Make the tools and operations work out of the
box on ARM
○ ARM Neon, OpenCL, OpenMP.
○ ‘pip install’ should just work :)
● Why Now? (Hint: Spark)
Pandas Data Analysis (Pandas)
● Origins in Finance for Data Manipulation and
Analysis (Python Library)
● DataFrame object
● Repeat Computations:
○ Means, Min, Max, Percentage Changes
○ Standard Deviation
○ Differentiation (Shift and subtract)
○ ...
Op: Rolling Mean (Pandas)
Op: Percentage Change (Pandas)
Operations: Shift, Subtract, Divide
Op: Percentage Change Vs Risk (Variance)
Op: Correlation (Pandas)
Operations: Mean, Variance, Square Root
Pearson Correlation Coef
Source: https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
Correlation Heat Maps
Linear Regression (Predictor)
Objective: Fit line/curve to minimize a cost function
Linear Regression (Predictor)...
Operations: Gradient Descent: Matrix Transforms/Inverse/Multiplications.
For Lnr Reg, directly reduces to:
Θ = (XT
X)-1
XT
Y
Source: https://www.youtube.com/watch?v=SqA6TujbmWw&list=PLE6Wd9FR--Ecf_5nCbnSQMHqORpiChfJf&index=16
https://youtu.be/WnqQrPNYz5Q?list=PLaXDtXvwY-oDvedS3f4HW0b4KxqpJ_imw&t=284
Logistic Regression (Classification)
Operations: Matrix Transforms/Inverse/Multiplications.
Source: https://www.youtube.com/watch?v=Zc7ouSD0DEQ&index=27&list=PLE6Wd9FR--Ecf_5nCbnSQMHqORpiChfJf
Artificial Neural Networks (Eg: Predictor)
Objective: Compute parameters to minimize a cost function
Operations: Matrix Transforms/Inverse/Multiplications, Dot Products...
Source: https://www.yout.ube.com/watch?v=bxe2T-V8XRs
Bayesian Classifiers
Operations (Training): Mean, Standard Deviations, Logs, binning/compare(/histograms)
Note: Some variations based on assumptions on variance (LDA, QDA)
Source: An Introduction to Statistical Learning with Applications in R (Springer Texts)
K-Nearest Neighbors (Classifier)
Operations: Euclidean Distance Computations
Note: K is tunable. Typical to repeat computations for lot of k values
Decision Tree (Predictors & Classifiers)
Operations: Each split done so that
Predictor: Minimize Root Mean Square Error
Classifier: Minimize Gini Index (Note: depth tunable)
Bagging / Random Forests
Operations: Each tree similar to Decision Tree
Note: Typical to see about 200 to 300 trees to stabilize.
Source: An Introduction to Statistical Learning with Applications in R (Springer Texts)
Segway to Spark
● So far, on a single machine with lots of RAM and
compute.
○ How to scale to a cluster
● So far, data from simple csv files
○ How to acquire/clean data on large scale
Discussion
● OpenCL (Deep Learning)
○ Acceleration using GPU, DSPs
○ Outside of Deep Learning: GPU? (scikit-learn faq)
○ But what about Shamrock? OpenCL backend on CPU.
○ Port CUDA kernels into OpenCL kernels in various projects?
● ARM NEON
○ Highly scalable
● OpenMP?
● Tools to profile end-to-end use cases?
○ Perf..
Backup Slides
● Text Analytics
○ Sparse Matrices and operations
○ Word counts (Naive Bayes)
● Libraries (Python)
○ sklearn (No GPU Support!! FAQ)
○ gensim (Text Analytics)
○ TensorFlowTM
(GPU backend available)
○ Caffe (GPU backend available)
○ Theano (GPU backend available)
Presented by
Date
Event
Data science in Distributed
Environment
Ganesh Raju
Tech Lead, Big Data
Linaro
Thursday 10 March 2016
BKK16
Overview
1. Review of Data Science in Single Node
2. Data Science in Distributed Environment
a. Hadoop and its limitations
3. Data Science using Apache Spark
a. Spark Ecosystem
b. Spark ML
c. Spark Streaming
d. Spark 2.0 and Roadmap
4. Q & A
Hadoop Vs Spark
80+ operators compared to 2 operators in Hadoop
Hadoop and its limitations
Spark - Unified Platform
source: www.databricks.com
Machine Learning Data Pipeline
Why in Spark:
● Machine learning algorithms are
○ Complex, multi-stage
○ Iterative
● MapReduce/Hadoop unsuitable
source: http://www.slideshare.net/databricks/2015-0317-scala-days
Spark is In-Memory and Fast
Apache Spark Stack
Unified Engine across diverse workloads and Environments
R Python Scala Java
Spark Core API
Spark SQL +
DataFrames
Streaming MLib / ML
Machine Learning
GraphX
Graph Computation
Spark Core
Internals
SparkContext connects to a cluster manager Obtains executors on cluster nodes Sends app code to
them Sends task to the executors
Driver Program
Cluster Manager
Worker Node
Executor
Worker Node
Cluster
Spark Context
Scheduler
Scheduler
Task
Cache
Task
Executor
Task
Cache
Task
Application Code Distribution
● PySpark enables developers to write driver programs in Python
● Application code is serialized and sent to the worker nodes
● Execution happens in native language (Python/R)
SparkContext
Driver Py4J
Java
Spark
Context
File system
Worker
Node
Worker
Node
Cluster
Python
Python
Python
Python
Python
Python
Pipe
Socket
Apache Spark MLlib / ML Algorithms Supported:
● Classification: Logistic Regression, Linear SVM, Naive Bayes, classification tree
● Regression: Generalized Linear Models (GLMs), Regression Tree
● Collaborative Filtering: Alternating Least Squares (ALS), Non-Negative Matrix Factorization (NMF)
● Clustering: K-Means
● Decomposition: SVD, PCA
● Optimization: Stochastic Gradient Descent (SGD), L-BFGS
Spark Streaming
● Run a streaming computation as a series of very small, deterministic batch jobs
○ Live data streaming is converted into micro batches of input. Batch can be of ½ sec latency.
○ Each batch is processed in Spark
○ Output is returned as micro batches
○ Potential for combining batch and streaming processing.
● Linear models can be trained in streaming fashion
● Model weights can be updated via SGD, thus amenable to streaming
● Consistent API
● Scalable
Apache Spark
General-purpose cluster computing system
● Unified End-to-End data pipeline platform with support for Machine Learning, Streaming, SQL and
Graph Pipelines
● Fast and Expressive Cluster Computing Engine. No Intermediate storage. In-memory Processing. Run
programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
● Rich higher level APIs in Scala, Python, R, Java. Typically less code (2-5x)
● REPL
● Interoperability with other ecosystem components
○ Mesos, YARN
○ EC2, EMR
○ HDFS, S3,
○ HBase, Cassandra, Parquet, Hive, JSON, ElasticSearch
Source: https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
Tungsten Phase 2
speedups of 5-10x
Structured Streaming
real-time engine on
SQL / DataFrames
Unifying Datasets
and DataFrames
Major Features in 2.0 - Project Tungsten
Focus on CPU and Memory Optimization
Code Generation
● Runtime Code Generation by dynamically generating bytecode during evaluation and eliminating Java boxing of primitive data types
● Faster serializer and compression on dataformat like Parquet.
Manual Memory Management and Binary Processing
● Off heap memory management. Avoiding non-transient Java objects (store them in binary format), which reduces GC overhead.
● Minimizing memory usage through denser in-memory data format, with less spill (I/O)
● Better memory accounting (size of bytes) rather than relying on heuristics
● For operators that understand data types (in the case of DataFrames and SQL), work directly against binary format in memory, i.e. have
no serialization/deserialization
Cache-aware Computation
● Exploiting cache locality. Faster sorting and hashing for aggregations, joins, and shuffle
Spark 2.0 - Project Tungsten features
Unified API, One Engine, Automatically Optimized
PythonSQL Java/Scala R ...
DataFrame
Logical Plan
Language
Frontend
LLVMJVM GPU NVRAM SSE/SIMD
Tungsten
Backend
...OpenCL
Apache Spark Roadmap
.
● Use LLVM for compilation
● Re-implement parallelizable code using OpenCL/SSE/SIMD to utilize underlying CPU / GPU advancements
towards machine learning.
Spark slave nodes can achieve better performance and energy efficiency if with GPU acceleration by doing further data parallelization and
algorithm acceleration.
De facto Analytics Platform
source: www.databricks.com
Q & A

Weitere ähnliche Inhalte

Was ist angesagt?

A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersIntel® Software
 
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...Spark Summit
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Clusterairbots
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersKazuaki Ishizaki
 
Advanced spark deep learning
Advanced spark deep learningAdvanced spark deep learning
Advanced spark deep learningAdam Gibson
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Yahoo Developer Network
 
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...InfluxData
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
 
Bringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingBringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingDataWorks Summit
 
Making Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseMaking Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseKazuaki Ishizaki
 
Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Linaro
 
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...inside-BigData.com
 
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUsScaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUsJim Dowling
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Intel® Software
 
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Unleashing Data Intelligence with Intel and Apache Spark with Michael GreeneUnleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Unleashing Data Intelligence with Intel and Apache Spark with Michael GreeneDatabricks
 
Migrating to Apache Spark at Netflix
Migrating to Apache Spark at NetflixMigrating to Apache Spark at Netflix
Migrating to Apache Spark at NetflixDatabricks
 
Scalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduceScalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReducePietro Michiardi
 

Was ist angesagt? (20)

A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Cluster
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
 
Advanced spark deep learning
Advanced spark deep learningAdvanced spark deep learning
Advanced spark deep learning
 
myHadoop 0.30
myHadoop 0.30myHadoop 0.30
myHadoop 0.30
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARN
 
Bringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingBringing complex event processing to Spark streaming
Bringing complex event processing to Spark streaming
 
Making Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseMaking Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to Use
 
Mmap failure analysis
Mmap failure analysisMmap failure analysis
Mmap failure analysis
 
20140708hcj
20140708hcj20140708hcj
20140708hcj
 
Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509
 
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
 
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUsScaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
 
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Unleashing Data Intelligence with Intel and Apache Spark with Michael GreeneUnleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
 
Migrating to Apache Spark at Netflix
Migrating to Apache Spark at NetflixMigrating to Apache Spark at Netflix
Migrating to Apache Spark at Netflix
 
Scalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduceScalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduce
 

Andere mochten auch

Top 3 Challenges to Profitable Mortgage Lending
Top 3 Challenges to Profitable Mortgage LendingTop 3 Challenges to Profitable Mortgage Lending
Top 3 Challenges to Profitable Mortgage LendingEquifax
 
Graph analytic and machine learning
Graph analytic and machine learningGraph analytic and machine learning
Graph analytic and machine learningStanley Wang
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 
Top 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big DataTop 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big DataDatameer
 
Predictive Analytics and Machine Learning 101
Predictive Analytics and Machine Learning 101Predictive Analytics and Machine Learning 101
Predictive Analytics and Machine Learning 101Poya Manouchehri
 
Intro au Big Data & Machine Learning
Intro au Big Data & Machine LearningIntro au Big Data & Machine Learning
Intro au Big Data & Machine LearningEric Daoud
 
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",..."From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...Dataconomy Media
 
Best Practices for Big Data Analytics with Machine Learning by Datameer
Best Practices for Big Data Analytics with Machine Learning by DatameerBest Practices for Big Data Analytics with Machine Learning by Datameer
Best Practices for Big Data Analytics with Machine Learning by DatameerDatameer
 
Cursos de Big Data y Machine Learning
Cursos de Big Data y Machine LearningCursos de Big Data y Machine Learning
Cursos de Big Data y Machine LearningStratebi
 
Machine Learning for Actuaries
Machine Learning for ActuariesMachine Learning for Actuaries
Machine Learning for ActuariesArthur Charpentier
 
How to Apply Big Data Analytics and Machine Learning to Real Time Processing ...
How to Apply Big Data Analytics and Machine Learning to Real Time Processing ...How to Apply Big Data Analytics and Machine Learning to Real Time Processing ...
How to Apply Big Data Analytics and Machine Learning to Real Time Processing ...Codemotion
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Cynthia Saracco
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 

Andere mochten auch (15)

Learning Analytics
Learning AnalyticsLearning Analytics
Learning Analytics
 
Top 3 Challenges to Profitable Mortgage Lending
Top 3 Challenges to Profitable Mortgage LendingTop 3 Challenges to Profitable Mortgage Lending
Top 3 Challenges to Profitable Mortgage Lending
 
Graph analytic and machine learning
Graph analytic and machine learningGraph analytic and machine learning
Graph analytic and machine learning
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 
Top 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big DataTop 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big Data
 
Predictive Analytics and Machine Learning 101
Predictive Analytics and Machine Learning 101Predictive Analytics and Machine Learning 101
Predictive Analytics and Machine Learning 101
 
Intro au Big Data & Machine Learning
Intro au Big Data & Machine LearningIntro au Big Data & Machine Learning
Intro au Big Data & Machine Learning
 
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",..."From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
 
Best Practices for Big Data Analytics with Machine Learning by Datameer
Best Practices for Big Data Analytics with Machine Learning by DatameerBest Practices for Big Data Analytics with Machine Learning by Datameer
Best Practices for Big Data Analytics with Machine Learning by Datameer
 
Cursos de Big Data y Machine Learning
Cursos de Big Data y Machine LearningCursos de Big Data y Machine Learning
Cursos de Big Data y Machine Learning
 
Machine Learning for Actuaries
Machine Learning for ActuariesMachine Learning for Actuaries
Machine Learning for Actuaries
 
How to Apply Big Data Analytics and Machine Learning to Real Time Processing ...
How to Apply Big Data Analytics and Machine Learning to Real Time Processing ...How to Apply Big Data Analytics and Machine Learning to Real Time Processing ...
How to Apply Big Data Analytics and Machine Learning to Real Time Processing ...
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
 
Big Data and Advanced Analytics
Big Data and Advanced AnalyticsBig Data and Advanced Analytics
Big Data and Advanced Analytics
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 

Ähnlich wie BKK16-404B Data Analytics and Machine Learning- from Node to Cluster

Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkNicola Ferraro
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDataWorks Summit
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi
 
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...Databricks
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Datio Big Data
 

Ähnlich wie BKK16-404B Data Analytics and Machine Learning- from Node to Cluster (20)

Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 

Mehr von Linaro

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloLinaro
 
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaLinaro
 
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraLinaro
 
Bud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaLinaro
 
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018Linaro
 
HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018Linaro
 
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...Linaro
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Linaro
 
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Linaro
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Linaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteLinaro
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopLinaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allLinaro
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorLinaro
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMULinaro
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MLinaro
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation Linaro
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootLinaro
 

Mehr von Linaro (20)

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
 
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
 
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
 
Bud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qa
 
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
 
HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018
 
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
 
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening Keynote
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP Workshop
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8M
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted boot
 

Kürzlich hochgeladen

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

BKK16-404B Data Analytics and Machine Learning- from Node to Cluster

  • 1. Presented by Date Event Data Analytics and Machine Learning: From Node to Cluster Understanding use cases to optimize on ARM Ecosystem Viswanath Puttagunta Ganesh Raju BKK16-404B March 10th, 2016 Linaro Connect BKK16 Version 0.1
  • 2. Vish Puttagunta ● Technical Program Manager, Linaro ● DSP (TI C64x) optimizations (Image / Signal Processing) ● ARM® NEON™ optimizations upstreamed to opus audio codec ● Data Analysis and Machine Learning: Why ARM Ganesh Raju ● Tech Lead, Big Data, Linaro ● Brings in Enterprise experience in implementing Big Data solutions
  • 3. Data Science: Big Picture Data Analytics Machine Learning Deep Learning High Performance Computing (HPC) Big Data Statistics Lin/Log Reg KNN, Decision Trees,PCA... Software Tools Python,R.. Pandas, scikit-learn, nltk.. Caffe, TensorFlow.. Spark, MLlib, Hadoop.. Deep Learning/NN CNN RNN.. Data Science ● Value Prediction ● Classification ● Transformation ● Correlations ● Causalities ● Wrangling...
  • 4. Overview ● Basic Data Science use cases ● Pandas Library(Python) ○ Time Series, Correlations, Risk Analysis ● Supervised Learning ○ scikit-learn Library (Python) ● Understand operations done repeatedly ● Open discussion for next steps: ○ Profile, Optimize (ARM Neon, OpenCL, OpenMP..) on a single machine. ○ Scale beyond a single machine ● Ref: https://github.com/viswanath-puttagunta/bkk16_MLPrimer
  • 5. Goal ● Make the tools and operations work out of the box on ARM ○ ARM Neon, OpenCL, OpenMP. ○ ‘pip install’ should just work :) ● Why Now? (Hint: Spark)
  • 6. Pandas Data Analysis (Pandas) ● Origins in Finance for Data Manipulation and Analysis (Python Library) ● DataFrame object ● Repeat Computations: ○ Means, Min, Max, Percentage Changes ○ Standard Deviation ○ Differentiation (Shift and subtract) ○ ...
  • 7. Op: Rolling Mean (Pandas)
  • 8. Op: Percentage Change (Pandas) Operations: Shift, Subtract, Divide
  • 9. Op: Percentage Change Vs Risk (Variance)
  • 10. Op: Correlation (Pandas) Operations: Mean, Variance, Square Root Pearson Correlation Coef Source: https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
  • 12. Linear Regression (Predictor) Objective: Fit line/curve to minimize a cost function
  • 13. Linear Regression (Predictor)... Operations: Gradient Descent: Matrix Transforms/Inverse/Multiplications. For Lnr Reg, directly reduces to: Θ = (XT X)-1 XT Y Source: https://www.youtube.com/watch?v=SqA6TujbmWw&list=PLE6Wd9FR--Ecf_5nCbnSQMHqORpiChfJf&index=16 https://youtu.be/WnqQrPNYz5Q?list=PLaXDtXvwY-oDvedS3f4HW0b4KxqpJ_imw&t=284
  • 14. Logistic Regression (Classification) Operations: Matrix Transforms/Inverse/Multiplications. Source: https://www.youtube.com/watch?v=Zc7ouSD0DEQ&index=27&list=PLE6Wd9FR--Ecf_5nCbnSQMHqORpiChfJf
  • 15. Artificial Neural Networks (Eg: Predictor) Objective: Compute parameters to minimize a cost function Operations: Matrix Transforms/Inverse/Multiplications, Dot Products... Source: https://www.yout.ube.com/watch?v=bxe2T-V8XRs
  • 16. Bayesian Classifiers Operations (Training): Mean, Standard Deviations, Logs, binning/compare(/histograms) Note: Some variations based on assumptions on variance (LDA, QDA) Source: An Introduction to Statistical Learning with Applications in R (Springer Texts)
  • 17. K-Nearest Neighbors (Classifier) Operations: Euclidean Distance Computations Note: K is tunable. Typical to repeat computations for lot of k values
  • 18. Decision Tree (Predictors & Classifiers) Operations: Each split done so that Predictor: Minimize Root Mean Square Error Classifier: Minimize Gini Index (Note: depth tunable)
  • 19. Bagging / Random Forests Operations: Each tree similar to Decision Tree Note: Typical to see about 200 to 300 trees to stabilize. Source: An Introduction to Statistical Learning with Applications in R (Springer Texts)
  • 20. Segway to Spark ● So far, on a single machine with lots of RAM and compute. ○ How to scale to a cluster ● So far, data from simple csv files ○ How to acquire/clean data on large scale
  • 21. Discussion ● OpenCL (Deep Learning) ○ Acceleration using GPU, DSPs ○ Outside of Deep Learning: GPU? (scikit-learn faq) ○ But what about Shamrock? OpenCL backend on CPU. ○ Port CUDA kernels into OpenCL kernels in various projects? ● ARM NEON ○ Highly scalable ● OpenMP? ● Tools to profile end-to-end use cases? ○ Perf..
  • 22. Backup Slides ● Text Analytics ○ Sparse Matrices and operations ○ Word counts (Naive Bayes) ● Libraries (Python) ○ sklearn (No GPU Support!! FAQ) ○ gensim (Text Analytics) ○ TensorFlowTM (GPU backend available) ○ Caffe (GPU backend available) ○ Theano (GPU backend available)
  • 23. Presented by Date Event Data science in Distributed Environment Ganesh Raju Tech Lead, Big Data Linaro Thursday 10 March 2016 BKK16
  • 24. Overview 1. Review of Data Science in Single Node 2. Data Science in Distributed Environment a. Hadoop and its limitations 3. Data Science using Apache Spark a. Spark Ecosystem b. Spark ML c. Spark Streaming d. Spark 2.0 and Roadmap 4. Q & A
  • 25. Hadoop Vs Spark 80+ operators compared to 2 operators in Hadoop
  • 26. Hadoop and its limitations
  • 27. Spark - Unified Platform source: www.databricks.com
  • 28. Machine Learning Data Pipeline Why in Spark: ● Machine learning algorithms are ○ Complex, multi-stage ○ Iterative ● MapReduce/Hadoop unsuitable
  • 30. Apache Spark Stack Unified Engine across diverse workloads and Environments R Python Scala Java Spark Core API Spark SQL + DataFrames Streaming MLib / ML Machine Learning GraphX Graph Computation
  • 31. Spark Core Internals SparkContext connects to a cluster manager Obtains executors on cluster nodes Sends app code to them Sends task to the executors Driver Program Cluster Manager Worker Node Executor Worker Node Cluster Spark Context Scheduler Scheduler Task Cache Task Executor Task Cache Task
  • 32. Application Code Distribution ● PySpark enables developers to write driver programs in Python ● Application code is serialized and sent to the worker nodes ● Execution happens in native language (Python/R) SparkContext Driver Py4J Java Spark Context File system Worker Node Worker Node Cluster Python Python Python Python Python Python Pipe Socket
  • 33. Apache Spark MLlib / ML Algorithms Supported: ● Classification: Logistic Regression, Linear SVM, Naive Bayes, classification tree ● Regression: Generalized Linear Models (GLMs), Regression Tree ● Collaborative Filtering: Alternating Least Squares (ALS), Non-Negative Matrix Factorization (NMF) ● Clustering: K-Means ● Decomposition: SVD, PCA ● Optimization: Stochastic Gradient Descent (SGD), L-BFGS
  • 34. Spark Streaming ● Run a streaming computation as a series of very small, deterministic batch jobs ○ Live data streaming is converted into micro batches of input. Batch can be of ½ sec latency. ○ Each batch is processed in Spark ○ Output is returned as micro batches ○ Potential for combining batch and streaming processing. ● Linear models can be trained in streaming fashion ● Model weights can be updated via SGD, thus amenable to streaming ● Consistent API ● Scalable
  • 35. Apache Spark General-purpose cluster computing system ● Unified End-to-End data pipeline platform with support for Machine Learning, Streaming, SQL and Graph Pipelines ● Fast and Expressive Cluster Computing Engine. No Intermediate storage. In-memory Processing. Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. ● Rich higher level APIs in Scala, Python, R, Java. Typically less code (2-5x) ● REPL ● Interoperability with other ecosystem components ○ Mesos, YARN ○ EC2, EMR ○ HDFS, S3, ○ HBase, Cassandra, Parquet, Hive, JSON, ElasticSearch
  • 36. Source: https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html Tungsten Phase 2 speedups of 5-10x Structured Streaming real-time engine on SQL / DataFrames Unifying Datasets and DataFrames Major Features in 2.0 - Project Tungsten
  • 37. Focus on CPU and Memory Optimization Code Generation ● Runtime Code Generation by dynamically generating bytecode during evaluation and eliminating Java boxing of primitive data types ● Faster serializer and compression on dataformat like Parquet. Manual Memory Management and Binary Processing ● Off heap memory management. Avoiding non-transient Java objects (store them in binary format), which reduces GC overhead. ● Minimizing memory usage through denser in-memory data format, with less spill (I/O) ● Better memory accounting (size of bytes) rather than relying on heuristics ● For operators that understand data types (in the case of DataFrames and SQL), work directly against binary format in memory, i.e. have no serialization/deserialization Cache-aware Computation ● Exploiting cache locality. Faster sorting and hashing for aggregations, joins, and shuffle Spark 2.0 - Project Tungsten features
  • 38. Unified API, One Engine, Automatically Optimized PythonSQL Java/Scala R ... DataFrame Logical Plan Language Frontend LLVMJVM GPU NVRAM SSE/SIMD Tungsten Backend ...OpenCL
  • 39. Apache Spark Roadmap . ● Use LLVM for compilation ● Re-implement parallelizable code using OpenCL/SSE/SIMD to utilize underlying CPU / GPU advancements towards machine learning. Spark slave nodes can achieve better performance and energy efficiency if with GPU acceleration by doing further data parallelization and algorithm acceleration.
  • 40. De facto Analytics Platform source: www.databricks.com
  • 41. Q & A