SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Modern Big Data Systems
for
Machine Learning
Antonio Roldao, Ph.D. CQF. 1
10/July/2015, Thomson Reuters, London, UK
About Me
http://anton.io
@roldao
2
This Talk on Big Data Systems
 Data
 Big Data as a Buzzword and the useless 4V’s
 Basic Aspects of Data
 Advanced Aspects of Data
 Small Data Innovations
 Algorithms for Machine Learning
 ML Overview
 Optimization Problems
 Solving Systems of Linear Equations
 Accelerating ML Using Different Technologies
 Distributed Computing
 Computing at Scale
 Platform Examples
Antonio Roldao, Ph.D. CQF. 3
Big Data
 4Vs of BD?!
 Volume
 Variety
 Velocity
 Veracity
 Too simplistic and technically
useless!
“Any amount of data that is too big
for Excel to process.”
1956 Hard-drive with 5 MB
Mostly a marketing Buzzword which
mean different things to different
people.
Antonio Roldao, Ph.D. CQF. 4
Understanding Data – Basic
 Storage formats
 Uncompressed <-> Compressed
 Unencrypted <-> Encrypted
 Human-readable <-> Binary
 Rigid <-> Templated <-> Self-describing
 Mainly regular <-> Irregular
 Different types and encodings
…
 Generation (write) modes
 parallel <-> sequential
 append-only
 in-place updates
 random inserts
…
 Consumption (read) modes
 parallel <-> sequential
 random <-> well defined access
…
Antonio Roldao, Ph.D. CQF. 5
Understanding Data – Advanced
 Represents:
 How concepts are connected (graph)
 How connections evolve with time (time series)
 Bitemporal (e.g. value depends on time frame)
 Time value of data (e.g. Useful today, but not tomorrow)
 Sensitivity (e.g. Medical, Economical, Political, Privacy…)
 Interdependency (e.g. one wrong bit destroys everything)
 Cleanliness (e.g. how Noisy it is)
 Truthfulness (e.g. how Accurate it is)
 Redundancy (e.g. how safe does it need to be)
 Density (e.g. how Redundant it is)
 Accessibility (e.g. Local <-> Global)
 Cost / Budget
Antonio Roldao, Ph.D. CQF. 6
Myriad of Data-stores/bases
 File-Systems
 local, distributed, p2p,…
 rom, tape, spindle, flash, ram,…
 Key-Value Stores
 Relational
 Object
 Geo-location
 Row-based
 Column-based
 Time-Series
 Graph-based
 ACID compliant or not
 Sharding Support
 Replication Support
 HA Support
 Blockchain
 LayerFS
 …
Antonio Roldao, Ph.D. CQF. 7
Recent Innovations in “Small-data”
 XML (1996)
 YAML (2001)
 JSON
 BSON
 Google Protocol Buffers (initial release 2008)
 Cap’n Proto
 Thrift
 Avro
 FAST
 FIX/BFIX
 Flat Buffers
 Simple Binary Encoding (2014)
 Dynamically Adaptive Encoding (Future)
http://www.quora.com/What-are-the-pros-and-cons-of-different-serialization-formats-for-Hadoop
Antonio Roldao, Ph.D. CQF. 8
Processing Data
Antonio Roldao, Ph.D. CQF. 9
Machine Learning
Antonio Roldao, Ph.D. CQF. 10
ML / AI – Boils down to…
 Given an input (X) and/or state (S) produce a output (Y)
 X may include
 Index or Time element (e.g. time series)
 S may include:
 a feedback-loop (e.g. reinforcement learning)
 a previously trained dataset (e.g. supervised learning)
 Y divides into two types:
 predictions (e.g. weather, trading, ...)
 categorizations
 known categories (e.g. object/speech recognition, …)
 unknown categories (e.g. insight generation, …)
Antonio Roldao, Ph.D. CQF. 11
Dimensionality Reduction
 Principal Component Analysis
 First component
 Subsequent components
Antonio Roldao, Ph.D. CQF. 12
Clustering
 k-Means
 For x observations cluster into k partitions the where ui
represents the mean of points in Si
Antonio Roldao, Ph.D. CQF. 13
General Al
 Genetic Algorithms
 For n mutations select mi that minimizes the difference
between output yi and a given reference (r):
where
Antonio Roldao, Ph.D. CQF. 14
Artificial Neural Networks
 Deep Convolutional Neural Network (d-CNN)
Optimization involving Stochastic Gradient Descent + Back-propagation
Antonio Roldao, Ph.D. CQF. 15
All About Optimization
 All these schemes involve solving for some constants that
Minimize or Maximize some Cost function
 Require fundamental Optimization algorithms such as:
 Direct Methods
 Combinatorial Algorithms
 Greedy Algorithm
 Minimax Algorithm with alpha-beta pruning
 …
 Iterative Methods
 Gradient Methods
 Karmarkar’s Algorithm
 …
Antonio Roldao, Ph.D. CQF. 16
At the Core of Optimization…
…there is a solution of a System of Linear equation of the form:
with x subject to some constraints.
 Which need algos that can be subdivided into two categories:
 Direct Methods
 Gaussian, LU, QR, Cholesky, LDL, …
 Iterative Methods
 MINRES, GC, BiCGSTAB, GMRES, ORTHOMIN, …
Antonio Roldao, Ph.D. CQF. 17
Accelerating Machine Learning
CPU GPGPU FPGA
Sequential Processing Parallel Processing
High Flexibility
High Abstractions
Many Libraries
…
Direct Methods
Ultra-Low-Latency
High Bandwidth
Fine grain optimization
...
Iterative Methods
Neural Networks
Markov Chains
Monte Carlo
Antonio Roldao, Ph.D. CQF. 18
Networked Computing Systems
 Mainframe Computing
 Cluster Computing
 Distributed Computing
 Grid Computing
 Orbital Computing
 Interstellar Computing
 Galactic Computing
 Inter-Universe Computing
 Cloud Computing
Antonio Roldao, Ph.D. CQF. 19
Modern Big Data Systems – Basic Components
 Dynamic (abstraction) + Statically-Typed (speed) Languages
 Need to rethink and re-engineer main systems:
 Data & Code Stores
 Logging
 Code Revision and Deployment
 Compute Nodes and Brokers Management
 Graceful Failure and Recovery
 Credentials and Access Controls
 Task Schedulers
 Messaging Bus
 Web/Mobile Interfaces
 Regression Testing
…
 Containerize and Standardize Services
Antonio Roldao, Ph.D. CQF. 20
Examples – Modern Big Data Systems
 Finance
 Athena/Hydra @ JP Morgan
 Quartz/Sandra @ Bank of America
 Slang/SecDB @ Goldman Sachs
 Optimus/DAL @ Morgan Stanley
 WSQ Tech @ n-prop shops &
 datapark.io @ quants / prop-shops
 Machine Learning
 Alpha/DL @ Muse.Ai
Antonio Roldao, Ph.D. CQF. 21
Thank you
http://anton.io
@roldao

Weitere ähnliche Inhalte

Ähnlich wie Modern Big Data Systems for Machine Learning

Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Timothy Chen
 
Petascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big Analytics
Heiko Joerg Schick
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
BigDataEverywhere
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
Tu Pham
 

Ähnlich wie Modern Big Data Systems for Machine Learning (20)

Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Petascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big Analytics
 
Wolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat DresdenWolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat Dresden
 
course description
course descriptioncourse description
course description
 
Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous Driving
 
Tech
TechTech
Tech
 
Data Engineering for Data Scientists
Data Engineering for Data Scientists Data Engineering for Data Scientists
Data Engineering for Data Scientists
 
Syl
SylSyl
Syl
 
AI meets Big Data
AI meets Big DataAI meets Big Data
AI meets Big Data
 
Rama krishna ppts for blue gene/L
Rama krishna ppts for blue gene/LRama krishna ppts for blue gene/L
Rama krishna ppts for blue gene/L
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Plank
PlankPlank
Plank
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
 
About "Apache Cassandra"
About "Apache Cassandra"About "Apache Cassandra"
About "Apache Cassandra"
 
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4j
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4jWebinar: Large Scale Graph Processing with IBM Power Systems & Neo4j
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4j
 
Artificial Intelligence Database Performance Tuning
Artificial Intelligence Database Performance TuningArtificial Intelligence Database Performance Tuning
Artificial Intelligence Database Performance Tuning
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Kürzlich hochgeladen (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Modern Big Data Systems for Machine Learning

  • 1. Modern Big Data Systems for Machine Learning Antonio Roldao, Ph.D. CQF. 1 10/July/2015, Thomson Reuters, London, UK
  • 3. This Talk on Big Data Systems  Data  Big Data as a Buzzword and the useless 4V’s  Basic Aspects of Data  Advanced Aspects of Data  Small Data Innovations  Algorithms for Machine Learning  ML Overview  Optimization Problems  Solving Systems of Linear Equations  Accelerating ML Using Different Technologies  Distributed Computing  Computing at Scale  Platform Examples Antonio Roldao, Ph.D. CQF. 3
  • 4. Big Data  4Vs of BD?!  Volume  Variety  Velocity  Veracity  Too simplistic and technically useless! “Any amount of data that is too big for Excel to process.” 1956 Hard-drive with 5 MB Mostly a marketing Buzzword which mean different things to different people. Antonio Roldao, Ph.D. CQF. 4
  • 5. Understanding Data – Basic  Storage formats  Uncompressed <-> Compressed  Unencrypted <-> Encrypted  Human-readable <-> Binary  Rigid <-> Templated <-> Self-describing  Mainly regular <-> Irregular  Different types and encodings …  Generation (write) modes  parallel <-> sequential  append-only  in-place updates  random inserts …  Consumption (read) modes  parallel <-> sequential  random <-> well defined access … Antonio Roldao, Ph.D. CQF. 5
  • 6. Understanding Data – Advanced  Represents:  How concepts are connected (graph)  How connections evolve with time (time series)  Bitemporal (e.g. value depends on time frame)  Time value of data (e.g. Useful today, but not tomorrow)  Sensitivity (e.g. Medical, Economical, Political, Privacy…)  Interdependency (e.g. one wrong bit destroys everything)  Cleanliness (e.g. how Noisy it is)  Truthfulness (e.g. how Accurate it is)  Redundancy (e.g. how safe does it need to be)  Density (e.g. how Redundant it is)  Accessibility (e.g. Local <-> Global)  Cost / Budget Antonio Roldao, Ph.D. CQF. 6
  • 7. Myriad of Data-stores/bases  File-Systems  local, distributed, p2p,…  rom, tape, spindle, flash, ram,…  Key-Value Stores  Relational  Object  Geo-location  Row-based  Column-based  Time-Series  Graph-based  ACID compliant or not  Sharding Support  Replication Support  HA Support  Blockchain  LayerFS  … Antonio Roldao, Ph.D. CQF. 7
  • 8. Recent Innovations in “Small-data”  XML (1996)  YAML (2001)  JSON  BSON  Google Protocol Buffers (initial release 2008)  Cap’n Proto  Thrift  Avro  FAST  FIX/BFIX  Flat Buffers  Simple Binary Encoding (2014)  Dynamically Adaptive Encoding (Future) http://www.quora.com/What-are-the-pros-and-cons-of-different-serialization-formats-for-Hadoop Antonio Roldao, Ph.D. CQF. 8
  • 11. ML / AI – Boils down to…  Given an input (X) and/or state (S) produce a output (Y)  X may include  Index or Time element (e.g. time series)  S may include:  a feedback-loop (e.g. reinforcement learning)  a previously trained dataset (e.g. supervised learning)  Y divides into two types:  predictions (e.g. weather, trading, ...)  categorizations  known categories (e.g. object/speech recognition, …)  unknown categories (e.g. insight generation, …) Antonio Roldao, Ph.D. CQF. 11
  • 12. Dimensionality Reduction  Principal Component Analysis  First component  Subsequent components Antonio Roldao, Ph.D. CQF. 12
  • 13. Clustering  k-Means  For x observations cluster into k partitions the where ui represents the mean of points in Si Antonio Roldao, Ph.D. CQF. 13
  • 14. General Al  Genetic Algorithms  For n mutations select mi that minimizes the difference between output yi and a given reference (r): where Antonio Roldao, Ph.D. CQF. 14
  • 15. Artificial Neural Networks  Deep Convolutional Neural Network (d-CNN) Optimization involving Stochastic Gradient Descent + Back-propagation Antonio Roldao, Ph.D. CQF. 15
  • 16. All About Optimization  All these schemes involve solving for some constants that Minimize or Maximize some Cost function  Require fundamental Optimization algorithms such as:  Direct Methods  Combinatorial Algorithms  Greedy Algorithm  Minimax Algorithm with alpha-beta pruning  …  Iterative Methods  Gradient Methods  Karmarkar’s Algorithm  … Antonio Roldao, Ph.D. CQF. 16
  • 17. At the Core of Optimization… …there is a solution of a System of Linear equation of the form: with x subject to some constraints.  Which need algos that can be subdivided into two categories:  Direct Methods  Gaussian, LU, QR, Cholesky, LDL, …  Iterative Methods  MINRES, GC, BiCGSTAB, GMRES, ORTHOMIN, … Antonio Roldao, Ph.D. CQF. 17
  • 18. Accelerating Machine Learning CPU GPGPU FPGA Sequential Processing Parallel Processing High Flexibility High Abstractions Many Libraries … Direct Methods Ultra-Low-Latency High Bandwidth Fine grain optimization ... Iterative Methods Neural Networks Markov Chains Monte Carlo Antonio Roldao, Ph.D. CQF. 18
  • 19. Networked Computing Systems  Mainframe Computing  Cluster Computing  Distributed Computing  Grid Computing  Orbital Computing  Interstellar Computing  Galactic Computing  Inter-Universe Computing  Cloud Computing Antonio Roldao, Ph.D. CQF. 19
  • 20. Modern Big Data Systems – Basic Components  Dynamic (abstraction) + Statically-Typed (speed) Languages  Need to rethink and re-engineer main systems:  Data & Code Stores  Logging  Code Revision and Deployment  Compute Nodes and Brokers Management  Graceful Failure and Recovery  Credentials and Access Controls  Task Schedulers  Messaging Bus  Web/Mobile Interfaces  Regression Testing …  Containerize and Standardize Services Antonio Roldao, Ph.D. CQF. 20
  • 21. Examples – Modern Big Data Systems  Finance  Athena/Hydra @ JP Morgan  Quartz/Sandra @ Bank of America  Slang/SecDB @ Goldman Sachs  Optimus/DAL @ Morgan Stanley  WSQ Tech @ n-prop shops &  datapark.io @ quants / prop-shops  Machine Learning  Alpha/DL @ Muse.Ai Antonio Roldao, Ph.D. CQF. 21