SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Modern Big Data Systems
for
Machine Learning
Antonio Roldao, Ph.D. CQF. 1
10/July/2015, Thomson Reuters, London, UK
About Me
http://anton.io
@roldao
2
This Talk on Big Data Systems
 Data
 Big Data as a Buzzword and the useless 4V’s
 Basic Aspects of Data
 Advanced Aspects of Data
 Small Data Innovations
 Algorithms for Machine Learning
 ML Overview
 Optimization Problems
 Solving Systems of Linear Equations
 Accelerating ML Using Different Technologies
 Distributed Computing
 Computing at Scale
 Platform Examples
Antonio Roldao, Ph.D. CQF. 3
Big Data
 4Vs of BD?!
 Volume
 Variety
 Velocity
 Veracity
 Too simplistic and technically
useless!
“Any amount of data that is too big
for Excel to process.”
1956 Hard-drive with 5 MB
Mostly a marketing Buzzword which
mean different things to different
people.
Antonio Roldao, Ph.D. CQF. 4
Understanding Data – Basic
 Storage formats
 Uncompressed <-> Compressed
 Unencrypted <-> Encrypted
 Human-readable <-> Binary
 Rigid <-> Templated <-> Self-describing
 Mainly regular <-> Irregular
 Different types and encodings
…
 Generation (write) modes
 parallel <-> sequential
 append-only
 in-place updates
 random inserts
…
 Consumption (read) modes
 parallel <-> sequential
 random <-> well defined access
…
Antonio Roldao, Ph.D. CQF. 5
Understanding Data – Advanced
 Represents:
 How concepts are connected (graph)
 How connections evolve with time (time series)
 Bitemporal (e.g. value depends on time frame)
 Time value of data (e.g. Useful today, but not tomorrow)
 Sensitivity (e.g. Medical, Economical, Political, Privacy…)
 Interdependency (e.g. one wrong bit destroys everything)
 Cleanliness (e.g. how Noisy it is)
 Truthfulness (e.g. how Accurate it is)
 Redundancy (e.g. how safe does it need to be)
 Density (e.g. how Redundant it is)
 Accessibility (e.g. Local <-> Global)
 Cost / Budget
Antonio Roldao, Ph.D. CQF. 6
Myriad of Data-stores/bases
 File-Systems
 local, distributed, p2p,…
 rom, tape, spindle, flash, ram,…
 Key-Value Stores
 Relational
 Object
 Geo-location
 Row-based
 Column-based
 Time-Series
 Graph-based
 ACID compliant or not
 Sharding Support
 Replication Support
 HA Support
 Blockchain
 LayerFS
 …
Antonio Roldao, Ph.D. CQF. 7
Recent Innovations in “Small-data”
 XML (1996)
 YAML (2001)
 JSON
 BSON
 Google Protocol Buffers (initial release 2008)
 Cap’n Proto
 Thrift
 Avro
 FAST
 FIX/BFIX
 Flat Buffers
 Simple Binary Encoding (2014)
 Dynamically Adaptive Encoding (Future)
http://www.quora.com/What-are-the-pros-and-cons-of-different-serialization-formats-for-Hadoop
Antonio Roldao, Ph.D. CQF. 8
Processing Data
Antonio Roldao, Ph.D. CQF. 9
Machine Learning
Antonio Roldao, Ph.D. CQF. 10
ML / AI – Boils down to…
 Given an input (X) and/or state (S) produce a output (Y)
 X may include
 Index or Time element (e.g. time series)
 S may include:
 a feedback-loop (e.g. reinforcement learning)
 a previously trained dataset (e.g. supervised learning)
 Y divides into two types:
 predictions (e.g. weather, trading, ...)
 categorizations
 known categories (e.g. object/speech recognition, …)
 unknown categories (e.g. insight generation, …)
Antonio Roldao, Ph.D. CQF. 11
Dimensionality Reduction
 Principal Component Analysis
 First component
 Subsequent components
Antonio Roldao, Ph.D. CQF. 12
Clustering
 k-Means
 For x observations cluster into k partitions the where ui
represents the mean of points in Si
Antonio Roldao, Ph.D. CQF. 13
General Al
 Genetic Algorithms
 For n mutations select mi that minimizes the difference
between output yi and a given reference (r):
where
Antonio Roldao, Ph.D. CQF. 14
Artificial Neural Networks
 Deep Convolutional Neural Network (d-CNN)
Optimization involving Stochastic Gradient Descent + Back-propagation
Antonio Roldao, Ph.D. CQF. 15
All About Optimization
 All these schemes involve solving for some constants that
Minimize or Maximize some Cost function
 Require fundamental Optimization algorithms such as:
 Direct Methods
 Combinatorial Algorithms
 Greedy Algorithm
 Minimax Algorithm with alpha-beta pruning
 …
 Iterative Methods
 Gradient Methods
 Karmarkar’s Algorithm
 …
Antonio Roldao, Ph.D. CQF. 16
At the Core of Optimization…
…there is a solution of a System of Linear equation of the form:
with x subject to some constraints.
 Which need algos that can be subdivided into two categories:
 Direct Methods
 Gaussian, LU, QR, Cholesky, LDL, …
 Iterative Methods
 MINRES, GC, BiCGSTAB, GMRES, ORTHOMIN, …
Antonio Roldao, Ph.D. CQF. 17
Accelerating Machine Learning
CPU GPGPU FPGA
Sequential Processing Parallel Processing
High Flexibility
High Abstractions
Many Libraries
…
Direct Methods
Ultra-Low-Latency
High Bandwidth
Fine grain optimization
...
Iterative Methods
Neural Networks
Markov Chains
Monte Carlo
Antonio Roldao, Ph.D. CQF. 18
Networked Computing Systems
 Mainframe Computing
 Cluster Computing
 Distributed Computing
 Grid Computing
 Orbital Computing
 Interstellar Computing
 Galactic Computing
 Inter-Universe Computing
 Cloud Computing
Antonio Roldao, Ph.D. CQF. 19
Modern Big Data Systems – Basic Components
 Dynamic (abstraction) + Statically-Typed (speed) Languages
 Need to rethink and re-engineer main systems:
 Data & Code Stores
 Logging
 Code Revision and Deployment
 Compute Nodes and Brokers Management
 Graceful Failure and Recovery
 Credentials and Access Controls
 Task Schedulers
 Messaging Bus
 Web/Mobile Interfaces
 Regression Testing
…
 Containerize and Standardize Services
Antonio Roldao, Ph.D. CQF. 20
Examples – Modern Big Data Systems
 Finance
 Athena/Hydra @ JP Morgan
 Quartz/Sandra @ Bank of America
 Slang/SecDB @ Goldman Sachs
 Optimus/DAL @ Morgan Stanley
 WSQ Tech @ n-prop shops &
 datapark.io @ quants / prop-shops
 Machine Learning
 Alpha/DL @ Muse.Ai
Antonio Roldao, Ph.D. CQF. 21
Thank you
http://anton.io
@roldao

Weitere ähnliche Inhalte

Ähnlich wie Modern Big Data Systems for Machine Learning

Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023Timothy Chen
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictabilityRichardWarburton
 
Petascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsHeiko Joerg Schick
 
Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingJan Wiegelmann
 
Data Engineering for Data Scientists
Data Engineering for Data Scientists Data Engineering for Data Scientists
Data Engineering for Data Scientists jlacefie
 
Rama krishna ppts for blue gene/L
Rama krishna ppts for blue gene/LRama krishna ppts for blue gene/L
Rama krishna ppts for blue gene/Lmsramakrishna
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...BigDataEverywhere
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictabilityRichardWarburton
 
Plank
PlankPlank
PlankFNian
 
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...Anthony Bradley
 
About "Apache Cassandra"
About "Apache Cassandra"About "Apache Cassandra"
About "Apache Cassandra"Jihyun Ahn
 
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4j
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4jWebinar: Large Scale Graph Processing with IBM Power Systems & Neo4j
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4jNeo4j
 
Artificial Intelligence Database Performance Tuning
Artificial Intelligence Database Performance TuningArtificial Intelligence Database Performance Tuning
Artificial Intelligence Database Performance TuningRoel Van de Paar
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop frameworkTu Pham
 

Ähnlich wie Modern Big Data Systems for Machine Learning (20)

Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Petascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big Analytics
 
Wolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat DresdenWolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat Dresden
 
course description
course descriptioncourse description
course description
 
Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous Driving
 
Tech
TechTech
Tech
 
Data Engineering for Data Scientists
Data Engineering for Data Scientists Data Engineering for Data Scientists
Data Engineering for Data Scientists
 
Syl
SylSyl
Syl
 
AI meets Big Data
AI meets Big DataAI meets Big Data
AI meets Big Data
 
Rama krishna ppts for blue gene/L
Rama krishna ppts for blue gene/LRama krishna ppts for blue gene/L
Rama krishna ppts for blue gene/L
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Plank
PlankPlank
Plank
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
 
About "Apache Cassandra"
About "Apache Cassandra"About "Apache Cassandra"
About "Apache Cassandra"
 
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4j
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4jWebinar: Large Scale Graph Processing with IBM Power Systems & Neo4j
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4j
 
Artificial Intelligence Database Performance Tuning
Artificial Intelligence Database Performance TuningArtificial Intelligence Database Performance Tuning
Artificial Intelligence Database Performance Tuning
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
 

Kürzlich hochgeladen

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Kürzlich hochgeladen (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

Modern Big Data Systems for Machine Learning

  • 1. Modern Big Data Systems for Machine Learning Antonio Roldao, Ph.D. CQF. 1 10/July/2015, Thomson Reuters, London, UK
  • 3. This Talk on Big Data Systems  Data  Big Data as a Buzzword and the useless 4V’s  Basic Aspects of Data  Advanced Aspects of Data  Small Data Innovations  Algorithms for Machine Learning  ML Overview  Optimization Problems  Solving Systems of Linear Equations  Accelerating ML Using Different Technologies  Distributed Computing  Computing at Scale  Platform Examples Antonio Roldao, Ph.D. CQF. 3
  • 4. Big Data  4Vs of BD?!  Volume  Variety  Velocity  Veracity  Too simplistic and technically useless! “Any amount of data that is too big for Excel to process.” 1956 Hard-drive with 5 MB Mostly a marketing Buzzword which mean different things to different people. Antonio Roldao, Ph.D. CQF. 4
  • 5. Understanding Data – Basic  Storage formats  Uncompressed <-> Compressed  Unencrypted <-> Encrypted  Human-readable <-> Binary  Rigid <-> Templated <-> Self-describing  Mainly regular <-> Irregular  Different types and encodings …  Generation (write) modes  parallel <-> sequential  append-only  in-place updates  random inserts …  Consumption (read) modes  parallel <-> sequential  random <-> well defined access … Antonio Roldao, Ph.D. CQF. 5
  • 6. Understanding Data – Advanced  Represents:  How concepts are connected (graph)  How connections evolve with time (time series)  Bitemporal (e.g. value depends on time frame)  Time value of data (e.g. Useful today, but not tomorrow)  Sensitivity (e.g. Medical, Economical, Political, Privacy…)  Interdependency (e.g. one wrong bit destroys everything)  Cleanliness (e.g. how Noisy it is)  Truthfulness (e.g. how Accurate it is)  Redundancy (e.g. how safe does it need to be)  Density (e.g. how Redundant it is)  Accessibility (e.g. Local <-> Global)  Cost / Budget Antonio Roldao, Ph.D. CQF. 6
  • 7. Myriad of Data-stores/bases  File-Systems  local, distributed, p2p,…  rom, tape, spindle, flash, ram,…  Key-Value Stores  Relational  Object  Geo-location  Row-based  Column-based  Time-Series  Graph-based  ACID compliant or not  Sharding Support  Replication Support  HA Support  Blockchain  LayerFS  … Antonio Roldao, Ph.D. CQF. 7
  • 8. Recent Innovations in “Small-data”  XML (1996)  YAML (2001)  JSON  BSON  Google Protocol Buffers (initial release 2008)  Cap’n Proto  Thrift  Avro  FAST  FIX/BFIX  Flat Buffers  Simple Binary Encoding (2014)  Dynamically Adaptive Encoding (Future) http://www.quora.com/What-are-the-pros-and-cons-of-different-serialization-formats-for-Hadoop Antonio Roldao, Ph.D. CQF. 8
  • 11. ML / AI – Boils down to…  Given an input (X) and/or state (S) produce a output (Y)  X may include  Index or Time element (e.g. time series)  S may include:  a feedback-loop (e.g. reinforcement learning)  a previously trained dataset (e.g. supervised learning)  Y divides into two types:  predictions (e.g. weather, trading, ...)  categorizations  known categories (e.g. object/speech recognition, …)  unknown categories (e.g. insight generation, …) Antonio Roldao, Ph.D. CQF. 11
  • 12. Dimensionality Reduction  Principal Component Analysis  First component  Subsequent components Antonio Roldao, Ph.D. CQF. 12
  • 13. Clustering  k-Means  For x observations cluster into k partitions the where ui represents the mean of points in Si Antonio Roldao, Ph.D. CQF. 13
  • 14. General Al  Genetic Algorithms  For n mutations select mi that minimizes the difference between output yi and a given reference (r): where Antonio Roldao, Ph.D. CQF. 14
  • 15. Artificial Neural Networks  Deep Convolutional Neural Network (d-CNN) Optimization involving Stochastic Gradient Descent + Back-propagation Antonio Roldao, Ph.D. CQF. 15
  • 16. All About Optimization  All these schemes involve solving for some constants that Minimize or Maximize some Cost function  Require fundamental Optimization algorithms such as:  Direct Methods  Combinatorial Algorithms  Greedy Algorithm  Minimax Algorithm with alpha-beta pruning  …  Iterative Methods  Gradient Methods  Karmarkar’s Algorithm  … Antonio Roldao, Ph.D. CQF. 16
  • 17. At the Core of Optimization… …there is a solution of a System of Linear equation of the form: with x subject to some constraints.  Which need algos that can be subdivided into two categories:  Direct Methods  Gaussian, LU, QR, Cholesky, LDL, …  Iterative Methods  MINRES, GC, BiCGSTAB, GMRES, ORTHOMIN, … Antonio Roldao, Ph.D. CQF. 17
  • 18. Accelerating Machine Learning CPU GPGPU FPGA Sequential Processing Parallel Processing High Flexibility High Abstractions Many Libraries … Direct Methods Ultra-Low-Latency High Bandwidth Fine grain optimization ... Iterative Methods Neural Networks Markov Chains Monte Carlo Antonio Roldao, Ph.D. CQF. 18
  • 19. Networked Computing Systems  Mainframe Computing  Cluster Computing  Distributed Computing  Grid Computing  Orbital Computing  Interstellar Computing  Galactic Computing  Inter-Universe Computing  Cloud Computing Antonio Roldao, Ph.D. CQF. 19
  • 20. Modern Big Data Systems – Basic Components  Dynamic (abstraction) + Statically-Typed (speed) Languages  Need to rethink and re-engineer main systems:  Data & Code Stores  Logging  Code Revision and Deployment  Compute Nodes and Brokers Management  Graceful Failure and Recovery  Credentials and Access Controls  Task Schedulers  Messaging Bus  Web/Mobile Interfaces  Regression Testing …  Containerize and Standardize Services Antonio Roldao, Ph.D. CQF. 20
  • 21. Examples – Modern Big Data Systems  Finance  Athena/Hydra @ JP Morgan  Quartz/Sandra @ Bank of America  Slang/SecDB @ Goldman Sachs  Optimus/DAL @ Morgan Stanley  WSQ Tech @ n-prop shops &  datapark.io @ quants / prop-shops  Machine Learning  Alpha/DL @ Muse.Ai Antonio Roldao, Ph.D. CQF. 21