Modern Big Data Systems for Machine Learning

Modern Big Data Systems
for
Machine Learning
Antonio Roldao, Ph.D. CQF. 1
10/July/2015, Thomson Reuters, London, UK

About Me
http://anton.io
@roldao
2

This Talk on Big Data Systems
 Data
 Big Data as a Buzzword and the useless 4V’s
 Basic Aspects of Data
 Advanced Aspects of Data
 Small Data Innovations
 Algorithms for Machine Learning
 ML Overview
 Optimization Problems
 Solving Systems of Linear Equations
 Accelerating ML Using Different Technologies
 Distributed Computing
 Computing at Scale
 Platform Examples

Big Data
 4Vs of BD?!
 Volume
 Variety
 Velocity
 Veracity
 Too simplistic and technically
useless!
“Any amount of data that is too big
for Excel to process.”
1956 Hard-drive with 5 MB
Mostly a marketing Buzzword which
mean different things to different
people.

Understanding Data – Basic
 Storage formats
 Uncompressed <-> Compressed
 Unencrypted <-> Encrypted
 Human-readable <-> Binary
 Rigid <-> Templated <-> Self-describing
 Mainly regular <-> Irregular
 Different types and encodings
…
 Generation (write) modes
 parallel <-> sequential
 append-only
 in-place updates
 random inserts
…
 Consumption (read) modes
 parallel <-> sequential
 random <-> well defined access
…

Understanding Data – Advanced
 Represents:
 How concepts are connected (graph)
 How connections evolve with time (time series)
 Bitemporal (e.g. value depends on time frame)
 Time value of data (e.g. Useful today, but not tomorrow)
 Sensitivity (e.g. Medical, Economical, Political, Privacy…)
 Interdependency (e.g. one wrong bit destroys everything)
 Cleanliness (e.g. how Noisy it is)
 Truthfulness (e.g. how Accurate it is)
 Redundancy (e.g. how safe does it need to be)
 Density (e.g. how Redundant it is)
 Accessibility (e.g. Local <-> Global)
 Cost / Budget

Myriad of Data-stores/bases
 File-Systems
 local, distributed, p2p,…
 rom, tape, spindle, flash, ram,…
 Key-Value Stores
 Relational
 Object
 Geo-location
 Row-based
 Column-based
 Time-Series
 Graph-based
 ACID compliant or not
 Sharding Support
 Replication Support
 HA Support
 Blockchain
 LayerFS
 …

Recent Innovations in “Small-data”
 XML (1996)
 YAML (2001)
 JSON
 BSON
 Google Protocol Buffers (initial release 2008)
 Cap’n Proto
 Thrift
 Avro
 FAST
 FIX/BFIX
 Flat Buffers
 Simple Binary Encoding (2014)
 Dynamically Adaptive Encoding (Future)
http://www.quora.com/What-are-the-pros-and-cons-of-different-serialization-formats-for-Hadoop

Processing Data

Machine Learning

ML / AI – Boils down to…
 Given an input (X) and/or state (S) produce a output (Y)
 X may include
 Index or Time element (e.g. time series)
 S may include:
 a feedback-loop (e.g. reinforcement learning)
 a previously trained dataset (e.g. supervised learning)
 Y divides into two types:
 predictions (e.g. weather, trading, ...)
 categorizations
 known categories (e.g. object/speech recognition, …)
 unknown categories (e.g. insight generation, …)

Dimensionality Reduction
 Principal Component Analysis
 First component
 Subsequent components

Clustering
 k-Means
 For x observations cluster into k partitions the where ui
represents the mean of points in Si

General Al
 Genetic Algorithms
 For n mutations select mi that minimizes the difference
between output yi and a given reference (r):
where

Artificial Neural Networks
 Deep Convolutional Neural Network (d-CNN)
Optimization involving Stochastic Gradient Descent + Back-propagation

All About Optimization
 All these schemes involve solving for some constants that
Minimize or Maximize some Cost function
 Require fundamental Optimization algorithms such as:
 Direct Methods
 Combinatorial Algorithms
 Greedy Algorithm
 Minimax Algorithm with alpha-beta pruning
 …
 Iterative Methods
 Gradient Methods
 Karmarkar’s Algorithm
 …

At the Core of Optimization…
…there is a solution of a System of Linear equation of the form:
with x subject to some constraints.
 Which need algos that can be subdivided into two categories:
 Direct Methods
 Gaussian, LU, QR, Cholesky, LDL, …
 Iterative Methods
 MINRES, GC, BiCGSTAB, GMRES, ORTHOMIN, …

Accelerating Machine Learning
CPU GPGPU FPGA
Sequential Processing Parallel Processing
High Flexibility
High Abstractions
Many Libraries
…
Direct Methods
Ultra-Low-Latency
High Bandwidth
Fine grain optimization
...
Iterative Methods
Neural Networks
Markov Chains
Monte Carlo

Networked Computing Systems
 Mainframe Computing
 Cluster Computing
 Distributed Computing
 Grid Computing
 Orbital Computing
 Interstellar Computing
 Galactic Computing
 Inter-Universe Computing
 Cloud Computing

Modern Big Data Systems – Basic Components
 Dynamic (abstraction) + Statically-Typed (speed) Languages
 Need to rethink and re-engineer main systems:
 Data & Code Stores
 Logging
 Code Revision and Deployment
 Compute Nodes and Brokers Management
 Graceful Failure and Recovery
 Credentials and Access Controls
 Task Schedulers
 Messaging Bus
 Web/Mobile Interfaces
 Regression Testing
…
 Containerize and Standardize Services

Examples – Modern Big Data Systems
 Finance
 Athena/Hydra @ JP Morgan
 Quartz/Sandra @ Bank of America
 Slang/SecDB @ Goldman Sachs
 Optimus/DAL @ Morgan Stanley
 WSQ Tech @ n-prop shops &
 datapark.io @ quants / prop-shops
 Machine Learning
 Alpha/DL @ Muse.Ai

Thank you
http://anton.io
@roldao

Modern Big Data Systems for Machine Learning

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Modern Big Data Systems for Machine Learning

Ähnlich wie Modern Big Data Systems for Machine Learning (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Modern Big Data Systems for Machine Learning