This document discusses modern big data systems for machine learning. It begins with an overview of the speaker and agenda. It then discusses data aspects like formats, generation, and consumption. Machine learning algorithms like neural networks, clustering, and dimensionality reduction are also overviewed. The document emphasizes that optimization problems are at the core of machine learning. It describes accelerating machine learning using technologies like GPUs. Distributed and networked computing systems are also discussed. Finally, examples of modern big data systems in finance and machine learning are provided.
3. This Talk on Big Data Systems
Data
Big Data as a Buzzword and the useless 4V’s
Basic Aspects of Data
Advanced Aspects of Data
Small Data Innovations
Algorithms for Machine Learning
ML Overview
Optimization Problems
Solving Systems of Linear Equations
Accelerating ML Using Different Technologies
Distributed Computing
Computing at Scale
Platform Examples
Antonio Roldao, Ph.D. CQF. 3
4. Big Data
4Vs of BD?!
Volume
Variety
Velocity
Veracity
Too simplistic and technically
useless!
“Any amount of data that is too big
for Excel to process.”
1956 Hard-drive with 5 MB
Mostly a marketing Buzzword which
mean different things to different
people.
Antonio Roldao, Ph.D. CQF. 4
5. Understanding Data – Basic
Storage formats
Uncompressed <-> Compressed
Unencrypted <-> Encrypted
Human-readable <-> Binary
Rigid <-> Templated <-> Self-describing
Mainly regular <-> Irregular
Different types and encodings
…
Generation (write) modes
parallel <-> sequential
append-only
in-place updates
random inserts
…
Consumption (read) modes
parallel <-> sequential
random <-> well defined access
…
Antonio Roldao, Ph.D. CQF. 5
6. Understanding Data – Advanced
Represents:
How concepts are connected (graph)
How connections evolve with time (time series)
Bitemporal (e.g. value depends on time frame)
Time value of data (e.g. Useful today, but not tomorrow)
Sensitivity (e.g. Medical, Economical, Political, Privacy…)
Interdependency (e.g. one wrong bit destroys everything)
Cleanliness (e.g. how Noisy it is)
Truthfulness (e.g. how Accurate it is)
Redundancy (e.g. how safe does it need to be)
Density (e.g. how Redundant it is)
Accessibility (e.g. Local <-> Global)
Cost / Budget
Antonio Roldao, Ph.D. CQF. 6
7. Myriad of Data-stores/bases
File-Systems
local, distributed, p2p,…
rom, tape, spindle, flash, ram,…
Key-Value Stores
Relational
Object
Geo-location
Row-based
Column-based
Time-Series
Graph-based
ACID compliant or not
Sharding Support
Replication Support
HA Support
Blockchain
LayerFS
…
Antonio Roldao, Ph.D. CQF. 7
8. Recent Innovations in “Small-data”
XML (1996)
YAML (2001)
JSON
BSON
Google Protocol Buffers (initial release 2008)
Cap’n Proto
Thrift
Avro
FAST
FIX/BFIX
Flat Buffers
Simple Binary Encoding (2014)
Dynamically Adaptive Encoding (Future)
http://www.quora.com/What-are-the-pros-and-cons-of-different-serialization-formats-for-Hadoop
Antonio Roldao, Ph.D. CQF. 8
11. ML / AI – Boils down to…
Given an input (X) and/or state (S) produce a output (Y)
X may include
Index or Time element (e.g. time series)
S may include:
a feedback-loop (e.g. reinforcement learning)
a previously trained dataset (e.g. supervised learning)
Y divides into two types:
predictions (e.g. weather, trading, ...)
categorizations
known categories (e.g. object/speech recognition, …)
unknown categories (e.g. insight generation, …)
Antonio Roldao, Ph.D. CQF. 11
13. Clustering
k-Means
For x observations cluster into k partitions the where ui
represents the mean of points in Si
Antonio Roldao, Ph.D. CQF. 13
14. General Al
Genetic Algorithms
For n mutations select mi that minimizes the difference
between output yi and a given reference (r):
where
Antonio Roldao, Ph.D. CQF. 14
16. All About Optimization
All these schemes involve solving for some constants that
Minimize or Maximize some Cost function
Require fundamental Optimization algorithms such as:
Direct Methods
Combinatorial Algorithms
Greedy Algorithm
Minimax Algorithm with alpha-beta pruning
…
Iterative Methods
Gradient Methods
Karmarkar’s Algorithm
…
Antonio Roldao, Ph.D. CQF. 16
17. At the Core of Optimization…
…there is a solution of a System of Linear equation of the form:
with x subject to some constraints.
Which need algos that can be subdivided into two categories:
Direct Methods
Gaussian, LU, QR, Cholesky, LDL, …
Iterative Methods
MINRES, GC, BiCGSTAB, GMRES, ORTHOMIN, …
Antonio Roldao, Ph.D. CQF. 17
18. Accelerating Machine Learning
CPU GPGPU FPGA
Sequential Processing Parallel Processing
High Flexibility
High Abstractions
Many Libraries
…
Direct Methods
Ultra-Low-Latency
High Bandwidth
Fine grain optimization
...
Iterative Methods
Neural Networks
Markov Chains
Monte Carlo
Antonio Roldao, Ph.D. CQF. 18
20. Modern Big Data Systems – Basic Components
Dynamic (abstraction) + Statically-Typed (speed) Languages
Need to rethink and re-engineer main systems:
Data & Code Stores
Logging
Code Revision and Deployment
Compute Nodes and Brokers Management
Graceful Failure and Recovery
Credentials and Access Controls
Task Schedulers
Messaging Bus
Web/Mobile Interfaces
Regression Testing
…
Containerize and Standardize Services
Antonio Roldao, Ph.D. CQF. 20
21. Examples – Modern Big Data Systems
Finance
Athena/Hydra @ JP Morgan
Quartz/Sandra @ Bank of America
Slang/SecDB @ Goldman Sachs
Optimus/DAL @ Morgan Stanley
WSQ Tech @ n-prop shops &
datapark.io @ quants / prop-shops
Machine Learning
Alpha/DL @ Muse.Ai
Antonio Roldao, Ph.D. CQF. 21