SlideShare ist ein Scribd-Unternehmen logo
1 von 41
Downloaden Sie, um offline zu lesen
Big Data for
Big Questions
Cliff Click, CTO 0xdata
cliffc@0xdata.com
http://0xdata.com
http://cliffc.org/blog
● Motivation: What & Why Big Math?
● Better Mousetrap
● Demo
● Fork: Deep Dive into
Math Hacking ...or...
K/V Store
Source: https://github.com/0xdata/h2o
0xdata.com 3
42!
0xdata.com 4
42!
What was the question again?
0xdata.com 5
42!
What was the question again?
Oh yeah, it was:
● How do I place ads based on a clickstream?
0xdata.com 6
42!
What was the question again?
Oh yeah, it was:
● How do I place ads based on a clickstream?
● Detect fraud in a credit-card swipe stream?
0xdata.com 7
42!
What was the question again?
Oh yeah, it was:
● How do I place ads based on a clickstream?
● Detect fraud in a credit-card swipe stream?
● Detect cancer from sensor data?
0xdata.com 8
42!
What was the question again?
Oh yeah, it was:
● How do I place ads based on a clickstream?
● Detect fraud in a credit-card swipe stream?
● Detect cancer from sensor data?
● Predict equipment failure ahead of time?
0xdata.com 9
42!
What was the question again?
Oh yeah, it was:
● How do I place ads based on a clickstream?
● Detect fraud in a credit-card swipe stream?
● Detect cancer from sensor data?
● Predict equipment failure ahead of time?
● Find people (un)like me?
● ... or ... or ... or... ????
0xdata.com 10
How do I figure it all out?
● Well... what are my tools?
● Domain Knowledge,
● (me! The Expert)
● Math & Science! Data Science, and
● Data – lots and lots and lots of it
● Old logs, new logs, databases, historical
records, click-streams, CSV files, dumps
● Often TB's, sometimes PB's of it
0xdata.com 11
Data: The Main Player
● Data: I got lots of it
● But it's a messy mixed-up lot
● Stored in HDFS, S3, DB2 or scattered about
● Incompatible formats, older & newer bits
● Missing stuff, or "known broken" fields
● And it's Big
● Too big for my laptop, or even one server
0xdata.com 12
Data: Cleaning it Up
● Just the parts I want:
● SQL, Hive, HBase, grep
● Data is Big, so this is slow
● Wrong format:
● Awk, shell scripts, files, disk-to-disk
● Inspection (do I got it right yet?)
● Grep/awk, histograms, plots/prints
● Visualization tools
0xdata.com 13
From Facts to Knowledge
● Data cleaned up: lots of neat rows of facts
● Lots of rows: millions and billions ...
● But facts is not knowledge
● Too much to "get it" by looking
● Time for a mathematical Model!
● Here again, Big limits my tools
● Either can't deal, or deal very very slowly
0xdata.com 14
Modeling: math(data)
● Modeling gives a simpler view
● A way to understand
● And predict in real time
● Modeling is Math!
● Generalized Linear Modeling
– Oldest, most well known & used
● Random Forest
● K-Means Clustering
0xdata.com 15
Big Data vs Modeling
● Model: a concise description of my data
● A more accurate model predicts better
● Generally More Data builds a better Model
● But only if the tool can handle it
● (some datasets are not helped but it rarely hurts)
● Tools can't handle Big: so down sample,
and use better (more complex) algorithm
0xdata.com 16
Big Data vs Better Algorithm
● Don't want to choose Big vs Better
● Down sampling loses information
● Want a way to manipulate Big Data like it's
small: interactive & fast. Subtle when I
need it and brute force when I don't
● Build the Better Algorithm and use Big Data
● Seeing 10x more data yield prediction
increases e.g. from 75% to 85%
0xdata.com 17
Building The Better
Big Data Mousetrap
● Want fast: means dram instead of disk
● Fall back to disk, if data >>> dram
● Want fast: use all cpus
● Problems are mostly data-parallel anyways
● Want ease-of-programming:
● “parallelism without effort”
● Well understood programming model
0xdata.com 18
● Want ease-of-use:
● python, json, REST/HTML interfaces
● Full R semantics (via fastr project)
● Data ingest:
● where: HDFS, S3, NFS, URL, URI, browser
● what: csv, hive, rdata
Building The Better
Big Data Mousetrap
0xdata.com 19
Building The Better
Big Data Mousetrap
● Want ease-of-admin:
● e.g. java -jar h2o.jar
● auto-cluster (no config at all) or hadoop Job
● Want ease-of-upgrade:
adding more servers gives
● More CPU (faster exec)
● More DRAM (larger data in dram)
● More network/disk bandwidth (faster ingest)
0xdata.com 20
H2O: An Engine for Big Math
● Built in layers – pick your abstraction level
● Analysts, starters: REST, browser
– "clicky clicky" load data, build model, score
● Scientists: R, JSON, python to drive engine
– Complex math
● Math hackers: building new algos
– Full (distributed) Java Memory Model
– "codes like Java, runs distributed"
● Core Engineering: call us, we're hiring
0xdata.com 21
Core Engineering: K/V Store
● Classic distributed Key/Value store
● get/put/atomic-transaction
● Full JMM semantics, exact consistency
● Full caching as-needed
– Cached keys "get" in 150 nano's
– Misses limited by network speed
● Hardware-like cache coherency protocol
● Distributed fork/join (thanks Doug Lea)
0xdata.com 22
Core Engineering: D/F/J
● Distributed fork/join (jsr 166y)
● Recursive-descent for data-parallel
● Distribution handled by the core
– Log-tree scatter/gather across cluster
● Supports map/reduce-style directly
● But also "do this on all nodes" style
● Or random graph hacking
0xdata.com 23
Math Hacking
● “Tastes like (distributed) java”
(actual inner loop, auto-parallel, auto-distributed)
● Big “vector math” is easy
● The obvious for-loop "just works"
for( int i=0; i<rows; i++ ) {
double X = ary.datad(bits,i,A);
double Y = ary.datad(bits,i,B);
_sumX += X;
_sumY += Y;
_sumX2+= X*X;
}
0xdata.com 24
Math Hacking
● Dense-vector algorithms are easy
● Generalized Linear Modeling: 2 weeks
● K-means: 2 days
● Histogram: 2 hours
● Random Forest: not dense vectors
● Still makes good use of D/F/J
● All-CPUs, all-nodes still light up
– Very fast tree building
0xdata.com 25
Science: dancing with the data
● Like the belle of the ball, the main algos
(GLM, k-means, RF) only arrive when the
data is properly dressed
● Munging data: dropping junk columns,
replacing missing bits, adding features
● H2O provides a tool-kit
● Big vector calculator: "d := a+b*c"
● dram speeds: "msec per Gbyte"
0xdata.com 26
Science: APIs
● Need to script, automate repetitive tasks
● R via fastr and bigmemory package
● Full R semantics, 5x R speed single-thread
● But your vectors can be very very big...
● https://github.com/allr/fastr
● REST / URL / JSON
● Drive from e.g. python, scripts, curl, wget
– e.g. h2o testing harness is all python
0xdata.com 27
Demos & Quick Starts
● Full browser interface
● Tutorials
● Handful of clicks to run e.g. RF or GLM
on gigabytes of data
● Auto-cluster in seconds
● On EC2 (or your laptops right now)
● Good enough for serious work
● (and have customers using this interface!)
0xdata.com 28
Demo Time!
0xdata.com 29
H2O: An Engine for Big Math
● Focus on Big Math
● Easy to extend via M/R or K/V programming
● Auto-cluster
● Data-parallel exec across all CPUs
● dram caching across all servers
● Parallel ingest across all servers
● Open source: https://github.com/0xdata/h2o
0xdata.com
0xdata.com 30
Math Hacking: The M/R API
● Make a 'golden object'
● Will be endlessly replicated across cluster
● Set 'input' fields:
– Auto-serialized, distributed
– Shallow-copy on nodes: eg arrays share state
● golden.map(key_1mb)
● map() called on clone for each 1mb
● Set 'output' fields now
0xdata.com 31
Math Hacking: The M/R API
● gold.reduce(gold)
● Combine pairs of 'golden' objects
● Both locally and remotely (distributed)
● Log-tree roll-up
● 'output' fields will be shipped over the wire
● null-out 'input' fields
● transient marker available
0xdata.com 32
Math Hacking: Example
CalcSumsTask cst = new CalcSumsTask();
cst._arykey = ary._key; // BigData Table key
cst._colA = colA; // integer indices to columns
cst._colB = colB;
cst.invoke(ary._key); // Do It!
// Results returned directly in 'cst' object
...cst._sumX... // use results
public static class CalcSumsTask extends MRTask {
Key _arykey; // BigData Table key
int _colA, _colB; // Column indices to work on
double _sumX,_sumY,_sumX2; // Sum of X's, Y's, X^2's
0xdata.com 33
Math Hacking: Example
public static class CalcSumsTask extends MRTask {
Key _arykey; // BigData Table key
int _colA, _colB; // Column indices to work on
double _sumX,_sumY,_sumX2; // Sum of X's, Y's, X^2's
// map called for every 1Mb of data, or so
public void map( Key key1Mb ) {
… boiler plate... // lots of unimportant details
// Standard for-loop over the data
for( int i=0; i<rows; i++ ) {
double X = ary.datad(bits,i,A);
double Y = ary.datad(bits,i,B);
_sumX += X;
_sumY += Y;
_sumX2+= X*X;
}
}
0xdata.com 34
Math Hacking: Example
public static class CalcSumsTask extends MRTask {
Key _arykey; // BigData Table key
int _colA, _colB; // Column indices to work on
double _sumX,_sumY,_sumX2; // Sum of X's, Y's, X^2's
// reduce called between pairs of golden objects
// always reduce right-side into 'this' object
public void reduce( DRemoteTask rt ) {
CalcSumsTask cst = (CalcSumsTask)rt;
_sumX += cst._sumX ;
_sumY += cst._sumY ;
_sumX2+= cst._sumX2;
}
}
0xdata.com 35
A Fast K/V Store
● Distributed in-memory K/V Store
● Peer-to-peer, no master
● Full JMM semantics, get/put/atomic/remove
● Hardware-style cache-coherency protocol
● Fast: 150nanos for cache-hitting 'get'
● Fast: 50micros for cache-missing 'put'
● No persistence (see above for 'fast')
● No locks: use 'atomic' instead
0xdata.com 36
K/V Design Goals
● JMM semantics on all get/put
● Cache-hitting 'gets' as fast as possible
● Local hashtable lookup + few tests
● 'puts' as lazy as possible (still JMM)
● Typically do not block for remote put
● Arbitrary transactions on single Keys
0xdata.com 37
K/V Coherency Protocol
● Many are possible
● Picked a {fast-enough,easy} one
● Faster is possible
● Every Key has 1 master node
● And everybody knows it from Key hash
● Master orders racing writes
● Winner of NBHM insert
0xdata.com 38
K/V Coherency Protocol
● Master tracks replicas
● Single CAS update
● Invalidate replicas on update
● Single CAS required, plus the invalidates
● Cache miss on replica will reload
● Interlocking get/put races solved with
finite state machine
0xdata.com 39
K/V Coherency Protocol
0xdata.com 40
Backup Slides
0xdata.com 41
The Expert
● Domain Expert:
● What data is useful, which is trash
● What needs help to become useful
● Missing elements? Toss outliers?
● Build new features from old?
● All through this process Big Data is, well,
Big, hence Slow to cp / awk / grep
● And Big limits my tools

Weitere ähnliche Inhalte

Andere mochten auch

Guess the picture
Guess the pictureGuess the picture
Guess the pictureai73
 
Student motivation, by: Haseen Ah-Hassan
Student motivation, by: Haseen Ah-HassanStudent motivation, by: Haseen Ah-Hassan
Student motivation, by: Haseen Ah-HassanHaseeb Ahmed
 
Pass the word - Gestures game
Pass the word - Gestures gamePass the word - Gestures game
Pass the word - Gestures gamecrisgarlop
 
Theorems invloving inequalities in a triangle
Theorems invloving inequalities in a triangleTheorems invloving inequalities in a triangle
Theorems invloving inequalities in a triangleElton John Embodo
 
4 pics 1 word math version
4 pics 1 word   math version4 pics 1 word   math version
4 pics 1 word math versionGem Lamsen
 
Motivation games
Motivation gamesMotivation games
Motivation gamesjanehbasto
 
Triangle inequality power point
Triangle inequality power pointTriangle inequality power point
Triangle inequality power pointpowayvball
 
Triangle inequalities
Triangle inequalitiesTriangle inequalities
Triangle inequalitiesmasljr
 
Motivating Students With Math Games
Motivating Students With Math GamesMotivating Students With Math Games
Motivating Students With Math GamesDoug Adams
 
Triangle Inequality Theorem: Activities and Assessment Methods
Triangle Inequality Theorem:  Activities and Assessment MethodsTriangle Inequality Theorem:  Activities and Assessment Methods
Triangle Inequality Theorem: Activities and Assessment MethodsMarianne McFadden
 
Game On: Everything you need to know about how games are changing the world
Game On: Everything you need to know about how games are changing the worldGame On: Everything you need to know about how games are changing the world
Game On: Everything you need to know about how games are changing the worldJeremy Johnson
 
Student motivation
Student motivationStudent motivation
Student motivationjvirwin
 
Semi-detailed Lesson Plan in math IV (k-12 based curriculum) "FINDING THE ARE...
Semi-detailed Lesson Plan in math IV (k-12 based curriculum) "FINDING THE ARE...Semi-detailed Lesson Plan in math IV (k-12 based curriculum) "FINDING THE ARE...
Semi-detailed Lesson Plan in math IV (k-12 based curriculum) "FINDING THE ARE...Cristy Melloso
 
Visual Thinking Games
Visual Thinking GamesVisual Thinking Games
Visual Thinking Gamesscottekim
 

Andere mochten auch (20)

Guess the picture
Guess the pictureGuess the picture
Guess the picture
 
Student motivation, by: Haseen Ah-Hassan
Student motivation, by: Haseen Ah-HassanStudent motivation, by: Haseen Ah-Hassan
Student motivation, by: Haseen Ah-Hassan
 
Pass the word - Gestures game
Pass the word - Gestures gamePass the word - Gestures game
Pass the word - Gestures game
 
Theorems invloving inequalities in a triangle
Theorems invloving inequalities in a triangleTheorems invloving inequalities in a triangle
Theorems invloving inequalities in a triangle
 
4 pic 1 word
4 pic 1 word4 pic 1 word
4 pic 1 word
 
4 pics 1 word math version
4 pics 1 word   math version4 pics 1 word   math version
4 pics 1 word math version
 
Motivation games
Motivation gamesMotivation games
Motivation games
 
Triangle inequality power point
Triangle inequality power pointTriangle inequality power point
Triangle inequality power point
 
Game
GameGame
Game
 
Triangle inequalities
Triangle inequalitiesTriangle inequalities
Triangle inequalities
 
Mystery Objects
Mystery ObjectsMystery Objects
Mystery Objects
 
Guess the object
Guess the objectGuess the object
Guess the object
 
Student motivation powerpoint 3
Student motivation powerpoint 3Student motivation powerpoint 3
Student motivation powerpoint 3
 
Triangle inequalities
Triangle inequalitiesTriangle inequalities
Triangle inequalities
 
Motivating Students With Math Games
Motivating Students With Math GamesMotivating Students With Math Games
Motivating Students With Math Games
 
Triangle Inequality Theorem: Activities and Assessment Methods
Triangle Inequality Theorem:  Activities and Assessment MethodsTriangle Inequality Theorem:  Activities and Assessment Methods
Triangle Inequality Theorem: Activities and Assessment Methods
 
Game On: Everything you need to know about how games are changing the world
Game On: Everything you need to know about how games are changing the worldGame On: Everything you need to know about how games are changing the world
Game On: Everything you need to know about how games are changing the world
 
Student motivation
Student motivationStudent motivation
Student motivation
 
Semi-detailed Lesson Plan in math IV (k-12 based curriculum) "FINDING THE ARE...
Semi-detailed Lesson Plan in math IV (k-12 based curriculum) "FINDING THE ARE...Semi-detailed Lesson Plan in math IV (k-12 based curriculum) "FINDING THE ARE...
Semi-detailed Lesson Plan in math IV (k-12 based curriculum) "FINDING THE ARE...
 
Visual Thinking Games
Visual Thinking GamesVisual Thinking Games
Visual Thinking Games
 

Ähnlich wie Big Data for Big Questions: An Engine for Fast Math Hacking

GBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O APIGBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O APISri Ambati
 
Cliff Click Explains GBM at Netflix October 10 2013
Cliff Click Explains GBM at Netflix October 10 2013Cliff Click Explains GBM at Netflix October 10 2013
Cliff Click Explains GBM at Netflix October 10 2013Sri Ambati
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleSri Ambati
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchEdward Capriolo
 
MySQL Performance Monitoring
MySQL Performance MonitoringMySQL Performance Monitoring
MySQL Performance Monitoringspil-engineering
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyGuillaume Lefranc
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
 
Retaining globally distributed high availability
Retaining globally distributed high availabilityRetaining globally distributed high availability
Retaining globally distributed high availabilityspil-engineering
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaJose Mº Muñoz
 
DSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De BoerDSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De BoerDeltares
 
Creative Data Analysis with Python
Creative Data Analysis with PythonCreative Data Analysis with Python
Creative Data Analysis with PythonGrant Paton-Simpson
 
mloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentmloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentDavid Galeano
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data qualityLars Albertsson
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 

Ähnlich wie Big Data for Big Questions: An Engine for Fast Math Hacking (20)

GBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O APIGBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O API
 
Cliff Click Explains GBM at Netflix October 10 2013
Cliff Click Explains GBM at Netflix October 10 2013Cliff Click Explains GBM at Netflix October 10 2013
Cliff Click Explains GBM at Netflix October 10 2013
 
2013 05 ny
2013 05 ny2013 05 ny
2013 05 ny
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt Dowle
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batch
 
MySQL Performance Monitoring
MySQL Performance MonitoringMySQL Performance Monitoring
MySQL Performance Monitoring
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative study
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Retaining globally distributed high availability
Retaining globally distributed high availabilityRetaining globally distributed high availability
Retaining globally distributed high availability
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest Córdoba
 
DSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De BoerDSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De Boer
 
Creative Data Analysis with Python
Creative Data Analysis with PythonCreative Data Analysis with Python
Creative Data Analysis with Python
 
mloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentmloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game development
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 

Mehr von Sri Ambati

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxSri Ambati
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek Sri Ambati
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thSri Ambati
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionSri Ambati
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Sri Ambati
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMsSri Ambati
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the WaySri Ambati
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OSri Ambati
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Sri Ambati
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersSri Ambati
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Sri Ambati
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Sri Ambati
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...Sri Ambati
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability Sri Ambati
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email AgainSri Ambati
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Sri Ambati
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...Sri Ambati
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...Sri Ambati
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneySri Ambati
 

Mehr von Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 

Kürzlich hochgeladen

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Big Data for Big Questions: An Engine for Fast Math Hacking

  • 1. Big Data for Big Questions Cliff Click, CTO 0xdata cliffc@0xdata.com http://0xdata.com http://cliffc.org/blog
  • 2. ● Motivation: What & Why Big Math? ● Better Mousetrap ● Demo ● Fork: Deep Dive into Math Hacking ...or... K/V Store Source: https://github.com/0xdata/h2o
  • 4. 0xdata.com 4 42! What was the question again?
  • 5. 0xdata.com 5 42! What was the question again? Oh yeah, it was: ● How do I place ads based on a clickstream?
  • 6. 0xdata.com 6 42! What was the question again? Oh yeah, it was: ● How do I place ads based on a clickstream? ● Detect fraud in a credit-card swipe stream?
  • 7. 0xdata.com 7 42! What was the question again? Oh yeah, it was: ● How do I place ads based on a clickstream? ● Detect fraud in a credit-card swipe stream? ● Detect cancer from sensor data?
  • 8. 0xdata.com 8 42! What was the question again? Oh yeah, it was: ● How do I place ads based on a clickstream? ● Detect fraud in a credit-card swipe stream? ● Detect cancer from sensor data? ● Predict equipment failure ahead of time?
  • 9. 0xdata.com 9 42! What was the question again? Oh yeah, it was: ● How do I place ads based on a clickstream? ● Detect fraud in a credit-card swipe stream? ● Detect cancer from sensor data? ● Predict equipment failure ahead of time? ● Find people (un)like me? ● ... or ... or ... or... ????
  • 10. 0xdata.com 10 How do I figure it all out? ● Well... what are my tools? ● Domain Knowledge, ● (me! The Expert) ● Math & Science! Data Science, and ● Data – lots and lots and lots of it ● Old logs, new logs, databases, historical records, click-streams, CSV files, dumps ● Often TB's, sometimes PB's of it
  • 11. 0xdata.com 11 Data: The Main Player ● Data: I got lots of it ● But it's a messy mixed-up lot ● Stored in HDFS, S3, DB2 or scattered about ● Incompatible formats, older & newer bits ● Missing stuff, or "known broken" fields ● And it's Big ● Too big for my laptop, or even one server
  • 12. 0xdata.com 12 Data: Cleaning it Up ● Just the parts I want: ● SQL, Hive, HBase, grep ● Data is Big, so this is slow ● Wrong format: ● Awk, shell scripts, files, disk-to-disk ● Inspection (do I got it right yet?) ● Grep/awk, histograms, plots/prints ● Visualization tools
  • 13. 0xdata.com 13 From Facts to Knowledge ● Data cleaned up: lots of neat rows of facts ● Lots of rows: millions and billions ... ● But facts is not knowledge ● Too much to "get it" by looking ● Time for a mathematical Model! ● Here again, Big limits my tools ● Either can't deal, or deal very very slowly
  • 14. 0xdata.com 14 Modeling: math(data) ● Modeling gives a simpler view ● A way to understand ● And predict in real time ● Modeling is Math! ● Generalized Linear Modeling – Oldest, most well known & used ● Random Forest ● K-Means Clustering
  • 15. 0xdata.com 15 Big Data vs Modeling ● Model: a concise description of my data ● A more accurate model predicts better ● Generally More Data builds a better Model ● But only if the tool can handle it ● (some datasets are not helped but it rarely hurts) ● Tools can't handle Big: so down sample, and use better (more complex) algorithm
  • 16. 0xdata.com 16 Big Data vs Better Algorithm ● Don't want to choose Big vs Better ● Down sampling loses information ● Want a way to manipulate Big Data like it's small: interactive & fast. Subtle when I need it and brute force when I don't ● Build the Better Algorithm and use Big Data ● Seeing 10x more data yield prediction increases e.g. from 75% to 85%
  • 17. 0xdata.com 17 Building The Better Big Data Mousetrap ● Want fast: means dram instead of disk ● Fall back to disk, if data >>> dram ● Want fast: use all cpus ● Problems are mostly data-parallel anyways ● Want ease-of-programming: ● “parallelism without effort” ● Well understood programming model
  • 18. 0xdata.com 18 ● Want ease-of-use: ● python, json, REST/HTML interfaces ● Full R semantics (via fastr project) ● Data ingest: ● where: HDFS, S3, NFS, URL, URI, browser ● what: csv, hive, rdata Building The Better Big Data Mousetrap
  • 19. 0xdata.com 19 Building The Better Big Data Mousetrap ● Want ease-of-admin: ● e.g. java -jar h2o.jar ● auto-cluster (no config at all) or hadoop Job ● Want ease-of-upgrade: adding more servers gives ● More CPU (faster exec) ● More DRAM (larger data in dram) ● More network/disk bandwidth (faster ingest)
  • 20. 0xdata.com 20 H2O: An Engine for Big Math ● Built in layers – pick your abstraction level ● Analysts, starters: REST, browser – "clicky clicky" load data, build model, score ● Scientists: R, JSON, python to drive engine – Complex math ● Math hackers: building new algos – Full (distributed) Java Memory Model – "codes like Java, runs distributed" ● Core Engineering: call us, we're hiring
  • 21. 0xdata.com 21 Core Engineering: K/V Store ● Classic distributed Key/Value store ● get/put/atomic-transaction ● Full JMM semantics, exact consistency ● Full caching as-needed – Cached keys "get" in 150 nano's – Misses limited by network speed ● Hardware-like cache coherency protocol ● Distributed fork/join (thanks Doug Lea)
  • 22. 0xdata.com 22 Core Engineering: D/F/J ● Distributed fork/join (jsr 166y) ● Recursive-descent for data-parallel ● Distribution handled by the core – Log-tree scatter/gather across cluster ● Supports map/reduce-style directly ● But also "do this on all nodes" style ● Or random graph hacking
  • 23. 0xdata.com 23 Math Hacking ● “Tastes like (distributed) java” (actual inner loop, auto-parallel, auto-distributed) ● Big “vector math” is easy ● The obvious for-loop "just works" for( int i=0; i<rows; i++ ) { double X = ary.datad(bits,i,A); double Y = ary.datad(bits,i,B); _sumX += X; _sumY += Y; _sumX2+= X*X; }
  • 24. 0xdata.com 24 Math Hacking ● Dense-vector algorithms are easy ● Generalized Linear Modeling: 2 weeks ● K-means: 2 days ● Histogram: 2 hours ● Random Forest: not dense vectors ● Still makes good use of D/F/J ● All-CPUs, all-nodes still light up – Very fast tree building
  • 25. 0xdata.com 25 Science: dancing with the data ● Like the belle of the ball, the main algos (GLM, k-means, RF) only arrive when the data is properly dressed ● Munging data: dropping junk columns, replacing missing bits, adding features ● H2O provides a tool-kit ● Big vector calculator: "d := a+b*c" ● dram speeds: "msec per Gbyte"
  • 26. 0xdata.com 26 Science: APIs ● Need to script, automate repetitive tasks ● R via fastr and bigmemory package ● Full R semantics, 5x R speed single-thread ● But your vectors can be very very big... ● https://github.com/allr/fastr ● REST / URL / JSON ● Drive from e.g. python, scripts, curl, wget – e.g. h2o testing harness is all python
  • 27. 0xdata.com 27 Demos & Quick Starts ● Full browser interface ● Tutorials ● Handful of clicks to run e.g. RF or GLM on gigabytes of data ● Auto-cluster in seconds ● On EC2 (or your laptops right now) ● Good enough for serious work ● (and have customers using this interface!)
  • 29. 0xdata.com 29 H2O: An Engine for Big Math ● Focus on Big Math ● Easy to extend via M/R or K/V programming ● Auto-cluster ● Data-parallel exec across all CPUs ● dram caching across all servers ● Parallel ingest across all servers ● Open source: https://github.com/0xdata/h2o 0xdata.com
  • 30. 0xdata.com 30 Math Hacking: The M/R API ● Make a 'golden object' ● Will be endlessly replicated across cluster ● Set 'input' fields: – Auto-serialized, distributed – Shallow-copy on nodes: eg arrays share state ● golden.map(key_1mb) ● map() called on clone for each 1mb ● Set 'output' fields now
  • 31. 0xdata.com 31 Math Hacking: The M/R API ● gold.reduce(gold) ● Combine pairs of 'golden' objects ● Both locally and remotely (distributed) ● Log-tree roll-up ● 'output' fields will be shipped over the wire ● null-out 'input' fields ● transient marker available
  • 32. 0xdata.com 32 Math Hacking: Example CalcSumsTask cst = new CalcSumsTask(); cst._arykey = ary._key; // BigData Table key cst._colA = colA; // integer indices to columns cst._colB = colB; cst.invoke(ary._key); // Do It! // Results returned directly in 'cst' object ...cst._sumX... // use results public static class CalcSumsTask extends MRTask { Key _arykey; // BigData Table key int _colA, _colB; // Column indices to work on double _sumX,_sumY,_sumX2; // Sum of X's, Y's, X^2's
  • 33. 0xdata.com 33 Math Hacking: Example public static class CalcSumsTask extends MRTask { Key _arykey; // BigData Table key int _colA, _colB; // Column indices to work on double _sumX,_sumY,_sumX2; // Sum of X's, Y's, X^2's // map called for every 1Mb of data, or so public void map( Key key1Mb ) { … boiler plate... // lots of unimportant details // Standard for-loop over the data for( int i=0; i<rows; i++ ) { double X = ary.datad(bits,i,A); double Y = ary.datad(bits,i,B); _sumX += X; _sumY += Y; _sumX2+= X*X; } }
  • 34. 0xdata.com 34 Math Hacking: Example public static class CalcSumsTask extends MRTask { Key _arykey; // BigData Table key int _colA, _colB; // Column indices to work on double _sumX,_sumY,_sumX2; // Sum of X's, Y's, X^2's // reduce called between pairs of golden objects // always reduce right-side into 'this' object public void reduce( DRemoteTask rt ) { CalcSumsTask cst = (CalcSumsTask)rt; _sumX += cst._sumX ; _sumY += cst._sumY ; _sumX2+= cst._sumX2; } }
  • 35. 0xdata.com 35 A Fast K/V Store ● Distributed in-memory K/V Store ● Peer-to-peer, no master ● Full JMM semantics, get/put/atomic/remove ● Hardware-style cache-coherency protocol ● Fast: 150nanos for cache-hitting 'get' ● Fast: 50micros for cache-missing 'put' ● No persistence (see above for 'fast') ● No locks: use 'atomic' instead
  • 36. 0xdata.com 36 K/V Design Goals ● JMM semantics on all get/put ● Cache-hitting 'gets' as fast as possible ● Local hashtable lookup + few tests ● 'puts' as lazy as possible (still JMM) ● Typically do not block for remote put ● Arbitrary transactions on single Keys
  • 37. 0xdata.com 37 K/V Coherency Protocol ● Many are possible ● Picked a {fast-enough,easy} one ● Faster is possible ● Every Key has 1 master node ● And everybody knows it from Key hash ● Master orders racing writes ● Winner of NBHM insert
  • 38. 0xdata.com 38 K/V Coherency Protocol ● Master tracks replicas ● Single CAS update ● Invalidate replicas on update ● Single CAS required, plus the invalidates ● Cache miss on replica will reload ● Interlocking get/put races solved with finite state machine
  • 41. 0xdata.com 41 The Expert ● Domain Expert: ● What data is useful, which is trash ● What needs help to become useful ● Missing elements? Toss outliers? ● Build new features from old? ● All through this process Big Data is, well, Big, hence Slow to cp / awk / grep ● And Big limits my tools