SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
http://purdygoodengineering.com http://anant.us
Accumulo and Spark
With MLLib and GraphX
http://purdygoodengineering.com http://anant.us
Introduction
● Section 1: Understanding the Technology
○ Big Picture
○ Accumulo
○ Spark
○ Example Code
● Section 2: Use Cases
○ Multi-Tenant Data Processing
○ Machine Learning / Graph Processing in Spark
○ Example ML + Graph on Business Data
● Questions and Answers
● Contact Information
http://purdygoodengineering.com http://anant.us
● Section 1: Understanding the Technology
○ Big Picture
■ Why Accumulo
■ Why Spark
○ Accumulo
■ Key/Value Structure
■ Table Structure
■ Cell Level Security
■ Splits
■ Reads (scans)
■ Writes (upserts)
■ Deletes
○ Spark
■ Batch/Streaming
■ Machine Learning
■ Graph Processing
○ Example Code
■ Writing to Accumulo
■ Reading from Accumulo
■ Shell
Section 1: Understanding the Technology
http://purdygoodengineering.com http://anant.us
Section 1: Big Picture
● Accumulo
○ Scalable, sorted, distributed key/value store with cell level security
● Spark
○ General compute engine for large-scale data processing
■ Batch Processing
■ Streaming
■ Machine Learning Library
■ Graph Processing
● Use Spark for Compute and Accumulo for storage for a security distributed
scalable solution
http://purdygoodengineering.com http://anant.us
● Section 1: Understanding the Technology
○ Big Picture
■ Why Accumulo
■ Why Spark
○ Accumulo
■ Key/Value Structure
■ Table Structure
■ Cell Level Security
■ Splits
■ Reads (scans)
■ Writes (upserts)
■ Deletes
○ Spark
■ Batch/Streaming
■ Machine Learning
■ Graph Processing
○ Example Code
■ Writing to Accumulo
■ Reading from Accumulo
■ Shell
Section 1: Understanding the Technology
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo: Key Structure
(image from accumulo.apache.org)
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo: Key Structure
Accumulo
Table
Design
RDBM
Table
Design
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo: Table Structure
● Each table has many tablets (distributed across nodes)
● Tablet servers are replicated (default is 3)
● Each row resides on the same tablets
○ A Row Id design strategy needs to ensure binning is
evenly distributed
○ Each table has “splits” which determine binning
○ If Row Ids are still too large; a sharding strategy is
required
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo: Cell Level Security
● Each cell (or field) has its own access control determined
by visibility
● Each user has authorizations which correspond to
visibilities
● Only fields with visibilities which a user has authorization
to access can be retrieved by that user
● Visibilities have limited logic such as AND and OR
○ e.g. private | system public & dna_partner
http://purdygoodengineering.com http://anant.us
Section 1: Splits
● Each table has a default split
● Splits can be added to tables
● Accumulo auto splits when tablets get to large
● Table splits and tablet max size can is configurable
● Row ids are generally hashed to support distribution
● Example splits based on hashing
○ 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo Reads
● Reads (are scans)
○ Scanner
○ BatchScanner (parallelizes over ranges)
● MapReduce/Spark
○ AccumuloInputFormat (one field at a time)
○ AccumuloRowInputFormat (one row at a time)
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo: Writes
● Writes
○ Writer
○ BatchWriter (parallelizes over tablets)
● MapReduce/Spark
○ AccumuloOutputFormat
○ AccumuloFileOutputFormat (bulk ingest)
● Both use Mutations to write to accumulo
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo: Mutations (write and delete)
● Mutations are used to write and delete
● Mutation.put (to write)
● Mutation.putDelete (to delete)
● Writes are Upserts (insert or updates)
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo
● accumulo.apache.org
● Download accumulo
● Examples
● Documentation
Concerned about scalling; how about 4T Nodes, 70T edges
in a graph => see link
http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2
013_56002v1.pdf
http://purdygoodengineering.com http://anant.us
● Section 1: Understanding the Technology
○ Big Picture
■ Why Accumulo
■ Why Spark
○ Accumulo
■ Key/Value Structure
■ Table Structure
■ Cell Level Security
■ Splits
■ Reads (scans)
■ Writes (upserts)
■ Deletes
○ Spark
■ Batch/Streaming
■ Machine Learning
■ Graph Processing
○ Example Code
■ Writing to Accumulo
■ Reading from Accumulo
■ Shell
Section 1: Understanding the Technology
http://purdygoodengineering.com http://anant.us
Section 1: Spark: MapReduce first
● Hadoop MapReduce (batch processing)
○ Mapping
○ Reducing
○ Chain jobs
○ 95% IO (each job must read/write to disk)
○ scalable
http://purdygoodengineering.com http://anant.us
Section 1: Spark
● Batch Processing - MapReduce (many more functions)
● Streaming - mini batch processing
● Machine Learning - MLLib
● Graph Processing - GraphX
● Many Languages - (Java, Scala, Python, R)
http://purdygoodengineering.com http://anant.us
Section 1: Spark
● spark.apache.org
● Download spark
● Example code
● Documentation
http://purdygoodengineering.com http://anant.us
○ Spark
■ Batch/Streaming
■ Machine Learning
■ Graph Processing
○ Example Code
■ Writing to Accumulo
■ Reading from Accumulo
■ Shell
Section 1: Understanding the Technology
● Section 1: Understanding the Technology
○ Big Picture
■ Why Accumulo
■ Why Spark
○ Accumulo
■ Key/Value Structure
■ Table Structure
■ Cell Level Security
■ Splits
■ Reads (scans)
■ Writes (upserts)
■ Deletes
http://purdygoodengineering.com http://anant.us
Section 1: Example Code
Simple Examples for bookkeeping with spark and accumulo
https://github.com/matthewpurdy/purdy-good/tree/master/purdy-good-spark/purdy-good-spark-accumulo
http://purdygoodengineering.com http://anant.us
Section 2: Use Case(s) Machine Learning and
Graph Processing
● Multi-Tenant Data Processing
● Machine Learning / Graph Processing in Spark
● Example Usecase of ML + Graph on Business Data
http://purdygoodengineering.com http://anant.us
Section 2: Multi-Tenant Data Processing Needs
Customer (C) (P) & (C) Provider (P)
Team Customer Private Customer Data
shared w/ Provider
Private Provider Data
for Economy of Scale
Sales
Marketing
IBM Indicators
Relationships
Classification
Classification Model
Relationship Graph
Marketing
Finance
Apple Indicators
Correlation
Prediction
Correlation Model
Prediction Model
Sales
Marketing
Finance
Microsoft Indicators
Relationships
Correlation
Prediction
Correlation Model
Prediction Model
Relationship Graph
Finance Google Indicators
Correlation
Prediction
Correlation Model
Prediction Model
http://purdygoodengineering.com http://anant.us
Section 2: Multi-Tenant Data Processing Needs
Customer (C) (P) & (C) Provider (P)
C User C Team C Management C Management
P Analytics
P Analytics
P Support
CU Manager
CU Employee
CT Sales CM Executive CM Executive
CU Manager
PA * / PS *
PA * / PS *
CU Manager
CU Employee
CT Marketing CM Executive CM Executive
CU Manager
PA * / PS *
PA * / PS *
CU Employee CT Research CM Executive CM Executive
CU Manager
PA * / PS *
PA * / PS *
CU Employee CT Finance CM Executive CM Executive
CU Manager
PA * / PS *
PA * / PS *
http://purdygoodengineering.com http://anant.us
Section 2: Multi-Tenant Data Processing Needs
● Analyze Sales Team successes (Closed Accounts) to recommend companies
to target for Marketing campaigns.
● Analyze Sales Team User social account against social network users against
recommended companies to create Call Lists
● Correlate historic Marketing (Traffic & Conversions) with historic Sales (Leads
& Closed Accounts) data with historic Finance ( Revenue & Profit) to Predict
Sales from current Marketing & Sales activities
http://purdygoodengineering.com http://anant.us
Section 2: Out of the Box : MLLib in Spark
● Classification
● Regression
● Decision Trees
● Recommendation
● Clustering
● Topic Modeling
● Feature Transformations
● ML Pipelining / Persistence
● “Based on past
performance in the
companies in the CRM,
the most successful sales
have come from these
categories, so go after
these companies.”
http://purdygoodengineering.com http://anant.us
Section 2: Out of the Box : MLLib in Spark
● Load Data
● Extract Features
● Train Model
● Find Best Model
● Use Model to Predict
http://purdygoodengineering.com http://anant.us
Section 2: KeystoneML - End to End ML
http://keystone-ml.org/
http://purdygoodengineering.com http://anant.us
Section 2: Out of the Box : GraphX in Spark
● PageRank
● Connected components
● Label propagation
● SVD++
● Strongly connected components
● Triangle count
● “Based on the social graph
of sales team members
and the companies in your
CRM, talk to the
companies you are most
“closest” to.
http://purdygoodengineering.com http://anant.us
Section 2: Out of the Box : GraphX in Spark
● Load Nodes RDD
● Load Vertices RDD
● Create Graph from
Nodes & Vertices RDD
● Run Graph Process /
Query
● Get Data
http://ampcamp.berkeley.edu/big-d
ata-mini-course/graph-analytics-wit
h-graphx.html
http://purdygoodengineering.com http://anant.us
Section 2: Out of the Box : GraphX in Spark
● Load Edges into Graph
● Run Page Rank
● Load Nodes into RDD
● Join Users RDD with
Rank
http://purdygoodengineering.com http://anant.us
Questions and Answers
?
http://purdygoodengineering.com http://anant.us
Contact Information
Matthew Purdy
● matthew.purdy@purdygoodengineering.com
● http://www.purdygoodengineering.com
● https://www.linkedin.com/in/matthewpurdy
● https://github.com/matthewpurdy
Rahul Singh
● rahul.singh@anant.us
● http://www.anant.us
● http://www.linkedin.com/in/xingh
● https://github.com/xingh

Weitere ähnliche Inhalte

Andere mochten auch

HBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User GroupHBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User GroupCloudera, Inc.
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
BioSense Program Going Forward: HIMSS10 Conference
BioSense Program Going Forward: HIMSS10 ConferenceBioSense Program Going Forward: HIMSS10 Conference
BioSense Program Going Forward: HIMSS10 ConferenceTaha Kass-Hout, MD, MS
 
Evolve: InSTEDD's Global Early Warning and Response System
Evolve: InSTEDD's Global Early Warning and Response SystemEvolve: InSTEDD's Global Early Warning and Response System
Evolve: InSTEDD's Global Early Warning and Response SystemTaha Kass-Hout, MD, MS
 
Public Health Surveillance Through Collaboration
Public Health Surveillance Through CollaborationPublic Health Surveillance Through Collaboration
Public Health Surveillance Through CollaborationTaha Kass-Hout, MD, MS
 
Geohash: Integration of Disparate Geospatial Data
Geohash: Integration of Disparate Geospatial DataGeohash: Integration of Disparate Geospatial Data
Geohash: Integration of Disparate Geospatial DataDataCards
 
Latest Advances in Megapixel Surveillance
Latest Advances in Megapixel SurveillanceLatest Advances in Megapixel Surveillance
Latest Advances in Megapixel SurveillanceSteve Ma
 
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechGeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechRob Emanuele
 
Matchinguu droidcon presentation
Matchinguu droidcon presentationMatchinguu droidcon presentation
Matchinguu droidcon presentationDroidcon Berlin
 
Riff: A Social Network and Collaborative Platform for Public Health Disease S...
Riff: A Social Network and Collaborative Platform for Public Health Disease S...Riff: A Social Network and Collaborative Platform for Public Health Disease S...
Riff: A Social Network and Collaborative Platform for Public Health Disease S...Taha Kass-Hout, MD, MS
 

Andere mochten auch (14)

HBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User GroupHBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User Group
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
BioSense 2.0
BioSense 2.0BioSense 2.0
BioSense 2.0
 
Social Media for the Meta-Leader
Social Media for the Meta-LeaderSocial Media for the Meta-Leader
Social Media for the Meta-Leader
 
BioSense Program Going Forward: HIMSS10 Conference
BioSense Program Going Forward: HIMSS10 ConferenceBioSense Program Going Forward: HIMSS10 Conference
BioSense Program Going Forward: HIMSS10 Conference
 
Evolve: InSTEDD's Global Early Warning and Response System
Evolve: InSTEDD's Global Early Warning and Response SystemEvolve: InSTEDD's Global Early Warning and Response System
Evolve: InSTEDD's Global Early Warning and Response System
 
Public Health Surveillance Through Collaboration
Public Health Surveillance Through CollaborationPublic Health Surveillance Through Collaboration
Public Health Surveillance Through Collaboration
 
Big Data in Public Health
Big Data in Public HealthBig Data in Public Health
Big Data in Public Health
 
precisionFDA
precisionFDAprecisionFDA
precisionFDA
 
Geohash: Integration of Disparate Geospatial Data
Geohash: Integration of Disparate Geospatial DataGeohash: Integration of Disparate Geospatial Data
Geohash: Integration of Disparate Geospatial Data
 
Latest Advances in Megapixel Surveillance
Latest Advances in Megapixel SurveillanceLatest Advances in Megapixel Surveillance
Latest Advances in Megapixel Surveillance
 
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechGeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
 
Matchinguu droidcon presentation
Matchinguu droidcon presentationMatchinguu droidcon presentation
Matchinguu droidcon presentation
 
Riff: A Social Network and Collaborative Platform for Public Health Disease S...
Riff: A Social Network and Collaborative Platform for Public Health Disease S...Riff: A Social Network and Collaborative Platform for Public Health Disease S...
Riff: A Social Network and Collaborative Platform for Public Health Disease S...
 

Ähnlich wie Machine Learning & Graph Processing w/ Spark and Accumulo

Databricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog FoodDatabricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog FoodDatabricks
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
[scala.by] Launching new application fast
[scala.by] Launching new application fast[scala.by] Launching new application fast
[scala.by] Launching new application fastDenis Karpenko
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
OpenCL Kernel Optimization Tips
OpenCL Kernel Optimization TipsOpenCL Kernel Optimization Tips
OpenCL Kernel Optimization TipsChamp Yen
 
Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Etti Gur
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018javier ramirez
 
Benchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbersBenchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbersJustin Dorfman
 
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Benchmarks, performance, scalability, and capacity  what s behind the numbers...Benchmarks, performance, scalability, and capacity  what s behind the numbers...
Benchmarks, performance, scalability, and capacity what s behind the numbers...james tong
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
 
Dfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopDfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopTamas K Lengyel
 
Anurag Awasthi - Machine Learning applications for CloudStack
Anurag Awasthi - Machine Learning applications for CloudStackAnurag Awasthi - Machine Learning applications for CloudStack
Anurag Awasthi - Machine Learning applications for CloudStackShapeBlue
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesAhsan Javed Awan
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Boosting big data with apache spark
Boosting big data with apache sparkBoosting big data with apache spark
Boosting big data with apache sparkInfoFarm
 
Software Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationSoftware Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationHao Xu
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesTuhin Mahmud
 
Data Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQLData Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQLEDB
 

Ähnlich wie Machine Learning & Graph Processing w/ Spark and Accumulo (20)

Databricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog FoodDatabricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog Food
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
[scala.by] Launching new application fast
[scala.by] Launching new application fast[scala.by] Launching new application fast
[scala.by] Launching new application fast
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
OpenCL Kernel Optimization Tips
OpenCL Kernel Optimization TipsOpenCL Kernel Optimization Tips
OpenCL Kernel Optimization Tips
 
Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
 
Benchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbersBenchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbers
 
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Benchmarks, performance, scalability, and capacity  what s behind the numbers...Benchmarks, performance, scalability, and capacity  what s behind the numbers...
Benchmarks, performance, scalability, and capacity what s behind the numbers...
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
Dfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopDfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshop
 
Anurag Awasthi - Machine Learning applications for CloudStack
Anurag Awasthi - Machine Learning applications for CloudStackAnurag Awasthi - Machine Learning applications for CloudStack
Anurag Awasthi - Machine Learning applications for CloudStack
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Boosting big data with apache spark
Boosting big data with apache sparkBoosting big data with apache spark
Boosting big data with apache spark
 
Software Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationSoftware Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale Automation
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion Trees
 
Data Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQLData Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQL
 

Mehr von Rahul Singh

Unifying Business Information with Dashboards
Unifying Business Information with Dashboards Unifying Business Information with Dashboards
Unifying Business Information with Dashboards Rahul Singh
 
Get Your Shit Together
Get Your Shit TogetherGet Your Shit Together
Get Your Shit TogetherRahul Singh
 
Building Online Business Software 101 (B2B)
Building Online Business Software 101 (B2B) Building Online Business Software 101 (B2B)
Building Online Business Software 101 (B2B) Rahul Singh
 
Asynchronous Data Processing
Asynchronous Data ProcessingAsynchronous Data Processing
Asynchronous Data ProcessingRahul Singh
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesRahul Singh
 
Deliver Excellent Service to your Customers
Deliver Excellent Service to your CustomersDeliver Excellent Service to your Customers
Deliver Excellent Service to your CustomersRahul Singh
 
Building Search Engines - Lucene, SolR and Elasticsearch
Building Search Engines - Lucene, SolR and ElasticsearchBuilding Search Engines - Lucene, SolR and Elasticsearch
Building Search Engines - Lucene, SolR and ElasticsearchRahul Singh
 
Building Smart Indexes for Drupal Sites
Building Smart Indexes for Drupal SitesBuilding Smart Indexes for Drupal Sites
Building Smart Indexes for Drupal SitesRahul Singh
 
Building People First - Lessons in Team Effectiveness & Happiness
Building People First - Lessons in Team Effectiveness & HappinessBuilding People First - Lessons in Team Effectiveness & Happiness
Building People First - Lessons in Team Effectiveness & HappinessRahul Singh
 
Select * From Internet - Integrating the Web
Select * From Internet - Integrating the WebSelect * From Internet - Integrating the Web
Select * From Internet - Integrating the WebRahul Singh
 
Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...
Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...
Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...Rahul Singh
 
The Future of the Internet - The Next 30 Years
The Future of the Internet - The Next 30 YearsThe Future of the Internet - The Next 30 Years
The Future of the Internet - The Next 30 YearsRahul Singh
 
Modern Presidential Communications - Communicating Presidential Rhetorical Vi...
Modern Presidential Communications - Communicating Presidential Rhetorical Vi...Modern Presidential Communications - Communicating Presidential Rhetorical Vi...
Modern Presidential Communications - Communicating Presidential Rhetorical Vi...Rahul Singh
 
Rahul.singh.speech presentation
Rahul.singh.speech presentationRahul.singh.speech presentation
Rahul.singh.speech presentationRahul Singh
 
Anant - Micro Enterprise - The Future, Today
Anant - Micro Enterprise - The Future, TodayAnant - Micro Enterprise - The Future, Today
Anant - Micro Enterprise - The Future, TodayRahul Singh
 

Mehr von Rahul Singh (15)

Unifying Business Information with Dashboards
Unifying Business Information with Dashboards Unifying Business Information with Dashboards
Unifying Business Information with Dashboards
 
Get Your Shit Together
Get Your Shit TogetherGet Your Shit Together
Get Your Shit Together
 
Building Online Business Software 101 (B2B)
Building Online Business Software 101 (B2B) Building Online Business Software 101 (B2B)
Building Online Business Software 101 (B2B)
 
Asynchronous Data Processing
Asynchronous Data ProcessingAsynchronous Data Processing
Asynchronous Data Processing
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source Technologies
 
Deliver Excellent Service to your Customers
Deliver Excellent Service to your CustomersDeliver Excellent Service to your Customers
Deliver Excellent Service to your Customers
 
Building Search Engines - Lucene, SolR and Elasticsearch
Building Search Engines - Lucene, SolR and ElasticsearchBuilding Search Engines - Lucene, SolR and Elasticsearch
Building Search Engines - Lucene, SolR and Elasticsearch
 
Building Smart Indexes for Drupal Sites
Building Smart Indexes for Drupal SitesBuilding Smart Indexes for Drupal Sites
Building Smart Indexes for Drupal Sites
 
Building People First - Lessons in Team Effectiveness & Happiness
Building People First - Lessons in Team Effectiveness & HappinessBuilding People First - Lessons in Team Effectiveness & Happiness
Building People First - Lessons in Team Effectiveness & Happiness
 
Select * From Internet - Integrating the Web
Select * From Internet - Integrating the WebSelect * From Internet - Integrating the Web
Select * From Internet - Integrating the Web
 
Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...
Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...
Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...
 
The Future of the Internet - The Next 30 Years
The Future of the Internet - The Next 30 YearsThe Future of the Internet - The Next 30 Years
The Future of the Internet - The Next 30 Years
 
Modern Presidential Communications - Communicating Presidential Rhetorical Vi...
Modern Presidential Communications - Communicating Presidential Rhetorical Vi...Modern Presidential Communications - Communicating Presidential Rhetorical Vi...
Modern Presidential Communications - Communicating Presidential Rhetorical Vi...
 
Rahul.singh.speech presentation
Rahul.singh.speech presentationRahul.singh.speech presentation
Rahul.singh.speech presentation
 
Anant - Micro Enterprise - The Future, Today
Anant - Micro Enterprise - The Future, TodayAnant - Micro Enterprise - The Future, Today
Anant - Micro Enterprise - The Future, Today
 

Kürzlich hochgeladen

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 

Kürzlich hochgeladen (20)

Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 

Machine Learning & Graph Processing w/ Spark and Accumulo

  • 2. http://purdygoodengineering.com http://anant.us Introduction ● Section 1: Understanding the Technology ○ Big Picture ○ Accumulo ○ Spark ○ Example Code ● Section 2: Use Cases ○ Multi-Tenant Data Processing ○ Machine Learning / Graph Processing in Spark ○ Example ML + Graph on Business Data ● Questions and Answers ● Contact Information
  • 3. http://purdygoodengineering.com http://anant.us ● Section 1: Understanding the Technology ○ Big Picture ■ Why Accumulo ■ Why Spark ○ Accumulo ■ Key/Value Structure ■ Table Structure ■ Cell Level Security ■ Splits ■ Reads (scans) ■ Writes (upserts) ■ Deletes ○ Spark ■ Batch/Streaming ■ Machine Learning ■ Graph Processing ○ Example Code ■ Writing to Accumulo ■ Reading from Accumulo ■ Shell Section 1: Understanding the Technology
  • 4. http://purdygoodengineering.com http://anant.us Section 1: Big Picture ● Accumulo ○ Scalable, sorted, distributed key/value store with cell level security ● Spark ○ General compute engine for large-scale data processing ■ Batch Processing ■ Streaming ■ Machine Learning Library ■ Graph Processing ● Use Spark for Compute and Accumulo for storage for a security distributed scalable solution
  • 5. http://purdygoodengineering.com http://anant.us ● Section 1: Understanding the Technology ○ Big Picture ■ Why Accumulo ■ Why Spark ○ Accumulo ■ Key/Value Structure ■ Table Structure ■ Cell Level Security ■ Splits ■ Reads (scans) ■ Writes (upserts) ■ Deletes ○ Spark ■ Batch/Streaming ■ Machine Learning ■ Graph Processing ○ Example Code ■ Writing to Accumulo ■ Reading from Accumulo ■ Shell Section 1: Understanding the Technology
  • 6. http://purdygoodengineering.com http://anant.us Section 1: Accumulo: Key Structure (image from accumulo.apache.org)
  • 7. http://purdygoodengineering.com http://anant.us Section 1: Accumulo: Key Structure Accumulo Table Design RDBM Table Design
  • 8. http://purdygoodengineering.com http://anant.us Section 1: Accumulo: Table Structure ● Each table has many tablets (distributed across nodes) ● Tablet servers are replicated (default is 3) ● Each row resides on the same tablets ○ A Row Id design strategy needs to ensure binning is evenly distributed ○ Each table has “splits” which determine binning ○ If Row Ids are still too large; a sharding strategy is required
  • 9. http://purdygoodengineering.com http://anant.us Section 1: Accumulo: Cell Level Security ● Each cell (or field) has its own access control determined by visibility ● Each user has authorizations which correspond to visibilities ● Only fields with visibilities which a user has authorization to access can be retrieved by that user ● Visibilities have limited logic such as AND and OR ○ e.g. private | system public & dna_partner
  • 10. http://purdygoodengineering.com http://anant.us Section 1: Splits ● Each table has a default split ● Splits can be added to tables ● Accumulo auto splits when tablets get to large ● Table splits and tablet max size can is configurable ● Row ids are generally hashed to support distribution ● Example splits based on hashing ○ 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f
  • 11. http://purdygoodengineering.com http://anant.us Section 1: Accumulo Reads ● Reads (are scans) ○ Scanner ○ BatchScanner (parallelizes over ranges) ● MapReduce/Spark ○ AccumuloInputFormat (one field at a time) ○ AccumuloRowInputFormat (one row at a time)
  • 12. http://purdygoodengineering.com http://anant.us Section 1: Accumulo: Writes ● Writes ○ Writer ○ BatchWriter (parallelizes over tablets) ● MapReduce/Spark ○ AccumuloOutputFormat ○ AccumuloFileOutputFormat (bulk ingest) ● Both use Mutations to write to accumulo
  • 13. http://purdygoodengineering.com http://anant.us Section 1: Accumulo: Mutations (write and delete) ● Mutations are used to write and delete ● Mutation.put (to write) ● Mutation.putDelete (to delete) ● Writes are Upserts (insert or updates)
  • 14. http://purdygoodengineering.com http://anant.us Section 1: Accumulo ● accumulo.apache.org ● Download accumulo ● Examples ● Documentation Concerned about scalling; how about 4T Nodes, 70T edges in a graph => see link http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2 013_56002v1.pdf
  • 15. http://purdygoodengineering.com http://anant.us ● Section 1: Understanding the Technology ○ Big Picture ■ Why Accumulo ■ Why Spark ○ Accumulo ■ Key/Value Structure ■ Table Structure ■ Cell Level Security ■ Splits ■ Reads (scans) ■ Writes (upserts) ■ Deletes ○ Spark ■ Batch/Streaming ■ Machine Learning ■ Graph Processing ○ Example Code ■ Writing to Accumulo ■ Reading from Accumulo ■ Shell Section 1: Understanding the Technology
  • 16. http://purdygoodengineering.com http://anant.us Section 1: Spark: MapReduce first ● Hadoop MapReduce (batch processing) ○ Mapping ○ Reducing ○ Chain jobs ○ 95% IO (each job must read/write to disk) ○ scalable
  • 17. http://purdygoodengineering.com http://anant.us Section 1: Spark ● Batch Processing - MapReduce (many more functions) ● Streaming - mini batch processing ● Machine Learning - MLLib ● Graph Processing - GraphX ● Many Languages - (Java, Scala, Python, R)
  • 18. http://purdygoodengineering.com http://anant.us Section 1: Spark ● spark.apache.org ● Download spark ● Example code ● Documentation
  • 19. http://purdygoodengineering.com http://anant.us ○ Spark ■ Batch/Streaming ■ Machine Learning ■ Graph Processing ○ Example Code ■ Writing to Accumulo ■ Reading from Accumulo ■ Shell Section 1: Understanding the Technology ● Section 1: Understanding the Technology ○ Big Picture ■ Why Accumulo ■ Why Spark ○ Accumulo ■ Key/Value Structure ■ Table Structure ■ Cell Level Security ■ Splits ■ Reads (scans) ■ Writes (upserts) ■ Deletes
  • 20. http://purdygoodengineering.com http://anant.us Section 1: Example Code Simple Examples for bookkeeping with spark and accumulo https://github.com/matthewpurdy/purdy-good/tree/master/purdy-good-spark/purdy-good-spark-accumulo
  • 21. http://purdygoodengineering.com http://anant.us Section 2: Use Case(s) Machine Learning and Graph Processing ● Multi-Tenant Data Processing ● Machine Learning / Graph Processing in Spark ● Example Usecase of ML + Graph on Business Data
  • 22. http://purdygoodengineering.com http://anant.us Section 2: Multi-Tenant Data Processing Needs Customer (C) (P) & (C) Provider (P) Team Customer Private Customer Data shared w/ Provider Private Provider Data for Economy of Scale Sales Marketing IBM Indicators Relationships Classification Classification Model Relationship Graph Marketing Finance Apple Indicators Correlation Prediction Correlation Model Prediction Model Sales Marketing Finance Microsoft Indicators Relationships Correlation Prediction Correlation Model Prediction Model Relationship Graph Finance Google Indicators Correlation Prediction Correlation Model Prediction Model
  • 23. http://purdygoodengineering.com http://anant.us Section 2: Multi-Tenant Data Processing Needs Customer (C) (P) & (C) Provider (P) C User C Team C Management C Management P Analytics P Analytics P Support CU Manager CU Employee CT Sales CM Executive CM Executive CU Manager PA * / PS * PA * / PS * CU Manager CU Employee CT Marketing CM Executive CM Executive CU Manager PA * / PS * PA * / PS * CU Employee CT Research CM Executive CM Executive CU Manager PA * / PS * PA * / PS * CU Employee CT Finance CM Executive CM Executive CU Manager PA * / PS * PA * / PS *
  • 24. http://purdygoodengineering.com http://anant.us Section 2: Multi-Tenant Data Processing Needs ● Analyze Sales Team successes (Closed Accounts) to recommend companies to target for Marketing campaigns. ● Analyze Sales Team User social account against social network users against recommended companies to create Call Lists ● Correlate historic Marketing (Traffic & Conversions) with historic Sales (Leads & Closed Accounts) data with historic Finance ( Revenue & Profit) to Predict Sales from current Marketing & Sales activities
  • 25. http://purdygoodengineering.com http://anant.us Section 2: Out of the Box : MLLib in Spark ● Classification ● Regression ● Decision Trees ● Recommendation ● Clustering ● Topic Modeling ● Feature Transformations ● ML Pipelining / Persistence ● “Based on past performance in the companies in the CRM, the most successful sales have come from these categories, so go after these companies.”
  • 26. http://purdygoodengineering.com http://anant.us Section 2: Out of the Box : MLLib in Spark ● Load Data ● Extract Features ● Train Model ● Find Best Model ● Use Model to Predict
  • 27. http://purdygoodengineering.com http://anant.us Section 2: KeystoneML - End to End ML http://keystone-ml.org/
  • 28. http://purdygoodengineering.com http://anant.us Section 2: Out of the Box : GraphX in Spark ● PageRank ● Connected components ● Label propagation ● SVD++ ● Strongly connected components ● Triangle count ● “Based on the social graph of sales team members and the companies in your CRM, talk to the companies you are most “closest” to.
  • 29. http://purdygoodengineering.com http://anant.us Section 2: Out of the Box : GraphX in Spark ● Load Nodes RDD ● Load Vertices RDD ● Create Graph from Nodes & Vertices RDD ● Run Graph Process / Query ● Get Data http://ampcamp.berkeley.edu/big-d ata-mini-course/graph-analytics-wit h-graphx.html
  • 30. http://purdygoodengineering.com http://anant.us Section 2: Out of the Box : GraphX in Spark ● Load Edges into Graph ● Run Page Rank ● Load Nodes into RDD ● Join Users RDD with Rank
  • 32. http://purdygoodengineering.com http://anant.us Contact Information Matthew Purdy ● matthew.purdy@purdygoodengineering.com ● http://www.purdygoodengineering.com ● https://www.linkedin.com/in/matthewpurdy ● https://github.com/matthewpurdy Rahul Singh ● rahul.singh@anant.us ● http://www.anant.us ● http://www.linkedin.com/in/xingh ● https://github.com/xingh