SlideShare a Scribd company logo
1 of 32
HBase for Dealing with
Large Matrices
Who am I?
Leads data team at Dilisim
Researcher at Anadolu University
Machine Learning
Some big problems
Classifying huge text collections
Recommending to millions of users
Predicting links in a social network
Recommender Systems
Recommenders input large sparse
matrices
How would you input a millions X
millions matrix?
Recommender Systems
m users
3.00 0.00 2.00 0.00 4.00 2.00 3.00 2.00 1.00 3.00 0.00 3.00 0.00 2.00 0.00 4.00 2.00 3.00 2.00 1.00 3.00 0.00 …
2.00 3.00 2.00 1.00 1.00 4.00 2.00 3.00 0.00 2.00 3.00 2.00 3.00 2.00 1.00 1.00 4.00 2.00 3.00 0.00 2.00 3.00…
3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 …
0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 …
0.00 2.00 3.00 2.00 2.00 4.00 3.00 0.00 3.00 4.00 2.00 0.00 2.00 3.00 2.00 2.00 4.00 3.00 0.00 3.00 4.00 2.00 …
4.00 2.00 1.00 4.00 2.00 3.00 3.00 0.00 3.00 1.00 2.00 4.00 2.00 1.00 4.00 2.00 3.00 3.00 0.00 3.00 1.00 2.00…
4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00 4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00…
1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 …
1.00 4.00 1.00 1.00 3.00 4.00 2.00 2.00 0.00 1.00 4.00 1.00 4.00 1.00 1.00 3.00 4.00 2.00 2.00 0.00 1.00 4.00…
3.00 3.00 2.00 3.00 4.00 1.00 4.00 0.00 1.00 0.00 1.00 3.00 3.00 2.00 3.00 4.00 1.00 4.00 0.00 1.00 0.00 1.00…
0.00 1.00 2.00 4.00 2.00 2.00 3.00 4.00 4.00 4.00 1.00 0.00 1.00 2.00 4.00 2.00 2.00 3.00 4.00 4.00 4.00 1.00…
4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00 4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00…
1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 …
3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 …
0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 …
…………………………………………………………………………………………………………………… .
…………………………………………………………………………………………………………………… .
………………………………………………………………………………………………………………… … .
n items
Input
Recommender Systems
State-of-the-art recommender systems learn
large models
One factor vector per each user and item
One parameter vector (on side info) per
each user and item
Recommender Systems
m users
3.00 0.00 2.00 0.00 4.00 2.00 3.00 2.00 1.00 3.00 0.00 …
2.00 3.00 2.00 1.00 1.00 4.00 2.00 3.00 0.00 2.00 3.00 …
3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 …
0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 …
0.00 2.00 3.00 2.00 2.00 4.00 3.00 0.00 3.00 4.00 2.00 …
4.00 2.00 1.00 4.00 2.00 3.00 3.00 0.00 3.00 1.00 2.00 …
4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00 …
1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 …
1.00 4.00 1.00 1.00 3.00 4.00 2.00 2.00 0.00 1.00 4.00 …
3.00 3.00 2.00 3.00 4.00 1.00 4.00 0.00 1.00 0.00 1.00 …
0.00 1.00 2.00 4.00 2.00 2.00 3.00 4.00 4.00 4.00 1.00 …
………………………………………………………… .
………………………………………………………… .
………………………………………………………… .
n items
Input
User Model
Item Model
m x k
n x k
0.54 0.48 0.83 0.75 0.28 …
0.02 0.29 0.99 0.85 0.68 …
0.05 0.53 0.60 0.98 0.19 …
0.52 0.47 0.50 0.12 0.98 …
0.26 0.39 0.29 0.91 0.50 …
0.15 0.43 0.66 0.07 0.51 …
0.52 0.36 0.01 0.87 0.53 …
…………………………. .
………………………….. .
…………………………... .
0.93 0.78 0.56 0.77 0.75 …
0.21 0.44 0.99 0.01 0.00 …
0.04 0.42 0.36 0.72 0.19 …
0.77 0.07 0.24 0.67 0.87 …
0.42 0.79 0.62 0.80 0.79 …
0.42 0.32 0.26 0.50 0.85 …
0.94 0.76 0.93 0.34 0.46 …
…………………………. .
………………………….. .
…………………………... .
Learning Process
What does a machine learning algorithm
require to do with that matrix?
Machine Learning - Techniques
Batch Learning
All parameters are updated once per
iteration
Machine Learning - Techniques
Batch Learning
Updates can be calculated in parallel
using MapReduce
(SequenceFile might be enough)
Machine Learning - Techniques
Batch Learning
Output model should provide random
access to rows
Machine Learning - Techniques
Online Learning
Parameters are updated per training
example
Machine Learning - Techniques
Online Learning
Each update results in updates in
a row
Needs random access while learning
Machine Learning - Techniques
Online Learning
Output model should provide random
access to rows
Deployment Process
How do you decide to deploy a machine
learning model in production?
Machine Learning - Deployment
Usual process
Works
good?
Deploy in
production
Experiment
on prototype
Y
N
Machine Learning - Deployment
How would you turn your prototype into
production easily?
Common matrix interface for in-
memory and persistent versions
HBase Backed Matrix
Implements Mahout matrix
Dense or sparse
HBase Backed Matrix
Random access to cells
Random access to rows
Iteration over rows
Lazy loading while iterating
HBase Backed Matrix
Common interface for prototype and
product
Easy to deploy (Model already persisted)
HBase Backed Matrix
Matrix operations with existing mahout-
math library
Logical Schema
Composite row keys:
12_0:
12_9:
12_22000:
data:value:0.41
data:value:0.41
data:value:0.41
Logical Schema
Composite row keys:
Row access by scan
Cell access by get
Atomic row update should be
handled in application
Logical Schema
Row indices as row keys
12:
data:0:0.41 data:22000:0.41data:9:0.41
Logical Schema
Row indices as row keys
Atomic updates are handled
automatically
Speed – Cell access/write
GET SET
row index as row key
composite row key
Speed – Row access/write
GET SET
row index as row key
composite row key
Code
github.com/gcapan/mahout/tree/hbase-matrix
Future Work
MatrixInputFormat
Might replace SequenceFile based
MapReduce inputs
Future Work – A little digression
Recommender Systems
Calculating score for a user-item
pair is easy with HBaseMatrix
Future Work – A little digression
Recommender Systems
top-N recommendation?
All candidate items for a user in
the user row as a nested entity
(See Ian Varley's HBase Schema
Design)
Thank you!

More Related Content

Similar to HBase for Dealing with Large Matrices

report.doc
report.docreport.doc
report.doc
butest
 
Anti-MOOCs: The design of MACROSIMs
Anti-MOOCs: The design of MACROSIMsAnti-MOOCs: The design of MACROSIMs
Anti-MOOCs: The design of MACROSIMs
dws1d
 
Stacy Fileccia TW Portfolio 2016
Stacy Fileccia TW Portfolio 2016Stacy Fileccia TW Portfolio 2016
Stacy Fileccia TW Portfolio 2016
Stacy Fileccia
 

Similar to HBase for Dealing with Large Matrices (20)

IoT Analytics Workshop (IOT314-R1) - AWS re:Invent 2018
IoT Analytics Workshop (IOT314-R1) - AWS re:Invent 2018IoT Analytics Workshop (IOT314-R1) - AWS re:Invent 2018
IoT Analytics Workshop (IOT314-R1) - AWS re:Invent 2018
 
Big data technologies : A survey
Big data technologies : A survey Big data technologies : A survey
Big data technologies : A survey
 
Assessing the consistency, quality, and completeness of the Reviewed Event Bu...
Assessing the consistency, quality, and completeness of the Reviewed Event Bu...Assessing the consistency, quality, and completeness of the Reviewed Event Bu...
Assessing the consistency, quality, and completeness of the Reviewed Event Bu...
 
Is observability good for your brain?
Is observability good for your brain?Is observability good for your brain?
Is observability good for your brain?
 
Advanced online search through the web
Advanced online search through the webAdvanced online search through the web
Advanced online search through the web
 
03 GlobalAIBootcamp2020Lisboa-Rock, Paper, Scissors.pptx
03 GlobalAIBootcamp2020Lisboa-Rock, Paper, Scissors.pptx03 GlobalAIBootcamp2020Lisboa-Rock, Paper, Scissors.pptx
03 GlobalAIBootcamp2020Lisboa-Rock, Paper, Scissors.pptx
 
Crab: A Python Framework for Building Recommender Systems
Crab: A Python Framework for Building Recommender Systems Crab: A Python Framework for Building Recommender Systems
Crab: A Python Framework for Building Recommender Systems
 
An overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support SystemAn overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support System
 
Lazard network correlation_architecture
Lazard network correlation_architectureLazard network correlation_architecture
Lazard network correlation_architecture
 
Chatbot mohinh sinh
Chatbot mohinh sinhChatbot mohinh sinh
Chatbot mohinh sinh
 
CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...
CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...
CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...
 
Recommendation Subsystem - Museum Radar
Recommendation Subsystem - Museum RadarRecommendation Subsystem - Museum Radar
Recommendation Subsystem - Museum Radar
 
Cloud Lunch and Learn ML.NET MACHINE LEARNING (AND DEEP LEARNING) FOR THE CSh...
Cloud Lunch and Learn ML.NET MACHINE LEARNING (AND DEEP LEARNING) FOR THE CSh...Cloud Lunch and Learn ML.NET MACHINE LEARNING (AND DEEP LEARNING) FOR THE CSh...
Cloud Lunch and Learn ML.NET MACHINE LEARNING (AND DEEP LEARNING) FOR THE CSh...
 
There and Back Again - A Tale of Programming Languages
There and Back Again - A Tale of Programming LanguagesThere and Back Again - A Tale of Programming Languages
There and Back Again - A Tale of Programming Languages
 
report.doc
report.docreport.doc
report.doc
 
Rock Paper Scissors with MLNET.pptx
Rock Paper Scissors with MLNET.pptxRock Paper Scissors with MLNET.pptx
Rock Paper Scissors with MLNET.pptx
 
Machine Learning for Designers - UX Camp Switzerland
Machine Learning for Designers - UX Camp SwitzerlandMachine Learning for Designers - UX Camp Switzerland
Machine Learning for Designers - UX Camp Switzerland
 
Search Engine Risk Dependency by Ronan Chardennau
Search Engine Risk Dependency by Ronan ChardennauSearch Engine Risk Dependency by Ronan Chardennau
Search Engine Risk Dependency by Ronan Chardennau
 
Anti-MOOCs: The design of MACROSIMs
Anti-MOOCs: The design of MACROSIMsAnti-MOOCs: The design of MACROSIMs
Anti-MOOCs: The design of MACROSIMs
 
Stacy Fileccia TW Portfolio 2016
Stacy Fileccia TW Portfolio 2016Stacy Fileccia TW Portfolio 2016
Stacy Fileccia TW Portfolio 2016
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

HBase for Dealing with Large Matrices

  • 1. HBase for Dealing with Large Matrices
  • 2. Who am I? Leads data team at Dilisim Researcher at Anadolu University
  • 3. Machine Learning Some big problems Classifying huge text collections Recommending to millions of users Predicting links in a social network
  • 4. Recommender Systems Recommenders input large sparse matrices How would you input a millions X millions matrix?
  • 5. Recommender Systems m users 3.00 0.00 2.00 0.00 4.00 2.00 3.00 2.00 1.00 3.00 0.00 3.00 0.00 2.00 0.00 4.00 2.00 3.00 2.00 1.00 3.00 0.00 … 2.00 3.00 2.00 1.00 1.00 4.00 2.00 3.00 0.00 2.00 3.00 2.00 3.00 2.00 1.00 1.00 4.00 2.00 3.00 0.00 2.00 3.00… 3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 … 0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 … 0.00 2.00 3.00 2.00 2.00 4.00 3.00 0.00 3.00 4.00 2.00 0.00 2.00 3.00 2.00 2.00 4.00 3.00 0.00 3.00 4.00 2.00 … 4.00 2.00 1.00 4.00 2.00 3.00 3.00 0.00 3.00 1.00 2.00 4.00 2.00 1.00 4.00 2.00 3.00 3.00 0.00 3.00 1.00 2.00… 4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00 4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00… 1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 … 1.00 4.00 1.00 1.00 3.00 4.00 2.00 2.00 0.00 1.00 4.00 1.00 4.00 1.00 1.00 3.00 4.00 2.00 2.00 0.00 1.00 4.00… 3.00 3.00 2.00 3.00 4.00 1.00 4.00 0.00 1.00 0.00 1.00 3.00 3.00 2.00 3.00 4.00 1.00 4.00 0.00 1.00 0.00 1.00… 0.00 1.00 2.00 4.00 2.00 2.00 3.00 4.00 4.00 4.00 1.00 0.00 1.00 2.00 4.00 2.00 2.00 3.00 4.00 4.00 4.00 1.00… 4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00 4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00… 1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 … 3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 … 0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 … …………………………………………………………………………………………………………………… . …………………………………………………………………………………………………………………… . ………………………………………………………………………………………………………………… … . n items Input
  • 6. Recommender Systems State-of-the-art recommender systems learn large models One factor vector per each user and item One parameter vector (on side info) per each user and item
  • 7. Recommender Systems m users 3.00 0.00 2.00 0.00 4.00 2.00 3.00 2.00 1.00 3.00 0.00 … 2.00 3.00 2.00 1.00 1.00 4.00 2.00 3.00 0.00 2.00 3.00 … 3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 … 0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 … 0.00 2.00 3.00 2.00 2.00 4.00 3.00 0.00 3.00 4.00 2.00 … 4.00 2.00 1.00 4.00 2.00 3.00 3.00 0.00 3.00 1.00 2.00 … 4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00 … 1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 … 1.00 4.00 1.00 1.00 3.00 4.00 2.00 2.00 0.00 1.00 4.00 … 3.00 3.00 2.00 3.00 4.00 1.00 4.00 0.00 1.00 0.00 1.00 … 0.00 1.00 2.00 4.00 2.00 2.00 3.00 4.00 4.00 4.00 1.00 … ………………………………………………………… . ………………………………………………………… . ………………………………………………………… . n items Input User Model Item Model m x k n x k 0.54 0.48 0.83 0.75 0.28 … 0.02 0.29 0.99 0.85 0.68 … 0.05 0.53 0.60 0.98 0.19 … 0.52 0.47 0.50 0.12 0.98 … 0.26 0.39 0.29 0.91 0.50 … 0.15 0.43 0.66 0.07 0.51 … 0.52 0.36 0.01 0.87 0.53 … …………………………. . ………………………….. . …………………………... . 0.93 0.78 0.56 0.77 0.75 … 0.21 0.44 0.99 0.01 0.00 … 0.04 0.42 0.36 0.72 0.19 … 0.77 0.07 0.24 0.67 0.87 … 0.42 0.79 0.62 0.80 0.79 … 0.42 0.32 0.26 0.50 0.85 … 0.94 0.76 0.93 0.34 0.46 … …………………………. . ………………………….. . …………………………... .
  • 8. Learning Process What does a machine learning algorithm require to do with that matrix?
  • 9. Machine Learning - Techniques Batch Learning All parameters are updated once per iteration
  • 10. Machine Learning - Techniques Batch Learning Updates can be calculated in parallel using MapReduce (SequenceFile might be enough)
  • 11. Machine Learning - Techniques Batch Learning Output model should provide random access to rows
  • 12. Machine Learning - Techniques Online Learning Parameters are updated per training example
  • 13. Machine Learning - Techniques Online Learning Each update results in updates in a row Needs random access while learning
  • 14. Machine Learning - Techniques Online Learning Output model should provide random access to rows
  • 15. Deployment Process How do you decide to deploy a machine learning model in production?
  • 16. Machine Learning - Deployment Usual process Works good? Deploy in production Experiment on prototype Y N
  • 17. Machine Learning - Deployment How would you turn your prototype into production easily? Common matrix interface for in- memory and persistent versions
  • 18. HBase Backed Matrix Implements Mahout matrix Dense or sparse
  • 19. HBase Backed Matrix Random access to cells Random access to rows Iteration over rows Lazy loading while iterating
  • 20. HBase Backed Matrix Common interface for prototype and product Easy to deploy (Model already persisted)
  • 21. HBase Backed Matrix Matrix operations with existing mahout- math library
  • 22. Logical Schema Composite row keys: 12_0: 12_9: 12_22000: data:value:0.41 data:value:0.41 data:value:0.41
  • 23. Logical Schema Composite row keys: Row access by scan Cell access by get Atomic row update should be handled in application
  • 24. Logical Schema Row indices as row keys 12: data:0:0.41 data:22000:0.41data:9:0.41
  • 25. Logical Schema Row indices as row keys Atomic updates are handled automatically
  • 26. Speed – Cell access/write GET SET row index as row key composite row key
  • 27. Speed – Row access/write GET SET row index as row key composite row key
  • 29. Future Work MatrixInputFormat Might replace SequenceFile based MapReduce inputs
  • 30. Future Work – A little digression Recommender Systems Calculating score for a user-item pair is easy with HBaseMatrix
  • 31. Future Work – A little digression Recommender Systems top-N recommendation? All candidate items for a user in the user row as a nested entity (See Ian Varley's HBase Schema Design)