Suche senden
Hochladen
Chicago HUG Presentation Oct 2011
•
0 gefällt mir
•
473 views
Abe Taha
Folgen
Chicago Hadoop User Group presentation at Orbitz, October 2011
Weniger lesen
Mehr lesen
Technologie
Bildung
Melden
Teilen
Melden
Teilen
1 von 28
Empfohlen
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
Zohar Elkayam
Apache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the Union
DataWorks Summit
Ozone and HDFS's Evolution
Ozone and HDFS's Evolution
DataWorks Summit
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Cloudera, Inc.
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Mark Kerzner
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi
DataWorks Summit
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
Tillmann Eitelberg
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks
Empfohlen
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
Zohar Elkayam
Apache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the Union
DataWorks Summit
Ozone and HDFS's Evolution
Ozone and HDFS's Evolution
DataWorks Summit
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Cloudera, Inc.
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Mark Kerzner
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi
DataWorks Summit
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
Tillmann Eitelberg
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
Data Con LA
Tips and tricks for complex migrations to SharePoint Online
Tips and tricks for complex migrations to SharePoint Online
Andries den Haan
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
Enabling real interactive BI on Hadoop
Enabling real interactive BI on Hadoop
DataWorks Summit
Integrating ECM (WebCenter Content) with your Enterprise! 5 Tips to Try, 5 Tr...
Integrating ECM (WebCenter Content) with your Enterprise! 5 Tips to Try, 5 Tr...
Brian Huff
Performing successful migrations to the microsoft cloud
Performing successful migrations to the microsoft cloud
Andries den Haan
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Amr Awadallah
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
MapR Technologies
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Cloudera, Inc.
NoSQL_Night
NoSQL_Night
Clarence J M Tauro
Development of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data Grids
jlorenzocima
Containers and Big Data
Containers and Big Data
DataWorks Summit
Embeddable data transformation for real time streams
Embeddable data transformation for real time streams
Joey Echeverria
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
Hortonworks
FatWire Tutorial For Site Studio Developers
FatWire Tutorial For Site Studio Developers
Brian Huff
How to Ingest 16 Billion Records Per Day into your Hadoop Environment
How to Ingest 16 Billion Records Per Day into your Hadoop Environment
DataWorks Summit
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
DataWorks Summit
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
DataWorks Summit
Karmasphere bdabi blueprint- final
Karmasphere bdabi blueprint- final
Abe Taha
Seattle hug 2010
Seattle hug 2010
Abe Taha
Weitere ähnliche Inhalte
Was ist angesagt?
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
Data Con LA
Tips and tricks for complex migrations to SharePoint Online
Tips and tricks for complex migrations to SharePoint Online
Andries den Haan
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
Enabling real interactive BI on Hadoop
Enabling real interactive BI on Hadoop
DataWorks Summit
Integrating ECM (WebCenter Content) with your Enterprise! 5 Tips to Try, 5 Tr...
Integrating ECM (WebCenter Content) with your Enterprise! 5 Tips to Try, 5 Tr...
Brian Huff
Performing successful migrations to the microsoft cloud
Performing successful migrations to the microsoft cloud
Andries den Haan
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Amr Awadallah
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
MapR Technologies
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Cloudera, Inc.
NoSQL_Night
NoSQL_Night
Clarence J M Tauro
Development of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data Grids
jlorenzocima
Containers and Big Data
Containers and Big Data
DataWorks Summit
Embeddable data transformation for real time streams
Embeddable data transformation for real time streams
Joey Echeverria
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
Hortonworks
FatWire Tutorial For Site Studio Developers
FatWire Tutorial For Site Studio Developers
Brian Huff
How to Ingest 16 Billion Records Per Day into your Hadoop Environment
How to Ingest 16 Billion Records Per Day into your Hadoop Environment
DataWorks Summit
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
DataWorks Summit
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
DataWorks Summit
Was ist angesagt?
(20)
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
Tips and tricks for complex migrations to SharePoint Online
Tips and tricks for complex migrations to SharePoint Online
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
Enabling real interactive BI on Hadoop
Enabling real interactive BI on Hadoop
Integrating ECM (WebCenter Content) with your Enterprise! 5 Tips to Try, 5 Tr...
Integrating ECM (WebCenter Content) with your Enterprise! 5 Tips to Try, 5 Tr...
Performing successful migrations to the microsoft cloud
Performing successful migrations to the microsoft cloud
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
NoSQL_Night
NoSQL_Night
Development of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data Grids
Containers and Big Data
Containers and Big Data
Embeddable data transformation for real time streams
Embeddable data transformation for real time streams
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
FatWire Tutorial For Site Studio Developers
FatWire Tutorial For Site Studio Developers
How to Ingest 16 Billion Records Per Day into your Hadoop Environment
How to Ingest 16 Billion Records Per Day into your Hadoop Environment
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
Andere mochten auch
Karmasphere bdabi blueprint- final
Karmasphere bdabi blueprint- final
Abe Taha
Seattle hug 2010
Seattle hug 2010
Abe Taha
Social Media and Public Health
Social Media and Public Health
HIV.gov, Office of HIV/AIDS and Infectious Disease Policy, HHS
Copyright edtc6340.66 april_canales#3
Copyright edtc6340.66 april_canales#3
acanales04
Quality Milk Through Milking Parlor Technology
Quality Milk Through Milking Parlor Technology
24x7esubmission Online Marketing
Malta Trading Company Tax System Guide - Acumum Legal & Advisory
Malta Trading Company Tax System Guide - Acumum Legal & Advisory
Acumum - Legal & Advisory
Akamai: From Theory to Practice
Akamai: From Theory to Practice
Liz Bradley
Andere mochten auch
(7)
Karmasphere bdabi blueprint- final
Karmasphere bdabi blueprint- final
Seattle hug 2010
Seattle hug 2010
Social Media and Public Health
Social Media and Public Health
Copyright edtc6340.66 april_canales#3
Copyright edtc6340.66 april_canales#3
Quality Milk Through Milking Parlor Technology
Quality Milk Through Milking Parlor Technology
Malta Trading Company Tax System Guide - Acumum Legal & Advisory
Malta Trading Company Tax System Guide - Acumum Legal & Advisory
Akamai: From Theory to Practice
Akamai: From Theory to Practice
Ähnlich wie Chicago HUG Presentation Oct 2011
Big data - Online Training
Big data - Online Training
Learntek1
Hadoop and Machine Learning
Hadoop and Machine Learning
joshwills
Machine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and future
Cloudera, Inc.
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
Productionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons Learned
Cloudera, Inc.
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Cloudera, Inc.
Containers and Big Data
Containers and Big Data
DataWorks Summit
Intro to Big Data
Intro to Big Data
Zohar Elkayam
What ya gonna do?
What ya gonna do?
CQD
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
Hadoop.pptx
Hadoop.pptx
arslanhaneef
Hadoop.pptx
Hadoop.pptx
sonukumar379092
Going Mobile with HTML5
Going Mobile with HTML5
John Reiser
Technologies for Data Analytics Platform
Technologies for Data Analytics Platform
N Masahiro
Presto Summit 2018 - 02 - LinkedIn
Presto Summit 2018 - 02 - LinkedIn
kbajda
Hadoop ppt1
Hadoop ppt1
chariorienit
Introduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
Amir Shaikh
Demystifying data engineering
Demystifying data engineering
Thang Bui (Bob)
The Internet-of-things: Architecting for the deluge of data
The Internet-of-things: Architecting for the deluge of data
bcantrill
Ähnlich wie Chicago HUG Presentation Oct 2011
(20)
Big data - Online Training
Big data - Online Training
Hadoop and Machine Learning
Hadoop and Machine Learning
Machine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and future
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
Productionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons Learned
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Containers and Big Data
Containers and Big Data
Intro to Big Data
Intro to Big Data
What ya gonna do?
What ya gonna do?
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Hadoop.pptx
Hadoop.pptx
Hadoop.pptx
Hadoop.pptx
Going Mobile with HTML5
Going Mobile with HTML5
Technologies for Data Analytics Platform
Technologies for Data Analytics Platform
Presto Summit 2018 - 02 - LinkedIn
Presto Summit 2018 - 02 - LinkedIn
Hadoop ppt1
Hadoop ppt1
Introduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
Demystifying data engineering
Demystifying data engineering
The Internet-of-things: Architecting for the deluge of data
The Internet-of-things: Architecting for the deluge of data
Kürzlich hochgeladen
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Roshan Dwivedi
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
Pooja Nehwal
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
Delhi Call girls
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
The Digital Insurer
🐬 The future of MySQL is Postgres 🐘
🐬 The future of MySQL is Postgres 🐘
RTylerCroy
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Enterprise Knowledge
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
Gabriella Davis
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
Sinan KOZAK
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
Puma Security, LLC
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
Enterprise Knowledge
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
Paola De la Torre
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
Delhi Call girls
Slack Application Development 101 Slides
Slack Application Development 101 Slides
praypatel2
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Katpro Technologies
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
Delhi Call girls
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Igalia
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Rafal Los
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Miguel Araújo
Kürzlich hochgeladen
(20)
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
🐬 The future of MySQL is Postgres 🐘
🐬 The future of MySQL is Postgres 🐘
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
Slack Application Development 101 Slides
Slack Application Development 101 Slides
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Chicago HUG Presentation Oct 2011
1.
GENTLE STROLL DOWN
THE ANALYTICS MEMORY LANE Abe Taha VP Engineering, Karmapshere Oct 19th, 2011 1 © Karmasphere 2011 All rights reserved
2.
What is this
talk about • This talk is a story about building an analytics services team at Ning and the experiences and lessons learned • There is also a bit about how I’d do things differently • And like a good story, an ending 2 © Karmasphere 2011 All rights reserved
3.
Caveat Lector
• The story has no pictures or conversations • “And what is the use of a book," thought Alice, "without pictures or conversations?” Alice’s Adventures in Wonderland, Lewis Carroll 3 © Karmasphere 2011 All rights reserved
4.
Your storyteller
• Mostly scalable distributed systems background • At Yahoo–Search and Social Search • At Google—App infrastructure • At Ning—Hadoop for Analytics and System Management services • At Ask—Dictionary/Reference properties • Now at Karmasphere building analytics applications on Hadoop 4 © Karmasphere 2011 All rights reserved
5.
Prologue
• The story begins at Ning • Starting an analytics and systems management teams • In 2008 • When Hadoop was gaining popularity • v0.16 was out 5 © Karmasphere 2011 All rights reserved
6.
A bit about
Ning • Hot company at the time, co-founded by Andreessen • Allowed users to build websites that look like Facebook • Websites called networks • Networks had social features • Blogs • Photos • Videos • Chat • Social graph • Each network had a major topic/category • Most networks were free, few for pay • Free networks monetized through contextual ads • The theory was that people produce good content that you can monetize 6 © Karmasphere 2011 All rights reserved
7.
Raison d’etre for
the analytics team • Figure out what ads to display on the network • Look at user generated content (UGC) • Posts • Comments and discussions • Tags on photos and videos • Come up with categories for networks and ads • Model network trends and business metrics • Predict serving machine growth (poor man’s ec2) • Model machine and application data (poor man’s ec2) • Memory, disk, CPU, network • Application logs, counters, etc 7 © Karmasphere 2011 All rights reserved
8.
First: building the
team • Data scientist title not common then, second best engineers • Distributed systems engineers (3) for the infrastructure • Statistics and ML engineers (2) for modeling and trending • Data visualization engineers (1) for building dashboards to interact with the data • Systems management engineers (2) for building the machine monitoring systems 8 © Karmasphere 2011 All rights reserved
9.
Second: figuring out
where the data is • Typical company scenario • Data resides in log files • Machine or application logs • Stored locally • Purged after 30 days 9 © Karmasphere 2011 All rights reserved
10.
Third: where to
keep the data • Wanted to keep all the historical data • In a centralized place • Without paying too much money • Or using specialized hardware • Ruled out DW • Had experience with systems that looked like Hadoop (or Hadoop looked like them) • Team wanted to experiment with newer technology • -> Data in Hadoop • V1: POC 10 © Karmasphere 2011 All rights reserved
11.
V1: getting data
in • Minor changes to store all machine and application logs on NFS drive • A couple of retired NetApps filers • Log files copied into HDFS using the Hadoop client • Data organized by source in a directory hierarchy • Grouped by date • No preprocessing • 3x replication • Some latency in moving the data 11 © Karmasphere 2011 All rights reserved
12.
V1: now what
• Custom Java map-reduce programs to process the data • Support libraries to parse different log file formats • Jobs did simple analytics • Averages • Network response times • User engagement • Trends per network • Active users • Pageviews • Most common/popular • Browsers, pages, queries • Indexing • Machine utilization • Simple scheduler to run jobs 12 © Karmasphere 2011 All rights reserved
13.
V1: dashboarding
• Results stored in flat files in HDFS • Grouped daily/weekly/monthly • Use gnuplot to build dashboards every hour 13 © Karmasphere 2011 All rights reserved
14.
What did we
learn from V1 • POC proved viability of Hadoop • Latency of pulling files was an issue • Most of the metrics computations are of the same nature • People need flexibility in defining what is measured • Once you put data in front of people, they ask more questions • POC shows which areas are a pain, and where to invest to fix 14 © Karmasphere 2011 All rights reserved
15.
V2: changing data
ingestion • Use event records instead of log files • Pushed through HTTP • Build using Thrift • Events have • Names • Timestamps • Host • Version • Payloads • Published catalog • All available events • Event parsers • Load ~50 million external page views (~10 events per page) 15 © Karmasphere 2011 All rights reserved
16.
V2: collectors
• Receive events • Put in a memory queue • Background processes store to local disk • Check events for validity against catalog • Separate into valid/invalid queues • Another process sucks data into HDFS and organize in a directory hierarchy • Events • Grouped by date 16 © Karmasphere 2011 All rights reserved
17.
V2: computation abstraction
• Common tasks • Projection • What fields am I interested in • Filtering • What records I am interested in • Aggregations • What do I want to do with the metrics • Common readers and writers for data types • Captured in libraries that can be composed for complex analytics 17 © Karmasphere 2011 All rights reserved
18.
V2: better dashboards
• Metrics summarized in MySQL databases • Interactive dashboards using Ruby/Senatra • Select metrics • Time range • Aggregation method • Plot results using FusionCharts • OpenCharts was a close second, but no combined charts (Histograms, line charts) 18 © Karmasphere 2011 All rights reserved
19.
What did we
learn from V2 • HDFS I/O is better than the local disk • No need for the process that saves locally then to HDFS • People loved events • Led to event abuse • Each feature on the page had an associated event • Events were used for performance tuning: how much time did a feature take • Events were used for monitoring backend features: record errors with services • Large number of files cause problems for the namenode • Need to coalesce events to reduce file number • With flexible event types, and interactive dashboards, people have more questions • We couldn’t keep up with developing custom metrics and charts • Needed a self serve query mechanism 19 © Karmasphere 2011 All rights reserved
20.
V3: ingestion
• Minor modifications • Collectors now write to HDFS • Collectors accumulate events to reduce file number • Self serve UI for defining new events outside of the metrics team 20 © Karmasphere 2011 All rights reserved
21.
V3: computation
• Need a higher level language for query • JSON API exposing a search like query syntax • {from: ‘date’, to: ‘date’, metric:’x’, computation} • Computations are encapsulated into libraries and exposed through JSON • Users can add metrics and computations and build frontends for the query language • Custom code for ML tasks • Cascading for algorithms • R for visualization 21 © Karmasphere 2011 All rights reserved
22.
V3: dashboards
• More intermediate data precomputed • Data stored in Hbase • Dashboards go against HBase • Templates for users to build custom dashboards 22 © Karmasphere 2011 All rights reserved
23.
V3: What did
we learn • Self serve is the way to go • Give people the infrastructure and the support libraries and they’ll go to town • Some tasks still can’t be done in a framework and needs custom code • Machine learning, with analysis on R • ML is hard, even with experience • Data is not clean • Some content is very small • Comments on pictures and videos (workarounds for aggregation) • Even then you can build products around the results • People and network recommenders • Network categories for ads 23 © Karmasphere 2011 All rights reserved
24.
How would we
do it differently today • Open source obviates custom code • Scribe for data ingestion • Hive for self serve analytics and business intelligence • Pig scripts subsume most of the Java code • Cascading for Java map-reduce • Dashboards still stay the same 24 © Karmasphere 2011 All rights reserved
25.
Epilogue
• ML analysis showed most usage is spam • Shutdown a lot of pr0n networks and video hosting networks in far east Asia • Team moved to different companies • Still in analytics at LI, FB, and twitter • Company changed business model to for pay only and laid off half the staff 6 months later • Company acquired recently 25 © Karmasphere 2011 All rights reserved
26.
Takeaway
• The problems and solutions are mostly the same everywhere • Getting data into Hadoop • How do you compute over the data • Getting meaningful data out of Hadoop • Lots of software components exist to help you with these • It is about the balance of what you develop vs what you acquire 26 © Karmasphere 2011 All rights reserved
27.
Q&A 27 © Karmasphere
2011 All rights reserved
28.
The Leader in
Big Data Intelligence on Hadoop www.karmasphere.com 28 © Karmasphere 2011 All rights reserved