Weitere ähnliche Inhalte
Ähnlich wie Chicago HUG Presentation Oct 2011 (20)
Kürzlich hochgeladen (20)
Chicago HUG Presentation Oct 2011
- 1. GENTLE STROLL DOWN
THE ANALYTICS MEMORY
LANE
Abe Taha
VP Engineering, Karmapshere
Oct 19th, 2011
1 © Karmasphere 2011 All rights reserved
- 2. What is this talk about
• This talk is a story about building an analytics services team at
Ning and the experiences and lessons learned
• There is also a bit about how I’d do things differently
• And like a good story, an ending
2 © Karmasphere 2011 All rights reserved
- 3. Caveat Lector
• The story has no pictures or conversations
• “And what is the use of a book," thought Alice, "without
pictures or conversations?”
Alice’s Adventures in Wonderland, Lewis Carroll
3 © Karmasphere 2011 All rights reserved
- 4. Your storyteller
• Mostly scalable distributed systems background
• At Yahoo–Search and Social Search
• At Google—App infrastructure
• At Ning—Hadoop for Analytics and System Management services
• At Ask—Dictionary/Reference properties
• Now at Karmasphere building analytics applications on Hadoop
4 © Karmasphere 2011 All rights reserved
- 5. Prologue
• The story begins at Ning
• Starting an analytics and systems management teams
• In 2008
• When Hadoop was gaining popularity
• v0.16 was out
5 © Karmasphere 2011 All rights reserved
- 6. A bit about Ning
• Hot company at the time, co-founded by Andreessen
• Allowed users to build websites that look like Facebook
• Websites called networks
• Networks had social features
• Blogs
• Photos
• Videos
• Chat
• Social graph
• Each network had a major topic/category
• Most networks were free, few for pay
• Free networks monetized through contextual ads
• The theory was that people produce good content that you can
monetize
6 © Karmasphere 2011 All rights reserved
- 7. Raison d’etre for the analytics team
• Figure out what ads to display on the network
• Look at user generated content (UGC)
• Posts
• Comments and discussions
• Tags on photos and videos
• Come up with categories for networks and ads
• Model network trends and business metrics
• Predict serving machine growth (poor man’s ec2)
• Model machine and application data (poor man’s ec2)
• Memory, disk, CPU, network
• Application logs, counters, etc
7 © Karmasphere 2011 All rights reserved
- 8. First: building the team
• Data scientist title not common then, second best engineers
• Distributed systems engineers (3) for the infrastructure
• Statistics and ML engineers (2) for modeling and trending
• Data visualization engineers (1) for building dashboards to interact
with the data
• Systems management engineers (2) for building the machine
monitoring systems
8 © Karmasphere 2011 All rights reserved
- 9. Second: figuring out where the data is
• Typical company scenario
• Data resides in log files
• Machine or application logs
• Stored locally
• Purged after 30 days
9 © Karmasphere 2011 All rights reserved
- 10. Third: where to keep the data
• Wanted to keep all the historical data
• In a centralized place
• Without paying too much money
• Or using specialized hardware
• Ruled out DW
• Had experience with systems that looked like Hadoop (or
Hadoop looked like them)
• Team wanted to experiment with newer technology
• -> Data in Hadoop
• V1: POC
10 © Karmasphere 2011 All rights reserved
- 11. V1: getting data in
• Minor changes to store all machine and application logs on NFS
drive
• A couple of retired NetApps filers
• Log files copied into HDFS using the Hadoop client
• Data organized by source in a directory hierarchy
• Grouped by date
• No preprocessing
• 3x replication
• Some latency in moving the data
11 © Karmasphere 2011 All rights reserved
- 12. V1: now what
• Custom Java map-reduce programs to process the data
• Support libraries to parse different log file formats
• Jobs did simple analytics
• Averages
• Network response times
• User engagement
• Trends per network
• Active users
• Pageviews
• Most common/popular
• Browsers, pages, queries
• Indexing
• Machine utilization
• Simple scheduler to run jobs
12 © Karmasphere 2011 All rights reserved
- 13. V1: dashboarding
• Results stored in flat files in HDFS
• Grouped daily/weekly/monthly
• Use gnuplot to build dashboards every hour
13 © Karmasphere 2011 All rights reserved
- 14. What did we learn from V1
• POC proved viability of Hadoop
• Latency of pulling files was an issue
• Most of the metrics computations are of the same nature
• People need flexibility in defining what is measured
• Once you put data in front of people, they ask more questions
• POC shows which areas are a pain, and where to invest to fix
14 © Karmasphere 2011 All rights reserved
- 15. V2: changing data ingestion
• Use event records instead of log files
• Pushed through HTTP
• Build using Thrift
• Events have
• Names
• Timestamps
• Host
• Version
• Payloads
• Published catalog
• All available events
• Event parsers
• Load ~50 million external page views (~10 events per page)
15 © Karmasphere 2011 All rights reserved
- 16. V2: collectors
• Receive events
• Put in a memory queue
• Background processes store to local disk
• Check events for validity against catalog
• Separate into valid/invalid queues
• Another process sucks data into HDFS and organize in a
directory hierarchy
• Events
• Grouped by date
16 © Karmasphere 2011 All rights reserved
- 17. V2: computation abstraction
• Common tasks
• Projection
• What fields am I interested in
• Filtering
• What records I am interested in
• Aggregations
• What do I want to do with the metrics
• Common readers and writers for data types
• Captured in libraries that can be composed for complex
analytics
17 © Karmasphere 2011 All rights reserved
- 18. V2: better dashboards
• Metrics summarized in MySQL databases
• Interactive dashboards using Ruby/Senatra
• Select metrics
• Time range
• Aggregation method
• Plot results using FusionCharts
• OpenCharts was a close second, but no combined charts
(Histograms, line charts)
18 © Karmasphere 2011 All rights reserved
- 19. What did we learn from V2
• HDFS I/O is better than the local disk
• No need for the process that saves locally then to HDFS
• People loved events
• Led to event abuse
• Each feature on the page had an associated event
• Events were used for performance tuning: how much time did a feature
take
• Events were used for monitoring backend features: record errors with
services
• Large number of files cause problems for the namenode
• Need to coalesce events to reduce file number
• With flexible event types, and interactive dashboards, people have
more questions
• We couldn’t keep up with developing custom metrics and charts
• Needed a self serve query mechanism
19 © Karmasphere 2011 All rights reserved
- 20. V3: ingestion
• Minor modifications
• Collectors now write to HDFS
• Collectors accumulate events to reduce file number
• Self serve UI for defining new events outside of the metrics
team
20 © Karmasphere 2011 All rights reserved
- 21. V3: computation
• Need a higher level language for query
• JSON API exposing a search like query syntax
• {from: ‘date’, to: ‘date’, metric:’x’, computation}
• Computations are encapsulated into libraries and exposed
through JSON
• Users can add metrics and computations and build frontends for
the query language
• Custom code for ML tasks
• Cascading for algorithms
• R for visualization
21 © Karmasphere 2011 All rights reserved
- 22. V3: dashboards
• More intermediate data precomputed
• Data stored in Hbase
• Dashboards go against HBase
• Templates for users to build custom dashboards
22 © Karmasphere 2011 All rights reserved
- 23. V3: What did we learn
• Self serve is the way to go
• Give people the infrastructure and the support libraries and
they’ll go to town
• Some tasks still can’t be done in a framework and needs custom
code
• Machine learning, with analysis on R
• ML is hard, even with experience
• Data is not clean
• Some content is very small
• Comments on pictures and videos (workarounds for aggregation)
• Even then you can build products around the results
• People and network recommenders
• Network categories for ads
23 © Karmasphere 2011 All rights reserved
- 24. How would we do it differently today
• Open source obviates custom code
• Scribe for data ingestion
• Hive for self serve analytics and business intelligence
• Pig scripts subsume most of the Java code
• Cascading for Java map-reduce
• Dashboards still stay the same
24 © Karmasphere 2011 All rights reserved
- 25. Epilogue
• ML analysis showed most usage is spam
• Shutdown a lot of pr0n networks and video hosting networks in
far east Asia
• Team moved to different companies
• Still in analytics at LI, FB, and twitter
• Company changed business model to for pay only and laid off
half the staff 6 months later
• Company acquired recently
25 © Karmasphere 2011 All rights reserved
- 26. Takeaway
• The problems and solutions are mostly the same everywhere
• Getting data into Hadoop
• How do you compute over the data
• Getting meaningful data out of Hadoop
• Lots of software components exist to help you with these
• It is about the balance of what you develop vs what you acquire
26 © Karmasphere 2011 All rights reserved
- 28. The Leader in Big Data Intelligence on Hadoop
www.karmasphere.com
28 © Karmasphere 2011 All rights reserved