Weitere ähnliche Inhalte Ähnlich wie Data Evolution in HBase (20) Kürzlich hochgeladen (20) Data Evolution in HBase1. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
Building a Data “Development” Platform
Data Evolution In HBase
Eric Czech & Alec Zopf
Next Big Sound
!
HBaseCon - Case Studies Track
May 5, 2014
2. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
Intro
• Eric Czech - Chief Architect
Previously worked for infrastructure team at
quantitative hedge fund
!
• Alec Zopf - Senior Data Engineer
Previously worked on algorithmic futures and
options trading platform
3. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
Agenda
• Data & Architecture
• Data Aggregation
- Why no tools help us
• Data Development (HBlocks)
- Our platform for making it happen
• A Practical Example
4. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
Misc
iTunes
Physical Sales
Amazon
Sitecatalyst
Facebook
Facebook Insights
Last.fm
Pandora
Rdio
ReverbNation
SoundCloud
Tumblr
Streaming & SocialNext Big Sound marries billions of public social
data points with customers’ internal transactional
data. Public sources include up to 3+ years of
historical and competitive data for hundreds of
thousands of artists and millions of songs.
Google Analytics
Wikipedia
Tunesat
Mediabase
Sales
Spotify
Twitter
Vevo
Vimeo
YouTube
YouTube Analytics
Deezer
Instagram
Data Sources
7. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
Charts Licensed to Billboard
In Billboard’s 118 year history
they’ve licensed data from two
providers – Nielsen in 1991 and
Next Big Sound in 2010.
8. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
Architecture & Stats
•Data collected from 60+ sources
•1M artists, 10M tracks
•10s of billions of records
•CDH 4.3.0
•48 node Hadoop cluster for 35TB dataset
•No licensing costs
•Giant counting machine!
9. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
Data Aggregation
Stores raw fact tables and copies of
dimension tables from MySQL
HDFS
Oozie/Pig
HBase
Runs incremental joins of fact and
dimension tables
Stores timeseries aggregations for
random access (NOT using counters)
10. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
Raw Fact Data (HDFS)
Aggregate Tables (HBase)
Cube/Rollup Operations (Pig)
(and many more...)
11. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
Other Solutions
• OpenTSDB
• Summingbird (Twitter)
• DataFu Hourglass (Linkedin)
• Blueflood (Rackspace)
• Oozie Coordinators
• Apache Accumulo
Are there better ways to just count things?
Yes! Lots:
• Hadoop + Voldemort
• MongoDB Incremental MapReduce
• TempoDB & InfluxDB (hosted services)
• KairosDB (originally built on Cassandra)
• Amazon EMR/Redshift
• Cassandra/Redis/Riak/HBase Counters
12. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
Considerations
• Scalability
• Cost
• Performance
• Client Libraries
• I/O Characteristics
• Optimal Hardware
• Config Overhead
• Language
• Community
• Data Model
• Monitoring/Alerting
• Documentation
• Support
• Learning Curve
13. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
One More Thing..
What about mistakes?!
Data “bugs” are nearly impossible to predict
and can screw you in unimaginable ways..
14. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
Data Bugs
Why are fan counts in Schenectady, NY 1000% higher than everywhere else?
Data source uses 12345 as default for new users’ locations
Why are radio station play numbers recently all multiples of 2 or 3?
Data delivered several times and we had no idea
Why is the number of songs sold 3% too high?
We didn't account for returns
Why are all the page view spikes 8 hours after they should be?
We assumed UTC timestamps instead of PST
Hundreds of these! .. that we know of
16. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
Or maybe not...
Can we just fix the code and re-aggregate?
NO, there’s no guarantee that the bad data is overwritten.
Can we do the aggregations “on-the-fly”?
NO, we’re not using a relational model for good reason.
Can we rebuild everything in new tables?
NO, we’d need 2x storage to fix < .0001% of the data.
17. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
Fixing data bugs online is terrifying.
• Dangerous and complicated
• Difficult to generalize
• Time-consuming to test
• A huge database I/O burden
“Ad-hoc” updates to production datasets are:
Learning the Hard Way
18. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
Back To Solutions
What if each dataset had multiple versions?
... and we can focus on small pieces
... with alpha/beta/stable tags
... where users only see what they should
Feels familiar
19. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
HBlocks
• Spans HDFS, Hive, Pig, and HBase
• Arbitrary versioning of data subsets
• Incremental processing, full-scale re-processing,
and everything in between
• Append-only model (deletes in background)
Our solution for large-scale revision control
20. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
The Basics
Each raw file has an ID
* e.g “block_1”
Each ID has versions
* ID & version stored in HBase
Version state used
to filter results
21. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
Data Development
Version “States” control data lifecycle
PENDING New data for ETL pipeline
PROCESSING Data currently being processed
ALPHA Developers only
BETA Privileged users
STABLE Everybody
HIDDEN Ignored (but still in HBase)
DELETED Removed permanently
Birth
Death
22. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
A Practical Example
Tracking the number of English Language
Wikipedia page views for Hadoop
http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-01/
http://en.wikipedia.org/wiki/Apache_Hadoop
So we’ll track this site:
Using this data:
23. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
The Dataset
http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-01/
Contains ~100MB compressed files for each hour
pagecounts-20140101-*.gzAll pageviews for Jan 1, 2014:
24. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
File Uploads
user@host001> for file in `ls wikipedia`!
do!
hblocks upload !
-file $file !
-source wikipedia !
done
user@host001> ls wikipedia!
pagecounts-20140101.gz!
pagecounts-20140102.gz!
...!
pagecounts-20140131.gz
Files downloaded
anywhere ...
... and uploaded
to HDFS
25. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
File Metadata
user@host001> hblocks list -source wikipedia !
+---------------------------------------------------------+!
| hblock_id | hblock_name | source | version:1 |!
+---------------------------------------------------------+!
| 2935 | pagecounts-20140101 | wikipedia | PENDING |!
| 2936 | pagecounts-20140102 | wikipedia | PENDING |!
...!
| 3678 | pagecounts-20140131 | wikipedia | PENDING |!
+---------------------------------------------------------+!
Table contains 31 row(s)
HDFS files registered in HBlocks metadata:
“PENDING” state indicates
availability for Pig scripts
26. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
Run It!
Now, lets do some aggregating:
user@host001> hblocks aggregate -source wikipedia
user@host001> hblocks query -table page_views !
+-------------------------------------------------------------------+!
| hblock_id | version | language | page | date | value |!
+-------------------------------------------------------------------+!
| 2935 | 1 | en | Apache_Hadoop | 20140101 | 283 |!
...!
| 2935 | 1 | En | Apache_Hadoop | 20140131 | 2 |!
| 2935 | 1 | en.mw | Apache_Hadoop | 20140131 | 3 |
Pig script writes results to HBase:
Wtf is this !?
27. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
What Happened?
• “Sub” languages (e.g. ‘en.mw’) introduced
• Capitalized languages (e.g. ‘En’) also added
• Aggregation script starts ignoring small % of records
On January 20th:
* fictitious problems - these language values are real but were
not introduced in January
29. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
Fix It!
Create new versions for each file affected:
user@host001> hblocks rebuild -source wikipedia -regex ‘.*201401(2|3).*’
Old versions “STABLE”, new versions “PENDING”:
user@host001> hblocks list -source wikipedia !
+---------------------------------------------------------------------+!
| hblock_id | hblock_name | source | version:1 | version:2 |!
+---------------------------------------------------------------------+!
| 2935 | pagecounts-20140101 | wikipedia | STABLE | |!
| 2935 | pagecounts-20140102 | wikipedia | STABLE | |!
...!
| 2936 | pagecounts-20140120 | wikipedia | STABLE | PENDING |!
| 2936 | pagecounts-20140121 | wikipedia | STABLE | PENDING |!
...!
| 3678 | pagecounts-20140131 | wikipedia | STABLE | PENDING |!
+---------------------------------------------------------------------+
30. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
Fix It!
Change the current aggregation code:
String language = line.get(“language”);
To handle case-sensitivity and use first part before a “.”:
String language = line.get(“language”)!
! .split(“.”)[1]!
! .toLowerCase();
31. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
Run It Again
Run the same aggregation for new versions:
user@host001> hblocks aggregate -source wikipedia
New results:
We made it
even worse!
32. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
Revert
Hurry, hide the bad data:
.split(“.”)[1]
Wrong! Should have been:
.split(“.”)[0]
user@host001> hblocks update_versions -source wikipedia !
! ! ! ! -regex ‘.*201401(2|3).*’ -state ‘HIDDEN’
Phew, back to where we started .. but what happened?
33. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
Fix It Again (carefully)
user@host001> hblocks rebuild -source wikipedia !
! ! ! ! -regex ‘.*201401(2|3).*’ -state ‘beta’
Rebuild aggregations in ‘beta’ state this time:
hblocks aggregateAfter another only developers see:
Looks good!
34. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
Finishing Up
Make the new data available for ALL users:
Final state:
user@host001> hblocks update_versions -source wikipedia !
! ! ! ! -regex ‘.*201401(2|3).*’ -state ‘ACTIVE’
user@host001> hblocks list -source wikipedia !
+---------------------------------------------------------------------------------+!
| hblock_id | hblock_name | source | version:1 | version:2 | version:3 |!
+---------------------------------------------------------------------------------+!
| 2935 | pagecounts-20140101 | wikipedia | STABLE | | |!
| 2935 | pagecounts-20140102 | wikipedia | STABLE | | |!
... !
| 2936 | pagecounts-20140120 | wikipedia | HIDDEN | HIDDEN | STABLE |!
| 2936 | pagecounts-20140121 | wikipedia | HIDDEN | HIDDEN | STABLE |!
...!
| 3678 | pagecounts-20140131 | wikipedia | HIDDEN | HIDDEN | STABLE |!
+---------------------------------------------------------------------------------+
35. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
HBase Schema
Primary Dimensions
HBlock Id
Time 0 Secondary Dimensions
Time 1 HBlockVersion Id
Time 2.0 Value0 Time 2.N Value N
Keys
Columns
Values
Timestamps Schema #Insertion Time (secs) Value Data Type
36. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
HBase Keys/Columns
Primary Dimensions
HBlock Id
Time 0 Secondary Dimensions
Time 1 HBlockVersion Id
Keys
Columns
Concatenated string ids
artists, tracks & metrics
Times split into offsets
limits row width
Queried in bulk
demographics & zip codes
HBlocks metadata
determines record “state”
37. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
HBase Values
Time 2.0 Value0 Time 2.N Value NValues
Time offsets in values too
fixed width (single byte)
Values stored as VarInts
can be any width
Many values per cell keeps key count
lower, reducing MemStore size
* difficult without an append-only model like ours
38. ®
eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
Alec Zopf
alec@nextbigsound.com
Eric Czech
eric@nextbigsound.com
Architecture @ NBS - highscalability.com
HBlocks White PaperJobs @ NBS
Links