Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)

9.107 Aufrufe

Veröffentlicht am

At LongTail Video, we use Hadoop and HBase for our real-time analytics engine. This was presented at HBaseCon 2013 in San Francisco.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)

  1. 1. Hadoop and HBase for Real-TimeVideo AnalyticsSuman Srinivasan
  2. 2. About LongTail Video• Home of JW player– JW player is embedded onover 2 million+ sites• Founded in 2007• 32 Employees• $5M investment• Headquartered in New Yorkdisney.co.ukchevrolet.com
  3. 3. JW Player - Key FeaturesWorks on all mobile devices and desktops.Chrome, IE, Firefox, iOS, Android, etcEasy to customize, extend and embed.Scripting API, PNG Skinning, Mgmt dashboardHD-quality, secure, adaptive streaming.Utilizing Apple HTTP Live StreamingCross-platform advertising & analytics.VAST/VPAID, SiteCatalyst, Google
  4. 4. JW Analytics: Numbers and Tech Stack• 156 million unique viewers - intl• 24 million unique viewers – USA• 1.04 billion video streams (plays)• 29.94 million hours of video watched• 134,000 live domains• 16 billion analytics events• 20,000 simultaneous pings persecond (peak)• 3 TB (gzip compressed) per month• 12-15 TB (uncompressed) per monthTechnology Stack•Runs completely in Amazon AWS•Master node & ping nodes in EC2; Hadoop and HBase clusters run in EMR•We upload data to and process from S3•Full-stack Python: boto (AWS S3, EMR), happybase (HBase)• Look ma, no Java!JW Player Numbers (Version 6.0 and above) – May 2013
  5. 5. JW Analytics: Demo• Availableto thepublic• Must be aregistereduser ofJWPlayer(freeincluded!)http://account.longtailvideo.com/
  6. 6. Real-Time Analytics: The Holy GrailDatabaseDatabaseCrunch dataInsert into a DBReal-timequeryingRaw logs with player data
  7. 7. Why We Chose HBase• Goal: Build “Google Analytics for video”!• Requirements:– Fast queries across data sets– Support date-range queries– Store huge amounts of aggregate data– Flexibility in dimensions used for rollup tables• HBase! But why?– Open source! And good community!• Based on & closely integrated with Hadoop– Facebook uses it (as do other large companies)– Amazon AWS released a “hosted” HBase solution on EMR
  8. 8. JW Analytics Architecture
  9. 9. Schema: HBase Row-Key Design• Allows us to do date range queries• If we need new metrics, we just create a new table– Specify this in a JSON config file used by our Hadoop mapper• We don’t use column filters, secondary indexes, etc• We do need to know the “prefix” ahead of timeQueryString _ yyyy mm ddRow prefix for a specific table•We need to know this ahead of time•Like the “WHERE” clause in SQLDate in yyyymmdd format•ISO8601 makes date range scanslexographic (perfect for HBase)
  10. 10. E.g.: A Tale of Two Tables (Domains, URLs)import happybaseconn = happybase.Connection(SERVER)# User1: “I want my list of domains from May 1 to# May 31, 2013”t = conn.table(“user_domains”)t.scan(row_start = “User1_20130501”,row_end = “User1_20130531”)# ‘User1_20130501’: { ‘cf:D1.com’: ‘100’; … }# User1: “Oooh, D1.com looks interesting. Wonder# what the URLs were popular for 2 months.” <Click>t = conn.table(“user_domain_urls”)t.scan(row_start = “User1_D1.com_20130501”,row_end = “User1_D1.com_20130631”)# ‘User1_Domain1_20130501’: {‘cf:D1.com/url’: ’80’ }
  11. 11. HBase + Thrift SetupMaster Data DataTT TT TTAPI Hadoop Hadoop• Used for HBase RPC with non-Java languages (e.g.: Python!)• Thrift runs on all nodes in our HBase clusters– Thrift on Master is read-only: used by API– Thrift on Data Nodes is write-only: data inserts from Hadoop• We use batch puts/inserts to improve write speed– Our analytics is VERY write-intensiveThrift is …?RPC frameworkdeveloped at Facebook,now in wide useNOT the Macklemore &Ryan Lewis music video(that’s Thrift Shop!)
  12. 12. What We Like About HBase• Giant, sorted key-value store– Hadoop output (also key-value!) can havea 1-to-1 correspondence to HBase• FAST lookups over large data set– O(1) lookup time to find key; lookupscomplete in ms across billion-plus rows• Usually retrieval is fast as well– But slow if data sets are large!– O(n). No simple way to solve this.– Most times you only need top N => can besolved through optimization of keyAll HBase dataAll HBase dataDatawewantDatawewantO(1) lookup = fast!O(n) read =could be slowGot good row-key design? HBase excels at finding needles in haystacks!Got good row-key design? HBase excels at finding needles in haystacks!
  13. 13. Challenges With HBase• Most programmers prefer SQL queries, not list iteration– “Why can’t I do a SELECT * FROM domains WHERE …???”• Thrift server goes down under load– We wrote our own HBase Thrift watchdog script• We deal with pretty exotic bugs at scale…– … with sometimes one blog post documenting a fix.– When was the last time Google showed you one useful result? • Some things we dealt with (we are on HBase 0.92)– org.apache.hadoop.hbase.NotServingRegionException• SSH into master, clean out Zookeeper meta-data, restart master.• Kinda scary the first time you actually do this?– java.util.concurrent.RejectedExecutionException (hbck)• Ticket #6018 (“hbck fails … when >50 regions present”); fixed in 0.94.1– org.apache.hadoop.hbase.MasterNotRunningException
  14. 14. Conclusion• Real-time analytics on Hadoop and HBase– Handling 16 billion events a month (~15 TB data)– Inserting ~80 million data points into HBase daily– Running in production for 7 months!– Did I mention we built it on Python (& bash)?• Important lessons– Design your row key well (with room to iterate)– Give HBase as much memory/CPU as it needs• HBase is resource-hungry; better to over-provision– Backup frequently!Questions?