Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Complex Realtime
Event Analytics using BigQuery
Márton Kodok
Senior Software Engineer at REEA
twitter: martonkodok stackov...
Agenda
1. Big Data movement
2. Analytics Project - Background
3. Challenges - Why is it so hard?
4. Approach - Strategy - ...
Big data analyses movement
Every scientist who needs
big data analytics to save millions of lives
should have that power.
...
Challenging experience
The simple fact is that
you are brilliant
but your brilliant ideas require
complex big data analyti...
Project: One-size-fits-all problem
Need a backend to store, query, extract for deep analytics:
● Events (product, app, sit...
Desired system/platform
● Terabyte scalable storage
● Real-time event ingestion
● Ask sophisticated queries (optional: wit...
Equipment strategy
● In-House
● Hosted
● Managed
* people still required
Services:
❏ ELK Stack (Elastic-Logstash-Kibana).....
Complex Realtime Event Analytics using BigQuery @martonkodok
Google BigQuery
What is BigQuery?
● Analytics-as-a-Service - Data Warehouse in the Cloud
● Fully-Managed
● Scales into Petabytes
● Ridicul...
BigQuery: Big Data Analytics in the Cloud
● Convenience of SQL
● Familiar DB Structure (table, column, views, JSON)
● Open...
BigQuery: Convenience of SQL/JSON/JS
● Append-only tables
● Batch load file size limits: 5TB (CSV or JSON)
● ACL - row lev...
BigQuery Costs - October 2015
* 1 Petabyte storage, 100 TB rows insert, 100 TB queries => 26,000 USD
Queries Storage Inges...
UDF - Power of Javascript
● impossible to express in SQL: Loops, complex
conditionals, string parsing or transformations
●...
Append only tables - Get last value
1. Use aggregation MIN/MAX on timestamp to find first/last and join back to the same t...
Table wildcard functions
This example assumes the following tables exist:
● mydata.people20140323
● mydata.people20140324
...
Infrastructure
Complex Realtime Event Analytics using BigQuery @martonkodok
Schema modelling
Complex Realtime Event Analytics using BigQuery @martonkodok
+--------------------------+-----------+----...
Streaming insert time (ms) - last 6M
Complex Realtime Event Analytics using BigQuery @martonkodok
Achievements
● Funnel Analysis
Complex Realtime Event Analytics using BigQuery @martonkodok
Attribute orders to first article visited
Example:
● article1 -> page2 -> page3 -> page4 -> orderpage1 -> thankyoupage1
● ...
Achievements
● Funnel Analysis
● Email URL click heatmap
Complex Realtime Event Analytics using BigQuery @martonkodok
Email URL clicks map (79GB in 2.4sec)
Complex Realtime Event Analytics using BigQuery @martonkodok
Achievements Continued
● Funnel Analysis
● Email URL click heatmap
● Email Dashboard (Trends, SPAM, ISP deferral)
● Split ...
Our benefits
● no provisioning/deploy
● no running out of resources
● no more focus on large scale execution plan
● no nee...
BigQuery: Sample projects to try out
1. githubarchive.org: 20+ event types available since 2012
a. pull request latency
b....
HttpArchive - .HU Javascript frameworks
Complex Realtime Event Analytics using BigQuery @martonkodok
GDELT - News Coverage: Orbán Viktor
Complex Realtime Event Analytics using BigQuery @martonkodok
GDELT - News Coverage: Beata Szydlo
Complex Realtime Event Analytics using BigQuery @martonkodok
Reddit - books community talks about
Complex Realtime Event Analytics using BigQuery @martonkodok
Questions?
Thank you.
Nächste SlideShare
Wird geladen in …5
×

Complex realtime event analytics using BigQuery @Crunch Warmup

3.775 Aufrufe

Veröffentlicht am

Complex event analytics solutions require massive architecture, and Know-How to build a fast real-time computing system. Google BigQuery solves this problem by enabling super-fast, SQL-like queries against append-only tables, using the processing power of Google’s infrastructure.In this presentation we will see how Bigquery solves our ultimate goal: Store everything accessible by SQL immediately at petabyte-scale. We will discuss some common use cases: funnels, user retention, affiliate metrics.

Veröffentlicht in: Software

Complex realtime event analytics using BigQuery @Crunch Warmup

  1. 1. Complex Realtime Event Analytics using BigQuery Márton Kodok Senior Software Engineer at REEA twitter: martonkodok stackoverflow: pentium10 github: pentium10 Crunch Warm Up - October 2015 - Budapest
  2. 2. Agenda 1. Big Data movement 2. Analytics Project - Background 3. Challenges - Why is it so hard? 4. Approach - Strategy - Application 5. Use Cases - Implementations 6. Exploring Big Data (GDELT, Hackernews, Reddit) Complex Realtime Event Analytics using BigQuery @martonkodok
  3. 3. Big data analyses movement Every scientist who needs big data analytics to save millions of lives should have that power. Complex Realtime Event Analytics using BigQuery @martonkodok
  4. 4. Challenging experience The simple fact is that you are brilliant but your brilliant ideas require complex big data analytics. Complex Realtime Event Analytics using BigQuery @martonkodok
  5. 5. Project: One-size-fits-all problem Need a backend to store, query, extract for deep analytics: ● Events (product, app, site email events) ● Achievements (“tag” users on the go, retention) ● Entities (split tests, user profiles, business entities) ● Metrics (app profiler data, custom) ● Email activity (click-map, engagement, ISP, Spam) ● 3rd party Analytics (good to have: Google Analytics) ● Systems generated data (log file entries, unstructured) Complex Realtime Event Analytics using BigQuery @martonkodok
  6. 6. Desired system/platform ● Terabyte scalable storage ● Real-time event ingestion ● Ask sophisticated queries (optional: without Dev) ● Query-performance ● Low-maintenance ● Cost effective ● Wire them up easily Goal: Store everything accessible by SQL immediately. Complex Realtime Event Analytics using BigQuery @martonkodok
  7. 7. Equipment strategy ● In-House ● Hosted ● Managed * people still required Services: ❏ ELK Stack (Elastic-Logstash-Kibana)... ❏ Cassandra, Hive, Hadoop... ❏ Amazon RedShift, Google BigQuery... Complex Realtime Event Analytics using BigQuery @martonkodok
  8. 8. Complex Realtime Event Analytics using BigQuery @martonkodok Google BigQuery
  9. 9. What is BigQuery? ● Analytics-as-a-Service - Data Warehouse in the Cloud ● Fully-Managed ● Scales into Petabytes ● Ridiculously fast ● Decent pricing (queries $5/TB, storage: $20/TB) ● 100.000 rows / sec Streaming API * October 2015 pricing Complex Realtime Event Analytics using BigQuery @martonkodok
  10. 10. BigQuery: Big Data Analytics in the Cloud ● Convenience of SQL ● Familiar DB Structure (table, column, views, JSON) ● Open Interfaces (REST, Web UI, ODBC) ● Fast atomic imports JSON/CSV (file size up to 5TB) ● Simple data ingest from GCS or Hadoop ● Web UI + bq CLI ● Connectors: Hadoop, Tableau, R, Talend, Logstash ● US or EU zone Complex Realtime Event Analytics using BigQuery @martonkodok
  11. 11. BigQuery: Convenience of SQL/JSON/JS ● Append-only tables ● Batch load file size limits: 5TB (CSV or JSON) ● ACL - row level locking (individual or group based) ● Columnar storage (max 10 000 columns in table) ● Rich SQL: JSON,IP,Math,RegExp,Window functions ● Datatypes: String 2MB, Record, Nested … ● UDF (User defined functions): Javascript Note: Store what you can in columns, the rest in JSON. Complex Realtime Event Analytics using BigQuery @martonkodok
  12. 12. BigQuery Costs - October 2015 * 1 Petabyte storage, 100 TB rows insert, 100 TB queries => 26,000 USD Queries Storage Ingestion ➔ 1 TB per month free ➔ 5 USD per TB ➔ only pay for the columns you use in your query ➔ 20 USD per TB ➔ Batch load free (CSV/JSON) ➔ Exporting free ➔ Table copy free ➔ 1 USD per 20TB data Estimate 1 - Storage 5 TB - Streaming Inserts 5TB - Queries 3 TB Monthly total: 110 USD Estimate 2 - Storage 20 TB - Streaming Inserts 10TB - Queries 10 TB Monthly total: 455 USD Complex Realtime Event Analytics using BigQuery @martonkodok
  13. 13. UDF - Power of Javascript ● impossible to express in SQL: Loops, complex conditionals, string parsing or transformations ● UDFs are similar to map functions in MapReduce ● inline JS or from GCS (gs://some-bucket/js/lib.js) Some UDF use cases: ● take one row and emit zero or more rows ● decoding URL-encoded strings ● text readability Complex Realtime Event Analytics using BigQuery @martonkodok
  14. 14. Append only tables - Get last value 1. Use aggregation MIN/MAX on timestamp to find first/last and join back to the same table. 2. Use analytic functions FIRST_VALUE and LAST_VALUE. SELECT LAST_VALUE(email) OVER( PARTITION BY user_id ORDER BY timestamp ASC) AS email_last ... 3. Using Window Functions SELECT email, firstname, lastname FROM (SELECT email, firstname, lastname row_number() over (partition BY user_id ORDER BY timestamp DESC) seqnum FROM [profile_event] ) WHERE seqnum=1 Complex Realtime Event Analytics using BigQuery @martonkodok
  15. 15. Table wildcard functions This example assumes the following tables exist: ● mydata.people20140323 ● mydata.people20140324 ● mydata.people20140325 SELECT name FROM (TABLE_DATE_RANGE(mydata.people, DATE_ADD(CURRENT_TIMESTAMP(), -2, 'DAY'), CURRENT_TIMESTAMP())) WHERE age >= 35 #... another example with RegExp ... FROM (TABLE_QUERY(mydata, 'REGEXP_MATCH(table_id, r"^boo[d]{3,5}")')) Complex Realtime Event Analytics using BigQuery @martonkodok
  16. 16. Infrastructure Complex Realtime Event Analytics using BigQuery @martonkodok
  17. 17. Schema modelling Complex Realtime Event Analytics using BigQuery @martonkodok +--------------------------+-----------+----------+--+ | order_id | INTEGER | REQUIRED | | | ... | | | | | products | RECORD | REPEATED | | | products.product_id | INTEGER | NULLABLE | | | products.attributes | STRING | REPEATED | | | products.price | FLOAT | NULLABLE | | | products.name | STRING | NULLABLE | | | ... | | | | | common | RECORD | NULLABLE | | | common.insert_id | INTEGER | REQUIRED | | | common.tenant | INTEGER | REQUIRED | | | common.event | INTEGER | REQUIRED | | | common.user_id | INTEGER | REQUIRED | | | common.timestamp | TIMESTAMP | REQUIRED | | | .... | | | | | common.utm | RECORD | NULLABLE | | | common.utm.source | STRING | NULLABLE | | | common.utm.medium | STRING | NULLABLE | | | common.utm.campaign | STRING | NULLABLE | | | common.utm.content | STRING | NULLABLE | | | common.utm.term | STRING | NULLABLE | | | meta | STRING | NULLABLE | | +--------------------------+-----------+----------+--+
  18. 18. Streaming insert time (ms) - last 6M Complex Realtime Event Analytics using BigQuery @martonkodok
  19. 19. Achievements ● Funnel Analysis Complex Realtime Event Analytics using BigQuery @martonkodok
  20. 20. Attribute orders to first article visited Example: ● article1 -> page2 -> page3 -> page4 -> orderpage1 -> thankyoupage1 ● page1 -> article2-> page3 -> orderpage2 -> ... Problem: When an order is made, attribute a credit to the first article visited by that user! Complex Realtime Event Analytics using BigQuery @martonkodok
  21. 21. Achievements ● Funnel Analysis ● Email URL click heatmap Complex Realtime Event Analytics using BigQuery @martonkodok
  22. 22. Email URL clicks map (79GB in 2.4sec) Complex Realtime Event Analytics using BigQuery @martonkodok
  23. 23. Achievements Continued ● Funnel Analysis ● Email URL click heatmap ● Email Dashboard (Trends, SPAM, ISP deferral) ● Split tests (by content, region, device, during the day) ● Ability for advanced segmentation as all raw data is stored ● Behavioral analytics (engaged users, recommendations) Complex Realtime Event Analytics using BigQuery @martonkodok
  24. 24. Our benefits ● no provisioning/deploy ● no running out of resources ● no more focus on large scale execution plan ● no need to re-implement tricky concepts (time windows / join streams) ● pay only the columns we have in your queries ● run raw ad-hoc queries (either by analysts/sales or Devs) ● no more throwing away-, expiring-, aggregating old data. Complex Realtime Event Analytics using BigQuery @martonkodok
  25. 25. BigQuery: Sample projects to try out 1. githubarchive.org: 20+ event types available since 2012 a. pull request latency b. expressions, emotions in commit messages 2. httparchive.org: Trends in web technology a. popular scripts b. website performance 3. raw Google Analytics data (*only Premium Customers) 4. GDELT - Global Database of Events, Language, and Tone GKG - Global Knowledge Graph 5. GSOD - samples of weather (rainfall, temp…) 6. 1.6 billion Reddit comments 7. Hackernews data 8. Wikipedia edits Complex Realtime Event Analytics using BigQuery @martonkodok
  26. 26. HttpArchive - .HU Javascript frameworks Complex Realtime Event Analytics using BigQuery @martonkodok
  27. 27. GDELT - News Coverage: Orbán Viktor Complex Realtime Event Analytics using BigQuery @martonkodok
  28. 28. GDELT - News Coverage: Beata Szydlo Complex Realtime Event Analytics using BigQuery @martonkodok
  29. 29. Reddit - books community talks about Complex Realtime Event Analytics using BigQuery @martonkodok
  30. 30. Questions? Thank you.

×