IV. IT&C Innovation Conference - October 2016 - Sovata, Romania
A. Every scientist who needs big data analytics to save millions of lives should have that power
Legacy systems don’t provide the power.
B. The simple fact is that you are brilliant but your brilliant ideas require complex analytics.
Traditional solutions are not applicable.
The Plan: have oversight over developments as they happen.
Goal: Store everything accessible by SQL immediately.
What is BigQuery?
Analytics-as-a-Service - Data Warehouse in the Cloud
Fully-Managed by Google (US or EU zone)
Scales into Petabytes
Ridiculously fast
Decent pricing (queries $5/TB, storage: $20/TB) *October 2016 pricing
100.000 rows / sec Streaming API
Open Interfaces (Web UI, BQ command line tool, REST, ODBC)
Familiar DB Structure (table, views, record, nested, JSON)
Convenience of SQL + Javascript UDF (User Defined Functions)
Integrates with Google Sheets + Google Cloud Storage + Pub/Sub connectors
Client libraries available in YFL (your favorite languages)
Our benefits
no provisioning/deploy
no running out of resources
no more focus on large scale execution plan
no need to re-implement tricky concepts
(time windows / join streams)
pay only the columns we have in your queries
run raw ad-hoc queries (either by analysts/sales or Devs)
no more throwing away-, expiring-, aggregating old data.
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Google BigQuery for Everyday Developer
1. Google BigQuery
for the Everyday Developer
Márton Kodok
Senior Software Engineer at REEA
twitter: martonkodok stackoverflow: pentium10 github: @pentium10
IV. IT&C Innovation Conference - October 2016 - Sovata, Romania
2. 1. Big Data Challenges
2. Infrastructure overview
3. Choosing a Strategy
4. BigQuery - Use Cases
5. Q & A
BigQuery for the Everyday Developer @martonkodok
Agenda
3. BigQuery for the Everyday Developer @martonkodok
The 5 V’s of Big Data
4. Every scientist who needs big data analytics to save millions of lives
should have that power
Legacy systems don’t provide the power.
The simple fact is that you are brilliant but your brilliant ideas require
complex analytics.
Traditional solutions are not applicable.
BigQuery for the Everyday Developer @martonkodok
Big Data movement
5. With more and more people realizing the value of realtime
analytics, the race to build realtime middleware has begun.
The Plan: have oversight over developments as they happen.
* More slides about Beanstalkd at slideshare.net/martonkodok
BigQuery for the Everyday Developer @martonkodok
Fast computing systems
6. BigQuery for the Everyday Developer @martonkodok
Simple legacy Website stack/infrastructure
7. BigQuery for the Everyday Developer @martonkodok
Infrastructure with Middleware
8. Task: Need backend/database for STORE, QUERY, EXTRACT deep analytics:
● Events (website events, app events, email events)
● Entities (user data, business entities, A/B split test data)
● Metrics (sales data, code commit history, profiler data)
● Streaming (sensor data, IoT data burst)
● 3rd party Analytics (Google Analytics, Mixpanel)
● Local generated data (log file entries, trace output, unstructured data...)
❏ Run Ad-Hoc queries (without Developer)
❏ Be realtime
BigQuery for the Everyday Developer @martonkodok
Requirements
9. ● Terabyte scalable storage
● Real-time row ingestion
● Ask sophisticated queries
● Query-performance
● Low-maintenance
● Cost effective
● Wire them up easily
Goal: Store everything accessible by SQL immediately.
BigQuery for the Everyday Developer @martonkodok
Desired system/platform Equipment strategy
* people still required
Engines:
❏ MongoDB, Riak, Redis
❏ ELK Stack (Elastic-Logstash-Kibana)...
❏ Cassandra, Hive, Hadoop...
❏ Amazon RedShift, Google BigQuery...
In-House Hosted Managed
10. BigQuery for the Everyday Developer @martonkodok
Google BigQuery
11. ● Analytics-as-a-Service - Data Warehouse in the Cloud
● Fully-Managed by Google (US or EU zone)
● Scales into Petabytes
● Ridiculously fast
● Decent pricing (queries $5/TB, storage: $20/TB) *October 2016 pricing
● 100.000 rows / sec Streaming API
● Open Interfaces (Web UI, BQ command line tool, REST, ODBC)
● Familiar DB Structure (table, views, record, nested, JSON)
● Convenience of SQL + Javascript UDF (User Defined Functions)
● Integrates with Google Sheets + Google Cloud Storage + Pub/Sub connectors
● Client libraries available in YFL (your favorite languages)
BigQuery for the Everyday Developer @martonkodok
What is BigQuery?
12. ● ACL - row level locking (individual or group based)
● Columnar storage (max 10 000 columns in table)
● Rich SQL: JSON, IP, Math, RegExp, Window functions
● Data types: String, Integer, Float, Boolean, Timestamp,
Record, Nested, Struct, Array.
● Append-only tables prefered (DML syntax available)
● Batch load file size limits: 1TB (CSV or JSON)
● User Defined Functions in SQL or Javascript
● Date partitioned tables
BigQuery for the Everyday Developer @martonkodok
BigQuery: Convenience of SQL
13. * 1 Petabyte storage, 10 TB inserts, 100 TB queries => $22000
Queries Storage Ingestion
➔ 1 TB per month free
➔ 5 USD per TB
➔ only pay for the columns you use
in your query
➔ 20 USD per TB frequently accessed
data
➔ 10 USD per TB long term storage
90 days
➔ Batch load free (CSV/JSON)
➔ Exporting free
➔ Table copy free
➔ 50 USD per TB
Estimate 1
- Storage 5 TB
- Streaming Inserts 1 TB
- Queries 3 TB
Monthly total: $165
Estimate 2
- Storage 24 TB
- Streaming Inserts 1 TB
- Queries 50 TB
Monthly total: $788
BigQuery for the Everyday Developer @martonkodok
BigQuery Costs - October 2016
15. 1. Use aggregation MIN/MAX on timestamp to find first/last and join back to the same table.
2. Use analytic functions FIRST_VALUE and LAST_VALUE when only one column is needed
SELECT LAST_VALUE(email) OVER(
PARTITION BY user_id
ORDER BY timestamp ASC) AS email_last ...
3. Using Window Functions for full row selection
SELECT email, firstname, lastname
FROM
(SELECT email, firstname, lastname
row_number() over (partition BY user_id
ORDER BY timestamp DESC) seqnum
FROM [profile_event]
)
WHERE seqnum=1
BigQuery for the Everyday Developer @martonkodok
Append only tables - Get last value
16. BigQuery for the Everyday Developer @martonkodok
Streaming insert time (ms) - last year
18. BigQuery for the Everyday Developer @martonkodok
Funnel analysis: Time on upsell pages
19. Example HITS chain:
● article1 -> page2 -> page3 -> page4 -> orderpage1 -> thankyoupage1
● page1 -> article2-> page3 -> orderpage2 -> ...
BigQuery for the Everyday Developer @martonkodok
Attribute credit to first article visited on purchase
20. ● Funnel Analysis
● Email URL click heatmap
BigQuery for the Everyday Developer @martonkodok
Achievements
21. BigQuery for the Everyday Developer @martonkodok
Email URL clicks heat-map
22. ● Funnel Analysis
● Email URL click heatmap
● Email Health Dashboard (SPAM, ISP deferral, content
A/B split tests, trends or low open rate campaigns)
● Advanced segmentation (all raw data stored)
● Behavioral analytics - engaged users etc...
BigQuery for the Everyday Developer @martonkodok
Achievements Continued
23. ● no provisioning/deploy
● no running out of resources
● no more focus on large scale execution plan
● no need to re-implement tricky concepts
(time windows / join streams)
● pay only the columns we have in your queries
● run raw ad-hoc queries (either by analysts/sales or Devs)
● no more throwing away-, expiring-, aggregating old data.
BigQuery for the Everyday Developer @martonkodok
Our benefits
24. 1. githubarchive.org: 20+ event types available since 2012
a. pull request latency
b. expressions, emotions in commit messages
c. etc...
2. httparchive.org
a. sites that use multiple popular JS frameworks
b. sites that use multiple jQuery versions
3. Export for Google Analytics
4. MLab - broadband connection performance (26B rows)
5. GSOD - samples of weather (rainfall, temp…)
6. US Natality - 1969 to 2008
7. Wikipedia edits
BigQuery for the Everyday Developer @martonkodok
BigQuery: Sample projects to try out
25. BigQuery for the Everyday Developer @martonkodok
HttpArchive - multiple JS frameworks
26. BigQuery for the Everyday Developer @martonkodok
HttpArchive - multiple jQuery versions