SlideShare a Scribd company logo
1 of 42
Download to read offline
HANDS ON WITH BIGQUERY JAVASCRIPT UDFS
THOMAS PARK
SOFTWARE ENGINEER - GOOGLE
Hands-on with BigQuery JavaScript
User-Defined Functions
Thomas Park
Software Engineer - Google
Felipe Hoffa
@felipehoffa
Developer Advocate - Google
Agenda
Background
Example: Cross-row intervals
Under the hood
Example: Codebreaking
I.
II.
III.
IV.
Agenda
Background
Example: Cross-row intervals
Under the hood
Example: Codebreaking
I.
II.
III.
IV.
What is BigQuery?
BigQuery: Big Data Analytics in the Cloud
Unrivaled
Performance and Scale
● Scan multiple TB’s in seconds
● Interactive query
performance
● No limits on amount of data
Ease of Use
and Adoption
● No administration /
provisioning
● Convenience of SQL
● Open interfaces
(REST, WebUI, ODBC)
● First 1 TB of data processed
per month is free
Advanced “Big Data”
Storage
● Familiar database structure
● Easy data management and
ACL’s
● Fast, atomic imports
Google confidential │ Do not distribute
How many pageviews does Wikipedia
have in a month?
SELECT COUNT(*)FROM
[fh-bigquery:wikipedia.wikipedia_views_201308]
https://bigquery.cloud.google.com/table/fh-bigquery:wikipedia.pagecounts_20140602_18
Google confidential │ Do not distribute
$500 in Cloud Platform credit
to launch your idea!
Build. Store. Analyze.
On the same infrastructure
that powers Google
Start building
Click ‘Apply Now’ and complete the
application with promo code: bigdata-spain
Starter Pack
Offer Description
1
2
3
Go to cloud.google.com/developers/starterpack
Agenda
Background
Example: Cross-row intervals
Under the hood
Example: Codebreaking
I.
II.
III.
IV.
Images by Connie Zhou
Scenario:
Door access records from a
Very Well-Secured Lab
where users must badge in
to enter or leave
Image by Tod Kurt
Images by Connie Zhou
Example:
Time-series analysis
from discrete user action data
Image by Tod Kurt
user_id timestamp
Beep!!
9h: arrive @ lab thomas 2014.07.15 09:00
Beep!!
10h: leave to pick up
prototype
thomas 2014.07.15 10:00
Beep!!
10h15: return with
prototype
thomas 2014.07.15 10:15
Beep!!
12h: out for lunch thomas 2014.07.15 12:00
How can we find out how much time
each user spent in the lab?
...where each scan of the user’s access card is
represented as a discrete row?
rownum user_id timestamp
1 thomas 2014.07.15 09:00
2 thomas 2014.07.15 10:00
3 thomas 2014.07.15 10:15
4 thomas 2014.07.15 12:00
60 minutes
105 minutes
Our analysis with data
in this format via SQL
is horrid and painful
A BigQuery + JS
friendly format:
data for each user in
separate rows
user_id timestamps
thomas [ 09:00, 10:00, 10:15, 12:00, ... ]
hoffa [ 08:10, 11:30, 12:00, 12:15, ... ]
SELECT user_id, NEST(timestamp) AS timestamps
FROM T
GROUP BY user_id;
Producing this format
is trivial in BigQuery...
// This function will be called once for each user,
// and receive an array of timestamps.
function(record, emit) {
var total_time = 0;
// Order of records built by NEST are not guaranteed!
// Sort to guarantee ascending timestamps.
var ts = record.ts.sort(
function (a, b) {return a > b;});
// Loop over timestamp pairs, calculate interval.
for (var i = 0; i < ts.length - 1; i += 2) {
total_time += (ts[i+1] - ts[i]);
}
// Emit total time for this user.
emit({user: record.user_id,
total_time: total_time});
}
JS: Total time for each
user
SELECT * FROM
js(
// Input table or query.
[secret-lab:door_scans.201411]
// Input columns.
user_id, timestamps,
// Output schema.
"[{name: 'user_id', type:'string'},
{name: 'tot_time', type:'integer'}]",
// The function.
"function(r, emit) {
...
emit(...);
}")
SELECT * FROM
js(
// Input table or query.
[secret-lab:door_scans.201411]
// Input columns.
user_id, timestamps,
// Output schema.
"[{name: 'user_id', type:'string'},
{name: 'tot_time', type:'integer'}]",
// The function.
"function(r, emit) {
...
emit(...);
}")
The JS
function
SELECT * FROM
js(
// Input table or query.
[secret-lab:door_scans.201411]
// Input columns.
user_id, timestamps,
// Output schema.
"[{name: 'user_id', type:'string'},
{name: 'tot_time', type:'integer'}]",
// The function.
"function(r, emit) {
...
emit(...);
}")
Input schema
(column
names only!)
SELECT * FROM
js(
// Input table or query.
[secret-lab:door_scans.201411]
// Input columns.
user_id, timestamps,
// Output schema.
"[{name: 'user_id', type:'string'},
{name: 'tot_time', type:'integer'}]",
// The function.
"function(r, emit) {
...
emit(...);
}")
Output schema
(full
declaration)
SELECT * FROM
js(
// Input table or query.
[secret-lab:door_scans.201411]
// Input columns.
user_id, timestamps,
// Output schema.
"[{name: 'user_id', type:'string'},
{name: 'tot_time', type:'integer'}]",
// The function.
"function(r, emit) {
...
emit(...);
}")
Input table or
subquery
SELECT * FROM
js(
// Input table or query.
[secret-lab:door_scans.201411]
// Input columns.
user_id, timestamps,
// Output schema.
"[{name: 'user_id', type:'string'},
{name: 'tot_time', type:'integer'}]",
// The function.
"function(r, emit) {
...
emit(...);
}")
Agenda
Background
Example: Cross-row intervals
Under the hood
Example: Codebreaking
I.
II.
III.
IV.
How BigQuery works
Get data from lower levels,
filter / join / transform,
send rows up
Tree Structured Query Dispatch and Aggregation
Distributed Storage
SELECT title, requests
Leaf Leaf Leaf Leaf
SUM(requests)
GROUP BY title
WHERE REGEX_MATCH(title, 'pat.*rn')
Mixer 1 Mixer 1 SUM(requests)
GROUP BY title
Mixer 0
LIMIT 10
ORDER BY c DESC
SUM(requests)
GROUP BY title
Data for each row is calculated and
streamed through a “Row Iterator”
Subquery0 Subquery1
JOIN
Row Iterator 0 Row Iterator 1
Row Iterator 2
Can insert JavaScript Functions
wherever we have a Row Iterator
Subquery0 Subquery1
JOIN
Row Iterator 0 Row Iterator 1
Row Iterator 2
UDF1
UDF0
Join order item info with web hits info
SELECT
item FROM
orders
SELECT query
string FROM hits
JOIN
Row Iterator 0 Row Iterator 1
Row Iterator 2
UDF1
UDF0
http://www.store.com/?q=7%2e1+Speakers
SELECT
item FROM
orders
SELECT query
string FROM hits
JOIN
Row Iterator 0 Row Iterator 1
Row Iterator 2
UDF1
UDF0
http://www.store.com/?q=7%2e1+Speakers
Extract and decode query term => “7.1 Speakers”
SELECT
item FROM
orders
SELECT query
string FROM hits
JOIN
Row Iterator 0 Row Iterator 1
Row Iterator 2
UDF1
UDF0
UDF execution
Subquery0 Subquery1
JOIN
UDF1
Process boundary
UDF0UDF0
User Code
Agenda
Background
Example: Cross-row intervals
Under the hood
Example: Codebreaking
I.
II.
III.
IV.
Demos:
Ñ
Image: El Hormiguero (Flickr CC)
http://jsfiddle.net/fhoffa/y4pt9s23/
Image: TheVanCats (Flickr CC)
Questions?
News: reddit.com/r/bigquery
Ask: stackoverflow.com
Share: bigqueri.es
Thomas Park
Felipe Hoffa @felipehoffa +FelipeHoffa
Rate us?
http://goo.gl/k3bzdw
Backup slides / screenshots
17TH ~ 18th NOV 2014
MADRID (SPAIN)

More Related Content

What's hot

What's hot (20)

BigQuery implementation
BigQuery implementationBigQuery implementation
BigQuery implementation
 
Workshop 20140522 BigQuery Implementation
Workshop 20140522   BigQuery ImplementationWorkshop 20140522   BigQuery Implementation
Workshop 20140522 BigQuery Implementation
 
TDC2016SP - Trilha BigData
TDC2016SP - Trilha BigDataTDC2016SP - Trilha BigData
TDC2016SP - Trilha BigData
 
Getting started with BigQuery
Getting started with BigQueryGetting started with BigQuery
Getting started with BigQuery
 
Google BigQuery Best Practices
Google BigQuery Best PracticesGoogle BigQuery Best Practices
Google BigQuery Best Practices
 
Google BigQuery 101 & What’s New
Google BigQuery 101 & What’s NewGoogle BigQuery 101 & What’s New
Google BigQuery 101 & What’s New
 
BigQuery for the Big Data win
BigQuery for the Big Data winBigQuery for the Big Data win
BigQuery for the Big Data win
 
Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)
 
Introduction to Google Cloud Platform
Introduction to Google Cloud PlatformIntroduction to Google Cloud Platform
Introduction to Google Cloud Platform
 
You might be paying too much for BigQuery
You might be paying too much for BigQueryYou might be paying too much for BigQuery
You might be paying too much for BigQuery
 
Faites évoluer votre accès aux données avec MongoDB Stitch
Faites évoluer votre accès aux données avec MongoDB StitchFaites évoluer votre accès aux données avec MongoDB Stitch
Faites évoluer votre accès aux données avec MongoDB Stitch
 
Google App Engine 7 9-14
Google App Engine 7 9-14Google App Engine 7 9-14
Google App Engine 7 9-14
 
Eagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational AwarenessEagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational Awareness
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and Kibana
 
Google Developer Group - Cloud Singapore BigQuery Webinar
Google Developer Group - Cloud Singapore BigQuery WebinarGoogle Developer Group - Cloud Singapore BigQuery Webinar
Google Developer Group - Cloud Singapore BigQuery Webinar
 
Unifying Events and Logs into the Cloud
Unifying Events and Logs into the CloudUnifying Events and Logs into the Cloud
Unifying Events and Logs into the Cloud
 
Google Cloud Spanner Preview
Google Cloud Spanner PreviewGoogle Cloud Spanner Preview
Google Cloud Spanner Preview
 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure data
 
MongoDB + Spring
MongoDB + SpringMongoDB + Spring
MongoDB + Spring
 
Siddhi - cloud-native stream processor
Siddhi - cloud-native stream processorSiddhi - cloud-native stream processor
Siddhi - cloud-native stream processor
 

Viewers also liked

Viewers also liked (20)

Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at ...
 Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at ... Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at ...
Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at ...
 
Location analytics by Marc Planaguma at Big Data Spain 2014
 Location analytics by Marc Planaguma at Big Data Spain 2014 Location analytics by Marc Planaguma at Big Data Spain 2014
Location analytics by Marc Planaguma at Big Data Spain 2014
 
Big Data the potential for data to improve service and business management by...
Big Data the potential for data to improve service and business management by...Big Data the potential for data to improve service and business management by...
Big Data the potential for data to improve service and business management by...
 
Data warehouse modernization programme by TOBY WOOLFE at Big Data Spain 2014
 Data warehouse modernization programme by TOBY WOOLFE at Big Data Spain 2014 Data warehouse modernization programme by TOBY WOOLFE at Big Data Spain 2014
Data warehouse modernization programme by TOBY WOOLFE at Big Data Spain 2014
 
ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...
ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...
ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...
 
Getting the best insights from your data using Apache Metamodel by Alberto Ro...
Getting the best insights from your data using Apache Metamodel by Alberto Ro...Getting the best insights from your data using Apache Metamodel by Alberto Ro...
Getting the best insights from your data using Apache Metamodel by Alberto Ro...
 
Intro to the Big Data Spain 2014 conference
Intro to the Big Data Spain 2014 conferenceIntro to the Big Data Spain 2014 conference
Intro to the Big Data Spain 2014 conference
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012
The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012
The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012
 
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data... Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
 
Convergent Replicated Data Types in Riak 2.0
Convergent Replicated Data Types in Riak 2.0Convergent Replicated Data Types in Riak 2.0
Convergent Replicated Data Types in Riak 2.0
 
Analyzing organization e-mails in near real time using hadoop ecosystem tools...
Analyzing organization e-mails in near real time using hadoop ecosystem tools...Analyzing organization e-mails in near real time using hadoop ecosystem tools...
Analyzing organization e-mails in near real time using hadoop ecosystem tools...
 
Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
 
A new streaming computation engine for real-time analytics by Michael Barton ...
A new streaming computation engine for real-time analytics by Michael Barton ...A new streaming computation engine for real-time analytics by Michael Barton ...
A new streaming computation engine for real-time analytics by Michael Barton ...
 
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
 
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
 
IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...
IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...
IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...
 
How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...
How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...
How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...
 

Similar to BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at Big Data Spain 2014

Intro To JavaScript Unit Testing - Ran Mizrahi
Intro To JavaScript Unit Testing - Ran MizrahiIntro To JavaScript Unit Testing - Ran Mizrahi
Intro To JavaScript Unit Testing - Ran Mizrahi
Ran Mizrahi
 
#NewMeetup Performance
#NewMeetup Performance#NewMeetup Performance
#NewMeetup Performance
Justin Cataldo
 

Similar to BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at Big Data Spain 2014 (20)

Hadoop cluster performance profiler
Hadoop cluster performance profilerHadoop cluster performance profiler
Hadoop cluster performance profiler
 
Intro To JavaScript Unit Testing - Ran Mizrahi
Intro To JavaScript Unit Testing - Ran MizrahiIntro To JavaScript Unit Testing - Ran Mizrahi
Intro To JavaScript Unit Testing - Ran Mizrahi
 
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
 
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
 
Why you should be using structured logs
Why you should be using structured logsWhy you should be using structured logs
Why you should be using structured logs
 
Re-Design with Elixir/OTP
Re-Design with Elixir/OTPRe-Design with Elixir/OTP
Re-Design with Elixir/OTP
 
Die Neuheiten in MariaDB 10.2 und MaxScale 2.1
Die Neuheiten in MariaDB 10.2 und MaxScale 2.1Die Neuheiten in MariaDB 10.2 und MaxScale 2.1
Die Neuheiten in MariaDB 10.2 und MaxScale 2.1
 
#NewMeetup Performance
#NewMeetup Performance#NewMeetup Performance
#NewMeetup Performance
 
Migrating from Struts 1 to Struts 2
Migrating from Struts 1 to Struts 2Migrating from Struts 1 to Struts 2
Migrating from Struts 1 to Struts 2
 
Protractor framework – how to make stable e2e tests for Angular applications
Protractor framework – how to make stable e2e tests for Angular applicationsProtractor framework – how to make stable e2e tests for Angular applications
Protractor framework – how to make stable e2e tests for Angular applications
 
Kapacitor - Real Time Data Processing Engine
Kapacitor - Real Time Data Processing EngineKapacitor - Real Time Data Processing Engine
Kapacitor - Real Time Data Processing Engine
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
TypeScript for Java Developers
TypeScript for Java DevelopersTypeScript for Java Developers
TypeScript for Java Developers
 
Slickdemo
SlickdemoSlickdemo
Slickdemo
 
SwampDragon presentation: The Copenhagen Django Meetup Group
SwampDragon presentation: The Copenhagen Django Meetup GroupSwampDragon presentation: The Copenhagen Django Meetup Group
SwampDragon presentation: The Copenhagen Django Meetup Group
 
Charla EHU Noviembre 2014 - Desarrollo Web
Charla EHU Noviembre 2014 - Desarrollo WebCharla EHU Noviembre 2014 - Desarrollo Web
Charla EHU Noviembre 2014 - Desarrollo Web
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
 
A miało być tak... bez wycieków
A miało być tak... bez wyciekówA miało być tak... bez wycieków
A miało być tak... bez wycieków
 
Artem Storozhuk "Building SQL firewall: insights from developers"
Artem Storozhuk "Building SQL firewall: insights from developers"Artem Storozhuk "Building SQL firewall: insights from developers"
Artem Storozhuk "Building SQL firewall: insights from developers"
 

More from Big Data Spain

More from Big Data Spain (20)

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at Big Data Spain 2014

  • 1. HANDS ON WITH BIGQUERY JAVASCRIPT UDFS THOMAS PARK SOFTWARE ENGINEER - GOOGLE
  • 2. Hands-on with BigQuery JavaScript User-Defined Functions Thomas Park Software Engineer - Google Felipe Hoffa @felipehoffa Developer Advocate - Google
  • 3. Agenda Background Example: Cross-row intervals Under the hood Example: Codebreaking I. II. III. IV.
  • 4. Agenda Background Example: Cross-row intervals Under the hood Example: Codebreaking I. II. III. IV.
  • 6. BigQuery: Big Data Analytics in the Cloud Unrivaled Performance and Scale ● Scan multiple TB’s in seconds ● Interactive query performance ● No limits on amount of data Ease of Use and Adoption ● No administration / provisioning ● Convenience of SQL ● Open interfaces (REST, WebUI, ODBC) ● First 1 TB of data processed per month is free Advanced “Big Data” Storage ● Familiar database structure ● Easy data management and ACL’s ● Fast, atomic imports
  • 7. Google confidential │ Do not distribute How many pageviews does Wikipedia have in a month? SELECT COUNT(*)FROM [fh-bigquery:wikipedia.wikipedia_views_201308] https://bigquery.cloud.google.com/table/fh-bigquery:wikipedia.pagecounts_20140602_18
  • 8. Google confidential │ Do not distribute $500 in Cloud Platform credit to launch your idea! Build. Store. Analyze. On the same infrastructure that powers Google Start building Click ‘Apply Now’ and complete the application with promo code: bigdata-spain Starter Pack Offer Description 1 2 3 Go to cloud.google.com/developers/starterpack
  • 9. Agenda Background Example: Cross-row intervals Under the hood Example: Codebreaking I. II. III. IV.
  • 10. Images by Connie Zhou Scenario: Door access records from a Very Well-Secured Lab where users must badge in to enter or leave Image by Tod Kurt
  • 11. Images by Connie Zhou Example: Time-series analysis from discrete user action data Image by Tod Kurt
  • 12. user_id timestamp Beep!! 9h: arrive @ lab thomas 2014.07.15 09:00 Beep!! 10h: leave to pick up prototype thomas 2014.07.15 10:00 Beep!! 10h15: return with prototype thomas 2014.07.15 10:15 Beep!! 12h: out for lunch thomas 2014.07.15 12:00
  • 13. How can we find out how much time each user spent in the lab? ...where each scan of the user’s access card is represented as a discrete row?
  • 14. rownum user_id timestamp 1 thomas 2014.07.15 09:00 2 thomas 2014.07.15 10:00 3 thomas 2014.07.15 10:15 4 thomas 2014.07.15 12:00 60 minutes 105 minutes Our analysis with data in this format via SQL is horrid and painful A BigQuery + JS friendly format: data for each user in separate rows user_id timestamps thomas [ 09:00, 10:00, 10:15, 12:00, ... ] hoffa [ 08:10, 11:30, 12:00, 12:15, ... ] SELECT user_id, NEST(timestamp) AS timestamps FROM T GROUP BY user_id; Producing this format is trivial in BigQuery...
  • 15. // This function will be called once for each user, // and receive an array of timestamps. function(record, emit) { var total_time = 0; // Order of records built by NEST are not guaranteed! // Sort to guarantee ascending timestamps. var ts = record.ts.sort( function (a, b) {return a > b;}); // Loop over timestamp pairs, calculate interval. for (var i = 0; i < ts.length - 1; i += 2) { total_time += (ts[i+1] - ts[i]); } // Emit total time for this user. emit({user: record.user_id, total_time: total_time}); } JS: Total time for each user
  • 16. SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }")
  • 17. SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }") The JS function
  • 18. SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }") Input schema (column names only!)
  • 19. SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }") Output schema (full declaration)
  • 20. SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }") Input table or subquery
  • 21. SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }")
  • 22. Agenda Background Example: Cross-row intervals Under the hood Example: Codebreaking I. II. III. IV.
  • 23. How BigQuery works Get data from lower levels, filter / join / transform, send rows up Tree Structured Query Dispatch and Aggregation Distributed Storage SELECT title, requests Leaf Leaf Leaf Leaf SUM(requests) GROUP BY title WHERE REGEX_MATCH(title, 'pat.*rn') Mixer 1 Mixer 1 SUM(requests) GROUP BY title Mixer 0 LIMIT 10 ORDER BY c DESC SUM(requests) GROUP BY title
  • 24. Data for each row is calculated and streamed through a “Row Iterator” Subquery0 Subquery1 JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2
  • 25. Can insert JavaScript Functions wherever we have a Row Iterator Subquery0 Subquery1 JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2 UDF1 UDF0
  • 26. Join order item info with web hits info SELECT item FROM orders SELECT query string FROM hits JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2 UDF1 UDF0
  • 27. http://www.store.com/?q=7%2e1+Speakers SELECT item FROM orders SELECT query string FROM hits JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2 UDF1 UDF0
  • 28. http://www.store.com/?q=7%2e1+Speakers Extract and decode query term => “7.1 Speakers” SELECT item FROM orders SELECT query string FROM hits JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2 UDF1 UDF0
  • 30. Agenda Background Example: Cross-row intervals Under the hood Example: Codebreaking I. II. III. IV.
  • 32. Image: El Hormiguero (Flickr CC)
  • 34. Questions? News: reddit.com/r/bigquery Ask: stackoverflow.com Share: bigqueri.es Thomas Park Felipe Hoffa @felipehoffa +FelipeHoffa Rate us? http://goo.gl/k3bzdw
  • 35. Backup slides / screenshots
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42. 17TH ~ 18th NOV 2014 MADRID (SPAIN)