Strategies for Near Real-Time Analytics and Interactive Data Exploration

A Presenta*on
Strategies for suppor.ng near real .me analy.cs, OLAP, and
interac.ve data explora.on
Dr. Jeremy Engle
Engineering Manager Data Team
The Jellyvision Lab
© 2017 The Jellyvision Lab, Inc. All rights reserved.
NOTICE: We have put a lot of work into this presenta.on. Please don’t take this
material and put it into your own material without ﬁrst talking with us. Seriously.

Fancy Title…WTF are you talking about?
•  How Jellyvision does the data
•  What we have done/learned
•  Strategies for eﬀec.ve design
•  What technologies helped us
•  Best outcomes
•  You learned something
•  My pics/gifs amused you

A bit about Jellyvision
•  Interac.ve soZware that talks people through important, complex, and
poten.ally snooze-inducing life decisions.
•  Our recipe: behavioral science, purposeful humor, mighty tech, and oregano.
•  850 companies with more than 15 million employees in total – including 91 of
the Fortune 500 and 15 of the country's 50 largest companies.
•  A B2B2C company
•  Our customers are corpora.ons
•  Our users are their employees

What does the data team do?
•  Pipeline = Collec.ng/Processing/Abstrac.ng
•  Visualiza.on = Applica.ons
•  Stewards =
•  Data valida.on/Governance
•  Ad hoc repor.ng
•  Internal dashboards
•  100s million events per year
•  Will eclipse 1TB collec.ve data this year
•  Peak traﬃc 70x from low traﬃc

What does our use of data empower?
•  Customer KPIs
•  Usage
•  Sa.sfac.on
•  Feedback
•  Customers can pick any .me
range at day granularity
•  Support 8+ products/versions
•  Analysis of user behavior to determine
effec.veness of new and legacy content
(more on this later)
•  Monitoring for customer success
•  One off data requests
•  Specific customer data points
•  Customers that are configured like XX

What is our soZware like?
•  Custom built pipeline
•  Near real .me KPI aggregator
•  CouchDB and “Data Movers”
•  Pipeline for each of 8+
products/varia.ons
•  Migrate to data warehouse
with Kinesis Firehose+Lambda
(Snowﬂake)
•  Ruby and Python Web apps
•  Pipeline populates rela.onal DB with
aggregated metrics
•  Based on Command Query Responsibility
Segrega.on Parern
•  hrps://mar.nfowler.com/bliki/CQRS.html

What .me granulari.es we support
•  Processing
•  Near real .me (des.na.on in minutes)
•  Batch (des.na.on in hours to days)
•  Data Access
•  Web applica.on (Response in < 500ms)
•  hrps://research.googleblog.com/2009/06/speed-marers.html
•  Interac.ve OLAP (Response in ~2sec)
•  OLAP Querying/Repor.ng (Response in seconds to minutes)

Modularize your stack
•  Collec.on
•  When/What data ﬁres
•  APIs to receive data
•  Processing/Movement
•  EL/ETL/Streaming
•  Aggrega.on
•  Storage/Access
•  Usage/Visualiza.on/Analysis

Strategy: Buy versus Build
•  If you can buy it….do that
•  Massive change in last 3 years
•  Turn key analy.cs
•  Reduce/eliminate need for engineers
•  Infrastructure
•  Engineers go faster but need to be
able to integrate mul.ple 3rd par.es

Technologies for Data
•  Movement
•  Kinesis, SQS, DB Migra.on Service, Kawa
•  Processing
•  EMR, Kinesis Analy.cs, Spark/Flink/Storm
•  Workflow/Orchestra.on
•  Glue, Data Pipeline, Azkaban, Airflow, Luigi
•  Storage/Access (Many include some form of processing)
•  DynamoDB, S3, RedshiZ/Spectrum, Snowflake, The
whole NoSQL ecosystem

Data Warehouse versus Data Lake
h-p://www.kdnuggets.com/2015/09/data-lake-vs-data-warehouse-key-diﬀerences.html

Data Lakehouse (Snowﬂake)
•  Support schemaless AND structure at physical layer
•  Create structure where desirable
•  Make code/queries simpler
•  Make columnar queries more preformat
•  Crea.ng structure does not require massive data
maintenance event
•  Flexibility!!!!!!!

How we are using a Data Lakehouse
•  Custom BI Tool for exploring user behavior
•  Example Query: Find users that answered
ques.on A with 1 and ques.on B with 2
•  Two approaches:
•  Each addi.onal constraint requires a self-
join with ques.ons
•  For each session, build a map of Ques.on/
Answer pairs

Data Lakehouse: Itera.ve Development
•  Views create abstrac.ons
•  Encapsulate common business logic
•  Abstract away table/database structure
•  Structure(s) under the view can itera.vely
change
•  Performance can be improved itera.vely
•  Make a smaller lake
•  Covert to columnar physical model
•  Leverage cluster/sort keys

Define your processing
•  Immutable data simplifies everything
•  Define for each unit of data:
•  Aggrega.ons
•  Roll ups
•  Required latency
•  Define how aggrega.on and usage logic
intersect with movement and storage
systems

Event driven data processing
•  Use non-temporal triggers to facilitate flow of data
•  Difference in polling .me (think Kinesis Firehose) creates the dis.nc.on
between near-real .me and batch processing
•  Can also be designed to support real .me analy.cs
•  Especially useful for system that have varied traffic levels

Mutable data processing
•  Data is dirty, some.mes to the point that it must be ﬁxed
•  Applying logic to mul.ple systems is .me consuming and fragile, with
streaming systems it is especially problema.c
•  Big data systems make it feasible to store full historical data
•  Source of truth should be upstream of all transforma.on
•  Changing the source of truth cascades to all downstream storage
•  Downstream aggrega.ons should be done AND triggered in a way to
facilitate updates. i.e. store at session level AND daily rollup
•  Handling deletes means aggrega.ons can be triggered independent of
create/update

All Done!
Ques.ons on how we
formed the data team,
ques.ons on what we
do, technologies we
use, HIPAA, or anything
else just come ﬁnd me
or contact me.

Thank you!
Jeremy Engle
Manager Data Team
jengle@jellyvision.com

Strategies for Near Real-Time Analytics and Interactive Data Exploration

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Strategies for Near Real-Time Analytics and Interactive Data Exploration

Similar to Strategies for Near Real-Time Analytics and Interactive Data Exploration (20)

More from AWS Chicago

More from AWS Chicago (20)

Recently uploaded

Recently uploaded (20)

Strategies for Near Real-Time Analytics and Interactive Data Exploration