"Strategies for supporting near real time analytics, OLAP, and interactive data exploration" - Dr. Jeremy Engle, Engineering Manager Data Team at Jellyvision
2. Fancy Title…WTF are you talking about?
• How Jellyvision does the data
• What we have done/learned
• Strategies for effec.ve design
• What technologies helped us
• Best outcomes
• You learned something
• My pics/gifs amused you
3. A bit about Jellyvision
• Interac.ve soZware that talks people through important, complex, and
poten.ally snooze-inducing life decisions.
• Our recipe: behavioral science, purposeful humor, mighty tech, and oregano.
• 850 companies with more than 15 million employees in total – including 91 of
the Fortune 500 and 15 of the country's 50 largest companies.
• A B2B2C company
• Our customers are corpora.ons
• Our users are their employees
4. What does the data team do?
• Pipeline = Collec.ng/Processing/Abstrac.ng
• Visualiza.on = Applica.ons
• Stewards =
• Data valida.on/Governance
• Ad hoc repor.ng
• Internal dashboards
• 100s million events per year
• Will eclipse 1TB collec.ve data this year
• Peak traffic 70x from low traffic
5. What does our use of data empower?
• Customer KPIs
• Usage
• Sa.sfac.on
• Feedback
• Customers can pick any .me
range at day granularity
• Support 8+ products/versions
• Analysis of user behavior to determine
effec.veness of new and legacy content
(more on this later)
• Monitoring for customer success
• One off data requests
• Specific customer data points
• Customers that are configured like XX
6. What is our soZware like?
• Custom built pipeline
• Near real .me KPI aggregator
• CouchDB and “Data Movers”
• Pipeline for each of 8+
products/varia.ons
• Migrate to data warehouse
with Kinesis Firehose+Lambda
(Snowflake)
• Ruby and Python Web apps
• Pipeline populates rela.onal DB with
aggregated metrics
• Based on Command Query Responsibility
Segrega.on Parern
• hrps://mar.nfowler.com/bliki/CQRS.html
7. What .me granulari.es we support
• Processing
• Near real .me (des.na.on in minutes)
• Batch (des.na.on in hours to days)
• Data Access
• Web applica.on (Response in < 500ms)
• hrps://research.googleblog.com/2009/06/speed-marers.html
• Interac.ve OLAP (Response in ~2sec)
• OLAP Querying/Repor.ng (Response in seconds to minutes)
9. Modularize your stack
• Collec.on
• When/What data fires
• APIs to receive data
• Processing/Movement
• EL/ETL/Streaming
• Aggrega.on
• Storage/Access
• Usage/Visualiza.on/Analysis
10. Strategy: Buy versus Build
• If you can buy it….do that
• Massive change in last 3 years
• Turn key analy.cs
• Reduce/eliminate need for engineers
• Infrastructure
• Engineers go faster but need to be
able to integrate mul.ple 3rd par.es
11. Technologies for Data
• Movement
• Kinesis, SQS, DB Migra.on Service, Kawa
• Processing
• EMR, Kinesis Analy.cs, Spark/Flink/Storm
• Workflow/Orchestra.on
• Glue, Data Pipeline, Azkaban, Airflow, Luigi
• Storage/Access (Many include some form of processing)
• DynamoDB, S3, RedshiZ/Spectrum, Snowflake, The
whole NoSQL ecosystem
13. Data Warehouse versus Data Lake
h-p://www.kdnuggets.com/2015/09/data-lake-vs-data-warehouse-key-differences.html
14. Data Lakehouse (Snowflake)
• Support schemaless AND structure at physical layer
• Create structure where desirable
• Make code/queries simpler
• Make columnar queries more preformat
• Crea.ng structure does not require massive data
maintenance event
• Flexibility!!!!!!!
15. How we are using a Data Lakehouse
• Custom BI Tool for exploring user behavior
• Example Query: Find users that answered
ques.on A with 1 and ques.on B with 2
• Two approaches:
• Each addi.onal constraint requires a self-
join with ques.ons
• For each session, build a map of Ques.on/
Answer pairs
16. Data Lakehouse: Itera.ve Development
• Views create abstrac.ons
• Encapsulate common business logic
• Abstract away table/database structure
• Structure(s) under the view can itera.vely
change
• Performance can be improved itera.vely
• Make a smaller lake
• Covert to columnar physical model
• Leverage cluster/sort keys
18. Define your processing
• Immutable data simplifies everything
• Define for each unit of data:
• Aggrega.ons
• Roll ups
• Required latency
• Define how aggrega.on and usage logic
intersect with movement and storage
systems
19. Event driven data processing
• Use non-temporal triggers to facilitate flow of data
• Difference in polling .me (think Kinesis Firehose) creates the dis.nc.on
between near-real .me and batch processing
• Can also be designed to support real .me analy.cs
• Especially useful for system that have varied traffic levels
20. Mutable data processing
• Data is dirty, some.mes to the point that it must be fixed
• Applying logic to mul.ple systems is .me consuming and fragile, with
streaming systems it is especially problema.c
• Big data systems make it feasible to store full historical data
• Source of truth should be upstream of all transforma.on
• Changing the source of truth cascades to all downstream storage
• Downstream aggrega.ons should be done AND triggered in a way to
facilitate updates. i.e. store at session level AND daily rollup
• Handling deletes means aggrega.ons can be triggered independent of
create/update
21. All Done!
Ques.ons on how we
formed the data team,
ques.ons on what we
do, technologies we
use, HIPAA, or anything
else just come find me
or contact me.