Big Data LA NoSQL Big Data Track. User Behavior analytics engine. Concepts around events, sessions, funnels using Interana Analytics Engine. Operational considerations.
3. Who Am I
Big Data Engineer at Interana
SQL, Cassandra, Redis, Mongo, SOLR and now Interana
jag@interana.com/Github/LinkedIn
If u want create a big problem, build a database,
if u want to create huge problem, never delete anything
4. Data > Opinion Journey
● Three Engineers Lior Abraham, Bobby Johnson and Ann Johnson
● Scuba at Facebook
● Take it to the masses
● Full Stack Solution - UI/API, Ingest and Storage Tier
8. Concepts Event Stream
Event - Actor - Behavior - At Time T, Jack makes purchase
Session - Between Login and Logout, inactive time
Cohort - Male, 25, California, In-Out-Burger
Funnel - Click On Ad => Viewed Item => Add to Cart => Made Purchase =>
Satisfaction
Metric - Equation column from existing, storageless
9. Time ordered Event Data
Timestamp User (SK) Ad_id (SK) Behaviour Is_Alcoholic
(DM)
July 1st Jack Beer Clicked on Ad True
July 1st Jill Juice Clicked on
Add
False
July 2nd Jack Added To
Cart
July 3rd Jack Purchase
July 5th Jill Added To
Cart
July 10th Jill Logged Out
12. Sampling
● Lies, Damn Lies and Sampling
● Sampling take advantage of SK <-> Actor relationship
● Confidence depends on the shape of the distributions
● Sample rate is key. Sparse data get tricky across 100’s of shards
17. Operational Approach
● Managed Service - Hosted in AWS/Azure environment
● Coming Soon - Container based solution (AMI), self-server Import as well
● Performance is critical - Sata vs SSD, Tiered Disks, RAM
● Redundancy - currently no live redundancy, use backup and playback
21. Music Data Set - 4B Rows
ts 1412210780000
userId(sk) 130065
sessionId(sk) 999FAFD51ASD
artist Audioslave
auth Logged In
lastName Brown
level free
page NextSong
song Gasoline
Started event data to represent satellite ground events in finite state machine 15 years ago. Could fit into single db (10M rows). We generate 10B Rows a day ( 1/1000 day = 86 seconds).
High dimensional. know the question. Naming things and cache invalidation. How many people have worked in an environment, where u had people use your software or want to. How many people made a feature decisions based on managers decision? And how many have used data to drive there decisions for products?
Came of stealth. Billions to Trillians. Startups To Enterprise. Consumer to B2B.
We do both!!! Unauthorized - application Took Funnel users for devices. Sonos funnel, 3 step, first step attempt for action, error, action. Funnel Dropoff. Drill down cohort, matrix. User id and count events, different types of errors.
Circle - follow link. Is Sparse. Is event based (time variying vs timeseries). Questions :
Actor is the thing that your are following. Most system only have one or two of these. Franca Lingua of Click Stream/Event Sream
Sessions are dynamically generated depending on the state.
Highlighted are sessions
The turn a a set of events into a shard key.
Funnels go across sessions. They talk about the population as a whole
Like cassandra
MMAP/MALLOC
Pipeling - using caches. Keep data and instructions ocal
Compression - focus on streaming data compression vs block level compressions
Realtime, neartime and batch time
Why does it matter - It impacts the answer but you save 10x the time producing the answer. Google example. Directionaly. Power Law - Uniform Distribution. Distribution Counts(click). 100 shards, 100 users. Uniform across (i.e. The data distribution is normal). Outliers. Unsampled. Number of Actor Result. Density Actors/shards.
Spend Summing to distribution is to wide. Tail distribution/Pareto Distribution. Heavy Ended. In those cases we alert, fully unsampled amount of time. Quality Distribution. Confidence measure. Shard Key. Effectively high availability. Delta-Dira, One User, Scaling or Delta Distribution. Count * event. Delta on check engine light. One user. Hackers behaviour, get 0 back. Check engine light. In Delta Dirac. Impact Scaling---------*-------------------*------
Ensure all distribution are similar. Uniformily distributed across shards. Sharding
Default, Continuous window temporal with time. Time Append Write. Ask during demo, how many have dataset on these backings. How much data.. Is it, Gigabyte, Terabyets, PetaBytes.
Question : Where is your data housed? Import/Export vs Ingest. Formatting. Other things that things struggle with. File/Stream/Custom Protocol(Kafka, et all). More confortable, Unix box command line stuff.
How many people use Amazon Marketplace. How many use a Virtual solution (.hvi) ? How many just plain old tarball?
https://demo.interana.com/?dashboard=dashboard-test-2&name=Music%20Dashboard. 1440 minutes a day
Demographic,, flexible behavior driven cohorts. The music data
Demographic, old school metal, flexible behavior driven cohorts