SlideShare a Scribd company logo
1 of 32
Download to read offline
Activity feeds
and more
at Mate1
Big Data Montreal
Tuesday April 8th 2014
Hisham Mardam-Bey
Overview
● Who is this guy?
● Mate1
○ quick intro
○ some of the features
○ technology stack
● Activity feed
○ take 1
○ take 2
● What’s next?
Who is this guy?
● Linux user and developer since 1996
● Started out hacking on Enlightenment
○ X11 window manager
● Worked with OpenBSD
○ building embedded network gear
● Did a whole lot of C followed by Ruby
● Working with the JVM since 2007
github: mardambey
twitter: codewarrior
Mate1: quick intro
● Online dating, since 2003, based in Montreal
● Initially team of 3, around 40 now
● Engineering team has 13 geeks / geekettes
● We own and run our own hardware
○ fun!
○ mostly…
○ LXC is a life (hardware resource?) saver (=
https://github.com/mate1
Some of our features...
● Lots of communication, chatting, push notifs
● Search, matching, ranking, geo-location
● Lists, friends, blocks, people interested, more
● News & activity feeds, counters, contacts
And what we use for them...
● Lots of communication, chatting, push notifs
● Search, matching, ranking, geo-location
● Lists, friends, blocks, people interested, more
● News & activity feeds, counters, contacts
… all glued together by…
Programming languages…
Scala, Java -> back-end services, business logic, “controllers”
● What makes us =(
○ Struts2 -> XML, painful, want to dump it
○ Hibernate -> not as an ORM, mainly to map
● What makes us (=
○ Play! -> in prod, migrating to it… need “non-blocking” db
layer
○ Akka -> simplifies concurrency, network transparency
Programming languages…
● JavaScript -> front end, mobile and desktop
○ Sencha Touch + Apache Cordova -> cross-platform
● PHP -> quick / temporary work
○ registration funnels
○ transient marketing pages
● Perl -> seriously? yes!
○ pre 2007, entire system was in Perl
○ now, customer service system, marketing tools, etc.
○ and! new email delivery service
At some point… activity feed
● Gather user activity and events
● Sometimes inject system events
● As low latency as possible
● Grouped into “tiers” (or types)
● Supports “roll-ups”
● Maintain counters for different event types
Take 1: fan-out on read (pull)
● Activity occurs...
○ A views B -> insert into views uid=B viewer_uid=A
○ C likes B -> insert into likes uid=C likee_uid=B
○ D emails B -> insert into emails uid=B sender_uid=D
■ refer to these as channels
■ based on legacy features and legacy data
● B asks for their activity feed
Take 1: fan-out on read (pull)
App
servers
MySQL
messages
memcached
MySQL
lists
MySQL
images
MySQL
users
memcached
memcachedcached?
query all channels
aggregate
cache
all done!
so far so good! … or is it?
Take 1: fan-out on read (pull)
● Several channels piggybacked off existing
features
○ no uniformity in data structure, not always optimal
● Time constraints on queries
○ can’t go back in time, databases suffered
● Activity feeds slowed down…
● Temporary solutions?
○ slash a bunch of channels -> sucks!
○ aggregate multiple channels to a single table -> hack!
○ had to rethink how we’re doing this
Take 2: fan out on write (push)
● Had to change approach entirely
● Needed to store data more efficiently
● Writes can be queued up
● Roll-ups should be persistent
● More channels != slower performance
● Built as a scalable service end-to-end
More efficient storage
● Don’t piggyback on old features
● Pre-aggregated user activity feeds
● Ideally store roll-ups in the same store
● Always sorted by time
● Minimal updates or deletes required
● Avoid counting
Writes can be queued up
● Push all activity / events into message
queue
● Process and persist as soon as possible
○ don’t cause back-pressure
○ needs to be durable
○ must be able to easily scale consumption
● Lots of message queue technologies
○ experience with RabbitMQ (web server logs)
○ and Redis (pubsub, monitoring)
○ tested out Flume (non-ng), buggy at the time
Architecture
App
servers
Web
servers
Other
services
Event
Manager
Event
Manager
Event
Manager
Event manager
● Wanted to be able to publish events
● Started off needing minimal information
○ event type, timestamp, user ids, etc.
○ soon after, needed much more data per event
● Was one of our first Scala libraries!
○ admittedly, needs clean-up now (=
● Provides sync and async publishing
● Also provides callback based consumer API
Architecture
App
servers
Web
servers
Other
services
Event
Manager
Event
Manager
Event
Manager
Kafka
Kafka
Kafka
ZK
ZK
Why Kafka?
● Durable by design
● Very high throughput! and scalable
● Supports consumer groups
● Native Scala API
● Supports consumer “replay”
● Per topic data retention and partitioning
● Integrates with Hadoop
○ Kafka <-> Hadoop via Camus and Camus2Kafka
● Grabbed it from LinkedIn’s SVN
○ never looked back! we love it!
○ moving to 0.8.1 at the time of this writing
Architecture
App
servers
Web
servers
Other
services
Event
Manager
Event
Manager
Event
Manager
Kafka
Kafka
Kafka
ZK
ZK
Consumers
Consumers
Consumers
● Implement a simple API
● Subscribe to Kafka topics
● Process events, can do “anything”
○ For activity feeds
■ All interesting events are published
■ and consumed (views, liked, emails, uploads…)
■ then stored into the data store
○ We can also
■ maintain counts & stats, send notifications
● Can fail, with certain tolerance
○ otherwise they stop and alarms are raised
Architecture
App
servers
Web
servers
Other
services
Event
Manager
Event
Manager
Event
Manager
Kafka
Kafka
Kafka
ZK
ZK
Consumers
Consumers
Store
Store
Store
Cassandra
● Activity feeds fit well into C*’s data model
● TimeUUID ordering means no sorting
● Having lots of writes is not a problem
○ works to our advantage
○ we want to push data to users’ activity feeds
● Supports counters
● Can add nodes as needed
● Gave each user multiple rows
○ each row is feed type
○ one row is the “roll-up” row
○ roll-ups done in background, or on demand
○ each user has multiple counters and a few “lists”
How are the feeds read?
● Cassandra nodes don’t get read from
directly
Architecture
App
servers
Web
servers
Other
services
Event
Manager
Event
Manager
Event
Manager
Kafka
Kafka
Kafka
ZK
ZK
Consumers
Consumers
C*
C*
C*
???
???
How are the feeds read?
● Cassandra nodes don’t get read from
directly
● HTTP layer in front of C*
○ provides specific access points
○ mainly for reading, almost no writes
■ except for some counters
○ supports caching requests
■ and busting the cache
○ returns everything as JSON
How are the feeds read?
● Cassandra nodes don’t get read from
directly
● HTTP layer in front of C*
○ provides specific access points
○ mainly for reading, almost no writes
■ except for some counters
○ supports caching requests
■ and busting the cache
○ returns everything as JSON
● Netty!
○ We have 3 readers with Varnish
○ we want to port this to Play!
Architecture
App
servers
Web
servers
Other
services
Event
Manager
Event
Manager
Event
Manager
Kafka
Kafka
Kafka
ZK
ZK
Consumers
Consumers
C*
C*
C*
Netty
Netty
What else?
App
servers
Web
servers
Other
services
Event
Manager
Event
Manager
Event
Manager
Kafka
Kafka
Kafka
ZK
ZK
Consumers
Consumers
C*
C*
C*
Netty
Netty
SOLR
Redis
EjabberdEjabberdEjabberd
APNS
NRT search
Geo-location
TTL flags
transient data
So how did all of this work out?
● Pretty fantastic!
● Kafka, top notch high performance queue
● Netty, fast, uses CPU very efficiently
● Cassandra, data model fit well
○ wide rows rock!
○ can’t live without counters anymore
○ eventual consistency, or why C* owns for feeds
● Issues?
○ C* consistency matters, must tune
○ Big batch reads from C* live cluster can be painful
○ not much else really (=
What else are you up to?
● Want to push more lists into C*
● Want to push our on-site inbox into C*
● Experimenting with Spark and C*
● Need to get data from MySQL -> C*
○ working on a tool to feed MySQL’s replication stream
into Kafka via Avro binary serialization
■ can use it to keep MySQL and C* in sync for
some table, or to maintain basic counts
■ or as a data source for Spark
■ or pump into Hadoop
Fin!
Thats all folks (=
Thanks!
Questions?
Oh, we’re hiring!
http://mate1inc.com/careers/

More Related Content

What's hot

KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...
KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...
KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...Yiran Wang
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponCloudera, Inc.
 
9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab9b. Document-Oriented Databases lab
9b. Document-Oriented Databases labFabio Fumarola
 
How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.Renzo Tomà
 
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Ziemowit Jankowski
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...NoSQLmatters
 
Scaling an ELK stack at bol.com
Scaling an ELK stack at bol.comScaling an ELK stack at bol.com
Scaling an ELK stack at bol.comRenzo Tomà
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScyllaDB
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - InstallationMartin Zapletal
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...Rob Skillington
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandrashimi_k
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSqlOmid Vahdaty
 
Experiences in ELK with D3.js for Large Log Analysis and Visualization
Experiences in ELK with D3.js  for Large Log Analysis  and VisualizationExperiences in ELK with D3.js  for Large Log Analysis  and Visualization
Experiences in ELK with D3.js for Large Log Analysis and VisualizationSurasak Sanguanpong
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3Rob Skillington
 
.NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov).NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov)ITCamp
 
My talk about Tarantool and Lua at Percona Live 2016
My talk about Tarantool and Lua at Percona Live 2016My talk about Tarantool and Lua at Percona Live 2016
My talk about Tarantool and Lua at Percona Live 2016Konstantin Osipov
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann
 
Pain points with M3, some things to address them and how replication works
Pain points with M3, some things to address them and how replication worksPain points with M3, some things to address them and how replication works
Pain points with M3, some things to address them and how replication worksRob Skillington
 

What's hot (20)

KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...
KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...
KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
 
9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab
 
How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.
 
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
 
Scaling an ELK stack at bol.com
Scaling an ELK stack at bol.comScaling an ELK stack at bol.com
Scaling an ELK stack at bol.com
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - Installation
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSql
 
MacGyver Learns Spark
MacGyver Learns SparkMacGyver Learns Spark
MacGyver Learns Spark
 
Experiences in ELK with D3.js for Large Log Analysis and Visualization
Experiences in ELK with D3.js  for Large Log Analysis  and VisualizationExperiences in ELK with D3.js  for Large Log Analysis  and Visualization
Experiences in ELK with D3.js for Large Log Analysis and Visualization
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3
 
.NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov).NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov)
 
My talk about Tarantool and Lua at Percona Live 2016
My talk about Tarantool and Lua at Percona Live 2016My talk about Tarantool and Lua at Percona Live 2016
My talk about Tarantool and Lua at Percona Live 2016
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning Group
 
Pain points with M3, some things to address them and how replication works
Pain points with M3, some things to address them and how replication worksPain points with M3, some things to address them and how replication works
Pain points with M3, some things to address them and how replication works
 
Xephon K A Time series database with multiple backends
Xephon K A Time series database with multiple backendsXephon K A Time series database with multiple backends
Xephon K A Time series database with multiple backends
 

Similar to Activity feeds (and more) at mate1

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in RetailHari Shreedharan
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanDatabricks
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Bowen Li
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Log Management: AtlSecCon2015
Log Management: AtlSecCon2015Log Management: AtlSecCon2015
Log Management: AtlSecCon2015cameronevans
 
Big data @ uber vu (1)
Big data @ uber vu (1)Big data @ uber vu (1)
Big data @ uber vu (1)Mihnea Giurgea
 
The Fn Project: A Quick Introduction (December 2017)
The Fn Project: A Quick Introduction (December 2017)The Fn Project: A Quick Introduction (December 2017)
The Fn Project: A Quick Introduction (December 2017)Oracle Developers
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
 
Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaHotstar
 
An EyeWitness View into your Network
An EyeWitness View into your NetworkAn EyeWitness View into your Network
An EyeWitness View into your NetworkCTruncer
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"Rob Winters
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsMonal Daxini
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin PodvalMartin Podval
 

Similar to Activity feeds (and more) at mate1 (20)

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Log Management: AtlSecCon2015
Log Management: AtlSecCon2015Log Management: AtlSecCon2015
Log Management: AtlSecCon2015
 
Netty training
Netty trainingNetty training
Netty training
 
Big data @ uber vu (1)
Big data @ uber vu (1)Big data @ uber vu (1)
Big data @ uber vu (1)
 
The Fn Project: A Quick Introduction (December 2017)
The Fn Project: A Quick Introduction (December 2017)The Fn Project: A Quick Introduction (December 2017)
The Fn Project: A Quick Introduction (December 2017)
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache Kafka
 
An EyeWitness View into your Network
An EyeWitness View into your NetworkAn EyeWitness View into your Network
An EyeWitness View into your Network
 
Netty training
Netty trainingNetty training
Netty training
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 

Recently uploaded

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 

Recently uploaded (20)

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 

Activity feeds (and more) at mate1

  • 1. Activity feeds and more at Mate1 Big Data Montreal Tuesday April 8th 2014 Hisham Mardam-Bey
  • 2. Overview ● Who is this guy? ● Mate1 ○ quick intro ○ some of the features ○ technology stack ● Activity feed ○ take 1 ○ take 2 ● What’s next?
  • 3. Who is this guy? ● Linux user and developer since 1996 ● Started out hacking on Enlightenment ○ X11 window manager ● Worked with OpenBSD ○ building embedded network gear ● Did a whole lot of C followed by Ruby ● Working with the JVM since 2007 github: mardambey twitter: codewarrior
  • 4. Mate1: quick intro ● Online dating, since 2003, based in Montreal ● Initially team of 3, around 40 now ● Engineering team has 13 geeks / geekettes ● We own and run our own hardware ○ fun! ○ mostly… ○ LXC is a life (hardware resource?) saver (= https://github.com/mate1
  • 5. Some of our features... ● Lots of communication, chatting, push notifs ● Search, matching, ranking, geo-location ● Lists, friends, blocks, people interested, more ● News & activity feeds, counters, contacts
  • 6. And what we use for them... ● Lots of communication, chatting, push notifs ● Search, matching, ranking, geo-location ● Lists, friends, blocks, people interested, more ● News & activity feeds, counters, contacts … all glued together by…
  • 7. Programming languages… Scala, Java -> back-end services, business logic, “controllers” ● What makes us =( ○ Struts2 -> XML, painful, want to dump it ○ Hibernate -> not as an ORM, mainly to map ● What makes us (= ○ Play! -> in prod, migrating to it… need “non-blocking” db layer ○ Akka -> simplifies concurrency, network transparency
  • 8. Programming languages… ● JavaScript -> front end, mobile and desktop ○ Sencha Touch + Apache Cordova -> cross-platform ● PHP -> quick / temporary work ○ registration funnels ○ transient marketing pages ● Perl -> seriously? yes! ○ pre 2007, entire system was in Perl ○ now, customer service system, marketing tools, etc. ○ and! new email delivery service
  • 9. At some point… activity feed ● Gather user activity and events ● Sometimes inject system events ● As low latency as possible ● Grouped into “tiers” (or types) ● Supports “roll-ups” ● Maintain counters for different event types
  • 10. Take 1: fan-out on read (pull) ● Activity occurs... ○ A views B -> insert into views uid=B viewer_uid=A ○ C likes B -> insert into likes uid=C likee_uid=B ○ D emails B -> insert into emails uid=B sender_uid=D ■ refer to these as channels ■ based on legacy features and legacy data ● B asks for their activity feed
  • 11. Take 1: fan-out on read (pull) App servers MySQL messages memcached MySQL lists MySQL images MySQL users memcached memcachedcached? query all channels aggregate cache all done! so far so good! … or is it?
  • 12. Take 1: fan-out on read (pull) ● Several channels piggybacked off existing features ○ no uniformity in data structure, not always optimal ● Time constraints on queries ○ can’t go back in time, databases suffered ● Activity feeds slowed down… ● Temporary solutions? ○ slash a bunch of channels -> sucks! ○ aggregate multiple channels to a single table -> hack! ○ had to rethink how we’re doing this
  • 13. Take 2: fan out on write (push) ● Had to change approach entirely ● Needed to store data more efficiently ● Writes can be queued up ● Roll-ups should be persistent ● More channels != slower performance ● Built as a scalable service end-to-end
  • 14. More efficient storage ● Don’t piggyback on old features ● Pre-aggregated user activity feeds ● Ideally store roll-ups in the same store ● Always sorted by time ● Minimal updates or deletes required ● Avoid counting
  • 15. Writes can be queued up ● Push all activity / events into message queue ● Process and persist as soon as possible ○ don’t cause back-pressure ○ needs to be durable ○ must be able to easily scale consumption ● Lots of message queue technologies ○ experience with RabbitMQ (web server logs) ○ and Redis (pubsub, monitoring) ○ tested out Flume (non-ng), buggy at the time
  • 17. Event manager ● Wanted to be able to publish events ● Started off needing minimal information ○ event type, timestamp, user ids, etc. ○ soon after, needed much more data per event ● Was one of our first Scala libraries! ○ admittedly, needs clean-up now (= ● Provides sync and async publishing ● Also provides callback based consumer API
  • 19. Why Kafka? ● Durable by design ● Very high throughput! and scalable ● Supports consumer groups ● Native Scala API ● Supports consumer “replay” ● Per topic data retention and partitioning ● Integrates with Hadoop ○ Kafka <-> Hadoop via Camus and Camus2Kafka ● Grabbed it from LinkedIn’s SVN ○ never looked back! we love it! ○ moving to 0.8.1 at the time of this writing
  • 21. Consumers ● Implement a simple API ● Subscribe to Kafka topics ● Process events, can do “anything” ○ For activity feeds ■ All interesting events are published ■ and consumed (views, liked, emails, uploads…) ■ then stored into the data store ○ We can also ■ maintain counts & stats, send notifications ● Can fail, with certain tolerance ○ otherwise they stop and alarms are raised
  • 23. Cassandra ● Activity feeds fit well into C*’s data model ● TimeUUID ordering means no sorting ● Having lots of writes is not a problem ○ works to our advantage ○ we want to push data to users’ activity feeds ● Supports counters ● Can add nodes as needed ● Gave each user multiple rows ○ each row is feed type ○ one row is the “roll-up” row ○ roll-ups done in background, or on demand ○ each user has multiple counters and a few “lists”
  • 24. How are the feeds read? ● Cassandra nodes don’t get read from directly
  • 26. How are the feeds read? ● Cassandra nodes don’t get read from directly ● HTTP layer in front of C* ○ provides specific access points ○ mainly for reading, almost no writes ■ except for some counters ○ supports caching requests ■ and busting the cache ○ returns everything as JSON
  • 27. How are the feeds read? ● Cassandra nodes don’t get read from directly ● HTTP layer in front of C* ○ provides specific access points ○ mainly for reading, almost no writes ■ except for some counters ○ supports caching requests ■ and busting the cache ○ returns everything as JSON ● Netty! ○ We have 3 readers with Varnish ○ we want to port this to Play!
  • 30. So how did all of this work out? ● Pretty fantastic! ● Kafka, top notch high performance queue ● Netty, fast, uses CPU very efficiently ● Cassandra, data model fit well ○ wide rows rock! ○ can’t live without counters anymore ○ eventual consistency, or why C* owns for feeds ● Issues? ○ C* consistency matters, must tune ○ Big batch reads from C* live cluster can be painful ○ not much else really (=
  • 31. What else are you up to? ● Want to push more lists into C* ● Want to push our on-site inbox into C* ● Experimenting with Spark and C* ● Need to get data from MySQL -> C* ○ working on a tool to feed MySQL’s replication stream into Kafka via Avro binary serialization ■ can use it to keep MySQL and C* in sync for some table, or to maintain basic counts ■ or as a data source for Spark ■ or pump into Hadoop
  • 32. Fin! Thats all folks (= Thanks! Questions? Oh, we’re hiring! http://mate1inc.com/careers/