Follow a Firefox crash from its genesis in a collapsing browser process through the dizzying array of collection, storage, and reporting systems that make up Socorro, our open-source crash collector. Enjoy war stories of weird, interlocking failures, and see how we nevertheless continue to fulfill our mandate: “Never lose a crash.” Observe some patterns that emerged from this system which can be useful in yours.
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
What happens when firefox crashes?
1. What Happens When
Firefox Crashes?or
It’s Not My Fault Tolerance
by Erik Rose
Welcome!
[Erik Rose (if not introduced)]
write server-side code @ Mozilla
to tell you about the Big Data systems behind FF crash reporting
•! ❑ ! A browser is a complex piece of software.
•! ❑ ! Challenging to test it
▼! ❑ ! Interacts with a lot of other software: JS add-ons, compiled plugins, OSes, different hardware.
! •! ❑ ! Even unique timings of your setup can trigger bugs.
! •! ❑ ! Also, 50 billion – 1 trillion web pages. They do unpredictable, creative things
! •! ❑ ***Any of which could make FF explode***
▼! –! That's why, in addition to an extensive test suite and manual testing, we invest a lot in crash reporting.
So today, I want to show you what happens when Firefox crashes and what the systems look like that receive and process the crash reports
2. What Happens When
Firefox Crashes?or
It’s Not My Fault Tolerance
by Erik Rose
Welcome!
[Erik Rose (if not introduced)]
write server-side code @ Mozilla
to tell you about the Big Data systems behind FF crash reporting
•! ❑ ! A browser is a complex piece of software.
•! ❑ ! Challenging to test it
▼! ❑ ! Interacts with a lot of other software: JS add-ons, compiled plugins, OSes, different hardware.
! •! ❑ ! Even unique timings of your setup can trigger bugs.
! •! ❑ ! Also, 50 billion – 1 trillion web pages. They do unpredictable, creative things
! •! ❑ ***Any of which could make FF explode***
▼! –! That's why, in addition to an extensive test suite and manual testing, we invest a lot in crash reporting.
So today, I want to show you what happens when Firefox crashes and what the systems look like that receive and process the crash reports
3. •!❑! If you've crashed FF, you've seen this dialog.
! ❑!If you choose to send us a crash report, we use it to…
! •!❑! find new bugs
! •!❑! decide where to concentrate our time
4. Socorro
!–! The thing that receives FF crash reports is called Socorro.
•!❑! ***Open source.***
•!❑! You can use it if you want. Very flexible.
•!❑! Used by Valve, Yandex
•!❑! Socorro gets its name from the Very Large Array in Socorro, NM because…
5. Socorro
https://github.com/
mozilla/socorro
!–! The thing that receives FF crash reports is called Socorro.
•!❑! ***Open source.***
•!❑! You can use it if you want. Very flexible.
•!❑! Used by Valve, Yandex
•!❑! Socorro gets its name from the Very Large Array in Socorro, NM because…
6. Very Large Array
Socorro, New Mexico
like that array, it receives signals from out in the universe and tries to filter out patterns from the
noise.
•!❑! 27 dish antennas, which can move to follow objects across the sky
•!❑! Socorro is a Very Large Array of slightly less expensive systems which tracks crashes
across the userbase
7. !
Big
Picture
The
Let’s take a peek behind the curtain
You’ll recognize some things you’re doing yourself,
and some other things might surprise you.
So let’s embark on our tour of Socorro!
8. ! •!❑! On its front end, it looks like this.
Public.
Don’t hide our failures
Unusual.
9. You can drill into this, to see
e.g. top crashers:
! •!❑! ***% of all crashes***
! •!❑! signature (stack trace)
! •!❑! breakdown by platform
! •!❑! ticket correllations
10. You can drill into this, to see
e.g. top crashers:
! •!❑! ***% of all crashes***
! •!❑! signature (stack trace)
! •!❑! breakdown by platform
! •!❑! ticket correllations
11. !–! Another example: explosive crashes
! !–! Music charts: "bullets"
! •!❑! song which rises quickly up the charts to suddenly become extremely popular
! •!❑! Something we expect to see as 5% of all crashes, but then you wake up one morning, and
they're 85% of all crashes.
! •!❑! Generally what this means is that one of the major sites shipped a new piece of JS which
crashes us.
! !✓! The most recent example of this is during the last Olypmics, when Google released a new
Doodle every day.
12. ! •!❑! I think it was this one that crashed us.
! •!❑! On the one hand, we knew the problem was going away tomorrow. So that’s nice.
! •!❑! OTOH, a lot of people have Google set as their startup page. So that's bad. ;-)
13. !❑! You can also find…
! •!❑! Most common crashes for a version, platform, etc.
! •!❑! New crashes
! !❑! Correlations
! •!❑! ferret out interactions between plugins, for example
•!❑! Pretty straightforward, right?
Backend is less straightforward…
14. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
•!❑! Over 120 boxes, all physical.
!❑! Why physical?
! •!❑! Organizational momentum
! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.
!–!
! •!❑! How much data?
! •!❑! "The smallest big-data project"
! •!❑! Used to be considered big. Not anymore.
! !✓! Numbers
! •!✓! ***500M FF users***
! •!✓! ***150M ADUs. Probably more.***
! •!✓! ***3000 crashes/minute.*** 3M/day.
! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)
! •!✓! ***800GB*** in PG
! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.
! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.
! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
15. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
500M Firefox users
•!❑! Over 120 boxes, all physical.
!❑! Why physical?
! •!❑! Organizational momentum
! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.
!–!
! •!❑! How much data?
! •!❑! "The smallest big-data project"
! •!❑! Used to be considered big. Not anymore.
! !✓! Numbers
! •!✓! ***500M FF users***
! •!✓! ***150M ADUs. Probably more.***
! •!✓! ***3000 crashes/minute.*** 3M/day.
! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)
! •!✓! ***800GB*** in PG
! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.
! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.
! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
16. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
500M Firefox users
150M daily users
•!❑! Over 120 boxes, all physical.
!❑! Why physical?
! •!❑! Organizational momentum
! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.
!–!
! •!❑! How much data?
! •!❑! "The smallest big-data project"
! •!❑! Used to be considered big. Not anymore.
! !✓! Numbers
! •!✓! ***500M FF users***
! •!✓! ***150M ADUs. Probably more.***
! •!✓! ***3000 crashes/minute.*** 3M/day.
! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)
! •!✓! ***800GB*** in PG
! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.
! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.
! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
17. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
500M Firefox users
150M daily users
3000 crashes per minute
•!❑! Over 120 boxes, all physical.
!❑! Why physical?
! •!❑! Organizational momentum
! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.
!–!
! •!❑! How much data?
! •!❑! "The smallest big-data project"
! •!❑! Used to be considered big. Not anymore.
! !✓! Numbers
! •!✓! ***500M FF users***
! •!✓! ***150M ADUs. Probably more.***
! •!✓! ***3000 crashes/minute.*** 3M/day.
! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)
! •!✓! ***800GB*** in PG
! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.
! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.
! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
18. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
500M Firefox users
150M daily users
3000 crashes per minute
150KB-20MB per crash
•!❑! Over 120 boxes, all physical.
!❑! Why physical?
! •!❑! Organizational momentum
! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.
!–!
! •!❑! How much data?
! •!❑! "The smallest big-data project"
! •!❑! Used to be considered big. Not anymore.
! !✓! Numbers
! •!✓! ***500M FF users***
! •!✓! ***150M ADUs. Probably more.***
! •!✓! ***3000 crashes/minute.*** 3M/day.
! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)
! •!✓! ***800GB*** in PG
! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.
! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.
! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
19. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
500M Firefox users
150M daily users
3000 crashes per minute
150KB-20MB per crash
800GB in PostgreSQL
•!❑! Over 120 boxes, all physical.
!❑! Why physical?
! •!❑! Organizational momentum
! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.
!–!
! •!❑! How much data?
! •!❑! "The smallest big-data project"
! •!❑! Used to be considered big. Not anymore.
! !✓! Numbers
! •!✓! ***500M FF users***
! •!✓! ***150M ADUs. Probably more.***
! •!✓! ***3000 crashes/minute.*** 3M/day.
! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)
! •!✓! ***800GB*** in PG
! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.
! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.
! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
20. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
500M Firefox users
150M daily users
3000 crashes per minute
150KB-20MB per crash
800GB in PostgreSQL
40TB in HDFS, 110TB replicated
•!❑! Over 120 boxes, all physical.
!❑! Why physical?
! •!❑! Organizational momentum
! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.
!–!
! •!❑! How much data?
! •!❑! "The smallest big-data project"
! •!❑! Used to be considered big. Not anymore.
! !✓! Numbers
! •!✓! ***500M FF users***
! •!✓! ***150M ADUs. Probably more.***
! •!✓! ***3000 crashes/minute.*** 3M/day.
! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)
! •!✓! ***800GB*** in PG
! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.
! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.
! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
21. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
It all starts ***down here***, with FF.
But even that’s made up of multiple moving parts.
22. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
It all starts ***down here***, with FF.
But even that’s made up of multiple moving parts.
23. Collectors
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
These ***first 3*** pieces all on client side
***First 2*** in FF process
! ❑! Breakpad
! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa
! ! ❑! stack dump of all threads
! •!❑! opaque; doesn't even know the frame boundaries
! •!❑! a little other processor state
! •!❑! throws it to another process: ***Crash Reporter***
Why?
Remember, FF has crashed.
State unknown.
“Crash Reporter, which is responsible for ***this little dialog***,”
binary crash dump + JSON metadata
→ POST → collectors…
24. Collectors
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
These ***first 3*** pieces all on client side
***First 2*** in FF process
! ❑! Breakpad
! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa
! ! ❑! stack dump of all threads
! •!❑! opaque; doesn't even know the frame boundaries
! •!❑! a little other processor state
! •!❑! throws it to another process: ***Crash Reporter***
Why?
Remember, FF has crashed.
State unknown.
“Crash Reporter, which is responsible for ***this little dialog***,”
binary crash dump + JSON metadata
→ POST → collectors…
25. Collectors
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
These ***first 3*** pieces all on client side
***First 2*** in FF process
! ❑! Breakpad
! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa
! ! ❑! stack dump of all threads
! •!❑! opaque; doesn't even know the frame boundaries
! •!❑! a little other processor state
! •!❑! throws it to another process: ***Crash Reporter***
Why?
Remember, FF has crashed.
State unknown.
“Crash Reporter, which is responsible for ***this little dialog***,”
binary crash dump + JSON metadata
→ POST → collectors…
26. Collectors
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
These ***first 3*** pieces all on client side
***First 2*** in FF process
! ❑! Breakpad
! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa
! ! ❑! stack dump of all threads
! •!❑! opaque; doesn't even know the frame boundaries
! •!❑! a little other processor state
! •!❑! throws it to another process: ***Crash Reporter***
Why?
Remember, FF has crashed.
State unknown.
“Crash Reporter, which is responsible for ***this little dialog***,”
binary crash dump + JSON metadata
→ POST → collectors…
27. Collectors
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
These ***first 3*** pieces all on client side
***First 2*** in FF process
! ❑! Breakpad
! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa
! ! ❑! stack dump of all threads
! •!❑! opaque; doesn't even know the frame boundaries
! •!❑! a little other processor state
! •!❑! throws it to another process: ***Crash Reporter***
Why?
Remember, FF has crashed.
State unknown.
“Crash Reporter, which is responsible for ***this little dialog***,”
binary crash dump + JSON metadata
→ POST → collectors…
29. Duplicate
Finder
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
Postgre
elasticse
Debug
symbols on
NFS
pgbou
Zeus
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
Version
Scraper
FTP
Zeu
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
Collectors: super simple
Writes crashes to ***local disk…***
30. Duplicate
Finder
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
Postgre
elasticse
Debug
symbols on
NFS
pgbou
Zeus
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
Version
Scraper
FTP
Zeu
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
Then, another process
on same box
31. Duplicate
Finder
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
Postgre
elasticse
Debug
symbols on
NFS
pgbou
Zeus
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
Version
Scraper
FTP
Zeu
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
Crash Movers
picks up crashes off local disk
→ 2 places
32. Duplicate
Finder
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
Postgre
elasticse
Debug
symbols on
NFS
pgbou
Zeus
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
Version
Scraper
FTP
Zeu
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
1st: → HBase.
HBase is primary store for crashes.
70 nodes
At the same time***…***
33. Duplicate
Finder
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
Postgre
elasticse
Debug
symbols on
NFS
pgbou
Zeus
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
Version
Scraper
FTP
Zeu
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
IDs → Rabbit
! ❑! Soft realtime: and normal queues
! •!❑! Priority: process within 60 secs
34. Duplicate
Finder
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
Postgre
elasticse
Debug
symbols on
NFS
pgbou
Zeus
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
Version
Scraper
FTP
Zeu
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
!❑! Processors
! •!❑! Where the real action happens
! •!❑! To process a crash means to do what's necessary to make it visible in the web UI.
! •!❑! ID from Rabbit
! •!❑! binary → debug
! •!❑! signature generation
! •!❑! Then it puts it into buckets and adds it to PG and ES.
First, PG.
35. Zeus Ze
Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
alized
ew
ders
Users
res
ns
ness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
!❑! Postgres
! !❑! Our main interactive datastore
! •!❑! It's what the web app and most batch jobs talk to.
! !❑! Stores (cut?)
! •!❑! unique crash signatures
! •!❑! numbers of crashes, bucketed by signature
! !❑! other aggregations of crash counts on various facets
! •!❑! to make reporting fast
! •!❑! (see slide 32 of breakpad.socorro.master.key.)
! !❑! In there for a couple reasons
! •!❑! Prompt, reliable answers to queries
! !❑! Ref integ
! •!❑! Stores unique crash signatures
! •!❑! And their relationships to versions, tickets, & so on
! •!❑! PHP & Django easy to query from
Now, let’s turn around & talk about ES, which operates in parallel.
36. Zeus Ze
Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
alized
ew
ders
Users
res
ns
ness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
!❑! Elasticsearch
! •!❑! 90-day rolling window
! •!❑! Faceting
! !❑! NKOTB
•! ❑!Extremely flexible text analysis.
! ! ! •! ❑! Though geared toward natural language, we may be able to persuade it to take apart C++
call signatures & let us mine those in meaningful ways.
! !❑! May someday eat some of HBase or Postgres's lunch
! !❑! It scales out like HBase & can even execute arbitrary scripts near the data, collating & returning
data through a master node.
! •!❑! Maybe not the flexibilty of full map-reduce
! •!❑! Filter caching
! •!❑! Supports indices itself
37. Duplicate
Finder
Zeus Zeus
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
ron
obs
!❑! Web services (“middleware”)
! •!❑! At end of this story: web application
! •!❑! But between it and data is REST middleware
! !❑! Why?
! •!❑! was in PHP and we didn't want to reimplement model logic in 2 languages
! •!❑! We change datastores.
! •!❑! We move data around.
38. Duplicate
Finder
Zeus Zeus
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
ron
obs
!✓! Web App
! •!✓! Django
! •!✓! Each runs memcached
39. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
And that concludes our big-picture tour of Socorro!
Now, as years have gone by and the system has grown in scope and size,
interesting patterns
40. !
Big
Patterns
tooling was clearly missing.
standard practices weren’t good enough.
I’m going to call out some of these emergent needs and
show you our solutions.
Maybe you’ll even find some of our tools useful.
The first…
41. !
Big
Storage
Every Big Data system put everything somewhere
Solutions well-established
Amount of data you can deal with in a commoditized fashion rises every year
sharding, repl
expensive
We realized
by application of statistics
***shrink amount of data***
42. !
Big
Storage
***sampling***
per product
all FFOS crashes
don’t wanna lose interesting rare events (due to sampling)
***targeting***
take anything with a comment
•!❑! Our statisticians have told us all kinds of useful things about the shape of our data. For
instance, the rules that select interesting events don't throw off our OS or version statistics.
***rarification***
throw away uninteresting parts of stack frames
!❑! Skiplist rules get uninteresting parts of the stack out of the data, to reduce noise. 2
kinds.
! •!❑! Sentinel frames to jump TO
! •!❑! Frames that should be ignored
An important part of making our hash buckets wider
reducing # of unique crash signatures
With these 3 techniques, we cut down the amount of data we need to handle in the later stages of
our pipeline.
Sure, we still have to keep everything in HBase, but we don’t run live queries against that, so it just
means buying more HDs.
But processors, rabbit, PG, ES, memcache, crons—all have lighter load
43. !
Big
Storage
Sampling
***sampling***
per product
all FFOS crashes
don’t wanna lose interesting rare events (due to sampling)
***targeting***
take anything with a comment
•!❑! Our statisticians have told us all kinds of useful things about the shape of our data. For
instance, the rules that select interesting events don't throw off our OS or version statistics.
***rarification***
throw away uninteresting parts of stack frames
!❑! Skiplist rules get uninteresting parts of the stack out of the data, to reduce noise. 2
kinds.
! •!❑! Sentinel frames to jump TO
! •!❑! Frames that should be ignored
An important part of making our hash buckets wider
reducing # of unique crash signatures
With these 3 techniques, we cut down the amount of data we need to handle in the later stages of
our pipeline.
Sure, we still have to keep everything in HBase, but we don’t run live queries against that, so it just
means buying more HDs.
But processors, rabbit, PG, ES, memcache, crons—all have lighter load
44. !
Big
Storage
Sampling
Targeting
***sampling***
per product
all FFOS crashes
don’t wanna lose interesting rare events (due to sampling)
***targeting***
take anything with a comment
•!❑! Our statisticians have told us all kinds of useful things about the shape of our data. For
instance, the rules that select interesting events don't throw off our OS or version statistics.
***rarification***
throw away uninteresting parts of stack frames
!❑! Skiplist rules get uninteresting parts of the stack out of the data, to reduce noise. 2
kinds.
! •!❑! Sentinel frames to jump TO
! •!❑! Frames that should be ignored
An important part of making our hash buckets wider
reducing # of unique crash signatures
With these 3 techniques, we cut down the amount of data we need to handle in the later stages of
our pipeline.
Sure, we still have to keep everything in HBase, but we don’t run live queries against that, so it just
means buying more HDs.
But processors, rabbit, PG, ES, memcache, crons—all have lighter load
45. !
Big
Storage
Sampling
Targeting
Rarification
***sampling***
per product
all FFOS crashes
don’t wanna lose interesting rare events (due to sampling)
***targeting***
take anything with a comment
•!❑! Our statisticians have told us all kinds of useful things about the shape of our data. For
instance, the rules that select interesting events don't throw off our OS or version statistics.
***rarification***
throw away uninteresting parts of stack frames
!❑! Skiplist rules get uninteresting parts of the stack out of the data, to reduce noise. 2
kinds.
! •!❑! Sentinel frames to jump TO
! •!❑! Frames that should be ignored
An important part of making our hash buckets wider
reducing # of unique crash signatures
With these 3 techniques, we cut down the amount of data we need to handle in the later stages of
our pipeline.
Sure, we still have to keep everything in HBase, but we don’t run live queries against that, so it just
means buying more HDs.
But processors, rabbit, PG, ES, memcache, crons—all have lighter load
46. !
Big
Systems
•!❑! Big Data systems tend to be complicated systems.
•!❑! Diverse parts: not just one big 500-node HBase cluster and done
!❑! Example: 6 data stores:
! •!❑! FS
! •!❑! PG
! •!❑! ES
! •!❑! HBase
! •!❑! memcache
! •!❑! RabbitMQ
! •!❑! This is typical of architectures now. Gone are the days of 1 datastore, 1 representation.
! •!❑! 18 months ago, was hearing jokes about data mullet: relational in the front, NoSQL in the
back.
! •!❑! data dreadlocks. It's all over the place.
The kinds of problems you can have in these systems
really tough to track down
47. Hadoops!
A tale of Big Failure
crash every 50 hours
***Hadoop’s cleverness*** with TCP connections
TCP stack bugs in Linux
lying NICs
OS buffers fill up with unclosed connections & crash
•!❑! So we're very very cautious about ***the equipment*** we use.
Remember that hardware is a nontrivial part of your system
! ❑! When you have a problem, it can be hard to work out exactly what's gone wrong.
! •!❑! Can take time to get everybody together
must keep receiving crashes.
***Boxes & springs***
48. Hadoops!
A tale of Big Failure
Complex interactions
crash every 50 hours
***Hadoop’s cleverness*** with TCP connections
TCP stack bugs in Linux
lying NICs
OS buffers fill up with unclosed connections & crash
•!❑! So we're very very cautious about ***the equipment*** we use.
Remember that hardware is a nontrivial part of your system
! ❑! When you have a problem, it can be hard to work out exactly what's gone wrong.
! •!❑! Can take time to get everybody together
must keep receiving crashes.
***Boxes & springs***
49. Hadoops!
A tale of Big Failure
Complex interactions
Hardware matters.
crash every 50 hours
***Hadoop’s cleverness*** with TCP connections
TCP stack bugs in Linux
lying NICs
OS buffers fill up with unclosed connections & crash
•!❑! So we're very very cautious about ***the equipment*** we use.
Remember that hardware is a nontrivial part of your system
! ❑! When you have a problem, it can be hard to work out exactly what's gone wrong.
! •!❑! Can take time to get everybody together
must keep receiving crashes.
***Boxes & springs***
50. Hadoops!
A tale of Big Failure
Complex interactions
Hardware matters.
Design for failure.
crash every 50 hours
***Hadoop’s cleverness*** with TCP connections
TCP stack bugs in Linux
lying NICs
OS buffers fill up with unclosed connections & crash
•!❑! So we're very very cautious about ***the equipment*** we use.
Remember that hardware is a nontrivial part of your system
! ❑! When you have a problem, it can be hard to work out exactly what's gone wrong.
! •!❑! Can take time to get everybody together
must keep receiving crashes.
***Boxes & springs***
51. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
The most important: ***this Local FS***
52. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
The most important: ***this Local FS***
53. Duplicate
Finder
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Debug
symbols on
NFS
pgbouncer
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
Everything else can fail
3 days of runway
Saved us several times
Yours may not look like this, but
•!❑! You could imagine a system being able to serve just out of cache if the datastore went away.
•!❑! Or operate in read-only mode if writes became unavailable.
! ! ! ! SUMO
One thing from this diagram we didn’t talk about much yet was ***cron jobs***.
54. !
Big
Batching
•!❑! Mozilla is a large project with a long legacy, and Socorro interfaces with a lot of
other systems. ***A lot of this occurs via batch jobs.***
55. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
57. In fact, you can look at a lot of our periodic tasks as a dependency tree.
One thing upstream fails***…***
58. …and downstream everything else fails.
replaced cron w/crontabber
Instead of blindly running jobs whose prerequisites aren’t filled,
runs the ***parent*** until it succeeds, then runs ***children***.
Diagrams to visualize state of sys
Too error-prone by hand.
***Then*** we thought: why not have crontabber draw them for us?
59. …and downstream everything else fails.
replaced cron w/crontabber
Instead of blindly running jobs whose prerequisites aren’t filled,
runs the ***parent*** until it succeeds, then runs ***children***.
Diagrams to visualize state of sys
Too error-prone by hand.
***Then*** we thought: why not have crontabber draw them for us?
60. …and downstream everything else fails.
replaced cron w/crontabber
Instead of blindly running jobs whose prerequisites aren’t filled,
runs the ***parent*** until it succeeds, then runs ***children***.
Diagrams to visualize state of sys
Too error-prone by hand.
***Then*** we thought: why not have crontabber draw them for us?
61.
62.
63.
64. SVGs are really neat.
can wiggle if unclear
And then break down specifics into a ***table…***
65. One job at a time atm cuz “eek matviews perf”, but a great contribution would be some
kind of shared locks or thresholds for multiple.
But you know, right now, it’s ***good enough…***
66. !
Big
Deal
And it’s surprising how often that happens. Oftentimes, your makeshift solutions end up
being good enough to do the job.
67. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
***Slapdash, hacky queue (PG)***
polls HBase
→ PG
polls PG
→ processors
***Local FS buffer*** was a temporary fix when we had reliability problems with HBase.
***I could tell*** you “don’t be afraid of temporary hacks”. But I think that’s a healthy fear to
have.
Or perhaps my message should be: do a good job on your temporary solutions, because they’ll
probably be around awhile.
68. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
***Slapdash, hacky queue (PG)***
polls HBase
→ PG
polls PG
→ processors
***Local FS buffer*** was a temporary fix when we had reliability problems with HBase.
***I could tell*** you “don’t be afraid of temporary hacks”. But I think that’s a healthy fear to
have.
Or perhaps my message should be: do a good job on your temporary solutions, because they’ll
probably be around awhile.
69. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
***Slapdash, hacky queue (PG)***
polls HBase
→ PG
polls PG
→ processors
***Local FS buffer*** was a temporary fix when we had reliability problems with HBase.
***I could tell*** you “don’t be afraid of temporary hacks”. But I think that’s a healthy fear to
have.
Or perhaps my message should be: do a good job on your temporary solutions, because they’ll
probably be around awhile.
70. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
***Slapdash, hacky queue (PG)***
polls HBase
→ PG
polls PG
→ processors
***Local FS buffer*** was a temporary fix when we had reliability problems with HBase.
***I could tell*** you “don’t be afraid of temporary hacks”. But I think that’s a healthy fear to
have.
Or perhaps my message should be: do a good job on your temporary solutions, because they’ll
probably be around awhile.
71. definition: hook up to one computer, or fit on one desk
changes every year
The fact…wearing nearly 100GB
unimaginable to operator of punch card duplicator from only 50 years ago
But the patterns that come out of large systems remain.
Duplicate cards: why? To facet 2 ways in parallel.
While you may need to generalize a bit,
I have no doubt
techniques you learn today and tomorrow
serve you well into the future.