SlideShare ist ein Scribd-Unternehmen logo
1 von 72
Downloaden Sie, um offline zu lesen
What Happens When
Firefox Crashes?or
It’s Not My Fault Tolerance
by Erik Rose
Welcome!
[Erik Rose (if not introduced)]
write server-side code @ Mozilla
to tell you about the Big Data systems behind FF crash reporting
•! ❑ ! A browser is a complex piece of software.
•! ❑ ! Challenging to test it
▼! ❑ ! Interacts with a lot of other software: JS add-ons, compiled plugins, OSes, different hardware.
! •! ❑ ! Even unique timings of your setup can trigger bugs.
! •! ❑ ! Also, 50 billion – 1 trillion web pages. They do unpredictable, creative things
! •! ❑ ***Any of which could make FF explode***
▼! –! That's why, in addition to an extensive test suite and manual testing, we invest a lot in crash reporting.
So today, I want to show you what happens when Firefox crashes and what the systems look like that receive and process the crash reports
What Happens When
Firefox Crashes?or
It’s Not My Fault Tolerance
by Erik Rose
Welcome!
[Erik Rose (if not introduced)]
write server-side code @ Mozilla
to tell you about the Big Data systems behind FF crash reporting
•! ❑ ! A browser is a complex piece of software.
•! ❑ ! Challenging to test it
▼! ❑ ! Interacts with a lot of other software: JS add-ons, compiled plugins, OSes, different hardware.
! •! ❑ ! Even unique timings of your setup can trigger bugs.
! •! ❑ ! Also, 50 billion – 1 trillion web pages. They do unpredictable, creative things
! •! ❑ ***Any of which could make FF explode***
▼! –! That's why, in addition to an extensive test suite and manual testing, we invest a lot in crash reporting.
So today, I want to show you what happens when Firefox crashes and what the systems look like that receive and process the crash reports
•!❑! If you've crashed FF, you've seen this dialog.
! ❑!If you choose to send us a crash report, we use it to…
! •!❑! find new bugs
! •!❑! decide where to concentrate our time
Socorro
!–! The thing that receives FF crash reports is called Socorro.
•!❑! ***Open source.***
•!❑! You can use it if you want. Very flexible.
•!❑! Used by Valve, Yandex
•!❑! Socorro gets its name from the Very Large Array in Socorro, NM because…
Socorro
https://github.com/
mozilla/socorro
!–! The thing that receives FF crash reports is called Socorro.
•!❑! ***Open source.***
•!❑! You can use it if you want. Very flexible.
•!❑! Used by Valve, Yandex
•!❑! Socorro gets its name from the Very Large Array in Socorro, NM because…
Very Large Array
Socorro, New Mexico
like that array, it receives signals from out in the universe and tries to filter out patterns from the
noise.
•!❑! 27 dish antennas, which can move to follow objects across the sky
•!❑! Socorro is a Very Large Array of slightly less expensive systems which tracks crashes
across the userbase
!
Big
Picture
The
Let’s take a peek behind the curtain
You’ll recognize some things you’re doing yourself,
and some other things might surprise you.
So let’s embark on our tour of Socorro!
! •!❑! On its front end, it looks like this.
Public.
Don’t hide our failures
Unusual.
You can drill into this, to see
e.g. top crashers:
! •!❑! ***% of all crashes***
! •!❑! signature (stack trace)
! •!❑! breakdown by platform
! •!❑! ticket correllations
You can drill into this, to see
e.g. top crashers:
! •!❑! ***% of all crashes***
! •!❑! signature (stack trace)
! •!❑! breakdown by platform
! •!❑! ticket correllations
!–! Another example: explosive crashes
! !–! Music charts: "bullets"
! •!❑! song which rises quickly up the charts to suddenly become extremely popular
! •!❑! Something we expect to see as 5% of all crashes, but then you wake up one morning, and
they're 85% of all crashes.
! •!❑! Generally what this means is that one of the major sites shipped a new piece of JS which
crashes us.
! !✓! The most recent example of this is during the last Olypmics, when Google released a new
Doodle every day.
! •!❑! I think it was this one that crashed us.
! •!❑! On the one hand, we knew the problem was going away tomorrow. So that’s nice.
! •!❑! OTOH, a lot of people have Google set as their startup page. So that's bad. ;-)
!❑! You can also find…
! •!❑! Most common crashes for a version, platform, etc.
! •!❑! New crashes
! !❑! Correlations
! •!❑! ferret out interactions between plugins, for example
•!❑! Pretty straightforward, right?
Backend is less straightforward…
Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
•!❑! Over 120 boxes, all physical.
!❑! Why physical?
! •!❑! Organizational momentum
! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.
!–!
! •!❑! How much data?
! •!❑! "The smallest big-data project"
! •!❑! Used to be considered big. Not anymore.
! !✓! Numbers
! •!✓! ***500M FF users***
! •!✓! ***150M ADUs. Probably more.***
! •!✓! ***3000 crashes/minute.*** 3M/day.
! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)
! •!✓! ***800GB*** in PG
! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.
! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.
! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
500M Firefox users
•!❑! Over 120 boxes, all physical.
!❑! Why physical?
! •!❑! Organizational momentum
! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.
!–!
! •!❑! How much data?
! •!❑! "The smallest big-data project"
! •!❑! Used to be considered big. Not anymore.
! !✓! Numbers
! •!✓! ***500M FF users***
! •!✓! ***150M ADUs. Probably more.***
! •!✓! ***3000 crashes/minute.*** 3M/day.
! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)
! •!✓! ***800GB*** in PG
! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.
! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.
! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
500M Firefox users
150M daily users
•!❑! Over 120 boxes, all physical.
!❑! Why physical?
! •!❑! Organizational momentum
! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.
!–!
! •!❑! How much data?
! •!❑! "The smallest big-data project"
! •!❑! Used to be considered big. Not anymore.
! !✓! Numbers
! •!✓! ***500M FF users***
! •!✓! ***150M ADUs. Probably more.***
! •!✓! ***3000 crashes/minute.*** 3M/day.
! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)
! •!✓! ***800GB*** in PG
! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.
! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.
! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
500M Firefox users
150M daily users
3000 crashes per minute
•!❑! Over 120 boxes, all physical.
!❑! Why physical?
! •!❑! Organizational momentum
! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.
!–!
! •!❑! How much data?
! •!❑! "The smallest big-data project"
! •!❑! Used to be considered big. Not anymore.
! !✓! Numbers
! •!✓! ***500M FF users***
! •!✓! ***150M ADUs. Probably more.***
! •!✓! ***3000 crashes/minute.*** 3M/day.
! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)
! •!✓! ***800GB*** in PG
! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.
! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.
! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
500M Firefox users
150M daily users
3000 crashes per minute
150KB-20MB per crash
•!❑! Over 120 boxes, all physical.
!❑! Why physical?
! •!❑! Organizational momentum
! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.
!–!
! •!❑! How much data?
! •!❑! "The smallest big-data project"
! •!❑! Used to be considered big. Not anymore.
! !✓! Numbers
! •!✓! ***500M FF users***
! •!✓! ***150M ADUs. Probably more.***
! •!✓! ***3000 crashes/minute.*** 3M/day.
! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)
! •!✓! ***800GB*** in PG
! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.
! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.
! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
500M Firefox users
150M daily users
3000 crashes per minute
150KB-20MB per crash
800GB in PostgreSQL
•!❑! Over 120 boxes, all physical.
!❑! Why physical?
! •!❑! Organizational momentum
! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.
!–!
! •!❑! How much data?
! •!❑! "The smallest big-data project"
! •!❑! Used to be considered big. Not anymore.
! !✓! Numbers
! •!✓! ***500M FF users***
! •!✓! ***150M ADUs. Probably more.***
! •!✓! ***3000 crashes/minute.*** 3M/day.
! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)
! •!✓! ***800GB*** in PG
! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.
! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.
! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
500M Firefox users
150M daily users
3000 crashes per minute
150KB-20MB per crash
800GB in PostgreSQL
40TB in HDFS, 110TB replicated
•!❑! Over 120 boxes, all physical.
!❑! Why physical?
! •!❑! Organizational momentum
! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.
!–!
! •!❑! How much data?
! •!❑! "The smallest big-data project"
! •!❑! Used to be considered big. Not anymore.
! !✓! Numbers
! •!✓! ***500M FF users***
! •!✓! ***150M ADUs. Probably more.***
! •!✓! ***3000 crashes/minute.*** 3M/day.
! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)
! •!✓! ***800GB*** in PG
! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.
! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.
! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
It all starts ***down here***, with FF.
But even that’s made up of multiple moving parts.
Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
It all starts ***down here***, with FF.
But even that’s made up of multiple moving parts.
Collectors
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
These ***first 3*** pieces all on client side
***First 2*** in FF process
! ❑! Breakpad
! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa
! ! ❑! stack dump of all threads
! •!❑! opaque; doesn't even know the frame boundaries
! •!❑! a little other processor state
! •!❑! throws it to another process: ***Crash Reporter***
Why?
Remember, FF has crashed.
State unknown.
“Crash Reporter, which is responsible for ***this little dialog***,”
binary crash dump + JSON metadata
→ POST → collectors…
Collectors
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
These ***first 3*** pieces all on client side
***First 2*** in FF process
! ❑! Breakpad
! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa
! ! ❑! stack dump of all threads
! •!❑! opaque; doesn't even know the frame boundaries
! •!❑! a little other processor state
! •!❑! throws it to another process: ***Crash Reporter***
Why?
Remember, FF has crashed.
State unknown.
“Crash Reporter, which is responsible for ***this little dialog***,”
binary crash dump + JSON metadata
→ POST → collectors…
Collectors
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
These ***first 3*** pieces all on client side
***First 2*** in FF process
! ❑! Breakpad
! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa
! ! ❑! stack dump of all threads
! •!❑! opaque; doesn't even know the frame boundaries
! •!❑! a little other processor state
! •!❑! throws it to another process: ***Crash Reporter***
Why?
Remember, FF has crashed.
State unknown.
“Crash Reporter, which is responsible for ***this little dialog***,”
binary crash dump + JSON metadata
→ POST → collectors…
Collectors
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
These ***first 3*** pieces all on client side
***First 2*** in FF process
! ❑! Breakpad
! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa
! ! ❑! stack dump of all threads
! •!❑! opaque; doesn't even know the frame boundaries
! •!❑! a little other processor state
! •!❑! throws it to another process: ***Crash Reporter***
Why?
Remember, FF has crashed.
State unknown.
“Crash Reporter, which is responsible for ***this little dialog***,”
binary crash dump + JSON metadata
→ POST → collectors…
Collectors
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
These ***first 3*** pieces all on client side
***First 2*** in FF process
! ❑! Breakpad
! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa
! ! ❑! stack dump of all threads
! •!❑! opaque; doesn't even know the frame boundaries
! •!❑! a little other processor state
! •!❑! throws it to another process: ***Crash Reporter***
Why?
Remember, FF has crashed.
State unknown.
“Crash Reporter, which is responsible for ***this little dialog***,”
binary crash dump + JSON metadata
→ POST → collectors…
Collectors
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
where really enters Socorro***…***
Duplicate
Finder
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
Postgre
elasticse
Debug
symbols on
NFS
pgbou
Zeus
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
Version
Scraper
FTP
Zeu
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
Collectors: super simple
Writes crashes to ***local disk…***
Duplicate
Finder
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
Postgre
elasticse
Debug
symbols on
NFS
pgbou
Zeus
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
Version
Scraper
FTP
Zeu
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
Then, another process
on same box
Duplicate
Finder
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
Postgre
elasticse
Debug
symbols on
NFS
pgbou
Zeus
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
Version
Scraper
FTP
Zeu
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
Crash Movers
picks up crashes off local disk
→ 2 places
Duplicate
Finder
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
Postgre
elasticse
Debug
symbols on
NFS
pgbou
Zeus
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
Version
Scraper
FTP
Zeu
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
1st: → HBase.
HBase is primary store for crashes.
70 nodes
At the same time***…***
Duplicate
Finder
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
Postgre
elasticse
Debug
symbols on
NFS
pgbou
Zeus
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
Version
Scraper
FTP
Zeu
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
IDs → Rabbit
! ❑! Soft realtime: and normal queues
! •!❑! Priority: process within 60 secs
Duplicate
Finder
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
Postgre
elasticse
Debug
symbols on
NFS
pgbou
Zeus
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
Version
Scraper
FTP
Zeu
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
!❑! Processors
! •!❑! Where the real action happens
! •!❑! To process a crash means to do what's necessary to make it visible in the web UI.
! •!❑! ID from Rabbit
! •!❑! binary → debug
! •!❑! signature generation
! •!❑! Then it puts it into buckets and adds it to PG and ES.
First, PG.
Zeus Ze
Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
alized
ew
ders
Users
res
ns
ness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
!❑! Postgres
! !❑! Our main interactive datastore
! •!❑! It's what the web app and most batch jobs talk to.
! !❑! Stores (cut?)
! •!❑! unique crash signatures
! •!❑! numbers of crashes, bucketed by signature
! !❑! other aggregations of crash counts on various facets
! •!❑! to make reporting fast
! •!❑! (see slide 32 of breakpad.socorro.master.key.)
! !❑! In there for a couple reasons
! •!❑! Prompt, reliable answers to queries
! !❑! Ref integ
! •!❑! Stores unique crash signatures
! •!❑! And their relationships to versions, tickets, & so on
! •!❑! PHP & Django easy to query from
Now, let’s turn around & talk about ES, which operates in parallel.
Zeus Ze
Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
alized
ew
ders
Users
res
ns
ness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
!❑! Elasticsearch
! •!❑! 90-day rolling window
! •!❑! Faceting
! !❑! NKOTB
•! ❑!Extremely flexible text analysis.
! ! ! •! ❑! Though geared toward natural language, we may be able to persuade it to take apart C++
call signatures & let us mine those in meaningful ways.
! !❑! May someday eat some of HBase or Postgres's lunch
! !❑! It scales out like HBase & can even execute arbitrary scripts near the data, collating & returning
data through a master node.
! •!❑! Maybe not the flexibilty of full map-reduce
! •!❑! Filter caching
! •!❑! Supports indices itself
Duplicate
Finder
Zeus Zeus
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
ron
obs
!❑! Web services (“middleware”)
! •!❑! At end of this story: web application
! •!❑! But between it and data is REST middleware
! !❑! Why?
! •!❑! was in PHP and we didn't want to reimplement model logic in 2 languages
! •!❑! We change datastores.
! •!❑! We move data around.
Duplicate
Finder
Zeus Zeus
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
ron
obs
!✓! Web App
! •!✓! Django
! •!✓! Each runs memcached
Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
And that concludes our big-picture tour of Socorro!
Now, as years have gone by and the system has grown in scope and size,
interesting patterns
!
Big
Patterns
tooling was clearly missing.
standard practices weren’t good enough.
I’m going to call out some of these emergent needs and
show you our solutions.
Maybe you’ll even find some of our tools useful.
The first…
!
Big
Storage
Every Big Data system put everything somewhere
Solutions well-established
Amount of data you can deal with in a commoditized fashion rises every year
sharding, repl
expensive
We realized
by application of statistics
***shrink amount of data***
!
Big
Storage
***sampling***
per product
all FFOS crashes
don’t wanna lose interesting rare events (due to sampling)
***targeting***
take anything with a comment
•!❑! Our statisticians have told us all kinds of useful things about the shape of our data. For
instance, the rules that select interesting events don't throw off our OS or version statistics.
***rarification***
throw away uninteresting parts of stack frames
!❑! Skiplist rules get uninteresting parts of the stack out of the data, to reduce noise. 2
kinds.
! •!❑! Sentinel frames to jump TO
! •!❑! Frames that should be ignored
An important part of making our hash buckets wider

 reducing # of unique crash signatures
With these 3 techniques, we cut down the amount of data we need to handle in the later stages of
our pipeline.
Sure, we still have to keep everything in HBase, but we don’t run live queries against that, so it just
means buying more HDs.
But processors, rabbit, PG, ES, memcache, crons—all have lighter load
!
Big
Storage
Sampling
***sampling***
per product
all FFOS crashes
don’t wanna lose interesting rare events (due to sampling)
***targeting***
take anything with a comment
•!❑! Our statisticians have told us all kinds of useful things about the shape of our data. For
instance, the rules that select interesting events don't throw off our OS or version statistics.
***rarification***
throw away uninteresting parts of stack frames
!❑! Skiplist rules get uninteresting parts of the stack out of the data, to reduce noise. 2
kinds.
! •!❑! Sentinel frames to jump TO
! •!❑! Frames that should be ignored
An important part of making our hash buckets wider

 reducing # of unique crash signatures
With these 3 techniques, we cut down the amount of data we need to handle in the later stages of
our pipeline.
Sure, we still have to keep everything in HBase, but we don’t run live queries against that, so it just
means buying more HDs.
But processors, rabbit, PG, ES, memcache, crons—all have lighter load
!
Big
Storage
Sampling
Targeting
***sampling***
per product
all FFOS crashes
don’t wanna lose interesting rare events (due to sampling)
***targeting***
take anything with a comment
•!❑! Our statisticians have told us all kinds of useful things about the shape of our data. For
instance, the rules that select interesting events don't throw off our OS or version statistics.
***rarification***
throw away uninteresting parts of stack frames
!❑! Skiplist rules get uninteresting parts of the stack out of the data, to reduce noise. 2
kinds.
! •!❑! Sentinel frames to jump TO
! •!❑! Frames that should be ignored
An important part of making our hash buckets wider

 reducing # of unique crash signatures
With these 3 techniques, we cut down the amount of data we need to handle in the later stages of
our pipeline.
Sure, we still have to keep everything in HBase, but we don’t run live queries against that, so it just
means buying more HDs.
But processors, rabbit, PG, ES, memcache, crons—all have lighter load
!
Big
Storage
Sampling
Targeting
Rarification
***sampling***
per product
all FFOS crashes
don’t wanna lose interesting rare events (due to sampling)
***targeting***
take anything with a comment
•!❑! Our statisticians have told us all kinds of useful things about the shape of our data. For
instance, the rules that select interesting events don't throw off our OS or version statistics.
***rarification***
throw away uninteresting parts of stack frames
!❑! Skiplist rules get uninteresting parts of the stack out of the data, to reduce noise. 2
kinds.
! •!❑! Sentinel frames to jump TO
! •!❑! Frames that should be ignored
An important part of making our hash buckets wider

 reducing # of unique crash signatures
With these 3 techniques, we cut down the amount of data we need to handle in the later stages of
our pipeline.
Sure, we still have to keep everything in HBase, but we don’t run live queries against that, so it just
means buying more HDs.
But processors, rabbit, PG, ES, memcache, crons—all have lighter load
!
Big
Systems
•!❑! Big Data systems tend to be complicated systems.
•!❑! Diverse parts: not just one big 500-node HBase cluster and done
!❑! Example: 6 data stores:
! •!❑! FS
! •!❑! PG
! •!❑! ES
! •!❑! HBase
! •!❑! memcache
! •!❑! RabbitMQ
! •!❑! This is typical of architectures now. Gone are the days of 1 datastore, 1 representation.
! •!❑! 18 months ago, was hearing jokes about data mullet: relational in the front, NoSQL in the
back.
! •!❑! data dreadlocks. It's all over the place.
The kinds of problems you can have in these systems
really tough to track down
Hadoops!
A tale of Big Failure
crash every 50 hours
***Hadoop’s cleverness*** with TCP connections
TCP stack bugs in Linux
lying NICs
OS buffers fill up with unclosed connections & crash
•!❑! So we're very very cautious about ***the equipment*** we use.
Remember that hardware is a nontrivial part of your system
! ❑! When you have a problem, it can be hard to work out exactly what's gone wrong.
! •!❑! Can take time to get everybody together
must keep receiving crashes.
***Boxes & springs***
Hadoops!
A tale of Big Failure
Complex interactions
crash every 50 hours
***Hadoop’s cleverness*** with TCP connections
TCP stack bugs in Linux
lying NICs
OS buffers fill up with unclosed connections & crash
•!❑! So we're very very cautious about ***the equipment*** we use.
Remember that hardware is a nontrivial part of your system
! ❑! When you have a problem, it can be hard to work out exactly what's gone wrong.
! •!❑! Can take time to get everybody together
must keep receiving crashes.
***Boxes & springs***
Hadoops!
A tale of Big Failure
Complex interactions
Hardware matters.
crash every 50 hours
***Hadoop’s cleverness*** with TCP connections
TCP stack bugs in Linux
lying NICs
OS buffers fill up with unclosed connections & crash
•!❑! So we're very very cautious about ***the equipment*** we use.
Remember that hardware is a nontrivial part of your system
! ❑! When you have a problem, it can be hard to work out exactly what's gone wrong.
! •!❑! Can take time to get everybody together
must keep receiving crashes.
***Boxes & springs***
Hadoops!
A tale of Big Failure
Complex interactions
Hardware matters.
Design for failure.
crash every 50 hours
***Hadoop’s cleverness*** with TCP connections
TCP stack bugs in Linux
lying NICs
OS buffers fill up with unclosed connections & crash
•!❑! So we're very very cautious about ***the equipment*** we use.
Remember that hardware is a nontrivial part of your system
! ❑! When you have a problem, it can be hard to work out exactly what's gone wrong.
! •!❑! Can take time to get everybody together
must keep receiving crashes.
***Boxes & springs***
Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
The most important: ***this Local FS***
Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
The most important: ***this Local FS***
Duplicate
Finder
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Debug
symbols on
NFS
pgbouncer
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
Everything else can fail
3 days of runway
Saved us several times
Yours may not look like this, but
•!❑! You could imagine a system being able to serve just out of cache if the datastore went away.
•!❑! Or operate in read-only mode if writes became unavailable.
! ! ! ! SUMO
One thing from this diagram we didn’t talk about much yet was ***cron jobs***.
!
Big
Batching
•!❑! Mozilla is a large project with a long legacy, and Socorro interfaces with a lot of
other systems. ***A lot of this occurs via batch jobs.***
Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
Duplicate
Finder
MQ Processors
PostgreSQL
pgbouncer
Middleware
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
matviews
version scraper, 1x/day
bugzilla
•!❑! Send advice back to users, like in the case where we see they have malware
ADUs

 denominator for every metric

 fails a lot. Metrics’ systems unreliable.

 everything that depends on it fails
In fact, you can look at a lot of our periodic tasks as a dependency tree.
One thing upstream fails***…***
…and downstream everything else fails.
replaced cron w/crontabber
Instead of blindly running jobs whose prerequisites aren’t filled,
runs the ***parent*** until it succeeds, then runs ***children***.
Diagrams to visualize state of sys
Too error-prone by hand.
***Then*** we thought: why not have crontabber draw them for us?
…and downstream everything else fails.
replaced cron w/crontabber
Instead of blindly running jobs whose prerequisites aren’t filled,
runs the ***parent*** until it succeeds, then runs ***children***.
Diagrams to visualize state of sys
Too error-prone by hand.
***Then*** we thought: why not have crontabber draw them for us?
…and downstream everything else fails.
replaced cron w/crontabber
Instead of blindly running jobs whose prerequisites aren’t filled,
runs the ***parent*** until it succeeds, then runs ***children***.
Diagrams to visualize state of sys
Too error-prone by hand.
***Then*** we thought: why not have crontabber draw them for us?
SVGs are really neat.
can wiggle if unclear
And then break down specifics into a ***table…***
One job at a time atm cuz “eek matviews perf”, but a great contribution would be some
kind of shared locks or thresholds for multiple.
But you know, right now, it’s ***good enough…***
!
Big
Deal
And it’s surprising how often that happens. Oftentimes, your makeshift solutions end up
being good enough to do the job.
Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
***Slapdash, hacky queue (PG)***

 polls HBase

 → PG

 polls PG

 → processors
***Local FS buffer*** was a temporary fix when we had reliability problems with HBase.
***I could tell*** you “don’t be afraid of temporary hacks”. But I think that’s a healthy fear to
have.
Or perhaps my message should be: do a good job on your temporary solutions, because they’ll
probably be around awhile.
Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
***Slapdash, hacky queue (PG)***

 polls HBase

 → PG

 polls PG

 → processors
***Local FS buffer*** was a temporary fix when we had reliability problems with HBase.
***I could tell*** you “don’t be afraid of temporary hacks”. But I think that’s a healthy fear to
have.
Or perhaps my message should be: do a good job on your temporary solutions, because they’ll
probably be around awhile.
Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
***Slapdash, hacky queue (PG)***

 polls HBase

 → PG

 polls PG

 → processors
***Local FS buffer*** was a temporary fix when we had reliability problems with HBase.
***I could tell*** you “don’t be afraid of temporary hacks”. But I think that’s a healthy fear to
have.
Or perhaps my message should be: do a good job on your temporary solutions, because they’ll
probably be around awhile.
Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
***Slapdash, hacky queue (PG)***

 polls HBase

 → PG

 polls PG

 → processors
***Local FS buffer*** was a temporary fix when we had reliability problems with HBase.
***I could tell*** you “don’t be afraid of temporary hacks”. But I think that’s a healthy fear to
have.
Or perhaps my message should be: do a good job on your temporary solutions, because they’ll
probably be around awhile.
definition: hook up to one computer, or fit on one desk
changes every year
The fact…wearing nearly 100GB
unimaginable to operator of punch card duplicator from only 50 years ago
But the patterns that come out of large systems remain.
Duplicate cards: why? To facet 2 ways in parallel.
While you may need to generalize a bit,
I have no doubt
techniques you learn today and tomorrow
serve you well into the future.
Big Thanks
twitter: ErikRose
www.grinchcentral.com
erik@mozilla.com

Weitere ähnliche Inhalte

Ähnlich wie What happens when firefox crashes?

CODE BLUE 2014 : マイクロソフトの脆弱性調査 : ベンダーでありながら発見者となるために by デイヴィッド・シードマン David Se...
CODE BLUE 2014 : マイクロソフトの脆弱性調査 : ベンダーでありながら発見者となるために by デイヴィッド・シードマン David Se...CODE BLUE 2014 : マイクロソフトの脆弱性調査 : ベンダーでありながら発見者となるために by デイヴィッド・シードマン David Se...
CODE BLUE 2014 : マイクロソフトの脆弱性調査 : ベンダーでありながら発見者となるために by デイヴィッド・シードマン David Se...CODE BLUE
 
Deploying PHP on PaaS: Why and How?
Deploying PHP on PaaS: Why and How?Deploying PHP on PaaS: Why and How?
Deploying PHP on PaaS: Why and How?Docker, Inc.
 
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-Bayes
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-BayesOSDC 2016 - Ingesting Logs with Style by Pere Urbon-Bayes
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-BayesNETWAYS
 
Midwest php 2013 deploying php on paas- why & how
Midwest php 2013   deploying php on paas- why & howMidwest php 2013   deploying php on paas- why & how
Midwest php 2013 deploying php on paas- why & howdotCloud
 
MozTW YZU CSE Lecture
MozTW YZU CSE LectureMozTW YZU CSE Lecture
MozTW YZU CSE Lecturelittlebtc
 
Єгор Попович, CTO @Tesseract, (Lviv, Ukraine) "Blockchain user: myth or reali...
Єгор Попович, CTO @Tesseract, (Lviv, Ukraine) "Blockchain user: myth or reali...Єгор Попович, CTO @Tesseract, (Lviv, Ukraine) "Blockchain user: myth or reali...
Єгор Попович, CTO @Tesseract, (Lviv, Ukraine) "Blockchain user: myth or reali...Dakiry
 
Cross-platform logging and analytics
Cross-platform logging and analyticsCross-platform logging and analytics
Cross-platform logging and analyticsDrew Crawford
 
Building websites with building blocks
Building websites with building blocksBuilding websites with building blocks
Building websites with building blocksPer Åström
 
Symfony Live NYC 2014 - Rock Solid Deployment of Symfony Apps
Symfony Live NYC 2014 -  Rock Solid Deployment of Symfony AppsSymfony Live NYC 2014 -  Rock Solid Deployment of Symfony Apps
Symfony Live NYC 2014 - Rock Solid Deployment of Symfony AppsPablo Godel
 
SymfonyCon Madrid 2014 - Rock Solid Deployment of Symfony Apps
SymfonyCon Madrid 2014 - Rock Solid Deployment of Symfony AppsSymfonyCon Madrid 2014 - Rock Solid Deployment of Symfony Apps
SymfonyCon Madrid 2014 - Rock Solid Deployment of Symfony AppsPablo Godel
 
The KNOT DNS Server
The KNOT DNS ServerThe KNOT DNS Server
The KNOT DNS ServerMen and Mice
 
Desert Code Camp 2014: C#, the best programming language
Desert Code Camp 2014: C#, the best programming languageDesert Code Camp 2014: C#, the best programming language
Desert Code Camp 2014: C#, the best programming languageJames Montemagno
 
W.E.B. 2010 - Web, Exploits, Browsers
W.E.B. 2010 - Web, Exploits, BrowsersW.E.B. 2010 - Web, Exploits, Browsers
W.E.B. 2010 - Web, Exploits, BrowsersSaumil Shah
 
Java tuning on GNU/Linux for busy dev
Java tuning on GNU/Linux for busy devJava tuning on GNU/Linux for busy dev
Java tuning on GNU/Linux for busy devTomek Borek
 
Pivotal Open Source: Using Fluentd to gain insights into your logs
Pivotal Open Source:  Using Fluentd to gain insights into your logsPivotal Open Source:  Using Fluentd to gain insights into your logs
Pivotal Open Source: Using Fluentd to gain insights into your logsKiyoto Tamura
 
Firefox Crash Reporting (@ Open Source Bridge)
Firefox Crash Reporting (@ Open Source Bridge)Firefox Crash Reporting (@ Open Source Bridge)
Firefox Crash Reporting (@ Open Source Bridge)lauraxthomson
 
Server-Side JavaScript with Nashorn
Server-Side JavaScript with NashornServer-Side JavaScript with Nashorn
Server-Side JavaScript with NashornDaniel Woods
 
Philly ete-2011
Philly ete-2011Philly ete-2011
Philly ete-2011davyjones
 

Ähnlich wie What happens when firefox crashes? (20)

CODE BLUE 2014 : マイクロソフトの脆弱性調査 : ベンダーでありながら発見者となるために by デイヴィッド・シードマン David Se...
CODE BLUE 2014 : マイクロソフトの脆弱性調査 : ベンダーでありながら発見者となるために by デイヴィッド・シードマン David Se...CODE BLUE 2014 : マイクロソフトの脆弱性調査 : ベンダーでありながら発見者となるために by デイヴィッド・シードマン David Se...
CODE BLUE 2014 : マイクロソフトの脆弱性調査 : ベンダーでありながら発見者となるために by デイヴィッド・シードマン David Se...
 
Deploying PHP on PaaS: Why and How?
Deploying PHP on PaaS: Why and How?Deploying PHP on PaaS: Why and How?
Deploying PHP on PaaS: Why and How?
 
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-Bayes
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-BayesOSDC 2016 - Ingesting Logs with Style by Pere Urbon-Bayes
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-Bayes
 
Midwest php 2013 deploying php on paas- why & how
Midwest php 2013   deploying php on paas- why & howMidwest php 2013   deploying php on paas- why & how
Midwest php 2013 deploying php on paas- why & how
 
MozTW YZU CSE Lecture
MozTW YZU CSE LectureMozTW YZU CSE Lecture
MozTW YZU CSE Lecture
 
Transforming WebSockets
Transforming WebSocketsTransforming WebSockets
Transforming WebSockets
 
Єгор Попович, CTO @Tesseract, (Lviv, Ukraine) "Blockchain user: myth or reali...
Єгор Попович, CTO @Tesseract, (Lviv, Ukraine) "Blockchain user: myth or reali...Єгор Попович, CTO @Tesseract, (Lviv, Ukraine) "Blockchain user: myth or reali...
Єгор Попович, CTO @Tesseract, (Lviv, Ukraine) "Blockchain user: myth or reali...
 
Cross-platform logging and analytics
Cross-platform logging and analyticsCross-platform logging and analytics
Cross-platform logging and analytics
 
Building websites with building blocks
Building websites with building blocksBuilding websites with building blocks
Building websites with building blocks
 
Distributed "Web Scale" Systems
Distributed "Web Scale" SystemsDistributed "Web Scale" Systems
Distributed "Web Scale" Systems
 
Symfony Live NYC 2014 - Rock Solid Deployment of Symfony Apps
Symfony Live NYC 2014 -  Rock Solid Deployment of Symfony AppsSymfony Live NYC 2014 -  Rock Solid Deployment of Symfony Apps
Symfony Live NYC 2014 - Rock Solid Deployment of Symfony Apps
 
SymfonyCon Madrid 2014 - Rock Solid Deployment of Symfony Apps
SymfonyCon Madrid 2014 - Rock Solid Deployment of Symfony AppsSymfonyCon Madrid 2014 - Rock Solid Deployment of Symfony Apps
SymfonyCon Madrid 2014 - Rock Solid Deployment of Symfony Apps
 
The KNOT DNS Server
The KNOT DNS ServerThe KNOT DNS Server
The KNOT DNS Server
 
Desert Code Camp 2014: C#, the best programming language
Desert Code Camp 2014: C#, the best programming languageDesert Code Camp 2014: C#, the best programming language
Desert Code Camp 2014: C#, the best programming language
 
W.E.B. 2010 - Web, Exploits, Browsers
W.E.B. 2010 - Web, Exploits, BrowsersW.E.B. 2010 - Web, Exploits, Browsers
W.E.B. 2010 - Web, Exploits, Browsers
 
Java tuning on GNU/Linux for busy dev
Java tuning on GNU/Linux for busy devJava tuning on GNU/Linux for busy dev
Java tuning on GNU/Linux for busy dev
 
Pivotal Open Source: Using Fluentd to gain insights into your logs
Pivotal Open Source:  Using Fluentd to gain insights into your logsPivotal Open Source:  Using Fluentd to gain insights into your logs
Pivotal Open Source: Using Fluentd to gain insights into your logs
 
Firefox Crash Reporting (@ Open Source Bridge)
Firefox Crash Reporting (@ Open Source Bridge)Firefox Crash Reporting (@ Open Source Bridge)
Firefox Crash Reporting (@ Open Source Bridge)
 
Server-Side JavaScript with Nashorn
Server-Side JavaScript with NashornServer-Side JavaScript with Nashorn
Server-Side JavaScript with Nashorn
 
Philly ete-2011
Philly ete-2011Philly ete-2011
Philly ete-2011
 

Mehr von Erik Rose

Fathom Overview and Future, San Francisco 2018
Fathom Overview and Future, San Francisco 2018Fathom Overview and Future, San Francisco 2018
Fathom Overview and Future, San Francisco 2018Erik Rose
 
Es part 2 pdf no build
Es part 2 pdf no buildEs part 2 pdf no build
Es part 2 pdf no buildErik Rose
 
Fluid, Fluent APIs
Fluid, Fluent APIsFluid, Fluent APIs
Fluid, Fluent APIsErik Rose
 
Django’s nasal passage
Django’s nasal passageDjango’s nasal passage
Django’s nasal passageErik Rose
 
WebLion Hosting: Leveraging Laziness, Impatience, and Hubris
WebLion Hosting: Leveraging Laziness, Impatience, and HubrisWebLion Hosting: Leveraging Laziness, Impatience, and Hubris
WebLion Hosting: Leveraging Laziness, Impatience, and HubrisErik Rose
 
WebLion Hosting Lightning Talk
WebLion Hosting Lightning TalkWebLion Hosting Lightning Talk
WebLion Hosting Lightning TalkErik Rose
 
Protecting Plone from the Big, Bad Internet
Protecting Plone from the Big, Bad InternetProtecting Plone from the Big, Bad Internet
Protecting Plone from the Big, Bad InternetErik Rose
 

Mehr von Erik Rose (8)

Fathom Overview and Future, San Francisco 2018
Fathom Overview and Future, San Francisco 2018Fathom Overview and Future, San Francisco 2018
Fathom Overview and Future, San Francisco 2018
 
Es part 2 pdf no build
Es part 2 pdf no buildEs part 2 pdf no build
Es part 2 pdf no build
 
Fluid, Fluent APIs
Fluid, Fluent APIsFluid, Fluent APIs
Fluid, Fluent APIs
 
Django’s nasal passage
Django’s nasal passageDjango’s nasal passage
Django’s nasal passage
 
Stackful
StackfulStackful
Stackful
 
WebLion Hosting: Leveraging Laziness, Impatience, and Hubris
WebLion Hosting: Leveraging Laziness, Impatience, and HubrisWebLion Hosting: Leveraging Laziness, Impatience, and Hubris
WebLion Hosting: Leveraging Laziness, Impatience, and Hubris
 
WebLion Hosting Lightning Talk
WebLion Hosting Lightning TalkWebLion Hosting Lightning Talk
WebLion Hosting Lightning Talk
 
Protecting Plone from the Big, Bad Internet
Protecting Plone from the Big, Bad InternetProtecting Plone from the Big, Bad Internet
Protecting Plone from the Big, Bad Internet
 

Kürzlich hochgeladen

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 

Kürzlich hochgeladen (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 

What happens when firefox crashes?

  • 1. What Happens When Firefox Crashes?or It’s Not My Fault Tolerance by Erik Rose Welcome! [Erik Rose (if not introduced)] write server-side code @ Mozilla to tell you about the Big Data systems behind FF crash reporting •! ❑ ! A browser is a complex piece of software. •! ❑ ! Challenging to test it ▼! ❑ ! Interacts with a lot of other software: JS add-ons, compiled plugins, OSes, different hardware. ! •! ❑ ! Even unique timings of your setup can trigger bugs. ! •! ❑ ! Also, 50 billion – 1 trillion web pages. They do unpredictable, creative things ! •! ❑ ***Any of which could make FF explode*** ▼! –! That's why, in addition to an extensive test suite and manual testing, we invest a lot in crash reporting. So today, I want to show you what happens when Firefox crashes and what the systems look like that receive and process the crash reports
  • 2. What Happens When Firefox Crashes?or It’s Not My Fault Tolerance by Erik Rose Welcome! [Erik Rose (if not introduced)] write server-side code @ Mozilla to tell you about the Big Data systems behind FF crash reporting •! ❑ ! A browser is a complex piece of software. •! ❑ ! Challenging to test it ▼! ❑ ! Interacts with a lot of other software: JS add-ons, compiled plugins, OSes, different hardware. ! •! ❑ ! Even unique timings of your setup can trigger bugs. ! •! ❑ ! Also, 50 billion – 1 trillion web pages. They do unpredictable, creative things ! •! ❑ ***Any of which could make FF explode*** ▼! –! That's why, in addition to an extensive test suite and manual testing, we invest a lot in crash reporting. So today, I want to show you what happens when Firefox crashes and what the systems look like that receive and process the crash reports
  • 3. •!❑! If you've crashed FF, you've seen this dialog. ! ❑!If you choose to send us a crash report, we use it to… ! •!❑! find new bugs ! •!❑! decide where to concentrate our time
  • 4. Socorro !–! The thing that receives FF crash reports is called Socorro. •!❑! ***Open source.*** •!❑! You can use it if you want. Very flexible. •!❑! Used by Valve, Yandex •!❑! Socorro gets its name from the Very Large Array in Socorro, NM because…
  • 5. Socorro https://github.com/ mozilla/socorro !–! The thing that receives FF crash reports is called Socorro. •!❑! ***Open source.*** •!❑! You can use it if you want. Very flexible. •!❑! Used by Valve, Yandex •!❑! Socorro gets its name from the Very Large Array in Socorro, NM because…
  • 6. Very Large Array Socorro, New Mexico like that array, it receives signals from out in the universe and tries to filter out patterns from the noise. •!❑! 27 dish antennas, which can move to follow objects across the sky •!❑! Socorro is a Very Large Array of slightly less expensive systems which tracks crashes across the userbase
  • 7. ! Big Picture The Let’s take a peek behind the curtain You’ll recognize some things you’re doing yourself, and some other things might surprise you. So let’s embark on our tour of Socorro!
  • 8. ! •!❑! On its front end, it looks like this. Public. Don’t hide our failures Unusual.
  • 9. You can drill into this, to see e.g. top crashers: ! •!❑! ***% of all crashes*** ! •!❑! signature (stack trace) ! •!❑! breakdown by platform ! •!❑! ticket correllations
  • 10. You can drill into this, to see e.g. top crashers: ! •!❑! ***% of all crashes*** ! •!❑! signature (stack trace) ! •!❑! breakdown by platform ! •!❑! ticket correllations
  • 11. !–! Another example: explosive crashes ! !–! Music charts: "bullets" ! •!❑! song which rises quickly up the charts to suddenly become extremely popular ! •!❑! Something we expect to see as 5% of all crashes, but then you wake up one morning, and they're 85% of all crashes. ! •!❑! Generally what this means is that one of the major sites shipped a new piece of JS which crashes us. ! !✓! The most recent example of this is during the last Olypmics, when Google released a new Doodle every day.
  • 12. ! •!❑! I think it was this one that crashed us. ! •!❑! On the one hand, we knew the problem was going away tomorrow. So that’s nice. ! •!❑! OTOH, a lot of people have Google set as their startup page. So that's bad. ;-)
  • 13. !❑! You can also find… ! •!❑! Most common crashes for a version, platform, etc. ! •!❑! New crashes ! !❑! Correlations ! •!❑! ferret out interactions between plugins, for example •!❑! Pretty straightforward, right? Backend is less straightforward…
  • 14. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad •!❑! Over 120 boxes, all physical. !❑! Why physical? ! •!❑! Organizational momentum ! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important. !–! ! •!❑! How much data? ! •!❑! "The smallest big-data project" ! •!❑! Used to be considered big. Not anymore. ! !✓! Numbers ! •!✓! ***500M FF users*** ! •!✓! ***150M ADUs. Probably more.*** ! •!✓! ***3000 crashes/minute.*** 3M/day. ! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash anyway and just full of corrupt garbage) ! •!✓! ***800GB*** in PG ! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data. ! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning. ! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it will be.
  • 15. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad 500M Firefox users •!❑! Over 120 boxes, all physical. !❑! Why physical? ! •!❑! Organizational momentum ! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important. !–! ! •!❑! How much data? ! •!❑! "The smallest big-data project" ! •!❑! Used to be considered big. Not anymore. ! !✓! Numbers ! •!✓! ***500M FF users*** ! •!✓! ***150M ADUs. Probably more.*** ! •!✓! ***3000 crashes/minute.*** 3M/day. ! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash anyway and just full of corrupt garbage) ! •!✓! ***800GB*** in PG ! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data. ! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning. ! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it will be.
  • 16. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad 500M Firefox users 150M daily users •!❑! Over 120 boxes, all physical. !❑! Why physical? ! •!❑! Organizational momentum ! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important. !–! ! •!❑! How much data? ! •!❑! "The smallest big-data project" ! •!❑! Used to be considered big. Not anymore. ! !✓! Numbers ! •!✓! ***500M FF users*** ! •!✓! ***150M ADUs. Probably more.*** ! •!✓! ***3000 crashes/minute.*** 3M/day. ! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash anyway and just full of corrupt garbage) ! •!✓! ***800GB*** in PG ! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data. ! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning. ! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it will be.
  • 17. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad 500M Firefox users 150M daily users 3000 crashes per minute •!❑! Over 120 boxes, all physical. !❑! Why physical? ! •!❑! Organizational momentum ! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important. !–! ! •!❑! How much data? ! •!❑! "The smallest big-data project" ! •!❑! Used to be considered big. Not anymore. ! !✓! Numbers ! •!✓! ***500M FF users*** ! •!✓! ***150M ADUs. Probably more.*** ! •!✓! ***3000 crashes/minute.*** 3M/day. ! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash anyway and just full of corrupt garbage) ! •!✓! ***800GB*** in PG ! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data. ! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning. ! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it will be.
  • 18. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad 500M Firefox users 150M daily users 3000 crashes per minute 150KB-20MB per crash •!❑! Over 120 boxes, all physical. !❑! Why physical? ! •!❑! Organizational momentum ! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important. !–! ! •!❑! How much data? ! •!❑! "The smallest big-data project" ! •!❑! Used to be considered big. Not anymore. ! !✓! Numbers ! •!✓! ***500M FF users*** ! •!✓! ***150M ADUs. Probably more.*** ! •!✓! ***3000 crashes/minute.*** 3M/day. ! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash anyway and just full of corrupt garbage) ! •!✓! ***800GB*** in PG ! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data. ! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning. ! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it will be.
  • 19. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad 500M Firefox users 150M daily users 3000 crashes per minute 150KB-20MB per crash 800GB in PostgreSQL •!❑! Over 120 boxes, all physical. !❑! Why physical? ! •!❑! Organizational momentum ! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important. !–! ! •!❑! How much data? ! •!❑! "The smallest big-data project" ! •!❑! Used to be considered big. Not anymore. ! !✓! Numbers ! •!✓! ***500M FF users*** ! •!✓! ***150M ADUs. Probably more.*** ! •!✓! ***3000 crashes/minute.*** 3M/day. ! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash anyway and just full of corrupt garbage) ! •!✓! ***800GB*** in PG ! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data. ! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning. ! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it will be.
  • 20. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad 500M Firefox users 150M daily users 3000 crashes per minute 150KB-20MB per crash 800GB in PostgreSQL 40TB in HDFS, 110TB replicated •!❑! Over 120 boxes, all physical. !❑! Why physical? ! •!❑! Organizational momentum ! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important. !–! ! •!❑! How much data? ! •!❑! "The smallest big-data project" ! •!❑! Used to be considered big. Not anymore. ! !✓! Numbers ! •!✓! ***500M FF users*** ! •!✓! ***150M ADUs. Probably more.*** ! •!✓! ***3000 crashes/minute.*** 3M/day. ! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash anyway and just full of corrupt garbage) ! •!✓! ***800GB*** in PG ! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data. ! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning. ! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it will be.
  • 21. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad It all starts ***down here***, with FF. But even that’s made up of multiple moving parts.
  • 22. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad It all starts ***down here***, with FF. But even that’s made up of multiple moving parts.
  • 23. Collectors Materialized View Builders Active Daily Users Signatures Versions Explosiveness cron jobs Zeus load balancer Crash Reporter Breakpad These ***first 3*** pieces all on client side ***First 2*** in FF process ! ❑! Breakpad ! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa ! ! ❑! stack dump of all threads ! •!❑! opaque; doesn't even know the frame boundaries ! •!❑! a little other processor state ! •!❑! throws it to another process: ***Crash Reporter*** Why? Remember, FF has crashed. State unknown. “Crash Reporter, which is responsible for ***this little dialog***,” binary crash dump + JSON metadata → POST → collectors…
  • 24. Collectors Materialized View Builders Active Daily Users Signatures Versions Explosiveness cron jobs Zeus load balancer Crash Reporter Breakpad These ***first 3*** pieces all on client side ***First 2*** in FF process ! ❑! Breakpad ! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa ! ! ❑! stack dump of all threads ! •!❑! opaque; doesn't even know the frame boundaries ! •!❑! a little other processor state ! •!❑! throws it to another process: ***Crash Reporter*** Why? Remember, FF has crashed. State unknown. “Crash Reporter, which is responsible for ***this little dialog***,” binary crash dump + JSON metadata → POST → collectors…
  • 25. Collectors Materialized View Builders Active Daily Users Signatures Versions Explosiveness cron jobs Zeus load balancer Crash Reporter Breakpad These ***first 3*** pieces all on client side ***First 2*** in FF process ! ❑! Breakpad ! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa ! ! ❑! stack dump of all threads ! •!❑! opaque; doesn't even know the frame boundaries ! •!❑! a little other processor state ! •!❑! throws it to another process: ***Crash Reporter*** Why? Remember, FF has crashed. State unknown. “Crash Reporter, which is responsible for ***this little dialog***,” binary crash dump + JSON metadata → POST → collectors…
  • 26. Collectors Materialized View Builders Active Daily Users Signatures Versions Explosiveness cron jobs Zeus load balancer Crash Reporter Breakpad These ***first 3*** pieces all on client side ***First 2*** in FF process ! ❑! Breakpad ! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa ! ! ❑! stack dump of all threads ! •!❑! opaque; doesn't even know the frame boundaries ! •!❑! a little other processor state ! •!❑! throws it to another process: ***Crash Reporter*** Why? Remember, FF has crashed. State unknown. “Crash Reporter, which is responsible for ***this little dialog***,” binary crash dump + JSON metadata → POST → collectors…
  • 27. Collectors Materialized View Builders Active Daily Users Signatures Versions Explosiveness cron jobs Zeus load balancer Crash Reporter Breakpad These ***first 3*** pieces all on client side ***First 2*** in FF process ! ❑! Breakpad ! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa ! ! ❑! stack dump of all threads ! •!❑! opaque; doesn't even know the frame boundaries ! •!❑! a little other processor state ! •!❑! throws it to another process: ***Crash Reporter*** Why? Remember, FF has crashed. State unknown. “Crash Reporter, which is responsible for ***this little dialog***,” binary crash dump + JSON metadata → POST → collectors…
  • 28. Collectors Materialized View Builders Active Daily Users Signatures Versions Explosiveness cron jobs Zeus load balancer Crash Reporter Breakpad where really enters Socorro***…***
  • 29. Duplicate Finder Collectors Local FS Crash Movers HBase RabbitMQ Processors Postgre elasticse Debug symbols on NFS pgbou Zeus Materialized View Builders Active Daily Users Signatures Versions Explosiveness Version Scraper FTP Zeu cron jobs Zeus load balancer Crash Reporter Breakpad Collectors: super simple Writes crashes to ***local disk…***
  • 30. Duplicate Finder Collectors Local FS Crash Movers HBase RabbitMQ Processors Postgre elasticse Debug symbols on NFS pgbou Zeus Materialized View Builders Active Daily Users Signatures Versions Explosiveness Version Scraper FTP Zeu cron jobs Zeus load balancer Crash Reporter Breakpad Then, another process on same box
  • 31. Duplicate Finder Collectors Local FS Crash Movers HBase RabbitMQ Processors Postgre elasticse Debug symbols on NFS pgbou Zeus Materialized View Builders Active Daily Users Signatures Versions Explosiveness Version Scraper FTP Zeu cron jobs Zeus load balancer Crash Reporter Breakpad Crash Movers picks up crashes off local disk → 2 places
  • 32. Duplicate Finder Collectors Local FS Crash Movers HBase RabbitMQ Processors Postgre elasticse Debug symbols on NFS pgbou Zeus Materialized View Builders Active Daily Users Signatures Versions Explosiveness Version Scraper FTP Zeu cron jobs Zeus load balancer Crash Reporter Breakpad 1st: → HBase. HBase is primary store for crashes. 70 nodes At the same time***…***
  • 33. Duplicate Finder Collectors Local FS Crash Movers HBase RabbitMQ Processors Postgre elasticse Debug symbols on NFS pgbou Zeus Materialized View Builders Active Daily Users Signatures Versions Explosiveness Version Scraper FTP Zeu cron jobs Zeus load balancer Crash Reporter Breakpad IDs → Rabbit ! ❑! Soft realtime: and normal queues ! •!❑! Priority: process within 60 secs
  • 34. Duplicate Finder Collectors Local FS Crash Movers HBase RabbitMQ Processors Postgre elasticse Debug symbols on NFS pgbou Zeus Materialized View Builders Active Daily Users Signatures Versions Explosiveness Version Scraper FTP Zeu cron jobs Zeus load balancer Crash Reporter Breakpad !❑! Processors ! •!❑! Where the real action happens ! •!❑! To process a crash means to do what's necessary to make it visible in the web UI. ! •!❑! ID from Rabbit ! •!❑! binary → debug ! •!❑! signature generation ! •!❑! Then it puts it into buckets and adds it to PG and ES. First, PG.
  • 35. Zeus Ze Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Bugzilla Associator Automatic Emailer Bugzilla alized ew ders Users res ns ness ADU Count Loader Version Scraper FTP Vertica Zeus !❑! Postgres ! !❑! Our main interactive datastore ! •!❑! It's what the web app and most batch jobs talk to. ! !❑! Stores (cut?) ! •!❑! unique crash signatures ! •!❑! numbers of crashes, bucketed by signature ! !❑! other aggregations of crash counts on various facets ! •!❑! to make reporting fast ! •!❑! (see slide 32 of breakpad.socorro.master.key.) ! !❑! In there for a couple reasons ! •!❑! Prompt, reliable answers to queries ! !❑! Ref integ ! •!❑! Stores unique crash signatures ! •!❑! And their relationships to versions, tickets, & so on ! •!❑! PHP & Django easy to query from Now, let’s turn around & talk about ES, which operates in parallel.
  • 36. Zeus Ze Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Bugzilla Associator Automatic Emailer Bugzilla alized ew ders Users res ns ness ADU Count Loader Version Scraper FTP Vertica Zeus !❑! Elasticsearch ! •!❑! 90-day rolling window ! •!❑! Faceting ! !❑! NKOTB •! ❑!Extremely flexible text analysis. ! ! ! •! ❑! Though geared toward natural language, we may be able to persuade it to take apart C++ call signatures & let us mine those in meaningful ways. ! !❑! May someday eat some of HBase or Postgres's lunch ! !❑! It scales out like HBase & can even execute arbitrary scripts near the data, collating & returning data through a master node. ! •!❑! Maybe not the flexibilty of full map-reduce ! •!❑! Filter caching ! •!❑! Supports indices itself
  • 37. Duplicate Finder Zeus Zeus HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus ron obs !❑! Web services (“middleware”) ! •!❑! At end of this story: web application ! •!❑! But between it and data is REST middleware ! !❑! Why? ! •!❑! was in PHP and we didn't want to reimplement model logic in 2 languages ! •!❑! We change datastores. ! •!❑! We move data around.
  • 38. Duplicate Finder Zeus Zeus HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus ron obs !✓! Web App ! •!✓! Django ! •!✓! Each runs memcached
  • 39. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad And that concludes our big-picture tour of Socorro! Now, as years have gone by and the system has grown in scope and size, interesting patterns
  • 40. ! Big Patterns tooling was clearly missing. standard practices weren’t good enough. I’m going to call out some of these emergent needs and show you our solutions. Maybe you’ll even find some of our tools useful. The first…
  • 41. ! Big Storage Every Big Data system put everything somewhere Solutions well-established Amount of data you can deal with in a commoditized fashion rises every year sharding, repl expensive We realized by application of statistics ***shrink amount of data***
  • 42. ! Big Storage ***sampling*** per product all FFOS crashes don’t wanna lose interesting rare events (due to sampling) ***targeting*** take anything with a comment •!❑! Our statisticians have told us all kinds of useful things about the shape of our data. For instance, the rules that select interesting events don't throw off our OS or version statistics. ***rarification*** throw away uninteresting parts of stack frames !❑! Skiplist rules get uninteresting parts of the stack out of the data, to reduce noise. 2 kinds. ! •!❑! Sentinel frames to jump TO ! •!❑! Frames that should be ignored An important part of making our hash buckets wider reducing # of unique crash signatures With these 3 techniques, we cut down the amount of data we need to handle in the later stages of our pipeline. Sure, we still have to keep everything in HBase, but we don’t run live queries against that, so it just means buying more HDs. But processors, rabbit, PG, ES, memcache, crons—all have lighter load
  • 43. ! Big Storage Sampling ***sampling*** per product all FFOS crashes don’t wanna lose interesting rare events (due to sampling) ***targeting*** take anything with a comment •!❑! Our statisticians have told us all kinds of useful things about the shape of our data. For instance, the rules that select interesting events don't throw off our OS or version statistics. ***rarification*** throw away uninteresting parts of stack frames !❑! Skiplist rules get uninteresting parts of the stack out of the data, to reduce noise. 2 kinds. ! •!❑! Sentinel frames to jump TO ! •!❑! Frames that should be ignored An important part of making our hash buckets wider reducing # of unique crash signatures With these 3 techniques, we cut down the amount of data we need to handle in the later stages of our pipeline. Sure, we still have to keep everything in HBase, but we don’t run live queries against that, so it just means buying more HDs. But processors, rabbit, PG, ES, memcache, crons—all have lighter load
  • 44. ! Big Storage Sampling Targeting ***sampling*** per product all FFOS crashes don’t wanna lose interesting rare events (due to sampling) ***targeting*** take anything with a comment •!❑! Our statisticians have told us all kinds of useful things about the shape of our data. For instance, the rules that select interesting events don't throw off our OS or version statistics. ***rarification*** throw away uninteresting parts of stack frames !❑! Skiplist rules get uninteresting parts of the stack out of the data, to reduce noise. 2 kinds. ! •!❑! Sentinel frames to jump TO ! •!❑! Frames that should be ignored An important part of making our hash buckets wider reducing # of unique crash signatures With these 3 techniques, we cut down the amount of data we need to handle in the later stages of our pipeline. Sure, we still have to keep everything in HBase, but we don’t run live queries against that, so it just means buying more HDs. But processors, rabbit, PG, ES, memcache, crons—all have lighter load
  • 45. ! Big Storage Sampling Targeting Rarification ***sampling*** per product all FFOS crashes don’t wanna lose interesting rare events (due to sampling) ***targeting*** take anything with a comment •!❑! Our statisticians have told us all kinds of useful things about the shape of our data. For instance, the rules that select interesting events don't throw off our OS or version statistics. ***rarification*** throw away uninteresting parts of stack frames !❑! Skiplist rules get uninteresting parts of the stack out of the data, to reduce noise. 2 kinds. ! •!❑! Sentinel frames to jump TO ! •!❑! Frames that should be ignored An important part of making our hash buckets wider reducing # of unique crash signatures With these 3 techniques, we cut down the amount of data we need to handle in the later stages of our pipeline. Sure, we still have to keep everything in HBase, but we don’t run live queries against that, so it just means buying more HDs. But processors, rabbit, PG, ES, memcache, crons—all have lighter load
  • 46. ! Big Systems •!❑! Big Data systems tend to be complicated systems. •!❑! Diverse parts: not just one big 500-node HBase cluster and done !❑! Example: 6 data stores: ! •!❑! FS ! •!❑! PG ! •!❑! ES ! •!❑! HBase ! •!❑! memcache ! •!❑! RabbitMQ ! •!❑! This is typical of architectures now. Gone are the days of 1 datastore, 1 representation. ! •!❑! 18 months ago, was hearing jokes about data mullet: relational in the front, NoSQL in the back. ! •!❑! data dreadlocks. It's all over the place. The kinds of problems you can have in these systems really tough to track down
  • 47. Hadoops! A tale of Big Failure crash every 50 hours ***Hadoop’s cleverness*** with TCP connections TCP stack bugs in Linux lying NICs OS buffers fill up with unclosed connections & crash •!❑! So we're very very cautious about ***the equipment*** we use. Remember that hardware is a nontrivial part of your system ! ❑! When you have a problem, it can be hard to work out exactly what's gone wrong. ! •!❑! Can take time to get everybody together must keep receiving crashes. ***Boxes & springs***
  • 48. Hadoops! A tale of Big Failure Complex interactions crash every 50 hours ***Hadoop’s cleverness*** with TCP connections TCP stack bugs in Linux lying NICs OS buffers fill up with unclosed connections & crash •!❑! So we're very very cautious about ***the equipment*** we use. Remember that hardware is a nontrivial part of your system ! ❑! When you have a problem, it can be hard to work out exactly what's gone wrong. ! •!❑! Can take time to get everybody together must keep receiving crashes. ***Boxes & springs***
  • 49. Hadoops! A tale of Big Failure Complex interactions Hardware matters. crash every 50 hours ***Hadoop’s cleverness*** with TCP connections TCP stack bugs in Linux lying NICs OS buffers fill up with unclosed connections & crash •!❑! So we're very very cautious about ***the equipment*** we use. Remember that hardware is a nontrivial part of your system ! ❑! When you have a problem, it can be hard to work out exactly what's gone wrong. ! •!❑! Can take time to get everybody together must keep receiving crashes. ***Boxes & springs***
  • 50. Hadoops! A tale of Big Failure Complex interactions Hardware matters. Design for failure. crash every 50 hours ***Hadoop’s cleverness*** with TCP connections TCP stack bugs in Linux lying NICs OS buffers fill up with unclosed connections & crash •!❑! So we're very very cautious about ***the equipment*** we use. Remember that hardware is a nontrivial part of your system ! ❑! When you have a problem, it can be hard to work out exactly what's gone wrong. ! •!❑! Can take time to get everybody together must keep receiving crashes. ***Boxes & springs***
  • 51. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad The most important: ***this Local FS***
  • 52. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad The most important: ***this Local FS***
  • 53. Duplicate Finder Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Debug symbols on NFS pgbouncer Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad Everything else can fail 3 days of runway Saved us several times Yours may not look like this, but •!❑! You could imagine a system being able to serve just out of cache if the datastore went away. •!❑! Or operate in read-only mode if writes became unavailable. ! ! ! ! SUMO One thing from this diagram we didn’t talk about much yet was ***cron jobs***.
  • 54. ! Big Batching •!❑! Mozilla is a large project with a long legacy, and Socorro interfaces with a lot of other systems. ***A lot of this occurs via batch jobs.***
  • 55. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad
  • 56. Duplicate Finder MQ Processors PostgreSQL pgbouncer Middleware Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus matviews version scraper, 1x/day bugzilla •!❑! Send advice back to users, like in the case where we see they have malware ADUs denominator for every metric fails a lot. Metrics’ systems unreliable. everything that depends on it fails
  • 57. In fact, you can look at a lot of our periodic tasks as a dependency tree. One thing upstream fails***…***
  • 58. …and downstream everything else fails. replaced cron w/crontabber Instead of blindly running jobs whose prerequisites aren’t filled, runs the ***parent*** until it succeeds, then runs ***children***. Diagrams to visualize state of sys Too error-prone by hand. ***Then*** we thought: why not have crontabber draw them for us?
  • 59. …and downstream everything else fails. replaced cron w/crontabber Instead of blindly running jobs whose prerequisites aren’t filled, runs the ***parent*** until it succeeds, then runs ***children***. Diagrams to visualize state of sys Too error-prone by hand. ***Then*** we thought: why not have crontabber draw them for us?
  • 60. …and downstream everything else fails. replaced cron w/crontabber Instead of blindly running jobs whose prerequisites aren’t filled, runs the ***parent*** until it succeeds, then runs ***children***. Diagrams to visualize state of sys Too error-prone by hand. ***Then*** we thought: why not have crontabber draw them for us?
  • 61.
  • 62.
  • 63.
  • 64. SVGs are really neat. can wiggle if unclear And then break down specifics into a ***table…***
  • 65. One job at a time atm cuz “eek matviews perf”, but a great contribution would be some kind of shared locks or thresholds for multiple. But you know, right now, it’s ***good enough…***
  • 66. ! Big Deal And it’s surprising how often that happens. Oftentimes, your makeshift solutions end up being good enough to do the job.
  • 67. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad ***Slapdash, hacky queue (PG)*** polls HBase → PG polls PG → processors ***Local FS buffer*** was a temporary fix when we had reliability problems with HBase. ***I could tell*** you “don’t be afraid of temporary hacks”. But I think that’s a healthy fear to have. Or perhaps my message should be: do a good job on your temporary solutions, because they’ll probably be around awhile.
  • 68. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad ***Slapdash, hacky queue (PG)*** polls HBase → PG polls PG → processors ***Local FS buffer*** was a temporary fix when we had reliability problems with HBase. ***I could tell*** you “don’t be afraid of temporary hacks”. But I think that’s a healthy fear to have. Or perhaps my message should be: do a good job on your temporary solutions, because they’ll probably be around awhile.
  • 69. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad ***Slapdash, hacky queue (PG)*** polls HBase → PG polls PG → processors ***Local FS buffer*** was a temporary fix when we had reliability problems with HBase. ***I could tell*** you “don’t be afraid of temporary hacks”. But I think that’s a healthy fear to have. Or perhaps my message should be: do a good job on your temporary solutions, because they’ll probably be around awhile.
  • 70. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad ***Slapdash, hacky queue (PG)*** polls HBase → PG polls PG → processors ***Local FS buffer*** was a temporary fix when we had reliability problems with HBase. ***I could tell*** you “don’t be afraid of temporary hacks”. But I think that’s a healthy fear to have. Or perhaps my message should be: do a good job on your temporary solutions, because they’ll probably be around awhile.
  • 71. definition: hook up to one computer, or fit on one desk changes every year The fact…wearing nearly 100GB unimaginable to operator of punch card duplicator from only 50 years ago But the patterns that come out of large systems remain. Duplicate cards: why? To facet 2 ways in parallel. While you may need to generalize a bit, I have no doubt techniques you learn today and tomorrow serve you well into the future.