4. Socorro
Very Large Array at Socorro, New Mexico, USA. Photo taken by Hajor, 08.Aug.2004. Released under cc.by.sa
and/or GFDL. Source: http://en.wikipedia.org/wiki/File:USA.NM.VeryLargeArray.02.jpg
5.
6.
7.
8.
9. âSocorro has a lot of moving partsâ
...
âI prefer to think of them as dancing partsâ
11. Lifetime of a crash
⢠Raw crash submitted by user via POST (metadata json +
minidump)
⢠Collected to disk by collector (web.py WSGI app)
⢠Moved to HBase by crashmover
⢠Noticed in queue by monitor and assigned for processing
12. Processing
⢠Processor spins off minidumpstackwalk (MDSW)
⢠MDSW re-unites raw crash with symbols to generate a stack
⢠Processor generates a signature and pulls out other data
⢠Processor writes processed crash back to HBase and to
PostgreSQL
13. Back end processing
Large number of cron jobs, e.g.:
⢠Calculate aggregates: Top crashers by signature, URL,
domain
⢠Process incoming builds from ftp server
⢠Match known crashes to bugzilla bugs
⢠Duplicate detection
⢠Match up pairs of dumps (OOPP, content crashes, etc)
⢠Generate extracts (CSV) for engineers to analyze
14. Middleware
⢠Moving all data access to be through REST API (by end of
year)
⢠(Still some queries in webapp)
⢠Enable other front ends to data and us to rewrite webapp
using Django in 2012
⢠In upcoming version (late 2011, 2012) each component will
have its own API for status and health checks
15. Webapp
⢠Hard part here: how to visualize some of this data
⢠Example: nightly builds, moving to reporting in build time
rather than clock time
⢠Code a bit crufty: rewrite in 2012
⢠Currently KohanaPHP, will be (likely) Django
16. Implementation details
⢠Python 2.6 mostly (except PHP for the webapp)
⢠PostgreSQL9.1, some stored procs in pgpl/sql
⢠memcache for the webapp
⢠Thrift for HBase access
⢠HBase we use CDH3
18. A different type of scaling:
⢠Typical webapp: scale to millions of users without
degradation of response time
⢠Socorro: less than a hundred users, terabytes of data.
19. Basic law of scale still applies:
The bigger you get, the more spectacularly you fail
20. Some numbers
⢠At peak we receive 2300 crashes per minute
⢠2.5 million per day
⢠Median crash size 150k, max size 20MB (reject bigger)
⢠~110TB stored in HDFS (3x replication, ~40TB of HBase data)
21. What can we do?
⢠Does betaN have more (null signature) crashes than other
betas?
⢠Analyze differences between Flash versions x and y crashes
⢠Detect duplicate crashes
⢠Detect explosive crashes
⢠Find âfrankeninstallsâ
⢠Email victims of a malware-related crash
22. Implementation scale
⢠> 115 physical boxes (not cloud)
⢠Now up to 8 developers + sysadmins + QA + Hadoop ops/
analysts
⢠(Yes, hiring. mozilla.org/careers)
⢠Deploy approximately weekly but could do continuous if
needed
24. Development process
⢠Fork
⢠Hard to install: use a VM (more in a moment)
⢠Pull request with bugfix/feature
⢠Code review
⢠Lands on master
25. Development process -2
⢠Jenkins polls github master, picks up changes
⢠Jenkins runs tests, builds a package
⢠Package automatically picked up and pushed to dev
⢠Wanted changes merged to release branch
⢠Jenkins builds release branch, manual push to stage
⢠QA runs acceptance on stage (Selenium/Jenkins + manual)
⢠Manual push same build to production
26. Note âManualâ deployment =
⢠Run a single script with the build as
parameter
⢠Pushes it out to all machines and restarts
where needed
27. ABSOLUTELY CRITICAL
All the machinery for continuous deployment
even if you donât want to deploy continuously
28. Configuration management
⢠Some releases involve a configuration change
⢠These are controlled and managed through Puppet
⢠Again, a single line to change config the same way every
time
⢠Config controlled the same way on dev and stage; tested the
same way; deployed the same way
29. Virtualization
⢠You donât want to install HBase
⢠Use Vagrant to set up a virtual machine
⢠Use Jenkins to build a new Vagrant VM with each code build
⢠Use Puppet to configure the VM the same way as production
⢠Tricky part is still getting the right data
30. Virtualization - 2
⢠VM work at
github.com/rhelmer/socorro-vagrant
⢠Also pulled in as a submodule to socorro
31. Upcoming
⢠ElasticSearch implemented for better search including
faceted search, waiting on hardware to pref it on
⢠More analytics: automatic detection of explosive crashes,
malware, etc
⢠Better queueing: looking at Sagrada queue
⢠Grand Unified Configuration System
32. Everything is open (source)
Fork: https://github.com/mozilla/socorro
Read/file/fix bugs: https://bugzilla.mozilla.org/
Docs: http://www.readthedocs.org/docs/socorro
Mailing list: https://lists.mozilla.org/listinfo/tools-socorro
Join us in IRC: irc.mozilla.org #breakpad