Clairvoyant Squirrel: Large Scale Malicious Domain Classification
Accumulo Nutch/GORA, Storm, and Pig
1. Large Scale Web Analytics with Accumulo
(and Nutch/Gora, Pig, and Storm)
Jason Trost
jtrost@endgames.us
@jason_trost
2. Introductions
• Jason Trost (jtrost@endgames.us)
• Senior Software Engineer at Endgame Systems
• Former Accumulo Trainer
• Apache Accumulo Committer
– Apache Pig integration with Accumulo
– some minor bug fixes
3. Agenda
• Technologies Introduction
– Apache Accumulo
– Apache Gora
– Apache Nutch/Gora
– Storm
• Accumulo at Endgame
– Web Crawl Analytics
– Real-time DNS Processing
– Operations
4. Apache Accumulo
• Accumulo is a BigTable implementation with cell
level security
• It is conceptually very similar to HBase, but it has
some nice features that HBase is currently
lacking.
• Some of these features are:
– Cell level security
– No fat row problem
– No limitation on col fams or when col fams can be created
– Server side, data local, programming abstraction called Iterators
– Iterators enable fast aggregation, searching, filtering, streaming
Reduce
5. Apache Gora
• Gora is a object relational/non-relational
mapping for arbitrary data stores including
both relational (MySQL) and non-relational
data stores (HBase, Cassandra, Accumulo,
Redis, Voldermort, etc.).
• It was designed for Big Data
applications and has support
(interfaces) for Apache Pig, Apache
Hive, Cascading, and generic
MapReduce.
6. Apache Nutch/Gora
• Nutch is a highly scalable web crawler built
over Hadoop MapReduce.
• It was designed from the ground up to be an
Internet scale web crawler and to enable large
scale search applications
• GORA enables the storing of the web crawl
data and metadata in Accumulo
7. Storm
• Highly scalable streaming event processing system
• Conceptually similar to MapReduce, but operates on
streaming data in real-time
• Released by Twitter after they acquired Backtype
• Development led by Nathan Marz
• At-least-once-processing of events
• Spouts and Bolts are wired
together to form computation
Topologies
• Topologies run until killed
Twitter Storm
9. Web Crawl Analytics
• Formerly used Heritrix with a Cassandra backend
for collection and storage
• We now use Nutch/GORA to perform Large-scale
web crawling
• All pages and HTTP headers are stored in
Accumulo
• Run Pig scripts for pulling data out of Accumulo,
performing rollups, performing pattern matching
(using regular expressions), and processing the
pages using python scripts
10. Real-time DNS Processing
• We used to use MapReduce/PIG to generate daily reports on all
DNS event data from files in HDFS; this took several hours
• Now, we use an internally developed framework called Velocity
that was built over Storm
• In real-time, enrich DNS and security events with IP geo data
(country, city, company, vertical), correlate with internally
developed/maintained DNS blacklists
• Store the events in Accumulo & use custom
Accumulo iterators to perform rollups
• At report generation time, Accumulo
aggregates records server side
• This process now takes minutes, not hours,
and we can query for partial results instead
Twitter Storm of having to wait until the end of the day
11. Custom Iterators & Aggregation
Ingest Format At Ingest
Row GROUP BY FIELDS • RowID contains a CSV record that
Col Fam Constant String represents the fields used to basically
Col Qual Event UUID perform a GROUP BY
Val - • Col Qual contains the event UUID
At Scan time
Format After Custom Iterator • Basically strip off the event UUID
Row GROUP BY FIELDS • Set the value to be “1”
Col Fam Constant String • Prepares Key/Value for input into
Col Qual “” SummingCombiner
Val “1” • Output from SummingCombiner is an
accurate count of aggregated records
• This is, in essence, a streaming
Reduce
12. Operations with Accumulo
• Hadoop Streaming jobs tend to kill tablet servers
– Streaming jobs use more memory than Hadoop allows
– This can make service memory allocations challenging
– Reducing number of Map tasks helped
• Running tablet servers under supervision is critical
– Tablet servers fail fast
– Supervisord or daemontools restart failed processes
– Has improved our cluster’s stability dramatically
• Pre-splitting tables is very important for throughput
– Our rows lead with a day day, e.g. “20120101”
• Locality Groups are your friend for Nutch/Gora
13. We’re Hiring
• Like to work on hard problems with Big Data?
• Are you familiar/interested in these
technologies?
– Hadoop, Storm, Django, Nutch/GORA
– Accumulo, Solr/ElasticSearch, Redis
– Python, Java, Pig, Node.JS, Github
• Want to contribute to Open Source?
• We have offices in Atlanta, Washington DC,
Baltimore, and San Antonio
• www.linkedin.com/jobs/at-Endgame-Systems