Neustar is a fast growing provider of enterprise services in telecommunications, online advertising, Internet infrastructure, and advanced technology. Neustar has engaged Think Big Analytics to leverage Hadoop to expand their data analysis capacity. This session describes how Hadoop has expanded their data warehouse capacity, agility for data analysis, reduced costs, and enabled new data products. We look at the challenges and opportunities in capturing 100′s of TB’s of compact binary network data, ad hoc analysis, integration with a scale out relational database, more agile data development, and building new products integrating multiple big data sets.
14. Massive Binary Format Data
Query
SELECT * FROM datafile
WHERE dt='2012-06-15'; » Parse on fly: don’t
1
duplicate or lose original
Parse into records
» Reused open source
Binary
InputFormat
Binary
SerDe
parser with custom
extensions
2
Parse into fields lazily
» Optimized with
Bean » profiling
Object
Inspector » lazy parsing
3
Fields determined
minimizes object
large partitioned binary file - 100's of TBs by Java beans methods creation
compressed binary record 1 » CPU bound due to
parsing compact structure
compressed binary record 2
...
6/19/12 14
23. Use of Capability
Use
Case
Technology
Selection Criteria
» Structure
Basic
Only Structured Data Type Complex Structure » Compute Scale Reporting
» Data Volume
» Latency
At least 10
# Calculations
Under 10 » Analysis Type Data Ingestion
PetaFLOP PetaFLOP
100 TB
Data Size
Under Batch
EDW
or more
EDW
10 TB
Data Processing
EDW EDW
EDW Data
EDW 10-100 TB
Data
Latency? Analysis Analysis Fast Analytics
Type? Type?
Tightly Integrated Other (Simple,
A minute Under with existing data
a minute Parallel, Complex Production, Tightly Integrated Data
or more Other
Structural) with existing data
Enrichment
existing existing
EDW EDW EDW
platform platform
Data Science
Trends
• Compute model scores faster
• Analyze full data sets
• Incorporate new data
•
» 23
Build new services from data