Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Hadoop vs. RDBMS for Advanced Analytics
1. Hadoop vs. RDBMS for
Advanced Analytics
Josh Wills
April 26th, 2012
2. About Me
• jwills@cloudera.com
• Formerly of Google (2008 – 2011)
• Worked on the ad auction
• Led the team that build the data infrastructure for Google+
• Before that: a bunch of startups
• Sometimes as a software engineer, sometimes as a statistician
• Math degree from Duke and a half-finished PhD from The
University of Texas at Austin
• Now: Director of Data Science at Cloudera
Copyright 2012 Cloudera Inc. All rights reserved
3. Getting Started with Hadoop: Apache Hive
• Stick with the relational
models that you are
used to working with
• Great for the common
starter use cases
• Logs processing
• Online data archival
• ETL/ELT
Copyright 2012 Cloudera Inc. All rights reserved
4. Hadoop for Advanced Analytics
When Should I Use Hadoop instead of an RDBMS?
Copyright 2012 Cloudera Inc. All rights reserved
7. Third Symptom: ALTER TABLE OF_DOOM
Copyright 2012 Cloudera Inc. All rights reserved
8. The Unit of Analysis Problem
• Data warehouses are
optimized to analyze
transactions
• Awesome for finance
and ERP
• Not ideal for product
and marketing
• A function of what
databases are good at
Copyright 2012 Cloudera Inc. All rights reserved
9. What Are You Trying to Analyze?
Simple Entities Complex Entities
• Static attributes • Evolving attributes
• Flat data structure • Hierarchical data structure
• Transient • Persistent
• Examples • Examples
• SKUs • Customers
• Line items from an invoice • Suppliers
• Log messages • Website visitors
Copyright 2011 Cloudera Inc. All rights reserved
10. Rods and Cones vs. Facial Recognition
Copyright 2012 Cloudera Inc. All rights reserved
11. Structure the Data to Fit the Problem
• HDFS Lets Us Store Our
Data However We Want
• We can choose storage
schemas that are:
• Flexible
• Evolvable
• Compact
• Fast
serialization/deserializati
on
Copyright 2012 Cloudera Inc. All rights reserved
How do you know you have a unit of analysis problem? You’re doing a bunch of COUNT DISTINCT queries. You’re doing LAG/LEAD-style queries, or using a cursor.