In 10 slides explains bigData. It separates the hype from reality about BigData. Explains what it is and what was already from before. No big numbers, no big claims : just plain simple truth.
The "red pill"
1. Big Data
in 10
What’s real and what’s fluff
Abhishek Pamecha
Mar-2013
2. What is Big Data
• It is all about data
– But not about “how much”
– But about correlations and increased reach
3. BigData Architecture
It influences or changes your
• Data source choices
• Data storing choices
• Data analyzing/mining approaches
It helps
• Address highly focused use cases
• Correlate more data sources
• address scale and fault tolerance issues
4. Caution!
BigData is not a “substitute” for existing warehousing practices.
It complements existing practices.
5. Architectures – Data sources
• Traditional DW • BigData adds
– Production DB – Log files
– Dictionaries – Social graphs
– ETL/ELT pipelines – Streaming data
– External Data marts
6. Architectures – Data Storage
• Traditional DW • BigData adds
– Production DB – Distributed file storage
• Flatten hierarchies
• Resolved references – Distributed hash maps
– Columnar representations
– ROLAP or MOLAP databases
• Star schema
– Graph data bases
• Materialized views
• Virtual data marts
– Document collections
• Partitioned tables
– Still relational – Other NoSQL variants
7. Architectures – Analytic approaches
• Traditional DW • BigData adds
– Production DB – Distributed file storage
• Flatten hierarchies • Map reduce frameworks and chaining
• Resolved references
– Pre-generate results
– Distributed hash maps
• Single key predominant
– ROLAP databases
• Star schema
– Multidimensional queries
– Columnar representations
• Materialized views • Extracts select columns per row
– adhoc explorations on subsets
• Still relational – Graph data bases
• Virtual data marts • Navigate links
– adhoc explorations on subsets
• Partitioned tables
– Document collections
• Simplified schemas
– Other NoSQL approaches
• Stream pattern matching and pipelining
8. Big Data Architectures
Pros and Cons
• Pros
– Incorporate low value and social data in analysis
– Increase analysis reach to non-structured data
– Correlate across data sources on the same platform
– Very strong in their sweet spots.
– Efficiency in terms of
• data movement volume,
• scale
• fault tolerance and
• responsiveness.
• Cons
– Not relational. Gives up on some of the relational advantages.
• Joins
• Aggregations etc.
– Little standards – Non portable solutions
– Less support with end-user tools and applications [ though growing ]
– Not a replacement to DW but just an extension to it.
– Incompatible with different classes of use-cases. Have sweet spots.
– Heterogeneous setup in Development and Operations.
9. Challenges
• Architectural
– “Big” data management
– Data consistency
– Read heavy or write heavy
– Scaling
– Distributed deployment
• Functional
– data quality
– Problem set choice
• Organizational
– Data backed decisions
– Going overboard
– SLAs and operations management
– Data Privacy