8. What RDBMS are for
● Operational data
● Normalized models
● Static typed data
9. What RDBMS are NOT for
● "Full Scan" Aggregated Computations
● Multi-dimensional queries (think pivot)
● Unstructured data
10. OK so how's that
different from Big Data
platforms?
11. Big Data - More than a buzzword
(although sometimes its hard to tell...)
Big Data is not a product.
It is an architecture.
12. Big Data - More than a buzzword
(although sometimes its hard to tell...)
A schema-less distributed storage and
processing model for data.
13. Big Data
● Schema less
○ Programmatic queries
○ "Map" of MapReduce
● High Redundancy
○ Distributed processing
○ "Reduce" of MapReduce
14. Big Data
● No referential integrity
● Non transactional
● High latency
15. Classic Big Data internals
● "Share nothing"
paradigm
SCHEDULER
● Push the processing
closer to the data
PROCESSOR PROCESSOR PROCESSOR
● The query defines
the schema
16. What Big Data is for
● Unstructured data
keep everything
● Distributed file system
great for archiving
● Data is fixed
only the process evolves
17. What Big Data is for
● Ludicrous amounts of data
keep everything, remember?
● Made on the cheap
each processing unit is commodity hardware
18. What Big Data is NOT for
● Low latency applications
arbitrary exploration of the data is close to impossible
● End-users
writing code is easy. writing good code is hard.
● Replacing your operational DB
19. Some more limitations
● No structured query language
exploration is tedious
● Accuracy & Exactitude
the burden is put on the end user / query designer
● No query optimizer
cannot optimize at runtime.
does exactly what you tell it to.
21. First, defining NoSQL...
● NoSQL: The thing named after what it
lacks which has as many definitions as
there are products.
(which usually turns out to be some sort of key-value store)
22. Why "NoSQL"? Why all the hate?!
● Historical reasons
○ Wrong technological choices
○ Blind faith in RDBMS scalability
○ General wishful thinking and voodoo magic
23. Why "NoSQL"? Why all the hate?!
● "SQL" itself was never the issue
● NoSQL projects are implementing SQL-
like query languages
25. Current efforts
● Straight SQL implementations
Greenplum: Straight SQL on top of Big Data
Hive JDBC: A hybrid of DSL & SQL
● The Splunk approach
SQL with missing columns
● Runtime query optimizers
Optiq framework: SQL with Big Data federated sources
28. Widely used. Little known.
● Your favorite corporate dashboards
● Google Analytics
& other ad-hoc tools
29. Analytics centric language
● Multidimensional Expressions (MDX)
a powerful query language for analytics
● Forget about rows and columns
as many axis as you need
● Slice & dice
start from everything - progressively focus only on relevant data
31. An example
What are my total sales for the current year, per month, for male customers?
with
member [Measures].[Accumulated Sales]
as 'Sum(YTD(), [Measures].[Store Sales])'
select
{[Measures].[Accumulated Sales]} on columns,
{Descendants([Time].[1997], [Time].[Month])} on rows
from
[Sales]
where
([Customer].[Gender].[M])
33. Analytics data modelization
● A denormalized model for performance
the data is modelized for read operations - not write
● High redundancy
because sometimes more is better
37. Relational OLAP (ROLAP)
● Backed by a relational database
think of a MDX to SQL bridge.
the aggregated data can be cached in-memory or on-disk.
● Relies heavily on the RDBMS performance
figures out at runtime the proper optimizations
39. Other OLAP
● On-disk aggregated data files
Think SAS. Cubes are compiled into data files on disk.
● Simple Bridges
Converts MDX straight to SQL, with limited support of MDX syntax.
42. Where the data lives matters
Location Speed (ns)
L1 Cache Reference 0.5
Branch Mispredict 5
L2 Cache Reference 7
Mutex lock/unlock 25
Main memory reference 100
Compress 1K bytes w/ cheap algorithm 3000
Send 2K bytes over 1 Gbps network 20 000
Read 1 MB sequentially from memory 250 000
Round trip within same datacenter 500 000
Disk seek 10 000 000
Read 1 MB sequentially from disk 20 000 000
Send packet CA -> Netherlands -> CA 150 000 000
43. Optimizing for CPU
● Java NIO blocks
use extremely compact chunks of 64 bits.
● Primitive types
use "int" instead of "Integer"
● BitKeys
because they are naturally CPU friendly
44. Optimizing for memory
● Hard limits on the heap space
must pay attention to the total memory usage.
● Inherent limitations
there can only be so many individual pointers on heap.
49. Cache indexing
● Linear performance is not good enough
as N grows, full scanning takes O(n)
● The rollup combinatorial problem
as the cache grows, reuse becomes tedious
50. The rollup combinatorial problem
Gender Country Sales
M USA 7
M CANADA 8
F USA 4
F CANADA 2
Country Sales
USA 11
CANADA 10
51. The rollup combinatorial problem
Gender Country Sales Gender Country Sales City Sales
M USA 7 F USA 5 Montreal 6
M CANADA 8 Quebec 1
Age Country Sales
Ottawa 8
Age Country Cost 16 - 25 USA 2
Vancouver 2
41 - 56 USA 5 26 - 40 CANADA 3
Toronto 5
26 - 40 USA 5
Country Sales
? ?
? ?
52. PoSet & BitKeys
● Represent the levels / values as bitkeys
because bitkeys are fast, remember?
● The PartiallyOrderedSet
a hierarchical hash set where elements might
or might not be related to one another.
53. PoSet & BitKeys
● An example application
finding all primes in a set of integers
61. Shared Caches
● OLAP and key-value stores
don't like each other
OLAP requires a complex key. a hash is insufficient.
● Remember the "deltas" strategy?
partially invalidating a block of data would break the hash
62. Data grids & OLAP
● Well suited for OLAP caches
supports "rich" keys
● Distributed and redundant
if a node goes offline, the cache data is not lost
● In-memory grids are fast
multiplies the available heap space
65. Advertising data analysis
● Low latency
the end users don't want to wait for MapReduce jobs
● Scalability a huge factor
we're talking petabytes of data here
66. Advertising data analysis
● Queries are not static
we can't tell upfront what will be computed
● Deployed in datacenters worldwide
the hashing strategy must allow "smart" data distribution
● Almost all open source
67. Monitoring &
ETL Designer Client App
Management
olap4j
Load
Balancer
OLAP XML/A
Cache
olap4j
Logs
ETL Analytical
OLAP
Logs
DB
Big ETL
Data
Store Logs
ETL
Logs Message
ETL Queue
68. Client App
● A query olap4j
- UI sends MDX to a SOAP service.
- load balancer dispatches the query.
- OLAP layer uses its data sources and aggregates. Load
- query is answered Balancer
OLAP XML/A
Cache
olap4j
Analytical
OLAP
DB
69. ● An update - Strategy #1
- the ETL process updates the analytical DB.
- a cache delta is sent to a message queue.
- OLAP processes the message.
- OLAP uses its index to spot the regions to invalidate.
- aggregated cache is updated incrementally.
OLAP
Cache
Logs
ETL Analytical
OLAP
Logs
DB
Big ETL
Data
Store Logs
ETL
Logs Message
ETL Queue
70. ● An update - Strategy #2
- ETL updates the analytical DB.
- ETL acts directly on the OLAP cache.
- OLAP processes events from its cache.
- OLAP updates its index
OLAP
Cache
Logs
ETL Analytical
OLAP
Logs
DB
Big ETL
Data
Store Logs
ETL
Logs
ETL
71. a stack built on open
standards
(get ready, the next slide will hurt your brains)
72. Java
Client App load balancer Client App
olap4j-xmla olap4j-xmla
HTTP (XMLA)
olap4j server olap4j server olap4j server
olap4j olap4j olap4j
jdbc jdbc jdbc
JDBC connection connection connection
pool pool pool
jdbc jdbc jdbc
olap4j impl olap4j impl olap4j impl
Mondrian Mondrian Mondrian
server server server
manager manager manager
Java
Mondrian Mondrian Mondrian
cache cache cache
manager manager manager
infinispan infinispan infinispan
UDP (Hot Rod)
infinispan data grid
76. olap4j-xmla / olap4j-server
Client App
olap4j
● JDBC for OLAP
extension to JDBC. became the de facto standard.
Load
● A Java toolkit for OLAP Balancer
- MDX parser / validator
- a rich type system / MDX object model
- driver specification
- programmatic query models XML/A
- olap4j to XMLA bridge olap4j
78. Mondrian
● Developed by Pentaho Corp.
used worldwide. pure java. open source.
OLAP
● Highly extensible
exposes many APIs & SPIs for enterprise integration.
● ROLAP / MOLAP hybrid
uses the best of what's available.
● Extensible MDX parser
new MDX functions can be created for specific business domains.
80. Stuff that didn't work
● memcached
○ doesn't have an index.
○ enforces random TTLs. OLAP
Cache
○ a hash key is not enough
● simple Java collections
81. Infinispan
● Developed for JBoss AS
well tested.
OLAP
Cache
● UDP Multicast
nodes can join and leave the cluster as needed.
● Can distribute the processing
jobs can be distributed and ran on the nodes.
● Serializes rich objects
the contents can be read from APIs.
83. Oracle
● Cluster of instances
partitioned Oracle nodes
Analytical
● Why Oracle? DB
because their DBAs are good enough with Oracle
to get it to run properly under such a load
84. Other options
● An analytical oriented DB
use of Vectorwise, Vertica, MonetDB, Greenplum, ...
Analytical
● Column stores DB
Column stores scale marvelously and are well
suited for analytics
86. Big Data Layer
● Homebrew Java MapReduce Logs
ETL
Logs
Big ETL
● 42 000 nodes Data
Store Logs
ETL
Logs
● ETL processes managed ETL
with Pig
● A keynote in itself
(see the resources at the end for a keynote
from Scott Burke, Senior VP of Yahoo!)
88. Final processing capacity
● Big Data layer
○ 140 petabytes
○ 500 users
○ 42 000 nodes
○ 10 000 000 hours of CPU time usage per day
○ 100 000 000 000 records per day
89. Final processing capacity
● Analytical DB layer
○ 50 terabytes
○ 100s of tables
(heavy use of the snowflake schema)
○ 1 000 000 000 new rows per day
90. Final processing capacity
● OLAP layer
○ 10s of Mondrian instances
○ 10s of cubes
○ 100s of dimensions
○ 1 000s of levels
○ 1 000 000s of members per level
○ 1 000 000 000s of facts per day
92. Mondrian over Google's BigQuery
● Big Data as a service
upload CSVs & other formats to a ad-hoc cluster
● No code required
MapReduce jobs usually require you to code them
93. Pentaho Instaview
● Interactive data discovery for Big Data
fully integrated ETL / OLAP.
all you need is a URL and a user / password.
● A rich UI environment for data
drag & drop.
full OLAP support.
mobile.
● Open source
94. resources
Mondrian - The open source analytics engine
mondrian.pentaho.org
olap4j - The open standard for OLAP in Java
olap4j.org
Infinispan - The distributed data grid platform
jboss.org/infinispan
Scott Burke, SVP Advertising & Data @ Yahoo!
Keynote of Hadoop Summit 2012
youtube.com/watch?v=mR30psmuIPo