SlideShare a Scribd company logo
1 of 96
Download to read offline
Online Analytical
  Processing of Large
Distributed Databases



            Luc Boudreau
            Lead Engineer, Pentaho Corporation
Olap scalability
"its all about data movement and operating on
                         that data on the fly"
Olap scalability
a relational database
Relational Databases

●   Static schema
●   Minimized redundancy
●   Referential integrity
●   Transactional
Classic RDBMS internals

● "Shared Everything"
   paradigm
                              PLANNER / SCHEDULER

● Private Planner
                        PROCESSOR   PROCESSOR   PROCESSOR
● Multiple private
  processors

● Multiple private
  data stores
What RDBMS are for

● Operational data
● Normalized models
● Static typed data
What RDBMS are NOT for

● "Full Scan" Aggregated Computations
● Multi-dimensional queries (think pivot)
● Unstructured data
OK so how's that
different from Big Data
             platforms?
Big Data - More than a buzzword
(although sometimes its hard to tell...)



                Big Data is not a product.
                  It is an architecture.
Big Data - More than a buzzword
(although sometimes its hard to tell...)




      A schema-less distributed storage and
           processing model for data.
Big Data

● Schema less
  ○ Programmatic queries
  ○ "Map" of MapReduce

● High Redundancy
  ○ Distributed processing
  ○ "Reduce" of MapReduce
Big Data

● No referential integrity

● Non transactional

● High latency
Classic Big Data internals

● "Share nothing"
  paradigm
                                    SCHEDULER

● Push the processing
  closer to the data
                        PROCESSOR    PROCESSOR   PROCESSOR


● The query defines
  the schema
What Big Data is for

● Unstructured data
  keep everything


● Distributed file system
  great for archiving


● Data is fixed
  only the process evolves
What Big Data is for

● Ludicrous amounts of data
  keep everything, remember?


● Made on the cheap
  each processing unit is commodity hardware
What Big Data is NOT for

● Low latency applications
  arbitrary exploration of the data is close to impossible



● End-users
  writing code is easy. writing good code is hard.


● Replacing your operational DB
Some more limitations

● No structured query language
  exploration is tedious


● Accuracy & Exactitude
  the burden is put on the end user / query designer


● No query optimizer
  cannot optimize at runtime.
  does exactly what you tell it to.
why is this so similar to
                 NoSQL?
First, defining NoSQL...

● NoSQL: The thing named after what it
  lacks which has as many definitions as
  there are products.
  (which usually turns out to be some sort of key-value store)
Why "NoSQL"? Why all the hate?!

● Historical reasons
  ○ Wrong technological choices
  ○ Blind faith in RDBMS scalability
  ○ General wishful thinking and voodoo magic
Why "NoSQL"? Why all the hate?!

● "SQL" itself was never the issue

● NoSQL projects are implementing SQL-
  like query languages
bringing structured
queries to Big Data
Current efforts

● Straight SQL implementations
  Greenplum: Straight SQL on top of Big Data
  Hive JDBC: A hybrid of DSL & SQL


● The Splunk approach
  SQL with missing columns


● Runtime query optimizers
  Optiq framework: SQL with Big Data federated sources
isn't there something
  better than SQL for
             analytics?
Online Analytical
Processing (OLAP)
Widely used. Little known.

● Your favorite corporate dashboards

● Google Analytics
  & other ad-hoc tools
Analytics centric language

● Multidimensional Expressions (MDX)
  a powerful query language for analytics


● Forget about rows and columns
  as many axis as you need


● Slice & dice
  start from everything - progressively focus only on relevant data
Business domain driven

● Hierarchical view of
  a multidimensional
  universe
An example

What are my total sales for the current year, per month, for male customers?


    with
       member [Measures].[Accumulated Sales]
       as 'Sum(YTD(), [Measures].[Store Sales])'
    select
       {[Measures].[Accumulated Sales]} on columns,
       {Descendants([Time].[1997], [Time].[Month])} on rows
    from
       [Sales]
    where
       ([Customer].[Gender].[M])
how does that work?
Analytics data modelization

● A denormalized model for performance
  the data is modelized for read operations - not write


● High redundancy
  because sometimes more is better
The Star model
The Snowflake model
different OLAP servers.
       Different beasts.
Relational OLAP (ROLAP)

● Backed by a relational database
  think of a MDX to SQL bridge.
  the aggregated data can be cached in-memory or on-disk.


● Relies heavily on the RDBMS performance
  figures out at runtime the proper optimizations
Memory OLAP (MOLAP)

● Loads everything in RAM

● Relies on an efficient ETL platform
Other OLAP

● On-disk aggregated data files
  Think SAS. Cubes are compiled into data files on disk.


● Simple Bridges
  Converts MDX straight to SQL, with limited support of MDX syntax.
how do they compare?
(there are no straight
      answers, sorry)
Where the data lives matters
                     Location                 Speed (ns)

       L1 Cache Reference                              0.5

       Branch Mispredict                                   5

       L2 Cache Reference                                  7

       Mutex lock/unlock                               25

       Main memory reference                          100

       Compress 1K bytes w/ cheap algorithm          3000

       Send 2K bytes over 1 Gbps network            20 000

       Read 1 MB sequentially from memory          250 000

       Round trip within same datacenter           500 000

       Disk seek                                10 000 000

       Read 1 MB sequentially from disk         20 000 000

       Send packet CA -> Netherlands -> CA     150 000 000
Optimizing for CPU

● Java NIO blocks
  use extremely compact chunks of 64 bits.


● Primitive types
  use "int" instead of "Integer"


● BitKeys
  because they are naturally CPU friendly
Optimizing for memory

● Hard limits on the heap space
  must pay attention to the total memory usage.


● Inherent limitations
  there can only be so many individual pointers on heap.
Optimizing for networking

● Payload optimization
  batching. deltas.


● Manageability
  turning nodes on & off.
Optimizing for disk

● Concurrent access
  must carefully manage disk IO.


● Inherently slooooow
how to deal with
   these issues?
a scalable indexing
            strategy
Cache indexing

● Linear performance is not good enough
  as N grows, full scanning takes O(n)


● The rollup combinatorial problem
  as the cache grows, reuse becomes tedious
The rollup combinatorial problem
     Gender   Country   Sales

       M       USA       7

       M      CANADA     8

       F       USA       4

       F      CANADA     2




              Country   Sales

               USA       11

              CANADA     10
The rollup combinatorial problem
  Gender    Country   Sales   Gender        Country   Sales     City      Sales


    M        USA       7        F            USA       5      Montreal     6


    M       CANADA     8                                       Quebec      1
                               Age          Country   Sales
                                                               Ottawa      8
   Age      Country   Cost    16 - 25        USA       2
                                                              Vancouver    2
  41 - 56    USA       5      26 - 40       CANADA     3
                                                               Toronto     5
                              26 - 40        USA       5




                               Country                          Sales

                                        ?                         ?

                                        ?                         ?
PoSet & BitKeys

● Represent the levels / values as bitkeys
  because bitkeys are fast, remember?


● The PartiallyOrderedSet
  a hierarchical hash set where elements might
  or might not be related to one another.
PoSet & BitKeys

● An example application
  finding all primes in a set of integers
a scalable threading
              model
Concurrent cache access

● Usage of phases
  peek -> load -> rinse & repeat


● A scalable threading model
  thread safety without locks and blocks
A scalable threading model

● Do things once. Do them right.
  the actor pattern
a scalable cache
management strategy
Operating by deltas

● All part of a whole
  implicit relation between the dimensions


● Why deltas are necessary
  reducing IO
Cache management

● A data block is a complex object
    Schema:[FoodMart]
    Checksum:[9cca66327439577753dd5c3144ab59b5]
    Cube:[Sales]
    Measure:[Unit Sales]
    Axes:[
         {time_by_day.the_year=(*)}
         {time_by_day.quarter=('Q1', 'Q2')}
         {product_class.product_family=('Bread', 'Soft Drinks')}]
    Excluded Regions:[
         {time_by_day.quarter=('Q1')}
         {time_by_day.the_year=('1997')}]
    Compound Predicates:[]
    ID:[9c8ba4ec39678526f4100506994c384183cd205d19dd142eae76a9fb1d74cab7]
a scalable sharing
          strategy
Shared Caches

● OLAP and key-value stores
  don't like each other
  OLAP requires a complex key. a hash is insufficient.


● Remember the "deltas" strategy?
  partially invalidating a block of data would break the hash
Data grids & OLAP

● Well suited for OLAP caches
  supports "rich" keys


● Distributed and redundant
  if a node goes offline, the cache data is not lost


● In-memory grids are fast
  multiplies the available heap space
a case study
Advertising data analysis

   Interactive behavioral targeting of
         advertising in real time
Advertising data analysis

● Low latency
  the end users don't want to wait for MapReduce jobs


● Scalability a huge factor
  we're talking petabytes of data here
Advertising data analysis

● Queries are not static
  we can't tell upfront what will be computed


● Deployed in datacenters worldwide
  the hashing strategy must allow "smart" data distribution


● Almost all open source
Monitoring &
  ETL Designer                            Client App
                    Management
                                               olap4j




                                           Load
                                           Balancer




                 OLAP                      XML/A
                 Cache
                                                 olap4j




          Logs
          ETL    Analytical
                                   OLAP
          Logs
                 DB
Big       ETL
Data
Store     Logs
          ETL

          Logs   Message
          ETL    Queue
Client App

● A query                                                            olap4j

  -   UI sends MDX to a SOAP service.
  -   load balancer dispatches the query.
  -   OLAP layer uses its data sources and aggregates.           Load
  -   query is answered                                          Balancer




                              OLAP                               XML/A
                              Cache
                                                                       olap4j




                              Analytical
                                                         OLAP
                              DB
● An update - Strategy #1
  -   the ETL process updates the analytical DB.
  -   a cache delta is sent to a message queue.
  -   OLAP processes the message.
  -   OLAP uses its index to spot the regions to invalidate.
  -   aggregated cache is updated incrementally.


                               OLAP
                               Cache




                 Logs
                 ETL           Analytical
                                                       OLAP
                 Logs
                               DB
         Big     ETL
         Data
         Store   Logs
                 ETL

                 Logs          Message
                 ETL           Queue
● An update - Strategy #2
  -   ETL updates the analytical DB.
  -   ETL acts directly on the OLAP cache.
  -   OLAP processes events from its cache.
  -   OLAP updates its index




                              OLAP
                              Cache




                 Logs
                 ETL          Analytical
                                              OLAP
                 Logs
                              DB
         Big     ETL
         Data
         Store   Logs
                 ETL

                 Logs
                 ETL
a stack built on open
            standards



   (get ready, the next slide will hurt your brains)
Java
                 Client App        load balancer          Client App

                  olap4j-xmla                              olap4j-xmla

  HTTP (XMLA)
                 olap4j server     olap4j server        olap4j server

                     olap4j            olap4j               olap4j


                      jdbc              jdbc                 jdbc
       JDBC      connection        connection           connection
                 pool              pool                 pool
                      jdbc              jdbc                 jdbc


                  olap4j impl        olap4j impl         olap4j impl
                 Mondrian          Mondrian             Mondrian
                 server            server               server
                 manager           manager              manager

        Java

                 Mondrian          Mondrian             Mondrian
                 cache             cache                cache
                 manager           manager              manager
                   infinispan        infinispan           infinispan
UDP (Hot Rod)
                                 infinispan data grid
the UI
Yahoo! Cocktails

● A Node.js implementation
  runs on Manhattan
  JS hosted execution
                                 Client App


● Mojito
  client application framework


● Works both online / offline
the OLAP service
olap4j-xmla / olap4j-server
                                                     Client App

                                                          olap4j
● JDBC for OLAP
  extension to JDBC. became the de facto standard.

                                                      Load
● A Java toolkit for OLAP                             Balancer

  -   MDX parser / validator
  -   a rich type system / MDX object model
  -   driver specification
  -   programmatic query models                       XML/A
  -   olap4j to XMLA bridge                                 olap4j
the OLAP layer
Mondrian

● Developed by Pentaho Corp.
  used worldwide. pure java. open source.

                                                          OLAP
● Highly extensible
  exposes many APIs & SPIs for enterprise integration.


● ROLAP / MOLAP hybrid
  uses the best of what's available.


● Extensible MDX parser
  new MDX functions can be created for specific business domains.
the OLAP cache
Stuff that didn't work

● memcached
  ○ doesn't have an index.
  ○ enforces random TTLs.      OLAP
                               Cache
  ○ a hash key is not enough

● simple Java collections
Infinispan

● Developed for JBoss AS
  well tested.
                                                    OLAP
                                                    Cache
● UDP Multicast
  nodes can join and leave the cluster as needed.


● Can distribute the processing
  jobs can be distributed and ran on the nodes.


● Serializes rich objects
  the contents can be read from APIs.
the analytical DB layer
Oracle

● Cluster of instances
  partitioned Oracle nodes
                                                   Analytical
● Why Oracle?                                      DB

  because their DBAs are good enough with Oracle
  to get it to run properly under such a load
Other options

● An analytical oriented DB
  use of Vectorwise, Vertica, MonetDB, Greenplum, ...
                                                        Analytical
● Column stores                                         DB

  Column stores scale marvelously and are well
  suited for analytics
the Big Data layer
Big Data Layer

● Homebrew Java MapReduce                                Logs
                                                         ETL

                                                         Logs
                                                 Big     ETL

● 42 000 nodes                                   Data
                                                 Store   Logs
                                                         ETL

                                                         Logs

● ETL processes managed                                  ETL



  with Pig

● A keynote in itself
  (see the resources at the end for a keynote
        from Scott Burke, Senior VP of Yahoo!)
some numbers
Final processing capacity

● Big Data layer

  ○   140 petabytes
  ○   500 users
  ○   42 000 nodes
  ○   10 000 000 hours of CPU time usage per day
  ○   100 000 000 000 records per day
Final processing capacity

● Analytical DB layer

  ○ 50 terabytes
  ○ 100s of tables
     (heavy use of the snowflake schema)
  ○ 1 000 000 000 new rows per day
Final processing capacity

● OLAP layer

  ○   10s of Mondrian instances
  ○   10s of cubes
  ○   100s of dimensions
  ○   1 000s of levels
  ○   1 000 000s of members per level
  ○   1 000 000 000s of facts per day
skunkworks



(future stuff you might care about)
Mondrian over Google's BigQuery

● Big Data as a service
  upload CSVs & other formats to a ad-hoc cluster


● No code required
  MapReduce jobs usually require you to code them
Pentaho Instaview

● Interactive data discovery for Big Data
  fully integrated ETL / OLAP.
  all you need is a URL and a user / password.


● A rich UI environment for data
  drag & drop.
  full OLAP support.
  mobile.


● Open source
resources
 Mondrian - The open source analytics engine
                      mondrian.pentaho.org

  olap4j - The open standard for OLAP in Java
                                   olap4j.org

Infinispan - The distributed data grid platform
                           jboss.org/infinispan

Scott Burke, SVP Advertising & Data @ Yahoo!
             Keynote of Hadoop Summit 2012
         youtube.com/watch?v=mR30psmuIPo
resources
                                Pentaho Instaview
pentahobigdata.com/ecosystem/capabilities/instaview
big thanks

            On Twitter: @luclemagnifique

On the blogosphere: devdonkey.blogspot.ca

More Related Content

What's hot

jakkrit_akaramethathip-resume-201602
jakkrit_akaramethathip-resume-201602jakkrit_akaramethathip-resume-201602
jakkrit_akaramethathip-resume-201602jakkritakeng
 
In-memory Database and MySQL Cluster
In-memory Database and MySQL ClusterIn-memory Database and MySQL Cluster
In-memory Database and MySQL Clustergrandis_au
 
Oracle Exadata Version 2
Oracle Exadata Version 2Oracle Exadata Version 2
Oracle Exadata Version 2Jarod Wang
 
In-Memory Computing - The Big Picture
In-Memory Computing - The Big PictureIn-Memory Computing - The Big Picture
In-Memory Computing - The Big PictureMarkus Kett
 
DB2 10 Migration Planning & Customer experiences - Chris Crone (IDUG India)
DB2 10 Migration Planning & Customer experiences - Chris Crone (IDUG India) DB2 10 Migration Planning & Customer experiences - Chris Crone (IDUG India)
DB2 10 Migration Planning & Customer experiences - Chris Crone (IDUG India) Surekha Parekh
 
The Science of DBMS: Data Storage & Organization
The Science of DBMS: Data Storage & Organization The Science of DBMS: Data Storage & Organization
The Science of DBMS: Data Storage & Organization SAP Technology
 

What's hot (6)

jakkrit_akaramethathip-resume-201602
jakkrit_akaramethathip-resume-201602jakkrit_akaramethathip-resume-201602
jakkrit_akaramethathip-resume-201602
 
In-memory Database and MySQL Cluster
In-memory Database and MySQL ClusterIn-memory Database and MySQL Cluster
In-memory Database and MySQL Cluster
 
Oracle Exadata Version 2
Oracle Exadata Version 2Oracle Exadata Version 2
Oracle Exadata Version 2
 
In-Memory Computing - The Big Picture
In-Memory Computing - The Big PictureIn-Memory Computing - The Big Picture
In-Memory Computing - The Big Picture
 
DB2 10 Migration Planning & Customer experiences - Chris Crone (IDUG India)
DB2 10 Migration Planning & Customer experiences - Chris Crone (IDUG India) DB2 10 Migration Planning & Customer experiences - Chris Crone (IDUG India)
DB2 10 Migration Planning & Customer experiences - Chris Crone (IDUG India)
 
The Science of DBMS: Data Storage & Organization
The Science of DBMS: Data Storage & Organization The Science of DBMS: Data Storage & Organization
The Science of DBMS: Data Storage & Organization
 

Viewers also liked

Frederic Arrouays, CFO Emerging Markets at SAP - The Finance transformation a...
Frederic Arrouays, CFO Emerging Markets at SAP - The Finance transformation a...Frederic Arrouays, CFO Emerging Markets at SAP - The Finance transformation a...
Frederic Arrouays, CFO Emerging Markets at SAP - The Finance transformation a...Global Business Events
 
[Seoul cartoon] policy sharing makes cities around the world happier
[Seoul cartoon] policy sharing makes cities around the world happier[Seoul cartoon] policy sharing makes cities around the world happier
[Seoul cartoon] policy sharing makes cities around the world happiersimrc
 
Digital marketing CK sinh vien kent international college
Digital marketing CK sinh vien kent international collegeDigital marketing CK sinh vien kent international college
Digital marketing CK sinh vien kent international collegetrung_1881
 
Airfreight Trends: Still Sluggish, with Cargo Growth in Some Sectors
Airfreight Trends: Still Sluggish, with Cargo Growth in Some Sectors Airfreight Trends: Still Sluggish, with Cargo Growth in Some Sectors
Airfreight Trends: Still Sluggish, with Cargo Growth in Some Sectors Craig Raucher New York
 
2015 SaaS Industry Survey Results for Marketers
2015 SaaS Industry Survey Results for Marketers2015 SaaS Industry Survey Results for Marketers
2015 SaaS Industry Survey Results for MarketersMatthew Howard
 
Corporate gifts suppliers in gurgaon
Corporate gifts suppliers in gurgaonCorporate gifts suppliers in gurgaon
Corporate gifts suppliers in gurgaonvinay kumar
 
HotelREZ Hotels & Resorts Corporate Brochure 2016
HotelREZ Hotels & Resorts Corporate Brochure 2016HotelREZ Hotels & Resorts Corporate Brochure 2016
HotelREZ Hotels & Resorts Corporate Brochure 2016Catt McLeod
 
ヘルパー移送dm
ヘルパー移送dmヘルパー移送dm
ヘルパー移送dmfrumpy
 
Sociocracy - Pursuit of great decisions, fast
Sociocracy - Pursuit of great decisions, fastSociocracy - Pursuit of great decisions, fast
Sociocracy - Pursuit of great decisions, fastQaiser Mazhar
 
The world is a beautiful place to live
The world is a beautiful place to liveThe world is a beautiful place to live
The world is a beautiful place to liveVergilia Salgan
 
Insider's Guide to the AppExchange Security Review (Dreamforce 2015)
Insider's Guide to the AppExchange Security Review (Dreamforce 2015)Insider's Guide to the AppExchange Security Review (Dreamforce 2015)
Insider's Guide to the AppExchange Security Review (Dreamforce 2015)Salesforce Partners
 
2011 High tech automotive campus, Helmond, The Netherlands
2011  High tech automotive campus, Helmond, The Netherlands2011  High tech automotive campus, Helmond, The Netherlands
2011 High tech automotive campus, Helmond, The NetherlandsJacques Van Dinteren
 
Food and Prosperity: Balancing Technology and Community in Agriculture
Food and Prosperity: Balancing Technology and Community in AgricultureFood and Prosperity: Balancing Technology and Community in Agriculture
Food and Prosperity: Balancing Technology and Community in AgricultureThe Rockefeller Foundation
 
ICE Totally Gaming 2016 - Singular Marketing Materials
ICE Totally Gaming 2016 - Singular Marketing MaterialsICE Totally Gaming 2016 - Singular Marketing Materials
ICE Totally Gaming 2016 - Singular Marketing MaterialsMarie Talak
 
Employee Value Proposition
Employee Value Proposition Employee Value Proposition
Employee Value Proposition Domenico Fama
 
Market research analysis ppt shikari martin
Market research analysis ppt shikari martinMarket research analysis ppt shikari martin
Market research analysis ppt shikari martinmartinshhs
 

Viewers also liked (20)

Frederic Arrouays, CFO Emerging Markets at SAP - The Finance transformation a...
Frederic Arrouays, CFO Emerging Markets at SAP - The Finance transformation a...Frederic Arrouays, CFO Emerging Markets at SAP - The Finance transformation a...
Frederic Arrouays, CFO Emerging Markets at SAP - The Finance transformation a...
 
[Seoul cartoon] policy sharing makes cities around the world happier
[Seoul cartoon] policy sharing makes cities around the world happier[Seoul cartoon] policy sharing makes cities around the world happier
[Seoul cartoon] policy sharing makes cities around the world happier
 
Digital marketing CK sinh vien kent international college
Digital marketing CK sinh vien kent international collegeDigital marketing CK sinh vien kent international college
Digital marketing CK sinh vien kent international college
 
Follow me on Twitter
Follow me on TwitterFollow me on Twitter
Follow me on Twitter
 
Tema liderazgo
Tema liderazgoTema liderazgo
Tema liderazgo
 
Airfreight Trends: Still Sluggish, with Cargo Growth in Some Sectors
Airfreight Trends: Still Sluggish, with Cargo Growth in Some Sectors Airfreight Trends: Still Sluggish, with Cargo Growth in Some Sectors
Airfreight Trends: Still Sluggish, with Cargo Growth in Some Sectors
 
2015 SaaS Industry Survey Results for Marketers
2015 SaaS Industry Survey Results for Marketers2015 SaaS Industry Survey Results for Marketers
2015 SaaS Industry Survey Results for Marketers
 
SM-re-ex1
SM-re-ex1SM-re-ex1
SM-re-ex1
 
Corporate gifts suppliers in gurgaon
Corporate gifts suppliers in gurgaonCorporate gifts suppliers in gurgaon
Corporate gifts suppliers in gurgaon
 
HotelREZ Hotels & Resorts Corporate Brochure 2016
HotelREZ Hotels & Resorts Corporate Brochure 2016HotelREZ Hotels & Resorts Corporate Brochure 2016
HotelREZ Hotels & Resorts Corporate Brochure 2016
 
Venecia pinturas
Venecia pinturasVenecia pinturas
Venecia pinturas
 
ヘルパー移送dm
ヘルパー移送dmヘルパー移送dm
ヘルパー移送dm
 
Sociocracy - Pursuit of great decisions, fast
Sociocracy - Pursuit of great decisions, fastSociocracy - Pursuit of great decisions, fast
Sociocracy - Pursuit of great decisions, fast
 
The world is a beautiful place to live
The world is a beautiful place to liveThe world is a beautiful place to live
The world is a beautiful place to live
 
Insider's Guide to the AppExchange Security Review (Dreamforce 2015)
Insider's Guide to the AppExchange Security Review (Dreamforce 2015)Insider's Guide to the AppExchange Security Review (Dreamforce 2015)
Insider's Guide to the AppExchange Security Review (Dreamforce 2015)
 
2011 High tech automotive campus, Helmond, The Netherlands
2011  High tech automotive campus, Helmond, The Netherlands2011  High tech automotive campus, Helmond, The Netherlands
2011 High tech automotive campus, Helmond, The Netherlands
 
Food and Prosperity: Balancing Technology and Community in Agriculture
Food and Prosperity: Balancing Technology and Community in AgricultureFood and Prosperity: Balancing Technology and Community in Agriculture
Food and Prosperity: Balancing Technology and Community in Agriculture
 
ICE Totally Gaming 2016 - Singular Marketing Materials
ICE Totally Gaming 2016 - Singular Marketing MaterialsICE Totally Gaming 2016 - Singular Marketing Materials
ICE Totally Gaming 2016 - Singular Marketing Materials
 
Employee Value Proposition
Employee Value Proposition Employee Value Proposition
Employee Value Proposition
 
Market research analysis ppt shikari martin
Market research analysis ppt shikari martinMarket research analysis ppt shikari martin
Market research analysis ppt shikari martin
 

Similar to Olap scalability

Everything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBEverything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBjhugg
 
Retour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenantRetour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenantSwiss Data Forum Swiss Data Forum
 
Evolution of DBA in the Cloud Era
 Evolution of DBA in the Cloud Era Evolution of DBA in the Cloud Era
Evolution of DBA in the Cloud EraMydbops
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistAlexey Zinoviev
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics PlatformSantanu Dey
 
Vote NO for MySQL
Vote NO for MySQLVote NO for MySQL
Vote NO for MySQLUlf Wendel
 
GraphTour - Closing Keynote
GraphTour - Closing KeynoteGraphTour - Closing Keynote
GraphTour - Closing KeynoteNeo4j
 
MySQL 高可用性
MySQL 高可用性MySQL 高可用性
MySQL 高可用性YUCHENG HU
 
When is Myrocks good? 2020 Webinar Series
When is Myrocks good? 2020 Webinar SeriesWhen is Myrocks good? 2020 Webinar Series
When is Myrocks good? 2020 Webinar SeriesAlkin Tezuysal
 
No sql bigdata and postgresql
No sql bigdata and postgresqlNo sql bigdata and postgresql
No sql bigdata and postgresqlZaid Shabbir
 
Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Don Demcsak
 
UKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL TuningUKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL TuningFromDual GmbH
 
Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.Roman Nikitchenko
 
Big data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymoreBig data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymoreStfalcon Meetups
 
Raft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdfRaft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdffengxun
 

Similar to Olap scalability (20)

AWS User Group October
AWS User Group OctoberAWS User Group October
AWS User Group October
 
Everything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBEverything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDB
 
Retour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenantRetour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenant
 
Evolution of DBA in the Cloud Era
 Evolution of DBA in the Cloud Era Evolution of DBA in the Cloud Era
Evolution of DBA in the Cloud Era
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
 
Vote NO for MySQL
Vote NO for MySQLVote NO for MySQL
Vote NO for MySQL
 
GraphTour - Closing Keynote
GraphTour - Closing KeynoteGraphTour - Closing Keynote
GraphTour - Closing Keynote
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
MySQL 高可用性
MySQL 高可用性MySQL 高可用性
MySQL 高可用性
 
When is Myrocks good? 2020 Webinar Series
When is Myrocks good? 2020 Webinar SeriesWhen is Myrocks good? 2020 Webinar Series
When is Myrocks good? 2020 Webinar Series
 
No sql bigdata and postgresql
No sql bigdata and postgresqlNo sql bigdata and postgresql
No sql bigdata and postgresql
 
Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)
 
Running MySQL in AWS
Running MySQL in AWSRunning MySQL in AWS
Running MySQL in AWS
 
UKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL TuningUKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL Tuning
 
Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.
 
Big data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymoreBig data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymore
 
Raft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdfRaft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdf
 

Recently uploaded

Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 

Recently uploaded (20)

Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 

Olap scalability

  • 1. Online Analytical Processing of Large Distributed Databases Luc Boudreau Lead Engineer, Pentaho Corporation
  • 3. "its all about data movement and operating on that data on the fly"
  • 6. Relational Databases ● Static schema ● Minimized redundancy ● Referential integrity ● Transactional
  • 7. Classic RDBMS internals ● "Shared Everything" paradigm PLANNER / SCHEDULER ● Private Planner PROCESSOR PROCESSOR PROCESSOR ● Multiple private processors ● Multiple private data stores
  • 8. What RDBMS are for ● Operational data ● Normalized models ● Static typed data
  • 9. What RDBMS are NOT for ● "Full Scan" Aggregated Computations ● Multi-dimensional queries (think pivot) ● Unstructured data
  • 10. OK so how's that different from Big Data platforms?
  • 11. Big Data - More than a buzzword (although sometimes its hard to tell...) Big Data is not a product. It is an architecture.
  • 12. Big Data - More than a buzzword (although sometimes its hard to tell...) A schema-less distributed storage and processing model for data.
  • 13. Big Data ● Schema less ○ Programmatic queries ○ "Map" of MapReduce ● High Redundancy ○ Distributed processing ○ "Reduce" of MapReduce
  • 14. Big Data ● No referential integrity ● Non transactional ● High latency
  • 15. Classic Big Data internals ● "Share nothing" paradigm SCHEDULER ● Push the processing closer to the data PROCESSOR PROCESSOR PROCESSOR ● The query defines the schema
  • 16. What Big Data is for ● Unstructured data keep everything ● Distributed file system great for archiving ● Data is fixed only the process evolves
  • 17. What Big Data is for ● Ludicrous amounts of data keep everything, remember? ● Made on the cheap each processing unit is commodity hardware
  • 18. What Big Data is NOT for ● Low latency applications arbitrary exploration of the data is close to impossible ● End-users writing code is easy. writing good code is hard. ● Replacing your operational DB
  • 19. Some more limitations ● No structured query language exploration is tedious ● Accuracy & Exactitude the burden is put on the end user / query designer ● No query optimizer cannot optimize at runtime. does exactly what you tell it to.
  • 20. why is this so similar to NoSQL?
  • 21. First, defining NoSQL... ● NoSQL: The thing named after what it lacks which has as many definitions as there are products. (which usually turns out to be some sort of key-value store)
  • 22. Why "NoSQL"? Why all the hate?! ● Historical reasons ○ Wrong technological choices ○ Blind faith in RDBMS scalability ○ General wishful thinking and voodoo magic
  • 23. Why "NoSQL"? Why all the hate?! ● "SQL" itself was never the issue ● NoSQL projects are implementing SQL- like query languages
  • 25. Current efforts ● Straight SQL implementations Greenplum: Straight SQL on top of Big Data Hive JDBC: A hybrid of DSL & SQL ● The Splunk approach SQL with missing columns ● Runtime query optimizers Optiq framework: SQL with Big Data federated sources
  • 26. isn't there something better than SQL for analytics?
  • 28. Widely used. Little known. ● Your favorite corporate dashboards ● Google Analytics & other ad-hoc tools
  • 29. Analytics centric language ● Multidimensional Expressions (MDX) a powerful query language for analytics ● Forget about rows and columns as many axis as you need ● Slice & dice start from everything - progressively focus only on relevant data
  • 30. Business domain driven ● Hierarchical view of a multidimensional universe
  • 31. An example What are my total sales for the current year, per month, for male customers? with member [Measures].[Accumulated Sales] as 'Sum(YTD(), [Measures].[Store Sales])' select {[Measures].[Accumulated Sales]} on columns, {Descendants([Time].[1997], [Time].[Month])} on rows from [Sales] where ([Customer].[Gender].[M])
  • 32. how does that work?
  • 33. Analytics data modelization ● A denormalized model for performance the data is modelized for read operations - not write ● High redundancy because sometimes more is better
  • 36. different OLAP servers. Different beasts.
  • 37. Relational OLAP (ROLAP) ● Backed by a relational database think of a MDX to SQL bridge. the aggregated data can be cached in-memory or on-disk. ● Relies heavily on the RDBMS performance figures out at runtime the proper optimizations
  • 38. Memory OLAP (MOLAP) ● Loads everything in RAM ● Relies on an efficient ETL platform
  • 39. Other OLAP ● On-disk aggregated data files Think SAS. Cubes are compiled into data files on disk. ● Simple Bridges Converts MDX straight to SQL, with limited support of MDX syntax.
  • 40. how do they compare?
  • 41. (there are no straight answers, sorry)
  • 42. Where the data lives matters Location Speed (ns) L1 Cache Reference 0.5 Branch Mispredict 5 L2 Cache Reference 7 Mutex lock/unlock 25 Main memory reference 100 Compress 1K bytes w/ cheap algorithm 3000 Send 2K bytes over 1 Gbps network 20 000 Read 1 MB sequentially from memory 250 000 Round trip within same datacenter 500 000 Disk seek 10 000 000 Read 1 MB sequentially from disk 20 000 000 Send packet CA -> Netherlands -> CA 150 000 000
  • 43. Optimizing for CPU ● Java NIO blocks use extremely compact chunks of 64 bits. ● Primitive types use "int" instead of "Integer" ● BitKeys because they are naturally CPU friendly
  • 44. Optimizing for memory ● Hard limits on the heap space must pay attention to the total memory usage. ● Inherent limitations there can only be so many individual pointers on heap.
  • 45. Optimizing for networking ● Payload optimization batching. deltas. ● Manageability turning nodes on & off.
  • 46. Optimizing for disk ● Concurrent access must carefully manage disk IO. ● Inherently slooooow
  • 47. how to deal with these issues?
  • 49. Cache indexing ● Linear performance is not good enough as N grows, full scanning takes O(n) ● The rollup combinatorial problem as the cache grows, reuse becomes tedious
  • 50. The rollup combinatorial problem Gender Country Sales M USA 7 M CANADA 8 F USA 4 F CANADA 2 Country Sales USA 11 CANADA 10
  • 51. The rollup combinatorial problem Gender Country Sales Gender Country Sales City Sales M USA 7 F USA 5 Montreal 6 M CANADA 8 Quebec 1 Age Country Sales Ottawa 8 Age Country Cost 16 - 25 USA 2 Vancouver 2 41 - 56 USA 5 26 - 40 CANADA 3 Toronto 5 26 - 40 USA 5 Country Sales ? ? ? ?
  • 52. PoSet & BitKeys ● Represent the levels / values as bitkeys because bitkeys are fast, remember? ● The PartiallyOrderedSet a hierarchical hash set where elements might or might not be related to one another.
  • 53. PoSet & BitKeys ● An example application finding all primes in a set of integers
  • 55. Concurrent cache access ● Usage of phases peek -> load -> rinse & repeat ● A scalable threading model thread safety without locks and blocks
  • 56. A scalable threading model ● Do things once. Do them right. the actor pattern
  • 58. Operating by deltas ● All part of a whole implicit relation between the dimensions ● Why deltas are necessary reducing IO
  • 59. Cache management ● A data block is a complex object Schema:[FoodMart] Checksum:[9cca66327439577753dd5c3144ab59b5] Cube:[Sales] Measure:[Unit Sales] Axes:[ {time_by_day.the_year=(*)} {time_by_day.quarter=('Q1', 'Q2')} {product_class.product_family=('Bread', 'Soft Drinks')}] Excluded Regions:[ {time_by_day.quarter=('Q1')} {time_by_day.the_year=('1997')}] Compound Predicates:[] ID:[9c8ba4ec39678526f4100506994c384183cd205d19dd142eae76a9fb1d74cab7]
  • 60. a scalable sharing strategy
  • 61. Shared Caches ● OLAP and key-value stores don't like each other OLAP requires a complex key. a hash is insufficient. ● Remember the "deltas" strategy? partially invalidating a block of data would break the hash
  • 62. Data grids & OLAP ● Well suited for OLAP caches supports "rich" keys ● Distributed and redundant if a node goes offline, the cache data is not lost ● In-memory grids are fast multiplies the available heap space
  • 64. Advertising data analysis Interactive behavioral targeting of advertising in real time
  • 65. Advertising data analysis ● Low latency the end users don't want to wait for MapReduce jobs ● Scalability a huge factor we're talking petabytes of data here
  • 66. Advertising data analysis ● Queries are not static we can't tell upfront what will be computed ● Deployed in datacenters worldwide the hashing strategy must allow "smart" data distribution ● Almost all open source
  • 67. Monitoring & ETL Designer Client App Management olap4j Load Balancer OLAP XML/A Cache olap4j Logs ETL Analytical OLAP Logs DB Big ETL Data Store Logs ETL Logs Message ETL Queue
  • 68. Client App ● A query olap4j - UI sends MDX to a SOAP service. - load balancer dispatches the query. - OLAP layer uses its data sources and aggregates. Load - query is answered Balancer OLAP XML/A Cache olap4j Analytical OLAP DB
  • 69. ● An update - Strategy #1 - the ETL process updates the analytical DB. - a cache delta is sent to a message queue. - OLAP processes the message. - OLAP uses its index to spot the regions to invalidate. - aggregated cache is updated incrementally. OLAP Cache Logs ETL Analytical OLAP Logs DB Big ETL Data Store Logs ETL Logs Message ETL Queue
  • 70. ● An update - Strategy #2 - ETL updates the analytical DB. - ETL acts directly on the OLAP cache. - OLAP processes events from its cache. - OLAP updates its index OLAP Cache Logs ETL Analytical OLAP Logs DB Big ETL Data Store Logs ETL Logs ETL
  • 71. a stack built on open standards (get ready, the next slide will hurt your brains)
  • 72. Java Client App load balancer Client App olap4j-xmla olap4j-xmla HTTP (XMLA) olap4j server olap4j server olap4j server olap4j olap4j olap4j jdbc jdbc jdbc JDBC connection connection connection pool pool pool jdbc jdbc jdbc olap4j impl olap4j impl olap4j impl Mondrian Mondrian Mondrian server server server manager manager manager Java Mondrian Mondrian Mondrian cache cache cache manager manager manager infinispan infinispan infinispan UDP (Hot Rod) infinispan data grid
  • 74. Yahoo! Cocktails ● A Node.js implementation runs on Manhattan JS hosted execution Client App ● Mojito client application framework ● Works both online / offline
  • 76. olap4j-xmla / olap4j-server Client App olap4j ● JDBC for OLAP extension to JDBC. became the de facto standard. Load ● A Java toolkit for OLAP Balancer - MDX parser / validator - a rich type system / MDX object model - driver specification - programmatic query models XML/A - olap4j to XMLA bridge olap4j
  • 78. Mondrian ● Developed by Pentaho Corp. used worldwide. pure java. open source. OLAP ● Highly extensible exposes many APIs & SPIs for enterprise integration. ● ROLAP / MOLAP hybrid uses the best of what's available. ● Extensible MDX parser new MDX functions can be created for specific business domains.
  • 80. Stuff that didn't work ● memcached ○ doesn't have an index. ○ enforces random TTLs. OLAP Cache ○ a hash key is not enough ● simple Java collections
  • 81. Infinispan ● Developed for JBoss AS well tested. OLAP Cache ● UDP Multicast nodes can join and leave the cluster as needed. ● Can distribute the processing jobs can be distributed and ran on the nodes. ● Serializes rich objects the contents can be read from APIs.
  • 83. Oracle ● Cluster of instances partitioned Oracle nodes Analytical ● Why Oracle? DB because their DBAs are good enough with Oracle to get it to run properly under such a load
  • 84. Other options ● An analytical oriented DB use of Vectorwise, Vertica, MonetDB, Greenplum, ... Analytical ● Column stores DB Column stores scale marvelously and are well suited for analytics
  • 85. the Big Data layer
  • 86. Big Data Layer ● Homebrew Java MapReduce Logs ETL Logs Big ETL ● 42 000 nodes Data Store Logs ETL Logs ● ETL processes managed ETL with Pig ● A keynote in itself (see the resources at the end for a keynote from Scott Burke, Senior VP of Yahoo!)
  • 88. Final processing capacity ● Big Data layer ○ 140 petabytes ○ 500 users ○ 42 000 nodes ○ 10 000 000 hours of CPU time usage per day ○ 100 000 000 000 records per day
  • 89. Final processing capacity ● Analytical DB layer ○ 50 terabytes ○ 100s of tables (heavy use of the snowflake schema) ○ 1 000 000 000 new rows per day
  • 90. Final processing capacity ● OLAP layer ○ 10s of Mondrian instances ○ 10s of cubes ○ 100s of dimensions ○ 1 000s of levels ○ 1 000 000s of members per level ○ 1 000 000 000s of facts per day
  • 91. skunkworks (future stuff you might care about)
  • 92. Mondrian over Google's BigQuery ● Big Data as a service upload CSVs & other formats to a ad-hoc cluster ● No code required MapReduce jobs usually require you to code them
  • 93. Pentaho Instaview ● Interactive data discovery for Big Data fully integrated ETL / OLAP. all you need is a URL and a user / password. ● A rich UI environment for data drag & drop. full OLAP support. mobile. ● Open source
  • 94. resources Mondrian - The open source analytics engine mondrian.pentaho.org olap4j - The open standard for OLAP in Java olap4j.org Infinispan - The distributed data grid platform jboss.org/infinispan Scott Burke, SVP Advertising & Data @ Yahoo! Keynote of Hadoop Summit 2012 youtube.com/watch?v=mR30psmuIPo
  • 95. resources Pentaho Instaview pentahobigdata.com/ecosystem/capabilities/instaview
  • 96. big thanks On Twitter: @luclemagnifique On the blogosphere: devdonkey.blogspot.ca