SlideShare ist ein Scribd-Unternehmen logo
1 von 50
Downloaden Sie, um offline zu lesen
Business Intelligence
    for Big Data
    James Dixon, Chief Geek
         August, 2010




     © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
Business Intelligence =
                 reports, dashboards, analysis,
                   visualization, alerts, auditing



© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                           US and Worldwide: +1 (866) 660-7555 | Slide

It might be a self-selecting audience since we are a Business Intelligence company, but upwards of 90% of the companies we talk to are using, or
plan to use Hadoop to transform structured or semi-structured data - with the aim of then analyzing, investigating and reporting on the data.
Hadoop and BI




© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                           US and Worldwide: +1 (866) 660-7555 | Slide

It might be a self-selecting audience since we are a Business Intelligence company, but upwards of 90% of the companies we talk to are using, or
plan to use Hadoop to transform structured or semi-structured data - with the aim of then analyzing, investigating and reporting on the data.
Example Hadoop Cases Today
           Transactional
           • Fraud detection
           • Financial services/stock markets
           Sub-Transactional
           • Weblogs
           • Social/online media
           • Telecoms events


© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                            US and Worldwide: +1 (866) 660-7555 | Slide

* Not many companies have transactional data that classifies as Big Data. Credit card companies, and financial services companies are about it.

* With stock market data were are talking about every stock trade and the bid and ask prices between the transactions - for every stock on multiple
markets for a significant time period.
For many other companies the Big Data is sub-transactional - it is the events that lead up to transactions

* Weblogs are semi/badly structured. Consider the number of weblog entries created as you look for a book online - researching 5-10 books,
reading reviews and comments. You might generate 1000 entries and may or may not buy a book - potentially lots of entries for no transaction. We
also want to enrich this data with metadata about the URLs and information about the location of user

* In an online game or world every interaction between participants and the system and between each other is logged. An individual participant
might generate > 1 million events for their 1 monthly transaction

* A single phone call or text message generates many events within a telecoms company
Example Hadoop Cases Today
           Non-Transactional
           • Web pages, blogs etc
           • Documents
           • Physical events
           • Application events
           • Machine events

           In most cases structured or semi-structured

© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                              US and Worldwide: +1 (866) 660-7555 | Slide

* In additional to transactional and sub-transactional there is also non-transactional data. Some of this data is human-generated and some of it is
people-generated.

* People generate lots of content that companies are interested in - web pages, blogs, and comments

* Physical events include data such as weather data. If you take the combined output of the weather-sensing instruments deployed today you get
Big Data

* Many software applications log events as they execute, as do machines such as production line machinery

TRANSITION
In the majority of these cases the data is structured or semi-structured.

LEAD-IN
What do we have in common between these use cases? How can we describe these Big Data scenarios?
Data Lake
           •      Single source
           •      Large volume
           •      Not distilled
           •      Can be treated




© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                                  US and Worldwide: +1 (866) 660-7555 | Slide

In most of these cases we are dealing with a single source of data.

We are, we know, dealing with a large volume of data.

We are also dealing with data that is not aggregated, or summarized. Itʼs not ʻdistilledʼ in any way.

It is a large body of data. The data can be raw data or might be treated in some way, treated within the lake or on its way into the lake. For example
weblog entries might be geocoded and enriched with metadata.

So we are calling these things Data Lakes.
Data Lakes
           •      0-2 lakes per company
           •      Known and unknown questions
           •      Multiple user communities
           •      $1-10k questions, not $1m ones
           •      Don’t fit in traditional RDBMS with a
                  reasonable cost




© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                               US and Worldwide: +1 (866) 660-7555 | Slide

There are some other interesting attributes of these data lakes

* If there is a data lake at all in a company there is usually only 1. Some domains such as financial services companies might have two, but any
more than this is very rare.

* In most cases we have some questions of this data that are known ahead of time. But we also have questions of the data that cannot be
anticipated.

* We also frequently have different user communities that want access to the data. In the example of weblogs we have sales and marketing
departments that want to know about the behavior of visitors and the volume of traffic on the site, maybe for different geographies. We also have the
IT department that wants information about throughput and load on the server for capacity planning.

* In general most of the questions about the data are not million dollar questions, they are $1k to $10k questions. Because no one user or group has
a million dollar question, no-one has a million dollar budget to solve the problem.

* Additionally this amount of data does not fit into a database either because the database physically will not fit or the cost of doing so is out of reach
economically.
Data Lake Requirements
           •      Store all the data
           •      Satisfy routine reporting and analysis
           •      Satisfy ad-hoc query / analysis / reporting
           •      Balance performance and cost




© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                             US and Worldwide: +1 (866) 660-7555 | Slide

If we look at the requirements of these data lakes we also see common ground:
* We want to store all of the data because we donʼt know all the questions we have of the data. If we did know, weʼd only have to keep a subset of
the data.
* We still want to satisfy all of the traditional BI reporting and analysis needs.
* We need to provide the ability to dip into the lake at any time to ask any question of the data:
- In some cases we want to extract a slice of data from the lake for detailed analysis. Letʼs say Iʼm in charge of pricing and promotions for a
company and this week Iʼm looking at a particular region or a particular product. I want to select a subset of the data from the lake, summarized to
some level, with attributes that I want to analyze. I want to slice and dice this data for a few hours or days, and then move onto my next region or
product. In this case we are creating a short-lived data-mart from the data lake.
- In other cases we know exactly the data we are looking for and donʼt need to explore it. In this case we defined the attributes of the data that we
want and we get a query results back.
* We also want to balance cost and performance. Big Data solutions are cheaper per-TeraByte than other solutions, but do not have the same level
of performance. We want a system where we can selectively improve the performance of data that we care the most about, and still have access to
the entire data set any time we need it.

LEAD-IN
Since we are introducing a new term ʻData Lakeʼ we need to explain how it is different from traditional BI system
Traditional BI
                               Data Mart(s)




                                                              Tape/Trash

         Data                             ? ? ?
        Source                             ?
                                          ? ??


© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                             US and Worldwide: +1 (866) 660-7555 | Slide

In a traditional BI system where we have not been able to store all of the raw data, we have solved the problem by being selective.

Firstly we selected the attributes of the data that we know we have questions about. Then we cleansed it and aggregated it to transaction levels or
higher, and packaged it up in a form that is easy to consume. Then we put it into an expensive system that we could not scale, whether technically
or financially. The rest of the data was thrown away or archived on tape, which for the purposes of analysis, is the same as throwing it away.

TRANSITION
The problem is we donʼt know what is in the data that we are throwing away or archiving. We can only answer the questions that we could predict
ahead of time.
What if...
                               Data Mart(s)                       Ad-Hoc                        Data Warehouse




                                                             Data Lake(s)
                                                             Tape/Trash

         Data
        Source



© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                          US and Worldwide: +1 (866) 660-7555 | Slide

But what if, instead of sampling the data and throwing the rest away
TRANSITION
We pour all of the data into a Data Lake
TRANSITION
And then create whatever data marts we need from the Data Lake
TRANSITION
And also provide the ability to extract data from the Data Lake on an ad-hoc basis
TRANSITION
And also provide the ability to extract data from the Data Lake to feed into a data warehouse
Big Data Architecture
                               Data Mart(s)                       Ad-Hoc                         Data Warehouse




                                                            Data Lake(s)

         Data
        Source



© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                            US and Worldwide: +1 (866) 660-7555 | Slide

This, then, is our Big Data architecture.

As well as pouring data from the source into the Data Lake, we can also take our archive tapes and pour them into the lake as well. Giving us a
huge about of historical data.

Does this meet our requirements?

TRANSITION
We are storing all of the data, so we can answer both known and unknown questions
TRANSITION
We are satisfying our standard reporting and analysis requirements by putting the most commonly requested data into data marts
TRANSITION
We are satisfying ad-hoc needs by providing the ability to dip into the lake at any time to extract data. This extracted data might be used to
populate a temporary data mart, it might be used at the input for a specialized visualization tool, or might be used by an analytical application.
TRANSITION
We are meeting the need to balance performance and cost by allowing you to choose how much data is staged in high-performance databases for
fast access, and how much data is available from the Data Lake only.
Does Big Data Replace Data Marts?
           • If it is a database
           • If it has low latency

           Hadoop (to date)
           • Databases are immature
           • Databases are no-SQL




© 2010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
Why Hadoop and BI?
           •      Distributed processing
           •      Distributed file system
           •      Commodity hardware
           •      Platform independent (in theory)
           •      Scales out beyond technology and/or
                  economy of a RDBMS

           In many cases it’s the only viable solution

© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                              US and Worldwide: +1 (866) 660-7555 | Slide

* For the purposes of BI the parallel processing and distributed storage of Hadoop, along with its scale-out architecture using commodity hardware
is attractive.

* Since Hadoop is written in Java it is, theoretically plaform-independent. At this point, due to some dependencies, it is only recommended for
Linux/Unix.

* And because these factors allow it to scale with a better price/performance characteristics than databases...

TRANSITION

... in many cases itʼs the only viable solution

LEAD-IN
So are there any downsides to Hadoop for BI use cases?
Hadoop and BI?



              90% of new Hadoop use cases
                  are transformation of
                  semi/structured data*


           * of those companies we’ve talked to...

© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                           US and Worldwide: +1 (866) 660-7555 | Slide

It might be a self-selecting audience since we are a Business Intelligence company, but upwards of 90% of the companies we talk to are using, or
plan to use Hadoop to transform structured or semi-structured data - with the aim of then analyzing, investigating and reporting on the data.
Hadoop and BI?




                                     “The working conditions
                                     within Hadoop are shocking”


                                  ETL Developer




© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                            US and Worldwide: +1 (866) 660-7555 | Slide

Unfortunately for developers who are used to working with data transformation tools, the productivity within the Hadoop environment is not what
they are used to.
Hadoop and BI?
           Instead of this...




© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                               US and Worldwide: +1 (866) 660-7555 | Slide

Instead of a graphical UI with palettes of data transformation operations to string together in a way that is easy to understand, easy to trace, and
easy to explain...
Hadoop and BI?
           You have to do this...
           public void map(
               Text key,
               Text value,
               OutputCollector output,
               Reporter reporter)

           public void reduce(
               Text key,
               Iterator values,
               OutputCollector output,
               Reporter reporter)



© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                            US and Worldwide: +1 (866) 660-7555 | Slide

In Hadoop we have two Java functions - Map and Reduce - that need to be implemented. These functions are part of the MapReduce processing
engine mentioned earlier.

Mapping and reducing are important functions in a data transformation engine, unfortunately there are many other operations that we need to do
on our data.

Hadoop does not include a comprehensive suite of data transformation operations

To understand how we ended up in this situation we need to take a brief look at the history of Hadoop
MapReduce Limitations



           Doing everything with MapReduce is like
           doing everything with recursion.

           You can, but that doesn’t mean its the best
           solution



© 2010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
MapReduce Limitations



           Not a scalable name...


           What’s next?
           MapReduceLookupJoinDenormalize
           UpdateDedupeFilterCalcMergeAppend



© 2010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
Google’s Use Case
           •      Needed to index the internet
           •      Huge set of unstructured data
           •      Predetermined input
           •      Predetermined output (the index)
           •      Predetermined questions
           •      Single user community
           •      Needed parallel processing and storage

           Their answer was MapReduce (MR)
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                              US and Worldwide: +1 (866) 660-7555 | Slide

The trail starts with Google. Google wanted to index the internet.

* This is clearly a big data set, and also an unstructured data set.

* Before they set out, Google knew what their data set was

* They knew how they wanted to process the data - to create an index

* They knew the questions they wanted to ask of the data - given some keys words, what are the most relevant web pages

* They has a single user community - the set of people trying to search the internet

* In order to solve this problem they needed a scalable architecture with distributed storage and parallel processing

TRANSITION
Their answer was to use MapReduce
Yahoo’s Use Case
           •      Needed to index the internet
           •      Huge set of unstructured data
           •      Predetermined input
           •      Predetermined output (the index)
           •      Predetermined questions
           •      Single user community
           •      Needed parallel processing and storage

           Their answer was Hadoop (w/ MapReduce)
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                               US and Worldwide: +1 (866) 660-7555 | Slide

Next along the trail is Yahoo. Yahooʼs requirements were very similar, in fact almost identical, to Googlesʼs.

* The exact same data set
* The same input format
* The same output
* The same questions
* From the same population
* With the same scalability requirements

TRANSITION
Yahooʼs answer was Hadoop, which includes a MapReduce engine

LEAD-IN
So how do these requirements compare with the current, BI-specific use cases?
Current Use Cases
         ✗ Not indexing the internet
         •
         ✗ Huge set of semi/structured data
         •
         ✗ Different input source and format
         •
         ✗ Different outputs
         •
         ✗ Different questions
         •
         ✗ Multiple user communities
         •
         ✓ Need parallel processing and storage
         •


© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                              US and Worldwide: +1 (866) 660-7555 | Slide

* No-one is indexing the internet - that is not a BI use case
* In most cases we have structured or semi-structured data, not unstructured

* In each use case the data source is different, so the format of the data is different

* In each case the output is not an index, it is a variety of data sets, data feeds, and reports

* In each case the questions of the data are different, and the questions cannot all be predicted

* In most cases we have multiple user communities with different needs and questions

* In each case the volume of the data is such that we need a scalable architecture with distributed storage and parallel processing

When we compare these scenarios with the purpose for which Hadoop was created we see that
TRANSITION
There is not much overlap between the Big Data needs of BI, and the original intent of Hadoop
LEAD-IN
The realization here is that...
Unfortunately Hadoop
                                     wasn’t designed
                                for most BI requirements



© 2010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
Hadoop’s Strengths and Weaknesses
           • Distributed processing
           • Distributed file system
           • Commodity hardware
           • Platform independent (in theory)
           • Scales out beyond technology and/or
             economy of a RDBMS
           But...
           • Not designed for BI

© 2010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
No-SQL and BI




© 2010, Pentaho. All Rights Reserved. www.pentaho.com.              US and Worldwide: +1 (866) 660-7555 | Slide
BI Tools Need...




                       Structured Query Language




© 2010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
BI Tools Don’t Need
           •      CREATE / INSERT
           •      UPDATE
           •      DELETE
           •      (only Read needed)
           •      No ACID transactions




© 2010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
Mondrian (OLAP) Needs
           Required:                                     Nice to have:
           • SELECT                                      • HAVING
           • FROM                                        • ORDER BY ... NULLS COLLATE
           • WHERE                                       • COUNT(DISTINCT x,y)
           • GROUP BY                                    • COUNT(DISTINCT x), COUNT(DISTINCT y)
           • ORDER BY                                    • VALUES (1,’a’), (2,’b’)




© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                           US and Worldwide: +1 (866) 660-7555 | Slide
Why not add to Hadoop
                                   the things it’s missing...



© 2010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
... until it can do
                                               what we need it to?



© 2010, Pentaho. All Rights Reserved. www.pentaho.com.         US and Worldwide: +1 (866) 660-7555 | Slide
If only we had a
                          Java, embeddable,
                     data transformation engine...



© 2010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
Hadoop Architecture

                                                          Java/         Clients
                                                         Python


           Map/                                            Job      Task      Task                    Task
          Reduce                                         Tracker   Tracker   Tracker                 Tracker

                                                                   Hadoop Common

    Filesystem:                                          Name       Data      Data                      Data
       HDFS,                                             Node       Node      Node                      Node
        S3...

© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                            US and Worldwide: +1 (866) 660-7555 | Slide
Pentaho Data Integration
                                                 Data Marts, Data Warehouse,
                                                    Analytical Applications


                                                         Pentaho Data
                                                          Integration

                                                                                                          Design
                          Hadoop                         Pentaho Data                                     Deploy
                                                          Integration
                                                                                                    Orchestrate
                                                         Pentaho Data
                                                          Integration

© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                               US and Worldwide: +1 (866) 660-7555 | Slide

Fortunately we have an embeddable data integration engine, written in Java

We have taken our Data Integration engine, PDI and integrated with Hadoop in a number of different areas:

* We have the ability to move files between Hadoop and external locations

* We have the ability to read and write to HDFS files during data transformations

* We have the ability to execute data transformations within the MapReduce engine

* We have the ability to extract information from Hadoop and load it into external data bases and applications

* And we have the ability to orchestrate all of this so you can integrate Hadoop into the rest of your data architecture with scheduling, monitoring,
logging etc
Visualize                                Repor3ng	
  /	
  Dashboards	
  /	
  Analysis


                                                                                                                             Web Tier

                                                                DM	
  &	
  DW                                                  RDBMS
     Op#mize
                                                                      Hive
                                                                                                                              Hadoop
                                                               Files	
  /	
  HDFS


         Load                                            Applica3ons	
  &	
  Systems

© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                                 US and Worldwide: +1 (866) 660-7555 | Slide

Put in to diagram form so we can indicate the different layers in the architecture and also show the scale of the data we get this Big Data pyramid.

* At the bottom of the pyramid we have Hadoop, containing our complete set of data.

* Higher up we have our data mart layer. This layer has less data in it, but has better performance.

* At the top we have application-level data caches.

* Looking down from the top, from the perspective of our users, they can see the whole pyramid - they have access to the whole structure. The only
thing that varies is the query time, depending on what data they want.

* Here we see that the RDBMS layer lets up optimize access to the data. We can decide how much data we want to stage in this layer. If we add
more storage in this layer, we can increase performance of a larger subset of the data lake, but it costs more money.
Repor3ng	
  /	
  Dashboards	
  /	
  Analysis


                                                                                                                       Web Tier

                                                                DM	
  &	
  DW                                            RDBMS
                       Metadata




                                                                      PDI
                                                                    Hive
                                                                                                                        Hadoop
                                                                      PDI
                                                               Files	
  /	
  HDFS
                                                                      PDI




                                                         Applica3ons	
  &	
  Systems

© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                           US and Worldwide: +1 (866) 660-7555 | Slide

We are able to provide this data architecture because we have metadata about every layer in the architecture.

We used Pentaho Data Integration to move data into Hadoop, and to process data within Hadoop, and as result we have metadata about the data
within Hadoop.

We also use PDI to create the data marts and extracts from Hadoop, so we have metadata about those as well
Repor3ng	
  /	
  Dashboards	
  /	
  Analysis


                                                                                                                        Web Tier

                                                                                                                            RDBMS


                                                          Data                                                              Hadoop
                                                          Lake



                                                         Applica3ons	
  &	
  Systems

© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                           US and Worldwide: +1 (866) 660-7555 | Slide

If we compare this diagram to our other Big Data diagram we see how it fits together.

TRANSITION

Our Data Lake sits within Hadoop

TRANSITION

Our neatly packaged data mart and DW extracts feed into the database layer. Data from here can get to users very quickly.

TRANSITION

Our ad-hoc queries and ad-hoc data-marts come directly from the Data Lake
Visualize                                Repor3ng	
  /	
  Dashboards	
  /	
  Analysis


                                                                                                                         Web Tier

                                                                DM	
  &	
  DW                                              RDBMS
     Op#mize
                                                                    Hive
                                                                                                                          Hadoop
                                                               Files	
  /	
  HDFS


         Load                                            Applica3ons	
  &	
  Systems

© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                           US and Worldwide: +1 (866) 660-7555 | Slide

This, then, is our big data architecture.

Its a hybrid architecture that enables you to blend Hadoop with other elements of your data architecture, and with whatever amount of database
storage you think necessary.

The blend of Hadoop and other technologies is flexible and easy to tweak over time
Repor3ng	
  /	
  Dashboards	
  /	
  Analysis


                                                                                                                     Web Tier

                                                                DM	
                                                   RDBMS

                                                         Hive
                                                                                                                      Hadoop
                                                     HDFS




© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                         US and Worldwide: +1 (866) 660-7555 | Slide

In this demo we will show how easy it is to execute a series of Hadoop and non-Hadoop tasks. We are going to

TRANSITION 1 Get a weblog file from an FTP server
TRANSITION 2 Make sure the source file does not exist with the Hadoop file system
TRANSITION 3 Copy the weblog file into Hadoop
TRANSITION 4 Read the weblog and process it - add metadata about the URLs, add geocoding, and enrich the operating system and browser
attributes
TRANSITION 5 Write the results of the data transformation to a new, improved, data file
TRANSITION 6 Load the data into Hive
TRANSITION 7 Read an aggregated data set from Hadoop
TRANSITION 8 And write it into a database
TRANSITION 9 Slice and dice the data with the database
TRANSITION 10 And execute an ad-hoc query into Hadoop
Demo




© 2010, Pentaho. All Rights Reserved. www.pentaho.com.          US and Worldwide: +1 (866) 660-7555 | Slide
FAQ
           1. Will Pentaho contribute to Apache’s
           Hadoop projects? Yes
           2. Will Pentaho distribute Hadoop as part of
           their product? Unlikely
           3. What version of Hadoop will be
           supported? Initially 20.2
           4. Will Pentaho’s APIs allow existing open
           source APIs to be used in parallel? Yes

© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                              US and Worldwide: +1 (866) 660-7555 | Slide

1. Any changes Pentaho makes to the Apache code will be contributed to Apache.

2. Pentaho does not plan to provide its own distribution of Hadoop or to provide anyone elseʼs distribution as part of our products. If we need to
provide binary patches while we wait for our contributions to be accepted by the Hive developers, we will do so, but this will be a temporary situation
only.

3. We are looking into support for version 20.0 as well.

4. We are not modifying or disabling any Hadoop APIs so any existing MapReduce tasks will work as they did before
FAQ
           5. Will Pentaho provide support or services
           to help setup Hadoop? Yes, no, maybe
           6. What are the requirements to be in the
           Pentaho Hadoop beta program?
           Requirements, be serious, have started
           already, etc




© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                             US and Worldwide: +1 (866) 660-7555 | Slide

5. Hadoop is a data source for Pentaho, just as any filesystem, FTP, web service or database is. We donʼt directly provide support for these third
party services. We recognize that companies want support and services for Hadoop so we will work with partners to provide these.

6. For the ongoing beta program we are looking for Hadoop sites that have data, have Hadoop installed, and have requirements
Can I Use ‘Big Data’
                                                     as a Data Warehouse?


                                                         Yes, probably




© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                 US and Worldwide: +1 (866) 660-7555 | Slide
Should I Use ‘Big Data’
                                                   as a Data Warehouse?


                                                         No, probably not




© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                 US and Worldwide: +1 (866) 660-7555 | Slide
What is a Data Warehouse?
           Data Mart
           • Data structured for query and reporting
           Data Warehouse
           • What you get if you create data marts for
             every system, then combine them together




© 2010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
Data Warehouse
           • Multiple sources
           • Cleansed and
             processed
           • Organized
           • Summarized




© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                          US and Worldwide: +1 (866) 660-7555 | Slide

By definition a data warehouse has content from many different sources - every operational system within your organization. This data has been
cleansed, processed, structured and aggregated to the transaction level

TRANSITION
If we compare the data warehouse to the Data Lake the differences between them become obvious
Big Data Architecture
                               Data Mart(s)                    Ad-Hoc                        Data Warehouse




                                                          Data Lake(s)

         Data
        Source



© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                        US and Worldwide: +1 (866) 660-7555 | Slide

So our recommendation is the Data Lake architecture, where data marts and a data warehouse are fed from a data lake.
But what if I really,
                                                  really want to . . .



© 2010, Pentaho. All Rights Reserved. www.pentaho.com.             US and Worldwide: +1 (866) 660-7555 | Slide
Data Water-Garden
           • Lake(s)
           • Pools and ponds
              • Organized
              • Cleansed
           • Linkages




© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                              US and Worldwide: +1 (866) 660-7555 | Slide

Instead of a single Data Lake, create a series of data pools. Each pool will be populated from a different data source. The data in the pools should
be cleansed and structured.

Create links between the pools with attributes that are exist in both.
Water-Garden Architecture
                               Data Mart(s)                      Ad-Hoc                        Data Mart(s)




                                                          Water-Garden

        Data
       Sources



© 2010, Pentaho. All Rights Reserved. www.pentaho.com.                                       US and Worldwide: +1 (866) 660-7555 | Slide

Then optimize your system by creating data marts for different domains or user populations
More information
www.pentaho.com/hadoop
contact: hadoop@pentaho.com




                                                                       Pentaho Template v6
              © 2010, Pentaho. All Rights Reserved. www.pentaho.com.

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop BasicsSonal Tiwari
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Adam Muise
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascienceAdam Muise
 
Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseGwen (Chen) Shapira
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architectureMilos Milovanovic
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHortonworks
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
 
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeHadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeDataWorks Summit
 
Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Daniel Abadi
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop IntroductionDzung Nguyen
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreCloudera, Inc.
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 

Was ist angesagt? (20)

Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
 
Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle Database
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - Jaspersoft
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeHadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance Initiative
 
Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data Store
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 

Ähnlich wie Nov 2010 HUG: Business Intelligence for Big Data

INTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOPINTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOPDr Geetha Mohan
 
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupBig Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupScott Mitchell
 
Putting Business Intelligence to Work on Hadoop Data Stores
Putting Business Intelligence to Work on Hadoop Data StoresPutting Business Intelligence to Work on Hadoop Data Stores
Putting Business Intelligence to Work on Hadoop Data StoresDATAVERSITY
 
Exclusive Verizon Employee Webinar: Getting More From Your CDR Data
Exclusive Verizon Employee Webinar: Getting More From Your CDR DataExclusive Verizon Employee Webinar: Getting More From Your CDR Data
Exclusive Verizon Employee Webinar: Getting More From Your CDR DataPentaho
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
 
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.Jennifer Walker
 
Fbdl enabling comprehensive_data_services
Fbdl enabling comprehensive_data_servicesFbdl enabling comprehensive_data_services
Fbdl enabling comprehensive_data_servicesCindy Irby
 
Hadoop uk user group meeting final
Hadoop uk user group meeting finalHadoop uk user group meeting final
Hadoop uk user group meeting finalSkills Matter
 
MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...
MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...
MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...MongoDB
 
Pentaho big data camp - 5 min
Pentaho   big data camp - 5 minPentaho   big data camp - 5 min
Pentaho big data camp - 5 minianfyfe
 
How advanced analytics is impacting the banking sector
How advanced analytics is impacting the banking sectorHow advanced analytics is impacting the banking sector
How advanced analytics is impacting the banking sectorMichael Haddad
 
BI congres 2014-5: from BI to big data - Jan Aertsen - Pentaho
BI congres 2014-5: from BI to big data - Jan Aertsen - PentahoBI congres 2014-5: from BI to big data - Jan Aertsen - Pentaho
BI congres 2014-5: from BI to big data - Jan Aertsen - PentahoBICC Thomas More
 
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...Experfy
 
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...Edureka!
 

Ähnlich wie Nov 2010 HUG: Business Intelligence for Big Data (20)

Plug 20110217
Plug   20110217Plug   20110217
Plug 20110217
 
Big Data for BI - Beyond the Hype - Pentaho
Big Data for BI - Beyond the Hype - PentahoBig Data for BI - Beyond the Hype - Pentaho
Big Data for BI - Beyond the Hype - Pentaho
 
Data lake ppt
Data lake pptData lake ppt
Data lake ppt
 
INTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOPINTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOP
 
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupBig Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
 
Putting Business Intelligence to Work on Hadoop Data Stores
Putting Business Intelligence to Work on Hadoop Data StoresPutting Business Intelligence to Work on Hadoop Data Stores
Putting Business Intelligence to Work on Hadoop Data Stores
 
Exclusive Verizon Employee Webinar: Getting More From Your CDR Data
Exclusive Verizon Employee Webinar: Getting More From Your CDR DataExclusive Verizon Employee Webinar: Getting More From Your CDR Data
Exclusive Verizon Employee Webinar: Getting More From Your CDR Data
 
Ask bigger questions
Ask bigger questionsAsk bigger questions
Ask bigger questions
 
Big data rmoug
Big data rmougBig data rmoug
Big data rmoug
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
 
Fbdl enabling comprehensive_data_services
Fbdl enabling comprehensive_data_servicesFbdl enabling comprehensive_data_services
Fbdl enabling comprehensive_data_services
 
Hadoop uk user group meeting final
Hadoop uk user group meeting finalHadoop uk user group meeting final
Hadoop uk user group meeting final
 
MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...
MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...
MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...
 
Pentaho big data camp - 5 min
Pentaho   big data camp - 5 minPentaho   big data camp - 5 min
Pentaho big data camp - 5 min
 
Big data for product managers
Big data for product managersBig data for product managers
Big data for product managers
 
How advanced analytics is impacting the banking sector
How advanced analytics is impacting the banking sectorHow advanced analytics is impacting the banking sector
How advanced analytics is impacting the banking sector
 
BI congres 2014-5: from BI to big data - Jan Aertsen - Pentaho
BI congres 2014-5: from BI to big data - Jan Aertsen - PentahoBI congres 2014-5: from BI to big data - Jan Aertsen - Pentaho
BI congres 2014-5: from BI to big data - Jan Aertsen - Pentaho
 
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
 
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
 

Mehr von Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

Mehr von Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Kürzlich hochgeladen

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Kürzlich hochgeladen (20)

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Nov 2010 HUG: Business Intelligence for Big Data

  • 1. Business Intelligence for Big Data James Dixon, Chief Geek August, 2010 © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
  • 2. Business Intelligence = reports, dashboards, analysis, visualization, alerts, auditing © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide It might be a self-selecting audience since we are a Business Intelligence company, but upwards of 90% of the companies we talk to are using, or plan to use Hadoop to transform structured or semi-structured data - with the aim of then analyzing, investigating and reporting on the data.
  • 3. Hadoop and BI © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide It might be a self-selecting audience since we are a Business Intelligence company, but upwards of 90% of the companies we talk to are using, or plan to use Hadoop to transform structured or semi-structured data - with the aim of then analyzing, investigating and reporting on the data.
  • 4. Example Hadoop Cases Today Transactional • Fraud detection • Financial services/stock markets Sub-Transactional • Weblogs • Social/online media • Telecoms events © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide * Not many companies have transactional data that classifies as Big Data. Credit card companies, and financial services companies are about it. * With stock market data were are talking about every stock trade and the bid and ask prices between the transactions - for every stock on multiple markets for a significant time period. For many other companies the Big Data is sub-transactional - it is the events that lead up to transactions * Weblogs are semi/badly structured. Consider the number of weblog entries created as you look for a book online - researching 5-10 books, reading reviews and comments. You might generate 1000 entries and may or may not buy a book - potentially lots of entries for no transaction. We also want to enrich this data with metadata about the URLs and information about the location of user * In an online game or world every interaction between participants and the system and between each other is logged. An individual participant might generate > 1 million events for their 1 monthly transaction * A single phone call or text message generates many events within a telecoms company
  • 5. Example Hadoop Cases Today Non-Transactional • Web pages, blogs etc • Documents • Physical events • Application events • Machine events In most cases structured or semi-structured © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide * In additional to transactional and sub-transactional there is also non-transactional data. Some of this data is human-generated and some of it is people-generated. * People generate lots of content that companies are interested in - web pages, blogs, and comments * Physical events include data such as weather data. If you take the combined output of the weather-sensing instruments deployed today you get Big Data * Many software applications log events as they execute, as do machines such as production line machinery TRANSITION In the majority of these cases the data is structured or semi-structured. LEAD-IN What do we have in common between these use cases? How can we describe these Big Data scenarios?
  • 6. Data Lake • Single source • Large volume • Not distilled • Can be treated © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide In most of these cases we are dealing with a single source of data. We are, we know, dealing with a large volume of data. We are also dealing with data that is not aggregated, or summarized. Itʼs not ʻdistilledʼ in any way. It is a large body of data. The data can be raw data or might be treated in some way, treated within the lake or on its way into the lake. For example weblog entries might be geocoded and enriched with metadata. So we are calling these things Data Lakes.
  • 7. Data Lakes • 0-2 lakes per company • Known and unknown questions • Multiple user communities • $1-10k questions, not $1m ones • Don’t fit in traditional RDBMS with a reasonable cost © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide There are some other interesting attributes of these data lakes * If there is a data lake at all in a company there is usually only 1. Some domains such as financial services companies might have two, but any more than this is very rare. * In most cases we have some questions of this data that are known ahead of time. But we also have questions of the data that cannot be anticipated. * We also frequently have different user communities that want access to the data. In the example of weblogs we have sales and marketing departments that want to know about the behavior of visitors and the volume of traffic on the site, maybe for different geographies. We also have the IT department that wants information about throughput and load on the server for capacity planning. * In general most of the questions about the data are not million dollar questions, they are $1k to $10k questions. Because no one user or group has a million dollar question, no-one has a million dollar budget to solve the problem. * Additionally this amount of data does not fit into a database either because the database physically will not fit or the cost of doing so is out of reach economically.
  • 8. Data Lake Requirements • Store all the data • Satisfy routine reporting and analysis • Satisfy ad-hoc query / analysis / reporting • Balance performance and cost © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide If we look at the requirements of these data lakes we also see common ground: * We want to store all of the data because we donʼt know all the questions we have of the data. If we did know, weʼd only have to keep a subset of the data. * We still want to satisfy all of the traditional BI reporting and analysis needs. * We need to provide the ability to dip into the lake at any time to ask any question of the data: - In some cases we want to extract a slice of data from the lake for detailed analysis. Letʼs say Iʼm in charge of pricing and promotions for a company and this week Iʼm looking at a particular region or a particular product. I want to select a subset of the data from the lake, summarized to some level, with attributes that I want to analyze. I want to slice and dice this data for a few hours or days, and then move onto my next region or product. In this case we are creating a short-lived data-mart from the data lake. - In other cases we know exactly the data we are looking for and donʼt need to explore it. In this case we defined the attributes of the data that we want and we get a query results back. * We also want to balance cost and performance. Big Data solutions are cheaper per-TeraByte than other solutions, but do not have the same level of performance. We want a system where we can selectively improve the performance of data that we care the most about, and still have access to the entire data set any time we need it. LEAD-IN Since we are introducing a new term ʻData Lakeʼ we need to explain how it is different from traditional BI system
  • 9. Traditional BI Data Mart(s) Tape/Trash Data ? ? ? Source ? ? ?? © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide In a traditional BI system where we have not been able to store all of the raw data, we have solved the problem by being selective. Firstly we selected the attributes of the data that we know we have questions about. Then we cleansed it and aggregated it to transaction levels or higher, and packaged it up in a form that is easy to consume. Then we put it into an expensive system that we could not scale, whether technically or financially. The rest of the data was thrown away or archived on tape, which for the purposes of analysis, is the same as throwing it away. TRANSITION The problem is we donʼt know what is in the data that we are throwing away or archiving. We can only answer the questions that we could predict ahead of time.
  • 10. What if... Data Mart(s) Ad-Hoc Data Warehouse Data Lake(s) Tape/Trash Data Source © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide But what if, instead of sampling the data and throwing the rest away TRANSITION We pour all of the data into a Data Lake TRANSITION And then create whatever data marts we need from the Data Lake TRANSITION And also provide the ability to extract data from the Data Lake on an ad-hoc basis TRANSITION And also provide the ability to extract data from the Data Lake to feed into a data warehouse
  • 11. Big Data Architecture Data Mart(s) Ad-Hoc Data Warehouse Data Lake(s) Data Source © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide This, then, is our Big Data architecture. As well as pouring data from the source into the Data Lake, we can also take our archive tapes and pour them into the lake as well. Giving us a huge about of historical data. Does this meet our requirements? TRANSITION We are storing all of the data, so we can answer both known and unknown questions TRANSITION We are satisfying our standard reporting and analysis requirements by putting the most commonly requested data into data marts TRANSITION We are satisfying ad-hoc needs by providing the ability to dip into the lake at any time to extract data. This extracted data might be used to populate a temporary data mart, it might be used at the input for a specialized visualization tool, or might be used by an analytical application. TRANSITION We are meeting the need to balance performance and cost by allowing you to choose how much data is staged in high-performance databases for fast access, and how much data is available from the Data Lake only.
  • 12. Does Big Data Replace Data Marts? • If it is a database • If it has low latency Hadoop (to date) • Databases are immature • Databases are no-SQL © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 13. Why Hadoop and BI? • Distributed processing • Distributed file system • Commodity hardware • Platform independent (in theory) • Scales out beyond technology and/or economy of a RDBMS In many cases it’s the only viable solution © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide * For the purposes of BI the parallel processing and distributed storage of Hadoop, along with its scale-out architecture using commodity hardware is attractive. * Since Hadoop is written in Java it is, theoretically plaform-independent. At this point, due to some dependencies, it is only recommended for Linux/Unix. * And because these factors allow it to scale with a better price/performance characteristics than databases... TRANSITION ... in many cases itʼs the only viable solution LEAD-IN So are there any downsides to Hadoop for BI use cases?
  • 14. Hadoop and BI? 90% of new Hadoop use cases are transformation of semi/structured data* * of those companies we’ve talked to... © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide It might be a self-selecting audience since we are a Business Intelligence company, but upwards of 90% of the companies we talk to are using, or plan to use Hadoop to transform structured or semi-structured data - with the aim of then analyzing, investigating and reporting on the data.
  • 15. Hadoop and BI? “The working conditions within Hadoop are shocking” ETL Developer © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide Unfortunately for developers who are used to working with data transformation tools, the productivity within the Hadoop environment is not what they are used to.
  • 16. Hadoop and BI? Instead of this... © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide Instead of a graphical UI with palettes of data transformation operations to string together in a way that is easy to understand, easy to trace, and easy to explain...
  • 17. Hadoop and BI? You have to do this... public void map( Text key, Text value, OutputCollector output, Reporter reporter) public void reduce( Text key, Iterator values, OutputCollector output, Reporter reporter) © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide In Hadoop we have two Java functions - Map and Reduce - that need to be implemented. These functions are part of the MapReduce processing engine mentioned earlier. Mapping and reducing are important functions in a data transformation engine, unfortunately there are many other operations that we need to do on our data. Hadoop does not include a comprehensive suite of data transformation operations To understand how we ended up in this situation we need to take a brief look at the history of Hadoop
  • 18. MapReduce Limitations Doing everything with MapReduce is like doing everything with recursion. You can, but that doesn’t mean its the best solution © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 19. MapReduce Limitations Not a scalable name... What’s next? MapReduceLookupJoinDenormalize UpdateDedupeFilterCalcMergeAppend © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 20. Google’s Use Case • Needed to index the internet • Huge set of unstructured data • Predetermined input • Predetermined output (the index) • Predetermined questions • Single user community • Needed parallel processing and storage Their answer was MapReduce (MR) © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide The trail starts with Google. Google wanted to index the internet. * This is clearly a big data set, and also an unstructured data set. * Before they set out, Google knew what their data set was * They knew how they wanted to process the data - to create an index * They knew the questions they wanted to ask of the data - given some keys words, what are the most relevant web pages * They has a single user community - the set of people trying to search the internet * In order to solve this problem they needed a scalable architecture with distributed storage and parallel processing TRANSITION Their answer was to use MapReduce
  • 21. Yahoo’s Use Case • Needed to index the internet • Huge set of unstructured data • Predetermined input • Predetermined output (the index) • Predetermined questions • Single user community • Needed parallel processing and storage Their answer was Hadoop (w/ MapReduce) © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide Next along the trail is Yahoo. Yahooʼs requirements were very similar, in fact almost identical, to Googlesʼs. * The exact same data set * The same input format * The same output * The same questions * From the same population * With the same scalability requirements TRANSITION Yahooʼs answer was Hadoop, which includes a MapReduce engine LEAD-IN So how do these requirements compare with the current, BI-specific use cases?
  • 22. Current Use Cases ✗ Not indexing the internet • ✗ Huge set of semi/structured data • ✗ Different input source and format • ✗ Different outputs • ✗ Different questions • ✗ Multiple user communities • ✓ Need parallel processing and storage • © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide * No-one is indexing the internet - that is not a BI use case * In most cases we have structured or semi-structured data, not unstructured * In each use case the data source is different, so the format of the data is different * In each case the output is not an index, it is a variety of data sets, data feeds, and reports * In each case the questions of the data are different, and the questions cannot all be predicted * In most cases we have multiple user communities with different needs and questions * In each case the volume of the data is such that we need a scalable architecture with distributed storage and parallel processing When we compare these scenarios with the purpose for which Hadoop was created we see that TRANSITION There is not much overlap between the Big Data needs of BI, and the original intent of Hadoop LEAD-IN The realization here is that...
  • 23. Unfortunately Hadoop wasn’t designed for most BI requirements © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 24. Hadoop’s Strengths and Weaknesses • Distributed processing • Distributed file system • Commodity hardware • Platform independent (in theory) • Scales out beyond technology and/or economy of a RDBMS But... • Not designed for BI © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 25. No-SQL and BI © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 26. BI Tools Need... Structured Query Language © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 27. BI Tools Don’t Need • CREATE / INSERT • UPDATE • DELETE • (only Read needed) • No ACID transactions © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 28. Mondrian (OLAP) Needs Required: Nice to have: • SELECT • HAVING • FROM • ORDER BY ... NULLS COLLATE • WHERE • COUNT(DISTINCT x,y) • GROUP BY • COUNT(DISTINCT x), COUNT(DISTINCT y) • ORDER BY • VALUES (1,’a’), (2,’b’) © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 29. Why not add to Hadoop the things it’s missing... © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 30. ... until it can do what we need it to? © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 31. If only we had a Java, embeddable, data transformation engine... © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 32. Hadoop Architecture Java/ Clients Python Map/ Job Task Task Task Reduce Tracker Tracker Tracker Tracker Hadoop Common Filesystem: Name Data Data Data HDFS, Node Node Node Node S3... © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 33. Pentaho Data Integration Data Marts, Data Warehouse, Analytical Applications Pentaho Data Integration Design Hadoop Pentaho Data Deploy Integration Orchestrate Pentaho Data Integration © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide Fortunately we have an embeddable data integration engine, written in Java We have taken our Data Integration engine, PDI and integrated with Hadoop in a number of different areas: * We have the ability to move files between Hadoop and external locations * We have the ability to read and write to HDFS files during data transformations * We have the ability to execute data transformations within the MapReduce engine * We have the ability to extract information from Hadoop and load it into external data bases and applications * And we have the ability to orchestrate all of this so you can integrate Hadoop into the rest of your data architecture with scheduling, monitoring, logging etc
  • 34. Visualize Repor3ng  /  Dashboards  /  Analysis Web Tier DM  &  DW RDBMS Op#mize Hive Hadoop Files  /  HDFS Load Applica3ons  &  Systems © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide Put in to diagram form so we can indicate the different layers in the architecture and also show the scale of the data we get this Big Data pyramid. * At the bottom of the pyramid we have Hadoop, containing our complete set of data. * Higher up we have our data mart layer. This layer has less data in it, but has better performance. * At the top we have application-level data caches. * Looking down from the top, from the perspective of our users, they can see the whole pyramid - they have access to the whole structure. The only thing that varies is the query time, depending on what data they want. * Here we see that the RDBMS layer lets up optimize access to the data. We can decide how much data we want to stage in this layer. If we add more storage in this layer, we can increase performance of a larger subset of the data lake, but it costs more money.
  • 35. Repor3ng  /  Dashboards  /  Analysis Web Tier DM  &  DW RDBMS Metadata PDI Hive Hadoop PDI Files  /  HDFS PDI Applica3ons  &  Systems © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide We are able to provide this data architecture because we have metadata about every layer in the architecture. We used Pentaho Data Integration to move data into Hadoop, and to process data within Hadoop, and as result we have metadata about the data within Hadoop. We also use PDI to create the data marts and extracts from Hadoop, so we have metadata about those as well
  • 36. Repor3ng  /  Dashboards  /  Analysis Web Tier RDBMS Data Hadoop Lake Applica3ons  &  Systems © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide If we compare this diagram to our other Big Data diagram we see how it fits together. TRANSITION Our Data Lake sits within Hadoop TRANSITION Our neatly packaged data mart and DW extracts feed into the database layer. Data from here can get to users very quickly. TRANSITION Our ad-hoc queries and ad-hoc data-marts come directly from the Data Lake
  • 37. Visualize Repor3ng  /  Dashboards  /  Analysis Web Tier DM  &  DW RDBMS Op#mize Hive Hadoop Files  /  HDFS Load Applica3ons  &  Systems © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide This, then, is our big data architecture. Its a hybrid architecture that enables you to blend Hadoop with other elements of your data architecture, and with whatever amount of database storage you think necessary. The blend of Hadoop and other technologies is flexible and easy to tweak over time
  • 38. Repor3ng  /  Dashboards  /  Analysis Web Tier DM   RDBMS Hive Hadoop HDFS © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide In this demo we will show how easy it is to execute a series of Hadoop and non-Hadoop tasks. We are going to TRANSITION 1 Get a weblog file from an FTP server TRANSITION 2 Make sure the source file does not exist with the Hadoop file system TRANSITION 3 Copy the weblog file into Hadoop TRANSITION 4 Read the weblog and process it - add metadata about the URLs, add geocoding, and enrich the operating system and browser attributes TRANSITION 5 Write the results of the data transformation to a new, improved, data file TRANSITION 6 Load the data into Hive TRANSITION 7 Read an aggregated data set from Hadoop TRANSITION 8 And write it into a database TRANSITION 9 Slice and dice the data with the database TRANSITION 10 And execute an ad-hoc query into Hadoop
  • 39. Demo © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 40. FAQ 1. Will Pentaho contribute to Apache’s Hadoop projects? Yes 2. Will Pentaho distribute Hadoop as part of their product? Unlikely 3. What version of Hadoop will be supported? Initially 20.2 4. Will Pentaho’s APIs allow existing open source APIs to be used in parallel? Yes © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide 1. Any changes Pentaho makes to the Apache code will be contributed to Apache. 2. Pentaho does not plan to provide its own distribution of Hadoop or to provide anyone elseʼs distribution as part of our products. If we need to provide binary patches while we wait for our contributions to be accepted by the Hive developers, we will do so, but this will be a temporary situation only. 3. We are looking into support for version 20.0 as well. 4. We are not modifying or disabling any Hadoop APIs so any existing MapReduce tasks will work as they did before
  • 41. FAQ 5. Will Pentaho provide support or services to help setup Hadoop? Yes, no, maybe 6. What are the requirements to be in the Pentaho Hadoop beta program? Requirements, be serious, have started already, etc © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide 5. Hadoop is a data source for Pentaho, just as any filesystem, FTP, web service or database is. We donʼt directly provide support for these third party services. We recognize that companies want support and services for Hadoop so we will work with partners to provide these. 6. For the ongoing beta program we are looking for Hadoop sites that have data, have Hadoop installed, and have requirements
  • 42. Can I Use ‘Big Data’ as a Data Warehouse? Yes, probably © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 43. Should I Use ‘Big Data’ as a Data Warehouse? No, probably not © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 44. What is a Data Warehouse? Data Mart • Data structured for query and reporting Data Warehouse • What you get if you create data marts for every system, then combine them together © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 45. Data Warehouse • Multiple sources • Cleansed and processed • Organized • Summarized © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide By definition a data warehouse has content from many different sources - every operational system within your organization. This data has been cleansed, processed, structured and aggregated to the transaction level TRANSITION If we compare the data warehouse to the Data Lake the differences between them become obvious
  • 46. Big Data Architecture Data Mart(s) Ad-Hoc Data Warehouse Data Lake(s) Data Source © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide So our recommendation is the Data Lake architecture, where data marts and a data warehouse are fed from a data lake.
  • 47. But what if I really, really want to . . . © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 48. Data Water-Garden • Lake(s) • Pools and ponds • Organized • Cleansed • Linkages © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide Instead of a single Data Lake, create a series of data pools. Each pool will be populated from a different data source. The data in the pools should be cleansed and structured. Create links between the pools with attributes that are exist in both.
  • 49. Water-Garden Architecture Data Mart(s) Ad-Hoc Data Mart(s) Water-Garden Data Sources © 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide Then optimize your system by creating data marts for different domains or user populations
  • 50. More information www.pentaho.com/hadoop contact: hadoop@pentaho.com Pentaho Template v6 © 2010, Pentaho. All Rights Reserved. www.pentaho.com.