GUWAHATI 💋 Call Girl 9827461493 Call Girls in Escort service book now
Big data analytics beyond beer and diapers
1. Big Data Analytics:
Beyond Beer and Diapers
2012/2/22
Kai Zhao @Teradata
kingaim@gmail.com
by Kai Zhao 2011.12
Disclaimer:
Any views or opinions presented in this article are solely those of the author and do NOT necessarily represent those of Teradata or other companies .
2. Content
Background:
Traditional Business Intelligent(BI)
What is Big Data
What is Big Data Analytics
Big Data Analytics: State of the Art
Big Data Analytics Technology Stack
ETL/ELT/ETLT(Demo)
MPP Data Warehouse
Map Reduce
NoSQL
Web Service
Data Analytics
Data Visualization
BI Tools(Demo)
Big Data Analytics Platform Architecture
5. What is Big Data
Volume: The increase in data volumes within
enterprise systems is caused by transaction volumes
and other traditional data types, as well as by new
types of data. Too much volume is a storage issue,
but too much data is also a massive analysis issue.
Variety: IT leaders have always had an issue
translating large volumes of transactional
information into decisions — now there are more
types of information to analyze — mainly coming
from social media and mobile (context-aware).
Variety includes tabular data (databases),
hierarchical data, documents, e-mail, metering data,
video, still images, audio, stock ticker data, financial
transactions and more.
Velocity: This involves streams of data, structured
record creation, and availability for access and
delivery. Velocity means both how fast data is being
produced and how fast the data must be processed
to meet demand.
6. What is Big Data (cont.)
Broadly speaking, Big Data is generated by a number of sources, including:
Social Networking and Media: There are currently over 700 million Facebook users, 250 million Twitter users and 156
million public blogs. Each Facebook update, Tweet, blog post and comment creates multiple new data points, both
structured, semi-structured and unstructured, sometimes called Data Exhaust.
Mobile Devices: There are over 5 billion mobile phones in use worldwide. Each call, text and instant message is
logged as data. Mobile devices, particularly smart phones and tablets, also make it easier to use social media and use
other data-generating applications. Mobile devices also collect and transmit location data.
Internet Transactions: Billions of online purchases, stock trades and other transactions happen every day, including
countless automated transactions. Each creates a number of data points collected by retailers, banks, credit cards,
credit agencies and others.
Networked Devices and Sensors: Electronic devices of all sorts – including servers and other IT hardware, smart
energy meters and temperature sensors -- all create semi-structured log data that record every action.
7. What is Big Data Analytics
See Video
Big Data
Visualization
8. Big Data Analytics: State of the Art
Acquisitions and Investments
Big Data Vendors and Their Productions
Forrester Report
Gartner Report
9. Acquisitions and Investments
Acquirer Acquiree(Est. date) Date of Acq. Deal Summary
Teradata AsterData - 2005 2011.3.3 $0.263 billion Traditional Data
HP Vertica – 2005 2011.2.14 $1.2 billion Warehouse Vendors
needs Big Data
IBM Netezza – 2000 2010.11.11 $1.7 billion Analytics technology.
EMC Greenplum – 2003 2010.7.6 $0.1~0.15 billion
SAP Sybase 2010.5.12 $0.58 billion
Investee Investment
Cloudera $76 million
MapR $29 million
Hortonworks $50 million
Datameer $10 million
Summary New Big Data
Analytics Startups
Source: http://www.leiphone.com/why-2012-the-year-of-hadoop.html
10. Big Data Vendors and Their Productions
Source: http://wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond
14. Big Data Analytics Technology Stack
Data Import
Data Storage
Data Computing
Data Analytics
XXX as a Service
15. ETL/ELT/ETLT
Extract – The process by which data is extracted from the data source
Transform – The transformation of the source data into a format relevant to the solution
Load – The loading of data into the warehouse
This approach to data warehouse development is the traditional and widely accepted approach.
The following diagram illustrates each of the individual stages in the process.
16. ETL
This approach to data warehouse development is the traditional and widely accepted approach.
The following diagram illustrates each of the individual stages in the process.
Source: Robert J Davenport ETL vs ELT A Subjective View
17. ETL
Strengths
Development Time
Designing from the output backwards ensures that only data relevant to the solution is extracted and processed,
potentially reducing development, extract, and processing overhead; and therefore time.
Targeted data
Due to the targeted nature of the load process, the warehouse contains only data relevant to the presentation.
Administration Overhead
Reduced warehouse content simplifies the security regime implemented and hence the administration overhead.
Tools Availability
The prolific number of tools available that implement ETL provides flexibility of approach and the opportunity to
identify a most appropriate tool. The proliferation of tools has lead to a competitive functionality war, which
often results in loss of maintainability.
Weaknesses
Flexibility
Targeting only relevant data for output means that any future requirements, that may need data that was not
included in the original design, will need to be added to the ETL routines. Due to nature of tight dependency
between the routines developed, this often leads to a need for fundamental re-design and development. As a
result this increases the time and costs involved.
Hardware
Most third party tools utilize their own engine to implement the ETL process. Regardless of the size of the
solution this can necessitate the investment in additional hardware to implement the tool’s ETL engine.
Skills Investment
The use of third party tools to implement ETL processes compels the learning of new scripting languages.
Learning Curve
Implementing a third party tool that uses foreign processes and languages results in the learning curve that is
implicit in all technologies new to an organization and can often lead to following blind alleys in their use due to
lack of experience.
18. ELT
Whilst this approach to the implementation of a warehouse appears on the surface to be
similar to ETL, it differs in a number of significant ways.
The following diagram illustrates the process.
19. ELT
Strengths
Project Management
Being able to split the warehouse process into specific and isolated tasks, enables a project to be designed on a smaller
task basis, therefore the project can be broken down into manageable chunks.
Flexible & Future Proof
In general, in an ELT implementation all data from the sources are loaded into the warehouse as part of the extract and
load process. This, combined with the isolation of the transformation process, means that future requirements can easily
be incorporated into the warehouse structure.
Risk minimization
Removing the close interdependencies between each stage of the warehouse build process enables the development
process to be isolated, and the individual process design can thus also be isolated. This provides an excellent platform for
change, maintenance and management.
Utilize Existing Hardware
In implementing ELT as a warehouse build process, the inherent tools provided with the database engine can be used.
Alternatively, the vast majority of the third party ELT tools available employ the use of the database engine’s capability
and hence the ELT process is run on the same hardware as the database engine underpinning the data warehouse, using
the existing hardware deployed.
Utilize Existing Skill sets
By using the functionality provided by the database engine, the existing investments in database skills are re-used to
develop the warehouse.
Weaknesses
Against the Norm
ELT is an emergent approach to data warehouse design and development. Whilst it has proven itself many times over
through its abundant use in implementations throughout the world, it does require a change in mentality and design
approach against traditional methods. To get the best from an ELT approach requires an open mind.
Tools Availability
Being an emergent technology approach, ELT suffers from a limited availability of tools.
21. Map Reduce: Hadoop
Comparing with MPP Data Warehouse.
Source: http://www.capgemini.com/technology-blog/2012/01/what-is-hadoop/
22. Map Reduce: Hadoop
Professional
Service
Enterprise-
Database OLTP grade
Distribution
Hadoop
Subscription replacements:
Service Teradata
Aster/MongoDB
Hadoop
Cluster Data Integration
Management with Hadoop
EDW BI
23. MPP Data Warehouse
Comparing MPP Data Warehouse with Hadoop stack.
Draw a picture.
31. BI Tools
BI Tools fall into three categories:
Query Tools
A query tool is software setup for users to ask questions about the data. The user can
search for patterns or details.
Multidimensional Analysis Tools
A multidimensional analysis tool, also called Online Analytical Processing (OLAP),
is software that allows the user to view the same data from different aspects.
Eg: Business Objects, Hyperio, Cognos, MicroStrategy, Pentaho, Microsoft Analysis Services
and Palo OLAP Server etc.
Data Mining Tools
A data mining tool is software that is automated to search data, seeking out ways that
the data correlates to other data.
Eg: SPSS Clementine, Weka3, R and Apache Mahout etc.
32. BI Tools List
Source: BI Tool Survey 2012 http://www.businessintelligencetoolbox.com/list-of-business-intelligence-bi-tools/
33. BI Tools: Gartner Evaluation
Business intelligence (BI) platforms
enable all types of users – from IT staff to
consultants to business users – to build
applications that help organizations learn
about and understand their business
34. BI Demo – JasperSoft iReport
Demo Session: JasperSoft iReport