Cette session permet de découvrir le paysage Big Data d'une façon pragmatique. Nous remettrons d'abord la question du BIG Data dans ses contextes business et techno. Ensuite, nous ferons un zoom sur les technologies Hadoop et leurs différentes possibilités d'implémentation.
Dans le mobile, c'est la première impression qui compte! Faites le bilan comp...
Big Data : Manage, Refine, Analyze
1. Big Data : Manage, Refine, Recycle
orenault@hortonworks.com
blaisev@microsoft.com
2. Souscrivez à l’offre d’essai ou activez
votre accès Azure MSDN
Présentez-vous sur le stand Azure
(zone Services & Tools)
Participez au tirage au sort
à 18h30 le 12 ou le 13 février
3. Hadoop : Etude d’un cas
d’utilisation
Introduction :
Motivation et Hadoop en environnement
Scénarios Microsoft
15. OPERATIONAL DATA
SERVICES SERVICES Hortonworks
AMBARI FLUME PIG HIVE
HBASE
Data Platform (HDP)
OOZIE SQOOP HCATALOG
Enterprise Hadoop
WEBHDFS MAP REDUCE
HADOOP CORE
HDFS YARN (in 2.0)
The ONLY 100% open source and complete
Enterprise Readiness
PLATFORM SERVICES High Availability, Disaster distribution
Recovery, Snapshots, Security, etc…
HORTONWORKS
DATA PLATFORM (HDP) Enterprise grade, proven and tested at
scale
OS Cloud VM Appliance
Ecosystem endorsed to ensure
interoperability
17. Business Cases
Batch Interactive Online
Refine Explore Enrich
HORTONWORKS
DATA PLATFORM
Big Data
Transactions, Interactions, Observations
18. APPLICATIONS
Refine Explore Enrich
Business Analytics Custom Applications Enterprise Applications
Collect data and apply a
known algorithm to it in trusted
operational process
3
DATA SYSTEMS
HORTONWORKS
DATA PLATFORM 2 1 Capture
RDBMS EDW MPP Capture all data
TRADITIONAL REPOS
2 Process
Parse, cleanse, apply structure &
1 transform
3 Exchange
DATA SOURCES
Push to existing data warehouse
Traditional Sources New Sources
(RDBMS, OLTP, OLAP) (web logs, email, sensor data, social media)
for use with existing analytic tools
19. APPLICATIONS
Refine Explore Enrich
Business Analytics
Collect data and perform
3 iterative investigation for value
DATA SYSTEMS
HORTONWORKS
DATA PLATFORM 2 1 Capture
RDBMS EDW MPP
TRADITIONAL REPOS Capture all data
2 Process
Parse, cleanse, apply structure &
1 transform
DATA SOURCES
3 Exchange
Traditional Sources New Sources Explore and visualize with
(RDBMS, OLTP, OLAP) (web logs, email, sensor data, social media) analytics tools supporting Hadoop
20. APPLICATIONS
Refine Explore Enrich
Custom Applications Enterprise Applications
Collect data, analyze and
present salient results for
3 online apps
DATA SYSTEMS
HORTONWORKS 1 Capture
DATA PLATFORM 2 Capture all data
RDBMS EDW MPP NOSQL
TRADITIONAL REPOS
2 Process
Parse, cleanse, apply structure &
transform
1
3 Exchange
DATA SOURCES
Incorporate data directly into
Traditional Sources New Sources applications
(RDBMS, OLTP, OLAP) (web logs, email, sensor data, social media)
21. Vertical Refine Explore Enrich
• Dynamic Pricing
• Log Analysis/Site Optimization • Brand and Sentiment Analysis
Retail & Web • Session & Content Optimization
• Loyalty Program Optimization • Market basket analysis
• Product recommendation
Telco • Customer profiling • Equipment failure prediction • Location based advertising
Government • Threat Identification • Person of Interest Discovery • Cross Jurisdiction Queries
• Risk Modeling & Fraud Identification • Surveillance and Fraud Detection • Real-time upsell, cross sales marketing
Finance
• Trade Performance Analytics • Customer Risk Analysis offers
• Grid Failure Prevention
Energy • Smart Grid: Production Optimization • Individual Power Grid
• Smart Meters
• Dynamic Delivery
Manufacturing • Supply Chain Optimization • Customer Churn Analysis
• Replacement parts
• Clinical decision support
Healthcare • Electronic Medical Records (EMPI) • Insurance Premium Determination
• Clinical Trials Analysis
29. Chargement de données de ASV vers HDFS, exécution de
requêtes, agrégation de résultats
AZURE HD INSIGHT SERVER
30. Registrations
DB Klout.com
(MySql) (Node.js)
Mobile
Klout API
Profile DB (ObjectiveC)
(Scala)
Signal Data (HBase)
Collectors Enhancemen Partner API
(Java/Scala) t Data (Mashery)
Engine Warehouse Search Index
(PIG/Hive) (Hive) (Elastic Search)
Streams
(MongoDB)
Monitoring
(Nagios)
Serving Stores
Dashboards
(Tableau)
Perks Analyics
Analytics (Scala)
Cubes Event Tracker
(SSAS) (Scala)
Case Study: Data Services Firm Uses Microsoft BI and Hadoop to Boost Insight into Big Data
31. Sources de Business
Acquisition, Stockage, Traitement des données Supervision
données Intelligence
PIG HIVE MAHOUT Pegasus Reporting
CEP
Map/Reduce
OLAP
Data Node
Name Node Data Node
Bulk Load Data Node System Center
RDBMS
Files System
File System
Connector ASV HDFS Application Server
32. Cloud Services Virtual Machine On-premise
Sources de Business
Acquisition, Stockage, Traitement des données Supervision
données Intelligence
HDInsight Services
SQL Reporting
StreamInsight PIG HIVE MAHOUT Pegasus
Map/Reduce SSRS
Data Node
Name Node Data Node
Plume Data Node SSAS
System Center
SQL
Files System Database
SQOOP
ASV HDFS SharePoint
Microsoft Windows Azure
33. Agrégation de données issues de multiples sources
AZURE HD INSIGHT SERVER,
SQL2012, POWERPIVOT,
POWERVIEW
34.
35. • Submit changes back to Apache
Foundation
• ‘Just works’ on Windows Azure
and Server
• Integration with Visual Studio,
Javascript, Excel, etc.
• Performance, Scale, High
Availability
• Management, Ease of use
• Security, Data Governance
• Integration with AD and SC.
• Integrate as part of our overall data
platform
As the volume of data has exploded, we increasingly see organizations acknowledge that not all data belongs in a traditional database. The drivers are both cost (as volumes grow, database licensing costs can become prohibitive) and technology (databases are not optimized for very large datasets).Instead, we increasingly see Hadoop – and HDP in particular – being introduced as a complement to the traditional approaches. It is not replacing the database but rather is a complement: and as such, must integrate easily with existing tools and approaches. This means it must interoperate with: Existing applications – such as Tableau, SAS, Business Objects, etc,Existing databases and data warehouses for loading data to / from the data warehouseDevelopment tools used for building custom applicationsOperational tools for managing and monitoring
Across all of our user base, we have identified just 3 separate usage patterns – sometimes more than one is used in concert during a complex project, but the patterns are distinct nonetheless. These are Refine, Explore and Enrich.The first of these, the Refine case, is probably the most common today. It is about taking very large quantities of data and using Hadoop to distill the information down into a more manageable data set that can then be loaded into a traditional data warehouse for usage with existing tools. This is relatively straightforward and allows an organization to harness a much larger data set for their analytics applications while leveraging their existing data warehousing and analytics tools.Using the graphic here, in step 1 data is pulled from a variety of sources, into the Hadoop platform in step 2, and then in step 3 loaded into a data warehouse for analysis by existing BI tools
A second use case is what we would refer to as Data Exploration – this is the use case in question most commonly when people talk about “Data Science”.In simplest terms, it is about using Hadoop as the primary data store rather than performing the secondary step of moving data into a data warehouse. To support this use case you’ve seen all the BI tool vendor rally to add support for Hadoop – and most commonly HDP – as a peer to the database and in so doing allow for rich analytics on extremely large datasets that would be both unwieldy and also costly in a traditional data warehouse. Hadoop allows for interaction with a much richer dataset and has spawned a whole new generation of analytics tools that rely on Hadoop (HDP) as the data store.To use the graphic, in step 1 data is pulled into HDP, it is stored and processed in Step 2, before being surfaced directly into the analytics tools for the end user in Step 3.
The final use case is called Application Enrichment.This is about incorporating data stored in HDP to enrich an existing application. This could be an on-line application in which we want to surface custom information to a user based on their particular profile. For example: if a user has been searching the web for information on home renovations, in the context of your application you may want to use that knowledge to surface a custom offer for a product that you sell related to that category. Large web companies such as Facebook and others are very sophisticated in the use of this approach.In the diagram, this is about pulling data from disparate sources into HDP in Step 1, storing and processing it in Step 2, and then interacting with it directly from your applications in Step 3, typically in a bi-directional manner (e.g. request data, return data, store response).
In the currentdeveloperpreview on www.hadooponazure.com data stored inASV canbeaccesseddirectlyfrom the Interactive JavaScript Console byprefixing the protocolscheme of the URI for the assetsyou are accessingwithASV://To use thisfeature in the current release, youwillneedHDInsight and Windows Azure Blob Storage accounts. To accessyourstorageaccountfromHDInsight, go to the Cluster and click on the Manage Cluster tile.
Azure Vault Storage (ASV) and the HadoopDistributed File System (HDFS)implemented by HDInsight on Azure are distinct file systemsthat are optimized,respectively, for the storage of data and computations on that data. ASV provides a highlyscalable and available, lowcost, long term, and shareablestorageoption for data thatis to beprocessedusingHDInsight. With asv, you will process across all nodes in the cluster. The use case for using Azure Blob Storage as the backing store for your data is that you can scale compute independent of data (eg, you can only spin up a Hadoop cluster when you need it, and keep your data in blob store).When data is stored in ASV, you map/reduce jobs will run across multiple nodes.The Hadoop clusters deployed by HDInsight on HDFS are optimized for running Map/Reduce (M/R) computationaltasks on the data.HDInsight clusters are deployed in Azure on computenodes to execute M/Rtasks and are dropped once thesetasks have been completed. Keeping the data inthe HDFS clusters after computations have been completedwouldbe an expensiveway to store this data. ASV provides a full featured HDFS file system overAzure Blob storage (ABS). ABS is a robust, generalpurpose Azure storagesolution, sostoring data in ABS enables the clusters used for computation tobesafelydeletedwithoutlosing user data. ASV is not onlylowcost. It has beendesigned as an HDFS extension to provide a seamlessexperience to customers byenabling the full set of components in the Hadoopecosystem to operatedirectlyon the data it manages.Storage is located remotely to the worker nodes (no data locality optimization). We have re-architected the networking infrastructure in our datacenters to accommodate the Hadoop scenario. All up we have an incredibly low overhead / subscription ratio for networking, this means we can have a lot of throughput between Hadoop and Blob. With the right storage account placement and settings, Medium VM can read from Azure blob just as fast as it can read from the local disk. However, a single storage account is limited in size and overall transfer rate; so in order to scale out beyond these limitations, you will have to add storage accounts to your cluster. We are working to improve these numbers all the time.Regarding cluster VM placement, you decide at which data center the cluster will be deployed. as long as your storage account is placed at the same data center, you will get good throughput. Regarding copying data from asv to hdfs, you can use 'hadoop fs -cphdfs:///.. asv://...' to copy files from hdfs to asv (and vice versa) In the upcoming release of HDInsight on Azure, ASV willbe the default file system.
Camille
StorageHDFS is the distributed file system.ASV is Azure Storage VaultTask Scheduling and ExecutionMap Reduce is the batch job framework.ETLPIG is a high level language describes job execution and flowSQL LikeHIVE provides HiveQL, a SQL like language on top of Map Reduce.SQOOP enables data exchange between relational databases & HadoopBIHive ODBC used to move data out of Hadoop from a HIVE TableProgrammability.NET HDInsight SDKLINQ to Hive