Microsoft on Big Data

Microsoft on
Big Data
Donnerstag, 28.05.2015

Vorweg:
 Wir sind heute live auf Meerkat

Agenda
 Was ist Big Data?
 Funktionsweise und Ansätze
 Microsoft Architektur
 Hadoop und Map Reduce
 Pig

Die 3 Vs
Quelle: http://www.datasciencecentral.com/forum/topics/the-3vs-that-define-big-data

Why Big Data?
 2008: Google processes 20 PB a day
 2009: Facebook has 2.5 PB user data + 15 TB/day
 2009: eBay has 6.5 PB user data + 50 TB/day
 2011: Yahoo! has 180-200 PB of data
 2012: Facebook ingests 500 TB/day

Nächster Großer Datenlieferant

How to store data?
 Data storage is not trivial
 Data volumes are massive
 Reliably storing PBs of data is challenging
 Disk/hardware/network failures
 Probability of failure event increases with number of machines
 For example:
 1000 hosts, each with 10 disks
 a disk lasts 3 year
 how many failures per day?

Historical basics
 Hadoop is an open-source implementation based on GFS and MapReduce from Google Sanjay
Ghemawat, Howard Gobioff, and Shun-Tak Leung. (2003)
 The Google File System Jeffrey Dean and Sanjay Ghemawat. (2004)
 MapReduce: Simplified Data Processing on Large Clusters OSDI 2004

Klassische Big Data Architektur  Hadop

Characteristics and Features
 Distributed file system
 Redundant storage
 Designed to reliably store data using commodity hardware
 Designed to expect hardware failures
 Intended for large files
 Designed for batch inserts
 The Hadoop Distributed File System

HDFS - files and blocks
 Files are stored as a collection of blocks
 Blocks are 64 MB chunks of a file (configurable)
 Blocks are replicated on 3 nodes (configurable)
 The NameNode (NN) manages metadata about files and blocks
 The SecondaryNameNode (SNN) holds a backup of the NN data
 DataNodes (DN) store and serve blocks

Replication
 Multiple copies of a block are stored
 Replication strategy:
 Copy #1 on another node on same rack
 Copy #2 on another node on different rack

Failure DataNode
 DNs check in with the NN to report health
 Upon failure NN orders DNs to replicate under-replicated blocks

Distributed Storage
(HDFS)
Query
(Hive)
Distributed Processing
(MapReduce)
ODBC
Legend
Red = Core
Hadoop
Blue = Data
processing
Purple =
Microsoft
integration
points and
value adds
Orange = Data
Movement
Green =
Packages

Programming Models
Pig
Data scripting language
Hive
SQL-like set-oriented language
Pegasus, Giraph
Graph processing

Vorgehen
 Ziel Verteilung von Streams über Tag und Nutzer
 C# Dienst  Daten sammeln
 Persistierung in Azure
 Aufbereitung und Analyse mit Hive
 Analyse in Excel

Beispiel: Social Media Analyse

Analyse der Ergebnisse mit Excel

Beispiel: Analyse von Freitext

Quelle: Plenarprotokolle Bundestag

Verarbeitung der Daten mit Hadoop

What is Azure DocumentDB?
It is a fully managed, highly scalable, queryable, schema-free document
database, delivered as a service, for modern applications.
Query against Schema-Free JSON
Multi-Document transactions
Tunable, High Performance
Designed for cloud first
40

Azure DocumentDB Resources
41
Source: http://azure.microsoft.com/en-us/documentation/articles/documentdb-introduction/

Traditional RDBMS vs. MapReduce

Do I really need Hadoop?
Velocity
Variety
Highly
Structured
Poly
Structured
Batch Realtime

Ausblick: Data Management Prozesse
 Ziel: Big Data Pipeline kombinieren
 Steuern und Administrieren von Diensten
 Produkt: Azure Data Factory

Call Log Files
Customer Table
Call Log
Files
Customer Table
Customer
Churn Table
Data Factory
Concepts
Data
Sources
Ingest Transform & Analyze Publish
Customer
Call Details
Customers
Likely to
Churn

Zusammenfassung
 Datenanalyse verändert sich
 Technologien abwägen (JSON in Integration Services)
 Daten Analysten sind nicht überflüssig
 Das Toolset muss sich erweitern
 Coole Vorlesung zum Weiter machen http://blogs.ischool.berkeley.edu/i290-abdt-s12/

Microsoft on Big Data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Microsoft on Big Data

Ähnlich wie Microsoft on Big Data (20)

Mehr von Yvette Teiken

Mehr von Yvette Teiken (7)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Microsoft on Big Data

Hinweis der Redaktion