This document discusses how Hadoop can be used in the Microsoft enterprise to make sense of data fragmentation. It proposes starting simply by using Hadoop (HDInsight) for log storage. Existing ETL processes can be reused by incrementally adding Hadoop. This allows organizations to understand their data in a more complete, contextual way and take a customer-centric view, like seeing all customer interactions across systems. Starting small and getting different departments involved can encourage experimentation with Hadoop and analytics.
4. The reality is slightly more complicated
Client OS Office
Active Directory
System Center
Applications
Server OS
Virtualization
NTFS
ETL OLTP OLAP
SAN / NAS
15. Applied to Customer
Transactional
Social Media
Web Logs Sqoop
Hive Flume
HDFS
IVR Logs Email & Chat
Pig
Pig
16. Transactional Data
ID Sector Rev Code Acct Amount
01 Q X QW 10000 1200
02 P X AB 20000 1020
03 O X CD 20000 11221
04 N XX QW 50000 2323
05 M X CD 30000 33
06 L XX AB 10000 323231
07 K X ER 20000 12
08 J XXX CD 20000 3233
09 I X QW 40000 5468
10 H X ER 50000 234
11 G O AB 55000 765
12 F X ER 70000 34538
13 E XX ER 25000 3456476
14 D XXX AB 10000 4564
15 C X QW 10000 456
16 B XX YZ 11000 44
17 A X AB 15000 456
25. Incorporate log analysis
• Give users access to HDInsight / Hadoop with tools
they know
• Encourage departmental experiementation
• Development and IT departments are good early
adopters
• So is marketing
26. Tap into existing ETL
• You already have a TON of ETL
• Reuse as much as you can
• At first aim for replication
• Incrementally add value