SlideShare a Scribd company logo
1 of 24
SQL SERVER 2012 AND BIG DATA
Hadoop Connectors for SQL Server
TECHNICALLY – WHAT IS HADOOP
• Hadoop consists of two key services:
  • Data storage using the Hadoop Distributed File System (HDFS)
  • High-performance parallel data processing using a technique called
    MapReduce.
HADOOP IS AN ENTIRE ECOSYSTEM
•   Hbase as database
•   Hive as a Data Warehouse
•   Pig as the query language
•   Built on top of Hadoop and the Map-Reduce framework.
HDFS
• HDFS is designed to scale seamlessly
  • That‟s it‟s strength!
• Scaling horizontally is non-trivial in most cases.
• HDFS scales by throwing more hardware at it.
  • A lot of it!
  • HDFS is asynchronous
  • Is what links Hadoop to Cloud computing.
DIFFERENCES
• SQL Server & Windows 2008 R2′s NTFS?
  •   Data is not stored in the traditional table column format.
  •   HDFS supports only forward only parsing
  •   Databases built on HDFS don‟t guarantee ACID properties
  •   Taking code to the data
  •   SQL Server scales better vertically
UNSTRUCTURED DATA
• Doesn‟t know/care about column names, column data types, column
  sizes or even number of columns.
• Data is stored in delimited flat files
• You‟re on your own with respect to data cleansing
• Data input in Hadoop is as simple as loading your data file into HDFS
  • It‟s very close to copying files on an OS.
NO SQL, NO TABLES, NO COLUMNS
                                    NO DATA?


• Write code to do Map-Reduce
  • You have to write code to get data
• The best way to get data
  • write code that calls the MapReduce framework to slices and dices the stored
    data
• Step 1 is Map and Step 2 is Reduce.
MAP (REDUCE)
• Mapping
  •   Pick your selection of keys from record (Linefeed)
  •   Tell the framework what your Key is and what values that key will hold
  •    MR will deal with actual creation of the Map
  •   Control on what keys to include or what values to filter out
  •    End up with a giant hashtable
(MAP) REDUCE
• Reducing Data: Once the map phase is complete code moves on to
  the reduce phase. The reduce phase works on mapped data and can
  potentially do all the aggregation and summation activities.
• Finally you get a blob of the mapped and reduced data.
JAVA… VS. PIG…
• Pig is a querying engine
  •   Has a „business-friendly‟ syntax
  •   Spits out MapReduce code
  •   syntax for Pig is called : Pig Latin (Don‟t ask)
  •   Pig Latin is very similar syntactically to LINQ.
• Pig converts into MapReduce and sends it off to Hadoop then
  retrieves the results
• Half the performance
• 10 times faster to write
HBASE
• HBase is a key value store on top of HDFS
• This is the NOSql Database
• Very thin layer over raw HDFS
  •   Data is grouped in a Table that has rows of data.
  •   Each row can have multiple „Column Families‟
  •   Each „Column Family‟ contain(s) multiple columns.
  •   Each column name is the key and it has it‟s corresponding column value.
  •   Each row doesn‟t need to have the same number of columns
HIVE
• Hive is a little closer to RDBMS systems
• Is a DWH system on top of HDFS and Hbase
  • Performs join operations between HBase tables
• Maintains a meta layer
  • data summation, ad-hoc queries and analysis of large data stores in HFDS
• High level language
  • Hive Query Language, looks like SQL but restricted
  • No, Updates or Deletes are allowed
  • partitioning can be used to update information
    o Essentially re-writing a chunk of data.
WINDOWS HADOOP- PROJECT ISOTOPE
• 2 Flavours
  • Cloud
    o Azure CTP


  • On Permise
    o integration of the Hadoop File System with Active Directory
    o integrate System Center Operations Manager with Hadoop
    o BI Integration


  • Are not all that interesting in and of themselves, but data and tools are
    o Sqoop
      – Integration with SQL Server
    o Flume
      – Access to Lots of data
SQOOP
• Is a framework that facilitates transfer between (RDBMS) and HDFS.
• Uses MapReduce programs to import and export data;
• Imports and exports are performed in parallel with fault tolerance.

• Source / Target files being used by Sqoop can be:
  • delimited text files
  • binary SequenceFiles containing serialized record data.
SQL SERVER – HORTONWORKS - HADOOP
• Spin-off from Yahoo
• Bridge the technological gaps between Hadoop and Windows Server
• CTP of the Hadoop-based distribution for Windows Server
  ( somewhere in 2012)
• Will work with Microsoft‟s business-intelligence tools
  • including
    o Excel
    o PowerPivot
    o PowerView
HADOOP CONNECTORS
• SQL Server versions
  •   Azure
  •   PDW
  •   SQL 2012
  •   SQL 2008 R2
          http://www.microsoft.com/download/en/details.aspx?id=27584
WITH SQL SERVER-HADOOP CONNECTOR, YOU CAN:
• Sqoop-based connector
• Import
  •   tables in SQL Server to delimited text files on HDFS
  •   tables in SQL Server to SequenceFiles files on HDFS
  •   tables in SQL Server to tables in Hive
  •   Result of queries executed on SQL Server to delimited text files on HDFS
  •   Result of queries executed on SQL Server to SequenceFiles files on HDFS
  •   Result of queries executed on SQL Server to tables in Hive
• Export
  • Delimited text files on HDFS to SQL Server
  • DequenceFiles on HDFS to SQL Server
  • Hive Tables to tables in SQL Server
SQL SERVER 2012 ALONGSIDE THE ELEPHANT
• PowerView utilizes its own class of apps, if you will, that Microsoft is
  calling insights.
• SQL Server will extend insights to Hadoop data sets
• Interesting insights can be
  • Brought into a SQL Server environment using connectors
  • Drive analysis across it using BI tools.
WHY USE HADOOP WITH SQL SERVER
• Don‟t just think about big data being large volumes
  • Analyze both structured and unstructured datasets
  • Think about workload, growth, accessibility and even location
  • Can the amount of data stored every day reliably written to a traditional HDD


• Mapreduce is more complex then TSQL
  • Many companies try to avoid writing java for queries
  • Front ends are immature relative to the tooling available in the relational
    database world
  • It‟s not going to replace your database, but your database isn‟t likely to replace
    Hadoop either.
MICROSOFT AND HADOOP
• Broader access of Hadoop to:
  • End users
  • IT professionals
  • Developers
• Enterprise ready Hadoop distribution with greater security,
  performance, ease of management.
• Breakthrough insights through the use of familiar tools such as Excel,
  PowerPivot, SQL Server Analysis Services and Reporting Services.
ENTERPRISE HADOOP
• Installation wizard (IsotopeClusterDeployment)
• Healtcheck and monitoring pages
• Interactive Javascript Console
MICROSOFT ENTERPRISE HADOOP
• Machines in the Hadoop cluster must be running Windows Server 2008
  or higher
• Ipv4 network enabled on all nodes
  • Deployment does not work on Ipv6 only network.
• The ability to create a new user account called “Isotope”.
  • Will be created on all nodes of the cluster.
  • Used for running Hadoop daemons and running jobs.
  • Must be able to copy and install the deployment binaries to each machine
• Windows File Sharing services must be enabled on each machine that will
  be joined to the Hadoop cluster.
• .Net Framework 4 installed on all nodes.
• Minimum of 10G free space in C drive (JBOD HDFS configuration is
  supported)
© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market
    conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
                                        MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

More Related Content

What's hot

Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology LandscapeShivanandaVSeeri
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop IntroductionDzung Nguyen
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Cloudera, Inc.
 

What's hot (20)

Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Hadoop
Hadoop Hadoop
Hadoop
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
 

Viewers also liked

SQL Server 2012 - FileTables
SQL Server 2012 - FileTables SQL Server 2012 - FileTables
SQL Server 2012 - FileTables Sperasoft
 
SQL Server 2012 : réussir la migration - Stéphane Haby - Antonio De Santo - d...
SQL Server 2012 : réussir la migration - Stéphane Haby - Antonio De Santo - d...SQL Server 2012 : réussir la migration - Stéphane Haby - Antonio De Santo - d...
SQL Server 2012 : réussir la migration - Stéphane Haby - Antonio De Santo - d...dbi services
 
SQL Server 2008 Consolidation
SQL Server 2008 ConsolidationSQL Server 2008 Consolidation
SQL Server 2008 Consolidationwebhostingguy
 
Sql server consolidation and virtualization
Sql server consolidation and virtualizationSql server consolidation and virtualization
Sql server consolidation and virtualizationIvan Donev
 
AppStore申請を一式まるっと自動化する
AppStore申請を一式まるっと自動化するAppStore申請を一式まるっと自動化する
AppStore申請を一式まるっと自動化するTomoki Hasegawa
 
70-461 Querying Microsoft SQL Server 2012
70-461 Querying Microsoft SQL Server 201270-461 Querying Microsoft SQL Server 2012
70-461 Querying Microsoft SQL Server 2012siphocha
 
Srishti Sharma,,B.Sc-ID+ 2 Year Residential & Commercial Design Diploma
Srishti Sharma,,B.Sc-ID+ 2 Year Residential & Commercial Design DiplomaSrishti Sharma,,B.Sc-ID+ 2 Year Residential & Commercial Design Diploma
Srishti Sharma,,B.Sc-ID+ 2 Year Residential & Commercial Design Diplomadezyneecole
 
Advanced SQL injection to operating system full control (slides)
Advanced SQL injection to operating system full control (slides)Advanced SQL injection to operating system full control (slides)
Advanced SQL injection to operating system full control (slides)Bernardo Damele A. G.
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersAmal G Jose
 
[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민NAVER D2
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Alphorm.com Formation Microsoft SQL Server 2016 Business Intelligence (SSIS)
Alphorm.com Formation Microsoft SQL Server 2016 Business Intelligence (SSIS)Alphorm.com Formation Microsoft SQL Server 2016 Business Intelligence (SSIS)
Alphorm.com Formation Microsoft SQL Server 2016 Business Intelligence (SSIS)Alphorm
 

Viewers also liked (19)

SQL Server 2012 - FileTables
SQL Server 2012 - FileTables SQL Server 2012 - FileTables
SQL Server 2012 - FileTables
 
Manageability Enhancements of SQL Server 2012
Manageability Enhancements of SQL Server 2012Manageability Enhancements of SQL Server 2012
Manageability Enhancements of SQL Server 2012
 
Stress testing using SQLIOSIM and SQLIO
Stress testing using SQLIOSIM and SQLIOStress testing using SQLIOSIM and SQLIO
Stress testing using SQLIOSIM and SQLIO
 
What's new in SQL Server Integration Services 2012?
What's new in SQL Server Integration Services 2012?What's new in SQL Server Integration Services 2012?
What's new in SQL Server Integration Services 2012?
 
Database Schema Management & Deployment using SQL Server Data Tools (SSDT)
Database Schema Management & Deployment using SQL Server Data Tools (SSDT)Database Schema Management & Deployment using SQL Server Data Tools (SSDT)
Database Schema Management & Deployment using SQL Server Data Tools (SSDT)
 
SQL Server 2012 Best Practices
SQL Server 2012 Best PracticesSQL Server 2012 Best Practices
SQL Server 2012 Best Practices
 
SQL Server 2012 : réussir la migration - Stéphane Haby - Antonio De Santo - d...
SQL Server 2012 : réussir la migration - Stéphane Haby - Antonio De Santo - d...SQL Server 2012 : réussir la migration - Stéphane Haby - Antonio De Santo - d...
SQL Server 2012 : réussir la migration - Stéphane Haby - Antonio De Santo - d...
 
SQL Server 2008 Consolidation
SQL Server 2008 ConsolidationSQL Server 2008 Consolidation
SQL Server 2008 Consolidation
 
Sql server consolidation and virtualization
Sql server consolidation and virtualizationSql server consolidation and virtualization
Sql server consolidation and virtualization
 
Sql Server 2008 Server Consolidation
Sql Server 2008 Server ConsolidationSql Server 2008 Server Consolidation
Sql Server 2008 Server Consolidation
 
AppStore申請を一式まるっと自動化する
AppStore申請を一式まるっと自動化するAppStore申請を一式まるっと自動化する
AppStore申請を一式まるっと自動化する
 
70-461 Querying Microsoft SQL Server 2012
70-461 Querying Microsoft SQL Server 201270-461 Querying Microsoft SQL Server 2012
70-461 Querying Microsoft SQL Server 2012
 
Srishti Sharma,,B.Sc-ID+ 2 Year Residential & Commercial Design Diploma
Srishti Sharma,,B.Sc-ID+ 2 Year Residential & Commercial Design DiplomaSrishti Sharma,,B.Sc-ID+ 2 Year Residential & Commercial Design Diploma
Srishti Sharma,,B.Sc-ID+ 2 Year Residential & Commercial Design Diploma
 
Advanced SQL injection to operating system full control (slides)
Advanced SQL injection to operating system full control (slides)Advanced SQL injection to operating system full control (slides)
Advanced SQL injection to operating system full control (slides)
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop Clusters
 
[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Alphorm.com Formation Microsoft SQL Server 2016 Business Intelligence (SSIS)
Alphorm.com Formation Microsoft SQL Server 2016 Business Intelligence (SSIS)Alphorm.com Formation Microsoft SQL Server 2016 Business Intelligence (SSIS)
Alphorm.com Formation Microsoft SQL Server 2016 Business Intelligence (SSIS)
 
Sql Antipatterns Strike Back
Sql Antipatterns Strike BackSql Antipatterns Strike Back
Sql Antipatterns Strike Back
 

Similar to SQL Server 2012 and Big Data

Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Oozie & sqoop by pradeep
Oozie & sqoop by pradeepOozie & sqoop by pradeep
Oozie & sqoop by pradeepPradeep Pandey
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asiaMuhammad Rifqi
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online TrainingLearntek1
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)John Dougherty
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 

Similar to SQL Server 2012 and Big Data (20)

Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Oozie & sqoop by pradeep
Oozie & sqoop by pradeepOozie & sqoop by pradeep
Oozie & sqoop by pradeep
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asia
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Hive
HiveHive
Hive
 
Apache Hadoop Hive
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop Hive
 
Hadoop intro
Hadoop introHadoop intro
Hadoop intro
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 

More from Microsoft TechNet - Belgium and Luxembourg

More from Microsoft TechNet - Belgium and Luxembourg (20)

Windows 10: all you need to know!
Windows 10: all you need to know!Windows 10: all you need to know!
Windows 10: all you need to know!
 
Configuration Manager 2012 – Compliance Settings 101 - Tim de Keukelaere
Configuration Manager 2012 – Compliance Settings 101 - Tim de KeukelaereConfiguration Manager 2012 – Compliance Settings 101 - Tim de Keukelaere
Configuration Manager 2012 – Compliance Settings 101 - Tim de Keukelaere
 
Windows 8.1 a closer look
Windows 8.1 a closer lookWindows 8.1 a closer look
Windows 8.1 a closer look
 
So you’ve successfully installed SCOM… Now what.
So you’ve successfully installed SCOM… Now what.So you’ve successfully installed SCOM… Now what.
So you’ve successfully installed SCOM… Now what.
 
Data Leakage Prevention
Data Leakage PreventionData Leakage Prevention
Data Leakage Prevention
 
Deploying and managing ConfigMgr Clients
Deploying and managing ConfigMgr ClientsDeploying and managing ConfigMgr Clients
Deploying and managing ConfigMgr Clients
 
Self Service BI anno 2013 – Where Do We Come From and Where Are We Going?
Self Service BI anno 2013 – Where Do We Come From and Where Are We Going?Self Service BI anno 2013 – Where Do We Come From and Where Are We Going?
Self Service BI anno 2013 – Where Do We Come From and Where Are We Going?
 
Hands on with Hyper-V Clustering Maintenance Mode & Cluster Aware Updating
Hands on with Hyper-V Clustering Maintenance Mode & Cluster Aware UpdatingHands on with Hyper-V Clustering Maintenance Mode & Cluster Aware Updating
Hands on with Hyper-V Clustering Maintenance Mode & Cluster Aware Updating
 
SCEP 2012 inside SCCM 2012
SCEP 2012 inside SCCM 2012SCEP 2012 inside SCCM 2012
SCEP 2012 inside SCCM 2012
 
Jump start your application monitoring with APM
Jump start your application monitoring with APMJump start your application monitoring with APM
Jump start your application monitoring with APM
 
What’s new in Lync Server 2013: Persistent Chat
What’s new in Lync Server 2013: Persistent ChatWhat’s new in Lync Server 2013: Persistent Chat
What’s new in Lync Server 2013: Persistent Chat
 
What's new for Lync 2013 Clients & Devices
What's new for Lync 2013 Clients & DevicesWhat's new for Lync 2013 Clients & Devices
What's new for Lync 2013 Clients & Devices
 
Office 365 ProPlus: Click-to-run deployment and management
Office 365 ProPlus: Click-to-run deployment and managementOffice 365 ProPlus: Click-to-run deployment and management
Office 365 ProPlus: Click-to-run deployment and management
 
Office 365 Identity Management options
Office 365 Identity Management options Office 365 Identity Management options
Office 365 Identity Management options
 
SharePoint Installation and Upgrade: Untangling Your Options
SharePoint Installation and Upgrade: Untangling Your Options SharePoint Installation and Upgrade: Untangling Your Options
SharePoint Installation and Upgrade: Untangling Your Options
 
The application model in real life
The application model in real lifeThe application model in real life
The application model in real life
 
Microsoft private cloud with Cisco and Netapp - Flexpod solution
Microsoft private cloud with Cisco and Netapp -  Flexpod solutionMicrosoft private cloud with Cisco and Netapp -  Flexpod solution
Microsoft private cloud with Cisco and Netapp - Flexpod solution
 
Managing Windows RT devices in the Enterprise
Managing Windows RT devices in the Enterprise Managing Windows RT devices in the Enterprise
Managing Windows RT devices in the Enterprise
 
Moving from Device Centric to a User Centric Management
Moving from Device Centric to a User Centric Management Moving from Device Centric to a User Centric Management
Moving from Device Centric to a User Centric Management
 
Network Management in System Center 2012 SP1 - VMM
Network Management in System Center 2012  SP1 - VMM Network Management in System Center 2012  SP1 - VMM
Network Management in System Center 2012 SP1 - VMM
 

Recently uploaded

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 

SQL Server 2012 and Big Data

  • 1. SQL SERVER 2012 AND BIG DATA Hadoop Connectors for SQL Server
  • 2. TECHNICALLY – WHAT IS HADOOP • Hadoop consists of two key services: • Data storage using the Hadoop Distributed File System (HDFS) • High-performance parallel data processing using a technique called MapReduce.
  • 3. HADOOP IS AN ENTIRE ECOSYSTEM • Hbase as database • Hive as a Data Warehouse • Pig as the query language • Built on top of Hadoop and the Map-Reduce framework.
  • 4. HDFS • HDFS is designed to scale seamlessly • That‟s it‟s strength! • Scaling horizontally is non-trivial in most cases. • HDFS scales by throwing more hardware at it. • A lot of it! • HDFS is asynchronous • Is what links Hadoop to Cloud computing.
  • 5. DIFFERENCES • SQL Server & Windows 2008 R2′s NTFS? • Data is not stored in the traditional table column format. • HDFS supports only forward only parsing • Databases built on HDFS don‟t guarantee ACID properties • Taking code to the data • SQL Server scales better vertically
  • 6. UNSTRUCTURED DATA • Doesn‟t know/care about column names, column data types, column sizes or even number of columns. • Data is stored in delimited flat files • You‟re on your own with respect to data cleansing • Data input in Hadoop is as simple as loading your data file into HDFS • It‟s very close to copying files on an OS.
  • 7. NO SQL, NO TABLES, NO COLUMNS NO DATA? • Write code to do Map-Reduce • You have to write code to get data • The best way to get data • write code that calls the MapReduce framework to slices and dices the stored data • Step 1 is Map and Step 2 is Reduce.
  • 8. MAP (REDUCE) • Mapping • Pick your selection of keys from record (Linefeed) • Tell the framework what your Key is and what values that key will hold • MR will deal with actual creation of the Map • Control on what keys to include or what values to filter out • End up with a giant hashtable
  • 9. (MAP) REDUCE • Reducing Data: Once the map phase is complete code moves on to the reduce phase. The reduce phase works on mapped data and can potentially do all the aggregation and summation activities. • Finally you get a blob of the mapped and reduced data.
  • 10. JAVA… VS. PIG… • Pig is a querying engine • Has a „business-friendly‟ syntax • Spits out MapReduce code • syntax for Pig is called : Pig Latin (Don‟t ask) • Pig Latin is very similar syntactically to LINQ. • Pig converts into MapReduce and sends it off to Hadoop then retrieves the results • Half the performance • 10 times faster to write
  • 11. HBASE • HBase is a key value store on top of HDFS • This is the NOSql Database • Very thin layer over raw HDFS • Data is grouped in a Table that has rows of data. • Each row can have multiple „Column Families‟ • Each „Column Family‟ contain(s) multiple columns. • Each column name is the key and it has it‟s corresponding column value. • Each row doesn‟t need to have the same number of columns
  • 12. HIVE • Hive is a little closer to RDBMS systems • Is a DWH system on top of HDFS and Hbase • Performs join operations between HBase tables • Maintains a meta layer • data summation, ad-hoc queries and analysis of large data stores in HFDS • High level language • Hive Query Language, looks like SQL but restricted • No, Updates or Deletes are allowed • partitioning can be used to update information o Essentially re-writing a chunk of data.
  • 13. WINDOWS HADOOP- PROJECT ISOTOPE • 2 Flavours • Cloud o Azure CTP • On Permise o integration of the Hadoop File System with Active Directory o integrate System Center Operations Manager with Hadoop o BI Integration • Are not all that interesting in and of themselves, but data and tools are o Sqoop – Integration with SQL Server o Flume – Access to Lots of data
  • 14.
  • 15. SQOOP • Is a framework that facilitates transfer between (RDBMS) and HDFS. • Uses MapReduce programs to import and export data; • Imports and exports are performed in parallel with fault tolerance. • Source / Target files being used by Sqoop can be: • delimited text files • binary SequenceFiles containing serialized record data.
  • 16. SQL SERVER – HORTONWORKS - HADOOP • Spin-off from Yahoo • Bridge the technological gaps between Hadoop and Windows Server • CTP of the Hadoop-based distribution for Windows Server ( somewhere in 2012) • Will work with Microsoft‟s business-intelligence tools • including o Excel o PowerPivot o PowerView
  • 17. HADOOP CONNECTORS • SQL Server versions • Azure • PDW • SQL 2012 • SQL 2008 R2 http://www.microsoft.com/download/en/details.aspx?id=27584
  • 18. WITH SQL SERVER-HADOOP CONNECTOR, YOU CAN: • Sqoop-based connector • Import • tables in SQL Server to delimited text files on HDFS • tables in SQL Server to SequenceFiles files on HDFS • tables in SQL Server to tables in Hive • Result of queries executed on SQL Server to delimited text files on HDFS • Result of queries executed on SQL Server to SequenceFiles files on HDFS • Result of queries executed on SQL Server to tables in Hive • Export • Delimited text files on HDFS to SQL Server • DequenceFiles on HDFS to SQL Server • Hive Tables to tables in SQL Server
  • 19. SQL SERVER 2012 ALONGSIDE THE ELEPHANT • PowerView utilizes its own class of apps, if you will, that Microsoft is calling insights. • SQL Server will extend insights to Hadoop data sets • Interesting insights can be • Brought into a SQL Server environment using connectors • Drive analysis across it using BI tools.
  • 20. WHY USE HADOOP WITH SQL SERVER • Don‟t just think about big data being large volumes • Analyze both structured and unstructured datasets • Think about workload, growth, accessibility and even location • Can the amount of data stored every day reliably written to a traditional HDD • Mapreduce is more complex then TSQL • Many companies try to avoid writing java for queries • Front ends are immature relative to the tooling available in the relational database world • It‟s not going to replace your database, but your database isn‟t likely to replace Hadoop either.
  • 21. MICROSOFT AND HADOOP • Broader access of Hadoop to: • End users • IT professionals • Developers • Enterprise ready Hadoop distribution with greater security, performance, ease of management. • Breakthrough insights through the use of familiar tools such as Excel, PowerPivot, SQL Server Analysis Services and Reporting Services.
  • 22. ENTERPRISE HADOOP • Installation wizard (IsotopeClusterDeployment) • Healtcheck and monitoring pages • Interactive Javascript Console
  • 23. MICROSOFT ENTERPRISE HADOOP • Machines in the Hadoop cluster must be running Windows Server 2008 or higher • Ipv4 network enabled on all nodes • Deployment does not work on Ipv6 only network. • The ability to create a new user account called “Isotope”. • Will be created on all nodes of the cluster. • Used for running Hadoop daemons and running jobs. • Must be able to copy and install the deployment binaries to each machine • Windows File Sharing services must be enabled on each machine that will be joined to the Hadoop cluster. • .Net Framework 4 installed on all nodes. • Minimum of 10G free space in C drive (JBOD HDFS configuration is supported)
  • 24. © 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Editor's Notes

  1. 1. Data is not stored in the traditional table column format. At best some of the database layers mimic this, but deep in the bowels of HDFS, there are no tables, no primary keys, no indexes. Everything is a flat file with predetermined delimiters. HDFS is optimized to recognize <Key, Value> mode of storage. Every things maps down to <Key, Value> pairs.2. HDFS supports only forward only parsing. So you are either reading ahead or appending to the end. There is no concept of ‘Update’ or ‘Insert’.3. Databases built on HDFS don’t guarantee ACID properties. Specially ‘Consistency’. It offers what is called as ‘Eventual Consistency’, meaning data will be saved eventually, but because of the highly asynchronous nature of the file system you are not guaranteed at what point it will finish. So HDFS based systems are NOT ideal for OLTP systems. RDBMS still rock there.4. Taking code to the data. In traditional systems you fire a query to get data and then write code on it to manipulate it. In MapReduce, you write code and send it to Hadoop’s data store and get back the manipulated data. Essentially you are sending code to the data.5. Traditional databases like SQL Server scale better vertically, so more cores, more memory, faster cores is the way to scale. However Hadoop by design scales horizontally. Keep throwing hardware at it and it will scale.
  2. Mapping Data: If it is plain de-limited text data, you have the freedom to pick your selection of keys from the record (remember records are typically linefeed separated) and values and tell the framework what your Key is and what values that key will hold. MR will deal with actual creation of the Map. When the map is being created you can control on what keys to include or what values to filter out. In the end you end up with a giant hashtable of filtered key value pairs. Now what?
  3. Well, if you are that scared of Java, then you have Pig. No, I am not calling names here. Pig is a querying engine that has more ‘business-friendly’ syntax but spits out MapReduce code in the backend and does all the dirty work for you. The syntax for Pig is called, of course, Pig Latin.When you write queries in Pig Latin, Pig converts it into MapReduce and sends it off to Hadoop, then retrieves the results and hands it back to you.Analysis shows you get about half the performance of raw optimal hand written MapReduce java code, but the same code takes more than 10 times the time to write when compared to a Pig query.If you are in the mood for a start-up idea, generating optimal MapReduce code from Pig Latin is a topic to consider  …For those in the .NET world, Pig Latin is very similar syntactically to LINQ.
  4. HBase is a key value store that sits on top of HDFS. It is a NOSql Database.It has a very thin veneer over raw HDFS where in it mandates that data is grouped in a Table that has rows of data.Each row can have multiple ‘Column Families’ and each ‘Column Family’ can contain multiple columns.Each column name is the key and it has it’s corresponding column value.So a column of data can be represented asrow[family][column] = valueEach row need not have the same number of columns. Think of each row as a horizontal linked list, that links to a column family and then each column family links to multiple columns as <Key, Value> pairs.row1->family1->col A = val A->family2->col B = val Band so on.
  5. Hive is a little closer to traditional RDBMS systems. In fact it is a Data Warehousing system that sits on top of HDFS but maintains a meta layer that helps data summation, ad-hoc queries and analysis of large data stores in HFDS.Hive supports a high level language called Hive Query Language, that looks like SQL but restricted in a few ways like no, Updates or Deletes are allowed. However Hive has this concept of partitioning that can be used to update information, which is essentially re-writing a chunk of data whose granularity depends on the schema design.Hive can actually sit on top of HBase and perform join operations between HBase tables.
  6. Isotope is more than the distributions that the Softies are building with Hortonworks. Isotope also refers to the whole “tool chain” of supporting big-data analytics offerings that Microsoft is packaging up around the distributions. Microsoft’s big-picture concept is Isotope is what will give all kinds of users, from technical to “ordinary” productivity workers, access from inside data-analysis tools they know — like Microsoft’s own SQL Server Analysis Services, PowerPivot and Excel on their PCs — to data stored in Windows Servers and/or Windows Azure. (The Windows Azure Marketplace fits in here, as this is the place that third-party providers can publish free or paid collections of data which users will be able to download/buy.)To accelerate its adoption in the Enterprise, Microsoft will make Hadoop Enterprise ready by  Active Directory Integration: Providing Enterprise-class security through integration of Hadoop with Active Directory  High Performance: Boosting Hadoop performance to offer consistently high data throughput  System Center Integration: Simplifying management of the Hadoop infrastructure through integration with Microsoft’s management tools such as System Center  BI Integration: Enabling integration of relational and Hadoop data into Enterprise BI solution with Hadoop connectors  Flexibility and Choice with deployment options for Windows Server and Windows Azure which offers customers: o Freedom to choose: More control as they can choose which data to keep in-house instead of the cloud. o Lower TCO: Cost saving, as fewer resources are required to run their Hadoop deployment in the cloud o Elasticity to meet demand: Elasticity reduces your costs, since more nodes can be added to the Windows Azure deployment for more demanding workloads. In addition, the Azure deployment of Hadoop can be used to extend the on premise solution in periods of high demand o Increased Performance: Bringing computing closer to the data – our solution enables customers to process data closer to where data is born, whether on premise or in the cloud We do this while maintaining compatibility with existing Hadoop tools such as Pig, Hive, and Java. Our goal is to ensure that applications built on Apache Hadoop can be easily migrated to our distribution to run on Windows Azure or Windows Server.
  7. For developers, Microsoft is investing to make JavaScript a first class language within Big Data by making it possible to write high performance Map/Reduce jobs using JavaScript. In addition, our JavaScript console will allow users to write JavaScript Map/Reduce jobs, Pig-Latin, and Hive queries from the browser to execute their Hadoop jobs. Analyze Hadoop data with familiar tools such as Excel, thanks to a Hive Add-in for Excel • Reduce time to solution through integration of Hive and Microsoft BI tools such as PowerPivot and Power View • Build corporate BI solutions that include Hadoop data, through integration of Hive and leading BI tools such as SQL Server Analysis Services and Reporting ServicesCustomers can use this connector (on an already deployed Hadoop cluster) to analyze unstructured or semi-structured data from various sources and then load the processed data into PDW Efficiently transfer terabytes of data between Hadoop and PDW Enables users to get the best of both worlds: Hadoop for processing large volumes of unstructured data, and PDW for analyzing structured data with easy integration to BI tools Use of Map-Reduce and PDW Bulk Load/Extract tool for fast import/export
  8. Sqoop is an open source connectivity framework that facilitates transfer between multiple Relational Database Management Systems (RDBMS) and HDFS. Sqoop uses MapReduce programs to import and export data; the imports and exports are performed in parallel with fault tolerance. The Source / Target files being used by Sqoop can be delimited text files (for example, with commas or tabs separating each field), or binary SequenceFiles containing serialized record data. Please refer to section 7.2.7 in Sqoop User Guide for more details on supported file types. For information on SequenceFile format, please refer to Hadoop API page.
  9. Broader access to Hadoop through simplified deployment and programmability. Microsoft has simplified setup and deployment of Hadoop, making it possible to setup and configure Hadoop on Windows Azure in a few hours instead of days. Since the service is hosted on Windows Azure, customers only download a package that includes the Hive Add-in and Hive ODBC Driver. In addition, Microsoft has introduced new JavaScript libraries to make JavaScript a first class programming language in Hadoop. Through this library JavaScript programmers can easily write MapReduce programs in JavaScript, and run these jobs from simple web browsers. These improvements reduce the barrier to entry, by enabling customers to easily deploy and explore Hadoop on Windows. Breakthrough insights through integration Microsoft Excel and BI tools. This preview ships with a new Hive Add-in for Excel that enables users to interact with data in Hadoop from Excel. With the Hive Add-in customers can issue Hive queries to pull and analyze unstructured data from Hadoop in the familiar Excel. Second, the preview includes a Hive ODBC Driver that integrates Hadoop with Microsoft BI tools. This driver enables customers to integrate and analyze unstructured data from Hadoop using award winning Microsoft BI tools such as PowerPivot and PowerView. As a result customers can gain insight on all their data, including unstructured data stored in Hadoop. Elasticity, thanks to Windows Azure. This preview of the Hadoop based service runs on Windows Azure, offering an elastic and scalable platform for distributed storage and compute.
  10. Companies do not have to be at Google scale to have data issues. Scalability issues occur with less than a terabyte of data. If a company works with relational databases and SQL, they can drown in complex data transformations and calculations that do not fit naturally into sequences of set operations. In that sense, the “big data” mantra is misguided at times…The big issue is not that everyone will suddenly operate at petabyte scale; a lot of folks do not have that much data. The more important topics are the specifics of the storage and processing infrastructure and what approaches best suit each problem.attack unstructured and semi-structured datasets without the overhead of an ETL step to insert them into a traditional relational database. From CSV to XML, we can load in a single step and begin querying.
  11. through easy installation and configuration and simplified programming with JavaScript.The CTP of Microsoft's Hadoop based Service for Windows Azure is now available. Complete the online form with details of your Big Data scenario to download the preview. Microsoft will issue a code that will be used by the selected customers to access the Hadoop based Service.
  12. Gain new insights from your dataHave you ever had trouble finding data you needed? Or combining data from different, incompatible sources? How about sharing the results with others in a web-friendly way? If so, we want you to try Microsoft Codename “Data Explorer” Cloud service.With "Data Explorer" you can:Identify the data you care about from the sources you work with (e.g. Excel spreadsheets, files, SQL Server databases).Discover relevant data and services via automatic recommendations from the Windows Azure Marketplace.Enrich your data by combining it and visualizing the results.Collaborate with your colleagues to refine the data.Publish the results to share them with others or power solutions.In short, we help you harness the richness of data on the Web to generate new insights.
  13. Blue - Use for Cloud on Your Terms specific content
  14. Green - Use for Mission Critical Confidence specific content
  15. Orange - Use for Breakthrough Insight specific content