SlideShare a Scribd company logo
1 of 31
Big Data-BI Fusion:
Microsoft HDInsight & MS BI
Level: Intermediate
March 28, 2013
Andrew Brust
CEO and Founder
Blue Badge Insights
• CEO and Founder, Blue Badge Insights
• Big Data blogger for ZDNet
• Microsoft Regional Director, MVP
• Co-chair VSLive! and 18 years as a speaker
• Founder, MS BI and Big Data User Group of NYC
– http://www.msbigdatanyc.com
• Co-moderator, NYC .NET Developers Group
– http://www.nycdotnetdev.com
• “Redmond Review” columnist for
Visual Studio Magazine and Redmond Developer
News
• brustblog.com, Twitter: @andrewbrust
Meet Andrew
Andrew’s New Blog (bit.ly/bigondata)
Read all about it!
What is Big Data?
• 100s of TB into PB and higher
• Involving data from: financial data,
sensors, web logs, social media, etc.
• Parallel processing often involved
– Hadoop is emblematic, but other technologies are Big
Data too
• Processing of data sets too large for
transactional databases
– Analyzing interactions, rather than transactions
– The three V’s: Volume, Velocity, Variety
• Big Data tech sometimes imposed on
small data problems
The Hadoop Stack
MapReduce, HDFS
Database
RDBMS Import/Export
Query: HiveQL and Pig Latin
Machine Learning/Data Mining
Log file integration
What’s MapReduce?
• Divide and conquer approach to “Big”
data processing
• Partition the data and send to mappers
(nodes in cluster)
• Mappers pre-process into key-value pairs,
then all output for (a) given key(s) goes to
a reducer
• Reducer performs aggregations; one
output per key, with value
• Map and Reduce code natively written as
Java functions
MapReduce, in a Diagram
mapper
mapper
mapper
mapper
mapper
mapper
Input
reducer
reducer
reducer
Input
Input
Input
Input
Input
Input
Output
Output
Output
Output
Output
Output
Output
Input
Input
Input
K1
K2
K3
Output
Output
Output
HDFS
• File system whose data gets distributed
over commodity disks on commodity
servers
• Data is replicated
• If one box goes down, no data lost
– “Shared Nothing”
– Except the name node
• BUT: Immutable
– Files can only be written to once
– So updates require drop + re-write (slow)
– You can append though
– Like a DVD/CD-ROM
HBase
• A Wide-Column Store, NoSQL database
• Modeled after Google BigTable
• HBase tables are HDFS files
– Therefore, Hadoop-compatible
• Hadoop often used with HBase
– But you can use either without the other
• HDInsight (more on next slide) does not
(yet) include HBase
Microsoft HDInsight
• Developed with Hortonworks and
incorporates Hortonworks Data Platform
(HDP) for Windows
• Windows Azure HDInsight and Microsoft
HDInsight Server
– Single node preview runs on Windows client
• Includes ODBC Driver for Hive
– And Excel add-in that uses it
• JavaScript MapReduce framework
• Contribute it all back to open source
Apache Project
Azure HDInsight Provisioning
• HDInsight preview now public, so…
• Go to Windows Azure portal
• Sign up for the public preview
• Select HDInsight from left navbar
• Click “+ NEW” button @ lower-left
• Specify cluster name, number of nodes, admin
password, storage account
– Credentials used for browser login, RDP and ODBC
– During preview, you will be billed 50% of Azure compute rates
for nodes in cluster. Will be 100% at GA.
• Click “CREATE HDINSIGHT CLUSTER”
• Wait for provisioning to complete
• Navigate to http://clustername.azurehdinsight.net
New!
Azure HDInsight Provisioning
New!
Submitting, Running and
Monitoring Jobs
• Upload a JAR
• Use Streaming
– Use other languages (i.e. other than Java) to write
MapReduce code
– Python is popular option
– Any executable works, even C# console apps
– On HDInsight, JavaScript works too
– Still uses a JAR file: streaming.jar
• Run at command line (passing JAR name
and params) or use GUI
Hortonworks
Data Platform for
Windows
MRLib
(NuGet
Package)
LINQ to Hive
OdbcClient +
Hive ODBC
Driver
Deployment
Debugging
MR code in
C#,
HadoopJob,
MapperBase,
ReducerBase
Amenities for
Visual Studio/.NET
Running MapReduce
Jobs
The “Data-Refinery” Idea
• Use Hadoop to “on-board” unstructured
data, then extract manageable subsets
• Load the subsets into conventional DW/BI
servers and use familiar analytics tool to
examine
• This is the current rationalization of
Hadoop + BI tools’ coexistence
• Will it stay this way?
Hive
• Used by most BI products which connect
to Hadoop
• Provides a SQL-like abstraction over
Hadoop
– Officially HiveQL, or HQL
• Works on own tables, but also on HBase
• Query generates MapReduce job, output of
which becomes result set
• Microsoft has Hive ODBC driver
– Connects Excel, Reporting Services, PowerPivot,
Analysis Services Tabular Mode (only)
Hive
HDInsight Data Sources
• Files in HDFS
• Azure Blob Storage (Azure HDInsight only)
– Use asv:// URLs (“Azure Storage Vault”)
• Hive tables
• HBase?
Just-in-time Schema
• When looking at unstructured data,
schema is imposed at query time
• Schema is context specific
– If scanning a book, are the values words, lines, or
pages?
– Are notes a single field, or is each word a value?
– Are date and time two fields or one?
– Are street, city, state, zip separate or one value?
– Pig and Hive let you determine this at query time
– So does the Map function in MapReduce code
How Does MS BI Fit In?
• Excel, PowerPivot: can query via Hive
ODBC driver
• Analysis Services (SSAS) Tabular Mode
– Also compatible with Hive ODBC Driver
Multidimensional mode is not
• Power View
– Works against PowerPivot and SSAS Tabular
• RDBMS + Parallel Data Warehouse (PDW)
– Sqoop connectors
– Columnstore Indexes
Enterprise Edition and PDW only
• PDW: PolyBase
Excel, PowerPivot
• Excel and PowerPivot use the BI Semantic
Model (BISM), which can query Hadoop via
Hive and its ODBC driver
• Excel also features “Data Explorer”
(currently in Beta) which can query HDFS
directly and insert the results into a BISM
repository
• Excel BISM accommodates millions of
rows through compression. Not petabyte
scale, but sufficient to store and analyze
output of Hadoop queries.
PowerPivot, SSAS Tabular
• SQL Server Analysis Services Tabular
mode is the enterprise server
implementation of BISM
• Features partitioning and role-based
security
• Can store billions of rows. So even better
for Hadoop output analysis.
• Excel-based BISM repositories can be
upsized to SSAS Tabular
Querying Hadoop from
Microsoft BI
Sqoop
• Acronym for “SQL to Hadoop”
• Essentially a technology for moving data
between data warehouses and Hadoop
• Command line utility; allows specification
of source/target HDFS file and relational
server, database and table
• Sqoop connectors available for SQL
Server and PDW
• Sqoop generates MapReduce job to
extract data from, or insert data into, HDFS
PDW, PolyBase
• SQL Server Parallel Data Warehouse
(PDW) is a Massively Parallel Proicessing
(MPP) data warehouse appliance version
of SQL Server
• MPP manages a grid of relational database
servers for divide-and-conquer processing
of large data sets.
• PDW v2 includes “PolyBase,” a
component which allows PDW to query
data in Hadoop directly.
– Bypasses MapReduce; addresses data nodes directly
and orchestrates parallelism itself
PolyBase Versus Hive, Sqoop
• Hive and Sqoop generate MapReduce
jobs, and work in batch mode
• PolyBase addresses HDFS data itself
• This is true SQL over Hadoop.
• Competitors:
– Cloudera Impala
– Teradata Aster SQL-H
– EMC/Greenplum Pivotal HD
– Hadapt
Usability Impact
• PowerPivot makes analysis much easier,
self-service
• Power View is great for discovery and
visualization; also self-service
• Combine with the Hive ODBC driver and
suddenly Hadoop is accessible to
business users
• Caveats
– Someone has to write the HiveQL
– Can query Big Data, but must have smaller result
Resources
• Big On Data blog
– http://www.zdnet.com/blog/big-data
• Apache Hadoop home page
– http://hadoop.apache.org/
• Hive & Pig home pages
– http://hive.apache.org/
– http://pig.apache.org/
• Hadoop on Azure home page
– https://www.hadooponazure.com/
• SQL Server 2012 Big Data
– http://bit.ly/sql2012bigdata
Thank You!
• Email
• andrew.brust@bluebadgeinsights.com
• Blog:
• http://www.zdnet.com/blog/big-data
• Twitter
• @andrewbrust on twitter

More Related Content

What's hot

Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIHitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIAndrew Brust
 
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth AnalysisCloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth AnalysisAndrew Brust
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databasesJames Serra
 
A Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data HullabalooA Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data HullabalooAndrew Brust
 
Hadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionHadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionAndrew Brust
 
Evolved BI with SQL Server 2012
Evolved BIwith SQL Server 2012Evolved BIwith SQL Server 2012
Evolved BI with SQL Server 2012Andrew Brust
 
SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms Andrew Brust
 
Nonrelational Databases
Nonrelational DatabasesNonrelational Databases
Nonrelational DatabasesUdi Bauman
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7abdulrahmanhelan
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big dataSteven Francia
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sqlRam kumar
 
NoSQL: Why, When, and How
NoSQL: Why, When, and HowNoSQL: Why, When, and How
NoSQL: Why, When, and HowBigBlueHat
 
Non Relational Databases
Non Relational DatabasesNon Relational Databases
Non Relational DatabasesChris Baglieri
 
NoSql Data Management
NoSql Data ManagementNoSql Data Management
NoSql Data Managementsameerfaizan
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 

What's hot (20)

Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIHitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BI
 
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth AnalysisCloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
A Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data HullabalooA Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data Hullabaloo
 
Relational vs. Non-Relational
Relational vs. Non-RelationalRelational vs. Non-Relational
Relational vs. Non-Relational
 
Hadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionHadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in Action
 
Evolved BI with SQL Server 2012
Evolved BIwith SQL Server 2012Evolved BIwith SQL Server 2012
Evolved BI with SQL Server 2012
 
SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms
 
Nonrelational Databases
Nonrelational DatabasesNonrelational Databases
Nonrelational Databases
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big data
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sql
 
Selecting best NoSQL
Selecting best NoSQL Selecting best NoSQL
Selecting best NoSQL
 
NoSQL: Why, When, and How
NoSQL: Why, When, and HowNoSQL: Why, When, and How
NoSQL: Why, When, and How
 
Non Relational Databases
Non Relational DatabasesNon Relational Databases
Non Relational Databases
 
NoSql Data Management
NoSql Data ManagementNoSql Data Management
NoSql Data Management
 
NoSQL Seminer
NoSQL SeminerNoSQL Seminer
NoSQL Seminer
 
RDBMS vs NoSQL
RDBMS vs NoSQLRDBMS vs NoSQL
RDBMS vs NoSQL
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 

Viewers also liked

Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystemAndrew Brust
 
Town of Ladysmith Economic Development Plan 2013
Town of Ladysmith Economic Development Plan 2013Town of Ladysmith Economic Development Plan 2013
Town of Ladysmith Economic Development Plan 2013ladysmithdowntown
 
Azure ml screen grabs
Azure ml screen grabsAzure ml screen grabs
Azure ml screen grabsAndrew Brust
 
NoSQL and SQL Databases
NoSQL and SQL DatabasesNoSQL and SQL Databases
NoSQL and SQL DatabasesGaurav Paliwal
 
No SQL Databases (a thorough analysis)
No SQL Databases (a thorough analysis)No SQL Databases (a thorough analysis)
No SQL Databases (a thorough analysis)catprasanna
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases MongoDB
 
Where Does Big Data Meet Big Database - QCon 2012
Where Does Big Data Meet Big Database - QCon 2012Where Does Big Data Meet Big Database - QCon 2012
Where Does Big Data Meet Big Database - QCon 2012Ben Stopford
 
NoSQL Databases for Implementing Data Services – Should I Care?
NoSQL Databases for Implementing Data Services – Should I Care?NoSQL Databases for Implementing Data Services – Should I Care?
NoSQL Databases for Implementing Data Services – Should I Care?Guido Schmutz
 
MongoDB Pros and Cons
MongoDB Pros and ConsMongoDB Pros and Cons
MongoDB Pros and Consjohnrjenson
 
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBBenchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBAthiq Ahamed
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Oracle NoSQL Database -- Big Data Bellevue Meetup - 02-18-15
Oracle NoSQL Database -- Big Data Bellevue Meetup - 02-18-15Oracle NoSQL Database -- Big Data Bellevue Meetup - 02-18-15
Oracle NoSQL Database -- Big Data Bellevue Meetup - 02-18-15Dave Segleau
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLMongoDB
 
NoSQL databases pros and cons
NoSQL databases pros and consNoSQL databases pros and cons
NoSQL databases pros and consFabio Fumarola
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseEdureka!
 
Webinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBWebinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBMongoDB
 
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenNoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenLorenzo Alberton
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 

Viewers also liked (20)

Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
 
Town of Ladysmith Economic Development Plan 2013
Town of Ladysmith Economic Development Plan 2013Town of Ladysmith Economic Development Plan 2013
Town of Ladysmith Economic Development Plan 2013
 
Azure ml screen grabs
Azure ml screen grabsAzure ml screen grabs
Azure ml screen grabs
 
NoSQL and SQL Databases
NoSQL and SQL DatabasesNoSQL and SQL Databases
NoSQL and SQL Databases
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
No SQL Databases (a thorough analysis)
No SQL Databases (a thorough analysis)No SQL Databases (a thorough analysis)
No SQL Databases (a thorough analysis)
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases
 
Where Does Big Data Meet Big Database - QCon 2012
Where Does Big Data Meet Big Database - QCon 2012Where Does Big Data Meet Big Database - QCon 2012
Where Does Big Data Meet Big Database - QCon 2012
 
NoSQL Databases for Implementing Data Services – Should I Care?
NoSQL Databases for Implementing Data Services – Should I Care?NoSQL Databases for Implementing Data Services – Should I Care?
NoSQL Databases for Implementing Data Services – Should I Care?
 
MongoDB Pros and Cons
MongoDB Pros and ConsMongoDB Pros and Cons
MongoDB Pros and Cons
 
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBBenchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Oracle NoSQL Database -- Big Data Bellevue Meetup - 02-18-15
Oracle NoSQL Database -- Big Data Bellevue Meetup - 02-18-15Oracle NoSQL Database -- Big Data Bellevue Meetup - 02-18-15
Oracle NoSQL Database -- Big Data Bellevue Meetup - 02-18-15
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQL
 
NoSQL databases pros and cons
NoSQL databases pros and consNoSQL databases pros and cons
NoSQL databases pros and cons
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
 
Webinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBWebinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDB
 
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenNoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 

Similar to Big Data and NoSQL for Database and BI Pros

Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackBig Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackAndrew Brust
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxiaeronlineexm
 
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop EcosystemUnveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystemmashoodsyed66
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewGreat Wide Open
 
01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptx01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptxVIJAYAPRABAP
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
 
Getting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightGetting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightNilesh Gule
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 

Similar to Big Data and NoSQL for Database and BI Pros (20)

Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackBig Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Big data in Azure
Big data in AzureBig data in Azure
Big data in Azure
 
Apache Hadoop Hive
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop Hive
 
Hive
HiveHive
Hive
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptx
 
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop EcosystemUnveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptx01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptx
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Getting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightGetting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsight
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 

Recently uploaded

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Recently uploaded (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

Big Data and NoSQL for Database and BI Pros

  • 1. Big Data-BI Fusion: Microsoft HDInsight & MS BI Level: Intermediate March 28, 2013 Andrew Brust CEO and Founder Blue Badge Insights
  • 2. • CEO and Founder, Blue Badge Insights • Big Data blogger for ZDNet • Microsoft Regional Director, MVP • Co-chair VSLive! and 18 years as a speaker • Founder, MS BI and Big Data User Group of NYC – http://www.msbigdatanyc.com • Co-moderator, NYC .NET Developers Group – http://www.nycdotnetdev.com • “Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News • brustblog.com, Twitter: @andrewbrust Meet Andrew
  • 3. Andrew’s New Blog (bit.ly/bigondata)
  • 5. What is Big Data? • 100s of TB into PB and higher • Involving data from: financial data, sensors, web logs, social media, etc. • Parallel processing often involved – Hadoop is emblematic, but other technologies are Big Data too • Processing of data sets too large for transactional databases – Analyzing interactions, rather than transactions – The three V’s: Volume, Velocity, Variety • Big Data tech sometimes imposed on small data problems
  • 6. The Hadoop Stack MapReduce, HDFS Database RDBMS Import/Export Query: HiveQL and Pig Latin Machine Learning/Data Mining Log file integration
  • 7. What’s MapReduce? • Divide and conquer approach to “Big” data processing • Partition the data and send to mappers (nodes in cluster) • Mappers pre-process into key-value pairs, then all output for (a) given key(s) goes to a reducer • Reducer performs aggregations; one output per key, with value • Map and Reduce code natively written as Java functions
  • 8. MapReduce, in a Diagram mapper mapper mapper mapper mapper mapper Input reducer reducer reducer Input Input Input Input Input Input Output Output Output Output Output Output Output Input Input Input K1 K2 K3 Output Output Output
  • 9. HDFS • File system whose data gets distributed over commodity disks on commodity servers • Data is replicated • If one box goes down, no data lost – “Shared Nothing” – Except the name node • BUT: Immutable – Files can only be written to once – So updates require drop + re-write (slow) – You can append though – Like a DVD/CD-ROM
  • 10. HBase • A Wide-Column Store, NoSQL database • Modeled after Google BigTable • HBase tables are HDFS files – Therefore, Hadoop-compatible • Hadoop often used with HBase – But you can use either without the other • HDInsight (more on next slide) does not (yet) include HBase
  • 11. Microsoft HDInsight • Developed with Hortonworks and incorporates Hortonworks Data Platform (HDP) for Windows • Windows Azure HDInsight and Microsoft HDInsight Server – Single node preview runs on Windows client • Includes ODBC Driver for Hive – And Excel add-in that uses it • JavaScript MapReduce framework • Contribute it all back to open source Apache Project
  • 12. Azure HDInsight Provisioning • HDInsight preview now public, so… • Go to Windows Azure portal • Sign up for the public preview • Select HDInsight from left navbar • Click “+ NEW” button @ lower-left • Specify cluster name, number of nodes, admin password, storage account – Credentials used for browser login, RDP and ODBC – During preview, you will be billed 50% of Azure compute rates for nodes in cluster. Will be 100% at GA. • Click “CREATE HDINSIGHT CLUSTER” • Wait for provisioning to complete • Navigate to http://clustername.azurehdinsight.net New!
  • 14. Submitting, Running and Monitoring Jobs • Upload a JAR • Use Streaming – Use other languages (i.e. other than Java) to write MapReduce code – Python is popular option – Any executable works, even C# console apps – On HDInsight, JavaScript works too – Still uses a JAR file: streaming.jar • Run at command line (passing JAR name and params) or use GUI
  • 15. Hortonworks Data Platform for Windows MRLib (NuGet Package) LINQ to Hive OdbcClient + Hive ODBC Driver Deployment Debugging MR code in C#, HadoopJob, MapperBase, ReducerBase Amenities for Visual Studio/.NET
  • 17. The “Data-Refinery” Idea • Use Hadoop to “on-board” unstructured data, then extract manageable subsets • Load the subsets into conventional DW/BI servers and use familiar analytics tool to examine • This is the current rationalization of Hadoop + BI tools’ coexistence • Will it stay this way?
  • 18. Hive • Used by most BI products which connect to Hadoop • Provides a SQL-like abstraction over Hadoop – Officially HiveQL, or HQL • Works on own tables, but also on HBase • Query generates MapReduce job, output of which becomes result set • Microsoft has Hive ODBC driver – Connects Excel, Reporting Services, PowerPivot, Analysis Services Tabular Mode (only)
  • 19. Hive
  • 20. HDInsight Data Sources • Files in HDFS • Azure Blob Storage (Azure HDInsight only) – Use asv:// URLs (“Azure Storage Vault”) • Hive tables • HBase?
  • 21. Just-in-time Schema • When looking at unstructured data, schema is imposed at query time • Schema is context specific – If scanning a book, are the values words, lines, or pages? – Are notes a single field, or is each word a value? – Are date and time two fields or one? – Are street, city, state, zip separate or one value? – Pig and Hive let you determine this at query time – So does the Map function in MapReduce code
  • 22. How Does MS BI Fit In? • Excel, PowerPivot: can query via Hive ODBC driver • Analysis Services (SSAS) Tabular Mode – Also compatible with Hive ODBC Driver Multidimensional mode is not • Power View – Works against PowerPivot and SSAS Tabular • RDBMS + Parallel Data Warehouse (PDW) – Sqoop connectors – Columnstore Indexes Enterprise Edition and PDW only • PDW: PolyBase
  • 23. Excel, PowerPivot • Excel and PowerPivot use the BI Semantic Model (BISM), which can query Hadoop via Hive and its ODBC driver • Excel also features “Data Explorer” (currently in Beta) which can query HDFS directly and insert the results into a BISM repository • Excel BISM accommodates millions of rows through compression. Not petabyte scale, but sufficient to store and analyze output of Hadoop queries.
  • 24. PowerPivot, SSAS Tabular • SQL Server Analysis Services Tabular mode is the enterprise server implementation of BISM • Features partitioning and role-based security • Can store billions of rows. So even better for Hadoop output analysis. • Excel-based BISM repositories can be upsized to SSAS Tabular
  • 26. Sqoop • Acronym for “SQL to Hadoop” • Essentially a technology for moving data between data warehouses and Hadoop • Command line utility; allows specification of source/target HDFS file and relational server, database and table • Sqoop connectors available for SQL Server and PDW • Sqoop generates MapReduce job to extract data from, or insert data into, HDFS
  • 27. PDW, PolyBase • SQL Server Parallel Data Warehouse (PDW) is a Massively Parallel Proicessing (MPP) data warehouse appliance version of SQL Server • MPP manages a grid of relational database servers for divide-and-conquer processing of large data sets. • PDW v2 includes “PolyBase,” a component which allows PDW to query data in Hadoop directly. – Bypasses MapReduce; addresses data nodes directly and orchestrates parallelism itself
  • 28. PolyBase Versus Hive, Sqoop • Hive and Sqoop generate MapReduce jobs, and work in batch mode • PolyBase addresses HDFS data itself • This is true SQL over Hadoop. • Competitors: – Cloudera Impala – Teradata Aster SQL-H – EMC/Greenplum Pivotal HD – Hadapt
  • 29. Usability Impact • PowerPivot makes analysis much easier, self-service • Power View is great for discovery and visualization; also self-service • Combine with the Hive ODBC driver and suddenly Hadoop is accessible to business users • Caveats – Someone has to write the HiveQL – Can query Big Data, but must have smaller result
  • 30. Resources • Big On Data blog – http://www.zdnet.com/blog/big-data • Apache Hadoop home page – http://hadoop.apache.org/ • Hive & Pig home pages – http://hive.apache.org/ – http://pig.apache.org/ • Hadoop on Azure home page – https://www.hadooponazure.com/ • SQL Server 2012 Big Data – http://bit.ly/sql2012bigdata
  • 31. Thank You! • Email • andrew.brust@bluebadgeinsights.com • Blog: • http://www.zdnet.com/blog/big-data • Twitter • @andrewbrust on twitter