SlideShare ist ein Scribd-Unternehmen logo
1 von 58
Downloaden Sie, um offline zu lesen
The Elephant in the Room
A DBA’s Guide to Hadoop & Big Data
Purpose
Rosetta Stone presentation
High level overview of Hadoop & Big Data
NOT a deep dive
NOT a demo session
Mostly theory & vocabulary
Where to learn more
About Me
Manage DBA’s for financial services company
Former Data Architect, DBA, developer
Linchpin People TeamMate
AtlantaMDF Chapter Leader
Infrequent blogger: http://codegumbo.com
About You
Assume that
● mostly developers
● SQL experience
● exposure to database admin &
architecture
● little to no experience with Big Data
“Big” Data
Big Data is like teenage sex...
Everyone talks about it,
Nobody really knows how to do it,
Everyone thinks everyone else is doing it,
So everyone claims they are doing it…
-Dan Ariely
The Four V’s of Big Data
Volume - data is too big to scale out
Velocity - decision window is small
Variety - multiple formats challenge integration
Variability - same data, different interpretations
http://goo.gl/6icouZ
RDBMS versus Big Data
RDBMS
Primarily Scale-Up
Strong Typing
Normalization
Default Mutable
Mature
Big Data
Primarily Scale-Out
Schemaless
Default Immutable
Evolving
Big Data Use Cases
Massive Size
PB of info
Data Warehouse
Large clusters
High Cost
Complex Analytics
Schemaless
Investigational
Single-node
Low Cost
Foundations
“Gentlemen, this is a
football…”
- Vince Lombardi
Hadoop Ecosystem (Hortonworks)
Hortonworks
Hadoop
Scaleable, distributed processing framework
open-source
Hortonworks*
Cloudera
proprietary components
Facebook
Yahoo
HDFS
Hadoop Distributed File System
Inspired by Google FileSystem (2002-2003)
Cluster storage of large files across servers
Yahoo - 10,000 core Hadoop cluster(s)
Facebook - 100 PB+ (June, 2012)
http://goo.gl/SpSN
HDFS
HDFS
File permissions and authentication.
Rack aware
fsck: find missing files or blocks.
Scheduled Rebalancing
Redundancy & Replication
Built around MapReduce
MapReduce
“Developed” by Google; patent issued in 2004
Map - filtering and sorting
Reduce - summarization
Inherently distributed
MapReduce
Hive
HiveQL - SQL like syntax
DDL scripts define tables
Query transformed into MapReduce jobs
Performance increases with scalability
Stinger initiative - MicrosoftHortonworks
Hive
Hive
create external table price_data (stock_exchange string,
symbol string, trade_date string, open float, high float,
low float, close float, volume int, adj_close float) row
format delimited fields terminated by ',' stored as
textfile location '/user/hue/nyse/nyse_prices';
select * from price_data where symbol = 'IBM';
Hive
HCatalog
Tight integration with Hive, but supports all
Hadoop data access protocols
Define relational view into data (DDL)
“Tables” can be reused by Hive, Pig, Storm...
Tutorial
Pig
Data abstraction language; Yahoo (2006)
Based on Java; supports Python & Ruby
Procedural (SQL is declarative)
Allows for ETL
Lazy evaluation
Pig
Pig
Pig
ETL service; useful as “duct tape”
Typical scenario:
Load data into HDFS
Use Pig to scrub data, and
Pump to another “db” (e.g., MongoDB)
Web service reads from destination
Hadoop Ecosystem (Hortonworks)
Hortonworks
Hadoop SQL Server
HDFS Windows Cluster
Database
MapReduce Query Optimizer
Master Web Interface SQL Server Management Studio
Hive SQL
HCatalog Views
Pig Powershell
SSIS
Big Data Administration
The possession of
facts is knowledge,
the use of them is
wisdom. – Thomas
Jefferson
Big Data Use Cases
Massive Size
PB of info
Data Warehouse
Large clusters
High Cost
Complex Analytics
Schemaless
Investigational
Single-node
Low Cost
PERFORMANCE
APPLICATION GROWTH
RDBMS
PERFORMANCE
APPLICATION GROWTH
BIG DATA
PERFORMANCE
APPLICATION GROWTH
Scale-Up Costs (SQL Server)
Single Server
Maximum RAM
SAN
Licenses
Windows
SQL Server
Microsoft Support
Personnel
Developers
DBA
SAN Admin
Network Admin
Facilities
Minimum Footprint
Scale-Out Costs (Hortonworks HDP)
Multiple Servers
Commodity
Licenses
Windows ($$$)
Linux ($)
HDP Support
Personnel
Developer
HDP Admin
Network Admin
Facilities
Power
Space
Air
Performance Tuning
SYSTEM
CODE
RDBMS
SYSTEM
CODE
HADOOP
Performance Tuning Tips
Hadoop Ecosystem (Hortonworks)
Hortonworks
Performance Architecture
Nathan Marz - Twitter, Storm
Lambda Architecture
Performance Architecture
Getting Started (Massive Size)
1. Lab Environment (Virtualized)
2. Setup OS (Windows or Linux)
3. Download (MSI or RPM)
4. Deploy Prereqs (Python, Java, C++)
5. Setup Master Node(s)
6. Setup Data Node(s)
Windows Installation Tutorial
Big Data Use Cases
Massive Size
PB of info
Data Warehouse
Large clusters
High Cost
Complex Analytics
Schemaless
Investigational
Single-node
Low Cost
Word Count
Problem: count the number of times a word
displays in a specific record.
e.g. “Lorem ipsum dolor sit amet, consectetur
adipiscing elit.”...
Word Count
SQL Server
Create UDF to
parse strings
Hadoop
Pig script to parse
strings
Word Count - SQL Server
CREATE function WordRepeatedNumTimes
(@SourceString varchar(max),@TargetWord varchar(8000))
RETURNS int
AS
BEGIN
DECLARE @NumTimesRepeated int
,@CurrentStringPosition int
,@LengthOfString int
,@PatternStartsAtPosition int
,@LengthOfTargetWord int
,@NewSourceString varchar(max)
Word Count - SQL Server
SET @LengthOfTargetWord = len(@TargetWord)
SET @LengthOfString = len(@SourceString)
SET @NumTimesRepeated = 0
SET @CurrentStringPosition = 0
SET @PatternStartsAtPosition = 0
SET @NewSourceString = @SourceString
WHILE len(@NewSourceString) >= @LengthOfTargetWord
BEGIN
SET @PatternStartsAtPosition = CHARINDEX (@TargetWord,
@NewSourceString)
IF @PatternStartsAtPosition <> 0
BEGIN
Word Count - SQL Server
SET @NumTimesRepeated = @NumTimesRepeated + 1
SET @CurrentStringPosition = @CurrentStringPosition +
@PatternStartsAtPosition + @LengthOfTargetWord
SET @NewSourceString = substring(@NewSourceString,
@PatternStartsAtPosition + @LengthOfTargetWord, @LengthOfString)
END
ELSE
BEGIN
SET @NewSourceString = ''
END
END
RETURN @NumTimesRepeated
END
Word Count (Hadoop)
a = load '/user/hue/word_count_text.txt';
b = foreach a generate flatten(TOKENIZE
((chararray)$0)) as word;
c = group b by word;
d = foreach c generate COUNT(b), group;
store d into '/user/hue/pig_wordcount';
Getting Started (Complex Analysis)
1. Lab Environment (Virtualized)
2. Install Hortonworks Sandbox
1. Setup Azure account
2. HDInsight
Theoretically, can scale to PB, but
no idea what that will cost you.
Note that the interface highlights
Hive (with Stinger); Pig commands
are run through Powershell
In Conclusion
Lots of vocabulary
HDFS, Pig, Hive, MapReduce
Map to SQL Server (RDBMS) vocabulary
Different Use Cases
Massive Data
Complex Analysis
Questions & Feedback
Contact Me
Stuart R. Ainsworth
Twitter: @codegumbo
Email: stuart@codegumbo.com
SpeakerRate: http://spkr8.com/t/33521
Big Data - Dangerous
http://www.thefacehawk.com/

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by KeylabsSiva Sankar
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
Big Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersBig Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersEdureka!
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP vinoth kumar
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data HadoopApache Apex
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherMongoDB
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdKevin Weil
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabsSiva Sankar
 

Was ist angesagt? (20)

Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by Keylabs
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Real-time analytics with HBase
Real-time analytics with HBaseReal-time analytics with HBase
Real-time analytics with HBase
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Big Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersBig Data Analytics for Non-Programmers
Big Data Analytics for Non-Programmers
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
The ABC of Big Data
The ABC of Big DataThe ABC of Big Data
The ABC of Big Data
 
Big Data - Part IV
Big Data - Part IVBig Data - Part IV
Big Data - Part IV
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Big Data - Part II
Big Data - Part IIBig Data - Part II
Big Data - Part II
 
Big Data - Part I
Big Data - Part IBig Data - Part I
Big Data - Part I
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Big Data - Part III
Big Data - Part IIIBig Data - Part III
Big Data - Part III
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop Together
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabs
 

Andere mochten auch

Sql server 2014 what's new-
Sql server 2014  what's new-Sql server 2014  what's new-
Sql server 2014 what's new-Stuart Ainsworth
 
Team rockets oms.
Team rockets oms.Team rockets oms.
Team rockets oms.c_liberty
 
Communicatie is topsport - Corine van Impelen
Communicatie is topsport - Corine van ImpelenCommunicatie is topsport - Corine van Impelen
Communicatie is topsport - Corine van ImpelenWit_Bestuurscommunicatie
 
Use of Coordinated Multipoint Transmission for Relaxation of Relay Link Bott...
Use of Coordinated Multipoint Transmission for Relaxation of Relay Link Bott...Use of Coordinated Multipoint Transmission for Relaxation of Relay Link Bott...
Use of Coordinated Multipoint Transmission for Relaxation of Relay Link Bott...Beneyam Haile
 
Portafolio estudiantil de farmacología
Portafolio estudiantil de farmacologíaPortafolio estudiantil de farmacología
Portafolio estudiantil de farmacologíaZuli Campaña
 
Coke Ramadan Jay Chiat 2015
Coke Ramadan Jay Chiat 2015Coke Ramadan Jay Chiat 2015
Coke Ramadan Jay Chiat 2015Evan Kearney
 
Functional programming
Functional programmingFunctional programming
Functional programmingNewHeart
 
Уникальное коммерческое предложение
Уникальное коммерческое предложениеУникальное коммерческое предложение
Уникальное коммерческое предложениеSEO_Experts
 
Sarus 2014 magazine
Sarus 2014 magazineSarus 2014 magazine
Sarus 2014 magazineHuyHuang
 
Presentació curs fisqui
Presentació curs fisquiPresentació curs fisqui
Presentació curs fisquilauraod
 
SEO продвижение - сравнение с конкурентами
SEO продвижение - сравнение с конкурентамиSEO продвижение - сравнение с конкурентами
SEO продвижение - сравнение с конкурентамиSEO_Experts
 

Andere mochten auch (19)

Sql server 2014 what's new-
Sql server 2014  what's new-Sql server 2014  what's new-
Sql server 2014 what's new-
 
Team rockets oms.
Team rockets oms.Team rockets oms.
Team rockets oms.
 
All you need to know about WMS
All you need to know about WMSAll you need to know about WMS
All you need to know about WMS
 
Communicatie is topsport - Corine van Impelen
Communicatie is topsport - Corine van ImpelenCommunicatie is topsport - Corine van Impelen
Communicatie is topsport - Corine van Impelen
 
Use of Coordinated Multipoint Transmission for Relaxation of Relay Link Bott...
Use of Coordinated Multipoint Transmission for Relaxation of Relay Link Bott...Use of Coordinated Multipoint Transmission for Relaxation of Relay Link Bott...
Use of Coordinated Multipoint Transmission for Relaxation of Relay Link Bott...
 
Gruppo Ambiente Sicurezza & Lifegate
Gruppo Ambiente Sicurezza & LifegateGruppo Ambiente Sicurezza & Lifegate
Gruppo Ambiente Sicurezza & Lifegate
 
Bulungi Creative
Bulungi CreativeBulungi Creative
Bulungi Creative
 
Portafolio estudiantil de farmacología
Portafolio estudiantil de farmacologíaPortafolio estudiantil de farmacología
Portafolio estudiantil de farmacología
 
Coke Ramadan Jay Chiat 2015
Coke Ramadan Jay Chiat 2015Coke Ramadan Jay Chiat 2015
Coke Ramadan Jay Chiat 2015
 
Office Add-Ins
Office Add-InsOffice Add-Ins
Office Add-Ins
 
Functional programming
Functional programmingFunctional programming
Functional programming
 
заохочення і покарання
заохочення  і покараннязаохочення  і покарання
заохочення і покарання
 
Assignmen1
Assignmen1Assignmen1
Assignmen1
 
Circuitos mixtos
Circuitos mixtosCircuitos mixtos
Circuitos mixtos
 
Уникальное коммерческое предложение
Уникальное коммерческое предложениеУникальное коммерческое предложение
Уникальное коммерческое предложение
 
Sarus 2014 magazine
Sarus 2014 magazineSarus 2014 magazine
Sarus 2014 magazine
 
Presentació curs fisqui
Presentació curs fisquiPresentació curs fisqui
Presentació curs fisqui
 
SEO продвижение - сравнение с конкурентами
SEO продвижение - сравнение с конкурентамиSEO продвижение - сравнение с конкурентами
SEO продвижение - сравнение с конкурентами
 
Estación tercera
Estación terceraEstación tercera
Estación tercera
 

Ähnlich wie BIG DATA TITLE

Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010Christopher Curtin
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoopguest27e6764
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014Stratebi
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Christopher Curtin
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data Amar kumar
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoopAdam Muise
 
Intro to hadoop ecosystem
Intro to hadoop ecosystemIntro to hadoop ecosystem
Intro to hadoop ecosystemGrzegorz Kolpuc
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...WebExpo
 

Ähnlich wie BIG DATA TITLE (20)

מיכאל
מיכאלמיכאל
מיכאל
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoop
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
 
Intro to hadoop ecosystem
Intro to hadoop ecosystemIntro to hadoop ecosystem
Intro to hadoop ecosystem
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
 

Kürzlich hochgeladen

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Kürzlich hochgeladen (20)

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

BIG DATA TITLE

  • 1. The Elephant in the Room A DBA’s Guide to Hadoop & Big Data
  • 2.
  • 3.
  • 4. Purpose Rosetta Stone presentation High level overview of Hadoop & Big Data NOT a deep dive NOT a demo session Mostly theory & vocabulary Where to learn more
  • 5. About Me Manage DBA’s for financial services company Former Data Architect, DBA, developer Linchpin People TeamMate AtlantaMDF Chapter Leader Infrequent blogger: http://codegumbo.com
  • 6. About You Assume that ● mostly developers ● SQL experience ● exposure to database admin & architecture ● little to no experience with Big Data
  • 8. Big Data is like teenage sex... Everyone talks about it, Nobody really knows how to do it, Everyone thinks everyone else is doing it, So everyone claims they are doing it… -Dan Ariely
  • 9. The Four V’s of Big Data Volume - data is too big to scale out Velocity - decision window is small Variety - multiple formats challenge integration Variability - same data, different interpretations http://goo.gl/6icouZ
  • 10. RDBMS versus Big Data RDBMS Primarily Scale-Up Strong Typing Normalization Default Mutable Mature Big Data Primarily Scale-Out Schemaless Default Immutable Evolving
  • 11. Big Data Use Cases Massive Size PB of info Data Warehouse Large clusters High Cost Complex Analytics Schemaless Investigational Single-node Low Cost
  • 12. Foundations “Gentlemen, this is a football…” - Vince Lombardi
  • 14. Hadoop Scaleable, distributed processing framework open-source Hortonworks* Cloudera proprietary components Facebook Yahoo
  • 15. HDFS Hadoop Distributed File System Inspired by Google FileSystem (2002-2003) Cluster storage of large files across servers Yahoo - 10,000 core Hadoop cluster(s) Facebook - 100 PB+ (June, 2012) http://goo.gl/SpSN
  • 16. HDFS
  • 17. HDFS File permissions and authentication. Rack aware fsck: find missing files or blocks. Scheduled Rebalancing Redundancy & Replication Built around MapReduce
  • 18. MapReduce “Developed” by Google; patent issued in 2004 Map - filtering and sorting Reduce - summarization Inherently distributed
  • 20. Hive HiveQL - SQL like syntax DDL scripts define tables Query transformed into MapReduce jobs Performance increases with scalability Stinger initiative - MicrosoftHortonworks
  • 21. Hive
  • 22. Hive create external table price_data (stock_exchange string, symbol string, trade_date string, open float, high float, low float, close float, volume int, adj_close float) row format delimited fields terminated by ',' stored as textfile location '/user/hue/nyse/nyse_prices'; select * from price_data where symbol = 'IBM';
  • 23. Hive
  • 24. HCatalog Tight integration with Hive, but supports all Hadoop data access protocols Define relational view into data (DDL) “Tables” can be reused by Hive, Pig, Storm... Tutorial
  • 25. Pig Data abstraction language; Yahoo (2006) Based on Java; supports Python & Ruby Procedural (SQL is declarative) Allows for ETL Lazy evaluation
  • 26. Pig
  • 27. Pig
  • 28. Pig ETL service; useful as “duct tape” Typical scenario: Load data into HDFS Use Pig to scrub data, and Pump to another “db” (e.g., MongoDB) Web service reads from destination
  • 30.
  • 31. Hadoop SQL Server HDFS Windows Cluster Database MapReduce Query Optimizer Master Web Interface SQL Server Management Studio Hive SQL HCatalog Views Pig Powershell SSIS
  • 32. Big Data Administration The possession of facts is knowledge, the use of them is wisdom. – Thomas Jefferson
  • 33. Big Data Use Cases Massive Size PB of info Data Warehouse Large clusters High Cost Complex Analytics Schemaless Investigational Single-node Low Cost
  • 37. Scale-Up Costs (SQL Server) Single Server Maximum RAM SAN Licenses Windows SQL Server Microsoft Support Personnel Developers DBA SAN Admin Network Admin Facilities Minimum Footprint
  • 38. Scale-Out Costs (Hortonworks HDP) Multiple Servers Commodity Licenses Windows ($$$) Linux ($) HDP Support Personnel Developer HDP Admin Network Admin Facilities Power Space Air
  • 41. Performance Architecture Nathan Marz - Twitter, Storm Lambda Architecture
  • 43. Getting Started (Massive Size) 1. Lab Environment (Virtualized) 2. Setup OS (Windows or Linux) 3. Download (MSI or RPM) 4. Deploy Prereqs (Python, Java, C++) 5. Setup Master Node(s) 6. Setup Data Node(s)
  • 45. Big Data Use Cases Massive Size PB of info Data Warehouse Large clusters High Cost Complex Analytics Schemaless Investigational Single-node Low Cost
  • 46. Word Count Problem: count the number of times a word displays in a specific record. e.g. “Lorem ipsum dolor sit amet, consectetur adipiscing elit.”...
  • 47. Word Count SQL Server Create UDF to parse strings Hadoop Pig script to parse strings
  • 48. Word Count - SQL Server CREATE function WordRepeatedNumTimes (@SourceString varchar(max),@TargetWord varchar(8000)) RETURNS int AS BEGIN DECLARE @NumTimesRepeated int ,@CurrentStringPosition int ,@LengthOfString int ,@PatternStartsAtPosition int ,@LengthOfTargetWord int ,@NewSourceString varchar(max)
  • 49. Word Count - SQL Server SET @LengthOfTargetWord = len(@TargetWord) SET @LengthOfString = len(@SourceString) SET @NumTimesRepeated = 0 SET @CurrentStringPosition = 0 SET @PatternStartsAtPosition = 0 SET @NewSourceString = @SourceString WHILE len(@NewSourceString) >= @LengthOfTargetWord BEGIN SET @PatternStartsAtPosition = CHARINDEX (@TargetWord, @NewSourceString) IF @PatternStartsAtPosition <> 0 BEGIN
  • 50. Word Count - SQL Server SET @NumTimesRepeated = @NumTimesRepeated + 1 SET @CurrentStringPosition = @CurrentStringPosition + @PatternStartsAtPosition + @LengthOfTargetWord SET @NewSourceString = substring(@NewSourceString, @PatternStartsAtPosition + @LengthOfTargetWord, @LengthOfString) END ELSE BEGIN SET @NewSourceString = '' END END RETURN @NumTimesRepeated END
  • 51. Word Count (Hadoop) a = load '/user/hue/word_count_text.txt'; b = foreach a generate flatten(TOKENIZE ((chararray)$0)) as word; c = group b by word; d = foreach c generate COUNT(b), group; store d into '/user/hue/pig_wordcount';
  • 52. Getting Started (Complex Analysis) 1. Lab Environment (Virtualized) 2. Install Hortonworks Sandbox 1. Setup Azure account 2. HDInsight
  • 53. Theoretically, can scale to PB, but no idea what that will cost you. Note that the interface highlights Hive (with Stinger); Pig commands are run through Powershell
  • 54.
  • 55. In Conclusion Lots of vocabulary HDFS, Pig, Hive, MapReduce Map to SQL Server (RDBMS) vocabulary Different Use Cases Massive Data Complex Analysis
  • 57. Contact Me Stuart R. Ainsworth Twitter: @codegumbo Email: stuart@codegumbo.com SpeakerRate: http://spkr8.com/t/33521
  • 58. Big Data - Dangerous http://www.thefacehawk.com/