SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
Big Data for Oracle Professionals
Arup Nanda
Big Data Explorer
Time
Growth
Tweet @ArupNanda
Hadoop
Map/Reduce
YARN
NoSQL
Spark
Flume.
Tweet @ArupNanda
fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)" fcrawler.looksmart.com
fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html
HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)"
fcrawler.looksmart.com - - [26/Apr/2000:00:17:19 -0400] "GET /news/news.html
HTTP/1.0" 200 16716 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)"
ppp931.on.bellglobal.com - - [26/Apr/2000:00:16:12 -0400] "GET
/download/windows/asctab31.zip HTTP/1.0" 200 1540096
"http://www.htmlgoodies.com/downloads/freeware/webdevelopment/15.html"
"Mozilla/4.7 [en]C-SYMPA (Win95; U)"
123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0"
200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"
123.123.123.123 - - [26/Apr/2000:00:23:47 -0400] "GET /asctortf/ HTTP/1.0" 200
8130 "http://search.netscape.com/Computers/Data_Formats/Document/Text/RTF"
"Mozilla/4.05 (Macintosh; I; PPC)"
123.123.123.123 - -
Tweet @ArupNanda
fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)" fcrawler.looksmart.com
petabytes
unpredictable format
transient.
Tweet @ArupNanda
Metadata Repository
Tweet @ArupNanda
Tweet @ArupNanda
Tweet @ArupNanda
olumeV
arietyV
elocityV
Tweet @ArupNanda
CUSTOMERS
CUST_ID
NAME
ADDRESS
Tweet @ArupNanda
CUSTOMERS
CUST_ID
NAME
ADDRESS
SPOUSE
Tweet @ArupNanda
CUSTOMERS
CUST_ID
NAME
ADDRESS
SPOUSES
CUST_ID
NAME
CURRENT
Tweet @ArupNanda
CUSTOMERS
CUST_ID
NAME
ADDRESS
SPOUSES
CUST_ID
NAME
CURRENT
EMPLOYERS
CUST_ID
NAME
CURRENT
Tweet @ArupNanda
Name = Data
Relationship status = Data
Married to = Data
In a relationship with = Data
Friends = Data, Data, Data
Likes = Data, Data
Mutually Exclusive, Maybe not?
Multiple Data Points
Tweet @ArupNanda
First Name John
Spouse Jane
Child Jill
Goes to Acme School
Tweet @ArupNanda
First Name Martha
Child goes to Acme School
Tweet @ArupNanda
First Name John
Spouse Jane
Child Jill
Goes to Acme School
First Name Martha
Child goes to Acme School
Tweet @ArupNanda
First Name Martha
Child goes to Acme School
Teacher Mrs Gillen
Teacher Mrs Gillen
Jill
Tweet @ArupNanda
First Name John
Spouse Jane
Child Jill
Goes to Acme School
Teacher Mr Fullmeister
Tweet @ArupNanda
First Name Irene
Boyfriend Henry
Works at Starwood
Hobby Photography
Ex-Spouse Jane
Tweet @ArupNanda
Tweet @ArupNanda
First Name Irene
Key Value
Key-Value Pair
Tweet @ArupNanda
John Smith and his wife Jane,
along with their daughter Jill,
were strolling on the beach
when they heard a crash. John
ran towards …
Tweet @ArupNanda
Map
Tweet @ArupNanda
begin
get post
while (there_are_remaining_posts) loop
extract status of "like" for the specific post
if status = "like" then
like_count := like_count + 1
else
no_comment := no_comment + 1
end if
end loop
end
Counter()
Tweet @ArupNanda
Counter() Counter() Counter()
Tweet @ArupNanda
Counter() Counter() Counter()
Likes=100
No Comments=
300
Likes=50
No Comments=
350
Likes=150
No Comments=
250
Likes=300
No Comments=
900
Reduce
Tweet @ArupNanda
Map Reduce/
Dividing the
work among
different
nodes
Collating the
results to get
final answer
Tweet @ArupNanda
Counter
()
Counter
()
Counter
()
Likes=100
No
Comments=
300
Likes=50
No
Comments=
350
Likes=150
No
Comments=
250Likes=300
No
Comments=
900
• Divide the workload
• Submit and track the jobs
• If a job fails, restart it
on another node
• …
Hadoop
Tweet @ArupNanda
Resource Management
Applications
YARN
Yet Another Resource Negotiator
Map Reduce v2.
Tweet @ArupNanda
Counter() Counter() Counter()
Filesystem Filesystem Filesystem
1 2 32 3 13 1 2
Hadoop Distributed Filesystem (HDFS)
Tweet @ArupNanda
Count
er()
Count
er()
Count
er()
Filesystem Filesystem Filesystem
1 2 32 3 13 1 2
• Not shared storage
• Data is discrete
• Version control not required
• Concurrency not required
• Transactional integrity across
nodes not required.
Comparison
with RAC
Tweet @ArupNanda
Advantages of Hadoop
• Processors need not be super-fast
• Immensely scalable
• Storage is redundant by design
• No RAID level required.
Count
er()
Count
er()
Count
er()
Filesystem Filesystem Filesystem
1 2 32 3 13 1 2
Tweet @ArupNanda
Scalable?
ACID Properties
Reliability at a cost
Large overhead in data processing
Tweet @ArupNanda
Website logs
Combine with structured data
SOAP Messages
Twitter, Facebook …
Tweet @ArupNanda
Data Access: through programs
NoSQL Databases
Tweet @ArupNanda
Key Value
Key Value DB
Key Document
Document DB.
Key Value
Key Value
Key Value
{
empID:1,
empName:Larry
salary:infinity
}
Tweet @ArupNanda
SQL-interface required
Hive
HiveQL
Tweet @ArupNanda
Creating a Hive Table
create table accounts (
accno int,
accname string,
balance float
)
row format delimited
fields terminated by ‘,’
stored as texfile
location '/user/hive/db1.db/accounts'
Tweet @ArupNanda
select count(*)
from store_sales ss
join household_demographics hd on (ss.ss_hdemo_sk
= hd.hd_demo_sk)
join time_dim t on (ss.ss_sold_time_sk = t.t_time_sk)
join store s on (s.s_store_sk = ss.ss_store_sk)
where
t.t_hour = 8
t.t_minute >= 30
hd.hd_dep_count = 2
order by cnt;
HiveQL
Tweet @ArupNanda
Map/Reduce
Divide the work and
collate the results
Needs development
in Java, Python, Ruby, etc.
A framework to work on
the dataset in parallel Pig
Pig Latin
Scripting language for
Pig
Tweet @ArupNanda
select category, avg(pagerank)
from urls
where pagerank > 0.2
group by category
having count(*) > 1000000
good_urls = FILTER urls BY pagerank > 0.2;
groups = GROUP good_urls BY category;
big_groups = FILTER groups BY
COUNT(good_urls)>1000000;
output = FOREACH big_groups GENERATE category,
AVG(good_urls.pagerank);
SQL
Pig Latin
Tweet @ArupNanda
HBase
HiveQL
Pig
A database built
on Hadoop
An SQL-like (but
not the same)
query language
Procedural Logic
without M/R Code.
Tweet @ArupNanda
normal programming
languages, e.g. Python
YARN
Map/Reduce code in Java
Spark
Tweet @ArupNanda
Count
er()
Count
er()
Count
er()
Filesystem Filesystem Filesystem
1 2 32 3 13 1 2
Hadoop processing in
files
Memory is cheaper
Interactive
processing needs
faster access.
Tweet @ArupNanda
Spark
Core
SparkShell SparkSQL MLib SparkR PySpark
Can use Java, Python or Scala
Tweet @ArupNanda
Divide and conquer is the key
Non-shared division of data is important
Local access
Redundancy
Hadoop is a framework
You have to write the programs
Big data is batch-oriented
Hive is SQL-like
Pig Latin is a 4GL-like scripting language
Spark uses memory
Tweet @ArupNanda
Oh, I so want to Learn!
Cloudera – prebuilt VMs
https://www.cloudera.com/documentation/ente
rprise/5-9-x/topics/cloudera_quickstart_vm.html
Hortonworks – prebuilt
VMs
https://hortonworks.com/downloads/#sandbox
Tweet @ArupNanda
Thanks!
arup.blogspot.com @ArupNanda
Tweet @ArupNanda

Weitere ähnliche Inhalte

Ähnlich wie Big Data and Hadoop Fundamentals for Oracle Professionals

Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Amazon Athena (April 2017)
Amazon Athena (April 2017)Amazon Athena (April 2017)
Amazon Athena (April 2017)Julien SIMON
 
Social media analytics using Azure Technologies
Social media analytics using Azure TechnologiesSocial media analytics using Azure Technologies
Social media analytics using Azure TechnologiesKoray Kocabas
 
SDPHP - Percona Toolkit (It's Basically Magic)
SDPHP - Percona Toolkit (It's Basically Magic)SDPHP - Percona Toolkit (It's Basically Magic)
SDPHP - Percona Toolkit (It's Basically Magic)Robert Swisher
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
 
An Architect's guide to real time big data systems
An Architect's guide to real time big data systemsAn Architect's guide to real time big data systems
An Architect's guide to real time big data systemsRaja SP
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Holden Karau
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Holden Karau
 
Structured streaming for machine learning
Structured streaming for machine learningStructured streaming for machine learning
Structured streaming for machine learningSeth Hendrickson
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Improving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInImproving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInDatabricks
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.pptAbhijitManna19
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.pptsnowflakebatch
 
strata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streamingstrata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streamingShidrokhGoudarzi1
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeWim Godden
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptxDori Waldman
 

Ähnlich wie Big Data and Hadoop Fundamentals for Oracle Professionals (20)

Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Amazon Athena (April 2017)
Amazon Athena (April 2017)Amazon Athena (April 2017)
Amazon Athena (April 2017)
 
Social media analytics using Azure Technologies
Social media analytics using Azure TechnologiesSocial media analytics using Azure Technologies
Social media analytics using Azure Technologies
 
SDPHP - Percona Toolkit (It's Basically Magic)
SDPHP - Percona Toolkit (It's Basically Magic)SDPHP - Percona Toolkit (It's Basically Magic)
SDPHP - Percona Toolkit (It's Basically Magic)
 
Velocity 2015-final
Velocity 2015-finalVelocity 2015-final
Velocity 2015-final
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
An Architect's guide to real time big data systems
An Architect's guide to real time big data systemsAn Architect's guide to real time big data systems
An Architect's guide to real time big data systems
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
 
Structured streaming for machine learning
Structured streaming for machine learningStructured streaming for machine learning
Structured streaming for machine learning
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Improving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInImproving Spark SQL at LinkedIn
Improving Spark SQL at LinkedIn
 
Epic South Disasters
Epic South DisastersEpic South Disasters
Epic South Disasters
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
strata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streamingstrata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streaming
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the code
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 

Mehr von Gerger

Source Control for the Oracle Database
Source Control for the Oracle DatabaseSource Control for the Oracle Database
Source Control for the Oracle DatabaseGerger
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingGerger
 
Best Way to Write SQL in Java
Best Way to Write SQL in JavaBest Way to Write SQL in Java
Best Way to Write SQL in JavaGerger
 
Version control for PL/SQL
Version control for PL/SQLVersion control for PL/SQL
Version control for PL/SQLGerger
 
Gitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGerger
 
Gitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGerger
 
PostgreSQL for Oracle Developers and DBA's
PostgreSQL for Oracle Developers and DBA'sPostgreSQL for Oracle Developers and DBA's
PostgreSQL for Oracle Developers and DBA'sGerger
 
Shaping Optimizer's Search Space
Shaping Optimizer's Search SpaceShaping Optimizer's Search Space
Shaping Optimizer's Search SpaceGerger
 
Gitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGerger
 
Monitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with ZabbixMonitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with ZabbixGerger
 
Introducing ProHuddle
Introducing ProHuddleIntroducing ProHuddle
Introducing ProHuddleGerger
 
Use Cases of Row Pattern Matching in Oracle 12c
Use Cases of Row Pattern Matching in Oracle 12cUse Cases of Row Pattern Matching in Oracle 12c
Use Cases of Row Pattern Matching in Oracle 12cGerger
 
Introducing Gitora,the version control tool for PL/SQL
Introducing Gitora,the version control tool for PL/SQLIntroducing Gitora,the version control tool for PL/SQL
Introducing Gitora,the version control tool for PL/SQLGerger
 

Mehr von Gerger (13)

Source Control for the Oracle Database
Source Control for the Oracle DatabaseSource Control for the Oracle Database
Source Control for the Oracle Database
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Best Way to Write SQL in Java
Best Way to Write SQL in JavaBest Way to Write SQL in Java
Best Way to Write SQL in Java
 
Version control for PL/SQL
Version control for PL/SQLVersion control for PL/SQL
Version control for PL/SQL
 
Gitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQL
 
Gitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQL
 
PostgreSQL for Oracle Developers and DBA's
PostgreSQL for Oracle Developers and DBA'sPostgreSQL for Oracle Developers and DBA's
PostgreSQL for Oracle Developers and DBA's
 
Shaping Optimizer's Search Space
Shaping Optimizer's Search SpaceShaping Optimizer's Search Space
Shaping Optimizer's Search Space
 
Gitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQL
 
Monitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with ZabbixMonitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with Zabbix
 
Introducing ProHuddle
Introducing ProHuddleIntroducing ProHuddle
Introducing ProHuddle
 
Use Cases of Row Pattern Matching in Oracle 12c
Use Cases of Row Pattern Matching in Oracle 12cUse Cases of Row Pattern Matching in Oracle 12c
Use Cases of Row Pattern Matching in Oracle 12c
 
Introducing Gitora,the version control tool for PL/SQL
Introducing Gitora,the version control tool for PL/SQLIntroducing Gitora,the version control tool for PL/SQL
Introducing Gitora,the version control tool for PL/SQL
 

Kürzlich hochgeladen

Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 

Kürzlich hochgeladen (20)

Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting ServiceOdoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 

Big Data and Hadoop Fundamentals for Oracle Professionals

  • 1. Big Data for Oracle Professionals Arup Nanda Big Data Explorer Time Growth Tweet @ArupNanda
  • 2. Hadoop Map/Reduce YARN NoSQL Spark Flume. Tweet @ArupNanda fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)" fcrawler.looksmart.com fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)" fcrawler.looksmart.com - - [26/Apr/2000:00:17:19 -0400] "GET /news/news.html HTTP/1.0" 200 16716 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)" ppp931.on.bellglobal.com - - [26/Apr/2000:00:16:12 -0400] "GET /download/windows/asctab31.zip HTTP/1.0" 200 1540096 "http://www.htmlgoodies.com/downloads/freeware/webdevelopment/15.html" "Mozilla/4.7 [en]C-SYMPA (Win95; U)" 123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - [26/Apr/2000:00:23:47 -0400] "GET /asctortf/ HTTP/1.0" 200 8130 "http://search.netscape.com/Computers/Data_Formats/Document/Text/RTF" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - Tweet @ArupNanda
  • 3. fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)" fcrawler.looksmart.com petabytes unpredictable format transient. Tweet @ArupNanda Metadata Repository Tweet @ArupNanda
  • 7. CUSTOMERS CUST_ID NAME ADDRESS SPOUSES CUST_ID NAME CURRENT EMPLOYERS CUST_ID NAME CURRENT Tweet @ArupNanda Name = Data Relationship status = Data Married to = Data In a relationship with = Data Friends = Data, Data, Data Likes = Data, Data Mutually Exclusive, Maybe not? Multiple Data Points Tweet @ArupNanda
  • 8. First Name John Spouse Jane Child Jill Goes to Acme School Tweet @ArupNanda First Name Martha Child goes to Acme School Tweet @ArupNanda
  • 9. First Name John Spouse Jane Child Jill Goes to Acme School First Name Martha Child goes to Acme School Tweet @ArupNanda First Name Martha Child goes to Acme School Teacher Mrs Gillen Teacher Mrs Gillen Jill Tweet @ArupNanda
  • 10. First Name John Spouse Jane Child Jill Goes to Acme School Teacher Mr Fullmeister Tweet @ArupNanda First Name Irene Boyfriend Henry Works at Starwood Hobby Photography Ex-Spouse Jane Tweet @ArupNanda
  • 11. Tweet @ArupNanda First Name Irene Key Value Key-Value Pair Tweet @ArupNanda
  • 12. John Smith and his wife Jane, along with their daughter Jill, were strolling on the beach when they heard a crash. John ran towards … Tweet @ArupNanda Map Tweet @ArupNanda
  • 13. begin get post while (there_are_remaining_posts) loop extract status of "like" for the specific post if status = "like" then like_count := like_count + 1 else no_comment := no_comment + 1 end if end loop end Counter() Tweet @ArupNanda Counter() Counter() Counter() Tweet @ArupNanda
  • 14. Counter() Counter() Counter() Likes=100 No Comments= 300 Likes=50 No Comments= 350 Likes=150 No Comments= 250 Likes=300 No Comments= 900 Reduce Tweet @ArupNanda Map Reduce/ Dividing the work among different nodes Collating the results to get final answer Tweet @ArupNanda
  • 15. Counter () Counter () Counter () Likes=100 No Comments= 300 Likes=50 No Comments= 350 Likes=150 No Comments= 250Likes=300 No Comments= 900 • Divide the workload • Submit and track the jobs • If a job fails, restart it on another node • … Hadoop Tweet @ArupNanda Resource Management Applications YARN Yet Another Resource Negotiator Map Reduce v2. Tweet @ArupNanda
  • 16. Counter() Counter() Counter() Filesystem Filesystem Filesystem 1 2 32 3 13 1 2 Hadoop Distributed Filesystem (HDFS) Tweet @ArupNanda Count er() Count er() Count er() Filesystem Filesystem Filesystem 1 2 32 3 13 1 2 • Not shared storage • Data is discrete • Version control not required • Concurrency not required • Transactional integrity across nodes not required. Comparison with RAC Tweet @ArupNanda
  • 17. Advantages of Hadoop • Processors need not be super-fast • Immensely scalable • Storage is redundant by design • No RAID level required. Count er() Count er() Count er() Filesystem Filesystem Filesystem 1 2 32 3 13 1 2 Tweet @ArupNanda Scalable? ACID Properties Reliability at a cost Large overhead in data processing Tweet @ArupNanda
  • 18. Website logs Combine with structured data SOAP Messages Twitter, Facebook … Tweet @ArupNanda Data Access: through programs NoSQL Databases Tweet @ArupNanda
  • 19. Key Value Key Value DB Key Document Document DB. Key Value Key Value Key Value { empID:1, empName:Larry salary:infinity } Tweet @ArupNanda SQL-interface required Hive HiveQL Tweet @ArupNanda
  • 20. Creating a Hive Table create table accounts ( accno int, accname string, balance float ) row format delimited fields terminated by ‘,’ stored as texfile location '/user/hive/db1.db/accounts' Tweet @ArupNanda select count(*) from store_sales ss join household_demographics hd on (ss.ss_hdemo_sk = hd.hd_demo_sk) join time_dim t on (ss.ss_sold_time_sk = t.t_time_sk) join store s on (s.s_store_sk = ss.ss_store_sk) where t.t_hour = 8 t.t_minute >= 30 hd.hd_dep_count = 2 order by cnt; HiveQL Tweet @ArupNanda
  • 21. Map/Reduce Divide the work and collate the results Needs development in Java, Python, Ruby, etc. A framework to work on the dataset in parallel Pig Pig Latin Scripting language for Pig Tweet @ArupNanda select category, avg(pagerank) from urls where pagerank > 0.2 group by category having count(*) > 1000000 good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)>1000000; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); SQL Pig Latin Tweet @ArupNanda
  • 22. HBase HiveQL Pig A database built on Hadoop An SQL-like (but not the same) query language Procedural Logic without M/R Code. Tweet @ArupNanda normal programming languages, e.g. Python YARN Map/Reduce code in Java Spark Tweet @ArupNanda
  • 23. Count er() Count er() Count er() Filesystem Filesystem Filesystem 1 2 32 3 13 1 2 Hadoop processing in files Memory is cheaper Interactive processing needs faster access. Tweet @ArupNanda Spark Core SparkShell SparkSQL MLib SparkR PySpark Can use Java, Python or Scala Tweet @ArupNanda
  • 24. Divide and conquer is the key Non-shared division of data is important Local access Redundancy Hadoop is a framework You have to write the programs Big data is batch-oriented Hive is SQL-like Pig Latin is a 4GL-like scripting language Spark uses memory Tweet @ArupNanda Oh, I so want to Learn! Cloudera – prebuilt VMs https://www.cloudera.com/documentation/ente rprise/5-9-x/topics/cloudera_quickstart_vm.html Hortonworks – prebuilt VMs https://hortonworks.com/downloads/#sandbox Tweet @ArupNanda