SlideShare ist ein Scribd-Unternehmen logo
1 von 37
The Future of Data Science
SARITH DIVAKAR M | LBS COLLEGE OF ENGINEERING, KASARAGOD
www.sarithdivakar.info sarith@cusat.ac.in
Agenda
• DATA SCIENCE
• BIG DATA
• TECHNOLOGIES
Data Scientist
“A data scientist is someone who is better
at statistics than any software engineer
and better at software engineering than
any statistician”
An Interview with Lisa Qian, Airbnb
 WHICH SKILLS OR PROGRAMMING LANGUAGES DO YOU
MOST FREQUENTLY USE IN YOUR WORK, AND WHY?
“At Airbnb, we all use Hive to query data and build derived
tables. I use R to do analysis and build models. I use Hive and R
every day of the job. A lot of data scientists use Python instead
of R – it’s just a matter of what we were familiar with when we
came in. There have also been recent efforts to use Spark to
build large-scale machine learning models.”
Reference: Mathrubhumi, “http://digitalpaper.mathrubhumi.com/943320/kochi/21-Sept-2016#page/6/2”
Data Scientist Salaries
Average Salary (2015): $118,709 per year
Minimum: $76,000
Maximum: $148,000
Median Salary (2015): $93,991 per year
Total Pay Range: $63,524 – $138,123
Data Scientist Qualifications
Master’s degree 80%
PhD 46%
Math and statistics 32%
Computer Science 19%
Engineering 16%
Reference: The Burtch Works Study, “http://www.burtchworks.com/big-data-analyst-salary/big-data-career-tips/”
Data Scientist Job Outlook
 McKinsey reported that by 2018 the U.S. could face a
shortage of 1,40,000 to 1,90,000 “people with deep
analytic skills”
Reference: Report of McKinsey Global Institute, “http://www.mckinsey.com/business-functions/business-
technology/our-insights/big-data-the-next-frontier-for-innovation”
http://www.dst.gov.in/big-data-
initiative-1
Big Data Initiative
What Kind of Skills Will I Need?
Past and Future of Data Science
 Descriptive analytics
 Describing what has already taken place
 Predictive analytics and real-time
analytics in pursuit of business goals
 Improving the customer experience
 Improving products and services
 Reducing costs
Where to prioritize their Focus?
 Amazon, Google and Netflix.
 Python
 Variety of tools, perspectives and approaches
 Identify methods and models most appropriate
in a particular use case.
Reference: Devavrat Shah, Professor, Department of Electrical Engineering and
Computer Science, MIT, “http://blog.edx.org/future-data-science-qa-mit-
professional-educations-devavrat-shah”
Popular Applications
 Internet Search
 Digital Advertisements
 Gaming
Data Science to refine the “Crude Oil”
 Volume
 Variety
 Velocity
 Veracity
 Value
 (add your own V here…..)
Where big data comes from?
 Huge amount of data is created everyday!
 It comes from Us!
 No digitized process becomes digitized
 Digital India
 Programmee to transform India to a digitally
empowered society and knowledge economy
Excavating Hidden Treasures from Big Data
 Insights into data can provide business advantage
 Some key early indications can mean fortunes to
business
 More precise analysis with more data
 Integrate Big Data with traditional data: Enhance
business intelligence analysis
Challenges in big data
 Heterogeneity and
incompleteness
 Scale
 Timeliness
 Privacy
 Human collaboration
RDBMS : Why not for Big Data?
 Limitations in RDBMS
 RDBMS cannot handle petabytes of data
 Seek time of disk drives is improving more slowly than transfer
rate of data
 RDBMS are not built to handle unstructured or semi structured
data
 Normalization of data makes it difficult for handling large data sets
 Example : WebLogs
Distributed computing
 Dividing large problems into smaller ones, and solved
concurrently ("in parallel")
 Connecting multiple machines together for
 Storing big files
 Parallel processing
 Data locality
 Redundancy
Challenges in distributed computing
The distributed computing had some challenges which
restricted organizations to depend upon it. Those are
 Concurrency control
 Data synchronization
 Atomic commit
 Transaction split into small tasks
 Leader election
Big data and cloud: converging
technologies
 Big data: Extracting value out of “variety,
velocity and volume” from unstructured
information available
 Cloud: On demand, elastic, scalable pay
per use self service model
Answer these before moving to big data
analysis
 Do you have an effective big data problem?
 Can the business benefit from using Big Data?
 Do your data volumes really require these distributed
mechanisms?
Technology to handle big data
 Google was the first company to effectively use big data
 Engineers at google created massively distributed
systems
 Collected and analyzed massive collections of web pages
& relationships between them and created “Google
Search Engine” capable of querying billions of pages
First generation of Distributed systems
 Proprietary
 Custom Hardware and software
 Centralized data
 Hardware based fault recovery
 Eg: Teradata, Netezza etc
Second generation of Distributed systems
 Open source
 Commodity hardware
 Distributed data
 Software based fault recovery
 Eg : Hadoop, HPCC
Why we need new generation?
 Lot has been changed from 2000
 Both hardware and software gone through changes
 Big data has become necessity now
 Let’s look at what changed over decade
Changes in Hardware
State of hardware in 2000 State of hardware now
Disk was cheap so disk was primary
source of data
RAM is the king
Network was costly so data locality RAM is primary source of data and we
use disk for fallback
RAM was very costly Network is speedier
Single core machines were dominant Multi core machines are commonplace
Shortcomings of Second generation
 Batch processing is primary objective
 Not designed to change depending upon use cases
 Tight coupling between API and run time
 Do not exploit new hardware capabilities
 Too much complex
Third generation distributed systems
 Handle both batch processing and real time
 Exploit RAM as much as disk
 Multiple core aware
 Do not reinvent the wheel
 They use
 HDFS for storage
 Apache Mesos / YARN for distribution
 Plays well with Hadoop
Hadoop vs Spark
Stores data on disk Sores data in memory (RAM)
Commodity hardware can be utilized Need high end systems with greater RAM
Uses Replication to achieve fault tolerance Uses different data storage models to achieve
fault tolerance (Eg. RDD)
Speed of processing is less due to disk read
write
100x faster than Hadoop
Supports only Java & R Supports Java, Python, R, Scala etc. Ease of
programming is high.
Everything is just Map and Reduce Supports Map, Reduce, SQL. Streaming etc
Data should be in HDFS Data can be in HDFS,Cassandra,Hbase or S3.
Runs on Hadoop, Cloud, Mesos or standalone
Spark Open Source Ecosystem
Who are using Spark
Get Your Hands Dirty With Data
References
1. The Burtch Works Study, http://www.burtchworks.com/big-data-analyst-salary/big-data-career-tips/
2. Mathrubhumi, http://digitalpaper.mathrubhumi.com/943320/kochi/21-Sept-2016#page/6/2
3. Report of McKinsey Global Institute, http://www.mckinsey.com/business-functions/business-technology/our-
insights/big-data-the-next-frontier-for-innovation
4. Devavrat Shah, Professor, Department of Electrical Engineering and Computer Science, MIT,
“http://blog.edx.org/future-data-science-qa-mit-professional-educations-devavrat-shah
5. “Data Mining and Data Warehousing”, M.Sudheep Elayidom, SOE, CUSAT
6. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing”. Matei Zaharia,
Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott
Shenker, Ion Stoica. NSDI 2012. April 2012. Best Paper Award.
7. “What is Big Data”, https://www-01.ibm.com/software/in/data/bigdata/
8. “Apache Hadoop”, https://hadoop.apache.org/
9. “Apache Spark”, http://spark.apache.org/
The Future of Data Science and Big Data Analysis

Weitere ähnliche Inhalte

Was ist angesagt?

big data overview ppt
big data overview pptbig data overview ppt
big data overview pptVIKAS KATARE
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamGreg Goltsov
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataKaran Desai
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataKristof Jozsa
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
The rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computingThe rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computingMinhazul Arefin
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabatinabati
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An OverviewC. Scyphers
 
Presentation on Big Data Hadoop (Summer Training Demo)
Presentation on Big Data Hadoop (Summer Training Demo)Presentation on Big Data Hadoop (Summer Training Demo)
Presentation on Big Data Hadoop (Summer Training Demo)Ashok Royal
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big AnalyticsAjay Ohri
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampSpotle.ai
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data FrameworkseXascale Infolab
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewSivashankar Ganapathy
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"Nicola Ferraro
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataJoey Li
 

Was ist angesagt? (20)

big data overview ppt
big data overview pptbig data overview ppt
big data overview ppt
 
Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Exploring Big Data Analytics Tools
Exploring Big Data Analytics ToolsExploring Big Data Analytics Tools
Exploring Big Data Analytics Tools
 
The rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computingThe rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computing
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Presentation on Big Data Hadoop (Summer Training Demo)
Presentation on Big Data Hadoop (Summer Training Demo)Presentation on Big Data Hadoop (Summer Training Demo)
Presentation on Big Data Hadoop (Summer Training Demo)
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
1. what is hadoop part 1
1. what is hadoop   part 11. what is hadoop   part 1
1. what is hadoop part 1
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 

Andere mochten auch

Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache sparksarith divakar
 
Preso spark leadership
Preso spark leadershipPreso spark leadership
Preso spark leadershipsjoerdluteyn
 
Spark, the new age of data scientist
Spark, the new age of data scientistSpark, the new age of data scientist
Spark, the new age of data scientistMassimiliano Martella
 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinesecolorant
 
Spark the next top compute model
Spark   the next top compute modelSpark   the next top compute model
Spark the next top compute modelDean Wampler
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetupjlacefie
 
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)Hadoop / Spark Conference Japan
 
Pixie dust overview
Pixie dust overviewPixie dust overview
Pixie dust overviewDavid Taieb
 
Scala presentation by Aleksandar Prokopec
Scala presentation by Aleksandar ProkopecScala presentation by Aleksandar Prokopec
Scala presentation by Aleksandar ProkopecLoïc Descotte
 
Why dont you_create_new_spark_jl
Why dont you_create_new_spark_jlWhy dont you_create_new_spark_jl
Why dont you_create_new_spark_jlShintaro Fukushima
 
Spark tutorial py con 2016 part 2
Spark tutorial py con 2016   part 2Spark tutorial py con 2016   part 2
Spark tutorial py con 2016 part 2David Taieb
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged ApplicationsMapR Technologies
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 

Andere mochten auch (20)

Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
 
Pipelining
PipeliningPipelining
Pipelining
 
Preso spark leadership
Preso spark leadershipPreso spark leadership
Preso spark leadership
 
Spark, the new age of data scientist
Spark, the new age of data scientistSpark, the new age of data scientist
Spark, the new age of data scientist
 
Performance
PerformancePerformance
Performance
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinese
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Spark the next top compute model
Spark   the next top compute modelSpark   the next top compute model
Spark the next top compute model
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
 
Scala in practice
Scala in practiceScala in practice
Scala in practice
 
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
 
Pixie dust overview
Pixie dust overviewPixie dust overview
Pixie dust overview
 
Scala presentation by Aleksandar Prokopec
Scala presentation by Aleksandar ProkopecScala presentation by Aleksandar Prokopec
Scala presentation by Aleksandar Prokopec
 
Why dont you_create_new_spark_jl
Why dont you_create_new_spark_jlWhy dont you_create_new_spark_jl
Why dont you_create_new_spark_jl
 
Spark in 15 min
Spark in 15 minSpark in 15 min
Spark in 15 min
 
Spark tutorial py con 2016 part 2
Spark tutorial py con 2016   part 2Spark tutorial py con 2016   part 2
Spark tutorial py con 2016 part 2
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged Applications
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 

Ähnlich wie The Future of Data Science and Big Data Analysis

Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data ScienceDataWorks Summit
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overviewNitesh Ghosh
 
Big Data Basic Concepts | Presented in 2014
Big Data Basic Concepts  | Presented in 2014Big Data Basic Concepts  | Presented in 2014
Big Data Basic Concepts | Presented in 2014Kenneth Igiri
 
Cloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfCloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfkalai75
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond Rajesh Kumar
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...IJSRD
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...IJSRD
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusersBob Hardaway
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarioskcmallu
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKRajesh Jayarman
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 

Ähnlich wie The Future of Data Science and Big Data Analysis (20)

Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
TSE_Pres12.pptx
TSE_Pres12.pptxTSE_Pres12.pptx
TSE_Pres12.pptx
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Big Data
Big DataBig Data
Big Data
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
Big Data Basic Concepts | Presented in 2014
Big Data Basic Concepts  | Presented in 2014Big Data Basic Concepts  | Presented in 2014
Big Data Basic Concepts | Presented in 2014
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Cloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfCloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdf
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 

Kürzlich hochgeladen

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Kürzlich hochgeladen (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

The Future of Data Science and Big Data Analysis

  • 1. The Future of Data Science SARITH DIVAKAR M | LBS COLLEGE OF ENGINEERING, KASARAGOD www.sarithdivakar.info sarith@cusat.ac.in
  • 2. Agenda • DATA SCIENCE • BIG DATA • TECHNOLOGIES
  • 3. Data Scientist “A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician”
  • 4. An Interview with Lisa Qian, Airbnb  WHICH SKILLS OR PROGRAMMING LANGUAGES DO YOU MOST FREQUENTLY USE IN YOUR WORK, AND WHY? “At Airbnb, we all use Hive to query data and build derived tables. I use R to do analysis and build models. I use Hive and R every day of the job. A lot of data scientists use Python instead of R – it’s just a matter of what we were familiar with when we came in. There have also been recent efforts to use Spark to build large-scale machine learning models.”
  • 6. Data Scientist Salaries Average Salary (2015): $118,709 per year Minimum: $76,000 Maximum: $148,000 Median Salary (2015): $93,991 per year Total Pay Range: $63,524 – $138,123
  • 7. Data Scientist Qualifications Master’s degree 80% PhD 46% Math and statistics 32% Computer Science 19% Engineering 16% Reference: The Burtch Works Study, “http://www.burtchworks.com/big-data-analyst-salary/big-data-career-tips/”
  • 8. Data Scientist Job Outlook  McKinsey reported that by 2018 the U.S. could face a shortage of 1,40,000 to 1,90,000 “people with deep analytic skills” Reference: Report of McKinsey Global Institute, “http://www.mckinsey.com/business-functions/business- technology/our-insights/big-data-the-next-frontier-for-innovation”
  • 10. What Kind of Skills Will I Need?
  • 11.
  • 12. Past and Future of Data Science  Descriptive analytics  Describing what has already taken place  Predictive analytics and real-time analytics in pursuit of business goals  Improving the customer experience  Improving products and services  Reducing costs
  • 13. Where to prioritize their Focus?  Amazon, Google and Netflix.  Python  Variety of tools, perspectives and approaches  Identify methods and models most appropriate in a particular use case. Reference: Devavrat Shah, Professor, Department of Electrical Engineering and Computer Science, MIT, “http://blog.edx.org/future-data-science-qa-mit- professional-educations-devavrat-shah”
  • 14. Popular Applications  Internet Search  Digital Advertisements  Gaming
  • 15. Data Science to refine the “Crude Oil”  Volume  Variety  Velocity  Veracity  Value  (add your own V here…..)
  • 16. Where big data comes from?  Huge amount of data is created everyday!  It comes from Us!  No digitized process becomes digitized  Digital India  Programmee to transform India to a digitally empowered society and knowledge economy
  • 17. Excavating Hidden Treasures from Big Data  Insights into data can provide business advantage  Some key early indications can mean fortunes to business  More precise analysis with more data  Integrate Big Data with traditional data: Enhance business intelligence analysis
  • 18.
  • 19. Challenges in big data  Heterogeneity and incompleteness  Scale  Timeliness  Privacy  Human collaboration
  • 20. RDBMS : Why not for Big Data?  Limitations in RDBMS  RDBMS cannot handle petabytes of data  Seek time of disk drives is improving more slowly than transfer rate of data  RDBMS are not built to handle unstructured or semi structured data  Normalization of data makes it difficult for handling large data sets  Example : WebLogs
  • 21. Distributed computing  Dividing large problems into smaller ones, and solved concurrently ("in parallel")  Connecting multiple machines together for  Storing big files  Parallel processing  Data locality  Redundancy
  • 22. Challenges in distributed computing The distributed computing had some challenges which restricted organizations to depend upon it. Those are  Concurrency control  Data synchronization  Atomic commit  Transaction split into small tasks  Leader election
  • 23. Big data and cloud: converging technologies  Big data: Extracting value out of “variety, velocity and volume” from unstructured information available  Cloud: On demand, elastic, scalable pay per use self service model
  • 24. Answer these before moving to big data analysis  Do you have an effective big data problem?  Can the business benefit from using Big Data?  Do your data volumes really require these distributed mechanisms?
  • 25. Technology to handle big data  Google was the first company to effectively use big data  Engineers at google created massively distributed systems  Collected and analyzed massive collections of web pages & relationships between them and created “Google Search Engine” capable of querying billions of pages
  • 26. First generation of Distributed systems  Proprietary  Custom Hardware and software  Centralized data  Hardware based fault recovery  Eg: Teradata, Netezza etc
  • 27. Second generation of Distributed systems  Open source  Commodity hardware  Distributed data  Software based fault recovery  Eg : Hadoop, HPCC
  • 28. Why we need new generation?  Lot has been changed from 2000  Both hardware and software gone through changes  Big data has become necessity now  Let’s look at what changed over decade
  • 29. Changes in Hardware State of hardware in 2000 State of hardware now Disk was cheap so disk was primary source of data RAM is the king Network was costly so data locality RAM is primary source of data and we use disk for fallback RAM was very costly Network is speedier Single core machines were dominant Multi core machines are commonplace
  • 30. Shortcomings of Second generation  Batch processing is primary objective  Not designed to change depending upon use cases  Tight coupling between API and run time  Do not exploit new hardware capabilities  Too much complex
  • 31. Third generation distributed systems  Handle both batch processing and real time  Exploit RAM as much as disk  Multiple core aware  Do not reinvent the wheel  They use  HDFS for storage  Apache Mesos / YARN for distribution  Plays well with Hadoop
  • 32. Hadoop vs Spark Stores data on disk Sores data in memory (RAM) Commodity hardware can be utilized Need high end systems with greater RAM Uses Replication to achieve fault tolerance Uses different data storage models to achieve fault tolerance (Eg. RDD) Speed of processing is less due to disk read write 100x faster than Hadoop Supports only Java & R Supports Java, Python, R, Scala etc. Ease of programming is high. Everything is just Map and Reduce Supports Map, Reduce, SQL. Streaming etc Data should be in HDFS Data can be in HDFS,Cassandra,Hbase or S3. Runs on Hadoop, Cloud, Mesos or standalone
  • 33. Spark Open Source Ecosystem
  • 34. Who are using Spark
  • 35. Get Your Hands Dirty With Data
  • 36. References 1. The Burtch Works Study, http://www.burtchworks.com/big-data-analyst-salary/big-data-career-tips/ 2. Mathrubhumi, http://digitalpaper.mathrubhumi.com/943320/kochi/21-Sept-2016#page/6/2 3. Report of McKinsey Global Institute, http://www.mckinsey.com/business-functions/business-technology/our- insights/big-data-the-next-frontier-for-innovation 4. Devavrat Shah, Professor, Department of Electrical Engineering and Computer Science, MIT, “http://blog.edx.org/future-data-science-qa-mit-professional-educations-devavrat-shah 5. “Data Mining and Data Warehousing”, M.Sudheep Elayidom, SOE, CUSAT 6. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing”. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. NSDI 2012. April 2012. Best Paper Award. 7. “What is Big Data”, https://www-01.ibm.com/software/in/data/bigdata/ 8. “Apache Hadoop”, https://hadoop.apache.org/ 9. “Apache Spark”, http://spark.apache.org/

Hinweis der Redaktion

  1. Glassdoor helps you find a job and company you love. Reviews, salaries and benefits from employees. Interview questions from candidates. Millions of jobs. PayScale, Inc. or payscale.com is an online salary, benefits and compensation information company, which launched its service on January 1, 2002. It was founded by Joe Giordano, a former Microsoft and drugstore.com manager, and John Gaffney
  2. Math (e.g. linear algebra, calculus and probability) Statistics (e.g. hypothesis testing and summary statistics) Machine learning tools and techniques (e.g. k-nearest neighbors, random forests, ensemble methods, etc.) Software engineering skills (e.g. distributed computing, algorithms and data structures) Data mining Data visualization (e.g. ggplot and d3.js) and reporting techniques Unstructured data techniques R and/or SAS languages SQL databases and database querying languages Python (most common), C/C++ Java, Perl Big data platforms like Hadoop, Hive & Pig Cloud tools like Amazon S3
  3. Devavrat Shah received his Bachelor of Technology in Computer Science and Engineering from Indian Institute of Technology, Bombay in 1999 with the Presidents of India Gold Medal – awarded to the best graduating student across all engineering disciplines. He received his PhD in Computer Science from Stanford University in 2004