Big Data

•

1 gefällt mir•841 views

My recent presentation about what is Big Data, Why so much Hype now, Startling Facts, Opportunity, History, Important Research Papers such as GFS, Map-Reduce , Technology Platforms and Organizations , Hadoop, Cassandra, Introduction to Hadoop, Contribution of Indians to various Big Data technologies working in Google, Cloudera, Hortonworks, Yahoo, Facebook, Aadhar - "All your answers lie in data - @Sameer Sawhney"

Technologie

World’s information totaled over

2 Zetabytes
That’s 2 Trillion Gigabytes

By 2020, this number will be

35 Trillion ZB

“world’s data is doubling every

1.2 years”

Big Table
Google File
System
Map
Reduce
2003

2004

2005

2006

Impala

Amazon
Dremel
Dynamo
Apache
Hadoop
Apache
Cassandra
2007

2008

2009

2010

2011

2012

Spanner ?

2013

2013

Today

Analytics

Realtime

(Hadoop)

(“NoSql”)

Apache Hadoop is an open-source software
framework that supports running applications on
large clusters of commodity hardware.

Replication
Fault Tolerant
Commodity Hardware

World's largest biometric identity platform

2,00,00,00,00,000

Biometric Matches

2 PB

Data

Hadoop

Stack

This is just the Beginning of
This is just the Beginning of
“Big Data Revolution”
“Big Data Revolution”

sameer.sawhney@gmail.com
@sameersaw at twitter

Images
Raymond Bryson
Marius B
IntelFreePress License
Pedro Moura Pinheiro

Weitere ähnliche Inhalte

Was ist angesagt?

Big DataRaja Ram Dutta

Genome-scale Big Data PipelinesLynn Langit

0 to kaggle in 30 minutesmiztsai

Overview of bigdataAbinaya B

HadoopAmit Chaudhary

VariantSpark - a Spark library for genomicsLynn Langit

HadoopAnsuman mohapatro

Industry trends.v0.1pptxArindam Banerji

Biq query devfest2017_slidesgetdinesh

See the forest AND the trees: Free tools for data visualisationPaul Rowe

The Future of Data Sciencesarith divakar

Genomic Scale Big Data PipelinesLynn Langit

Opportunities for Genetic Programming Researchers in BlockchainTrent McConaghy

The Industry 4.0 revolutionKwanwoo Park

Easylearning Guru online Hadoop class KCC Software Ltd. & Easylearning.guru

BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...Alex Liu

big data and hadoopShamama Kamal

Customer Case Study: How Novel Compute Technology Transforms Medical and Life...Amazon Web Services

Introduction of Big data and Hadoop Arohi Khandelwal

Big Data Story - From An Engineer's PerspectiveHien Luu

Was ist angesagt? (20)

Big Data

Genome-scale Big Data Pipelines

0 to kaggle in 30 minutes

Overview of bigdata

Hadoop

VariantSpark - a Spark library for genomics

Hadoop

Industry trends.v0.1pptx

Biq query devfest2017_slides

See the forest AND the trees: Free tools for data visualisation

The Future of Data Science

Genomic Scale Big Data Pipelines

Opportunities for Genetic Programming Researchers in Blockchain

The Industry 4.0 revolution

Easylearning Guru online Hadoop class

BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...

big data and hadoop

Customer Case Study: How Novel Compute Technology Transforms Medical and Life...

Introduction of Big data and Hadoop

Big Data Story - From An Engineer's Perspective

Ähnlich wie Big Data

Addressing dm-cloudGenoveva Vargas-Solar

A Big Data ConceptDharmesh Tank

Big Data = Big DecisionsInnoTech

Predictive modelling with azure mlKoray Kocabas

The Walking DataJESS3

Big data analytics 1gauravsc36

Introduction to big dataSitaram Kotnis

State of Big Data MarketsKyle Redinger

Big Data Basic Concepts | Presented in 2014Kenneth Igiri

Where does hadoop come handyPraveen Sripati

Big data and APIs for PHP developers - SXSW 2011Eli White

Big Data & Data MiningMd Mizanur Rahman

Hw09 Protein AlignmentCloudera, Inc.

Tech4Africa - Opportunities around Big DataSteve Watt

Galaxy of bitsMichal Zylinski

Big Data Expo 2015 - Clusterpoint The Future of Big DataBigDataExpo

IBM Smart Camp: Philippe Souidi on Big DataPhilippe Souidi

Big Data and the Cloud a Best Friend StoryAmazon Web Services

Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk

Bigdata Anuraj Anand

Ähnlich wie Big Data (20)

Addressing dm-cloud

A Big Data Concept

Big Data = Big Decisions

Predictive modelling with azure ml

The Walking Data

Big data analytics 1

Introduction to big data

State of Big Data Markets

Big Data Basic Concepts | Presented in 2014

Where does hadoop come handy

Big data and APIs for PHP developers - SXSW 2011

Big Data & Data Mining

Hw09 Protein Alignment

Tech4Africa - Opportunities around Big Data

Galaxy of bits

Big Data Expo 2015 - Clusterpoint The Future of Big Data

IBM Smart Camp: Philippe Souidi on Big Data

Big Data and the Cloud a Best Friend Story

Lecture 5 - Big Data and Hadoop Intro.ppt

Bigdata

Kürzlich hochgeladen

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

GenAI Risks & Security Meetup 01052024.pdflior mazor

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

DBX First Quarter 2024 Investor PresentationDropbox

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Why Teams call analytics are critical to your entire businesspanagenda

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays

ICT role in 21st century education and its challengesrafiqahmad00786416

MS Copilot expands with MS Graph connectorsNanddeep Nachan

A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Ransomware_Q4_2023. The report. [EN].pdfOverkill Security

Architecting Cloud Native ApplicationsWSO2

Kürzlich hochgeladen (20)

presentation ICT roal in 21st century education

GenAI Risks & Security Meetup 01052024.pdf

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

DBX First Quarter 2024 Investor Presentation

Powerful Google developer tools for immediate impact! (2023-24 C)

Why Teams call analytics are critical to your entire business

Axa Assurance Maroc - Insurer Innovation Award 2024

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu

ICT role in 21st century education and its challenges

MS Copilot expands with MS Graph connectors

A Beginners Guide to Building a RAG App Using Open Source Milvus

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Exploring the Future Potential of AI-Enabled Smartphone Processors

Ransomware_Q4_2023. The report. [EN].pdf

Architecting Cloud Native Applications

Big Data

1. BIG DATA

2. WHY NOW ?

3. World’s information totaled over 2 Zetabytes That’s 2 Trillion Gigabytes By 2020, this number will be 35 Trillion ZB

4. “world’s data is doubling every 1.2 years”

5. “80% of this data is unstructured”

6. 5V

7. Money

9. Big Table Google File System Map Reduce 2003 2004 2005 2006 Impala Amazon Dremel Dynamo Apache Hadoop Apache Cassandra 2007 2008 2009 2010 2011 2012 Spanner ? 2013 2013 Today

10. Analytics Realtime (Hadoop) (“NoSql”)

11. THE ECOSYSTEM

12. Hadoop Ecosystem

13. Apache Hadoop is an open-source software framework that supports running applications on large clusters of commodity hardware.

14.

15. Replication Fault Tolerant Commodity Hardware

16. Map Reduce

17. Map Reduce

18. Word Count

19.

20. World's largest biometric identity platform 2,00,00,00,00,000 Biometric Matches 2 PB Data Hadoop Stack

21. This is just the Beginning of This is just the Beginning of “Big Data Revolution” “Big Data Revolution”

22. sameer.sawhney@gmail.com @sameersaw at twitter Images Raymond Bryson Marius B IntelFreePress License Pedro Moura Pinheiro

Hinweis der Redaktion

We all live in Data Age ….. While data storage capacity has increased, the speed at which data is read is still very slow.. Amount of data that is publicly available is increasing at a very past pace..Big data[1][2] is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications
Collosal amount of data is being generated, and this has changed things..
In good old days, we were using RDMS to store and process this data…we used to bring data to processing units but now data is huge…2 technologies have made this possible..
Characteristics of Big Data :Gartner defined 3 Vs : Volume, Velocity and varietyVeracity: Can this data be trusted ?Volume : Peta/Exa not TB ? Twitter alone around 7 TB of data every day, Facebook 10 TB, Google 20 PB every dayIn 2013: 200 million active users creating over 400 million Tweets each day.In 2011: Every day 200 million tweets, 10 million page book , reading this text will take 31 years In 2010: 65 million a dayIn 2009: 2 million tweets a dayVariety : Different sourcesValue : Meaningful
What is the problem that solution solves?Technology overviewSpecific solutionChallenges in current implementation/solution if any?Advantages and DisadvantagesAny alternatives of the specific solutionWay forward for the technology/solution?(Optional)
In defining big data, it’s also important to understand the mix of unstructured and multi-structured data that comprises the volume of information.Unstructured data comes from information that is not organized or easily interpreted by traditional databases or data models, and typically, it’s text-heavy. Metadata, Twitter tweets, and other social media posts are good examples of unstructured data.Multi-structured data refers to a variety of data formats and types and can be derived from interactions between people and machines, such as web applications or social networks. A great example is web log data, which includes a combination of text and visual images along with structured data like form or transactional information.
Characteristics of Big Data :Gartner defined 3 Vs : Volume, Velocity and varietyVeracity: Can this data be trusted ?Volume : Peta/Exa not TB ? Twitter alone around 7 TB of data every day, Facebook 10 TB, Google 20 PB every dayIn 2013: 200 million active users creating over 400 million Tweets each day.In 2011: Every day 200 million tweets, 10 million page book , reading this text will take 31 years In 2010: 65 million a dayIn 2009: 2 million tweets a dayVariety : Different sourcesValue : Meaningful
Characteristics of Big Data :Gartner defined 3 Vs : Volume, Velocity and varietyVeracity: Can this data be trusted ?Volume : Peta/Exa not TB ? Twitter alone around 7 TB of data every day, Facebook 10 TB, Google 20 PB every dayIn 2013: 200 million active users creating over 400 million Tweets each day.In 2011: Every day 200 million tweets, 10 million page book , reading this text will take 31 years In 2010: 65 million a dayIn 2009: 2 million tweets a dayVariety : Different sourcesValue : Meaningful
Characteristics of Big Data :Gartner defined 3 Vs : Volume, Velocity and varietyVeracity: Can this data be trusted ?Volume : Peta/Exa not TB ? Twitter alone around 7 TB of data every day, Facebook 10 TB, Google 20 PB every dayIn 2013: 200 million active users creating over 400 million Tweets each day.In 2011: Every day 200 million tweets, 10 million page book , reading this text will take 31 years In 2010: 65 million a dayIn 2009: 2 million tweets a dayVariety : Different sourcesValue : Meaningful
Data Source : Data Repository (data persists) : Filter and Transform : Compute (Distributed Scale out system)Map Reduce is inevitable.1980: Impedance Mismatch problem : Row/Columns for Relational Databases Integration Mechanism ( Relational Dominance into the 2000s)1990: Object databases : 2000: Big Internet sites, Amazon , Google ( Traffic) Lots of trafficBigger boxes : Real limits, CostLot of little boxes, SQL was designed on single node system.Google: Big TableAmazon: DynamoNoSQL movement: term comes from Johan Oskarsson : san francisco --- London , proposed meetup (late 2000), twitter hashtag,Short unique, #nosql, (Twitter hashtag to advertise a single meeting)Data Model:1. Key-Value: 2. Document Data model : JSON ( No schema), portions of documents, 3. Column Family : Single Row key having multiple column families, where each column family is aggregate of columsn which fit together.Aggregate is about storing all related items in 1cluster.
1980: Impedance Mismatch problem : Row/Columns for Relational Databases Integration Mechanism ( Relational Dominance into the 2000s)1990: Object databases : 2000: Big Internet sites, Amazon , Google ( Traffic) Lots of trafficBigger boxes : Real limits, CostLot of little boxes, SQL was designed on single node system.Google: Big TableAmazon: DynamoNoSQL movement: term comes from Johan Oskarsson : san francisco --- London , proposed meetup (late 2000), twitter hashtag,Short unique, #nosql, (Twitter hashtag to advertise a single meeting)Data Model:1. Key-Value: 2. Document Data model : JSON ( No schema), portions of documents, 3. Column Family : Single Row key having multiple column families, where each column family is aggregate of columsn which fit together.Aggregate is about storing all related items in 1cluster.
HadoopMapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
The Mapper implementation (lines 14-26), via the map method (lines 18-25), processes one line at a time, as provided by the specified TextInputFormat (line 49). It then splits the line into tokens separated by whitespaces, via the StringTokenizer, and emits a key-value pair of < <word>, 1>.The Reducer implementation (lines 28-36), via the reduce method (lines 29-35) just sums up the values, which are the occurence counts for each key (i.e. words in this example).
1980: Impedance Mismatch problem : Row/Columns for Relational Databases Integration Mechanism ( Relational Dominance into the 2000s)1990: Object databases : 2000: Big Internet sites, Amazon , Google ( Traffic) Lots of trafficBigger boxes : Real limits, CostLot of little boxes, SQL was designed on single node system.Google: Big TableAmazon: DynamoNoSQL movement: term comes from Johan Oskarsson : san francisco --- London , proposed meetup (late 2000), twitter hashtag,Short unique, #nosql, (Twitter hashtag to advertise a single meeting)Data Model:1. Key-Value: 2. Document Data model : JSON ( No schema), portions of documents, 3. Column Family : Single Row key having multiple column families, where each column family is aggregate of columsn which fit together.Aggregate is about storing all related items in 1cluster.

Big Data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Big Data

Ähnlich wie Big Data (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big Data

Hinweis der Redaktion