SlideShare ist ein Scribd-Unternehmen logo
1 von 34
BIG DATA
Need of Converged Data Platform
UV Saradhi.
Presentation Overview
1. Big data. Why it is really big?
2. Technologies that are available today.
3. Need of Converged Data Platform.
1. Innovation!! What it takes?
Video Surveillance
Data generated by 704 X 576 resolution CCTV’s generated
1GB per hour roughly.
Video Surveillance estimates 6000 PB in 2017.
Surge in Biometric applications.
Who stole my jacket? Forgot on the desk. Office has CCTV!!
Autonomous Cars
Driverless car generates 1 GB/sec roughly.
2 PB per car is the expectation.
Car goes for a trip. Comes back safely. “Is the
car drive good?”
What if someone files a lawsuit after six
months?
Aadhar
Biometric identity to Indian Citizens.
~5 Mega Bytes per citizen.
Maps around 15 PB of raw data.
100 million authentications per day. Each authentication is
roughly 4KB plus of data.
Sub second response needed.
Aadhar Continued ...
Enrolled data moves from hot to cold. Data temperature
varies.
Data analytics need.
https://uidai.gov.in/images/FrontPageUpdates/uid_doc_3
0012012.pdf
Data is stored on Mapr Technologies.
https://uidai.gov.in/images/AadhaarTechnologyArchitecture_March2014.pdf
Retailers View
Walmart needs to process 2.4 PB per hour.
Gain insights on data in 30 - 40 minutes time period.
Error in insights because of bugs and miscalculation will
burn money.
Need to model 40 PB of recent transactional data.
Retailers View Continued ...
Data insights figured out that two particular stores are not
selling popular cookies. It’s not easy to find!!
Alert when a particular metric threshold is violated. Helps to
reduce the turnaround time.
200 billion rows of transaction data has to be processed.
Retailer Needs ….
Building 360 degrees view of the customer. Measuring Brand Sentiment.
Creating customized promotions.
Improving store layout. Layout matters to make you purchase more!!
Click streams.
Inventory management.
Selling baby lotions to pregnancy women, tracking that weather is not
good and selling Pizzas.
BIG DATA: Technologies Primer
Search “GOD” in Laptop running with 1 Terabyte Drive. Assume 100 MB/sec as
throughput.
How to speed search of “GOD”?
Add more CPU. Okay, How many? 128 or 256 or 512?
Add more memory. How much? How many DIMMS? 16 or 64 or 128?
Tired!! Ah, I realize now single machine cannot solve the problem.
Do with multiple machines. May be, commodity machines, But scale in a huge
way.
How to distribute storage?
Technologies : Compute, Storage and Network
Scale by moving compute close to data.
Store data efficiently on multiple nodes.
No compromise on reliability.
No compromise on availability.
Automatically take care of addition and deletion of nodes.
Help to extract underlying device performance characteristics.
Network:
Do not let compute happen on data over network.
Technologies Available Today
Hadoop! What exactly hadoop is?
Map-Reduce! When is this a right choice?
YARN? Is it refined Map-Reduce? More tight control on resource management
and job scheduling /monitoring?
Looks Hadoop core is distributed storage. Map-Reduce is compute engine. Is the
processing real time? Are we good to go??
Technologies continued ...
How to push data to Hadoop storage? Use Flume?
How to push data from an existing application writing to legacy file system? Is it
to be rebuild?
Can the entire big data storage (aka hadoop) be accessed over NFS?
Okay, We somehow manage data into Hadoop. Does it solve all needs? Is there a
way to address data as Key-Value pairs?
Unstructured Data as Key-Value Pairs
Why do we need unstructured data as Key-Value pairs?
Aadhar needs to store biometric signature, address, fingerprints etc.
Retailers need to show various attributes on the products. It consists of images,
technical specifications, tables, columns, reviews, etc.
IoT (Internet of things) generate lot of unstructured data.
How to store them and process them? Need of more technologies ...
Big Table
HBase. Tries to address the key-value pair.
Cassandra. Tries to address the key-value pair.
Mapr DB. Addresses key-value pair problem.
Is there a JOIN operation on these tables? Can there be atomic operations across
different rows? How about calling the above as NOSQL DB’s.
How can one decide right technology?
NOSQL DB
MongoDB.
CouchDB.
Mapr DB - JSON
Why are there still more databases? What do these tables provide more?
Is querying data still a challenge?
Data Query Engines
Hive
Impala
Drill
Presto
Pig
SPARK SQL
Real Time Analysis of Data
Hadoop, Connectors to Hadoop, Unstructured key-value pair, Big Table SQL
engines, Ready to go?
Is there a need to process data as soon as it arrives?
May be, Streams are needed. Streams are like pipes!!
APACHE KAFKA
APACHE STORM
APACHE FLINK
MAPR STREAMS
AI, GRAPH, ...
Need to represent data in graph
Apache Giraffe
Machine learning.
Apache Mahout
Platform
Purchased 1000 nodes.
Have to connect several software to make meaning of the data.
IT needs standard platform to run day after day.
Development and Business needs continuous engagement of new tools and
new software.
Security and Fraud detection keeps on changing day-by-day.
What to do? Do I need virtualization software?
Virtualization
Go for existing virtualization techniques? Are they expensive?
How about Linux Containers?
How about scheduling Containers? Do we need scheduling software?
Apache Mesos
Kubernetes
How do I provision storage for containers?
Craft disk independently for each container?
Is there a way to plug in storage from any node in the cluster to a container running on any node?
Performance and Security Problems
1000 node cluster is not performing well.
Back to Big Data problem again.
Swim 1000 node logs to identify what is the issue?
Security.
Is data access kept confidential?
Authentication and Authorization is must. Is it same across all softwares?
Data encrypted on the wire?
DoS problems.
Multi Tenancy
Have 1000 tenants to work on 1000 node cluster.
How to provision storage, compute and network?
Is this going to be like Amazon cloud? Does each enterprise has the scale and
capacity to develop Amazon cloud software?
Is there a way for tenants to share data?
Hot and Cold Data
As time moves forward, Data can possibly become cold.
A need may arise to keep hot data on solid state drives.
How to retain cold data?
Move to cloud.
Does this need another software?
Is there a way to watch attributes of moved data into the cloud? Let’s say the file is /A/B/C.
Can one see the time when C is modified while the data stays in the cloud.
Is there a way to dynamically move data between solid state drives and hard
disk drives?
Reliability: Does it mean 3-way replication
Data reliability means 3-way replicating by and large.
Peta Bytes of data being 3-way replicated causes storage waste.
How to eliminate it?
A platform should try to represent data in erasure coded format (Probably 1.5x).
Yet while storing in erasure coded format, It should let to modify data if need arises.
IoT Devices : Edge Clusters
IoT devices generate lot of data.
Each IoT device data has to be processed and stored with high reliability to
meet government laws.
IoT devices has to process data.
We know, single machine has limitation in processing data. By virtue of CPU’s, Memory and hard
disks.
Single machine also poses data reliability problems if the drive or CPU went bad.
Is this asking for a cluster near IoT devices? How can we do? NUC (Nuclear unit
of computing) cluster may be the answer!!
IoT Edge Clusters
Process data and push to centralized cluster.
Access data in the centralized cluster and local cluster when need arises.
Unified global namespace access is must.
Ability to stream data from Edge Cluster to Centralized Cluster.
Edge cluster applications may not be sophisticated. They may have to write data with standard
file system calls.
Does the software platform we chose can provide Edge Cluster Processing?
Application Data Access Model
Table Format.
Big data files. Hadoop files (Write Once and Read Many) or Mapr files (read and
writable).
Object Store.
Flat name space.
Data is accessed as objects with strict SLA’s.
Used to store videos, Images, etc.
Converged Data Platform
Needed as Big Data Store.
Ability to support unstructured key-value pairs.
Ability to support data with SQL engines like Drill, Hive, etc.
Ability to support real time streaming of data.
Ability to support container virtualization.
Ability to support applications accessing data through objects.
Ability to support global namespace for IoT Edge Clusters.
Converged Data Platform Continued ...
Ability to support Multi Tenancy.
Ability to ensure security across several users and tenants.
Ability to provision CPU, Storage and network across tenants or users.
Ability to support different temperatures of the data.
Ability to move data between cloud and the cluster.
Innovation
Is Innovation function of knowledge?
Isn’t knowledge function of time?
What promotes innovation?
Salary?
Stock?
Recognition?
Peer Competition.
Innovation Continued ...
Innovation needs innocent mind.
How can one be innocent in this world?
Is there a way mind can be made innocent?
Recognizing innovation is innovation.
Questions
I may not be able to answer all your questions!!
We can investigate the question together !! Not alone.
THANK YOU
You can reach me
Email: uvsaradhi at gmail dot com.

Weitere ähnliche Inhalte

Was ist angesagt?

Spark Pipelines in the Cloud with Alluxio by Bin Fan
Spark Pipelines in the Cloud with Alluxio by Bin FanSpark Pipelines in the Cloud with Alluxio by Bin Fan
Spark Pipelines in the Cloud with Alluxio by Bin Fan
Data Con LA
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
Cloudera, Inc.
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
DataWorks Summit
 

Was ist angesagt? (20)

Spark Pipelines in the Cloud with Alluxio by Bin Fan
Spark Pipelines in the Cloud with Alluxio by Bin FanSpark Pipelines in the Cloud with Alluxio by Bin Fan
Spark Pipelines in the Cloud with Alluxio by Bin Fan
 
BDaas- BigData as a service
BDaas- BigData as a service  BDaas- BigData as a service
BDaas- BigData as a service
 
Hadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteHadoop Big Data Lakes Keynote
Hadoop Big Data Lakes Keynote
 
Hadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and MoreHadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and More
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
Better Together: The New Data Management Orchestra
Better Together: The New Data Management OrchestraBetter Together: The New Data Management Orchestra
Better Together: The New Data Management Orchestra
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Big Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyBig Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case Study
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Making Big Data Analytics with Hadoop fast & easy (webinar slides)
Making Big Data Analytics with Hadoop fast & easy (webinar slides)Making Big Data Analytics with Hadoop fast & easy (webinar slides)
Making Big Data Analytics with Hadoop fast & easy (webinar slides)
 
Big Data & Data Lakes Building Blocks
Big Data & Data Lakes Building BlocksBig Data & Data Lakes Building Blocks
Big Data & Data Lakes Building Blocks
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 

Ähnlich wie Big Data - Need of Converged Data Platform

Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data Analytics
Kaniska Mandal
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
Evert Lammerts
 

Ähnlich wie Big Data - Need of Converged Data Platform (20)

Final deck
Final deckFinal deck
Final deck
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by Keylabs
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data Analytics
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @Scale
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 

Mehr von GeekNightHyderabad

Mehr von GeekNightHyderabad (20)

Testing strategies in microservices
Testing strategies in microservicesTesting strategies in microservices
Testing strategies in microservices
 
Metaprogramming ruby
Metaprogramming rubyMetaprogramming ruby
Metaprogramming ruby
 
Scaling enterprise digital platforms with kubernetes
Scaling enterprise digital platforms with kubernetesScaling enterprise digital platforms with kubernetes
Scaling enterprise digital platforms with kubernetes
 
FreedomBox & Community Wi-Fi networks
FreedomBox & Community Wi-Fi networksFreedomBox & Community Wi-Fi networks
FreedomBox & Community Wi-Fi networks
 
Rendezvous with aucovei (autonomous connected car)
Rendezvous with aucovei (autonomous connected car)Rendezvous with aucovei (autonomous connected car)
Rendezvous with aucovei (autonomous connected car)
 
Role of AI & ML in beauty care industry
Role of AI & ML in beauty care industryRole of AI & ML in beauty care industry
Role of AI & ML in beauty care industry
 
Breaking down a monolith
Breaking down a monolithBreaking down a monolith
Breaking down a monolith
 
Design lean agile_thinking presentation
Design lean agile_thinking presentationDesign lean agile_thinking presentation
Design lean agile_thinking presentation
 
Scaling pipelines
Scaling pipelinesScaling pipelines
Scaling pipelines
 
Blockchain beyond bitcoin
Blockchain beyond bitcoinBlockchain beyond bitcoin
Blockchain beyond bitcoin
 
Http/2
Http/2Http/2
Http/2
 
Hardware hacking and internet of things
Hardware hacking and internet of thingsHardware hacking and internet of things
Hardware hacking and internet of things
 
Spring to Cloud - REST To Microservices
Spring to Cloud - REST To MicroservicesSpring to Cloud - REST To Microservices
Spring to Cloud - REST To Microservices
 
Serverless
ServerlessServerless
Serverless
 
Building Cloud Native Applications Using Spring Boot and Spring Cloud
Building Cloud Native Applications Using Spring Boot and Spring CloudBuilding Cloud Native Applications Using Spring Boot and Spring Cloud
Building Cloud Native Applications Using Spring Boot and Spring Cloud
 
Progressive Web Applications - The Next Gen Web Technologies
Progressive Web Applications - The Next Gen Web TechnologiesProgressive Web Applications - The Next Gen Web Technologies
Progressive Web Applications - The Next Gen Web Technologies
 
Scaling a Game Server: From 500 to 100,000 Users
Scaling a Game Server: From 500 to 100,000 UsersScaling a Game Server: From 500 to 100,000 Users
Scaling a Game Server: From 500 to 100,000 Users
 
Building a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's PerspectiveBuilding a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's Perspective
 
Understanding the Intelligent Cloud
Understanding the Intelligent CloudUnderstanding the Intelligent Cloud
Understanding the Intelligent Cloud
 
GeekNight 22.0 Multi-paradigm programming in Scala and Akka
GeekNight 22.0 Multi-paradigm programming in Scala and AkkaGeekNight 22.0 Multi-paradigm programming in Scala and Akka
GeekNight 22.0 Multi-paradigm programming in Scala and Akka
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Big Data - Need of Converged Data Platform

  • 1. BIG DATA Need of Converged Data Platform UV Saradhi.
  • 2. Presentation Overview 1. Big data. Why it is really big? 2. Technologies that are available today. 3. Need of Converged Data Platform. 1. Innovation!! What it takes?
  • 3. Video Surveillance Data generated by 704 X 576 resolution CCTV’s generated 1GB per hour roughly. Video Surveillance estimates 6000 PB in 2017. Surge in Biometric applications. Who stole my jacket? Forgot on the desk. Office has CCTV!!
  • 4. Autonomous Cars Driverless car generates 1 GB/sec roughly. 2 PB per car is the expectation. Car goes for a trip. Comes back safely. “Is the car drive good?” What if someone files a lawsuit after six months?
  • 5. Aadhar Biometric identity to Indian Citizens. ~5 Mega Bytes per citizen. Maps around 15 PB of raw data. 100 million authentications per day. Each authentication is roughly 4KB plus of data. Sub second response needed.
  • 6. Aadhar Continued ... Enrolled data moves from hot to cold. Data temperature varies. Data analytics need. https://uidai.gov.in/images/FrontPageUpdates/uid_doc_3 0012012.pdf Data is stored on Mapr Technologies. https://uidai.gov.in/images/AadhaarTechnologyArchitecture_March2014.pdf
  • 7. Retailers View Walmart needs to process 2.4 PB per hour. Gain insights on data in 30 - 40 minutes time period. Error in insights because of bugs and miscalculation will burn money. Need to model 40 PB of recent transactional data.
  • 8. Retailers View Continued ... Data insights figured out that two particular stores are not selling popular cookies. It’s not easy to find!! Alert when a particular metric threshold is violated. Helps to reduce the turnaround time. 200 billion rows of transaction data has to be processed.
  • 9. Retailer Needs …. Building 360 degrees view of the customer. Measuring Brand Sentiment. Creating customized promotions. Improving store layout. Layout matters to make you purchase more!! Click streams. Inventory management. Selling baby lotions to pregnancy women, tracking that weather is not good and selling Pizzas.
  • 10. BIG DATA: Technologies Primer Search “GOD” in Laptop running with 1 Terabyte Drive. Assume 100 MB/sec as throughput. How to speed search of “GOD”? Add more CPU. Okay, How many? 128 or 256 or 512? Add more memory. How much? How many DIMMS? 16 or 64 or 128? Tired!! Ah, I realize now single machine cannot solve the problem. Do with multiple machines. May be, commodity machines, But scale in a huge way. How to distribute storage?
  • 11. Technologies : Compute, Storage and Network Scale by moving compute close to data. Store data efficiently on multiple nodes. No compromise on reliability. No compromise on availability. Automatically take care of addition and deletion of nodes. Help to extract underlying device performance characteristics. Network: Do not let compute happen on data over network.
  • 12. Technologies Available Today Hadoop! What exactly hadoop is? Map-Reduce! When is this a right choice? YARN? Is it refined Map-Reduce? More tight control on resource management and job scheduling /monitoring? Looks Hadoop core is distributed storage. Map-Reduce is compute engine. Is the processing real time? Are we good to go??
  • 13. Technologies continued ... How to push data to Hadoop storage? Use Flume? How to push data from an existing application writing to legacy file system? Is it to be rebuild? Can the entire big data storage (aka hadoop) be accessed over NFS? Okay, We somehow manage data into Hadoop. Does it solve all needs? Is there a way to address data as Key-Value pairs?
  • 14. Unstructured Data as Key-Value Pairs Why do we need unstructured data as Key-Value pairs? Aadhar needs to store biometric signature, address, fingerprints etc. Retailers need to show various attributes on the products. It consists of images, technical specifications, tables, columns, reviews, etc. IoT (Internet of things) generate lot of unstructured data. How to store them and process them? Need of more technologies ...
  • 15. Big Table HBase. Tries to address the key-value pair. Cassandra. Tries to address the key-value pair. Mapr DB. Addresses key-value pair problem. Is there a JOIN operation on these tables? Can there be atomic operations across different rows? How about calling the above as NOSQL DB’s. How can one decide right technology?
  • 16. NOSQL DB MongoDB. CouchDB. Mapr DB - JSON Why are there still more databases? What do these tables provide more? Is querying data still a challenge?
  • 18. Real Time Analysis of Data Hadoop, Connectors to Hadoop, Unstructured key-value pair, Big Table SQL engines, Ready to go? Is there a need to process data as soon as it arrives? May be, Streams are needed. Streams are like pipes!! APACHE KAFKA APACHE STORM APACHE FLINK MAPR STREAMS
  • 19. AI, GRAPH, ... Need to represent data in graph Apache Giraffe Machine learning. Apache Mahout
  • 20. Platform Purchased 1000 nodes. Have to connect several software to make meaning of the data. IT needs standard platform to run day after day. Development and Business needs continuous engagement of new tools and new software. Security and Fraud detection keeps on changing day-by-day. What to do? Do I need virtualization software?
  • 21. Virtualization Go for existing virtualization techniques? Are they expensive? How about Linux Containers? How about scheduling Containers? Do we need scheduling software? Apache Mesos Kubernetes How do I provision storage for containers? Craft disk independently for each container? Is there a way to plug in storage from any node in the cluster to a container running on any node?
  • 22. Performance and Security Problems 1000 node cluster is not performing well. Back to Big Data problem again. Swim 1000 node logs to identify what is the issue? Security. Is data access kept confidential? Authentication and Authorization is must. Is it same across all softwares? Data encrypted on the wire? DoS problems.
  • 23. Multi Tenancy Have 1000 tenants to work on 1000 node cluster. How to provision storage, compute and network? Is this going to be like Amazon cloud? Does each enterprise has the scale and capacity to develop Amazon cloud software? Is there a way for tenants to share data?
  • 24. Hot and Cold Data As time moves forward, Data can possibly become cold. A need may arise to keep hot data on solid state drives. How to retain cold data? Move to cloud. Does this need another software? Is there a way to watch attributes of moved data into the cloud? Let’s say the file is /A/B/C. Can one see the time when C is modified while the data stays in the cloud. Is there a way to dynamically move data between solid state drives and hard disk drives?
  • 25. Reliability: Does it mean 3-way replication Data reliability means 3-way replicating by and large. Peta Bytes of data being 3-way replicated causes storage waste. How to eliminate it? A platform should try to represent data in erasure coded format (Probably 1.5x). Yet while storing in erasure coded format, It should let to modify data if need arises.
  • 26. IoT Devices : Edge Clusters IoT devices generate lot of data. Each IoT device data has to be processed and stored with high reliability to meet government laws. IoT devices has to process data. We know, single machine has limitation in processing data. By virtue of CPU’s, Memory and hard disks. Single machine also poses data reliability problems if the drive or CPU went bad. Is this asking for a cluster near IoT devices? How can we do? NUC (Nuclear unit of computing) cluster may be the answer!!
  • 27. IoT Edge Clusters Process data and push to centralized cluster. Access data in the centralized cluster and local cluster when need arises. Unified global namespace access is must. Ability to stream data from Edge Cluster to Centralized Cluster. Edge cluster applications may not be sophisticated. They may have to write data with standard file system calls. Does the software platform we chose can provide Edge Cluster Processing?
  • 28. Application Data Access Model Table Format. Big data files. Hadoop files (Write Once and Read Many) or Mapr files (read and writable). Object Store. Flat name space. Data is accessed as objects with strict SLA’s. Used to store videos, Images, etc.
  • 29. Converged Data Platform Needed as Big Data Store. Ability to support unstructured key-value pairs. Ability to support data with SQL engines like Drill, Hive, etc. Ability to support real time streaming of data. Ability to support container virtualization. Ability to support applications accessing data through objects. Ability to support global namespace for IoT Edge Clusters.
  • 30. Converged Data Platform Continued ... Ability to support Multi Tenancy. Ability to ensure security across several users and tenants. Ability to provision CPU, Storage and network across tenants or users. Ability to support different temperatures of the data. Ability to move data between cloud and the cluster.
  • 31. Innovation Is Innovation function of knowledge? Isn’t knowledge function of time? What promotes innovation? Salary? Stock? Recognition? Peer Competition.
  • 32. Innovation Continued ... Innovation needs innocent mind. How can one be innocent in this world? Is there a way mind can be made innocent? Recognizing innovation is innovation.
  • 33. Questions I may not be able to answer all your questions!! We can investigate the question together !! Not alone.
  • 34. THANK YOU You can reach me Email: uvsaradhi at gmail dot com.