SlideShare ist ein Scribd-Unternehmen logo
1 von 12
By: Khalid Imran
Big Data :
Technology Stack
Agenda
▪ Big Data Stack : In a Nutshell
▪ Data Layer
▪ Data Processing Layer
▪ Data Ingestion Layer
▪ Data Presentation Layer
▪ Operations & Scheduling Layer
▪ Security & Governance
2
Big Data Technology Stack : In a nutshell
3
Data Layer
4
Hadoop Distributed File System (HDFS)
HDFS is a scalable, fault-tolerant Java based distributed file system that is used for storing
large volumes of data in inexpensive commodity hardware.
Amazon Simple Storage Service (S3)
S3 is a cloud based scalable, distributed file system offering from Amazon. It can be
utilized as the data layer in big data applications, coupled with other required
components.
IBM General Parallel File System (GPFS) / Spectrum Scale
GPFS is a high-performance clustered file system developed by IBM.
Data Processing Layer
5
Hadoop MapReduce
Hadoop Map/Reduce is a software framework for distributed processing of large data sets on
compute clusters of commodity hardware. It is a sub-project of the Apache Hadoop project. The
framework takes care of scheduling tasks, monitoring them and re-executing any failed tasks. A
MapReduce job usually splits the input data-set into independent chunks which are processed by
the map tasks in a completely parallel manner. The framework sorts the outputs of the maps,
which are then input to the reduce tasks. Typically both the input and the output of the job are
stored in a file-system.
Apache Pig
Pig is a high-level platform for creating MapReduce programs used with Hadoop. Apache Pig
allows Apache Hadoop users to write complex MapReduce transformations using a simple
scripting language called Pig Latin. Pig translates the Pig Latin script into MapReduce so that it can
be executed on the data. Pig Latin can be extended using UDF (User Defined Functions) which the
user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.
Data Processing Layer
6
Apache Hive
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis. Hive is used to explore, structure and analyze data, then turn
it into actionable business insight. Apache Hive supports analysis of large datasets stored in
Hadoop's HDFS and compatible file systems such as Amazon S3 file system. It provides an SQL-like
language called HiveQL with schema on read and transparently converts queries to map/reduce,
Apache Tez and Spark jobs.
Apache HBase
HBase is an open-source NoSQL database that provides real-time read/write access to large
datasets with extremely low latency as well as fault tolerance. HBase runs on top of HDFS. HBase
provides a strong consistency model, and range-based partitioning. Reads, including range-based
reads, tend to scale much better on HBase, whereas writes do not scale as well as they do on
Cassandra..
Data Processing Layer
7
Apache Cassandra
Cassandra is another open-source distributed NoSQL database. It is highly scalable, fault tolerant
and can be used to manage huge volumes of data. Cassandra's consistency model is based on
Amazon's Dynamo: it provides eventual consistency. This is very appealing for some applications
where you want to guarantee the availability of writes. Similarly, Cassandra tends to provide very
good write scaling.
Apache Storm
Storm is a distributed real-time computation system for processing large volumes of high-velocity
data. Storm makes it easy to reliably process unbounded streams of data, doing for real-time
processing what Hadoop did for batch processing.
Apache Solr
Apache Solr is the open source platform for searches of data stored in HDFS in Hadoop. Solr
powers the search and navigation features of many of the world’s largest Internet sites, enabling
powerful full-text search and near real-time indexing. Apache Solr can be used for rapidly finding
tabular, text, geo-location or sensor data that is stored in Hadoop.
Data Processing Layer
8
Apache Spark
Apache Spark is an open source cluster computing framework for large-scale data processing.
Studies have shown that Spark can run up to 100x faster than Hadoop MapReduce in memory, or
10x faster on disk for program execution. It provides in-memory computations for increased speed
and data processing over MapReduce. It runs on top of existing Hadoop cluster and can access
Hadoop data store (HDFS), as well as also process structured data from Hive and streaming data
from HDFS, Flume, Kafka, Twitter and other sources.
Apache Mahout
Apache Mahout is a library of scalable machine-learning algorithms that can be implemented on
top of Apache Hadoop and it utilizes the MapReduce paradigm. Machine learning is a discipline of
artificial intelligence focused on enabling machines to learn without being explicitly programmed,
and it is commonly used to improve future performance based on previous outcomes. Mahout
provides the tools and algorithms to automatically find meaningful patterns in those big data sets
stored in the HDFS.
Data Ingestion Layer
9
Apache Flume
Apache Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming data (e.g. application logs, sensor and
machine data, geo-location data and social media) into the HDFS. It has a simple and flexible
architecture based on streaming data flows; and is robust and fault tolerant with comes with
configurable reliability mechanisms for failover and recovery.
Apache Kafka
Kafka is a high throughput distributed messaging system. Kafka maintains feeds of messages
in categories called topics. Producers are processes that publish messages to a Kafka topic.
Consumers are processes that subscribe to topics and process the feed of published
messages. Kafka is run as a cluster comprising of one or more servers each of which is called
a broker.
Apache Sqoop
Apache Sqoop is a tool designed to transfer data between Hadoop and relational databases or
mainframes. Sqoop can be used to import data from a RDBMS or a mainframe into HDFS,
transform the data using Hadoop MapReduce, and then export the data back into an RDBMS.
Data Presentation Layer
10
Kibana
Kibana is an analytics and visualization plugin that works with ElasticSearch. It provides real-time
summary and charting of streaming data. The visualization capabilities it provides allow users to
different charts, plots and maps of large volumes of data.
Operations & Scheduling Layer
11
Ambari
Ambari is an open framework that helps in provisioning, managing and monitoring of Apache
Hadoop clusters. It simplifies the deployment and maintenance of hosts. Ambari also includes an
intuitive web interface that allows one to easily provision, configure and test all the Hadoop
services and core components. It also comes with the powerful Ambari Blueprints API that can be
utilized for automating cluster installations without any user intervention.
Apache Oozie
Apache Oozie provides operational service capabilities for a Hadoop cluster, specifically around
job scheduling within the cluster. Oozie is a Java based web application that is primarily used to
schedule Apache Hadoop jobs. Oozie can combine multiple jobs sequentially into one logical unit
of work. It can be integrated with the Hadoop stack, and supports Hadoop jobs for various Apache
tools such as MapReduce, Apache Pig, Apache Hive, and Apache Sqoop.
Apache ZooKeeper
Apache ZooKeeper provides operational services for a Hadoop cluster. It provides a distributed
configuration service, a synchronization service and a naming registry for distributed systems that
can use Zookeeper to store and mediate updates to important configuration information.
About Me – Khalid Imran
12
A tester by passion, I’ve spent the past 16+ years testing disparate systems, learning new domains, developing
innovative solutions, designing test strategies, challenging conventional methods, proving new techniques and
embracing emerging tools & technologies. The breadth and depth of my experience cuts across functional testing,
non-functional testing, manual, automation, test and project execution methodologies, licensed and open stack
tools, platforms, devices, programming languages, custom-built test harnesses and utilities, delivery management,
client engagement, on-site, off-shore team dynamics and more.
I am currently heading the 1400+ QA strong testing practice at Cybage as a QA Evangelist. I manage the Testing
Centre of Excellence (TCoE), lead a team of architects and specialists and assist in deliveries across the organization,
pre-sales and business development, solutioning and consultancy, training and process improvement group. I hold
multiple certifications namely: CSQA, CSM and CPISI.
I welcome any questions or feedback you may have on this presentation.

Weitere ähnliche Inhalte

Was ist angesagt?

HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...Simplilearn
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta LakeKnoldus Inc.
 
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaWhat Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaEdureka!
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive TutorialSandeep Patil
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBRavi Teja
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 

Was ist angesagt? (20)

HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaWhat Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive Tutorial
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Big data on aws
Big data on awsBig data on aws
Big data on aws
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 

Ähnlich wie Big Data Technology Stack Overview

Ähnlich wie Big Data Technology Stack Overview (20)

Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.
 
In15orlesss hadoop
In15orlesss hadoopIn15orlesss hadoop
In15orlesss hadoop
 
Hadoop white papers
Hadoop white papersHadoop white papers
Hadoop white papers
 
Hadoop vs Apache Spark
Hadoop vs Apache SparkHadoop vs Apache Spark
Hadoop vs Apache Spark
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Big Data Tools & Libraries
Big Data Tools & LibrariesBig Data Tools & Libraries
Big Data Tools & Libraries
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
BIGDATA ppts
BIGDATA pptsBIGDATA ppts
BIGDATA ppts
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
HDFS
HDFSHDFS
HDFS
 

Kürzlich hochgeladen

Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 

Kürzlich hochgeladen (20)

Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 

Big Data Technology Stack Overview

  • 1. By: Khalid Imran Big Data : Technology Stack
  • 2. Agenda ▪ Big Data Stack : In a Nutshell ▪ Data Layer ▪ Data Processing Layer ▪ Data Ingestion Layer ▪ Data Presentation Layer ▪ Operations & Scheduling Layer ▪ Security & Governance 2
  • 3. Big Data Technology Stack : In a nutshell 3
  • 4. Data Layer 4 Hadoop Distributed File System (HDFS) HDFS is a scalable, fault-tolerant Java based distributed file system that is used for storing large volumes of data in inexpensive commodity hardware. Amazon Simple Storage Service (S3) S3 is a cloud based scalable, distributed file system offering from Amazon. It can be utilized as the data layer in big data applications, coupled with other required components. IBM General Parallel File System (GPFS) / Spectrum Scale GPFS is a high-performance clustered file system developed by IBM.
  • 5. Data Processing Layer 5 Hadoop MapReduce Hadoop Map/Reduce is a software framework for distributed processing of large data sets on compute clusters of commodity hardware. It is a sub-project of the Apache Hadoop project. The framework takes care of scheduling tasks, monitoring them and re-executing any failed tasks. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. Apache Pig Pig is a high-level platform for creating MapReduce programs used with Hadoop. Apache Pig allows Apache Hadoop users to write complex MapReduce transformations using a simple scripting language called Pig Latin. Pig translates the Pig Latin script into MapReduce so that it can be executed on the data. Pig Latin can be extended using UDF (User Defined Functions) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.
  • 6. Data Processing Layer 6 Apache Hive Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive is used to explore, structure and analyze data, then turn it into actionable business insight. Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 file system. It provides an SQL-like language called HiveQL with schema on read and transparently converts queries to map/reduce, Apache Tez and Spark jobs. Apache HBase HBase is an open-source NoSQL database that provides real-time read/write access to large datasets with extremely low latency as well as fault tolerance. HBase runs on top of HDFS. HBase provides a strong consistency model, and range-based partitioning. Reads, including range-based reads, tend to scale much better on HBase, whereas writes do not scale as well as they do on Cassandra..
  • 7. Data Processing Layer 7 Apache Cassandra Cassandra is another open-source distributed NoSQL database. It is highly scalable, fault tolerant and can be used to manage huge volumes of data. Cassandra's consistency model is based on Amazon's Dynamo: it provides eventual consistency. This is very appealing for some applications where you want to guarantee the availability of writes. Similarly, Cassandra tends to provide very good write scaling. Apache Storm Storm is a distributed real-time computation system for processing large volumes of high-velocity data. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Apache Solr Apache Solr is the open source platform for searches of data stored in HDFS in Hadoop. Solr powers the search and navigation features of many of the world’s largest Internet sites, enabling powerful full-text search and near real-time indexing. Apache Solr can be used for rapidly finding tabular, text, geo-location or sensor data that is stored in Hadoop.
  • 8. Data Processing Layer 8 Apache Spark Apache Spark is an open source cluster computing framework for large-scale data processing. Studies have shown that Spark can run up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk for program execution. It provides in-memory computations for increased speed and data processing over MapReduce. It runs on top of existing Hadoop cluster and can access Hadoop data store (HDFS), as well as also process structured data from Hive and streaming data from HDFS, Flume, Kafka, Twitter and other sources. Apache Mahout Apache Mahout is a library of scalable machine-learning algorithms that can be implemented on top of Apache Hadoop and it utilizes the MapReduce paradigm. Machine learning is a discipline of artificial intelligence focused on enabling machines to learn without being explicitly programmed, and it is commonly used to improve future performance based on previous outcomes. Mahout provides the tools and algorithms to automatically find meaningful patterns in those big data sets stored in the HDFS.
  • 9. Data Ingestion Layer 9 Apache Flume Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data (e.g. application logs, sensor and machine data, geo-location data and social media) into the HDFS. It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with comes with configurable reliability mechanisms for failover and recovery. Apache Kafka Kafka is a high throughput distributed messaging system. Kafka maintains feeds of messages in categories called topics. Producers are processes that publish messages to a Kafka topic. Consumers are processes that subscribe to topics and process the feed of published messages. Kafka is run as a cluster comprising of one or more servers each of which is called a broker. Apache Sqoop Apache Sqoop is a tool designed to transfer data between Hadoop and relational databases or mainframes. Sqoop can be used to import data from a RDBMS or a mainframe into HDFS, transform the data using Hadoop MapReduce, and then export the data back into an RDBMS.
  • 10. Data Presentation Layer 10 Kibana Kibana is an analytics and visualization plugin that works with ElasticSearch. It provides real-time summary and charting of streaming data. The visualization capabilities it provides allow users to different charts, plots and maps of large volumes of data.
  • 11. Operations & Scheduling Layer 11 Ambari Ambari is an open framework that helps in provisioning, managing and monitoring of Apache Hadoop clusters. It simplifies the deployment and maintenance of hosts. Ambari also includes an intuitive web interface that allows one to easily provision, configure and test all the Hadoop services and core components. It also comes with the powerful Ambari Blueprints API that can be utilized for automating cluster installations without any user intervention. Apache Oozie Apache Oozie provides operational service capabilities for a Hadoop cluster, specifically around job scheduling within the cluster. Oozie is a Java based web application that is primarily used to schedule Apache Hadoop jobs. Oozie can combine multiple jobs sequentially into one logical unit of work. It can be integrated with the Hadoop stack, and supports Hadoop jobs for various Apache tools such as MapReduce, Apache Pig, Apache Hive, and Apache Sqoop. Apache ZooKeeper Apache ZooKeeper provides operational services for a Hadoop cluster. It provides a distributed configuration service, a synchronization service and a naming registry for distributed systems that can use Zookeeper to store and mediate updates to important configuration information.
  • 12. About Me – Khalid Imran 12 A tester by passion, I’ve spent the past 16+ years testing disparate systems, learning new domains, developing innovative solutions, designing test strategies, challenging conventional methods, proving new techniques and embracing emerging tools & technologies. The breadth and depth of my experience cuts across functional testing, non-functional testing, manual, automation, test and project execution methodologies, licensed and open stack tools, platforms, devices, programming languages, custom-built test harnesses and utilities, delivery management, client engagement, on-site, off-shore team dynamics and more. I am currently heading the 1400+ QA strong testing practice at Cybage as a QA Evangelist. I manage the Testing Centre of Excellence (TCoE), lead a team of architects and specialists and assist in deliveries across the organization, pre-sales and business development, solutioning and consultancy, training and process improvement group. I hold multiple certifications namely: CSQA, CSM and CPISI. I welcome any questions or feedback you may have on this presentation.