Foxvalley bigdata

Tom Rogers
Tom RogersSystems Analyst/Programmer um Northwestern University
BIG DATA
AND THE
HADOOP ECOSYSTEM
TOM ROGERS
NORTHWESTERN UNIVERSITY
FEINBERG SCHOOL OF MEDICINE
DEPARTMENT OF ANESTHESIOLOGY
WHAT IS BIG DATA?
The 3 V’s
WHAT IS BIG DATA?
Volume Terabytes Petabytes Exabytes
WHAT IS BIG DATA?
Volume
Velocity System Logs Medical Monitors Machinery Controls
WHAT IS BIG DATA?
Volume
Velocity
Variety
Varacity Variability
RDBMSSocial MediaXML
JSON Documents IoT
Value
How do we collect, store and process all this data?
• Open Source Apache Software.
• Distributed processing across clusters of computers.
• Designed to scale to thousands of computers.
• Local computation and storage.
• Expects hardware failure which is handled at the application layer.
A cute yellow elephant
HADOOP ECOSYSTEM OVERVIEW
• Distributed storage and processing.
• Runs on commodity server hardware.
• Scales horizontally for seamless failover.
• Hadoop is open source software.
TRADITIONAL DATA REPOSITORIES
• Very structured in 3NF or Star topologies.
• Is the enterprise “Single Source of Truth”
• Optimized for operations reporting
requirements.
• Scales vertically.
• Limited interaction with external or
unstructured data sources.
• Complex management schemes and
protocols.
TRADITIONAL DATA SOURCES IN HEALTHCARE
• Data for the Healthcare EDW originates
from the functional clinical and
administrative responsibilities.
• Sources can be as sophisticated as highly
complex on-line systems or as simple as
Excel spreadsheets.
• Complex validation and transformation
processes before inclusion into the EDW.
• Staging of the data transformation
requires separate storage and processing
space, but is often times done on the
same physical hardware as the EDW.
INTEGRATION OF HADOOP AND TRADITIONAL IT
• Hadoop does is not replace traditional
storage or processing technologies.
• Hadoop can include data from traditional
IT sources to discover new value.
• Compared to traditional IT, setting up and
operating a Hadoop platform can be very
inexpensive.
• Can be seen as very expensive when
adding to existing traditional IT
environments.
EMERGING AND NON-TRADITIONAL DATA
• New knowledge is discovered by applying
known experience in context with
unknown or new experience.
• New sources of data are being created in
a seemingly unending manner.
• Social media and mobile computing
provide sources of new data unavailable
in the past.
• Monitors, system logs, and document
corpus all provide new ways of capturing
and expressing the human experience
that cannot be captured or analyzed by
traditional IT methodologies.
INTEGRATION OF HADOOP AND NON-TRADITIONAL DATA
• Hadoop is designed to store and process
non-traditional data sets.
• Optimized for unstructured file based
data sources.
• Core applications developed specifically
for different storage, processing, analysis
and display activities.
• Development of metadata definitions
and rules combined with data from
disparate data sources can be used for
deeper analytic discovery.
DATA ANALYSIS
• Inspecting, transforming and modeling
data to discover knowledge, make
predictions and suggest conclusions.
• 3rd party data analysis can be integrated
into traditional IT environments or big
data solutions.
• Traditionally conducted by working on
discrete data sets in isolation from the
decision making process.
• Data scientists are integrated into core
business processes to create solutions for
critical business problems using big data
platforms.
COMPLETE HADOOP ECOSYSTEM
• Integration between traditional and non-
traditional data is facilitated by the
Hadoop ecosystem.
• Data is stored on a fault tolerant
distributed file system in the Hadoop
cluster.
• Data is processed close to where the data
is located to reduce latency and time
consuming transfer processes.
• The Hadoop Master controller or
“NameNode” monitors the processes of
the Hadoop cluster and automatically
executes actions to continue processing
when failure is detected.
HADOOP CORE COMPONENTS
Storage
HDFS
Hive
Hbase
Management
Zoo Keeper
Avro
Oozie
Whirr
Processing
MapReduce
Spark
Integration
Sqoop
Flume
Programming
Pig
Hive QL
Jaql
Insight
Mahout
Hue
Beeswax
CORE COMPONENT - STORAGE
• HDFS – A distributed file system designed to run on commodity grade hardware in the Hadoop computing
ecosystem. This file system is highly fault tolerant and provides very high throughput to data and is suitable
for very large data sets. Fault tolerance is enabled by making redundant copies of data sectors and
distributing them throughout the Hadoop cluster.
• Key Characteristics Include:
• Streaming data access – Designed for batch processing instead of interactive use.
• Large data sets – Typically in gigabytes to terabytes in size.
• Single Coherency Model - To enable high throughput access.
• Moving computational process is cheaper than moving data.
• Designed to be easily portable.
• Hive – A data warehouse implementation in Hadoop that facilities the query and management of large
datasets kept in the distributed storage.
• Key Features:
• Tools for ETL
• A methodology for providing structure for multiple data formats.
• Access to files stored in HDFS or Hbase
• Executes queries via the MapReduce application.
CORE COMPONENT – STORAGE …..
• HBase – A distributed, scalable big data database. For random access realtime read/write access to big data.
• Key Features:
• Modular scalability.
• Strict consistent reads and writes.
• Automatic sharding of tables (partitioning tables to smaller more manageable parts).
• Automatic failover.
CORE COMPONENT - MANAGEMENT
• Zoo Keeper – A centralized service for maintaining configurations, naming providing distributed
synchronization and group services.
• Avro – A data serialization program.
• Oozie – A Hadoop workflow Scheduler
• Whirr – A cloud neutral library for running cloud services.
CORE COMPONENT - PROCESSING
• MapReduce – An implementation for processing and generating large data sets with a parallel, distributed
algorithm on a Hadoop cluster.
• Key Features:
• Automatic parallelization and distribution
• Fault-tolerance
• I/O Scheduling
• Status Monitoring
CORE COMPONENT - INTEGRATION
• Sqoop – a utility designed to efficiently transfer bulk data between Hadoop and relational databases.
• Flume – A service, based on streaming data flows, for collecting, aggregating and moving large amounts of
system log data.
CORE COMPONENT – PROGRAMMING
• Pig – A high level language for analyzing very large data sets and is designed is able to efficiently utilize
parallel processes to achieve its results.
• Key Properties:
• Ease of programming – Complex tasks are explicitly encoded as data flow sequences making them
easy to understand and implement.
• Significant optimization opportunities – the system optimizes execution automatically.
• Extensibility – Users can encode their own functions.
• HiveQL – A SQL like query language for data stored in Hive Tables which converts queries into MapReduce
jobs.
• Jaql – A data processing and query language used to processing JSON on Hadoop.
CORE COMPONENT - INSIGHT
• Mahout – A library of callable machine learning algorithms which uses the MapReduce paradigm.
• Supports four main data use cases:
• Collaborative filtering – analyzes behavior and make recommendations.
• Clustering – organizes data into naturally occurring groups.
• Classification – learns from known characteristics of existing categorizations and makes
assignments of unclassified items into a category.
• Frequent item or market basket mining – analyzes data items in transactions and identifies items
which typically occur together.
• Hue – Is a set of web applications that enable a user to interact with a Hadoop cluster. Also lets the user
browse and interact with Hive, Impala, MapReduce jobs and Oozie workflows.
• Beeswax – An application which allows the user to perform queries on the Hive data warehousing
application. You can create Hive tables, load data, run queries and download results in Excel spreadsheet
format or CSV format.
HADOOP DISTRIBUTIONS
Amazon Web Services Elastic MapReduce
•One of the first Hadoop commercial offerings
•Has the largest commercial Hadoop market share
•Includes strong integration with other AWS cloud products
•Auto scaling and support for NoSQL and BI integration
Cloudera
•2nd largest commercial marketshare
•Experience with very large deployments
•Revenue model based on software subscriptions
•Aggressive innovation to meet customer demands
HortonWorks
•Strong engineering partnerships with flagship companies.
•Innovation driven through the open source community.
•Is a key contributor to the Hadoop core project.
•Commits corporate resources to jump start Hadoop community projects.
HADOOP DISTRIBUTIONS …
International Business Machines
•Vast experience in distributed computing and data management.
•Experience with very large deployments.
•Has advanced analytic tools, and global recognition.
•Integration with vast array of IBM management and productivity software.
MapR Technologies
•Heavy focus and early adopter of enterprise features.
•Supports some legacy file systems such as NFS.
•Adding performance enhancements for HBase, high-availability and disaster recovery.
Pivotal
•Spin off from EMC and VMWare.
•Strong cadre of technical consultants and data scientists.
•Focus on MPP SQL engine and EDW with very high performance.
•Has an appliance with integrated Hadoop, EDW and data management in a single rack.
HADOOP DISTRIBUTIONS …
Teradata
• Specialist and strong background in EDW.
• Has a strong technical partnership with HortonWorks.
• Has very strong integration between Hadoop and Teradata’s management and EDW
tools.
• Extensive financial and technical resources allow creation of unique and powerful
appliances.
Microsoft Windows Azure HDInsight
• A product designed specifically for the cloud in partnership with HortonWorks.
• The only Hadoop distribution that runs in the Windows environment.
• Allows SQL Server users to also execute queries that include data stored in Hadoop.
• Unique marketing advantage for offering the Hadoop stack to traditional Windows
customers.
RECOMMENDATION
Commitment and Leadership in
the Open Source Community
Strong Engineering
Partnerships
Innovation driven from the
community
Innovative
Secure
Big Data/Health
Research
Collaboration
CLUSTER DIAGRAM
• NameNode is a single master server
which manages the file system and
file system operations.
• Data Nodes are slave servers that
manage the data and the storage
attached to the data.
• NameNode is a single point of
failure for the HDFS Cluster.
• A SecondaryNameNode can be
configured on a separate server in
the cluster which creates
checkpoints for the namespace.
• SecondaryNameNode is not a
failover NameNode.
CLUSTER HARDWARE CONFIGURATION AND COST
Factor/Specification Option 1 Option 2
Replication Factor 3 3
Size of Data to Move 500 TB 500 TB
Workspace Factor 1.25 1.25
Compression 1 (no compression) 3
Hadoop Storage Requirement 1875 TB 625 TB
Storage Per Node 16 TB 16 TB
Rack Size 42U 42U
Node Unit $4000 $4000
Rack Unit Cost $1500 $1500
Node (1 NameNode & DataNodes) (119 nodes * $4,000) = $480,500 (41 nodes * $4000) = $164,000
Rack Cost (3 racks * $1,500) = $4,500 (1 Rack * $1500) = $1,500
Total Cost $480,500 $165,500
HADOOP SANDBOX IN ORACLE VIRTUALBOX
Host Specification
• Windows 10
• Intel® Core™ i7-4770 CPU @ 3.40GHz
• 16GB Installed RAM
• 64-bit OS, x64
• 1.65 TB Storage
VM Specification
• Cloudera Quickstart Sandbox
• Red Hat
• Intel® Core™ i7-4770 CPU @ 3.40GHz
• 10GB Allocated RAM
• 32MB Video Memory
• 64-bit OS
• 64GB Storage
• Shared Clipboard: Bidirectional
• Drag’n’Drop: Bidirectional
CLOUDERA HADOOP DESKTOP & INTERFACE
Opening Cloudera interface
and view of the CDC “Healthy
People 2010” data set that
was uploaded to the Redhat
OS
HUE FILE BROWSER
• Folder List
• File Contents
• Displayed file content is
from the Vulnerable
Population and
Environmental Health
data of the “Healthy
People 2010” data set.
ADDING DATA TO HIVE
• Folder List
• File Contents
• Displayed file content is
from the Vulnerable
Population and
Environmental Health
data of the “Healthy
People 2010” data set.
ADDING DATA TO HIVE …
Choosing a delimiter type
Defining columns
ADDING DATA TO HIVE …
• Hive Table List
• Table properties
HIVE QUERY EDITOR
BIBLIOGRAPHY
• "2015/03/18 - Apache Whirr Has Been Retired." Accessed January 24, 2016. https://whirr.apache.org/.
• "Apache Avro™1.7.7 Documentation." Apache Avro™1.7.7 Documentation. Accessed January 24, 2016. https://avro.apache.org/docs/current/.
• "Apache HBase – Apache HBase™ Home." Apache HBase – Apache HBase™ Home. Accessed January 24, 2016. https://hbase.apache.org/.
• "Apache Mahout." Hortonworks. Accessed January 24, 2016. http://hortonworks.com/hadoop/mahout/.
• "Best Practices for Selecting Apache Hadoop Hardware - Hortonworks." Hortonworks. September 01, 2011. Accessed January 26, 2016. http://hortonworks.com/blog/best-
practices-for-selecting-apache-hadoop-hardware/.
• "CDH3 Documentation." Beeswax. Accessed January 24, 2016. http://www.cloudera.com/documentation/archive/cdh/3-x/3u6/Hue-1.2-User-Guide/hue1.html.
• "CDH4 Documentation." Introducing Hue. Accessed January 24, 2016. http://www.cloudera.com/documentation/archive/cdh/4-x/4-2-0/Hue-2-User-Guide/hue2.html.
• "Data Analysis." Wikipedia. Accessed January 24, 2016. https://en.wikipedia.org/wiki/Data_analysis.
• "Hadoop Is Transforming Healthcare." Hortonworks. Accessed January 24, 2016. http://hortonworks.com/industry/healthcare/.
• "HDFS Architecture Guide." HDFS Architecture Guide. Accessed January 24, 2016. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Introduction.
• "Healthy People." Centers for Disease Control and Prevention. January 22, 2013. Accessed January 30, 2016. http://www.cdc.gov/nchs/healthy_people.htm.
• "Home - Apache Hive - Apache Software Foundation." Home - Apache Hive - Apache Software Foundation. Accessed January 24, 2016.
https://cwiki.apache.org/confluence/display/Hive/Home;jsessionid=A2FE8C570A86815B0B4890A923872351.
• "How the 9 Leading Commercial Hadoop Distributions Stack Up." CIO. Accessed January 24, 2016. http://www.cio.com/article/2368763/big-data/146238-How-the-9-Leading-
Commercial-Hadoop-Distributions-Stack-Up.html.
• "How the 9 Leading Commercial Hadoop Distributions Stack Up." CIO. Accessed January 24, 2016. http://www.cio.com/article/2368763/big-data/146238-How-the-9-Leading-
Commercial-Hadoop-Distributions-Stack-Up.html#slide1.
• "Map Reduce (MR) Framework." [Gerardnico]. Accessed January 24, 2016. http://gerardnico.com/wiki/algorithm/map_reduce.
• "Oozie - Apache Oozie Workflow Scheduler for Hadoop." Oozie - Apache Oozie Workflow Scheduler for Hadoop. Accessed January 24, 2016. http://oozie.apache.org/.
• "Sizing Your Hadoop Cluster." - For Dummies. Accessed January 26, 2016. http://www.dummies.com/how-to/content/sizing-your-hadoop-cluster.html.
• "Sqoop -." Sqoop -. Accessed January 24, 2016. http://sqoop.apache.org/.
• "TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing." Preparing Data for Analytics: Making It Easier and Faster. Accessed January 24,
2016. https://tdwi.org/articles/2015/04/14/preparing-data-for-analytics.aspx.
• "UC Irvine Health Does Hadoop. With Hortonworks Data Platform." Hortonworks. Accessed January 26, 2016. http://hortonworks.com/customer/uc-irvine-health/.
• "Welcome to Apache Flume¶." Welcome to Apache Flume — Apache Flume. Accessed January 24, 2016. https://flume.apache.org/.
• "Welcome to Apache Pig!" Welcome to Apache Pig! Accessed January 24, 2016. https://pig.apache.org/.
• "Welcome to Apache ZooKeeper™." Apache ZooKeeper. Accessed January 24, 2016. https://zookeeper.apache.org/.
• Wikipedia. Accessed January 24, 2016. https://en.wikipedia.org/wiki/Jaql.
1 von 35

Recomendados

Big Data Architecture Workshop - Vahid Amiri von
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
520 views111 Folien
Hadoop - Architectural road map for Hadoop Ecosystem von
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
2.7K views20 Folien
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit... von
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Innovative Management Services
1.3K views43 Folien
Big data analytics with hadoop volume 2 von
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Imviplav
381 views18 Folien
Hadoop jon von
Hadoop jonHadoop jon
Hadoop jonHumoyun Ahmedov
354 views30 Folien
Hadoop data-lake-white-paper von
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paperSupratim Ray
803 views18 Folien

Más contenido relacionado

Was ist angesagt?

Big data hadoop rdbms von
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbmsArjen de Vries
4.5K views45 Folien
Big data & hadoop von
Big data & hadoopBig data & hadoop
Big data & hadoopTejashBansal2
100 views30 Folien
Big Data on the Microsoft Platform von
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft PlatformAndrew Brust
1.4K views28 Folien
SAP HORTONWORKS von
SAP HORTONWORKSSAP HORTONWORKS
SAP HORTONWORKSDouglas Bernardini
1.2K views9 Folien
Big data concepts von
Big data conceptsBig data concepts
Big data conceptsSerkan Özal
14.5K views34 Folien
Introduction to BIg Data and Hadoop von
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
870 views42 Folien

Was ist angesagt?(20)

Big Data on the Microsoft Platform von Andrew Brust
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft Platform
Andrew Brust1.4K views
Big data concepts von Serkan Özal
Big data conceptsBig data concepts
Big data concepts
Serkan Özal14.5K views
Introduction to BIg Data and Hadoop von Amir Shaikh
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
Amir Shaikh870 views
Planing and optimizing data lake architecture von Milos Milovanovic
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
Milos Milovanovic865 views
Oncrawl elasticsearch meetup france #12 von Tanguy MOAL
Oncrawl elasticsearch meetup france #12Oncrawl elasticsearch meetup france #12
Oncrawl elasticsearch meetup france #12
Tanguy MOAL1.2K views
Productionizing Hadoop: 7 Architectural Best Practices von MapR Technologies
Productionizing Hadoop: 7 Architectural Best PracticesProductionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best Practices
MapR Technologies2.4K views
Enrich a 360-degree Customer View with Splunk and Apache Hadoop von Hortonworks
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Hortonworks6.6K views
Big data technologies and Hadoop infrastructure von Roman Nikitchenko
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
Roman Nikitchenko7.4K views
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017 von Lviv Startup Club
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Lviv Startup Club463 views
Big Data Analytics with Hadoop, MongoDB and SQL Server von Mark Kromer
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
Mark Kromer6.9K views
Hadoop: An Industry Perspective von Cloudera, Inc.
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
Cloudera, Inc.32.4K views
Building a Big Data platform with the Hadoop ecosystem von Gregg Barrett
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
Gregg Barrett4.6K views
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks von Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Hortonworks2K views
Big Data and Hadoop Basics von Sonal Tiwari
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
Sonal Tiwari126 views

Destacado

kenkä portfolio von
kenkä portfoliokenkä portfolio
kenkä portfoliosirpa tuomaala
174 views16 Folien
BA 15 Chapter 16 von
BA 15 Chapter 16BA 15 Chapter 16
BA 15 Chapter 16dpd
881 views42 Folien
Slideshare von
SlideshareSlideshare
SlideshareMsuquilanda12
171 views8 Folien
Catalog 80 99 von
Catalog 80 99Catalog 80 99
Catalog 80 99Lillian Lee
73 views10 Folien
“Disponibilidad de información para el cálculo de los indicadores ODS 4–Educa... von
“Disponibilidad de información para el cálculo de los indicadores ODS 4–Educa...“Disponibilidad de información para el cálculo de los indicadores ODS 4–Educa...
“Disponibilidad de información para el cálculo de los indicadores ODS 4–Educa...Julio Alexander Parra Maldonado
484 views40 Folien
EvansErica_WorkSample3 von
EvansErica_WorkSample3EvansErica_WorkSample3
EvansErica_WorkSample3Erica Evans
133 views3 Folien

Destacado(11)

BA 15 Chapter 16 von dpd
BA 15 Chapter 16BA 15 Chapter 16
BA 15 Chapter 16
dpd881 views
EvansErica_WorkSample3 von Erica Evans
EvansErica_WorkSample3EvansErica_WorkSample3
EvansErica_WorkSample3
Erica Evans133 views
BA 65 Hour 6 ~ Business Operations and Practice von dpd
BA 65 Hour 6 ~ Business Operations and PracticeBA 65 Hour 6 ~ Business Operations and Practice
BA 65 Hour 6 ~ Business Operations and Practice
dpd1.2K views
Chapter 14 - Operations, Quality, and Productivity von dpd
Chapter 14 - Operations, Quality, and ProductivityChapter 14 - Operations, Quality, and Productivity
Chapter 14 - Operations, Quality, and Productivity
dpd6K views
Evangelization[1] von amason04
Evangelization[1]Evangelization[1]
Evangelization[1]
amason04278 views

Similar a Foxvalley bigdata

Simple, Modular and Extensible Big Data Platform Concept von
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
2K views17 Folien
Fundamentals of big data analytics and Hadoop von
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopArchana Gopinath
86 views15 Folien
Big Data and Cloud Computing von
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
2.5K views56 Folien
Big data and hadoop von
Big data and hadoopBig data and hadoop
Big data and hadoopPrashanth Yennampelli
460 views18 Folien
Big Data von
Big DataBig Data
Big DataNeha Mehta
7.8K views43 Folien
Big data analysis using hadoop cluster von
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop clusterFurqan Haider
240 views21 Folien

Similar a Foxvalley bigdata(20)

Simple, Modular and Extensible Big Data Platform Concept von Satish Mohan
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
Satish Mohan2K views
Fundamentals of big data analytics and Hadoop von Archana Gopinath
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
Archana Gopinath86 views
Big data analysis using hadoop cluster von Furqan Haider
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop cluster
Furqan Haider240 views
VTU 6th Sem Elective CSE - Module 4 cloud computing von Sachin Gowda
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computing
Sachin Gowda2.4K views
Big Data Open Source Technologies von neeraj rathore
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologies
neeraj rathore210 views
Hadoop and the Data Warehouse: When to Use Which von DataWorks Summit
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
DataWorks Summit8.4K views
Big Data Analytics with Hadoop von Philippe Julio
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio441.9K views
What is Hadoop & its Use cases-PromtpCloud von PromptCloud
What is Hadoop & its Use cases-PromtpCloudWhat is Hadoop & its Use cases-PromtpCloud
What is Hadoop & its Use cases-PromtpCloud
PromptCloud968 views
Big Data Practice_Planning_steps_RK von Rajesh Jayarman
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
Rajesh Jayarman137 views
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence... von Perficient, Inc.
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.3.6K views
Big data and hadoop overvew von Kunal Khanna
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
Kunal Khanna773 views
The Hadoop Ecosystem for Developers von Zohar Elkayam
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
Zohar Elkayam1K views

Último

Employees attrition von
Employees attritionEmployees attrition
Employees attritionMaryAlejandraDiaz
5 views5 Folien
[DSC Europe 23] Aleksandar Tomcic - Adversarial Attacks von
[DSC Europe 23] Aleksandar Tomcic - Adversarial Attacks[DSC Europe 23] Aleksandar Tomcic - Adversarial Attacks
[DSC Europe 23] Aleksandar Tomcic - Adversarial AttacksDataScienceConferenc1
5 views20 Folien
[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf von
[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf
[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdfDataScienceConferenc1
5 views54 Folien
Data about the sector workshop von
Data about the sector workshopData about the sector workshop
Data about the sector workshopinfo828217
29 views27 Folien
apple.pptx von
apple.pptxapple.pptx
apple.pptxhoneybeeqwe
6 views15 Folien
Dr. Ousmane Badiane-2023 ReSAKSS Conference von
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceAKADEMIYA2063
5 views34 Folien

Último(20)

[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf von DataScienceConferenc1
[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf
[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf
Data about the sector workshop von info828217
Data about the sector workshopData about the sector workshop
Data about the sector workshop
info82821729 views
Dr. Ousmane Badiane-2023 ReSAKSS Conference von AKADEMIYA2063
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS Conference
AKADEMIYA20635 views
Data Journeys Hard Talk workshop final.pptx von info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821711 views
Lack of communication among family.pptx von ahmed164023
Lack of communication among family.pptxLack of communication among family.pptx
Lack of communication among family.pptx
ahmed16402314 views
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation von DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
Chapter 3b- Process Communication (1) (1)(1) (1).pptx von ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20048 views
Best Home Security Systems.pptx von mogalang
Best Home Security Systems.pptxBest Home Security Systems.pptx
Best Home Security Systems.pptx
mogalang9 views
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines von DataScienceConferenc1
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx von DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
4_4_WP_4_06_ND_Model.pptx von d6fmc6kwd4
4_4_WP_4_06_ND_Model.pptx4_4_WP_4_06_ND_Model.pptx
4_4_WP_4_06_ND_Model.pptx
d6fmc6kwd47 views
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf von Oppotus
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf
Oppotus27 views
CRIJ4385_Death Penalty_F23.pptx von yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1007 views
CRM stick or twist workshop von info828217
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshop
info82821714 views
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ... von DataScienceConferenc1
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...

Foxvalley bigdata

  • 1. BIG DATA AND THE HADOOP ECOSYSTEM TOM ROGERS NORTHWESTERN UNIVERSITY FEINBERG SCHOOL OF MEDICINE DEPARTMENT OF ANESTHESIOLOGY
  • 2. WHAT IS BIG DATA? The 3 V’s
  • 3. WHAT IS BIG DATA? Volume Terabytes Petabytes Exabytes
  • 4. WHAT IS BIG DATA? Volume Velocity System Logs Medical Monitors Machinery Controls
  • 5. WHAT IS BIG DATA? Volume Velocity Variety Varacity Variability RDBMSSocial MediaXML JSON Documents IoT Value
  • 6. How do we collect, store and process all this data?
  • 7. • Open Source Apache Software. • Distributed processing across clusters of computers. • Designed to scale to thousands of computers. • Local computation and storage. • Expects hardware failure which is handled at the application layer. A cute yellow elephant
  • 8. HADOOP ECOSYSTEM OVERVIEW • Distributed storage and processing. • Runs on commodity server hardware. • Scales horizontally for seamless failover. • Hadoop is open source software.
  • 9. TRADITIONAL DATA REPOSITORIES • Very structured in 3NF or Star topologies. • Is the enterprise “Single Source of Truth” • Optimized for operations reporting requirements. • Scales vertically. • Limited interaction with external or unstructured data sources. • Complex management schemes and protocols.
  • 10. TRADITIONAL DATA SOURCES IN HEALTHCARE • Data for the Healthcare EDW originates from the functional clinical and administrative responsibilities. • Sources can be as sophisticated as highly complex on-line systems or as simple as Excel spreadsheets. • Complex validation and transformation processes before inclusion into the EDW. • Staging of the data transformation requires separate storage and processing space, but is often times done on the same physical hardware as the EDW.
  • 11. INTEGRATION OF HADOOP AND TRADITIONAL IT • Hadoop does is not replace traditional storage or processing technologies. • Hadoop can include data from traditional IT sources to discover new value. • Compared to traditional IT, setting up and operating a Hadoop platform can be very inexpensive. • Can be seen as very expensive when adding to existing traditional IT environments.
  • 12. EMERGING AND NON-TRADITIONAL DATA • New knowledge is discovered by applying known experience in context with unknown or new experience. • New sources of data are being created in a seemingly unending manner. • Social media and mobile computing provide sources of new data unavailable in the past. • Monitors, system logs, and document corpus all provide new ways of capturing and expressing the human experience that cannot be captured or analyzed by traditional IT methodologies.
  • 13. INTEGRATION OF HADOOP AND NON-TRADITIONAL DATA • Hadoop is designed to store and process non-traditional data sets. • Optimized for unstructured file based data sources. • Core applications developed specifically for different storage, processing, analysis and display activities. • Development of metadata definitions and rules combined with data from disparate data sources can be used for deeper analytic discovery.
  • 14. DATA ANALYSIS • Inspecting, transforming and modeling data to discover knowledge, make predictions and suggest conclusions. • 3rd party data analysis can be integrated into traditional IT environments or big data solutions. • Traditionally conducted by working on discrete data sets in isolation from the decision making process. • Data scientists are integrated into core business processes to create solutions for critical business problems using big data platforms.
  • 15. COMPLETE HADOOP ECOSYSTEM • Integration between traditional and non- traditional data is facilitated by the Hadoop ecosystem. • Data is stored on a fault tolerant distributed file system in the Hadoop cluster. • Data is processed close to where the data is located to reduce latency and time consuming transfer processes. • The Hadoop Master controller or “NameNode” monitors the processes of the Hadoop cluster and automatically executes actions to continue processing when failure is detected.
  • 16. HADOOP CORE COMPONENTS Storage HDFS Hive Hbase Management Zoo Keeper Avro Oozie Whirr Processing MapReduce Spark Integration Sqoop Flume Programming Pig Hive QL Jaql Insight Mahout Hue Beeswax
  • 17. CORE COMPONENT - STORAGE • HDFS – A distributed file system designed to run on commodity grade hardware in the Hadoop computing ecosystem. This file system is highly fault tolerant and provides very high throughput to data and is suitable for very large data sets. Fault tolerance is enabled by making redundant copies of data sectors and distributing them throughout the Hadoop cluster. • Key Characteristics Include: • Streaming data access – Designed for batch processing instead of interactive use. • Large data sets – Typically in gigabytes to terabytes in size. • Single Coherency Model - To enable high throughput access. • Moving computational process is cheaper than moving data. • Designed to be easily portable. • Hive – A data warehouse implementation in Hadoop that facilities the query and management of large datasets kept in the distributed storage. • Key Features: • Tools for ETL • A methodology for providing structure for multiple data formats. • Access to files stored in HDFS or Hbase • Executes queries via the MapReduce application.
  • 18. CORE COMPONENT – STORAGE ….. • HBase – A distributed, scalable big data database. For random access realtime read/write access to big data. • Key Features: • Modular scalability. • Strict consistent reads and writes. • Automatic sharding of tables (partitioning tables to smaller more manageable parts). • Automatic failover.
  • 19. CORE COMPONENT - MANAGEMENT • Zoo Keeper – A centralized service for maintaining configurations, naming providing distributed synchronization and group services. • Avro – A data serialization program. • Oozie – A Hadoop workflow Scheduler • Whirr – A cloud neutral library for running cloud services. CORE COMPONENT - PROCESSING • MapReduce – An implementation for processing and generating large data sets with a parallel, distributed algorithm on a Hadoop cluster. • Key Features: • Automatic parallelization and distribution • Fault-tolerance • I/O Scheduling • Status Monitoring
  • 20. CORE COMPONENT - INTEGRATION • Sqoop – a utility designed to efficiently transfer bulk data between Hadoop and relational databases. • Flume – A service, based on streaming data flows, for collecting, aggregating and moving large amounts of system log data. CORE COMPONENT – PROGRAMMING • Pig – A high level language for analyzing very large data sets and is designed is able to efficiently utilize parallel processes to achieve its results. • Key Properties: • Ease of programming – Complex tasks are explicitly encoded as data flow sequences making them easy to understand and implement. • Significant optimization opportunities – the system optimizes execution automatically. • Extensibility – Users can encode their own functions. • HiveQL – A SQL like query language for data stored in Hive Tables which converts queries into MapReduce jobs. • Jaql – A data processing and query language used to processing JSON on Hadoop.
  • 21. CORE COMPONENT - INSIGHT • Mahout – A library of callable machine learning algorithms which uses the MapReduce paradigm. • Supports four main data use cases: • Collaborative filtering – analyzes behavior and make recommendations. • Clustering – organizes data into naturally occurring groups. • Classification – learns from known characteristics of existing categorizations and makes assignments of unclassified items into a category. • Frequent item or market basket mining – analyzes data items in transactions and identifies items which typically occur together. • Hue – Is a set of web applications that enable a user to interact with a Hadoop cluster. Also lets the user browse and interact with Hive, Impala, MapReduce jobs and Oozie workflows. • Beeswax – An application which allows the user to perform queries on the Hive data warehousing application. You can create Hive tables, load data, run queries and download results in Excel spreadsheet format or CSV format.
  • 22. HADOOP DISTRIBUTIONS Amazon Web Services Elastic MapReduce •One of the first Hadoop commercial offerings •Has the largest commercial Hadoop market share •Includes strong integration with other AWS cloud products •Auto scaling and support for NoSQL and BI integration Cloudera •2nd largest commercial marketshare •Experience with very large deployments •Revenue model based on software subscriptions •Aggressive innovation to meet customer demands HortonWorks •Strong engineering partnerships with flagship companies. •Innovation driven through the open source community. •Is a key contributor to the Hadoop core project. •Commits corporate resources to jump start Hadoop community projects.
  • 23. HADOOP DISTRIBUTIONS … International Business Machines •Vast experience in distributed computing and data management. •Experience with very large deployments. •Has advanced analytic tools, and global recognition. •Integration with vast array of IBM management and productivity software. MapR Technologies •Heavy focus and early adopter of enterprise features. •Supports some legacy file systems such as NFS. •Adding performance enhancements for HBase, high-availability and disaster recovery. Pivotal •Spin off from EMC and VMWare. •Strong cadre of technical consultants and data scientists. •Focus on MPP SQL engine and EDW with very high performance. •Has an appliance with integrated Hadoop, EDW and data management in a single rack.
  • 24. HADOOP DISTRIBUTIONS … Teradata • Specialist and strong background in EDW. • Has a strong technical partnership with HortonWorks. • Has very strong integration between Hadoop and Teradata’s management and EDW tools. • Extensive financial and technical resources allow creation of unique and powerful appliances. Microsoft Windows Azure HDInsight • A product designed specifically for the cloud in partnership with HortonWorks. • The only Hadoop distribution that runs in the Windows environment. • Allows SQL Server users to also execute queries that include data stored in Hadoop. • Unique marketing advantage for offering the Hadoop stack to traditional Windows customers.
  • 25. RECOMMENDATION Commitment and Leadership in the Open Source Community Strong Engineering Partnerships Innovation driven from the community Innovative Secure Big Data/Health Research Collaboration
  • 26. CLUSTER DIAGRAM • NameNode is a single master server which manages the file system and file system operations. • Data Nodes are slave servers that manage the data and the storage attached to the data. • NameNode is a single point of failure for the HDFS Cluster. • A SecondaryNameNode can be configured on a separate server in the cluster which creates checkpoints for the namespace. • SecondaryNameNode is not a failover NameNode.
  • 27. CLUSTER HARDWARE CONFIGURATION AND COST Factor/Specification Option 1 Option 2 Replication Factor 3 3 Size of Data to Move 500 TB 500 TB Workspace Factor 1.25 1.25 Compression 1 (no compression) 3 Hadoop Storage Requirement 1875 TB 625 TB Storage Per Node 16 TB 16 TB Rack Size 42U 42U Node Unit $4000 $4000 Rack Unit Cost $1500 $1500 Node (1 NameNode & DataNodes) (119 nodes * $4,000) = $480,500 (41 nodes * $4000) = $164,000 Rack Cost (3 racks * $1,500) = $4,500 (1 Rack * $1500) = $1,500 Total Cost $480,500 $165,500
  • 28. HADOOP SANDBOX IN ORACLE VIRTUALBOX Host Specification • Windows 10 • Intel® Core™ i7-4770 CPU @ 3.40GHz • 16GB Installed RAM • 64-bit OS, x64 • 1.65 TB Storage VM Specification • Cloudera Quickstart Sandbox • Red Hat • Intel® Core™ i7-4770 CPU @ 3.40GHz • 10GB Allocated RAM • 32MB Video Memory • 64-bit OS • 64GB Storage • Shared Clipboard: Bidirectional • Drag’n’Drop: Bidirectional
  • 29. CLOUDERA HADOOP DESKTOP & INTERFACE Opening Cloudera interface and view of the CDC “Healthy People 2010” data set that was uploaded to the Redhat OS
  • 30. HUE FILE BROWSER • Folder List • File Contents • Displayed file content is from the Vulnerable Population and Environmental Health data of the “Healthy People 2010” data set.
  • 31. ADDING DATA TO HIVE • Folder List • File Contents • Displayed file content is from the Vulnerable Population and Environmental Health data of the “Healthy People 2010” data set.
  • 32. ADDING DATA TO HIVE … Choosing a delimiter type Defining columns
  • 33. ADDING DATA TO HIVE … • Hive Table List • Table properties
  • 35. BIBLIOGRAPHY • "2015/03/18 - Apache Whirr Has Been Retired." Accessed January 24, 2016. https://whirr.apache.org/. • "Apache Avro™1.7.7 Documentation." Apache Avro™1.7.7 Documentation. Accessed January 24, 2016. https://avro.apache.org/docs/current/. • "Apache HBase – Apache HBase™ Home." Apache HBase – Apache HBase™ Home. Accessed January 24, 2016. https://hbase.apache.org/. • "Apache Mahout." Hortonworks. Accessed January 24, 2016. http://hortonworks.com/hadoop/mahout/. • "Best Practices for Selecting Apache Hadoop Hardware - Hortonworks." Hortonworks. September 01, 2011. Accessed January 26, 2016. http://hortonworks.com/blog/best- practices-for-selecting-apache-hadoop-hardware/. • "CDH3 Documentation." Beeswax. Accessed January 24, 2016. http://www.cloudera.com/documentation/archive/cdh/3-x/3u6/Hue-1.2-User-Guide/hue1.html. • "CDH4 Documentation." Introducing Hue. Accessed January 24, 2016. http://www.cloudera.com/documentation/archive/cdh/4-x/4-2-0/Hue-2-User-Guide/hue2.html. • "Data Analysis." Wikipedia. Accessed January 24, 2016. https://en.wikipedia.org/wiki/Data_analysis. • "Hadoop Is Transforming Healthcare." Hortonworks. Accessed January 24, 2016. http://hortonworks.com/industry/healthcare/. • "HDFS Architecture Guide." HDFS Architecture Guide. Accessed January 24, 2016. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Introduction. • "Healthy People." Centers for Disease Control and Prevention. January 22, 2013. Accessed January 30, 2016. http://www.cdc.gov/nchs/healthy_people.htm. • "Home - Apache Hive - Apache Software Foundation." Home - Apache Hive - Apache Software Foundation. Accessed January 24, 2016. https://cwiki.apache.org/confluence/display/Hive/Home;jsessionid=A2FE8C570A86815B0B4890A923872351. • "How the 9 Leading Commercial Hadoop Distributions Stack Up." CIO. Accessed January 24, 2016. http://www.cio.com/article/2368763/big-data/146238-How-the-9-Leading- Commercial-Hadoop-Distributions-Stack-Up.html. • "How the 9 Leading Commercial Hadoop Distributions Stack Up." CIO. Accessed January 24, 2016. http://www.cio.com/article/2368763/big-data/146238-How-the-9-Leading- Commercial-Hadoop-Distributions-Stack-Up.html#slide1. • "Map Reduce (MR) Framework." [Gerardnico]. Accessed January 24, 2016. http://gerardnico.com/wiki/algorithm/map_reduce. • "Oozie - Apache Oozie Workflow Scheduler for Hadoop." Oozie - Apache Oozie Workflow Scheduler for Hadoop. Accessed January 24, 2016. http://oozie.apache.org/. • "Sizing Your Hadoop Cluster." - For Dummies. Accessed January 26, 2016. http://www.dummies.com/how-to/content/sizing-your-hadoop-cluster.html. • "Sqoop -." Sqoop -. Accessed January 24, 2016. http://sqoop.apache.org/. • "TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing." Preparing Data for Analytics: Making It Easier and Faster. Accessed January 24, 2016. https://tdwi.org/articles/2015/04/14/preparing-data-for-analytics.aspx. • "UC Irvine Health Does Hadoop. With Hortonworks Data Platform." Hortonworks. Accessed January 26, 2016. http://hortonworks.com/customer/uc-irvine-health/. • "Welcome to Apache Flume¶." Welcome to Apache Flume — Apache Flume. Accessed January 24, 2016. https://flume.apache.org/. • "Welcome to Apache Pig!" Welcome to Apache Pig! Accessed January 24, 2016. https://pig.apache.org/. • "Welcome to Apache ZooKeeper™." Apache ZooKeeper. Accessed January 24, 2016. https://zookeeper.apache.org/. • Wikipedia. Accessed January 24, 2016. https://en.wikipedia.org/wiki/Jaql.