2. Contents
• Introduction
• Why RADOOP
• Architecture
• Radoop sub-parts
• Benefits
• Conclusion and future work
3. Introduction
• Radoop is a tool used for data analysis.
• It is devloped by Gábor Makrai .
• Radoop closely integrates the highly optimized
data analytics capabilities of Hadoop clusters,
the distributed data warehouse Hive, and
Mahout into the user-friendly interface of
RapidMiner. This results in a powerful and
easy-to-use data analytics solution for
Hadoop.
4. Why Radoop ?
Data is growing:It’s growing. Quickly. And it’s everywhere.
9000
7910
8000
7000
6000
5000
4000
3000
2000
1227
1000
130
0
2005 2010 2015
5. New kinds of data
Structured data vs. Unstructured data growth
Complex, Unstructured
Analysis gap
Analysis gap
Relational
Our
ability to
analyze
“The sexy job in the next 10 years will be statisticians”
– Hal Varian, Chief Economist at Google
•Digital universe grew by 62% last year to 800K petabytes and will grow to
1.2 “zettabytes” this year.
6. Architechture
Hive: A data warehouse
infrastructure for data
summarization & ad hoc querying.
Mahout: A Scalable machine
learning and datamining library.
•HDFS is a distributed file system
designed to hold very large amounts of
data and provide high-throughput
access to this information.
•The Map-Reduce programming
model is a Framework for distributed
processing of large data sets.
• RapidMiner is a toolkit for
datamining.
7. RapidMiner
RapidMiner, formerly YALE (Yet Another Learning
Environment), is an environment for machine learning , data
mining, text mining, predictive analytics, and business
analytics.
It is used for research, education, training, application
development, and industrial applications.
RapidMiner provides data mining and machine learning
procedures including: data loading and transformation (ETL),
data preprocessing and visualization, modeling, evaluation,
and deployment. It is able to generate graphs like MS Excel.
It is also used for analyzing data generated by high-
throughput instruments used in processes such as
genotyping, proteomics, and mass spectrometry.
8. Hive and Mahout
• Hive : is a data warehouse infrastructure built on top of
Hadoop, i.e. it uses the distributed file system of Hadoop and
the efficient access technologies. Hive was initially developed
by Facebook and is now used and developed by many other
companies for their distributed data warehouse.
• Mahout : is a machine learning library already offering many
scalable machine learning libraries implemented as well on
top of Hadoop and its map & reduce paradigm. Hence,
Mahout is one of the first distributed data analytics
framework making use of the power of Hadoop.
9. Map Reduce
The Map-Reduce programming model
-Framework for distributed processing
of large data sets
Natural for:
– Log processing
– Web search indexing
– Ad-hoc queries
10. HDFS
HDFS is a distributed file system designed to hold very large amounts of data and
provide high-throughput access to this information.
Very Large Distributed File System
– 10K nodes, 100 million files, 10 PB
Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
Optimized for Batch Processing
– Data locations exposed so that computations can
move to where data resides
– Provides very high aggregate bandwidth
User Space, runs on heterogeneous OS
No RAID required.
11. Advantages
Scalability: Even data volumes in the
terabyte and petabyte range can be
analyzed.
Radoop provides a user-friendly
interface for editing and running ETL,
analytics, and machine learning
processes on Hadoop.
Radoop provides easy-to-use graphical
interface.
It eliminates the ETL bottlenecks.
12. Conclusion and future work
It has experimentally proved that within a time up to 1-8 gb
of data analyzed with 4-16 nodes in radoop where in
rapidminor up to 1gb of data can be analyzed.
“we believe more than half of the world’s data will be stored
in Apache Hadoop within 5 years” Hortonworks.
Radoop is opening the doors for people who are less
comfortable with Hadoop but want to use Hadoop for Big
Data analysis.