This presentation inludes step-by step tutorial by including the screen recordings to learn Rapid Miner.It also includes the step-step-step procedure to use the most interesting features -Turbo Prep and Auto Model.
2. ABOUT RAPID
MINER Developer(company)
First Release Year
Licenses AGPL
Written in
OS
Web Site
Rapid Miner
2006
Basic Version Under
Professional (2500$ for
each
Version is Paid license)
Java
Cross Platform
rapidminer.com
3. ABOUT RAPID
MINER
• Integrated Environment for data preparation,machine learning,deep learning,text
mining,predictive analytics.
• Wide range of Applications
• Supports All steps of Datamining
• Template-base work
4. MARKET OF RAPID
MINERRapidMiner received one of the strongest satisfaction ratings in the
2011 Rexer Analytics Data Miner Survey.
RapidMiner has received over 3 million total downloads and has over
250,000 users including eBay, Intel,PepsiCo and Kraft Foods as
paying customers.
RapidMiner claims to be the market leader in the software for
predictive data analytics services against competitors such as
Revolution Analytics, SAS, Predixion Software, SQL Server, StatSoft and
IBM.
6. WHY
RAPIDMINER
vices
• GUI
• Data from file, database, web, and cloud ser
• GUI or batch processing
• Integrates with in-house databases
• Data filtering, merging, joining and aggregating
• Build, train and validate predictive models
• Runs on every major platform and operating system
….. And Much More
7. • The term attribute is RapidMiner for "column."
• In machine learning, each row of a data set is an example
for a specific situation and the attributes (columns) are the
properties that describe the situation.
• It is sometimes also known as target or class or predicted
label.
8. • Join, Aggregate, Filter, Sort, Generate Attributes, and
Select Attributes.
• The role of an attribute describes how the column is
used by machine learning operators.
• Attributes without any role (also called "regular"
attributes) are used as input for training while id
attributes are usually ignored by modeling algorithms
because they are only used as unique identifiers of
observations of data.
Attributes
9.
10.
11. There are two general groups of data handling:
• Blending
• cleansing.
Blending is about transforming a data set from one state to
another or combining multiple data sets.
Cleansing is about improving the data so that modeling will
deliver better results.
14. Data Cleansing
• Identify unusual cases and remove them from the data
set.
• In some cases ,outliers are the most interesting cases,
but in most cases outliers are simply the result of an
incorrect measurement and should be removed from the
data set.
• A distance-based outlier detection algorithm is used
,which calculates the Euclidean distance between the
data points and marks those points which are farthest
away from other data points as outliers. The Euclidean
distance uses the distances between two data points for
each individual attribute.
• It creates a new column named outlier with true as the
value for the outliers and false for all other examples.
15.
16.
17. Predictive Modeling
• Predictive Modeling is a set of machine learning
techniques which search for patterns in big data sets and
use those patterns to create predictions for new
situations.
• Those predictions can be categorical (this is called
classification learning) or numerical (regression learning).
• In this, we need training data with known labels as input
for this kind of machine learning method. So this type of
method supervised learning.
18.
19.
20.
21. Scoring
• Using a model to generate predictions for new data
points is called Scoring.
• Here,the Naïve Bayes method is used to predict the
"Survived" class (yes / no) of each passenger and find
their respective confidences
• Apply Model operator is used to create predictions for
a new, unlabeled data set.
22.
23. Split labeled data into two partitions.
• Split Data takes an example set and divides it into the
partitions we have defined.
• In this case, we will get two partitions with 70% of the
data in one and 30% of the data in the other.
24.
25. Cross Validation
• Cross Validation to make sure each data point is used
as often for training as it is for testing which avoids this
problem.
• Cross validation divides the example set into equal parts
and rotates through all parts, always using one for
testing and all others for training the model. At the end,
the average of all testing accuracies is delivered as
result.
• By default this splits the data into 10 different parts, so
we call this a 10-fold cross validation.
• change the number of folds in the Parameters panel.