SlideShare ist ein Scribd-Unternehmen logo
1 von 27
BAS 250
Lesson 2: Data Preparation
 Explain concepts and purpose of Data Preparation
 Understand solutions for handling missing and
inconsistent data
 Utilize data and attribute reduction techniques
 Effectively work in RapidMiner to prepare your data.
This Week’s Learning Objectives
The Data Mining Process: CRISP-DM
o Join data sets that are needed for your analysis
o Reduce data sets to only include pertinent
variables
o Scrub data to remove anomalies- outliers or
missing data
o Reformat for consistency and effective use
3. Data Preparation
 Ensure robustness of data
o Combine more 2 or more data sets to create a “mini – database”
with all variables needed for analysis in one place.
o Merge by a unique identifier common to both data sets
 “Key Identifier”, “Common ID”, “ID Number”, etc.
 Example: Social Security Number (links Medical and Insurance)
Data Preparation
Data Preparation
Example: Sources of Data
Customer Purchases - “Point of Sale data” – CSV file format
Cost of Products Sold – “Accounting department” – Excel file format
Inventory of Products - “ IT Data Warehouse” - XML file format
Merge By Product ID or SKU
 Data Reduction…two part
o Observations (rows, instances, etc)
o Attributes (variables, records, columns, etc)
Data Preparation
 Attribute reduction to filter out irrelevant or
uninteresting data without completely removing them
from the original set.
 Even if a variable isn’t interesting for answering some
questions, it may still be useful in others.
It is recommended to import all attributes first, then filter as necessary
Data Preparation
 Observation Reduction…
 Observation reduction is to reduce the # of observations to create a
smaller data set.
 Some reasons to do so:
o Create a sample set for:
 Training data, proof of concept analysis, testing theories, sharing data
o Improve analysis speed or process time
o Data scrubbing for outliers, missing values, etc.
Data Preparation
 Ensure consistency of data
o Missing information
o Spelling errors, typos
o Multiple responses for an attribute
o Characters in numeric fields and vice-versa
Data Preparation
 Ensure consistency of data
Data Preparation
KEY: Missing data is data that does not exist in a data set
• Not the same as zero or some other value
• In a dataset, it is blank and the value is unknown
• Sometimes referred to as null values
• Depending on your objective and the circumstance, you may
choose to leave missing data as they are or replace with some
other value
 Ensure consistency of data
Data Preparation
KEY: Inconsistent data is different from missing data
• Occurs when a value does exist but its value is not valid
or meaningful.
• Common = “.” or “zero”
 Ensure consistency of data
Data Preparation
Replace or remove missing or inconsistent data
• For numeric data…
• Can be replaced using Measures of Central Tendency
• Mean, Median, and Mode
• Mean - Average value
• Median - Middle value
• Mode - Most frequent or common value
 Ensure consistency of data
Data Preparation
Replace or remove missing or inconsistent data
• For character data…
• Can be replaced using Best Estimated Value
• “Like Others”
• Ex. All males in data like bass fishing. If attribute “Fish Type” is blank and
attribute “Gender” equals male, then “Bass”
• “Clustering Techniques”
• “Best Guess”
 Ensure consistency of data
Data Preparation
• Replacing missing or inconsistent values found in
data should be done:
• With intention, not haphazardly
• Use common sense
• Be transparent
It is recommended to always document your
missing or consistent data processes.
 This course is a practical application course in Data Mining. Learning to
use RapidMiner is required.
 If you have not done so yet, please plan to walk through the tutorial
examples in RapidMiner.
 To assist you in understanding RapidMiner, I will take screenshots of what
I am doing to get the results we are looking for.
 RapidMiner is pretty intuitive. You will get it quickly.
Basics of RapidMiner
 Types of files that can be imported into RapidMiner:
o CSV File
o Excel File
o XML File
o Access Database Table
o … and much more
 We use mainly CSV files which contain Comma Separated Values- be mindful if
your dataset contains commas
o Alternative delimiters can be selected in this case:
 Tab
 Semicolon
 Pipe ( l ), etc.
Basics of RapidMiner
 Three main areas that contain useful tools in
RapidMiner:
o Operators – Every possible task you can think of
o Repositories – Where you store your data
o Parameters – Task set up details
Basics of RapidMiner
Basics of RapidMiner
Basics of RapidMiner
Basics of RapidMiner
Basics of RapidMiner
Basics of RapidMiner
Basics of RapidMiner
Basics of RapidMiner
 Explain concepts and purpose of Data Preparation
 Understand solutions for handling missing and inconsistent
data
 Utilize data and attribute reduction techniques
 Effectively work in RapidMiner to prepare your data.
Summary
“This workforce solution was funded by a grant awarded by the U.S. Department of Labor’s
Employment and Training Administration. The solution was created by the grantee and does not
necessarily reflect the official position of the U.S. Department of Labor. The Department of Labor
makes no guarantees, warranties, or assurances of any kind, express or implied, with respect to such
information, including any information on linked sites and including, but not limited to, accuracy of the
information or its completeness, timeliness, usefulness, adequacy, continued availability, or ownership.”
Except where otherwise stated, this work by Wake Technical Community College Building Capacity in
Business Analytics, a Department of Labor, TAACCCT funded project, is licensed under the Creative
Commons Attribution 4.0 International License. To view a copy of this license, visit
http://creativecommons.org/licenses/by/4.0/
Copyright Information

Weitere ähnliche Inhalte

Was ist angesagt?

SAS/MIT/Sloan Data Analytics
SAS/MIT/Sloan Data AnalyticsSAS/MIT/Sloan Data Analytics
SAS/MIT/Sloan Data Analytics
Steven Kimber
 
Introducing SPSS customer overview
Introducing SPSS customer overviewIntroducing SPSS customer overview
Introducing SPSS customer overview
ebuc
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Vignesh Prajapati
 

Was ist angesagt? (20)

Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
 
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
 
SAS/MIT/Sloan Data Analytics
SAS/MIT/Sloan Data AnalyticsSAS/MIT/Sloan Data Analytics
SAS/MIT/Sloan Data Analytics
 
Introducing SPSS customer overview
Introducing SPSS customer overviewIntroducing SPSS customer overview
Introducing SPSS customer overview
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data Mining Technique - SEMMA
Data Mining Technique - SEMMAData Mining Technique - SEMMA
Data Mining Technique - SEMMA
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project Lifecycle
 
Challenges in business analytics
Challenges in business analyticsChallenges in business analytics
Challenges in business analytics
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overview
 
Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ Data
 
Reports vs analysis
Reports vs analysisReports vs analysis
Reports vs analysis
 
Analytics
AnalyticsAnalytics
Analytics
 
Data Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill SetData Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill Set
 
Leveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive IndustryLeveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive Industry
 
How to Become a Data Analyst? | Data Analyst Skills | Data Analyst Training |...
How to Become a Data Analyst? | Data Analyst Skills | Data Analyst Training |...How to Become a Data Analyst? | Data Analyst Skills | Data Analyst Training |...
How to Become a Data Analyst? | Data Analyst Skills | Data Analyst Training |...
 

Andere mochten auch

Base 9.1 preparation guide
Base 9.1 preparation guideBase 9.1 preparation guide
Base 9.1 preparation guide
imaduddin91
 
Base SAS Exam Questions
Base SAS Exam QuestionsBase SAS Exam Questions
Base SAS Exam Questions
guestc45097
 
The Second Little Book of Leadership
The Second Little Book of LeadershipThe Second Little Book of Leadership
The Second Little Book of Leadership
Phil Dourado
 

Andere mochten auch (20)

Learning SAS by Example -A Programmer’s Guide by Ron CodySolution
Learning SAS by Example -A Programmer’s Guide by Ron CodySolutionLearning SAS by Example -A Programmer’s Guide by Ron CodySolution
Learning SAS by Example -A Programmer’s Guide by Ron CodySolution
 
SAS BASICS
SAS BASICSSAS BASICS
SAS BASICS
 
SAS basics Step by step learning
SAS basics Step by step learningSAS basics Step by step learning
SAS basics Step by step learning
 
Basics Of SAS Programming Language
Basics Of SAS Programming LanguageBasics Of SAS Programming Language
Basics Of SAS Programming Language
 
SAS Training session - By Pratima
SAS Training session  -  By Pratima SAS Training session  -  By Pratima
SAS Training session - By Pratima
 
BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
 
SAS Ron Cody Solutions for even Number problems from Chapter 16 to 20
SAS Ron Cody Solutions for even Number problems from Chapter 16 to 20SAS Ron Cody Solutions for even Number problems from Chapter 16 to 20
SAS Ron Cody Solutions for even Number problems from Chapter 16 to 20
 
SAS TRAINING
SAS TRAININGSAS TRAINING
SAS TRAINING
 
BAS 250 Lecture 8
BAS 250 Lecture 8BAS 250 Lecture 8
BAS 250 Lecture 8
 
Where Vs If Statement
Where Vs If StatementWhere Vs If Statement
Where Vs If Statement
 
Base 9.1 preparation guide
Base 9.1 preparation guideBase 9.1 preparation guide
Base 9.1 preparation guide
 
Analytics with SAS
Analytics with SASAnalytics with SAS
Analytics with SAS
 
Sas demo
Sas demoSas demo
Sas demo
 
Base SAS Full Sample Paper
Base SAS Full Sample Paper Base SAS Full Sample Paper
Base SAS Full Sample Paper
 
Statistical analytical programming for social media analysis .
Statistical analytical programming for social media analysis .Statistical analytical programming for social media analysis .
Statistical analytical programming for social media analysis .
 
Base SAS Exam Questions
Base SAS Exam QuestionsBase SAS Exam Questions
Base SAS Exam Questions
 
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | EdurekaBig Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
 
Deep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsDeep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applications
 
The Second Little Book of Leadership
The Second Little Book of LeadershipThe Second Little Book of Leadership
The Second Little Book of Leadership
 
Best Presentation About Infosys
Best Presentation About InfosysBest Presentation About Infosys
Best Presentation About Infosys
 

Ähnlich wie BAS 250 Lecture 2

Ähnlich wie BAS 250 Lecture 2 (20)

data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjh
 
Metopen 6
Metopen 6Metopen 6
Metopen 6
 
Data Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxData Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptx
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.ppt
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
Business analyst
Business analystBusiness analyst
Business analyst
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdf
 
Data Science in Python.pptx
Data Science in Python.pptxData Science in Python.pptx
Data Science in Python.pptx
 
Pandas Data Cleaning and Preprocessing PPT.pptx
Pandas Data Cleaning and Preprocessing PPT.pptxPandas Data Cleaning and Preprocessing PPT.pptx
Pandas Data Cleaning and Preprocessing PPT.pptx
 
Data Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesData Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practices
 
Data processing
Data processingData processing
Data processing
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
4 Data preparation and processing
4  Data preparation and processing4  Data preparation and processing
4 Data preparation and processing
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amato
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Mba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation aMba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation a
 
BDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxBDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptx
 
Data Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data QualityData Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data Quality
 

Mehr von Wake Tech BAS (9)

BAS 250 Lecture 5
BAS 250 Lecture 5BAS 250 Lecture 5
BAS 250 Lecture 5
 
BAS 250 Lecture 4
BAS 250 Lecture 4BAS 250 Lecture 4
BAS 250 Lecture 4
 
BAS 250 Lecture 3
BAS 250 Lecture 3BAS 250 Lecture 3
BAS 250 Lecture 3
 
BAS 150 Lesson 8 Lecture
BAS 150 Lesson 8 LectureBAS 150 Lesson 8 Lecture
BAS 150 Lesson 8 Lecture
 
BAS 150 Lesson 7 Lecture
BAS 150 Lesson 7 LectureBAS 150 Lesson 7 Lecture
BAS 150 Lesson 7 Lecture
 
BAS 150 Lesson 6 Lecture
BAS 150 Lesson 6 LectureBAS 150 Lesson 6 Lecture
BAS 150 Lesson 6 Lecture
 
BAS 150 Lesson 5 Lecture
BAS 150 Lesson 5 LectureBAS 150 Lesson 5 Lecture
BAS 150 Lesson 5 Lecture
 
BAS 150 Lesson 4 Lecture
BAS 150 Lesson 4 LectureBAS 150 Lesson 4 Lecture
BAS 150 Lesson 4 Lecture
 
BAS 150 Lesson 3 Lecture
BAS 150 Lesson 3 LectureBAS 150 Lesson 3 Lecture
BAS 150 Lesson 3 Lecture
 

Kürzlich hochgeladen

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 

Kürzlich hochgeladen (20)

ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 

BAS 250 Lecture 2

  • 1. BAS 250 Lesson 2: Data Preparation
  • 2.  Explain concepts and purpose of Data Preparation  Understand solutions for handling missing and inconsistent data  Utilize data and attribute reduction techniques  Effectively work in RapidMiner to prepare your data. This Week’s Learning Objectives
  • 3. The Data Mining Process: CRISP-DM
  • 4. o Join data sets that are needed for your analysis o Reduce data sets to only include pertinent variables o Scrub data to remove anomalies- outliers or missing data o Reformat for consistency and effective use 3. Data Preparation
  • 5.  Ensure robustness of data o Combine more 2 or more data sets to create a “mini – database” with all variables needed for analysis in one place. o Merge by a unique identifier common to both data sets  “Key Identifier”, “Common ID”, “ID Number”, etc.  Example: Social Security Number (links Medical and Insurance) Data Preparation
  • 6. Data Preparation Example: Sources of Data Customer Purchases - “Point of Sale data” – CSV file format Cost of Products Sold – “Accounting department” – Excel file format Inventory of Products - “ IT Data Warehouse” - XML file format Merge By Product ID or SKU
  • 7.  Data Reduction…two part o Observations (rows, instances, etc) o Attributes (variables, records, columns, etc) Data Preparation
  • 8.  Attribute reduction to filter out irrelevant or uninteresting data without completely removing them from the original set.  Even if a variable isn’t interesting for answering some questions, it may still be useful in others. It is recommended to import all attributes first, then filter as necessary Data Preparation
  • 9.  Observation Reduction…  Observation reduction is to reduce the # of observations to create a smaller data set.  Some reasons to do so: o Create a sample set for:  Training data, proof of concept analysis, testing theories, sharing data o Improve analysis speed or process time o Data scrubbing for outliers, missing values, etc. Data Preparation
  • 10.  Ensure consistency of data o Missing information o Spelling errors, typos o Multiple responses for an attribute o Characters in numeric fields and vice-versa Data Preparation
  • 11.  Ensure consistency of data Data Preparation KEY: Missing data is data that does not exist in a data set • Not the same as zero or some other value • In a dataset, it is blank and the value is unknown • Sometimes referred to as null values • Depending on your objective and the circumstance, you may choose to leave missing data as they are or replace with some other value
  • 12.  Ensure consistency of data Data Preparation KEY: Inconsistent data is different from missing data • Occurs when a value does exist but its value is not valid or meaningful. • Common = “.” or “zero”
  • 13.  Ensure consistency of data Data Preparation Replace or remove missing or inconsistent data • For numeric data… • Can be replaced using Measures of Central Tendency • Mean, Median, and Mode • Mean - Average value • Median - Middle value • Mode - Most frequent or common value
  • 14.  Ensure consistency of data Data Preparation Replace or remove missing or inconsistent data • For character data… • Can be replaced using Best Estimated Value • “Like Others” • Ex. All males in data like bass fishing. If attribute “Fish Type” is blank and attribute “Gender” equals male, then “Bass” • “Clustering Techniques” • “Best Guess”
  • 15.  Ensure consistency of data Data Preparation • Replacing missing or inconsistent values found in data should be done: • With intention, not haphazardly • Use common sense • Be transparent It is recommended to always document your missing or consistent data processes.
  • 16.  This course is a practical application course in Data Mining. Learning to use RapidMiner is required.  If you have not done so yet, please plan to walk through the tutorial examples in RapidMiner.  To assist you in understanding RapidMiner, I will take screenshots of what I am doing to get the results we are looking for.  RapidMiner is pretty intuitive. You will get it quickly. Basics of RapidMiner
  • 17.  Types of files that can be imported into RapidMiner: o CSV File o Excel File o XML File o Access Database Table o … and much more  We use mainly CSV files which contain Comma Separated Values- be mindful if your dataset contains commas o Alternative delimiters can be selected in this case:  Tab  Semicolon  Pipe ( l ), etc. Basics of RapidMiner
  • 18.  Three main areas that contain useful tools in RapidMiner: o Operators – Every possible task you can think of o Repositories – Where you store your data o Parameters – Task set up details Basics of RapidMiner
  • 26.  Explain concepts and purpose of Data Preparation  Understand solutions for handling missing and inconsistent data  Utilize data and attribute reduction techniques  Effectively work in RapidMiner to prepare your data. Summary
  • 27. “This workforce solution was funded by a grant awarded by the U.S. Department of Labor’s Employment and Training Administration. The solution was created by the grantee and does not necessarily reflect the official position of the U.S. Department of Labor. The Department of Labor makes no guarantees, warranties, or assurances of any kind, express or implied, with respect to such information, including any information on linked sites and including, but not limited to, accuracy of the information or its completeness, timeliness, usefulness, adequacy, continued availability, or ownership.” Except where otherwise stated, this work by Wake Technical Community College Building Capacity in Business Analytics, a Department of Labor, TAACCCT funded project, is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ Copyright Information