SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Cleaning and sorting
data
YOU CAN DO A LOT WITH TOOLS YOU ALREADY HAVE
A word about data
Data is information that comes from people. Typically, they did not make it for you.
That means it is not perfect for your needs.
This is true whether they are people who built a sophisticated system to generate the
data, or just someone who sent you some lists as an email attachment.

Now you have to deal with it. Often the humbler the data, the bigger the headache.
Knowing some easy ways to wrangle the simple stuff will give you a foundation for
doing fancier work with bigger, more sophisticated data sets, if you are so inclined.
But also, it will help you get the job done today, and sometimes that’s good enough.
You can do a lot with what you’ve got
•
•
•
•

Word (yes, Microsoft Word)
Excel
The command line
A simple text editor
First, you don’t have to cut and paste to sort
Here’s a list of info about states in what you wish was in some kind of order…
Open it in Word. Select your text, then click “A to Z" in the Paragraph menu. Alpha by
paragraphs is the default, so just hit OK.
You can also sort by “fields” if your lines have recurring separators.
Or find a pattern you can turn into a separator: change the “ – “ to a pipe (“|”) or a tab
Word lets you find (or insert) paragraph marks
and tabs
This means you can turn a nasty old text file into a spreadsheet in a two jifs
Change returns (^p) to tabs (^t), then double-tabs back to ^p. Copy and paste into Excel!
Ah, Excel – so nice, so clean…
But wait – I needed the state and zip code in separate columns…
Select your column, then pick "Text to Columns" in the Data Tools menu, Delimited.
Check the character to split on (space in this example). Voila!
Actually you have to move the phone numbers to the right first. And yes, you’ll have to fix East Moline.
Sometimes you need to combine columns – for example, first name and last name, or
genus and species. (I’ve moved the amounts to the right to give us space.)
Select the cell where the first result should go, and type “=CONCATENATE(_," ",_)” with
the cells that the info is coming from (here it’s B2 and C2), and put what should go in
between them (here it’s a space) inside the quotes.
When the see the correct result, select the cell and copy to all the cells below.
And guess what else is still useful – the
command prompt!
Yes, there are better things out there. But it may still be the fastest way to
compare two files, especially if you haven’t installed the other things
Plain old “fc” (file compare) will list the differences between two files. Just put them in
same folder and type fc filename1 filename2 Seriously, this is so darn handy!
Sometimes the problem is inconsistent
filenaming
Maybe you got some data where each record is in a separate file, or pics from
different cameras. A tool I use all the time to deal with this is “Bulk Rename Utility”
Like most tools, Bulk Rename Utility can do lots of fancy stuff. But a simple “Replace”
will quickly standardize most inconsistent filenames for you.
An underrated problem with data is finding the
bits you need
Fortunately, some free text editors like “Notepad++” will search across all the files in
a folder and all its subfolders – even your whole C: drive.
Can’t face opening file after file to try to find the data you’re looking for?
Well, you don’t need to. Besides, as a human, you might miss some of it.
Open Notepad++ (or a similar text editor) and select “Find in Files” …
Tell it what to look for and where to start.
To use or save the results, select and copy to a file.
Download these tools for free
Bulk Rename Utility

Notepad++

http://www.bulkrenameutility.co.uk/
download.php

http://download.cnet.com/notepad/
3000-2352_4-10327521.html
Some other data tools you already
have

Actually these are some of the best ones…
Because let’s say you have a nice clean data set
from Socrata.
Maybe it’s County procurement data. Or whatever… You still have to make
sense of it.
Thinking about the data
• Is it complete? (right number of records)
• Is it consistent? (records entered the same way)
• Are there typos or variant punctuation? Stray
spaces?
• Are there values that don’t seem to make sense?
• Does it jibe with what you expected to be there?
• For what purpose, or under what mandate, was it
compiled? This can affect the meaning of terms.
• What do the values actually mean?
Getting to the bottom of it
•
•
•
•

How is this data generated, actually?
What staff are responsible for it?
If it’s automated, what triggers an entry?
If there are “multiple choice” values, what is the
selection based on?
• Is anyone checking it?
• How often is it updated?
• What do these codes / terms /values actually
mean?
Some more things to look at
There are of course plenty of more-sophisticated ways to clean and test the potential ok-ness of
data. Many of them are way beyond me. But they are based on this kind of thinking. Here is some
more of it at its best.
• Some thoughts from the IRE blog
http://ire.org/blog/ire-news/2013/10/25/ten-irrefutable-and-nonnegotiable-rules-responsibl/
• Some thoughts from Drew Skau, visualization architect at Visual.ly
http://blog.visual.ly/cleaning-data-sets/
• The School of Data Handbook
http://schoolofdata.org/handbook/
Data used in my examples
• State publications - http://www.library.illinois.edu/doc/researchtools/guides/state/statelist.html
• Community health centers - http://getcoveredillinois.gov/

• Scott Walker campaign contributors - http://boycottwalker.bsharp.org/walker-bycontributor.html
• Photo files - downloads from personal mobile devices
Thank you
Nina Sandlin
nina.sandlin@gmail.com
www.linkedin.com/in/nsandlin
@nsandlin
fieldmuseum.org/users/nina-sandlin

Weitere ähnliche Inhalte

Andere mochten auch

Sorting (introduction)
 Sorting (introduction) Sorting (introduction)
Sorting (introduction)
Arvind Devaraj
 
Processing, Dehydration, Canning, Preservation of Fruits & Vegetables
Processing, Dehydration, Canning, Preservation of Fruits & Vegetables Processing, Dehydration, Canning, Preservation of Fruits & Vegetables
Processing, Dehydration, Canning, Preservation of Fruits & Vegetables
Ajjay Kumar Gupta
 
Packing of harvested fruits and vegetables
Packing of harvested fruits and vegetablesPacking of harvested fruits and vegetables
Packing of harvested fruits and vegetables
Gayani Rasangika
 
Methods of slaughtering, processing & postmortem changes and ageing of meat
Methods of slaughtering, processing & postmortem changes and ageing of meatMethods of slaughtering, processing & postmortem changes and ageing of meat
Methods of slaughtering, processing & postmortem changes and ageing of meat
mahabubcvasu
 
Sorting Algorithms
Sorting AlgorithmsSorting Algorithms
Sorting Algorithms
multimedia9
 
Cleaning, sorting and grading of mango 1
Cleaning, sorting and grading of mango 1Cleaning, sorting and grading of mango 1
Cleaning, sorting and grading of mango 1
Manisha Mishra
 

Andere mochten auch (20)

Sorting (introduction)
 Sorting (introduction) Sorting (introduction)
Sorting (introduction)
 
2. fruit & vegetable grading
2. fruit & vegetable grading2. fruit & vegetable grading
2. fruit & vegetable grading
 
Freezing Fruits and Vegetables
Freezing Fruits and VegetablesFreezing Fruits and Vegetables
Freezing Fruits and Vegetables
 
Packaging agricultural produce 2012
Packaging agricultural produce 2012Packaging agricultural produce 2012
Packaging agricultural produce 2012
 
Basics of Milk Pasteurization
Basics of Milk PasteurizationBasics of Milk Pasteurization
Basics of Milk Pasteurization
 
1. seed & grain cleaning & grading
1. seed & grain cleaning & grading1. seed & grain cleaning & grading
1. seed & grain cleaning & grading
 
Processing, Dehydration, Canning, Preservation of Fruits & Vegetables
Processing, Dehydration, Canning, Preservation of Fruits & Vegetables Processing, Dehydration, Canning, Preservation of Fruits & Vegetables
Processing, Dehydration, Canning, Preservation of Fruits & Vegetables
 
Lecture 5: Transport and Storage of Fruits and Vegetables
Lecture 5: Transport and Storage of Fruits and VegetablesLecture 5: Transport and Storage of Fruits and Vegetables
Lecture 5: Transport and Storage of Fruits and Vegetables
 
Slaughtering of Animal and Processing of their Products
Slaughtering of Animal and Processing of their ProductsSlaughtering of Animal and Processing of their Products
Slaughtering of Animal and Processing of their Products
 
Lecture 4: Packaging Operations on Fruits and Vegetables
Lecture 4: Packaging Operations on Fruits and VegetablesLecture 4: Packaging Operations on Fruits and Vegetables
Lecture 4: Packaging Operations on Fruits and Vegetables
 
Packing of harvested fruits and vegetables
Packing of harvested fruits and vegetablesPacking of harvested fruits and vegetables
Packing of harvested fruits and vegetables
 
Pasteurisation of food product
Pasteurisation of food productPasteurisation of food product
Pasteurisation of food product
 
Canning
CanningCanning
Canning
 
Methods of slaughtering, processing & postmortem changes and ageing of meat
Methods of slaughtering, processing & postmortem changes and ageing of meatMethods of slaughtering, processing & postmortem changes and ageing of meat
Methods of slaughtering, processing & postmortem changes and ageing of meat
 
Sorting Algorithms
Sorting AlgorithmsSorting Algorithms
Sorting Algorithms
 
Sorting Algorithms
Sorting AlgorithmsSorting Algorithms
Sorting Algorithms
 
Data Structures - Searching & sorting
Data Structures - Searching & sortingData Structures - Searching & sorting
Data Structures - Searching & sorting
 
Chapter 11 - Sorting and Searching
Chapter 11 - Sorting and SearchingChapter 11 - Sorting and Searching
Chapter 11 - Sorting and Searching
 
Cleaning, sorting and grading of mango 1
Cleaning, sorting and grading of mango 1Cleaning, sorting and grading of mango 1
Cleaning, sorting and grading of mango 1
 
Transportation of fruits and vegetables
 Transportation of fruits and vegetables   Transportation of fruits and vegetables
Transportation of fruits and vegetables
 

Ähnlich wie Cleaning and sorting data

WK8_A2 OverviewAssignment 2 Excelling with ExcelDue Week 8 an.docx
WK8_A2 OverviewAssignment 2 Excelling with ExcelDue Week 8 an.docxWK8_A2 OverviewAssignment 2 Excelling with ExcelDue Week 8 an.docx
WK8_A2 OverviewAssignment 2 Excelling with ExcelDue Week 8 an.docx
ambersalomon88660
 
Database and Access Power Point
Database and Access Power PointDatabase and Access Power Point
Database and Access Power Point
Ayee_Its_Bailey
 
Baileybatts bailey battsdatabasepowerpoint8
Baileybatts bailey battsdatabasepowerpoint8Baileybatts bailey battsdatabasepowerpoint8
Baileybatts bailey battsdatabasepowerpoint8
Ayee_Its_Bailey
 

Ähnlich wie Cleaning and sorting data (20)

Beautiful Research Data (Structured Data and Open Refine)
Beautiful Research Data (Structured Data and Open Refine)Beautiful Research Data (Structured Data and Open Refine)
Beautiful Research Data (Structured Data and Open Refine)
 
Uses of Excel
Uses of ExcelUses of Excel
Uses of Excel
 
Microsoft DigiGirlz, Teaching Teens About Databases (Trick!)
Microsoft DigiGirlz, Teaching Teens About Databases (Trick!)Microsoft DigiGirlz, Teaching Teens About Databases (Trick!)
Microsoft DigiGirlz, Teaching Teens About Databases (Trick!)
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Justis emcc tutorial
Justis emcc tutorialJustis emcc tutorial
Justis emcc tutorial
 
SQL Basic and conceptual Explained with Examples,Graphs, pictures etc
SQL Basic and conceptual Explained with Examples,Graphs, pictures etc SQL Basic and conceptual Explained with Examples,Graphs, pictures etc
SQL Basic and conceptual Explained with Examples,Graphs, pictures etc
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Software fundamentals
Software fundamentalsSoftware fundamentals
Software fundamentals
 
Design your own database
Design your own databaseDesign your own database
Design your own database
 
Automating With Excel An Object Oriented Approach
Automating  With  Excel    An  Object  Oriented  ApproachAutomating  With  Excel    An  Object  Oriented  Approach
Automating With Excel An Object Oriented Approach
 
Database Project
Database ProjectDatabase Project
Database Project
 
Spss basics
Spss basicsSpss basics
Spss basics
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning Project
 
Introduction to database
Introduction to databaseIntroduction to database
Introduction to database
 
SAS Programming.ppt
SAS Programming.pptSAS Programming.ppt
SAS Programming.ppt
 
WK8_A2 OverviewAssignment 2 Excelling with ExcelDue Week 8 an.docx
WK8_A2 OverviewAssignment 2 Excelling with ExcelDue Week 8 an.docxWK8_A2 OverviewAssignment 2 Excelling with ExcelDue Week 8 an.docx
WK8_A2 OverviewAssignment 2 Excelling with ExcelDue Week 8 an.docx
 
Automation Of Reporting And Alerting
Automation Of Reporting And AlertingAutomation Of Reporting And Alerting
Automation Of Reporting And Alerting
 
Course 6 (part 2) data visualisation by toon vanagt
Course 6 (part 2)   data visualisation by toon vanagtCourse 6 (part 2)   data visualisation by toon vanagt
Course 6 (part 2) data visualisation by toon vanagt
 
Database and Access Power Point
Database and Access Power PointDatabase and Access Power Point
Database and Access Power Point
 
Baileybatts bailey battsdatabasepowerpoint8
Baileybatts bailey battsdatabasepowerpoint8Baileybatts bailey battsdatabasepowerpoint8
Baileybatts bailey battsdatabasepowerpoint8
 

Kürzlich hochgeladen

KLINIK BATA Jual obat penggugur kandungan 087776558899 ABORSI JANIN KEHAMILAN...
KLINIK BATA Jual obat penggugur kandungan 087776558899 ABORSI JANIN KEHAMILAN...KLINIK BATA Jual obat penggugur kandungan 087776558899 ABORSI JANIN KEHAMILAN...
KLINIK BATA Jual obat penggugur kandungan 087776558899 ABORSI JANIN KEHAMILAN...
Cara Menggugurkan Kandungan 087776558899
 
Girls in Mahipalpur (delhi) call me [🔝9953056974🔝] escort service 24X7
Girls in Mahipalpur  (delhi) call me [🔝9953056974🔝] escort service 24X7Girls in Mahipalpur  (delhi) call me [🔝9953056974🔝] escort service 24X7
Girls in Mahipalpur (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
the Husband rolesBrown Aesthetic Cute Group Project Presentation
the Husband rolesBrown Aesthetic Cute Group Project Presentationthe Husband rolesBrown Aesthetic Cute Group Project Presentation
the Husband rolesBrown Aesthetic Cute Group Project Presentation
brynpueblos04
 
February 2024 Recommendations for newsletter
February 2024 Recommendations for newsletterFebruary 2024 Recommendations for newsletter
February 2024 Recommendations for newsletter
ssuserdfec6a
 

Kürzlich hochgeladen (15)

KLINIK BATA Jual obat penggugur kandungan 087776558899 ABORSI JANIN KEHAMILAN...
KLINIK BATA Jual obat penggugur kandungan 087776558899 ABORSI JANIN KEHAMILAN...KLINIK BATA Jual obat penggugur kandungan 087776558899 ABORSI JANIN KEHAMILAN...
KLINIK BATA Jual obat penggugur kandungan 087776558899 ABORSI JANIN KEHAMILAN...
 
Pokemon Go... Unraveling the Conspiracy Theory
Pokemon Go... Unraveling the Conspiracy TheoryPokemon Go... Unraveling the Conspiracy Theory
Pokemon Go... Unraveling the Conspiracy Theory
 
Exploring Stoic Philosophy From Ancient Wisdom to Modern Relevance.pdf
Exploring Stoic Philosophy From Ancient Wisdom to Modern Relevance.pdfExploring Stoic Philosophy From Ancient Wisdom to Modern Relevance.pdf
Exploring Stoic Philosophy From Ancient Wisdom to Modern Relevance.pdf
 
Social Learning Theory presentation.pptx
Social Learning Theory presentation.pptxSocial Learning Theory presentation.pptx
Social Learning Theory presentation.pptx
 
Call Girls In Mumbai Just Genuine Call ☎ 7738596112✅ Call Girl Andheri East G...
Call Girls In Mumbai Just Genuine Call ☎ 7738596112✅ Call Girl Andheri East G...Call Girls In Mumbai Just Genuine Call ☎ 7738596112✅ Call Girl Andheri East G...
Call Girls In Mumbai Just Genuine Call ☎ 7738596112✅ Call Girl Andheri East G...
 
March 2023 Recommendations for newsletter
March 2023 Recommendations for newsletterMarch 2023 Recommendations for newsletter
March 2023 Recommendations for newsletter
 
Emotional Freedom Technique Tapping Points Diagram.pdf
Emotional Freedom Technique Tapping Points Diagram.pdfEmotional Freedom Technique Tapping Points Diagram.pdf
Emotional Freedom Technique Tapping Points Diagram.pdf
 
Dadar West Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Dadar West Escorts 🥰 8617370543 Call Girls Offer VIP Hot GirlsDadar West Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Dadar West Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
 
SIKP311 Sikolohiyang Pilipino - Ginhawa.pptx
SIKP311 Sikolohiyang Pilipino - Ginhawa.pptxSIKP311 Sikolohiyang Pilipino - Ginhawa.pptx
SIKP311 Sikolohiyang Pilipino - Ginhawa.pptx
 
Goregaon West Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Goregaon West Escorts 🥰 8617370543 Call Girls Offer VIP Hot GirlsGoregaon West Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Goregaon West Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
 
Girls in Mahipalpur (delhi) call me [🔝9953056974🔝] escort service 24X7
Girls in Mahipalpur  (delhi) call me [🔝9953056974🔝] escort service 24X7Girls in Mahipalpur  (delhi) call me [🔝9953056974🔝] escort service 24X7
Girls in Mahipalpur (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Colaba Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Colaba Escorts 🥰 8617370543 Call Girls Offer VIP Hot GirlsColaba Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Colaba Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
 
the Husband rolesBrown Aesthetic Cute Group Project Presentation
the Husband rolesBrown Aesthetic Cute Group Project Presentationthe Husband rolesBrown Aesthetic Cute Group Project Presentation
the Husband rolesBrown Aesthetic Cute Group Project Presentation
 
February 2024 Recommendations for newsletter
February 2024 Recommendations for newsletterFebruary 2024 Recommendations for newsletter
February 2024 Recommendations for newsletter
 
2023 - Between Philosophy and Practice: Introducing Yoga
2023 - Between Philosophy and Practice: Introducing Yoga2023 - Between Philosophy and Practice: Introducing Yoga
2023 - Between Philosophy and Practice: Introducing Yoga
 

Cleaning and sorting data

  • 1. Cleaning and sorting data YOU CAN DO A LOT WITH TOOLS YOU ALREADY HAVE
  • 2. A word about data Data is information that comes from people. Typically, they did not make it for you. That means it is not perfect for your needs. This is true whether they are people who built a sophisticated system to generate the data, or just someone who sent you some lists as an email attachment. Now you have to deal with it. Often the humbler the data, the bigger the headache. Knowing some easy ways to wrangle the simple stuff will give you a foundation for doing fancier work with bigger, more sophisticated data sets, if you are so inclined. But also, it will help you get the job done today, and sometimes that’s good enough.
  • 3. You can do a lot with what you’ve got • • • • Word (yes, Microsoft Word) Excel The command line A simple text editor
  • 4. First, you don’t have to cut and paste to sort Here’s a list of info about states in what you wish was in some kind of order…
  • 5. Open it in Word. Select your text, then click “A to Z" in the Paragraph menu. Alpha by paragraphs is the default, so just hit OK.
  • 6. You can also sort by “fields” if your lines have recurring separators.
  • 7. Or find a pattern you can turn into a separator: change the “ – “ to a pipe (“|”) or a tab
  • 8. Word lets you find (or insert) paragraph marks and tabs This means you can turn a nasty old text file into a spreadsheet in a two jifs
  • 9. Change returns (^p) to tabs (^t), then double-tabs back to ^p. Copy and paste into Excel!
  • 10. Ah, Excel – so nice, so clean… But wait – I needed the state and zip code in separate columns…
  • 11. Select your column, then pick "Text to Columns" in the Data Tools menu, Delimited. Check the character to split on (space in this example). Voila! Actually you have to move the phone numbers to the right first. And yes, you’ll have to fix East Moline.
  • 12. Sometimes you need to combine columns – for example, first name and last name, or genus and species. (I’ve moved the amounts to the right to give us space.)
  • 13. Select the cell where the first result should go, and type “=CONCATENATE(_," ",_)” with the cells that the info is coming from (here it’s B2 and C2), and put what should go in between them (here it’s a space) inside the quotes.
  • 14. When the see the correct result, select the cell and copy to all the cells below.
  • 15. And guess what else is still useful – the command prompt! Yes, there are better things out there. But it may still be the fastest way to compare two files, especially if you haven’t installed the other things
  • 16. Plain old “fc” (file compare) will list the differences between two files. Just put them in same folder and type fc filename1 filename2 Seriously, this is so darn handy!
  • 17. Sometimes the problem is inconsistent filenaming Maybe you got some data where each record is in a separate file, or pics from different cameras. A tool I use all the time to deal with this is “Bulk Rename Utility”
  • 18. Like most tools, Bulk Rename Utility can do lots of fancy stuff. But a simple “Replace” will quickly standardize most inconsistent filenames for you.
  • 19. An underrated problem with data is finding the bits you need Fortunately, some free text editors like “Notepad++” will search across all the files in a folder and all its subfolders – even your whole C: drive.
  • 20. Can’t face opening file after file to try to find the data you’re looking for? Well, you don’t need to. Besides, as a human, you might miss some of it.
  • 21. Open Notepad++ (or a similar text editor) and select “Find in Files” …
  • 22. Tell it what to look for and where to start. To use or save the results, select and copy to a file.
  • 23. Download these tools for free Bulk Rename Utility Notepad++ http://www.bulkrenameutility.co.uk/ download.php http://download.cnet.com/notepad/ 3000-2352_4-10327521.html
  • 24. Some other data tools you already have Actually these are some of the best ones…
  • 25. Because let’s say you have a nice clean data set from Socrata. Maybe it’s County procurement data. Or whatever… You still have to make sense of it.
  • 26. Thinking about the data • Is it complete? (right number of records) • Is it consistent? (records entered the same way) • Are there typos or variant punctuation? Stray spaces? • Are there values that don’t seem to make sense? • Does it jibe with what you expected to be there? • For what purpose, or under what mandate, was it compiled? This can affect the meaning of terms. • What do the values actually mean?
  • 27. Getting to the bottom of it • • • • How is this data generated, actually? What staff are responsible for it? If it’s automated, what triggers an entry? If there are “multiple choice” values, what is the selection based on? • Is anyone checking it? • How often is it updated? • What do these codes / terms /values actually mean?
  • 28. Some more things to look at There are of course plenty of more-sophisticated ways to clean and test the potential ok-ness of data. Many of them are way beyond me. But they are based on this kind of thinking. Here is some more of it at its best. • Some thoughts from the IRE blog http://ire.org/blog/ire-news/2013/10/25/ten-irrefutable-and-nonnegotiable-rules-responsibl/ • Some thoughts from Drew Skau, visualization architect at Visual.ly http://blog.visual.ly/cleaning-data-sets/ • The School of Data Handbook http://schoolofdata.org/handbook/
  • 29. Data used in my examples • State publications - http://www.library.illinois.edu/doc/researchtools/guides/state/statelist.html • Community health centers - http://getcoveredillinois.gov/ • Scott Walker campaign contributors - http://boycottwalker.bsharp.org/walker-bycontributor.html • Photo files - downloads from personal mobile devices