SlideShare ist ein Scribd-Unternehmen logo
1 von 7
Downloaden Sie, um offline zu lesen
Files
Hellen Gakuruh
February 8, 2016
Table of Contents
Definition and file names.....................................................................................................................................2
File formats.................................................................................................................................................................2
File Paths......................................................................................................................................................................2
Encoding.......................................................................................................................................................................4
ASCII..........................................................................................................................................................................4
UTF-8........................................................................................................................................................................4
Types of files ..............................................................................................................................................................5
Text files and binary files................................................................................................................................5
Text files.............................................................................................................................................................5
Binary files ........................................................................................................................................................6
Free format............................................................................................................................................................6
Delimited Files ................................................................................................................................................6
Other file types supported by base R ........................................................................................................6
Summary......................................................................................................................................................................7
References...................................................................................................................................................................7
Definition and file names
A file can be viewed as a computer storage device for information. Infomation can be
anything from data to program specification. Each file has a name used to refer to it in a
filing system. A filing system is the organization of files in a local disk, external device or
network. Filing systems are responsible for how files are stored and how they are
retrieved. Some of the recognisable filing systems are the legacy filling system called "File
Allocation Table" (FAT) and a newer one called "New Technology File System" (NTFS).
Most filing systems have restrictions on length and characters used in a file's name. It's
therefore important to know your filing system and it's restrictions when naming files.
File names usually end with an extension, these extensions indicate type of file and its
content. For example, some of microsoft office documents end in ".docx", "xlsx", and ".pptx"
extensions. Other notable extensions are; ".zip" for compressed folders, "csv" for comma
separated files, ".txt" for text files, and of course R's ".R", ".Rmd", ".RData" and ".Rhistory"
extensions. Since data files come in different types (as indicated by their extension), it's
imperative to know the program that can open it.
File formats
File format is used to imply file storage in terms of it being human-readable (like ASCII) or
machine-readable (binary) format and its organization.
File Paths
To access a file in any directory or folder, you must know its path. A path is a specific
location on your disk, device or network. Paths are constructed by different components
separated by different characters like "" or "/.
Components of a path comprise of different folders, file name and its extension. Paths are
hierarchical in nature; hierarchical in the sense that a file can be in a directory within
several other directories. The current directory containing a required file is referred to as a
child directory and its immediate directory as a parent director.
To demostrate this, take an example of a file named "files" with an extension ".Rmd" (a R
markdown file). This file is in a sub-folder called "Programming for Non-programmers" in
the tutorial's folder "Data_Mgt_Analysis_and_Graphics_R" which is in my organization's
folder called "Data Mania Inc" in "my documents folder", to source this file, I will need to
specify all the parent directory that form the file's path (components of the path).
Depending on the program you are using, a path can be specified in "reference to your
current working directory" or its "full path". Full path would often begin with a drive like
"c:" or "D:" on (Windows). When path is in reference to your current director, as is the case
with R, then the path begins from your working directory followed by components leading
to the file. Sometimes this would mean moving some directories above the working
directory. For example, my current working directory is a folder called "Level One" in the
"Class_Notes" folder stored in the tutorial's folder. To construct a path leading to the
files.Rmd document, we need to move up the directories until we get to the tutorials folder
(Data Mania Inc/Data_Mgt_Analysis_and_Graphics_R). From there we can specify the
directories leading to the file.
* Current working directory
My Documents > Data Mania Inc > Data_Mgt_Analysis_and_Graphics_R >
Class_Notes > Level One
* Directory with needed file
My Documents > Data Mania Inc > Data_Mgt_Analysis_and_Graphics_R >
Programming for Non-programmers
* Therefore we need to move up two positions (parent directories) and then
move into one other directory
In order for us to move up the directory hierachy, we will use special specifiers to denote
parent directories. These specifiers are denoted with two dots or periods (one dot indicates
current directory).
Components of a path are separated with either a forward or a backslash depending on the
operating system. Windows seems to be the only one using the backslash as all other
systems use the forward slash as a directory separator. You can determine the path
separator used in your platform with .Platform$file.sep
* Here we are literary saying move two directories up then get into the
programming folder and get an .Rmd file. Paths in R are enclosed in
double/single quotation marks (character/string delimiters).
"../../Programming for Non-programmers/Files.Rmd"
Exercise on file paths
Create full and relative paths to "tmp.csv".
* File location
My Documents > Data Mania Inc > Trainings > Online Class > Data set
* Working Directory
C:/Users/Hellen Gakuruh/Documents > Data Mania Inc >
Data_Mgt_Analysis_and_Graphics_R > Programming for Non-programmers
Encoding
Encoding is a set of rules on how a file's content is converted from human readable format
(like characters) to machine readable format (0s and 1s). There are numerous encoding of
which R has provision for 374 encodings.
If a wrong encoding is used, the content of a file can become unreadable to either you the
user (human) or the computer (machine). Below we discuss some common encodings.
ASCII
ASCII (American Standard Code for Information Interchange) is a type of encoding that
converts characters into binary form (machine format). This encoding was developed in
1963 and was quite common internet files like HTML until 2007 when it was superseded
by UTF-8 encoding. Since ASCII was originally based on American English characters, there
have been other non-English character encoding schemes, some based on ACSII.
To best understand this, take for instance when you start a script on R's text editor (or any
other text editor) and run it, the codes you type would need to be converted into a language
the computer understand which is 1's (on) and 0's (off), this is done through an encoding
scheme.
ASCII encodes 128 characters (0-9, lower and upper case letters, punctuation characters,
and control characters/non-printing characters e.g. tab) into 7 bits integers. These 128
characters are known as "character set", that is, a set of characters that can be encoded.
Characters can be encoded using binary (base 2), octal (base 8), decimal (base 10) or
hexadecimal (base 16) number system. Number system refers to the number of symbols
used to represent numbers and is used to describe the quantity of something.
Binary has two (binary) symbols 1 and 0 which form bit strings (bunch of bits) like
01000001 representing capital A, octal has eight symbols from 0 to 7, decimal (the system
that we commonly use) has 10 symbols from 0 to 9, and hexadecimal has sixteen from 0 to
9 plus A to F. Hexadecimal are shorter and easier to represent than binary. This is what is
returned by R when you convert to raw i.e. as.raw(10) becomes 0a
Out of the 128 characters that ASCII can encode, only 95 are human readable characters,
these characters are certainly not sufficient for non-English languages like Japanees,
Chinese and most European countries. This fact led to other character encoding schemes
like UTF-8.
UTF-8
Universal Coded Character Set + Transformation Format - 8-bit (UTF-8) is a character
encoding scheme that superseded ASCII. It's characters begin with the 128 ASCII characters
(enabling backward compartibility) which are one byte, and includes other multi-byte
characters to handle other characters like European and Asian languages as well as
currency symbols.
UTF-8 is now commonly used to encode characters in webbased files like html.
Other common encoding schemes include "ANSI", "OEM", and "Unicode". Unicode is really
not an encoding scheme but a collection of encoding schemes. It brings together other
encoding schemes like ISO-8859 variants in Europe, Shift-JIS in Japan, or BIG-5 in China
thereby allowing more characters that can work together; it's good for documents with
multiple languages and/or characters.
Types of files
It's important to know the type of files that R can easily work with. Here we will discuss
text and binary files as well as free format and other file types.
Text files and binary files
Text files
A text file, also known as a flat file is an object that contains text organized as lines of text.
The contents of a text file are known as plain text as they have little or no formatting (e.g.
bold and italics).
There are different types of text files indicated by their extensions. Windows has ".txt"
extension created by a program called notepad. Other programs have their own text files
with extensions denoting the programming language used.
Encoding text files
For English text files, the most common file encoding is ASCII, for other languages or
characters, it is best to change the encoding to suit the locale and its character set.
Newline/End of Line (EOL)
When writing multi-line of text, there are specific characters/markers that are used to
indicate end of a line and thereby the begining of a subsequent line. The most commonly
used markers are "carriage return (CR)" and "Line feed (LF)".
To understand these two concepts, think of the days of typewriters. When you finish typing
a line of text, you would go back to the begining of that line using a lever device and then
move to the next line. In the same way, when you type a text document you need to include
markers that tell the program to move to the start of the line and then to the next line. In
this regard, carriage return moves you to the begining of the line and line feed begins a new
line.
Different programs use different markers causing problems when moving from one
program to another. Example of markers used by different programs include; CR, LF, and
different combinations CR and LF. Programming languages provided some escape
sequences to deal with this issue. Some of these are "r" denoting carriage return and "n"
denoting new line.
Binary files
Binary files on the other hand contain binary characters 1 and 0 (computer readable
format) as eight bit strings representing human readable characters like aphabetical
letters.
Good example of binary files are executable programs.
Free format
Free format are file in the public domain (open source) and therefore do not have any
monetary implication as they are not restricted by copyrights, patents or trademarks.
These files are also platform independent meaning they can be opened from any operating
system as well as being machine readable. Some of the commonly used free format file
types are:
• Media files - JPEG and PNG
• Text - Plain text, HTML, XHTML, LaTex
• Archiving and compression - bzip2, gzip, tar
• Others - Custom style sheets (CSS), comma separated values (CSV), JSON, RSS, YALM
We will look at two commonly used free format files in R; text files and comma separated
files also referred to as delimited files.
Delimited Files
Unlike text data that contain continuous lines of information, datasets are usually
composed of lines with values separated by some characters like commas, tab, semi colon
and the like. Each line corresponds to a row or a case and each value separated into a
column or variable value.
The characters used to separate each value in a line are known as "delimiters" and the file
is therefore called a delimited file. Two of the commonly used delimited files in R are
comma separate values (CSV) and Tab-delimited files. CSV files are separated by commas
while tab-delimited files are separated by tabs.
Delimited files are usually the easiest files to import or export data in R.
Other file types supported by base R
R supports other file types, these include:
• Debian control file (DCF): DCF are simple file format for storing databases in plain text
files
• Binary data files (Bin): Used to transfer data to binary or binary to text
• Fixed width file format: These are text file with column and character alignment
formating. They are manily used to organize data in a certain way.
All proprietary type of files like excel, SPSS, Stata and SAS can only be handled by
contributed or add-on packages.
Summary
In summary, the key pointers that you should grasp from this write-up is how to specify a
path to a document, understand encoding used in a file and the files types used in R.
With that you should be able to read in and export data and output without difficulty. But in
the event that you experience some challenges, you are able to identify its cause (which is
half of the problem).
Solution
Full and relative paths to the file are:
* Full path
"C:/Users/Hellen Gakuruh/Documents/Data Mania Inc/Trainings/Online Class/Data
set/tmp.csv"
* Relative path
"../../Trainings/Online Class/Data set/tmp.csv"
References
1. http://kunststube.net/encoding/
2. https://en.wikipedia.org/wiki/ASCII
3. https://en.wikipedia.org/wiki/Text_file
4. https://www.cs.umd.edu/class/sum2003/cmsc311/Notes/BitOp/asciiBin.html
5. http://tibasicdev.wikidot.com/binandhex
6. http://www.oualline.com/practical.programmer/eol.html

Weitere ähnliche Inhalte

Was ist angesagt?

File management
File managementFile management
File managementsumathiv9
 
Data file handling in python introduction,opening & closing files
Data file handling in python introduction,opening & closing filesData file handling in python introduction,opening & closing files
Data file handling in python introduction,opening & closing fileskeeeerty
 
File Management in Operating Systems
File Management in Operating SystemsFile Management in Operating Systems
File Management in Operating Systemsvampugani
 
File types atul namdeo
File types  atul namdeoFile types  atul namdeo
File types atul namdeoAtul Namdeo
 
File management ppt
File management pptFile management ppt
File management pptmarotti
 
Advances in File Carving
Advances in File CarvingAdvances in File Carving
Advances in File CarvingRob Zirnstein
 
Switching & Multiplexing
Switching & MultiplexingSwitching & Multiplexing
Switching & MultiplexingMelkamuEndale1
 
Concept of computer files
Concept of computer filesConcept of computer files
Concept of computer filesSamuel Igbanogu
 

Was ist angesagt? (19)

File and fat
File and fatFile and fat
File and fat
 
File Management
File ManagementFile Management
File Management
 
File management
File managementFile management
File management
 
File management
File managementFile management
File management
 
Data file handling in python introduction,opening & closing files
Data file handling in python introduction,opening & closing filesData file handling in python introduction,opening & closing files
Data file handling in python introduction,opening & closing files
 
File handling
File handlingFile handling
File handling
 
OSCh12
OSCh12OSCh12
OSCh12
 
File Management
File ManagementFile Management
File Management
 
File Management in Operating Systems
File Management in Operating SystemsFile Management in Operating Systems
File Management in Operating Systems
 
File types atul namdeo
File types  atul namdeoFile types  atul namdeo
File types atul namdeo
 
Files
FilesFiles
Files
 
File Management
File ManagementFile Management
File Management
 
File management
File management File management
File management
 
File management ppt
File management pptFile management ppt
File management ppt
 
File Management
File ManagementFile Management
File Management
 
File management
File managementFile management
File management
 
Advances in File Carving
Advances in File CarvingAdvances in File Carving
Advances in File Carving
 
Switching & Multiplexing
Switching & MultiplexingSwitching & Multiplexing
Switching & Multiplexing
 
Concept of computer files
Concept of computer filesConcept of computer files
Concept of computer files
 

Andere mochten auch

MSc thesis Bram Roefs 518968
MSc thesis Bram Roefs 518968MSc thesis Bram Roefs 518968
MSc thesis Bram Roefs 518968Bram Roefs
 
SessionFour_DataTypesandObjects
SessionFour_DataTypesandObjectsSessionFour_DataTypesandObjects
SessionFour_DataTypesandObjectsHellen Gakuruh
 
модель унівського самоврядування
модель унівського самоврядуваннямодель унівського самоврядування
модель унівського самоврядуванняDevid kotter
 
4_INFRARED REMOTE USED FOR 8
4_INFRARED REMOTE USED FOR 84_INFRARED REMOTE USED FOR 8
4_INFRARED REMOTE USED FOR 8SURAJ MAHAPATRA
 
16 днів проти насильства
16 днів проти насильства16 днів проти насильства
16 днів проти насильстваАнна Пугач
 
Proyecto participativo 3°
Proyecto participativo 3°Proyecto participativo 3°
Proyecto participativo 3°KAtiRojChu
 
день цивільного захисту 08.04.2016
день цивільного захисту 08.04.2016день цивільного захисту 08.04.2016
день цивільного захисту 08.04.2016Анна Пугач
 
Gerencia y administracion de salud, catedra
Gerencia y administracion de salud, catedraGerencia y administracion de salud, catedra
Gerencia y administracion de salud, catedraJorge Amarante
 
Integration course development plan zervou
Integration course development plan zervouIntegration course development plan zervou
Integration course development plan zervouzervoum
 

Andere mochten auch (15)

MSc thesis Bram Roefs 518968
MSc thesis Bram Roefs 518968MSc thesis Bram Roefs 518968
MSc thesis Bram Roefs 518968
 
SessionFour_DataTypesandObjects
SessionFour_DataTypesandObjectsSessionFour_DataTypesandObjects
SessionFour_DataTypesandObjects
 
webScrapingFunctions
webScrapingFunctionswebScrapingFunctions
webScrapingFunctions
 
модель унівського самоврядування
модель унівського самоврядуваннямодель унівського самоврядування
модель унівського самоврядування
 
Question 5
Question 5 Question 5
Question 5
 
4_INFRARED REMOTE USED FOR 8
4_INFRARED REMOTE USED FOR 84_INFRARED REMOTE USED FOR 8
4_INFRARED REMOTE USED FOR 8
 
16 днів проти насильства
16 днів проти насильства16 днів проти насильства
16 днів проти насильства
 
Proyecto participativo 3°
Proyecto participativo 3°Proyecto participativo 3°
Proyecto participativo 3°
 
¿ACASO EL TALENTO BASTA?
¿ACASO EL TALENTO BASTA?¿ACASO EL TALENTO BASTA?
¿ACASO EL TALENTO BASTA?
 
¿INCENTIVOS INDIVIDUALES O COLECTIVOS?
¿INCENTIVOS INDIVIDUALES O COLECTIVOS?¿INCENTIVOS INDIVIDUALES O COLECTIVOS?
¿INCENTIVOS INDIVIDUALES O COLECTIVOS?
 
день цивільного захисту 08.04.2016
день цивільного захисту 08.04.2016день цивільного захисту 08.04.2016
день цивільного захисту 08.04.2016
 
Gerencia y administracion de salud, catedra
Gerencia y administracion de salud, catedraGerencia y administracion de salud, catedra
Gerencia y administracion de salud, catedra
 
Integration course development plan zervou
Integration course development plan zervouIntegration course development plan zervou
Integration course development plan zervou
 
AR Final Research Paper
AR Final Research PaperAR Final Research Paper
AR Final Research Paper
 
Diseño cuasiexperimental
Diseño cuasiexperimentalDiseño cuasiexperimental
Diseño cuasiexperimental
 

Ähnlich wie Files

File Handling In C++(OOPs))
File Handling In C++(OOPs))File Handling In C++(OOPs))
File Handling In C++(OOPs))Papu Kumar
 
SessionFive_ImportingandExportingData
SessionFive_ImportingandExportingDataSessionFive_ImportingandExportingData
SessionFive_ImportingandExportingDataHellen Gakuruh
 
Data file handling in python introduction,opening & closing files
Data file handling in python introduction,opening & closing filesData file handling in python introduction,opening & closing files
Data file handling in python introduction,opening & closing filesKeerty Smile
 
SessionTen_CaseStudies
SessionTen_CaseStudiesSessionTen_CaseStudies
SessionTen_CaseStudiesHellen Gakuruh
 
Data file handling in c++
Data file handling in c++Data file handling in c++
Data file handling in c++Vineeta Garg
 
Bba203 unit 2data processing concepts
Bba203   unit 2data processing conceptsBba203   unit 2data processing concepts
Bba203 unit 2data processing conceptskinjal patel
 
Understanding EDP (Electronic Data Processing) Environment
Understanding EDP (Electronic Data Processing) EnvironmentUnderstanding EDP (Electronic Data Processing) Environment
Understanding EDP (Electronic Data Processing) EnvironmentAdetula Bunmi
 
File handling3 (1).pdf uhgipughserigrfiogrehpiuhnfi;reuge
File handling3 (1).pdf uhgipughserigrfiogrehpiuhnfi;reugeFile handling3 (1).pdf uhgipughserigrfiogrehpiuhnfi;reuge
File handling3 (1).pdf uhgipughserigrfiogrehpiuhnfi;reugevsol7206
 
File Input/output, Database Access, Data Analysis with Pandas
File Input/output, Database Access, Data Analysis with PandasFile Input/output, Database Access, Data Analysis with Pandas
File Input/output, Database Access, Data Analysis with PandasPrabu U
 
File system interface
File system interfaceFile system interface
File system interfaceDayan Ahmed
 
File Management – File Concept, access methods, File types and File Operation
File Management – File Concept, access methods,  File types and File OperationFile Management – File Concept, access methods,  File types and File Operation
File Management – File Concept, access methods, File types and File OperationDhrumil Panchal
 

Ähnlich wie Files (20)

File Handling In C++(OOPs))
File Handling In C++(OOPs))File Handling In C++(OOPs))
File Handling In C++(OOPs))
 
SessionFive_ImportingandExportingData
SessionFive_ImportingandExportingDataSessionFive_ImportingandExportingData
SessionFive_ImportingandExportingData
 
Data file handling in python introduction,opening & closing files
Data file handling in python introduction,opening & closing filesData file handling in python introduction,opening & closing files
Data file handling in python introduction,opening & closing files
 
File Handling in C++
File Handling in C++File Handling in C++
File Handling in C++
 
Filepointers1 1215104829397318-9
Filepointers1 1215104829397318-9Filepointers1 1215104829397318-9
Filepointers1 1215104829397318-9
 
SessionTen_CaseStudies
SessionTen_CaseStudiesSessionTen_CaseStudies
SessionTen_CaseStudies
 
Files
FilesFiles
Files
 
Filehandlinging cp2
Filehandlinging cp2Filehandlinging cp2
Filehandlinging cp2
 
Data file handling in c++
Data file handling in c++Data file handling in c++
Data file handling in c++
 
File Handling In C++
File Handling In C++File Handling In C++
File Handling In C++
 
Files in Python.pptx
Files in Python.pptxFiles in Python.pptx
Files in Python.pptx
 
Files in Python.pptx
Files in Python.pptxFiles in Python.pptx
Files in Python.pptx
 
Bba203 unit 2data processing concepts
Bba203   unit 2data processing conceptsBba203   unit 2data processing concepts
Bba203 unit 2data processing concepts
 
Understanding EDP (Electronic Data Processing) Environment
Understanding EDP (Electronic Data Processing) EnvironmentUnderstanding EDP (Electronic Data Processing) Environment
Understanding EDP (Electronic Data Processing) Environment
 
File handling3 (1).pdf uhgipughserigrfiogrehpiuhnfi;reuge
File handling3 (1).pdf uhgipughserigrfiogrehpiuhnfi;reugeFile handling3 (1).pdf uhgipughserigrfiogrehpiuhnfi;reuge
File handling3 (1).pdf uhgipughserigrfiogrehpiuhnfi;reuge
 
Chapter 5
Chapter 5Chapter 5
Chapter 5
 
File Input/output, Database Access, Data Analysis with Pandas
File Input/output, Database Access, Data Analysis with PandasFile Input/output, Database Access, Data Analysis with Pandas
File Input/output, Database Access, Data Analysis with Pandas
 
File system interface
File system interfaceFile system interface
File system interface
 
File Management – File Concept, access methods, File types and File Operation
File Management – File Concept, access methods,  File types and File OperationFile Management – File Concept, access methods,  File types and File Operation
File Management – File Concept, access methods, File types and File Operation
 
File Organization
File OrganizationFile Organization
File Organization
 

Mehr von Hellen Gakuruh

Prelude to level_three
Prelude to level_threePrelude to level_three
Prelude to level_threeHellen Gakuruh
 
SessionThree_IntroductionToVersionControlSystems
SessionThree_IntroductionToVersionControlSystemsSessionThree_IntroductionToVersionControlSystems
SessionThree_IntroductionToVersionControlSystemsHellen Gakuruh
 
Introduction_to_Regular_Expressions_in_R
Introduction_to_Regular_Expressions_in_RIntroduction_to_Regular_Expressions_in_R
Introduction_to_Regular_Expressions_in_RHellen Gakuruh
 
SessionNine_HowandWheretoGetHelp
SessionNine_HowandWheretoGetHelpSessionNine_HowandWheretoGetHelp
SessionNine_HowandWheretoGetHelpHellen Gakuruh
 
SessionEight_PlottingInBaseR
SessionEight_PlottingInBaseRSessionEight_PlottingInBaseR
SessionEight_PlottingInBaseRHellen Gakuruh
 
SessionSeven_WorkingWithDatesandTime
SessionSeven_WorkingWithDatesandTimeSessionSeven_WorkingWithDatesandTime
SessionSeven_WorkingWithDatesandTimeHellen Gakuruh
 
SessionSix_TransformingManipulatingDataObjects
SessionSix_TransformingManipulatingDataObjectsSessionSix_TransformingManipulatingDataObjects
SessionSix_TransformingManipulatingDataObjectsHellen Gakuruh
 
SessionTwo_MakingFunctionCalls
SessionTwo_MakingFunctionCallsSessionTwo_MakingFunctionCalls
SessionTwo_MakingFunctionCallsHellen Gakuruh
 
SessionOne_KnowingRandRStudio
SessionOne_KnowingRandRStudioSessionOne_KnowingRandRStudio
SessionOne_KnowingRandRStudioHellen Gakuruh
 
Understanding_Markdowns_Pandoc_and_YALM
Understanding_Markdowns_Pandoc_and_YALMUnderstanding_Markdowns_Pandoc_and_YALM
Understanding_Markdowns_Pandoc_and_YALMHellen Gakuruh
 

Mehr von Hellen Gakuruh (19)

R training2
R training2R training2
R training2
 
R training6
R training6R training6
R training6
 
R training5
R training5R training5
R training5
 
R training4
R training4R training4
R training4
 
R training3
R training3R training3
R training3
 
R training
R trainingR training
R training
 
Prelude to level_three
Prelude to level_threePrelude to level_three
Prelude to level_three
 
Prelude to level_two
Prelude to level_twoPrelude to level_two
Prelude to level_two
 
SessionThree_IntroductionToVersionControlSystems
SessionThree_IntroductionToVersionControlSystemsSessionThree_IntroductionToVersionControlSystems
SessionThree_IntroductionToVersionControlSystems
 
Day 2
Day 2Day 2
Day 2
 
Day 1
Day 1Day 1
Day 1
 
Introduction_to_Regular_Expressions_in_R
Introduction_to_Regular_Expressions_in_RIntroduction_to_Regular_Expressions_in_R
Introduction_to_Regular_Expressions_in_R
 
SessionNine_HowandWheretoGetHelp
SessionNine_HowandWheretoGetHelpSessionNine_HowandWheretoGetHelp
SessionNine_HowandWheretoGetHelp
 
SessionEight_PlottingInBaseR
SessionEight_PlottingInBaseRSessionEight_PlottingInBaseR
SessionEight_PlottingInBaseR
 
SessionSeven_WorkingWithDatesandTime
SessionSeven_WorkingWithDatesandTimeSessionSeven_WorkingWithDatesandTime
SessionSeven_WorkingWithDatesandTime
 
SessionSix_TransformingManipulatingDataObjects
SessionSix_TransformingManipulatingDataObjectsSessionSix_TransformingManipulatingDataObjects
SessionSix_TransformingManipulatingDataObjects
 
SessionTwo_MakingFunctionCalls
SessionTwo_MakingFunctionCallsSessionTwo_MakingFunctionCalls
SessionTwo_MakingFunctionCalls
 
SessionOne_KnowingRandRStudio
SessionOne_KnowingRandRStudioSessionOne_KnowingRandRStudio
SessionOne_KnowingRandRStudio
 
Understanding_Markdowns_Pandoc_and_YALM
Understanding_Markdowns_Pandoc_and_YALMUnderstanding_Markdowns_Pandoc_and_YALM
Understanding_Markdowns_Pandoc_and_YALM
 

Files

  • 1. Files Hellen Gakuruh February 8, 2016 Table of Contents Definition and file names.....................................................................................................................................2 File formats.................................................................................................................................................................2 File Paths......................................................................................................................................................................2 Encoding.......................................................................................................................................................................4 ASCII..........................................................................................................................................................................4 UTF-8........................................................................................................................................................................4 Types of files ..............................................................................................................................................................5 Text files and binary files................................................................................................................................5 Text files.............................................................................................................................................................5 Binary files ........................................................................................................................................................6 Free format............................................................................................................................................................6 Delimited Files ................................................................................................................................................6 Other file types supported by base R ........................................................................................................6 Summary......................................................................................................................................................................7 References...................................................................................................................................................................7
  • 2. Definition and file names A file can be viewed as a computer storage device for information. Infomation can be anything from data to program specification. Each file has a name used to refer to it in a filing system. A filing system is the organization of files in a local disk, external device or network. Filing systems are responsible for how files are stored and how they are retrieved. Some of the recognisable filing systems are the legacy filling system called "File Allocation Table" (FAT) and a newer one called "New Technology File System" (NTFS). Most filing systems have restrictions on length and characters used in a file's name. It's therefore important to know your filing system and it's restrictions when naming files. File names usually end with an extension, these extensions indicate type of file and its content. For example, some of microsoft office documents end in ".docx", "xlsx", and ".pptx" extensions. Other notable extensions are; ".zip" for compressed folders, "csv" for comma separated files, ".txt" for text files, and of course R's ".R", ".Rmd", ".RData" and ".Rhistory" extensions. Since data files come in different types (as indicated by their extension), it's imperative to know the program that can open it. File formats File format is used to imply file storage in terms of it being human-readable (like ASCII) or machine-readable (binary) format and its organization. File Paths To access a file in any directory or folder, you must know its path. A path is a specific location on your disk, device or network. Paths are constructed by different components separated by different characters like "" or "/. Components of a path comprise of different folders, file name and its extension. Paths are hierarchical in nature; hierarchical in the sense that a file can be in a directory within several other directories. The current directory containing a required file is referred to as a child directory and its immediate directory as a parent director. To demostrate this, take an example of a file named "files" with an extension ".Rmd" (a R markdown file). This file is in a sub-folder called "Programming for Non-programmers" in the tutorial's folder "Data_Mgt_Analysis_and_Graphics_R" which is in my organization's folder called "Data Mania Inc" in "my documents folder", to source this file, I will need to specify all the parent directory that form the file's path (components of the path).
  • 3. Depending on the program you are using, a path can be specified in "reference to your current working directory" or its "full path". Full path would often begin with a drive like "c:" or "D:" on (Windows). When path is in reference to your current director, as is the case with R, then the path begins from your working directory followed by components leading to the file. Sometimes this would mean moving some directories above the working directory. For example, my current working directory is a folder called "Level One" in the "Class_Notes" folder stored in the tutorial's folder. To construct a path leading to the files.Rmd document, we need to move up the directories until we get to the tutorials folder (Data Mania Inc/Data_Mgt_Analysis_and_Graphics_R). From there we can specify the directories leading to the file. * Current working directory My Documents > Data Mania Inc > Data_Mgt_Analysis_and_Graphics_R > Class_Notes > Level One * Directory with needed file My Documents > Data Mania Inc > Data_Mgt_Analysis_and_Graphics_R > Programming for Non-programmers * Therefore we need to move up two positions (parent directories) and then move into one other directory In order for us to move up the directory hierachy, we will use special specifiers to denote parent directories. These specifiers are denoted with two dots or periods (one dot indicates current directory). Components of a path are separated with either a forward or a backslash depending on the operating system. Windows seems to be the only one using the backslash as all other systems use the forward slash as a directory separator. You can determine the path separator used in your platform with .Platform$file.sep * Here we are literary saying move two directories up then get into the programming folder and get an .Rmd file. Paths in R are enclosed in double/single quotation marks (character/string delimiters). "../../Programming for Non-programmers/Files.Rmd" Exercise on file paths Create full and relative paths to "tmp.csv". * File location My Documents > Data Mania Inc > Trainings > Online Class > Data set * Working Directory
  • 4. C:/Users/Hellen Gakuruh/Documents > Data Mania Inc > Data_Mgt_Analysis_and_Graphics_R > Programming for Non-programmers Encoding Encoding is a set of rules on how a file's content is converted from human readable format (like characters) to machine readable format (0s and 1s). There are numerous encoding of which R has provision for 374 encodings. If a wrong encoding is used, the content of a file can become unreadable to either you the user (human) or the computer (machine). Below we discuss some common encodings. ASCII ASCII (American Standard Code for Information Interchange) is a type of encoding that converts characters into binary form (machine format). This encoding was developed in 1963 and was quite common internet files like HTML until 2007 when it was superseded by UTF-8 encoding. Since ASCII was originally based on American English characters, there have been other non-English character encoding schemes, some based on ACSII. To best understand this, take for instance when you start a script on R's text editor (or any other text editor) and run it, the codes you type would need to be converted into a language the computer understand which is 1's (on) and 0's (off), this is done through an encoding scheme. ASCII encodes 128 characters (0-9, lower and upper case letters, punctuation characters, and control characters/non-printing characters e.g. tab) into 7 bits integers. These 128 characters are known as "character set", that is, a set of characters that can be encoded. Characters can be encoded using binary (base 2), octal (base 8), decimal (base 10) or hexadecimal (base 16) number system. Number system refers to the number of symbols used to represent numbers and is used to describe the quantity of something. Binary has two (binary) symbols 1 and 0 which form bit strings (bunch of bits) like 01000001 representing capital A, octal has eight symbols from 0 to 7, decimal (the system that we commonly use) has 10 symbols from 0 to 9, and hexadecimal has sixteen from 0 to 9 plus A to F. Hexadecimal are shorter and easier to represent than binary. This is what is returned by R when you convert to raw i.e. as.raw(10) becomes 0a Out of the 128 characters that ASCII can encode, only 95 are human readable characters, these characters are certainly not sufficient for non-English languages like Japanees, Chinese and most European countries. This fact led to other character encoding schemes like UTF-8. UTF-8 Universal Coded Character Set + Transformation Format - 8-bit (UTF-8) is a character encoding scheme that superseded ASCII. It's characters begin with the 128 ASCII characters (enabling backward compartibility) which are one byte, and includes other multi-byte
  • 5. characters to handle other characters like European and Asian languages as well as currency symbols. UTF-8 is now commonly used to encode characters in webbased files like html. Other common encoding schemes include "ANSI", "OEM", and "Unicode". Unicode is really not an encoding scheme but a collection of encoding schemes. It brings together other encoding schemes like ISO-8859 variants in Europe, Shift-JIS in Japan, or BIG-5 in China thereby allowing more characters that can work together; it's good for documents with multiple languages and/or characters. Types of files It's important to know the type of files that R can easily work with. Here we will discuss text and binary files as well as free format and other file types. Text files and binary files Text files A text file, also known as a flat file is an object that contains text organized as lines of text. The contents of a text file are known as plain text as they have little or no formatting (e.g. bold and italics). There are different types of text files indicated by their extensions. Windows has ".txt" extension created by a program called notepad. Other programs have their own text files with extensions denoting the programming language used. Encoding text files For English text files, the most common file encoding is ASCII, for other languages or characters, it is best to change the encoding to suit the locale and its character set. Newline/End of Line (EOL) When writing multi-line of text, there are specific characters/markers that are used to indicate end of a line and thereby the begining of a subsequent line. The most commonly used markers are "carriage return (CR)" and "Line feed (LF)". To understand these two concepts, think of the days of typewriters. When you finish typing a line of text, you would go back to the begining of that line using a lever device and then move to the next line. In the same way, when you type a text document you need to include markers that tell the program to move to the start of the line and then to the next line. In this regard, carriage return moves you to the begining of the line and line feed begins a new line. Different programs use different markers causing problems when moving from one program to another. Example of markers used by different programs include; CR, LF, and different combinations CR and LF. Programming languages provided some escape
  • 6. sequences to deal with this issue. Some of these are "r" denoting carriage return and "n" denoting new line. Binary files Binary files on the other hand contain binary characters 1 and 0 (computer readable format) as eight bit strings representing human readable characters like aphabetical letters. Good example of binary files are executable programs. Free format Free format are file in the public domain (open source) and therefore do not have any monetary implication as they are not restricted by copyrights, patents or trademarks. These files are also platform independent meaning they can be opened from any operating system as well as being machine readable. Some of the commonly used free format file types are: • Media files - JPEG and PNG • Text - Plain text, HTML, XHTML, LaTex • Archiving and compression - bzip2, gzip, tar • Others - Custom style sheets (CSS), comma separated values (CSV), JSON, RSS, YALM We will look at two commonly used free format files in R; text files and comma separated files also referred to as delimited files. Delimited Files Unlike text data that contain continuous lines of information, datasets are usually composed of lines with values separated by some characters like commas, tab, semi colon and the like. Each line corresponds to a row or a case and each value separated into a column or variable value. The characters used to separate each value in a line are known as "delimiters" and the file is therefore called a delimited file. Two of the commonly used delimited files in R are comma separate values (CSV) and Tab-delimited files. CSV files are separated by commas while tab-delimited files are separated by tabs. Delimited files are usually the easiest files to import or export data in R. Other file types supported by base R R supports other file types, these include: • Debian control file (DCF): DCF are simple file format for storing databases in plain text files • Binary data files (Bin): Used to transfer data to binary or binary to text
  • 7. • Fixed width file format: These are text file with column and character alignment formating. They are manily used to organize data in a certain way. All proprietary type of files like excel, SPSS, Stata and SAS can only be handled by contributed or add-on packages. Summary In summary, the key pointers that you should grasp from this write-up is how to specify a path to a document, understand encoding used in a file and the files types used in R. With that you should be able to read in and export data and output without difficulty. But in the event that you experience some challenges, you are able to identify its cause (which is half of the problem). Solution Full and relative paths to the file are: * Full path "C:/Users/Hellen Gakuruh/Documents/Data Mania Inc/Trainings/Online Class/Data set/tmp.csv" * Relative path "../../Trainings/Online Class/Data set/tmp.csv" References 1. http://kunststube.net/encoding/ 2. https://en.wikipedia.org/wiki/ASCII 3. https://en.wikipedia.org/wiki/Text_file 4. https://www.cs.umd.edu/class/sum2003/cmsc311/Notes/BitOp/asciiBin.html 5. http://tibasicdev.wikidot.com/binandhex 6. http://www.oualline.com/practical.programmer/eol.html