Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Hier ansehen

1 von 30 Anzeige

Weitere Verwandte Inhalte

Ähnlich wie Data Types (20)


Data Types

  1. 1. Dr. Carlos Rodríguez Contreras UNAM
  2. 2. Statistical Science Descriptive statistics – Collecting, presenting, and describing data Inferential statistics – Drawing conclusions and/or making decisions concerning a population based only on sample data
  3. 3. Descriptive Statistics  Collect data e.g., Survey, Observation, Experiments  Present data e.g., Charts and graphs  Characterize data e.g., Sample mean n xi
  4. 4. Data Sources Primary Data Collection Secondary Data Compilation Observation Experimentation Survey Print or Electronic
  5. 5. Data Qualitative (Categorical) Quantitative (Numerical) Discrete Continuous Data Types Examples:  Marital Status  Political Party  Eye Color (Defned categories) Examples:  Number of Children  Defects per hour (Counted items) Examples:  Weight  Voltage (Measured characteristics)
  6. 6. Data Types  Time Series Data – Ordered data values observed over time.  Cross Section Data – Data values observed at a fxed point in time.
  7. 7. Data Types Sales (in £1000’s) 2013 2014 2015 2016 London 435 460 475 490 York 320 345 375 395 Bristol 405 390 410 395 Kent 260 270 285 280 Time Series Data Cross Section Data
  8. 8. Data Measurement Levels Ratio/Interval Data Ordinal Data Nominal Data Highest Level Complete Analysis Higher Level Mid-level Analysis Lowest Level Basic Analysis Categorical Codes ID Numbers Category Names Rankings Ordered Categories Measurements
  9. 9. Data Measurement Levels
  10. 10. Attributes of NOIR Data Types
  11. 11. Nominal scalesNominal scales  A nominal scale of measurement only indicates the category of a variable that a case falls into: it expresses qualitative diferences but not quantitative diferences, and as such data at this level are often referred to as qualitative data.  A nominal scale only allows us to say that one case may be diferent from another  No ‘natural’ order to the arrangement of categories  Often identifed by ‘Other’ category
  12. 12. Ordinal scalesOrdinal scales  Consider that we operationalise age so that we measure its variation by recording whether someone is: young (18 years or less), middle aged (19-60 years), or old (over 60 years)  We can say one case may be diferent to another in terms of age, and  We can say one case may have more or less age than another, but  We cannot say how much more age one case may have as compared to another
  13. 13. Ordinal scales (cont.)Ordinal scales (cont.)  An ordinal level of measurement, in addition to the function of classifcation, allows cases to be ordered by degree according to measurements of the variable.  But we cannot quantify the amount of diference – there is no unit of measurement like years or dollars.  Ordinal scales are particularly common when measuring attitude or satisfaction in opinion surveys.  Yes/No responses are often ordinal e.g. “Do you enjoy statistics (Yes/No)?”  we can say that someone who answers ‘Yes’ has more enjoyment of statistics than someone who responds ‘No’, but  we can’t say how much more enjoyment of statistics they have.
  14. 14. Interval/ratio scalesInterval/ratio scales  The key characteristic of an interval/ratio scale is that it has units measuring intervals of equal distance between values on the scale.  Consider the variable ‘age’. This can be defned operationally as ‘age in whole years at last birthday’.  Having defned age this way our measurements of people’s age will allow us to say:  one case may be diferent to another in terms of age, and  one case may have more or less age than another, and  how much more age one case may have as compared to another.
  15. 15. Types of Data In all scientifc disciplines, we are obliged to understand the Stevens’ data classifcation...
  16. 16. Types of Data
  17. 17. Although Steven's taxonomy has permeated all scientifc disciplines, we still need to characterize data to match the way the digital computers work.
  18. 18.  When we look at many variables, some may simply record categories used to group the data.  In R we will use factors to store these variables.  An example might be the browser a user has used to view a web site, as gleaned from a web site log. factor datafactor data
  19. 19.  Some categorical data are factors, but others are really just identifers, and are not used for grouping.  An example might be a user’s IP address. This is basically a unique code identifying a computer, like an address.  While both factor and categorical data are “nominal” we keep the distinction as we will interact with such data in R diferently. character datacharacter data
  20. 20.  Discrete data comes from measurements where there are essentially only distinct and separate possible values that can be counted.  For example, the number of visits a person makes to our web site will always be integer data, as will other counting data. discrete datadiscrete data
  21. 21.  Continuous data is that which could conceivably come from a continuum of values.  The recording of the time in milliseconds of a visit to a web site might be such data.  A useful distinction is that for discrete data we expect that cases will share values, whereas for continuous data this will be impossible, or at least very unlikely.  There is no fne line though. continuous datacontinuous data
  22. 22.  Time data can be considered continuous or discrete depending on resolution, for computers there are often separate ways entirely to handle date and time data.  People in fnance want millisecond data, but over long time ranges this recording can literally run out of numbers on a computer.  Astronomers need precise measurements for durations down to leap seconds.  R has several ways to work with such data, that go beyond just storing the values as simple numbers. date and time datadate and time data
  23. 23. Data types in R
  24. 24.  To organise data, R assigns a class attribute to most R objects and otherwise creates an implicit class for an object.  The class of an object is used to determine how it should be printed.  The class function will return the class of an object.
  25. 25.  The two main classes for numeric data are numeric and integer, though there are others, e.g. complex. Most of the time numbers are numeric.  To make an integer value, we need to work a bit: we can preallocate space for an integer data set of length n with integer(n); we can use the sufx L to force a number to be treated as an integer (e.g., 1L); we can coerce numeric values of integer type through the as.integer function.  Numeric values are stored using foating point representation.  This format can store much larger integer values and has a much wider range of numbers it can represent. Numeric data typesNumeric data types
  26. 26.  Character data. Character data is created just by quoting values.  Quotes can be matching pairs of single or double quotes, though double quotes are preferred and used to display character values.  Within a quoted value a quote symbol can be used, but it must be escaped by prefxing it with a backslash. Categorical data typesCategorical data types
  27. 27.  Factors. A factor can be made from a character vector with the factor function.  The levels of a factor are a list of all possible categories for the data in the factor.  They need not all be represented in a particular factor, but when we create a factor through factor the default choice is simply the collection of unique values.  The current levels of a factor are returned by the levels function. Categorical data typesCategorical data types
  28. 28.  Working with dates and times is made more convenient using a special data type.  While R has some built-in features to work with dates and times, the lubridate package simplifes the usage.  This package introduces the notion of “instants,” “durations,” and “intervals” of time.  We concern ourselves with some basics, learning how to make and manipulate instants of time. Date and time typesDate and time types
  29. 29.  R uses TRUE and FALSE to represent Boolean or logical data.  Logical data is produced by many R functions, for example the “is” functions.  Most common, is the use of the comparison operators—<, <=, ==, !=, >=, > — to produce logical values.  The operators ! (for not), & (for and), and | (for or) can be used to combine values.  The functions any, all, which, and %in% are useful functions for working with logical vectors. The any and all functions answer whether any of the values are TRUE or if all the values are true. Logical dataLogical data