Data Analysis

DATA ANALYSIS
Marcelo Augusto A. Cosgayon

DATA ANALYSIS
defined as the process of systematically searching and
arranging interview transcripts, observation notes, or
other non-textual materials that the researcher
accumulates to increase the understanding of the
phenomenon.

DATA ANALYSIS
Qualitative research yields mainly unstructured text-
based data in the form of:
Interview transcripts
Observation notes
Diary entries
Records

DATA ANALYSIS
Data analysis in qualitative research is more of a
dynamic, intuitive, and creative process of inductive
reasoning, thinking, and theorizing.
In contrast to quantitative research, which uses statistical
methods, qualitative research focuses on the exploration
of values, meanings, beliefs, thoughts, experiences, and
feelings characteristic of the phenomenon under
investigation.

DATA ANALYSIS
The process of analyzing qualitative data predominantly involves
coding or categorizing the data. Basically, it involves making
sense of huge amounts of data by reducing the volume of raw
information, followed by identifying significant patterns, and finally
drawing meaning from data, and subsequently building a logical
chain of evidence.

SCALES OF MEASUREMENT
Data can be classified as being on one of four (4)
scales:
1. Nominal
2. Ordinal
3. Interval
4. Ratio

Nominal Scale
Nominal variables (also called categorical variables) can
be placed into categories. They don’t have a numeric
value and so cannot be added, subtracted, divided or
multiplied. They also have no order; if they appear to
have an order then these are ordinal variables instead

Nominal Scale
The nominal scale of measurement only satisfies the
identity property of measurement. Values assigned to
variables represent a descriptive category, but have no
inherent numerical value with respect to magnitude.

Nominal Scale
Gender is an example of a variable that is measured on a
nominal scale. Individuals may be classified as "male" or
"female", but neither value represents more or less
"gender" than the other. Religion and political affiliation
are other examples of variables that are normally
measured on a nominal scale

Ordinal Scale
The ordinal scale contains things that you can place in
order. For example, hottest to coldest, lightest to
heaviest, richest to poorest. Basically, if you can rank
data by 1st, 2nd, 3rd place (and so on), then you have data
that’s on an ordinal scale

Ordinal Scale
The ordinal scale has the property of both identity and
magnitude. Each value on the ordinal scale has a unique
meaning, and it has an ordered relationship to every
other value on the scale

Ordinal Scale
An example of an ordinal scale in action would be the results
of a horse race, reported as “win”, “place”, and “show”. The
rank order in which horses finished the race is known. The
horse that won finished ahead of the horse that placed, and
the horse that placed finished ahead of the horse that
showed. However, we cannot tell from this ordinal scale
whether it was a close race or whether the winning horse won
by a mile

Interval Scale
An interval scale has ordered numbers with meaningful
divisions.
Temperature is on the interval scale: a difference of 10
degrees between 90 and 100 means the same as 10 degrees
between 150 and 160. Compare that to high school ranking
(which is ordinal), where the difference between 1st and 2nd
might be .01 and between 10th and 11th .5. If you have
meaningful divisions, you have something on the interval
scale

Interval Scale
The interval scale of measurement has the properties of identity, magnitude, and
equal intervals.
 A perfect example of an interval scale is the Fahrenheit scale to measure
temperature. The scale is made up of equal temperature units, so that the
difference between 40 and 50 degrees Fahrenheit is equal to the difference
between 50 and 60 degrees Fahrenheit.
 With an interval scale, you know not only whether different values are bigger or
smaller, you also know how much bigger or smaller they are. For example,
suppose it is 60 degrees Fahrenheit on Monday and 70 degrees on Tuesday. You
know not only that it was hotter on Tuesday, you also know that it was 10 degrees
hotter

Ratio Scale
The ratio scale is exactly the same as the interval scale
with one major difference: zero is meaningful. For
example, a height of zero is meaningful (it means you
don’t exist). Compare that to a temperature of zero, which
while it exists, it doesn’t mean anything in particular
(although admittedly, in the Celsius scale it’s the freezing
point for water)

Ratio Scale
The ratio scale of measurement satisfies all four of the
properties of measurement: identity, magnitude, equal
intervals, and a minimum value of zero.
The weight of an object would be an example of a ratio
scale. Each value on the weight scale has a unique
meaning, weights can be rank ordered, units along the
weight scale are equal to one another, and the scale has a
minimum value of zero.

TYPES OF DATA ANALYSIS
1. Content analysis
2. Narrative analysis
3. Discourse analysis
4. Framework analysis
5. Grounded theory

Content analysis
This refers to the process of categorizing verbal or
behavioural data to classify, summarize, and tabulate the
data.

Narrative analysis.
This method involves the reformulation of stories
presented by respondents, taking into account the
context of each case and different experiences of each
respondent. In other words, narrative analysis is the
revision of primary qualitative data by the researcher.

Discourse analysis
A method of analysis of naturally occurring talk and all
types of written text.

Framework analysis
This is a more advanced method that consists of several
stages such as familiarization, identifying a thematic
framework, coding, charting, mapping, and interpretation.

Grounded theory
This method of qualitative data analysis starts with an
analysis of a single case to formulate a theory. Then,
additional cases are examined to see if they contribute to
the theory.

OTHER TYPES OF DATA ANALYSIS
1. Phenomenological Method
2. Ethnographic Model
3. Grounded Theory Method
4. Case Study Model
5. Historical Model
6. Narrative Model

Phenomenological Method
Describing how any one participant experiences a specific event is the goal
of the phenomenological method of research.
This method utilizes interviews, observation and surveys to gather
information from subjects. Phenomenology is highly concerned with how
participants feel about things during an event or activity. Businesses use this
method to develop processes to help sales representatives effectively close
sales using styles that fit their personality.

Ethnographic Model
The ethnographic model is one of the most popular and widely recognized
methods of qualitative research; it immerses subjects in a culture that is
unfamiliar to them.
The goal is to learn and describe the culture's characteristics much the same
way anthropologists observe the cultural challenges and motivations that
drive a group. This method often immerses the researcher as a subject for
extended periods of time. In a business model, ethnography is central to
understanding customers. Testing products personally or in beta groups
before releasing them to the public is an example of ethnographic research.

Grounded Theory Method
The grounded theory method tries to explain why a course of action evolved
the way it did.
Grounded theory looks at large subject numbers. Theoretical models are
developed based on existing data in existing modes of genetic, biological, or
psychological science. Businesses use grounded theory when conducting
user or satisfaction surveys that target why consumers use company
products or services. This data helps companies maintain customer
satisfaction and loyalty.

Case Study Model
Unlike grounded theory, the case study model provides an in-depth look at
one test subject. The subject can be a person or family, business or
organization, or a town or city.
Data is collected from various sources and compiled using the details to
create a bigger conclusion. Businesses often use case studies when
marketing to new clients to show how their business solutions solve a
problem for the subject

Historical Model
The historical method of qualitative research describes past events in order
to understand present patterns and anticipate future choices.
This model answers questions based on a hypothetical idea and then uses
resources to test the idea for any potential deviations. Businesses can use
historical data of previous ad campaigns and the targeted demographic and
split-test it with new campaigns to determine the most effective campaign.

Narrative Model
The narrative model occurs over extended periods of time and compiles
information as it happens.
Like a story narrative, it takes subjects at a starting point and reviews
situations as obstacles or opportunities occur, although the final narrative
doesn't always remain in chronological order. Businesses use the narrative
method to define buyer personas and use them to identify innovations that
appeal to a target market

MAIN APPROACHES TO DATA
ANALYSIS
Deductive Approach
Inductive Approach

APPROACHES TO DATA ANALYSIS
Deductive Approach
The deductive approach involves analyzing qualitative data based
on a structure that is predetermined by the researcher.
In this case, a researcher can use the questions as a guide for
analyzing the data. This approach is quick and easy and can be
used when a researcher has a fair idea about the likely responses
that will be received from the sample population

APPROACHES TO DATA ANALYSIS
Inductive Approach
The inductive approach, on the contrary, is not based on a
predetermined structure or set ground rules/framework.
This is more time consuming and a thorough approach to
qualitative data analysis. Inductive approach is often used when a
researcher has very little or no idea of the research phenomenon

VARIABLE DESCRIPTIVE ANALYSIS
1. Univariate Analysis – contains a single variable
2. Bivariate Analysis – contains two variables
3. Multivariate Analysis – contains multiple variables

Univariate Analysis
the simplest form of data analysis where the data being
analyzed contains only one variable.
Because there is only a single variable, it does not deal
with causes or relationships. The main purpose of
univariate analysis is to describe the data and find
patterns that exist within it

Univariate Analysis
Some ways that univariate data can describe patterns is by
looking at the mean, mode, median, range, variance,
maximum, minimum, quartiles, and standard deviation.
Additionally, some ways you may display univariate data
include frequency distribution tables, bar charts, histograms,
frequency polygons, and pie charts

Bivariate Analysis
used to find out if there is a relationship between two
different variables.

Bivariate Analysis
Something as simple as creating a scatterplot by plotting
one variable against another on a Cartesian plane (think
X and Y axis) can sometimes give a picture of what the
data is trying to indicate.
If the data seems to fit a line or curve, then there is a
relationship or correlation between the two variables

Multivariate Analysis
the analysis of three or more variables to determine
relationship

 Additive Tree
 Canonical Correlation Analysis
 Cluster Analysis
 Correspondence Analysis / Multiple
Correspondence Analysis
 Factor Analysis
 Generalized Procrustean Analysis
 MANOVA
 Multidimensional Scaling
 Multiple Regression Analysis
 Partial Least Square Regression
 Principal Component Analysis /
Regression / PARAFAC
 Redundancy Analysis.
Multivariate Analysis

CHARACTERISTICS OF DATA
Frequency Distribution - a tabular representation of a survey data set used
to organize and summarize the data. Specifically, it is a list of either
qualitative or quantitative values that a variable takes in a data set and the
associated number of times each value occurs (frequencies)
Measures of Central Tendency - a summary statistic that represents the
center point or typical value of a dataset. In statistics, the three most
common measures of central tendency are the mean, median, and mode.
Each of these measures calculates the location of the central point using a
different method.

FREQUENCY DISTRIBUTION
There are four important characteristics of frequency
distribution:
1. Measures of central tendency and location (mean, median,
mode)
2. Measures of dispersion (range, variance, standard deviation)
3. The extent of symmetry/asymmetry (skewness)
4. The flatness or peakedness (kurtosis)

FREQUENCY DISTRIBUTION
Frequency distribution
tells how frequencies are distributed over values.
Frequency distributions are mostly used for
summarizing categorical variables

MEASURES OF CENTRAL TENDENCY
There are three main measures of central tendency:
1. Mode
2. Median
3. Mean
Each of these measures describes a different indication of the
typical or central value in the distribution. The mode is the most
commonly occurring value in a distribution

Mode
- a list of numbers that refers to the integers that occur
most frequently. Unlike the median and mean, the mode
is about the frequency of occurrence. There can be more
than one mode or no mode at all; it all depends on the
data set itself

Median
- the middle number when listed in order from least to
greatest

Mean
– refers to the average. To calculate the mean, add
together all of the numbers in your data set. Then divide
that sum by the number of addends

VARIANCE IN DATA
The variance (σ2) is a measure of how far each value in the
data set is from the mean.
Variance is a measure of how spread out a data set is. It's
useful when creating statistical models since low variance
can be a sign that you are over-fitting your data.
Here is how it is defined: Subtract the mean from each
value in the data. This gives you a measure of the distance
of each value from the mean.

VARIANCE IN DATA
Calculating Variance of a Sample
 Write down your sample data set
 Write down the sample variance formula
 Calculate the mean of the sample
 Subtract the mean from each data point
 Square each result
 Find the sum of the squared values
 Divide by n - 1, where n is the number of data points.

VARIANCE IN DATA
Write down your sample data
set
X
X1 17
X2 15
X3 23
X4 7
X5 9
X6 13

VARIANCE IN DATA
Write down the sample
variance formula
The variance of a data set tells
you how spread out the data
points are. The closer the
variance is to zero, the more
closely the data points are
clustered together.

VARIANCE IN DATA
Calculate the mean of the
sample
The symbol x̅ or "x-bar" refers to
the mean of a sample.[Calculate
this as you would any mean: add
all the data points together, then
divide by the number of data
points.

VARIANCE IN DATA
Subtract the mean from each
data point
your answers should add up to
zero. This is due to the definition
of mean, since the negative
answers (distance from mean to
smaller numbers) exactly cancel
out the positive answers
(distance from mean to larger
numbers)

VARIANCE IN DATA
Square each result
This means the "average deviation"
will always be zero as well, so that
doesn't tell anything about how
spread out the data is. To solve this
problem, find the square of each
deviation. This will make them all
positive numbers, so the negative
and positive values no longer cancel
out to zero

VARIANCE IN DATA
Find the sum of the squared
values
∑ tells you to sum the value of
the following term for each
value of xi
Because (xi – x)2 is already
calculated, all you need to do is
add the results.

VARIANCE IN DATA
Divide by n - 1, where n is the
number of data points

VARIANCE VS. STANDARD DEVIATION
Variance is a numerical value that describes the variability of
observations from its arithmetic mean
Standard deviation is a measure of dispersion of observations
within a data set
Variance is nothing but an average of squared deviations. On the
other hand, the standard deviation is the root mean square
deviation

VARIANCE VS. STANDARD DEVIATION
A variance of zero indicates that all of the data values are identical. A
high variance indicates that the data points are very spread out from
the mean, and from one another. Variance is the average of the
squared distances from each point to the mean.
Standard deviation is a number used to tell how measurements for a
group are spread out from the average (mean), or expected value. A
low standard deviation means that most of the numbers are close to
the average. A high standard deviation means that the numbers are
more spread out.

RELATIONSHIPS BETWEEN
VARIABLES
The statistical relationship between two variables is
referred to as their correlation.
A correlation could be positive, meaning
both variables move in the same direction, or negative,
meaning that when the value of one variable increases,
the value of the other variable decreases

ASPECTS OF ASSOCIATION
BETWEEN VARIABLES
Association between two variables means the values of
one variable relate in some way to the values of the
other.
Association is usually measured by correlation for two
continuous variables and by cross tabulation and a Chi-
square test for two categorical variables.

BETWEEN VARIABLES
Chi Square Test
relating to or denoting a statistical method assessing the
goodness of fit between observed values and those
expected theoretically.
Commonly used for testing relationships between
categorical variables. The null hypothesis of the Chi-Square
test is that no relationship exists on the categorical
variables in the population; they are independent.

BETWEEN VARIABLES
Chi Square Test
The subscript “c” are the degrees of freedom.
“O” is your observed value and E is your expected value

MEASURES OF ASSOCIATION
BETWEEN VARIABLES
The measures of association refer to a wide variety of
coefficients (including bivariate correlation and
regression coefficients) that measure the strength and
direction of the relationship between variables;
these measures of strength, or association, can be
described in several ways, depending on the analysis.

MEASURES OF ASSOCIATION
BETWEEN VARIABLES
For measures of association, a value of zero signifies that no
relationship exists.
In a correlation analysis, if the coefficient (r) has a value of
one, it signifies a perfect relationship on the variables of
interest.
In regression analyses, if the standardized beta weight (β)
has a value of one, it also signifies a perfect relationship on
the variables of interest.

STATISTICAL MEASURES OF
RELATIONSHIPS
1. Correlational Coefficient
2. Linear Regression
3. Multiple Regression
4. Discriminant Analysis
5. Factor Analysis

RELATIONSHIPS
Correlational Coefficient
the relationship between two or more variables or sets of
data. It is expressed in the form of a coefficient with +1.00
indicating a perfect positive correlation; -1.00 indicating a
perfect inverse correlation; 0.00 indicating a complete
lack of a relationship.

RELATIONSHIPS
Correlational Coefficient
 Pearson's Product Moment Coefficient (r) is the most often
used and most precise coefficient; and generally used with
continuous variables
 Spearman Rank Order Coefficient (p) is a form of the Pearson's
Product Moment Coefficient that can be used with ordinal or
ranked data
 Phi Correlation Coefficient is a form of the Pearson's Product
Moment Coefficient that can be used with dichotomous variables
(i.e. pass/fail, male/female)

RELATIONSHIPS
Linear Regression
the use of correlation coefficients to plot a line illustrating the linear relationship of
two variables X and Y. It is based on the slope of the line which is represented by
the formula :
Y = a + bX
where
• Y = dependent variable
• X = independent variable
• b = slope of the line
• a = constant or Y intercept
Regression is used extensively in making predictions based on finding unknown Y
values from known X values

RELATIONSHIPS
Multiple Regression
the same as regression except that it attempts to predict Y
from two or more independent X variables. The formula for
multiple regression is an extension of the linear regression
formula:
Y = a + b1 X1 + b2 X2 + ....
Multiple regression is used extensively in making predictions
based on finding unknown Y values from known X values

RELATIONSHIPS
Discriminant Analysis
analogous to multiple regression, except that the criterion
variable consists of two categories rather than a
continuous range of values

RELATIONSHIPS
Factor Analysis
often used when a large number of correlations have
been explored in a given study; it is a means of grouping
certain variables into clusters or factors that are
moderately to highly correlated with each other

ANALYZING DIFFERENCES WITHIN
THE DATA
1. T-Test
2. Matched Pairs T-Test
3. Analysis of Variance (ANOVA)

THE DATA
T-Test
A t-test is used to determine if the scores of two groups
differ on a single variable. A t-test is designed to test for
the differences in mean scores
Note: A t-test is appropriate only when looking at paired data. It is useful in
analyzing scores of two groups of participants on a particular variable or in
analyzing scores of a single group of participants on two variables.

THE DATA
Matched Pairs T-Test
This type of t-test could be used to determine if the
scores of the same participants in a study differ under
different conditions
Note: A t-test is appropriate only when looking at paired data. It is useful in
analyzing scores of two groups of participants on a particular variable or in
analyzing scores of a single group of participants on two variables

THE DATA
Analysis of Variance (ANOVA)
The ANOVA (analysis of variance) is a statistical test which
makes a single, overall decision as to whether a significant
difference is present among three or more sample means (Levin
484). An ANOVA is similar to a t-test. However, the ANOVA can
also test multiple groups to see if they differ on one or more
variables. The ANOVA can be used to test between-groups and
within-groups differences.

THE DATA
Analysis of Variance (ANOVA)
One-Way ANOVA: This tests a group or groups to
determine if there are differences on a single set of scores
Multiple ANOVA (MANOVA): This tests a group or groups
to determine if there are differences on two or
more variables

MULTIVARIATE ANALYSIS
Multivariate analysis is used to study more complex sets of data
than what univariate analysis methods can handle. This type of
analysis is almost always performed with software
(i.e. SPSS or SAS), as working with even the smallest of data sets
can be overwhelming by hand.
Multivariate analysis can reduce the likelihood of Type I errors.
Sometimes, univariate analysis is preferred as multivariate techniques
can result in difficulty interpreting the results of the test. For example,
group differences on a linear combination of dependent variables in
MANOVA can be unclear. In addition, multivariate analysis is usually
unsuitable for small sets of data.

There are more than 20 different ways to perform multivariate analysis, depending
on the type of data and the objectives of the research. For single data sets there are
several choices:
1. Additive Trees, Multidimensional Scaling, and Cluster Analysis are
appropriate for when the rows and columns in a data table represent the same
units and the measure is either a similarity or a distance
2. Principal Component Analysis (PCA) decomposes a data table with
correlated measures into a new set of uncorrelated measures
3. Correspondence Analysis is similar to PCA. However, it applies to
contingency tables

 Additive Tree
 Canonical Correlation Analysis
 Cluster Analysis
 Correspondent Analysis/Multiple
Correspondence Analysis
 Factor Analysis
 Generalized Procrustean Analysis
 Independent Component Analysis
 MANOVA
 Multidimensional Scaling
 Multiple Regression Analysis
 Partial Least Square Regression
 Principal Component
Analysis/Regression/PARAFAC
 Redundancy Analysis

Additive Tree
a general way to represent clusters of data in a graph. It
is used when the data table is composed of rows and
columns that represent the same units; the measure must
be a distance or a similarity.

Additive Tree
A “tree” is a finite, connected graph where any two nodes are connected by
one path. The additive tree is a similar technique to cluster analysis. Both
techniques have the “leaves” of the tree representing units. Where the
additive tree differs is that the distance is graphically represented by the
distance of those units on the tree

Additive Tree
Cluster Analysis creates the clusters but does not create
a graph that represents the results. An additional
limitation of hierarchical cluster analysis is that objects in
the same cluster must be exactly the same distance from
each other, and the distances between clusters must be
larger than the “within clusters” distance. Additive trees
do not have these limitations

Canonical Correlation Analysis
one way to find associations between two data sets. Like the
Correlation Coefficient, CCA measures the relationship between
variables. Where Canonical Correlation Analysis differs is that it is
specifically used to find the relationships between two
sets of variables

appropriate to use in the same situations as you might
use multiple regression analysis, but when you have multiple
intercorrelated outcome variables.
CCA is not recommended for small data sets.

The purpose of Canonical Correlation Analysis is to explain the
variability within and between sets through identification of
several sets of canonical variates. Canonical variates are new
variables formed by making a linear combination of two of more
variables from the data sets. When running CCA, you choose
weights that maximize the correlation between these sets of
variates.

Cluster Analysis
Clustering in statistics refers to how data is gathered (“clustered”) by factors
like:
 Age.
 Household size.
 Income.
 Education level.

Cluster Analysis
Clusters can be based on factors like:
Distance-based clustering. Items are sorted based on their
proximity (or distance). For example, cancer cases might be
clustered together if they are in the same geographic location
Conceptual clustering. Items are grouped by factors that
items have in common. For example, cancer clusters could be
grouped by “people who work in manufacturing

Cluster Analysis
Clustering Types (continued):
Hierarchical Clustering This is a more complex approach to
clustering used in data mining. Basically, each item is given its
own cluster. A pair of clusters is joined based on similarities,
giving one less cluster. This process is repeated until all items
are clustered. The dendrogram is a graph that shows
hierarchical clusters
Probabilistic Clustering. Data is clustered using algorithms
which connect items using distances or densities. This is
usually performed by a computer

Cluster Analysis
Clustering Types (continued):
Ward’s method: uses minimum variance in each step to create
relatively small, even-sized clusters

Cluster Analysis
Clustering Types:
Exclusive Clustering. Each item can only belong in a single
cluster. It cannot belong in another cluster
Fuzzy Clustering: Data points are assigned a probability of
belonging to one or more clusters
Overlapping Clustering. Each item can belong to more than
one cluster

Correspondence Analysis/Multiple Correspondence
Analysis
a descriptive/exploratory technique designed
to analyze simple two-way and multi-way tables
containing some measure of correspondence between
the rows and columns

Factor Analysis
a way to take a mass of data and shrinking it to a smaller
data set that is more manageable and more
understandable. It is a way to find hidden patterns, show
how those patterns overlap, and show what
characteristics are seen in multiple patterns. It is also
used to create a set of variables for similar items in the
set called dimensions

Factor Analysis
It can be a very useful tool for complex sets of data involving
psychological studies, socioeconomic status and other involved
concepts. A “factor” is a set of observed variables that have
similar response patterns; They are associated with a hidden
variable (called a confounding variable) that is not directly
measured. Factors are listed according to factor loadings, or how
much variation in the data they can explain

Factor Analysis
Types:
1. Exploratory factor analysis is if the researcher does not any
idea about the structure of the data or the number of
dimensions exist in a set of variables
2. Confirmatory Factor Analysis is used for verification as long
as there is a specific idea about the data structure or the
number of dimensions in a set of variables

Generalized Procrustean Analysis
a way to compare two sets of configurations, or shapes. Originally
developed to match two solutions from Factor Analysis, the
technique was extended to Generalized Procrustes Analysis so
that more than two shapes could be compared. The shapes are
aligned to a target shape or to each other.
GPA uses geometric transformations (i.e. isotropic rescaling,
reflection, rotation, or translation) of matrices to compare the sets
of data

Independent Component Analysis
used in statistics and signal processing to express a
multivariate function by its hidden factors or
subcomponents. These component signals are
independent non-Gaussian signals, and the intention is
that these independent subcomponents accurately
represent the composite signal

Multiple Analysis of Variance (MANOVA)
Analysis of variance (ANOVA) tests for differences
between means. MANOVA is just an ANOVA with several
dependent variables.
Similar to many other tests and experiments in that it’s
purpose is to find out if the response variable (i.e. your
dependent variable) is changed by manipulating the
independent variable

Multiple Analysis of Variance (MANOVA)
The test helps to answer many research questions, including:
Do changes to the independent variables have statistically
significant effects on dependent variables?
What are the interactions among dependent variables?
What are the interactions among independent variables?

Multidimensional Scaling a visual representation of distances or
dissimilarities between sets of objects.
“Objects” can be colors, faces, map coordinates, political
persuasion, or any kind of real or conceptual stimuli. Objects that
are more similar (or have shorter distances) are closer together
on the graph than objects that are less similar (or have longer
distances). As well as interpreting dissimilarities as distances on a
graph, MDS can also serve as a dimension reduction technique
for high-dimensional data

Multiple Regression Analysis used to see if there is
a statistically significant relationship between sets of
variables. It’s used to find trends in those sets of data.
Multiple regression analysis is almost the same as simple
linear regression. The only difference between simple
linear regression and multiple regression is in the number
of predictors (“x” variables) used in the regression

Partial Least Square Regression if the data shows a linear
relationship between the X and Y variables, the line that best fits
this linear relationship needs to be found.
That line is called a Regression Line and has the equation
ŷ= a + b x
The Least Squares Regression Line is the line that makes the
vertical distance from the data points to the regression line as
small as possible. It’s called a “least squares” because the best
line of fit is one that minimizes the variance

Parallel Factor Analysis (PARAFAC)
is a generalization of Principal Component Analysis to
higher-order arrays. It is useful for exploratory data
analysis on very particular sets of data, for example if you
have three-way data. Where PARAFAC differs from
Principal Component Analysis is that PARAFAC produces
unique components

Principal Component Analysis
is a tool that has two main purposes:
1. To find variability in a data set and
2. To reduce the dimensions of the data set
Reducing dimensions means that redundancy in the data is
eliminated; This can make patterns in the data set more clear.
Therefore, Principal Component Analysis is a good tool to use
redundancies are suspected in a data set. Redundancy doesn’t
mean that the variables are identical; it means that there is a
strong correlation between them

Principal Component Regression
based on Principal Component Analysis. It is used when the data
set exhibits multicollinearity, meaning that although least squares
estimates are biased, variances may be too far away from the
actual value. PCA adds some bias to the regression model and
reduces standard error.
The first step in PCA is the same as in Principal Component
Analysis: identify the principal components. Regression is then
performed on those components

Redundancy Analysis the constrained version of Principal Components
Analysis. Constrained basically means reduction of dimensions. This
reduction is what leads to more understandable results.
Redundancy Analysis is a way to summarize linear relationships in a set of
dependent variables that are influenced by a set of independent variables.
 Linear Regression is first applied to represent Y as a function of X.
 PCA is then applied to a matrix of the results to provide a visual
representation.

Data Analysis

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Data Analysis

Ähnlich wie Data Analysis (20)

Mehr von Marcelo Augusto A. Cosgayon

Mehr von Marcelo Augusto A. Cosgayon (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data Analysis