SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
International journal of Agronomy and Plant Production. Vol., 4 (1), 127-141, 2013
Available online at http:// www.ijappjournal.com
ISSN 2051-1914 ©2013 VictorQuest Publications
A Review on Applied Multivariate Statistical Techniques in Agriculture
and Plant Science
Armin Saed-Moucheshi
1*
, Elham Fasihfar
1
, Hojat Hasheminasab
2
, Amir Rahmani
1
and Alli Ahmadi
3
1- Dept. Crop Production and Plant Breeding, Shiraz University, Shiraz (Iran)
2- Dept. Crop Production and Plant Breeding, Razi University, Kermanshah (Iran)
3- Dept. Plant Protection, Tabriz University, Tabriz (Iran)
*Corresponding Author Email: saedmoocheshi@gmail.com
Abstract
Most scientists make decisions based on analyzing of the obtained data from researches
works. Almost all data in science are abundance and by themselves they are of little help
unless they are summarized by some methods and appropriate interpretations have
been made. The data set may contain so many observations that stand out and whose
presence in the data cannot be justified by any simple explanation. Multivariate
statistical technique is a form of statistics encompassing the simultaneous observations
and analysis of more than one statistical variable. In this review we are trying to clarify
how multivariate statistical methods such as multiple regression analysis, principal
component analysis (PCA), factor analysis (FA), clustering analysis, and canonical
correlation (CC) can be used as methods to explain relationships among different
variables and making decisions for future works with examples relating to the agriculture
and plant science.
Keywords: Canonical correlation; Factor analysis; Principal component analysis; Cluster
analysis; Multiple regression.
Introduction
Most crucial scientific, sociological, political, economic, business, biology and botany make decisions
based on analyzing of obtained data from research's works. Almost all data in science are abundance and by
themselves they are of little help unless they are summarized by some methods and appropriate
interpretations have been made. Since such a summary and corresponding interpretation can rarely be
made just by looking at the raw data, a careful scientific scrutiny and analysis of these data can usually
provide enormous amount of valuable information. Admittedly, the more complex the data and their
structure, the more involved the data analysis (Steel and Torrie, 1960). The complexity in a data set may
exist for a variety of reasons. The data set may contain too many observations that stand out and what
presence in the data cannot be justified by any simple explanation. Another situation in which a simple
analysis alone may not suffice occurs when the data on some of the variables are correlated or when there is
a trend present in the data. Many times, data are collected on a number of units, and on each unit not just
one, but many variables are measured. Further, when many variables exist, in order to obtain more definite
and more easily comprehensible information, scientist need to used further complex analyses in order to get
highest information that can be obtained from data (Everitt and Dunn, 1992).
For univariate data, when there is only one variable under consideration, these are usually
summarized by the (at the either population or sample) mean, variance, skewness, kurtosis and etc
(Anderson, 1984). These are the basic quantities used for data description. On the other hand, multivariate
statistics is a form of statistics encompassing the simultaneous observation and analysis of more than one
statistical variable. Methods of bivariate statistics, for example simple linear regression and correlation, are
special cases of multivariate statistics in which two variables are involved (Steel and Torrie, 1960).
Multivariate statistics concerns understanding the different aims and background, and it can explain how
different variables are related with each other or one another. The practical implementation of multivariate
statistics to a particular problem may involve several types of univariate and multivariate analysis in order to
Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013
128
understand the relationships among variables and their relevance to the actual problems being studied
(Johnson and Wicheren, 1996). Many different multivariate analyses techniques such as multivariate
analysis of variance (MANOVA), multiple regression analysis, principal components analysis (PCA), factor
analysis (FA), canonical correlation analysis (CC), and clustering analysis are available. In this review we are
going to explain applying and usable techniques of multivariate statistics in the agriculture and plant science
with related examples in order to provide a practical manual in scientific research works for plant scientist.
Multiple Linear Regression Analysis
Linear regression is an approach to modeling the relationship between a dependent variable called Y
and one or more explanatory variables denoted X. The case of one explanatory variable is called simple
regression. For example we want to determine 1 cm increasing the height of a plant makes how much
change in its yield, in which situation we use simple linear regression (Draper and Smith, 1966). The
prediction model equation for simple linear regression is:
Y=b0 + b1X + ε
b0: It is the intercept that geometrically represents the value of dependent variable (Y) where the regression
line crosses the Y axis. Substantively, it is the expected value of Y when independent variable is equal zero.
b1: Slope coefficient (regression coefficient). It represents the change in Y associated with a one-unit
increase in X.
ε: In most situations, we are not in a position to determine the population parameters directly. Instead, we
must estimate their values from a finite sample from the population and this parameter is the error of the
prediction.
Multiple regression considers more than one explanatory variable (X). For example changing one unit in the
stem height, stem diameter, root length and leaf area caused how many changes in the plant yield.
Prediction model for multiple regression is expanded model of simple linear regression which is showed as
follow:
Y=b0 + b1X1 + b2X2 +…..+ biXi + ε
bi= Partial slope coefficient (also called partial regression coefficient, metric coefficient). It represents the
change in Y associated with a one-unit increase in Xi when all other independent variables are held
constant.
Where b0 is the sample estimate of β0 and bi is the sample estimate of βi, and β's are the parameters from
the whole population in which sampling is conducted.
After determining the intercept and regression coefficients, we have to test them for significance by
doing the analysis of variance (ANOVA). ANOVA determine if regression coefficients that the probable
model calculates should be present in the final model as a predictor or not. Statistical software calculates a
P-value or sig-value for coefficients significance test. If P-value for a coefficient was less than 0.05 (P<0.05),
the coefficient is statistically significant and the related variable should be present in the model as a predictor
but if it was higher than 0.05 (P>0.05), the coefficient is not statistically significant and the related variable
should not to be present as a predictor (Draper and Smith, 1981). Coefficient of determination or R-square
(R
2
) shows that how the model of predictors fits dependent variable or variables. Higher R
2
, higher fit of the
model and higher model goodness. Moreover, significant test for intercept (b0) is similar to regression
coefficients (Kleinbaum et al., 1998).
Significance test of the coefficient and R
2
help researchers to decide what predictor is more
important and must be present in the model. As well as these methods, some other techniques are made up
for determining the best model of predictors. Beside this, when the number of the predictors increase,
usually most of the variables are strongly correlated with each other and it is not necessary to presence all of
these correlated variables in the model and they can use instead of each other (Manly, 2001).
Backward elimination: in this technique, unlike forward selection, all variable are existed in the model and
the less important variables are removed from the model step by step. In the first step, all possible models
with removing each one of the variables considered and which variable having the least mean square will be
removed from the model. In the next steps, this procedure is applied and whenever the P-value will be higher
Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013
129
than standard, the analysis will be stopped and model with remained variables will be the best predicting
model (Burnham and Anderson, 2002).
Forward selection: in this method, for the first step of analysis, all possible simple regression related to
each of the independent variables is calculated and which of the variables that has the highest mean square
(or F-value) is presented as the first and most important predictor in the regression model. In the second
step, variable interred to the model in the first step is exist in the model and all other possible models in
which the first variable is exist must be made up and each one has the most mean square is preferred
prediction model. This procedure will continue until the P-value of the model will be higher than the standard
P-value. In this situation, the remained variables will not to be presented in the prediction model (Harrell,
2001).
Stepwise regression: this variable selection method has proved to be an extremely useful computational
technique in data analysis problems (Dong et al., 2008). Similar to forward selection, in stepwise regression
all possible univariate models are worked out and which variable has the highest mean square is consisted
in the model. In the second step, all other possible models associated with the first consisted variable is
investigated and each variable that has the highest mean square is entered to the model, but when the
second variable entered, first variable should be test for significance in the presence of the second entered
variable. In this situation if the first entered variable is either significant, both variables will be consisted in the
model but if the first variable is not significant, it should be removed from the model. In other steps, this
procedure is repeated and what variable was entered to the prediction model in the previous steps that has
P-value less than the standard will be removed. Indeed this technique use both forward selection and
backward elimination techniques and is more suitable than those alone (Miller, 2002).
Path analysis: regression coefficients strongly are depending on the unit of the variables. Based on the unit
of the variables, the coefficients of the variables are high or low and variables with strong unit has high
coefficient and vice versa. In order to comparing coefficients, the solution is to transform the variables' data
to the standard data by subtracting the mean and dividing to its standard deviation. After standardizing the
variables' data, the variable with higher coefficient has higher effect on the dependent variable. When
independent variables are correlated with each other, the variables can affects each other. In this situation,
the correlation between each independent variable with the dependent variable could be divided into direct
effect of the each independent variable and the indirect effect via other correlated variables (Fig. 1). Using
standardized data in the regression model for calculating regression coefficient gives the direct effect of the
variables. The indirect effect of the variables can be estimated by multiplying each related direct effect to
correlation coefficient between two or more independent variables (Shipley, 1997). Therefore, path analysis
can be explained as an extension technique of the regression model, used to test the fit of the correlation
matrix against two or more causal models which are being compared by the researcher (Dong et al., 2008).
X1
X2
X3 X3
X2
X1
Effect via
Y
Final effect
Figure 1; Diagram of path analysis
Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013
130
For better understanding of regression techniques have been mentioned above, we present an example
here.
Example 1: we had measured some morphological traits of three wheat cultivars consisting of Tiller
numbers/plant, Spike length, Spikelets/spike, Spike weight/plant, Grains/spike, Grain weight/spike, 100-
Grain weight, Total chlorophyll content of flag leaf, Biologic yield/plant, Root weight, Leaves area and grain
yield under for water regimes (Moocheshi et al., 2012). Here we want to evaluate relationship between what
grain yield and its related measured morphological traits using mentioned techniques above.
Multivariate regression
Table 1 shows regression coefficient values, their standard error, t-student value and p-value for
coefficients. Total regression equation based on the results is:
Y=0.5394 - 0.12X1 - 0.02X2 - 0.01X3 + 0.96X4 + 0.01X5 - 0.78X6 - 0.01X7 - 0.004X8 + 0.01X9 + 0.08X10 +
0.001X11
X1= Tiller numbers/plant, X2=Spike length, X3=Spikelets/spike, X4=Spike weight/plant, X5= Grains/spike,
X6= Grain weight/spike, X7= 100-Grain weight, X8= Total chlorophyll content of flag leaf, X9= Biologic
yield/plant, X10= Root weight, X11= Leaves area and Y=Grain yield.
Coefficient of determination or R
2
is equal to 99.2% which is very high, but it is not a real coefficient
of determination because with increasing variables numbers, R
2
will be getting higher. Scientists introduce
adjusted-R
2
instead of R
2
for solving this problem but it is either not a completely accepted index. Also, as
you can see, in this situation that number of variables are abundant and therefore, explaining the relation
between dependent and many independent variables are so complex, on the other hand some coefficient
values are very little can be removed from the model. Based on the p-value, most of the variables are not
statistically significant. P-value shows that what variable must to be present in the model as a predictor and
what must not to be present. As you can see in the table 1, X4 and X6 are the variables that have the p-
value lower than 0.05 and we must select them as the most effective variables on yield. The predicting
model based regression analysis will be as fallow:
Y= 0.96X4 – 0.78X6
Selection procedures
Backward elimination: in the four steps of the backward elimination four variables such as X4, X3, X2 and
X7 are removed from the model and other variables are remained. Based on this result, four mentioned
variables are the least important variables for predicting yield. By this procedure predicting model are
formulated as follow (Table 2 and3):
Y= -0.19 + 0.98X4 + 0.01X5 – 1.54X6 – 0.004 X8 - 0.01X9 + 0.1X10 + 0.005X11
X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X8= Total chlorophyll content of flag leaf,
X9= Biologic yield/plant, X10= Root weight, X11= Leaves area, and Y= Yield.
Forward selection: similar to backward elimination, seven variables are consisted in the forward selection
model but the values of the coefficients have little difference (Table 4).
Y= -0.003 + 0.98X4 - 0.004X5 + 0.01X6 – 0.01 X8 - 1.54X9 + 0.11X10 - 0.003X11
X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X8= Total chlorophyll content of flag leaf,
X9= Biologic yield/plant, X10= Root weight, X11= Leaves area, and Y= Yield.
Stepwise selection: Tables 5 shows the data representing entered variables to, or removed variables from
the model of stepwise regression. Similar to the results of backward and forward, stepwise selection can
screen seven variables:
Y= -0.195 + 0.98X4 - 0.01X5 -1.54X6 – 0.004 X8 - 0.01X9 + 0.1X10 - 0.005X11
X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X8= Total chlorophyll content of flag leaf,
X9= Biologic yield/plant, X10= Root weight, X11= Leaves area, and Y= Yield.
What model should be the predicting model is of the choice of researcher and he can use best model that
can explain idea of the research but usually stepwise selection is the best. On the other hand, significant t-
test for variables in multivariate regression analysis is not sufficient technique.
Path analysis: for better doing path coefficient analysis and understanding the relationship between yield
and other morphological traits, researcher can use results of the selection procedures in the path analysis,
but here we considered all variables. In this technique, the correlation coefficient between yield and each of
the measured morphological traits is partitioned into direct and their indirect effects via other variables on
yield. Highest direct effect of variables on yield was obtained for spike weight/plant (1.013) while other
variables had a very low direct effect on yield (Table 6). Sum of indirect effects of spike weight/plant were
negative. Except of spike weight/plant, other variables had high indirect effect on grain yield. Spiklets/spike
showed lowest contribution in grain yield through its direct effect but it showed highest contribution through
other traits.
Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013
131
Table 1. The regression coefficient (B), standard error (SE), T-value and probability of the estimated
variables in predicting wheat grain yield by the multiple linear regression analysis under inoculation (In)
and non-inoculation (Non-In) conditions and different water levels
Predictor DF B SE T P
Constant 1 0.5394 0.49180 1.10 0.284
X1 1 -0.1164 0.08245 -1.41 0.171
X2 1 -0.0202 0.05014 -0.40 0.691
X3 1 -0.0082 0.02037 -0.40 0.693
X4 1 0.9617 0.01927 49.90 0.001
X5 1 0.0110 0.00699 1.56 0.131
X6 1 -0.7802 0.34490 -2.26 0.033
X7 1 -0.0070 0.00979 -0.71 0.483
X8 1 -0.0042 0.00318 -1.33 0.196
X9 1 0.0131 0.01165 1.12 0.273
X10 1 0.0840 0.09246 0.91 0.373
X11 1 -0.0008 0.00318 -0.25 0.803
X1= Tiller numbers/plant, X2=Spike length, X3=Spikelets/spike, X4=Spike weight/plant, X5=
Grains/spike, X6= Grain weight/spike, X7= 100-Grain weight, X8= Total chlorophyll content of flag leaf,
X9= Biologic yield/plant, X10= Root weight and X11= Leaves area.
Table 3. Backward elimination and remained variables in the model
Step
Variable
Parameter
estimate
Standard
error
Sum of
squares F-Value Pr > F
Intercept -0.19463 0.08673 0.03923 5.040 0.0329
1 x4 0.97670 0.00947 82.8773 640.1 <.0001
2 x5 0.01208 0.00342 0.09736 12.50 0.0014
3 x6 -1.54441 0.21063 0.41875 53.76 <.0001
4 x8 -0.00407 0.00138 0.06753 8.670 0.0064
5 x9 -0.01094 0.00460 0.04402 5.650 0.0245
6 x10 0.09707 0.04682 0.03347 4.300 0.0475
7 x11 0.00505 0.00160 0.07755 9.960 0.0038
X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X8= Total chlorophyll content of
flag leaf, X9= Biologic yield/plant, X10= Root weight and X11= Leaves area.
Table 4. Summary of forward selection
Step
Variable
entered
partial R-
Square
Model R-
Square
Parameter
estimate
Standard
error F-Value Pr > F
1 x4 0.9963 0.9963 0.97859 0.00963 83.85 <.0001
2 x6 0.0013 0.9975 0.01198 0.00341 17.04 0.0002
3 x9 0.0005 0.998 -1.54065 0.21043 7.79 0.0088
4 x5 0.0004 0.9985 -0.00443 0.0043 8.69 0.006
5 x11 0.0002 0.9987 -0.0034 0.00152 5.48 0.0261
6 x8 0.0002 0.9989 -0.01166 0.00465 6.42 0.0169
7 x10 0.0001 0.9991 0.11336 0.04937 4.3 0.0475
Intercept -0.12314 0.11097 0.0034
Table 2. Summary of Backward elimination
Step
Variable
removed
Number of variables
remain in model
Partial
R-Square
Model
R-Square F Value Pr > F
1 x1 10 0 0.9991 0.02 0.8836
2 x3 9 0 0.9991 0.03 0.8558
3 x2 8 0 0.9991 0.28 0.6028
4 x7 7 0 0.9991 1.06 0.3117
X1= Tiller numbers/plant, X2=Spike length, X3=Spikelets/spike, X7= 100-Grain weight,
Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013
132
Table 5. Relative contribution (partial and model R
2
), F-value and probability in predicting wheat grain yield
by the stepwise procedure analysis under non-inoculation condition and different water levels
Step Variable
Entered
Variable
Removed
Partial
R-Square
Model
R-
Square
P-Value
ER
Parameter
Estimate
Standard
Error
P-Value
M
1 x4 - 0.9963 0.9963 <.0001 0.9767 0.00947 <.0001
2 x6 - 0.0013 0.9975 0.0002 -1.54441 0.21063 <.0001
3 x9 - 0.0005 0.998 0.0088 -0.01094 0.00460 0.0245
4 x5 - 0.0004 0.9985 0.0060 0.01208 0.00342 0.0014
5 x11 - 0.0002 0.9987 0.0261 0.00505 0.0016 0.0038
6 x8 - 0.0002 0.9989 0.0169 -0.00407 0.00138 0.0064
7 x10 - 0.0001 0.9991 0.0475 0.09707 0.04682 0.0475
Intercept -0.195 0.0867 0.0329
X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X8= Total chlorophyll content of flag
leaf, X9= Biologic yield/plant, X10= Root weight and X11= Leaves area.
R-Square= Coefficient of Determination, P-Value ER= P-value for enter or remove variables and P-Value
M= P-Value for final model.
Principal Component Analysis
Principal component analysis (PCA) is a variable reduction procedure and is useful when you have obtained
data on high number of variables (possibly a large number of variables), and believe that there is some
redundancy in those variables (Fig. 2). Principal component analysis (PCA) can be explained as a method
that reduces data dimensionality by performing a covariance analysis between variables. The main
advantage of principal component analysis is reducing the number of dimensions without much loss of
information. (Everitt and Dunn, 1992). In this case, redundancy means that some of the variables are
correlated with one another, possibly because they are measuring the same construct. PCA uses an
orthogonal transformation to convert a set of observations of possibly well correlated variables into a set of
values of linearly uncorrelated variables called principal components (PC). The number of principal
components is less than or equal to the number of original variables (Dunetman, 1989). This transformation
is defined as such a way that the first PC has the largest possible variance which is accounts for as much of
the variability in the data as possible, and each succeeding component in turn has the highest variance
possible under the constraint that it be orthogonal to (uncorrelated with) the preceding components
(Jackson, 1991). PCs are independent when the data set is jointly normally distributed. The PC may then be
used as predictor or criterion variables in subsequent analyses. PCA is sensitive to the relative scaling of the
original variables and it mostly used as a tool in exploratory data analysis and for making predicting models
(Anderson, 1984). PCA can be done by eigenvalue decomposition of a data covariance (or correlation)
matrix or singular value decomposition of a data matrix, usually after mean centering (and standardizing or
using Z-scores) the data matrix for each attribute. The results of a PCA are usually discussed in terms of
component scores, sometimes called factor scores (the transformed variable values corresponding to a
particular data point), and loadings (the weight by which each standardized original variable should be
multiplied to get the component score). Often, PCA operation can be thought of as revealing the internal
structure of the data in a way which best explains the variance in the data (Jackson, 1991). If a multivariate
dataset is visualized as a set of coordinates in a high-dimensional data space (1 axis per variable), PCA can
supply the user with a lower-dimensional picture. This is done by using only the first few principal
components so that the dimensionality of the transformed data is reduced (Steel and Torrie, 1960).
Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013
133
Figure 2; Diagram for principal component analysis
Table 6. Path coefficient (direct and indirect effects) of the measured variables attributed on grain yield variation of wheat in non-
inoculation condition and different water levels
Effects via Direct IndirectVariables
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 Effect Effect
Y
X1 -0.004 -0.003 0.000 0.503 0.018 -0.030 -0.006 -0.017 -0.019 0.01 0.01 -0.004 0.471 0.467
X2 -0.002 -0.004 0.000 0.495 0.018 -0.026 -0.004 -0.013 -0.019 0.009 0.007 -0.004 0.469 0.465
X3 -0.002 -0.002 0.001 0.641 0.019 -0.028 -0.006 -0.012 -0.018 0.009 0.013 0.001 0.620 0.621
X4 -0.002 -0.002 0.000 1.013 0.018 -0.037 -0.005 -0.012 -0.019 0.012 0.018 1.013 -0.02 0.993
X5 -0.002 -0.003 0.000 0.576 0.032 -0.041 -0.004 -0.015 -0.02 0.01 0.016 0.032 0.522 0.554
X6 -0.002 -0.002 0.000 0.567 0.02 -0.067 -0.003 -0.008 -0.015 0.01 0.02 -0.067 0.591 0.524
X7 -0.003 -0.002 0.000 0.579 0.016 -0.026 -0.008 -0.015 -0.015 0.012 0.013 -0.008 0.563 0.555
X8 -0.003 -0.002 0.000 0.54 0.021 -0.023 -0.006 -0.023 -0.021 0.009 0.011 -0.023 0.532 0.509
X9 -0.003 -0.003 0.000 0.727 0.023 -0.036 -0.005 -0.017 -0.027 0.011 0.013 -0.027 0.716 0.689
X10 -0.002 -0.002 0.000 0.616 0.017 -0.034 -0.005 -0.011 -0.015 0.020 0.013 0.020 0.581 0.601
X11 -0.002 -0.001 0.000 0.673 0.018 -0.048 -0.004 -0.01 -0.014 0.009 0.028 0.028 0.626 0.654
X1= Tiller numbers/plant, X2=Spike length, X3=Spikelets/spike, X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X7=
100-Grain weight, X8= Total chlorophyll content of flag leaf, X9= Biologic yield/plant, X10= Root weight and X11= Leaves area.
Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013
134
Example 2 explains how PCA can be used for explaining relationships among dataset related to agriculture.
Example 2: Fourteen strawberry cultivars were cultivated in two consequent years (2009-2010) in the
research center of agriculture and natural recourses of Sanadaj, Iran. Ten variables consist of two set data
(first set morphological traits and second set biochemical traits) were measured (Saed-Moucheshi et al., In
Press).
To regard with considering ten parameters used in this research, ten components were calculated by
PCA. As it is expected, PC1 showed highest eigenvalue (3.51) and though, most variation among data can
be explained by this PC. After component 1, PC2; PC3; and PC4 can explain more variation among data
than other components. Four first components can explain 85% of total variation among data (Table 7). On
the other hand, these components have higher value than unit value (1) of eigenvalue (Fig. 3) and so, these
components were used for explain whole variation among data. Also, From Fig. 3, it can be observed that an
increase in the number of the components was associated with a decrease in eigenvalues, which is an
important indicator in general genetics also efficient indicators for screening the genotypes. Flowering period
had highest coefficient in PC1. In components 2, 3, and 4; yield, anthocyanin and berry size regularly
showed maximum coefficient among traits.
First component clearly separated two groups of variables containing chemical and morphological
parameters. Yield, berry size, berry weight, flowering and fruiting periods had high and negative correlations
with PC1 and though, based upon this component these traits have higher effects in contributing of yield. In
PC2 petiole long and TSS and also yield showed highest and negative correlation with this component. PC2
explain that petiole long had a high effect on yield, on the other hand, higher yield can provide higher amount
of total soluble solids (TSS). Berry size and berry weight and also yield have a very low coefficient in PC3
and based on this component these two traits can be important distributors of yield. Flowering and fruiting
periods and anthocyanin content showed highest negative contribution in PC3 and so, these two periods can
change the anthocyanin content. Titratable acidity (TA) had the highest positive coefficient in PC3 and this
trait is independent variables from others. PC4 also showed that higher yield provide higher TSS content and
direct selecting for yield results in more TSS content as well.
Table 7. Principle component analysis of traits measured during two years strawberry cultivation.
Component
4321traits
-0.174-0.529-0.1910.229Anthocyanin
-0.4820.009-0.285-0.379Berry size
-0.4630.007-0.258-0.395Berry weight
0.326-0.342-0.016-0.424Flowering period
0.454-0.3510.012-0.383Fruiting period
0.348-0.018-0.480.084Petiole long
-0.107-0.404-0.1960.352Stolons/plant
0.2680.559-0.3760.025Titratable acidity
0.017-0.059-0.430.37Total soluble solids
0.0770.002-0.469-0.233Berry yield
1.2511.4302.3303.510Eigenvalue
12.514.323.335.1Proportion percent of variance
85.272.758.435.1Cumulative percent of variance
Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013
135
Factor Analysis
Factor analysis (FA), similar to principal component, is a statistical method used to describe
variability among observed, correlated variables in terms of a potentially lower number of unobserved
variables called factors. The purpose of FA is to discover simple patterns in the pattern of relationships
among the variables (Spearman, 1904). In other words, it is possible, for example, that variations in three or
four observed variables mainly reflect the variations in fewer such unobserved variables. FA searches for
such joint variations in response to unobserved latent variables (Anderson, 1984). The observed variables
are modeled as linear combinations of the potential factors, plus error terms. The information gained about
the interdependencies between observed variables can be used later to reduce the set of variables in a
dataset (Manly, 2001.). Computationally this technique is equivalent to low rank approximation of the matrix
of observed variables. FA is related to principal component analysis (PCA), but the two are not identical.
Latent variable models, including factor analysis, use regression modeling techniques to test hypotheses
producing error terms, while PCA is a descriptive statistical technique (Dunetman, 1989).
FA is used to study the patterns of relationship among many dependent variables, with the goal of
discovering something about the nature of the independent variables that affect them, even though those
independent variables were not measured directly. The different methods of FA at first extract a set of factors
from a data set. These factors are almost always orthogonal and are ordered according to the proportion of
the variance of the original data that these factors explain. In general, only a (small) subset of factors is kept
for further consideration and the remaining factors are considered as either irrelevant or nonexistent (i.e.,
they are assumed to reflect measurement error or noise). In order to make the interpretation of the factors
that are considered relevant, the first selection step is generally followed by a rotation of the factors that were
retained. Two main types of rotation are used: orthogonal; when the new axes are also orthogonal to each
other and oblique; when the new axes are not required being orthogonal to each other. Because the
rotations are always performed in a subspace (the so-called factor space), the new axes will always explain
less variance than the original factors (which are computed to be optimal), but obviously the part of variance
explained by the total subspace after rotation is the same as it was before rotation (only the partition of the
variance has changed; Kaiser, 1958.).
This model proposes that each observed response (measure 1 through measure 5) is influenced
partially by underlying common factors (factor 1 and factor 2) and partially by underlying unique factors (E1
through E5; Fig. 4). The strength of the link between each factor and each measure varies, such that a given
factor influences some measures more than others. FA is performed by examining the pattern of correlations
(or covariances) between the observed measures. Measures that are highly correlated (either positively or
negatively) are likely influenced by the same factors, while those that are relatively uncorrelated are likely
influenced by deferent factors (Manly, 1986).
Fig. 3; Component numbers and their eigenvalues in principal component analysis
Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013
136
Figure 4; Diagram of factor analysis
For factor analysis we use data of example 1. In this data, two first factors of the twelve factors in the factor
analysis accounted for 60.1% of the total variations in the data structure (Table 8). The first factor was
included for yield and spike weight/plant and it could explain 49.5% of the total variation in the dependent
structure for, and its suggested name is yield. The second factor accounted for 10.6% of total variability and
was consisted of total chlorophyll content of the flag leaf which was named total chlorophyll. The first two
factors have higher value than unit value (1) of eigenvalue and are graphically shown in Fig. 5 (a).
In this example, factor analysis showed that spike weight/plant, and total chlorophyll content of the flag leaf
had the highest relative contribution in wheat grain yield. Such results can be recognized by means of
diagram 5 (b).
Table 8. Rotated (Varimax rotation) factor loadings and communalities for the estimated variables of
wheat based on factor analysis technique for inoculation and non-inoculation conditions and different
water levels
Variable Factor1 Factor2 Communality
X1 0.159 0.384 0.543
X2 0.194 0.196 0.390
X3 0.324 0.127 0.451
X4 0.875 0.157 1.032
X5 0.230 0.280 0.510
X6 0.246 0.056 0.302
X7 0.250 0.247 0.497
X8 0.220 0.817 1.037
X9 0.411 0.421 0.832
X10 0.299 0.138 0.437
X11 0.374 0.126 0.500
Y 0.885 0.140 1.025
Latent roots 2.338 1.268 3.606
Factor variance (%) 49.50 10.60 60.10
X1= Tiller numbers/plant, X2=Spike length, X3=Spikelets/spike, X4=Spike weight/plant, X5=
Grains/spike, X6= Grain weight/spike, X7= 100-Grain weight, X8= Total chlorophyll content of flag leaf,
X9= Biologic yield/plant, X10= Root weight, X11= Leaves area and Y= Grain yield.
Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013
137
Figure 5 (a). Scree plot showing eigenvalues in response to the number of factors for the estimated variables
of wheat.
Figure 5 (b). Variables loading by factor analysis and varimax rotation with first two factors.
X1= Tiller numbers/plant, X2=Spike length, X3=Spikelets/spike, X4=Spike weight/plant, X5= Grain
number/spike, X6= Grain weight/spike, X7= 100-Grain weight, X8= Total chlorophyll content of flag leaf, X9=
Biologic yield/plant, X10= Root weight, X11= Leaves area and Y= Grain yield.
Clustering Analysis
Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so
that the objects in the same cluster are more similar (in some sense or another) to each other than to those
in other clusters. Clustering is a main task of explorative data mining, and a common technique for statistical
data analysis used in many fields, including machine learning, pattern recognition, image analysis,
information retrieval, and bioinformatics. Cluster analysis itself is not one specific algorithm, but the general
task to be solved (Romesburg, 1984.). It can be achieved by various algorithms that differ significantly in
their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include
groups with low distances among the cluster members, dense areas of the data space, intervals or particular
statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The
appropriate clustering algorithm and parameter settings (including values such as the distance function to
use, a density threshold or the number of expected clusters) depend on the individual data set and intended
use of the results (Richard, 2007). Indeed, Cluster analysis is an exploratory data analysis tool for organizing
observed data into meaningful taxonomies, groups, or clusters, based on combinations of variables, which
maximizes the similarity of cases within each cluster while maximizing the dissimilarity between groups that
Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013
138
are initially unknown. In this sense, cluster analysis creates new groupings without any preconceived notion
of what clusters may arise (Singh, and Chowdhury, 1985). Cluster analysis, like factor analysis, makes no
distinction between dependent and independent variables. The entire set of interdependent relationships is
examined. Cluster analysis is the obverse of factor analysis. Whereas factor analysis reduces the number of
variables by grouping them into a smaller set of factors, cluster analysis reduces the number of observations
or cases by grouping them into a smaller set of clusters (Johnson, and Wicheren, 1996). On the other hand,
Everitt and Dunn (1992) and Nouri et al. (2011) stated that the main advantage of using PCA over cluster
analysis is that each variable can be allocated to one group only.
The first choice that must be made for carrying out the clustering analysis is how similarity (or
alternatively, distance) between data is to be defined. There are many ways to compute how similar series of
data such as Pearson correlation, Spearman rank correlation (for non-numeric data), Euclidean distance and
etc (Romesburg, 1984). After choosing distance method for measuring similarity, related method for
clustering such as hierarchical or non-hierarchical algorithm must be used. Hierarchical method is most
popular method which in this procedure we construct a hierarchy or tree-like structure to see the relationship
among cases. The clusters could be arrived at either from weeding out dissimilar observations (divisive
method) or joining together similar observations (agglomerative method). Most common statistical packages
use agglomerative method and the most popular agglomerative methods are (1) single linkage (nearest
neighbor approach), (2) complete linkage (furthest neighbor), (3) average linkage, (4) Ward’s method, and
(5) Centroid method (Everitt, 1993).
Example 3: Twenty chickpea cultivars were cultivated in 2005 at research center of Razi University,
Kermanshah, Iran, under the rainfed condition (Moucheshi et al., 2009-2010). Yield and its components were
measured and cultivars were grouped using cluster analysis based on the measured traits.
Cluster analysis of chickpea genotypes based on grain yield and its components (Fig. 6) classified
genotypes into four groups with 5, 4, 2 and 9 number of genotypes, respectively. The highest distance or
dissimilarity between genotypes was observed for genotypes 1 and 17, and the highest similarity was
obtained for genotypes 18 and 20. Based on the results, four grouped cluster of cultivars can have a
common origin, on the other hand crossing between genotypes in distanced clusters like first and four cluster
can provided much variation for plant breeding aims.
Figure 6; Results of cluster analysis for 20 chickpea genotypes under rainfed condition
Canonical Correlation
in statistical techniques, dependence refers to any statistical relationship between two random
variables or two sets of data and correlation refers to any of a broad class of statistical relationships involving
dependence; such as dependent phenomena include the correlation between the physical statures of
parents and their offspring, and the correlation between the demand for a product and its price. Formally,
dependence refers to any situation in which random variables do not satisfy a mathematical condition of
probabilistic independence (Steel and Torrie, 1960).
Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013
139
Correlation can refer to any departure of two or more random variables from independence, but
technically it refers to any of several more specialized types of relationship between mean values. There are
several correlation coefficients, often denoted ρ or r, measuring the degree of correlation. The most common
of these is the Pearson correlation coefficient, which is sensitive only to a linear relationship between two
variables (which may exist even if one is a nonlinear function of the other). Other correlation coefficients
have been developed to be more robust than the Pearson correlation that is, more sensitive to nonlinear
relationships (Johnson and Wicheren, 1996).
In a canonical correlation (multiple multiple correlation), the data can be divided into two sets of
related variables, one referred to independent variables which has two or more Y, for example, variables and
another referred to dependent variables consisting of two or more X, for example, variables where its goal is
to describe the relationships between the two sets of variables. You find the canonical weights (coefficients)
a1, a2, a3, ... ap to be applied to the apX variables and b1, b2, b3, ... bm to be applied to the bmY variables in
such a way that the main correlation is between CVX1 and CVY1 is maximized (Bratlet, 1974.).
CVX1 = a1X1 + a2X2 +...+ apXp
CVY1 = b1Y1 + b2Y2 +...+ bmYm
CVX1 and CVY1 are the first canonical variates, and their correlation is the sample canonical
correlation coefficient for the first pair of canonical variates (Fig. 7). The residuals are then analyzed in the
same fashion to find a second pair of canonical variates, CVX2 and CVY2, whose weights are chosen to
maximizing the correlation between CVX2 and CVY2, using only the variance remaining after the variance due
to the first pair of canonical variates has been removed from the original variables. This continues until a
"significance" cutoff is reached or the maximum number of pairs (which equals the smaller of m and p) has
been found (Giffins, 1985.).
Figure 7; Diagram for canonical correlation
Example 4: nine variables consist of tow set (yield and its component, and photosynthesis related
traits) were measured in 20 chickpea genotypes under rainfed condition at Razi University, Kermanshah,
Iran in 2005. We want to consider the relationship between these two sets of variables (unpublished data).
Number of roots (Eigenvalue or squared canonical correlation) is equal to the number of variables in
the smaller set of data therefore; the number of roots in this example is 4 (Fig 8). In this example none of the
canonical correlation between sets of the variables is significant and so, there is no relationship between
these two sets (Table 9). For better understanding of this correlation we assume that the first canonical
correlation (0.428) is significant. Yield has the highest and negative contribution in the first root among the
first set of the data while 100seed weight has a high positive contribution. Highest negative contribution
among second set of the data was belonged to chlorophyll florescence and positive one was observed in the
chlorophyll b. these results shows that yield and chlorophyll floresenc have a direct relationship, and
chlorophyll a and also number of pods per plant are rather contributed in this relationship. These variables
are negatively correlated with first root. On the other hand, 100seeds weight, seed weight, number of seed
per plant, chlorophyll b, and chlorophyll ab have straight relationship on another but contribution of SW and
NSP in the first set and Ch ab in the second set are low. These variables are positively correlated with first
root.
Redundancy index is the amount of variance in a canonical variate (dependent or independent)
explained by the other canonical variate in the canonical function. It can be computed for both the dependent
Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013
140
and the independent canonical variates in each canonical function. The explained variability of each set of
the data by another one in this example are very low (3% for first set and 6.38% for the second set).
Table 9. summary of canonical correlation
Root1 Root2 Root3 Root4 variance
Extracted
Total
redundancy
100SW 2.694 -0.746 1.906 0.89
SW 0.588 0.1 -1.716 0.245
NSP 0.677 0.907 0.107 -0.06
NPP -0.231 -0.83 -0.231 -0.954
Y -2.897 1.042 0.038 -1.069
70.22%
3.00%
Ch a -0.288 0.999 -0.118 0.147
Ch b 0.529 0.51 -0.076 -0.836
Ch ab 0.402 -0.375 -0.778 0.52
Ch f -0.903 -0.202 -0.452 -0.253
100%
6.83%
EigenValue 0.1835 0.0805 0.0525 0.001
Can Corr 0 .428 0.284 0.229 0.032
P-value 0.55817 >0.56 >0.56 >0.56
100SW: 100 seed weight; SW: seed weight per plant; NSP: number of seed per plant; NPP:
number of pod per plant; Y: yield; CH a: chlorophyll a; Ch b: Chlorophyll b; Ch ab: total
chlorophyll content; and Ch f: chlorophyll florescence.
Figure 8. Plot of Eigenvalues or root number and their contribution in the canonical correlation
It seems that research works in agriculture and plant science are a little weak in the statistical
discussion and explaining. This review explained most widely applied multivariate statistical methods that
researchers in agriculture and plant science can use in their investigations to give more authority to their
works and results.
Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013
141
References
Anderson TW, 1984. An introduction to multivariate statistical analysis. John Wiley, New York.
Bratlet MS, 1974. The general canonical correlation distribution. Annals of Mathematical Statistics 18 1-17.
Burnham KP, Anderson DR. 2002. Model selection and multimodel inference. Springer, New York.
Dong B, Liu M, Shao HB, Li Q, Shi L, Du F, Zhang Z, 2008. Investigation on the relationship between leaf
water use efficiency and physio-biochemical traits of winter wheat under rained condition. Colloids and
Surfaces B: Biointerfaces 62: 280-287.
Draper NR, Smith H, 1966. Applied Regression Analysis. John Wiley, New York.
Draper NR, Smith H, 1981. Applied regression analysis. John Wiley, New York.
Dunetman GH, 1989. Principal component analysis. Sage Publication, Newbury Park.
Everitt BS, 1993. Cluster Analysis. Wiley, New York.
Everitt BS, Dunn G, 1992. Applied Multivariate Data Analysis, Oxford University Press, New York, NY.
Giffins R, 1985. Canonical analysis: a review with application in ecology. Springer-Verlag, Berlin.
Harrell FE, 2001. Regression modeling strategies: With applications to linear models, logistic regression, and
survival analysis. Springer-Verlag New York.
Jackson JE, 1991. A user's guide to principal component. John Wiley New York.
Johnson RA, Wicheren, DW, 1996. Applied multivariate statistical analysis. Prentice Hall of India, New Delhi.
Kaiser HF, 958. The varimax criterion for analytic notation in factor analysis. Psychometricka 23 187-200.
Kleinbaum DG, Kupper LL, Muller KE, 1988. Applied Regression Analysis and Other Multivariable Methods.
PWS-Kent Publishing Co, Boston.
Kleinbaum DG, Kupper LL, Muller KE, 1988. Applied Regression Analysis and Other Multivariable Methods.
PWS-Kent Publishing Co, Boston.
Manly BFJ, 1986. Multivariate statistical method; a primier. Chapman and Hall, London - New York.
Manly BFJ, 2001. Statitics for environmental science and management. Chapman and hall/CRC, Boca
Raton.
Miller AJ, 2002. Subset selection in regression. Chapman and Hall London.
Moucheshi AS, Heidari B, Dadkhodaie A, 2009-2010. Genetic Variation and Agronomic Evaluation of
Chickpea Cultivars for Grain Yield and Its Components Under Irrigated and Rainfed Growing
Conditions. Iran Agricultural Research 28-29: 39-50.
Mouchesi A, heidari B, Assad MT. 2012. Alleviation of drought stress effects on wheat using arbuscular
mycorrhizal symbiosis. International Journal of AgriScience 2: 35-47.
Nouri A, Etminan A, Dasilva D, Mohammad R, 2011, Assessment of yield, yield-related traits and drought
tolerance of durum wheat genotypes (Triticum turjidum var. durum Desf.). Australian Journal of Crop
Science 5:8-16.
Richard AJ, 2007. Applied multivariate statistical analysis. Prentice Hall
Romesburg HC, 1984. Cluster analysis for researches. Lifetime Learning Publication, Belmont
Saed-Moucheshi A, Karami F, Nadafi S, Khan AA, (in press) Heritability, genetic variability and
interrelationship among some morphological and chemical parameters of strawberry cultivars Pakistan
Journal of Botany
Shipley B, 1997. Exploratory path analysis with applications in ecology and evolution. The American
Naturalist 149: 1113-1138.
Singh RK, Chowdhury BD, 1985. Biometrical method in quantitative genetic analysis. Kalyani publishers.
Ludhiana, , New Delhi.
Spearman C, 1904. General intelligence, objectively determined and measured. American Journal of
Psychology 15 201-293.
Steel RGD, Torrie JH, 1960. Principles and Procedures of Statistics. McGraw Hill Book Co. Inc., New York.

Weitere ähnliche Inhalte

Was ist angesagt?

Correlation and Regression - ANOVA - DAY 5 - B.Ed - 8614 - AIOU
Correlation and Regression - ANOVA - DAY 5 - B.Ed - 8614 - AIOUCorrelation and Regression - ANOVA - DAY 5 - B.Ed - 8614 - AIOU
Correlation and Regression - ANOVA - DAY 5 - B.Ed - 8614 - AIOUEqraBaig
 
Multivariate Analysis Techniques
Multivariate Analysis TechniquesMultivariate Analysis Techniques
Multivariate Analysis TechniquesMehul Gondaliya
 
STATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRY
STATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRYSTATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRY
STATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRYkeerthana151
 
Application of Univariate, Bi-variate and Multivariate analysis Pooja k shetty
Application of Univariate, Bi-variate and Multivariate analysis Pooja k shettyApplication of Univariate, Bi-variate and Multivariate analysis Pooja k shetty
Application of Univariate, Bi-variate and Multivariate analysis Pooja k shettySundar B N
 
Statistical analysis using spss
Statistical analysis using spssStatistical analysis using spss
Statistical analysis using spssjpcagphil
 
Analysis of data (pratik)
Analysis of data (pratik)Analysis of data (pratik)
Analysis of data (pratik)Patel Parth
 
Data analysis and Interpretation
Data analysis and InterpretationData analysis and Interpretation
Data analysis and InterpretationMehul Gondaliya
 
Research methodology - Analysis of Data
Research methodology - Analysis of DataResearch methodology - Analysis of Data
Research methodology - Analysis of DataThe Stockker
 
Module 4 data analysis
Module 4 data analysisModule 4 data analysis
Module 4 data analysisILRI-Jmaru
 
Quantitative analysis using SPSS
Quantitative analysis using SPSSQuantitative analysis using SPSS
Quantitative analysis using SPSSAlaa Sadik
 
Univariate analysis:Medical statistics Part IV
Univariate analysis:Medical statistics Part IVUnivariate analysis:Medical statistics Part IV
Univariate analysis:Medical statistics Part IVRamachandra Barik
 
Parametric and non parametric test in biostatistics
Parametric and non parametric test in biostatistics Parametric and non parametric test in biostatistics
Parametric and non parametric test in biostatistics Mero Eye
 
Importance of statistics in chemistry
Importance of statistics in chemistryImportance of statistics in chemistry
Importance of statistics in chemistryAfifa Anjum
 
Non parametric test 8
Non parametric test 8Non parametric test 8
Non parametric test 8Sundar B N
 

Was ist angesagt? (20)

Correlation and Regression - ANOVA - DAY 5 - B.Ed - 8614 - AIOU
Correlation and Regression - ANOVA - DAY 5 - B.Ed - 8614 - AIOUCorrelation and Regression - ANOVA - DAY 5 - B.Ed - 8614 - AIOU
Correlation and Regression - ANOVA - DAY 5 - B.Ed - 8614 - AIOU
 
Multivariate Analysis Techniques
Multivariate Analysis TechniquesMultivariate Analysis Techniques
Multivariate Analysis Techniques
 
STATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRY
STATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRYSTATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRY
STATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRY
 
Application of Univariate, Bi-variate and Multivariate analysis Pooja k shetty
Application of Univariate, Bi-variate and Multivariate analysis Pooja k shettyApplication of Univariate, Bi-variate and Multivariate analysis Pooja k shetty
Application of Univariate, Bi-variate and Multivariate analysis Pooja k shetty
 
Understanding statistics in research
Understanding statistics in researchUnderstanding statistics in research
Understanding statistics in research
 
Bio statistics1
Bio statistics1Bio statistics1
Bio statistics1
 
Statistical analysis using spss
Statistical analysis using spssStatistical analysis using spss
Statistical analysis using spss
 
Analysis Of Medical Data
Analysis Of Medical DataAnalysis Of Medical Data
Analysis Of Medical Data
 
Analysis of data (pratik)
Analysis of data (pratik)Analysis of data (pratik)
Analysis of data (pratik)
 
Data analysis and Interpretation
Data analysis and InterpretationData analysis and Interpretation
Data analysis and Interpretation
 
Research methodology - Analysis of Data
Research methodology - Analysis of DataResearch methodology - Analysis of Data
Research methodology - Analysis of Data
 
Quantitative Data analysis
Quantitative Data analysisQuantitative Data analysis
Quantitative Data analysis
 
Module 4 data analysis
Module 4 data analysisModule 4 data analysis
Module 4 data analysis
 
Quantitative analysis using SPSS
Quantitative analysis using SPSSQuantitative analysis using SPSS
Quantitative analysis using SPSS
 
Univariate analysis:Medical statistics Part IV
Univariate analysis:Medical statistics Part IVUnivariate analysis:Medical statistics Part IV
Univariate analysis:Medical statistics Part IV
 
Parametric and non parametric test in biostatistics
Parametric and non parametric test in biostatistics Parametric and non parametric test in biostatistics
Parametric and non parametric test in biostatistics
 
Importance of statistics in chemistry
Importance of statistics in chemistryImportance of statistics in chemistry
Importance of statistics in chemistry
 
Non parametric test 8
Non parametric test 8Non parametric test 8
Non parametric test 8
 
Data analysis and working on spss
Data analysis and working on spssData analysis and working on spss
Data analysis and working on spss
 
Multivariate
MultivariateMultivariate
Multivariate
 

Andere mochten auch

The effects of worked examples on transfer of statistical reasoning
The effects of worked examples on transfer of statistical reasoningThe effects of worked examples on transfer of statistical reasoning
The effects of worked examples on transfer of statistical reasoningMarianna Lamnina
 
probability versusnonprobsampling
 probability versusnonprobsampling probability versusnonprobsampling
probability versusnonprobsamplingWander Guerra
 
PROBABILITY DISTRIBUTION OF SUM OF TWO CONTINUOUS VARIABLES AND CONVOLUTION
PROBABILITY DISTRIBUTION OF SUM OF TWO CONTINUOUS VARIABLES AND CONVOLUTIONPROBABILITY DISTRIBUTION OF SUM OF TWO CONTINUOUS VARIABLES AND CONVOLUTION
PROBABILITY DISTRIBUTION OF SUM OF TWO CONTINUOUS VARIABLES AND CONVOLUTIONJournal For Research
 
Ten Important Life Lessons on Nano Medicine Research Paper Taught Us
Ten Important Life Lessons on Nano Medicine Research Paper Taught UsTen Important Life Lessons on Nano Medicine Research Paper Taught Us
Ten Important Life Lessons on Nano Medicine Research Paper Taught Usscience journals
 
Non-Temporal ARIMA Models in Statistical Research
Non-Temporal ARIMA Models in Statistical ResearchNon-Temporal ARIMA Models in Statistical Research
Non-Temporal ARIMA Models in Statistical ResearchMagnify Analytic Solutions
 
Statistical thinking in Medicine (Historical Overview)
Statistical thinking in Medicine (Historical Overview)Statistical thinking in Medicine (Historical Overview)
Statistical thinking in Medicine (Historical Overview)Christos Argyropoulos
 
Business research model
Business research modelBusiness research model
Business research modelVignan Cheruku
 
Statistical analysis training course
Statistical analysis training courseStatistical analysis training course
Statistical analysis training courseMarwa Abo-Amra
 
Discrete Probability Distributions
Discrete Probability DistributionsDiscrete Probability Distributions
Discrete Probability Distributionsmandalina landy
 
Role of Statistics in Scientific Research
Role of Statistics in Scientific ResearchRole of Statistics in Scientific Research
Role of Statistics in Scientific ResearchVaruna Harshana
 

Andere mochten auch (12)

The effects of worked examples on transfer of statistical reasoning
The effects of worked examples on transfer of statistical reasoningThe effects of worked examples on transfer of statistical reasoning
The effects of worked examples on transfer of statistical reasoning
 
probability versusnonprobsampling
 probability versusnonprobsampling probability versusnonprobsampling
probability versusnonprobsampling
 
PROBABILITY DISTRIBUTION OF SUM OF TWO CONTINUOUS VARIABLES AND CONVOLUTION
PROBABILITY DISTRIBUTION OF SUM OF TWO CONTINUOUS VARIABLES AND CONVOLUTIONPROBABILITY DISTRIBUTION OF SUM OF TWO CONTINUOUS VARIABLES AND CONVOLUTION
PROBABILITY DISTRIBUTION OF SUM OF TWO CONTINUOUS VARIABLES AND CONVOLUTION
 
Ten Important Life Lessons on Nano Medicine Research Paper Taught Us
Ten Important Life Lessons on Nano Medicine Research Paper Taught UsTen Important Life Lessons on Nano Medicine Research Paper Taught Us
Ten Important Life Lessons on Nano Medicine Research Paper Taught Us
 
Non-Temporal ARIMA Models in Statistical Research
Non-Temporal ARIMA Models in Statistical ResearchNon-Temporal ARIMA Models in Statistical Research
Non-Temporal ARIMA Models in Statistical Research
 
Statistical thinking in Medicine (Historical Overview)
Statistical thinking in Medicine (Historical Overview)Statistical thinking in Medicine (Historical Overview)
Statistical thinking in Medicine (Historical Overview)
 
Business research model
Business research modelBusiness research model
Business research model
 
Statistical analysis training course
Statistical analysis training courseStatistical analysis training course
Statistical analysis training course
 
wto and indian agriculture
wto and indian agriculturewto and indian agriculture
wto and indian agriculture
 
Discrete Probability Distributions
Discrete Probability DistributionsDiscrete Probability Distributions
Discrete Probability Distributions
 
Role of Statistics in Scientific Research
Role of Statistics in Scientific ResearchRole of Statistics in Scientific Research
Role of Statistics in Scientific Research
 
M.Tech Final Seminar
M.Tech Final SeminarM.Tech Final Seminar
M.Tech Final Seminar
 

Ähnlich wie applied multivariate statistical techniques in agriculture and plant science 2

Advice On Statistical Analysis For Circulation Research
Advice On Statistical Analysis For Circulation ResearchAdvice On Statistical Analysis For Circulation Research
Advice On Statistical Analysis For Circulation ResearchNancy Ideker
 
Statistics and types of statistics .docx
Statistics and types of statistics .docxStatistics and types of statistics .docx
Statistics and types of statistics .docxHwre Idrees
 
artigo correlação policorica x correlaçãoperson pdf
artigo correlação policorica x correlaçãoperson pdfartigo correlação policorica x correlaçãoperson pdf
artigo correlação policorica x correlaçãoperson pdflarissaxavier60
 
Review of Basic Statistics and Terminology
Review of Basic Statistics and TerminologyReview of Basic Statistics and Terminology
Review of Basic Statistics and Terminologyaswhite
 
MELJUN CORTES research lectures_evaluating_data_statistical_treatment
MELJUN CORTES research lectures_evaluating_data_statistical_treatmentMELJUN CORTES research lectures_evaluating_data_statistical_treatment
MELJUN CORTES research lectures_evaluating_data_statistical_treatmentMELJUN CORTES
 
Statistics for Geography and Environmental Science: an introductory lecture c...
Statistics for Geography and Environmental Science:an introductory lecture c...Statistics for Geography and Environmental Science:an introductory lecture c...
Statistics for Geography and Environmental Science: an introductory lecture c...Rich Harris
 
Level of Measurement, Frequency Distribution,Stem & Leaf
Level of Measurement, Frequency Distribution,Stem & Leaf   Level of Measurement, Frequency Distribution,Stem & Leaf
Level of Measurement, Frequency Distribution,Stem & Leaf Qasim Raza
 
An Overview and Application of Discriminant Analysis in Data Analysis
An Overview and Application of Discriminant Analysis in Data AnalysisAn Overview and Application of Discriminant Analysis in Data Analysis
An Overview and Application of Discriminant Analysis in Data AnalysisIOSR Journals
 
Multivariate Approaches in Nursing Research Assignment.pdf
Multivariate Approaches in Nursing Research Assignment.pdfMultivariate Approaches in Nursing Research Assignment.pdf
Multivariate Approaches in Nursing Research Assignment.pdfbkbk37
 
Biostatistics
BiostatisticsBiostatistics
Biostatisticspriyarokz
 
Multinomial Logistic Regression.pdf
Multinomial Logistic Regression.pdfMultinomial Logistic Regression.pdf
Multinomial Logistic Regression.pdfAlemAyahu
 
SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSScsula its training
 
Dimensionality Reduction Techniques In Response Surface Designs
Dimensionality Reduction Techniques In Response Surface DesignsDimensionality Reduction Techniques In Response Surface Designs
Dimensionality Reduction Techniques In Response Surface Designsinventionjournals
 
25_Anderson_Biostatistics_and_Epidemiology.ppt
25_Anderson_Biostatistics_and_Epidemiology.ppt25_Anderson_Biostatistics_and_Epidemiology.ppt
25_Anderson_Biostatistics_and_Epidemiology.pptPriyankaSharma89719
 
Role of Modern Geographical Knowledge in National Development
Role  of Modern Geographical Knowledge in National DevelopmentRole  of Modern Geographical Knowledge in National Development
Role of Modern Geographical Knowledge in National DevelopmentProf Ashis Sarkar
 
Normalized Citation Indexes: a theoretical methodological study applied to sc...
Normalized Citation Indexes: a theoretical methodological study applied to sc...Normalized Citation Indexes: a theoretical methodological study applied to sc...
Normalized Citation Indexes: a theoretical methodological study applied to sc...Ely Francina Tannuri Oliveira
 

Ähnlich wie applied multivariate statistical techniques in agriculture and plant science 2 (20)

Advice On Statistical Analysis For Circulation Research
Advice On Statistical Analysis For Circulation ResearchAdvice On Statistical Analysis For Circulation Research
Advice On Statistical Analysis For Circulation Research
 
Data structure
Data   structureData   structure
Data structure
 
Lessons learnt in statistics essay
Lessons learnt in statistics essayLessons learnt in statistics essay
Lessons learnt in statistics essay
 
Statistics and types of statistics .docx
Statistics and types of statistics .docxStatistics and types of statistics .docx
Statistics and types of statistics .docx
 
artigo correlação policorica x correlaçãoperson pdf
artigo correlação policorica x correlaçãoperson pdfartigo correlação policorica x correlaçãoperson pdf
artigo correlação policorica x correlaçãoperson pdf
 
Medical Statistics.pptx
Medical Statistics.pptxMedical Statistics.pptx
Medical Statistics.pptx
 
Review of Basic Statistics and Terminology
Review of Basic Statistics and TerminologyReview of Basic Statistics and Terminology
Review of Basic Statistics and Terminology
 
MELJUN CORTES research lectures_evaluating_data_statistical_treatment
MELJUN CORTES research lectures_evaluating_data_statistical_treatmentMELJUN CORTES research lectures_evaluating_data_statistical_treatment
MELJUN CORTES research lectures_evaluating_data_statistical_treatment
 
Statistics for Geography and Environmental Science: an introductory lecture c...
Statistics for Geography and Environmental Science:an introductory lecture c...Statistics for Geography and Environmental Science:an introductory lecture c...
Statistics for Geography and Environmental Science: an introductory lecture c...
 
Level of Measurement, Frequency Distribution,Stem & Leaf
Level of Measurement, Frequency Distribution,Stem & Leaf   Level of Measurement, Frequency Distribution,Stem & Leaf
Level of Measurement, Frequency Distribution,Stem & Leaf
 
An Overview and Application of Discriminant Analysis in Data Analysis
An Overview and Application of Discriminant Analysis in Data AnalysisAn Overview and Application of Discriminant Analysis in Data Analysis
An Overview and Application of Discriminant Analysis in Data Analysis
 
Multivariate Approaches in Nursing Research Assignment.pdf
Multivariate Approaches in Nursing Research Assignment.pdfMultivariate Approaches in Nursing Research Assignment.pdf
Multivariate Approaches in Nursing Research Assignment.pdf
 
Statistics
StatisticsStatistics
Statistics
 
Biostatistics
BiostatisticsBiostatistics
Biostatistics
 
Multinomial Logistic Regression.pdf
Multinomial Logistic Regression.pdfMultinomial Logistic Regression.pdf
Multinomial Logistic Regression.pdf
 
SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSS
 
Dimensionality Reduction Techniques In Response Surface Designs
Dimensionality Reduction Techniques In Response Surface DesignsDimensionality Reduction Techniques In Response Surface Designs
Dimensionality Reduction Techniques In Response Surface Designs
 
25_Anderson_Biostatistics_and_Epidemiology.ppt
25_Anderson_Biostatistics_and_Epidemiology.ppt25_Anderson_Biostatistics_and_Epidemiology.ppt
25_Anderson_Biostatistics_and_Epidemiology.ppt
 
Role of Modern Geographical Knowledge in National Development
Role  of Modern Geographical Knowledge in National DevelopmentRole  of Modern Geographical Knowledge in National Development
Role of Modern Geographical Knowledge in National Development
 
Normalized Citation Indexes: a theoretical methodological study applied to sc...
Normalized Citation Indexes: a theoretical methodological study applied to sc...Normalized Citation Indexes: a theoretical methodological study applied to sc...
Normalized Citation Indexes: a theoretical methodological study applied to sc...
 

applied multivariate statistical techniques in agriculture and plant science 2

  • 1. International journal of Agronomy and Plant Production. Vol., 4 (1), 127-141, 2013 Available online at http:// www.ijappjournal.com ISSN 2051-1914 ©2013 VictorQuest Publications A Review on Applied Multivariate Statistical Techniques in Agriculture and Plant Science Armin Saed-Moucheshi 1* , Elham Fasihfar 1 , Hojat Hasheminasab 2 , Amir Rahmani 1 and Alli Ahmadi 3 1- Dept. Crop Production and Plant Breeding, Shiraz University, Shiraz (Iran) 2- Dept. Crop Production and Plant Breeding, Razi University, Kermanshah (Iran) 3- Dept. Plant Protection, Tabriz University, Tabriz (Iran) *Corresponding Author Email: saedmoocheshi@gmail.com Abstract Most scientists make decisions based on analyzing of the obtained data from researches works. Almost all data in science are abundance and by themselves they are of little help unless they are summarized by some methods and appropriate interpretations have been made. The data set may contain so many observations that stand out and whose presence in the data cannot be justified by any simple explanation. Multivariate statistical technique is a form of statistics encompassing the simultaneous observations and analysis of more than one statistical variable. In this review we are trying to clarify how multivariate statistical methods such as multiple regression analysis, principal component analysis (PCA), factor analysis (FA), clustering analysis, and canonical correlation (CC) can be used as methods to explain relationships among different variables and making decisions for future works with examples relating to the agriculture and plant science. Keywords: Canonical correlation; Factor analysis; Principal component analysis; Cluster analysis; Multiple regression. Introduction Most crucial scientific, sociological, political, economic, business, biology and botany make decisions based on analyzing of obtained data from research's works. Almost all data in science are abundance and by themselves they are of little help unless they are summarized by some methods and appropriate interpretations have been made. Since such a summary and corresponding interpretation can rarely be made just by looking at the raw data, a careful scientific scrutiny and analysis of these data can usually provide enormous amount of valuable information. Admittedly, the more complex the data and their structure, the more involved the data analysis (Steel and Torrie, 1960). The complexity in a data set may exist for a variety of reasons. The data set may contain too many observations that stand out and what presence in the data cannot be justified by any simple explanation. Another situation in which a simple analysis alone may not suffice occurs when the data on some of the variables are correlated or when there is a trend present in the data. Many times, data are collected on a number of units, and on each unit not just one, but many variables are measured. Further, when many variables exist, in order to obtain more definite and more easily comprehensible information, scientist need to used further complex analyses in order to get highest information that can be obtained from data (Everitt and Dunn, 1992). For univariate data, when there is only one variable under consideration, these are usually summarized by the (at the either population or sample) mean, variance, skewness, kurtosis and etc (Anderson, 1984). These are the basic quantities used for data description. On the other hand, multivariate statistics is a form of statistics encompassing the simultaneous observation and analysis of more than one statistical variable. Methods of bivariate statistics, for example simple linear regression and correlation, are special cases of multivariate statistics in which two variables are involved (Steel and Torrie, 1960). Multivariate statistics concerns understanding the different aims and background, and it can explain how different variables are related with each other or one another. The practical implementation of multivariate statistics to a particular problem may involve several types of univariate and multivariate analysis in order to
  • 2. Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013 128 understand the relationships among variables and their relevance to the actual problems being studied (Johnson and Wicheren, 1996). Many different multivariate analyses techniques such as multivariate analysis of variance (MANOVA), multiple regression analysis, principal components analysis (PCA), factor analysis (FA), canonical correlation analysis (CC), and clustering analysis are available. In this review we are going to explain applying and usable techniques of multivariate statistics in the agriculture and plant science with related examples in order to provide a practical manual in scientific research works for plant scientist. Multiple Linear Regression Analysis Linear regression is an approach to modeling the relationship between a dependent variable called Y and one or more explanatory variables denoted X. The case of one explanatory variable is called simple regression. For example we want to determine 1 cm increasing the height of a plant makes how much change in its yield, in which situation we use simple linear regression (Draper and Smith, 1966). The prediction model equation for simple linear regression is: Y=b0 + b1X + ε b0: It is the intercept that geometrically represents the value of dependent variable (Y) where the regression line crosses the Y axis. Substantively, it is the expected value of Y when independent variable is equal zero. b1: Slope coefficient (regression coefficient). It represents the change in Y associated with a one-unit increase in X. ε: In most situations, we are not in a position to determine the population parameters directly. Instead, we must estimate their values from a finite sample from the population and this parameter is the error of the prediction. Multiple regression considers more than one explanatory variable (X). For example changing one unit in the stem height, stem diameter, root length and leaf area caused how many changes in the plant yield. Prediction model for multiple regression is expanded model of simple linear regression which is showed as follow: Y=b0 + b1X1 + b2X2 +…..+ biXi + ε bi= Partial slope coefficient (also called partial regression coefficient, metric coefficient). It represents the change in Y associated with a one-unit increase in Xi when all other independent variables are held constant. Where b0 is the sample estimate of β0 and bi is the sample estimate of βi, and β's are the parameters from the whole population in which sampling is conducted. After determining the intercept and regression coefficients, we have to test them for significance by doing the analysis of variance (ANOVA). ANOVA determine if regression coefficients that the probable model calculates should be present in the final model as a predictor or not. Statistical software calculates a P-value or sig-value for coefficients significance test. If P-value for a coefficient was less than 0.05 (P<0.05), the coefficient is statistically significant and the related variable should be present in the model as a predictor but if it was higher than 0.05 (P>0.05), the coefficient is not statistically significant and the related variable should not to be present as a predictor (Draper and Smith, 1981). Coefficient of determination or R-square (R 2 ) shows that how the model of predictors fits dependent variable or variables. Higher R 2 , higher fit of the model and higher model goodness. Moreover, significant test for intercept (b0) is similar to regression coefficients (Kleinbaum et al., 1998). Significance test of the coefficient and R 2 help researchers to decide what predictor is more important and must be present in the model. As well as these methods, some other techniques are made up for determining the best model of predictors. Beside this, when the number of the predictors increase, usually most of the variables are strongly correlated with each other and it is not necessary to presence all of these correlated variables in the model and they can use instead of each other (Manly, 2001). Backward elimination: in this technique, unlike forward selection, all variable are existed in the model and the less important variables are removed from the model step by step. In the first step, all possible models with removing each one of the variables considered and which variable having the least mean square will be removed from the model. In the next steps, this procedure is applied and whenever the P-value will be higher
  • 3. Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013 129 than standard, the analysis will be stopped and model with remained variables will be the best predicting model (Burnham and Anderson, 2002). Forward selection: in this method, for the first step of analysis, all possible simple regression related to each of the independent variables is calculated and which of the variables that has the highest mean square (or F-value) is presented as the first and most important predictor in the regression model. In the second step, variable interred to the model in the first step is exist in the model and all other possible models in which the first variable is exist must be made up and each one has the most mean square is preferred prediction model. This procedure will continue until the P-value of the model will be higher than the standard P-value. In this situation, the remained variables will not to be presented in the prediction model (Harrell, 2001). Stepwise regression: this variable selection method has proved to be an extremely useful computational technique in data analysis problems (Dong et al., 2008). Similar to forward selection, in stepwise regression all possible univariate models are worked out and which variable has the highest mean square is consisted in the model. In the second step, all other possible models associated with the first consisted variable is investigated and each variable that has the highest mean square is entered to the model, but when the second variable entered, first variable should be test for significance in the presence of the second entered variable. In this situation if the first entered variable is either significant, both variables will be consisted in the model but if the first variable is not significant, it should be removed from the model. In other steps, this procedure is repeated and what variable was entered to the prediction model in the previous steps that has P-value less than the standard will be removed. Indeed this technique use both forward selection and backward elimination techniques and is more suitable than those alone (Miller, 2002). Path analysis: regression coefficients strongly are depending on the unit of the variables. Based on the unit of the variables, the coefficients of the variables are high or low and variables with strong unit has high coefficient and vice versa. In order to comparing coefficients, the solution is to transform the variables' data to the standard data by subtracting the mean and dividing to its standard deviation. After standardizing the variables' data, the variable with higher coefficient has higher effect on the dependent variable. When independent variables are correlated with each other, the variables can affects each other. In this situation, the correlation between each independent variable with the dependent variable could be divided into direct effect of the each independent variable and the indirect effect via other correlated variables (Fig. 1). Using standardized data in the regression model for calculating regression coefficient gives the direct effect of the variables. The indirect effect of the variables can be estimated by multiplying each related direct effect to correlation coefficient between two or more independent variables (Shipley, 1997). Therefore, path analysis can be explained as an extension technique of the regression model, used to test the fit of the correlation matrix against two or more causal models which are being compared by the researcher (Dong et al., 2008). X1 X2 X3 X3 X2 X1 Effect via Y Final effect Figure 1; Diagram of path analysis
  • 4. Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013 130 For better understanding of regression techniques have been mentioned above, we present an example here. Example 1: we had measured some morphological traits of three wheat cultivars consisting of Tiller numbers/plant, Spike length, Spikelets/spike, Spike weight/plant, Grains/spike, Grain weight/spike, 100- Grain weight, Total chlorophyll content of flag leaf, Biologic yield/plant, Root weight, Leaves area and grain yield under for water regimes (Moocheshi et al., 2012). Here we want to evaluate relationship between what grain yield and its related measured morphological traits using mentioned techniques above. Multivariate regression Table 1 shows regression coefficient values, their standard error, t-student value and p-value for coefficients. Total regression equation based on the results is: Y=0.5394 - 0.12X1 - 0.02X2 - 0.01X3 + 0.96X4 + 0.01X5 - 0.78X6 - 0.01X7 - 0.004X8 + 0.01X9 + 0.08X10 + 0.001X11 X1= Tiller numbers/plant, X2=Spike length, X3=Spikelets/spike, X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X7= 100-Grain weight, X8= Total chlorophyll content of flag leaf, X9= Biologic yield/plant, X10= Root weight, X11= Leaves area and Y=Grain yield. Coefficient of determination or R 2 is equal to 99.2% which is very high, but it is not a real coefficient of determination because with increasing variables numbers, R 2 will be getting higher. Scientists introduce adjusted-R 2 instead of R 2 for solving this problem but it is either not a completely accepted index. Also, as you can see, in this situation that number of variables are abundant and therefore, explaining the relation between dependent and many independent variables are so complex, on the other hand some coefficient values are very little can be removed from the model. Based on the p-value, most of the variables are not statistically significant. P-value shows that what variable must to be present in the model as a predictor and what must not to be present. As you can see in the table 1, X4 and X6 are the variables that have the p- value lower than 0.05 and we must select them as the most effective variables on yield. The predicting model based regression analysis will be as fallow: Y= 0.96X4 – 0.78X6 Selection procedures Backward elimination: in the four steps of the backward elimination four variables such as X4, X3, X2 and X7 are removed from the model and other variables are remained. Based on this result, four mentioned variables are the least important variables for predicting yield. By this procedure predicting model are formulated as follow (Table 2 and3): Y= -0.19 + 0.98X4 + 0.01X5 – 1.54X6 – 0.004 X8 - 0.01X9 + 0.1X10 + 0.005X11 X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X8= Total chlorophyll content of flag leaf, X9= Biologic yield/plant, X10= Root weight, X11= Leaves area, and Y= Yield. Forward selection: similar to backward elimination, seven variables are consisted in the forward selection model but the values of the coefficients have little difference (Table 4). Y= -0.003 + 0.98X4 - 0.004X5 + 0.01X6 – 0.01 X8 - 1.54X9 + 0.11X10 - 0.003X11 X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X8= Total chlorophyll content of flag leaf, X9= Biologic yield/plant, X10= Root weight, X11= Leaves area, and Y= Yield. Stepwise selection: Tables 5 shows the data representing entered variables to, or removed variables from the model of stepwise regression. Similar to the results of backward and forward, stepwise selection can screen seven variables: Y= -0.195 + 0.98X4 - 0.01X5 -1.54X6 – 0.004 X8 - 0.01X9 + 0.1X10 - 0.005X11 X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X8= Total chlorophyll content of flag leaf, X9= Biologic yield/plant, X10= Root weight, X11= Leaves area, and Y= Yield. What model should be the predicting model is of the choice of researcher and he can use best model that can explain idea of the research but usually stepwise selection is the best. On the other hand, significant t- test for variables in multivariate regression analysis is not sufficient technique. Path analysis: for better doing path coefficient analysis and understanding the relationship between yield and other morphological traits, researcher can use results of the selection procedures in the path analysis, but here we considered all variables. In this technique, the correlation coefficient between yield and each of the measured morphological traits is partitioned into direct and their indirect effects via other variables on yield. Highest direct effect of variables on yield was obtained for spike weight/plant (1.013) while other variables had a very low direct effect on yield (Table 6). Sum of indirect effects of spike weight/plant were negative. Except of spike weight/plant, other variables had high indirect effect on grain yield. Spiklets/spike showed lowest contribution in grain yield through its direct effect but it showed highest contribution through other traits.
  • 5. Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013 131 Table 1. The regression coefficient (B), standard error (SE), T-value and probability of the estimated variables in predicting wheat grain yield by the multiple linear regression analysis under inoculation (In) and non-inoculation (Non-In) conditions and different water levels Predictor DF B SE T P Constant 1 0.5394 0.49180 1.10 0.284 X1 1 -0.1164 0.08245 -1.41 0.171 X2 1 -0.0202 0.05014 -0.40 0.691 X3 1 -0.0082 0.02037 -0.40 0.693 X4 1 0.9617 0.01927 49.90 0.001 X5 1 0.0110 0.00699 1.56 0.131 X6 1 -0.7802 0.34490 -2.26 0.033 X7 1 -0.0070 0.00979 -0.71 0.483 X8 1 -0.0042 0.00318 -1.33 0.196 X9 1 0.0131 0.01165 1.12 0.273 X10 1 0.0840 0.09246 0.91 0.373 X11 1 -0.0008 0.00318 -0.25 0.803 X1= Tiller numbers/plant, X2=Spike length, X3=Spikelets/spike, X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X7= 100-Grain weight, X8= Total chlorophyll content of flag leaf, X9= Biologic yield/plant, X10= Root weight and X11= Leaves area. Table 3. Backward elimination and remained variables in the model Step Variable Parameter estimate Standard error Sum of squares F-Value Pr > F Intercept -0.19463 0.08673 0.03923 5.040 0.0329 1 x4 0.97670 0.00947 82.8773 640.1 <.0001 2 x5 0.01208 0.00342 0.09736 12.50 0.0014 3 x6 -1.54441 0.21063 0.41875 53.76 <.0001 4 x8 -0.00407 0.00138 0.06753 8.670 0.0064 5 x9 -0.01094 0.00460 0.04402 5.650 0.0245 6 x10 0.09707 0.04682 0.03347 4.300 0.0475 7 x11 0.00505 0.00160 0.07755 9.960 0.0038 X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X8= Total chlorophyll content of flag leaf, X9= Biologic yield/plant, X10= Root weight and X11= Leaves area. Table 4. Summary of forward selection Step Variable entered partial R- Square Model R- Square Parameter estimate Standard error F-Value Pr > F 1 x4 0.9963 0.9963 0.97859 0.00963 83.85 <.0001 2 x6 0.0013 0.9975 0.01198 0.00341 17.04 0.0002 3 x9 0.0005 0.998 -1.54065 0.21043 7.79 0.0088 4 x5 0.0004 0.9985 -0.00443 0.0043 8.69 0.006 5 x11 0.0002 0.9987 -0.0034 0.00152 5.48 0.0261 6 x8 0.0002 0.9989 -0.01166 0.00465 6.42 0.0169 7 x10 0.0001 0.9991 0.11336 0.04937 4.3 0.0475 Intercept -0.12314 0.11097 0.0034 Table 2. Summary of Backward elimination Step Variable removed Number of variables remain in model Partial R-Square Model R-Square F Value Pr > F 1 x1 10 0 0.9991 0.02 0.8836 2 x3 9 0 0.9991 0.03 0.8558 3 x2 8 0 0.9991 0.28 0.6028 4 x7 7 0 0.9991 1.06 0.3117 X1= Tiller numbers/plant, X2=Spike length, X3=Spikelets/spike, X7= 100-Grain weight,
  • 6. Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013 132 Table 5. Relative contribution (partial and model R 2 ), F-value and probability in predicting wheat grain yield by the stepwise procedure analysis under non-inoculation condition and different water levels Step Variable Entered Variable Removed Partial R-Square Model R- Square P-Value ER Parameter Estimate Standard Error P-Value M 1 x4 - 0.9963 0.9963 <.0001 0.9767 0.00947 <.0001 2 x6 - 0.0013 0.9975 0.0002 -1.54441 0.21063 <.0001 3 x9 - 0.0005 0.998 0.0088 -0.01094 0.00460 0.0245 4 x5 - 0.0004 0.9985 0.0060 0.01208 0.00342 0.0014 5 x11 - 0.0002 0.9987 0.0261 0.00505 0.0016 0.0038 6 x8 - 0.0002 0.9989 0.0169 -0.00407 0.00138 0.0064 7 x10 - 0.0001 0.9991 0.0475 0.09707 0.04682 0.0475 Intercept -0.195 0.0867 0.0329 X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X8= Total chlorophyll content of flag leaf, X9= Biologic yield/plant, X10= Root weight and X11= Leaves area. R-Square= Coefficient of Determination, P-Value ER= P-value for enter or remove variables and P-Value M= P-Value for final model. Principal Component Analysis Principal component analysis (PCA) is a variable reduction procedure and is useful when you have obtained data on high number of variables (possibly a large number of variables), and believe that there is some redundancy in those variables (Fig. 2). Principal component analysis (PCA) can be explained as a method that reduces data dimensionality by performing a covariance analysis between variables. The main advantage of principal component analysis is reducing the number of dimensions without much loss of information. (Everitt and Dunn, 1992). In this case, redundancy means that some of the variables are correlated with one another, possibly because they are measuring the same construct. PCA uses an orthogonal transformation to convert a set of observations of possibly well correlated variables into a set of values of linearly uncorrelated variables called principal components (PC). The number of principal components is less than or equal to the number of original variables (Dunetman, 1989). This transformation is defined as such a way that the first PC has the largest possible variance which is accounts for as much of the variability in the data as possible, and each succeeding component in turn has the highest variance possible under the constraint that it be orthogonal to (uncorrelated with) the preceding components (Jackson, 1991). PCs are independent when the data set is jointly normally distributed. The PC may then be used as predictor or criterion variables in subsequent analyses. PCA is sensitive to the relative scaling of the original variables and it mostly used as a tool in exploratory data analysis and for making predicting models (Anderson, 1984). PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix, usually after mean centering (and standardizing or using Z-scores) the data matrix for each attribute. The results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score). Often, PCA operation can be thought of as revealing the internal structure of the data in a way which best explains the variance in the data (Jackson, 1991). If a multivariate dataset is visualized as a set of coordinates in a high-dimensional data space (1 axis per variable), PCA can supply the user with a lower-dimensional picture. This is done by using only the first few principal components so that the dimensionality of the transformed data is reduced (Steel and Torrie, 1960).
  • 7. Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013 133 Figure 2; Diagram for principal component analysis Table 6. Path coefficient (direct and indirect effects) of the measured variables attributed on grain yield variation of wheat in non- inoculation condition and different water levels Effects via Direct IndirectVariables X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 Effect Effect Y X1 -0.004 -0.003 0.000 0.503 0.018 -0.030 -0.006 -0.017 -0.019 0.01 0.01 -0.004 0.471 0.467 X2 -0.002 -0.004 0.000 0.495 0.018 -0.026 -0.004 -0.013 -0.019 0.009 0.007 -0.004 0.469 0.465 X3 -0.002 -0.002 0.001 0.641 0.019 -0.028 -0.006 -0.012 -0.018 0.009 0.013 0.001 0.620 0.621 X4 -0.002 -0.002 0.000 1.013 0.018 -0.037 -0.005 -0.012 -0.019 0.012 0.018 1.013 -0.02 0.993 X5 -0.002 -0.003 0.000 0.576 0.032 -0.041 -0.004 -0.015 -0.02 0.01 0.016 0.032 0.522 0.554 X6 -0.002 -0.002 0.000 0.567 0.02 -0.067 -0.003 -0.008 -0.015 0.01 0.02 -0.067 0.591 0.524 X7 -0.003 -0.002 0.000 0.579 0.016 -0.026 -0.008 -0.015 -0.015 0.012 0.013 -0.008 0.563 0.555 X8 -0.003 -0.002 0.000 0.54 0.021 -0.023 -0.006 -0.023 -0.021 0.009 0.011 -0.023 0.532 0.509 X9 -0.003 -0.003 0.000 0.727 0.023 -0.036 -0.005 -0.017 -0.027 0.011 0.013 -0.027 0.716 0.689 X10 -0.002 -0.002 0.000 0.616 0.017 -0.034 -0.005 -0.011 -0.015 0.020 0.013 0.020 0.581 0.601 X11 -0.002 -0.001 0.000 0.673 0.018 -0.048 -0.004 -0.01 -0.014 0.009 0.028 0.028 0.626 0.654 X1= Tiller numbers/plant, X2=Spike length, X3=Spikelets/spike, X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X7= 100-Grain weight, X8= Total chlorophyll content of flag leaf, X9= Biologic yield/plant, X10= Root weight and X11= Leaves area.
  • 8. Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013 134 Example 2 explains how PCA can be used for explaining relationships among dataset related to agriculture. Example 2: Fourteen strawberry cultivars were cultivated in two consequent years (2009-2010) in the research center of agriculture and natural recourses of Sanadaj, Iran. Ten variables consist of two set data (first set morphological traits and second set biochemical traits) were measured (Saed-Moucheshi et al., In Press). To regard with considering ten parameters used in this research, ten components were calculated by PCA. As it is expected, PC1 showed highest eigenvalue (3.51) and though, most variation among data can be explained by this PC. After component 1, PC2; PC3; and PC4 can explain more variation among data than other components. Four first components can explain 85% of total variation among data (Table 7). On the other hand, these components have higher value than unit value (1) of eigenvalue (Fig. 3) and so, these components were used for explain whole variation among data. Also, From Fig. 3, it can be observed that an increase in the number of the components was associated with a decrease in eigenvalues, which is an important indicator in general genetics also efficient indicators for screening the genotypes. Flowering period had highest coefficient in PC1. In components 2, 3, and 4; yield, anthocyanin and berry size regularly showed maximum coefficient among traits. First component clearly separated two groups of variables containing chemical and morphological parameters. Yield, berry size, berry weight, flowering and fruiting periods had high and negative correlations with PC1 and though, based upon this component these traits have higher effects in contributing of yield. In PC2 petiole long and TSS and also yield showed highest and negative correlation with this component. PC2 explain that petiole long had a high effect on yield, on the other hand, higher yield can provide higher amount of total soluble solids (TSS). Berry size and berry weight and also yield have a very low coefficient in PC3 and based on this component these two traits can be important distributors of yield. Flowering and fruiting periods and anthocyanin content showed highest negative contribution in PC3 and so, these two periods can change the anthocyanin content. Titratable acidity (TA) had the highest positive coefficient in PC3 and this trait is independent variables from others. PC4 also showed that higher yield provide higher TSS content and direct selecting for yield results in more TSS content as well. Table 7. Principle component analysis of traits measured during two years strawberry cultivation. Component 4321traits -0.174-0.529-0.1910.229Anthocyanin -0.4820.009-0.285-0.379Berry size -0.4630.007-0.258-0.395Berry weight 0.326-0.342-0.016-0.424Flowering period 0.454-0.3510.012-0.383Fruiting period 0.348-0.018-0.480.084Petiole long -0.107-0.404-0.1960.352Stolons/plant 0.2680.559-0.3760.025Titratable acidity 0.017-0.059-0.430.37Total soluble solids 0.0770.002-0.469-0.233Berry yield 1.2511.4302.3303.510Eigenvalue 12.514.323.335.1Proportion percent of variance 85.272.758.435.1Cumulative percent of variance
  • 9. Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013 135 Factor Analysis Factor analysis (FA), similar to principal component, is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. The purpose of FA is to discover simple patterns in the pattern of relationships among the variables (Spearman, 1904). In other words, it is possible, for example, that variations in three or four observed variables mainly reflect the variations in fewer such unobserved variables. FA searches for such joint variations in response to unobserved latent variables (Anderson, 1984). The observed variables are modeled as linear combinations of the potential factors, plus error terms. The information gained about the interdependencies between observed variables can be used later to reduce the set of variables in a dataset (Manly, 2001.). Computationally this technique is equivalent to low rank approximation of the matrix of observed variables. FA is related to principal component analysis (PCA), but the two are not identical. Latent variable models, including factor analysis, use regression modeling techniques to test hypotheses producing error terms, while PCA is a descriptive statistical technique (Dunetman, 1989). FA is used to study the patterns of relationship among many dependent variables, with the goal of discovering something about the nature of the independent variables that affect them, even though those independent variables were not measured directly. The different methods of FA at first extract a set of factors from a data set. These factors are almost always orthogonal and are ordered according to the proportion of the variance of the original data that these factors explain. In general, only a (small) subset of factors is kept for further consideration and the remaining factors are considered as either irrelevant or nonexistent (i.e., they are assumed to reflect measurement error or noise). In order to make the interpretation of the factors that are considered relevant, the first selection step is generally followed by a rotation of the factors that were retained. Two main types of rotation are used: orthogonal; when the new axes are also orthogonal to each other and oblique; when the new axes are not required being orthogonal to each other. Because the rotations are always performed in a subspace (the so-called factor space), the new axes will always explain less variance than the original factors (which are computed to be optimal), but obviously the part of variance explained by the total subspace after rotation is the same as it was before rotation (only the partition of the variance has changed; Kaiser, 1958.). This model proposes that each observed response (measure 1 through measure 5) is influenced partially by underlying common factors (factor 1 and factor 2) and partially by underlying unique factors (E1 through E5; Fig. 4). The strength of the link between each factor and each measure varies, such that a given factor influences some measures more than others. FA is performed by examining the pattern of correlations (or covariances) between the observed measures. Measures that are highly correlated (either positively or negatively) are likely influenced by the same factors, while those that are relatively uncorrelated are likely influenced by deferent factors (Manly, 1986). Fig. 3; Component numbers and their eigenvalues in principal component analysis
  • 10. Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013 136 Figure 4; Diagram of factor analysis For factor analysis we use data of example 1. In this data, two first factors of the twelve factors in the factor analysis accounted for 60.1% of the total variations in the data structure (Table 8). The first factor was included for yield and spike weight/plant and it could explain 49.5% of the total variation in the dependent structure for, and its suggested name is yield. The second factor accounted for 10.6% of total variability and was consisted of total chlorophyll content of the flag leaf which was named total chlorophyll. The first two factors have higher value than unit value (1) of eigenvalue and are graphically shown in Fig. 5 (a). In this example, factor analysis showed that spike weight/plant, and total chlorophyll content of the flag leaf had the highest relative contribution in wheat grain yield. Such results can be recognized by means of diagram 5 (b). Table 8. Rotated (Varimax rotation) factor loadings and communalities for the estimated variables of wheat based on factor analysis technique for inoculation and non-inoculation conditions and different water levels Variable Factor1 Factor2 Communality X1 0.159 0.384 0.543 X2 0.194 0.196 0.390 X3 0.324 0.127 0.451 X4 0.875 0.157 1.032 X5 0.230 0.280 0.510 X6 0.246 0.056 0.302 X7 0.250 0.247 0.497 X8 0.220 0.817 1.037 X9 0.411 0.421 0.832 X10 0.299 0.138 0.437 X11 0.374 0.126 0.500 Y 0.885 0.140 1.025 Latent roots 2.338 1.268 3.606 Factor variance (%) 49.50 10.60 60.10 X1= Tiller numbers/plant, X2=Spike length, X3=Spikelets/spike, X4=Spike weight/plant, X5= Grains/spike, X6= Grain weight/spike, X7= 100-Grain weight, X8= Total chlorophyll content of flag leaf, X9= Biologic yield/plant, X10= Root weight, X11= Leaves area and Y= Grain yield.
  • 11. Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013 137 Figure 5 (a). Scree plot showing eigenvalues in response to the number of factors for the estimated variables of wheat. Figure 5 (b). Variables loading by factor analysis and varimax rotation with first two factors. X1= Tiller numbers/plant, X2=Spike length, X3=Spikelets/spike, X4=Spike weight/plant, X5= Grain number/spike, X6= Grain weight/spike, X7= 100-Grain weight, X8= Total chlorophyll content of flag leaf, X9= Biologic yield/plant, X10= Root weight, X11= Leaves area and Y= Grain yield. Clustering Analysis Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other than to those in other clusters. Clustering is a main task of explorative data mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. Cluster analysis itself is not one specific algorithm, but the general task to be solved (Romesburg, 1984.). It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with low distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings (including values such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results (Richard, 2007). Indeed, Cluster analysis is an exploratory data analysis tool for organizing observed data into meaningful taxonomies, groups, or clusters, based on combinations of variables, which maximizes the similarity of cases within each cluster while maximizing the dissimilarity between groups that
  • 12. Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013 138 are initially unknown. In this sense, cluster analysis creates new groupings without any preconceived notion of what clusters may arise (Singh, and Chowdhury, 1985). Cluster analysis, like factor analysis, makes no distinction between dependent and independent variables. The entire set of interdependent relationships is examined. Cluster analysis is the obverse of factor analysis. Whereas factor analysis reduces the number of variables by grouping them into a smaller set of factors, cluster analysis reduces the number of observations or cases by grouping them into a smaller set of clusters (Johnson, and Wicheren, 1996). On the other hand, Everitt and Dunn (1992) and Nouri et al. (2011) stated that the main advantage of using PCA over cluster analysis is that each variable can be allocated to one group only. The first choice that must be made for carrying out the clustering analysis is how similarity (or alternatively, distance) between data is to be defined. There are many ways to compute how similar series of data such as Pearson correlation, Spearman rank correlation (for non-numeric data), Euclidean distance and etc (Romesburg, 1984). After choosing distance method for measuring similarity, related method for clustering such as hierarchical or non-hierarchical algorithm must be used. Hierarchical method is most popular method which in this procedure we construct a hierarchy or tree-like structure to see the relationship among cases. The clusters could be arrived at either from weeding out dissimilar observations (divisive method) or joining together similar observations (agglomerative method). Most common statistical packages use agglomerative method and the most popular agglomerative methods are (1) single linkage (nearest neighbor approach), (2) complete linkage (furthest neighbor), (3) average linkage, (4) Ward’s method, and (5) Centroid method (Everitt, 1993). Example 3: Twenty chickpea cultivars were cultivated in 2005 at research center of Razi University, Kermanshah, Iran, under the rainfed condition (Moucheshi et al., 2009-2010). Yield and its components were measured and cultivars were grouped using cluster analysis based on the measured traits. Cluster analysis of chickpea genotypes based on grain yield and its components (Fig. 6) classified genotypes into four groups with 5, 4, 2 and 9 number of genotypes, respectively. The highest distance or dissimilarity between genotypes was observed for genotypes 1 and 17, and the highest similarity was obtained for genotypes 18 and 20. Based on the results, four grouped cluster of cultivars can have a common origin, on the other hand crossing between genotypes in distanced clusters like first and four cluster can provided much variation for plant breeding aims. Figure 6; Results of cluster analysis for 20 chickpea genotypes under rainfed condition Canonical Correlation in statistical techniques, dependence refers to any statistical relationship between two random variables or two sets of data and correlation refers to any of a broad class of statistical relationships involving dependence; such as dependent phenomena include the correlation between the physical statures of parents and their offspring, and the correlation between the demand for a product and its price. Formally, dependence refers to any situation in which random variables do not satisfy a mathematical condition of probabilistic independence (Steel and Torrie, 1960).
  • 13. Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013 139 Correlation can refer to any departure of two or more random variables from independence, but technically it refers to any of several more specialized types of relationship between mean values. There are several correlation coefficients, often denoted ρ or r, measuring the degree of correlation. The most common of these is the Pearson correlation coefficient, which is sensitive only to a linear relationship between two variables (which may exist even if one is a nonlinear function of the other). Other correlation coefficients have been developed to be more robust than the Pearson correlation that is, more sensitive to nonlinear relationships (Johnson and Wicheren, 1996). In a canonical correlation (multiple multiple correlation), the data can be divided into two sets of related variables, one referred to independent variables which has two or more Y, for example, variables and another referred to dependent variables consisting of two or more X, for example, variables where its goal is to describe the relationships between the two sets of variables. You find the canonical weights (coefficients) a1, a2, a3, ... ap to be applied to the apX variables and b1, b2, b3, ... bm to be applied to the bmY variables in such a way that the main correlation is between CVX1 and CVY1 is maximized (Bratlet, 1974.). CVX1 = a1X1 + a2X2 +...+ apXp CVY1 = b1Y1 + b2Y2 +...+ bmYm CVX1 and CVY1 are the first canonical variates, and their correlation is the sample canonical correlation coefficient for the first pair of canonical variates (Fig. 7). The residuals are then analyzed in the same fashion to find a second pair of canonical variates, CVX2 and CVY2, whose weights are chosen to maximizing the correlation between CVX2 and CVY2, using only the variance remaining after the variance due to the first pair of canonical variates has been removed from the original variables. This continues until a "significance" cutoff is reached or the maximum number of pairs (which equals the smaller of m and p) has been found (Giffins, 1985.). Figure 7; Diagram for canonical correlation Example 4: nine variables consist of tow set (yield and its component, and photosynthesis related traits) were measured in 20 chickpea genotypes under rainfed condition at Razi University, Kermanshah, Iran in 2005. We want to consider the relationship between these two sets of variables (unpublished data). Number of roots (Eigenvalue or squared canonical correlation) is equal to the number of variables in the smaller set of data therefore; the number of roots in this example is 4 (Fig 8). In this example none of the canonical correlation between sets of the variables is significant and so, there is no relationship between these two sets (Table 9). For better understanding of this correlation we assume that the first canonical correlation (0.428) is significant. Yield has the highest and negative contribution in the first root among the first set of the data while 100seed weight has a high positive contribution. Highest negative contribution among second set of the data was belonged to chlorophyll florescence and positive one was observed in the chlorophyll b. these results shows that yield and chlorophyll floresenc have a direct relationship, and chlorophyll a and also number of pods per plant are rather contributed in this relationship. These variables are negatively correlated with first root. On the other hand, 100seeds weight, seed weight, number of seed per plant, chlorophyll b, and chlorophyll ab have straight relationship on another but contribution of SW and NSP in the first set and Ch ab in the second set are low. These variables are positively correlated with first root. Redundancy index is the amount of variance in a canonical variate (dependent or independent) explained by the other canonical variate in the canonical function. It can be computed for both the dependent
  • 14. Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013 140 and the independent canonical variates in each canonical function. The explained variability of each set of the data by another one in this example are very low (3% for first set and 6.38% for the second set). Table 9. summary of canonical correlation Root1 Root2 Root3 Root4 variance Extracted Total redundancy 100SW 2.694 -0.746 1.906 0.89 SW 0.588 0.1 -1.716 0.245 NSP 0.677 0.907 0.107 -0.06 NPP -0.231 -0.83 -0.231 -0.954 Y -2.897 1.042 0.038 -1.069 70.22% 3.00% Ch a -0.288 0.999 -0.118 0.147 Ch b 0.529 0.51 -0.076 -0.836 Ch ab 0.402 -0.375 -0.778 0.52 Ch f -0.903 -0.202 -0.452 -0.253 100% 6.83% EigenValue 0.1835 0.0805 0.0525 0.001 Can Corr 0 .428 0.284 0.229 0.032 P-value 0.55817 >0.56 >0.56 >0.56 100SW: 100 seed weight; SW: seed weight per plant; NSP: number of seed per plant; NPP: number of pod per plant; Y: yield; CH a: chlorophyll a; Ch b: Chlorophyll b; Ch ab: total chlorophyll content; and Ch f: chlorophyll florescence. Figure 8. Plot of Eigenvalues or root number and their contribution in the canonical correlation It seems that research works in agriculture and plant science are a little weak in the statistical discussion and explaining. This review explained most widely applied multivariate statistical methods that researchers in agriculture and plant science can use in their investigations to give more authority to their works and results.
  • 15. Intl. J. Agron. Plant. Prod. Vol., 4 (1), 127-141, 2013 141 References Anderson TW, 1984. An introduction to multivariate statistical analysis. John Wiley, New York. Bratlet MS, 1974. The general canonical correlation distribution. Annals of Mathematical Statistics 18 1-17. Burnham KP, Anderson DR. 2002. Model selection and multimodel inference. Springer, New York. Dong B, Liu M, Shao HB, Li Q, Shi L, Du F, Zhang Z, 2008. Investigation on the relationship between leaf water use efficiency and physio-biochemical traits of winter wheat under rained condition. Colloids and Surfaces B: Biointerfaces 62: 280-287. Draper NR, Smith H, 1966. Applied Regression Analysis. John Wiley, New York. Draper NR, Smith H, 1981. Applied regression analysis. John Wiley, New York. Dunetman GH, 1989. Principal component analysis. Sage Publication, Newbury Park. Everitt BS, 1993. Cluster Analysis. Wiley, New York. Everitt BS, Dunn G, 1992. Applied Multivariate Data Analysis, Oxford University Press, New York, NY. Giffins R, 1985. Canonical analysis: a review with application in ecology. Springer-Verlag, Berlin. Harrell FE, 2001. Regression modeling strategies: With applications to linear models, logistic regression, and survival analysis. Springer-Verlag New York. Jackson JE, 1991. A user's guide to principal component. John Wiley New York. Johnson RA, Wicheren, DW, 1996. Applied multivariate statistical analysis. Prentice Hall of India, New Delhi. Kaiser HF, 958. The varimax criterion for analytic notation in factor analysis. Psychometricka 23 187-200. Kleinbaum DG, Kupper LL, Muller KE, 1988. Applied Regression Analysis and Other Multivariable Methods. PWS-Kent Publishing Co, Boston. Kleinbaum DG, Kupper LL, Muller KE, 1988. Applied Regression Analysis and Other Multivariable Methods. PWS-Kent Publishing Co, Boston. Manly BFJ, 1986. Multivariate statistical method; a primier. Chapman and Hall, London - New York. Manly BFJ, 2001. Statitics for environmental science and management. Chapman and hall/CRC, Boca Raton. Miller AJ, 2002. Subset selection in regression. Chapman and Hall London. Moucheshi AS, Heidari B, Dadkhodaie A, 2009-2010. Genetic Variation and Agronomic Evaluation of Chickpea Cultivars for Grain Yield and Its Components Under Irrigated and Rainfed Growing Conditions. Iran Agricultural Research 28-29: 39-50. Mouchesi A, heidari B, Assad MT. 2012. Alleviation of drought stress effects on wheat using arbuscular mycorrhizal symbiosis. International Journal of AgriScience 2: 35-47. Nouri A, Etminan A, Dasilva D, Mohammad R, 2011, Assessment of yield, yield-related traits and drought tolerance of durum wheat genotypes (Triticum turjidum var. durum Desf.). Australian Journal of Crop Science 5:8-16. Richard AJ, 2007. Applied multivariate statistical analysis. Prentice Hall Romesburg HC, 1984. Cluster analysis for researches. Lifetime Learning Publication, Belmont Saed-Moucheshi A, Karami F, Nadafi S, Khan AA, (in press) Heritability, genetic variability and interrelationship among some morphological and chemical parameters of strawberry cultivars Pakistan Journal of Botany Shipley B, 1997. Exploratory path analysis with applications in ecology and evolution. The American Naturalist 149: 1113-1138. Singh RK, Chowdhury BD, 1985. Biometrical method in quantitative genetic analysis. Kalyani publishers. Ludhiana, , New Delhi. Spearman C, 1904. General intelligence, objectively determined and measured. American Journal of Psychology 15 201-293. Steel RGD, Torrie JH, 1960. Principles and Procedures of Statistics. McGraw Hill Book Co. Inc., New York.