SlideShare ist ein Scribd-Unternehmen logo
1 von 6
Downloaden Sie, um offline zu lesen
International Journal of Research in Computer Science
eISSN 2249-8265 Volume 2 Issue 2 (2012) pp. 23-28
© White Globe Publications
www.ijorcs.org


          MULTIPLE LINEAR REGRESSION MODELS IN
                   OUTLIER DETECTION
                     S.M.A.Khaleelur Rahman1, M.Mohamed Sathik2, K.Senthamarai Kannan3
                            12
                               Sadakathullah Appa College, Tirunelveli, Tamilnadu, India
                        3
                          Manonmaniam Sundaranar University, Tirunelveli, Tamilnadu, India

  Abstract: Identifying anomalous values in the real-        expected (and not due to any anomalous condition).
world database is important both for improving the           Outliers are most extreme observations with maximum
quality of original data and for reducing the impact of      or minimum sample. However, the sample maximum
anomalous values in the process of knowledge                 and minimum are not always outliers because they
discovery in databases. Such anomalous values give           may not be unusually far from other observations.
useful information to the data analyst in discovering        Thus, the term “outliers” to values “that lies very far
useful patterns. Through isolation, these data may be        from the middle of the distribution in either direction”.
separated and analyzed. The analysis of outliers and         This definition is applicable for continuously valued
influential points is an important step of the regression    variables having a smooth pdf values. Sometimes,
diagnostics. In this paper, our aim is to detect the         numeric distance is not the only consideration in
points which are very different from the others points.      detecting continuous outliers. “An outlier is a single,
They do not seem to belong to a particular population        or very low frequency, occurrence of the value of a
and behave differently. If these influential points are to   variable that is far away from the bulk of the values of
be removed it will lead to a different model.                the variable”. The frequency of occurrence should be
Distinction between these points is not always obvious       an important criterion for detecting outliers in
and clear. Hence several indicators are used for             categorical (nominal) data, which is quite common in
identifying and analyzing outliers. Existing methods of      the real-world databases. A more general definition of
outlier detection are based on manual inspection of          an outlier is given in: an observation (or subset of
graphically represented data. In this paper, we present      observations) which appears to be inconsistent with the
a new approach in automating the process of detecting        remainder of that set of data [1]. The real cause of
and isolating outliers. Impact of anomalous values on        outlier occurrence is usually unknown to data users
the dataset has been established by using two                and/or analysts. Sometimes, this is a flawed value,
indicators DFFITS and Cook’sD. The process is based          resulting from the poor quality of a data set, i.e., a
on modeling the human perception of exceptional              data entry or a data conversion error.
values by using multiple linear regression analysis.
                                                                Outlier detection methods have been used to detect
Keywords: Cut-value, Cook’s D, DFFITS, multiple              and remove anomalous values from data [7]. There are
regression analysis, outlier detection.                      three approaches in the outlier detection process.
                                                                Determine the outliers through learning approach.
                 I. INTRODUCTION                             No prior knowledge of the data is similar to
   IN statistics, outlier is an observation that is          unsupervised clustering. This approach processes the
numerically distant from the rest of the data [1].           data as a static distribution, pinpoints the most remote
Grubbs defined outlier as an observation that appears        points and identify them as potential outliers.
to deviate markedly from other members of the sample            Pre-labeled data are marked as both normal and
in which it occurs [3]. Occurrence of outliers may be        abnormal. This approach is similar to supervised
by chance in any distribution, but they are often            classification .This is a semi-supervised recognition or
indicative either of measurement error or of one             detection task. This algorithm learns to recognize.
population that has a heavy-tailed distribution. If the
occurrence is by chance, robust statistical methods are                      II. RELATED WORK
being used to discard them. In case they indicate that
the distribution has high kurtosis then one should be           Mathematical calculations are used to find out
very cautious in using tools or intuitions that assume a     whether the outlier came from the same or different
normal distribution. A mixture model is frequently           population. Many statistical methods have been
used for these two distinct sub-populations. Outlier         devised for detecting outliers by measuring how far the
points can indicate faulty data or erroneous procedures      outlier is away from the other values. This can be the
where a certain theory might not be valid. However, in       difference between the outlier and the mean of all
large samples, a small number of outliers is to be           points or the difference between the outlier and the


                                                                              www.ijorcs.org
24                                                S.M.A.Khaleelur Rahman, M.Mohamed Sathik, K.Senthamarai Kannan

mean of the remaining values, or the difference             influences upon a single dependent variable. Further,
between the outlier and the next closest value.             because of omitted variables bias with simple
                                                            regression, multiple regression is necessitated, because
   Different computer-based approaches have been            we are interested in the effects of one of the
proposed for detecting outlying data and it cannot be       independent variables. Outliers are subject to masking
claimed that this is the generic or universally             and swamping [2][1]. One outlier masks a second
acceptable method. Therefore , these approaches were        outlier. In the presence of the first outlier the second is
classified into four major categories based on the          not considered as an outlier but can be considered an
techniques used , which are: distribution-based,            outlier only by itself. Thus, after the deletion of the
distance-based, density-based and deviation-based           first outlier the second appears as an outlier. In
approaches [7]. Distribution-based approaches develop       masking the resulting distance of the outlying point
statistical models from the given data and then             from the mean is short.
identify outliers with respect to the model using
discordance test. Data sets may follow normal or                          III. PROPOSED WORK
Poison distribution. Objects that have low probability
belong to the statistical model are declared as outliers.      In this paper, a new approach for outliers detection
However, Distribution-based approaches cannot be            to detect the values that are in an abnormal distance
applied on multidimensional data because they are           from other values is used. A number of suspicious
univariate in nature. In addition, a prior knowledge of     observations are identified according to the indicators.
the data distribution is required, making the               To highlight the suspicious individuals the following
distribution-based approaches difficult to be used in       indicators are used with a set of cut values. Various
practical applications. In the distance-based approach,     indicators and their cut values are
outliers are detected by measuring distance. Main
limitations of statistical methods are countered by this
                                                                       Indicator                  Cut value
approach. Rather than working on statistical tests ,
objects that do not have enough neighbors are defined                  DFFITS                 2 * SQRT (p / n)
based on the distance from the given object. Density-                  Cook’s D                  4 / (n - p)
based approaches compute the density of regions in the
data and declare the objects in low dense regions as
outliers. An outlier score to any given data point,            Here n is the dataset size, and p is the number of
known as Local Outlier Factor (LOF), depending on its       estimated parameters (number of descriptors + 1 for a
distance from its local neighborhood, is assigned.          regression with intercept). This method evaluates the
Deviation-based approaches do not use statistical tests     overall influence of each observation.
or distance-based measures to identify outliers. The           DFFITS measures how much an observation has
objects that deviate from the description are treated as    affected its fitted value from the regression model. It
outliers.                                                   also measures values larger than in absolute value.
    Outlier detection methods can be divided as             These measures are highly influential. In DFFITS , the
univariate methods, and multivariate methods.               differences in individual fitted values with and without
Currently many researches are worked upon                   particular cases are calculated.
multivariate methods. Another fundamental taxonomy          The following rules[3] are followed to consider the
of outlier detection methods is between parametric          case influential if
(statistical) methods and non-parametric methods that
are model-free. Statistical parametric methods work on        • DFFITS > 1 for small data sets, or if
the assumption that underlying observations are known         • DFFITS > 2 (p/n)1/2 for large data sets
or, at least, they are based on statistical estimates of
unknown distribution parameters [3] [4]. These                 Cook’sD measures aggregate impact of each
methods flag those observations that deviate from the       observation on the group of regression coefficients, as
model assumptions as outliers. They are often               well as the group of fitted values. By this method,
unsuitable for high-dimensional data sets and for           differences in an aggregate of all fitted values with
arbitrary data sets without prior knowledge of the          and without particular case values are calculated. It
underlying data distribution .                              measures the influence of cases on all n fitted values .
                                                            Cook’sD does not require that n different regressions
   Regression analysis is a statistical tool for the        be run. According to Cook’sD, a particular case can
investigation of relationships between variables [2].       be influential by [5]
Multiple regression is a technique that allows
additional factors to enter the analysis separately so        • Having a large residual ei
that the effect of each can be estimated. It is valuable      • Having a high leverage hii
for quantifying the impact of various simultaneous            • Having both


                                                                             www.ijorcs.org
Multiple Linear Regression Models in Outlier Detection                                                              25

   Find the percentile value corresponding to D in the           CrRate     Male14-                Unemp14-      FInc
                                                                                      Education
F(p, n-p) distribution. If the percentile is less than 0.10                   24                      24
or 0.20(not p value), then the particular case is                 79.1       151         91          108         394
relatively little influential. If the percentile is 0.5 or        163.5      143         113          96         557
more, the case has a major influence on the fit.                  57.8       142         89           94         318
                                                                  196.9      136         121         102         673
    Most assumptions of multiple regression cannot be
tested explicitly but gross deviations and violations can         123.4      141         121          91         578
be detected and dealt with appropriately. Outliers bias           68.2       121         110          84         689
the results and skew the regression line in a particular          96.3       127         111          97         620
direction , thereby leading to biased regression                  155.5      131         109          79         472
coefficients. Exclusion of a single deviated value can            85.6       157         90           81         421
give a completely different set of results and model.             70.5       140         118         100         526
The regression line demonstrates the prediction of the            167.4      124         105          77         657
dependent variable (Y), by giving the independent                 84.9       134         108          83         580
variables (X). Observed points are varied around the              51.1       128         113          77         507
fitted regression line. The deviation of a particular             66.4       135         117          77         529
point from the regression line (predicted value) is               79.8       152         87           92         405
called the residual value. Multiple regression analysis           94.6       142         88          116         427
is capable of dealing with an arbitrarily large number            53.9       143         110         114         487
of explanatory variables. With n explanatory variables,           92.9       135         104          89         631
multiple regression analysis will estimate the sum of              75        130         116          78         627
squared errors. Its intercept implies the constant term,          122.5      125         108         130         626
and its slope in each dimension implies one of the                74.2       126         108         102         557
regression coefficients.                                          43.9       157         89           97         288
   In regression analysis , the term "leverage" is used           121.6      132         96           83         513
for an undesirable effect . High leverage values do not           52.3       130         116          70         486
make considered an "outlier" has an over proportional             199.3      131         121         102         674
effect on the resulting regression curve. This effect             34.2       135         109          80         564
may completely corrupt great impact on coefficients.              121.6      152         112         103         537
It means that a single data point which is located well           104.3      119         107          92         637
outside the bulk of the data and may be considered an             69.6       166         89           72         396
"outlier" has an over proportional effect on the                  37.3       140         93          135         453
resulting regression curve. This effect may completely            75.4       125         109         105         617
corrupt a regression model depending on the number                107.2      147         104          76         462
of the samples and the distance of the outlier from the           92.3       126         118         102         589
rest of the data. This regression model would be                  65.3       123         102         124         572
good in the absence of an outlier.                                127.2      150         100          87         559
                                                                  83.1       177         87           76         382
        IV. RESULTS AND DISCUSSION
                                                                  56.6       133         104          99         425
    Now, we investigate our proposed method by using              82.6       149         88           86         395
a synthetic data set. This is an artificial data set with         115.1      145         104          88         488
two dimensions and values are entered to demonstrate               88        148         122          84         590
how regression methods are used for detecting outliers            54.2       141         109         107         489
and their influence. This data set has 14 attributes with         82.3       162         99           73         496
47 examples. All         14 attributes are continuous              103       136         121         111         622
attributes.                                                       45.5       139         88          135         457
    This dataset has 47 instances. We want to explain             50.8       126         104          78         593
the continuous attribute CrRate(CrimeRate) from four              84.9       130         121         113         588
attributes such as Male14-24, Education, Unemp14-24                                   Table 1
(Unemployed) and FInc(FamilyIncome). All the four
                                                                 Table 1 displays the number of outliers detected for
attributes are also continuous in nature.
                                                              two statistical procedures DFFITS and Cook’D.
                                                              Cook’s distance follows an F distribution so that
                                                              sample size is an important factor. Large residuals tend
                                                              to have large Cook’s distance[5]. Influential
                                                              observations are measured on the coefficient estimate.


                                                                              www.ijorcs.org
26                                                 S.M.A.Khaleelur Rahman, M.Mohamed Sathik, K.Senthamarai Kannan

   Outliers are very different to the others i.e. they              8           2.95457339       1.08224845
look, do not belong to the analysed population and if               9           0.10356779       0.03245340
we remove these influential values, they lead us to a
different model. The distinction between these kinds of             10          -0.69109583      -0.20852734
points is not always obvious. One hypothesis is                     11          1.89708304       0.81378603
formulated about the relationship between the                       12          -0.44522813      -0.09383719
variables of interest, here, Education and CrimeRate,
                                                                    13          -0.69840181      -0.26410747
Male14-24 and CrimeRate, Unemp14-24 and Crime
Rate, Family Income and CrimeRate. Common                           14          -0.66027892      -0.21406586
experience suggests that less family income is an                   15          0.23852123       0.07178046
important factor for inducing unemployed youth to                   16          0.86122960       0.29989931
commit more crimes. In our database only four records
are discriminated which implies normal assumption is                17          -0.96075284      -0.27731615
tend to make more money, people commit crime so                     18          -0.66088492      -0.21784022
that crime rate will be high for the group with less                19          -1.01672661      -0.27869084
family income. But this assumption proved wrong here
                                                                    20          0.62035054       0.24444196
that male12-24 hailed from high family income
commit more crimes. Thus, the hypothesis is that                    21          -0.30894044      -0.06945060
higher level of unemployment causes higher level of                 22          -0.07215249      -0.03140858
Crime but other factors also affect it. To investigate              23          1.31708193       0.41059065
this hypothesis, we analyzed the data on Education and
                                                                    24          0.31558868       0.15823852
FamilyIncome (earnings) for the individuals at the age
group of 14-24. Here CrimeRate is the dependent                     25          -0.58615857      -0.28383282
variable and remaining four variables male-14-24,                   26          2.43565369       0.75912845
Education, Unemployed 14-24 and FamilyIncome are                    27          -1.91721714      -0.40625769
independent variables. Let E denote education in years
of schooling for each individual, and let I denote that             28          0.35539761       0.12204570
individual’s earnings in dollars per year. We can plot              29          0.18945691       0.06703783
this information for all of the individuals in the sample           30          -0.50027841      -0.20941073
using a two-dimensional diagram, conventionally
                                                                    31          -1.08754754      -0.46675214
termed a “scatter” diagram.
                                                                    32          -0.72746843      -0.18305531
   Effect of each variable can be estimated separately              33          0.72281331       0.18214980
by using Multiple regression. Impact and influence
of different variables upon a single dependent variable             34          -0.05855739      -0.01478066
can also be quantified. As an investigator, we are                  35          -0.60047084      -0.22071476
interested only in the effects of different independent             36          0.45041770       0.14851908
variables. Four independent variables Male14-24,
                                                                    37          -0.37726417      -0.21679878
Unemployed14-24, Family Income, Education and one
explanatory variable CrimeRate are used to depict the               38          -0.00519032      -0.00182978
effect in different scatter diagrams. More influential              39          0.50739807       0.15381441
points in Cook’D have the higher values. In table2                  40          0.80370468       0.14202599
these suspicious values are highlighted in records with
                                                                    41          -1.01047850      -0.40952945
number 4, 8,11 and 26.
                                                                    42          -0.87981868      -0.20346297
                                                                    43          -0.83315825      -0.34374225
     Statistic     DFFITS             Cook's D
                                                                    44          -0.36672667      -0.12044833
     Lower bound   0.6523             -
                                                                    45          -0.82375938      -0.39888713
     Upper bound   0.6523             0.0952
                                                                    46          -1.32948601      -0.44754258
            1         0.32646009          0.09616140
                                                                    47          -0.42401546      -0.13260822
            2         1.78480983          0.39986444
                                                                                   Table 2
            3         0.66405863          0.31023771
            4         2.18146706          0.77293313
            5         0.43511620          0.13569903
            6        -1.47110713          -0.59439152
            7        -0.20111063          -0.04528905


                                                                            www.ijorcs.org
Multiple Linear Regression Models in Outlier Detection                                            27

                      Attribute         Coef.               std         t(42)           p-value

                      Intercept        -222.752          128.140538   -1.738341        0.089478

                     Male14-24         1.160999           0.567406    2.046153         0.047035

                      Education        0.091014           0.674681    0.134899         0.893337

                    Unemp14-24         0.007371           0.293173    0.025141         0.980062

                     FamIncome         0.270389           0.089632    3.016646         0.004327
                                                          Table 3




                       Figure 1                                                      Figure 3




                       Figure 2                                                      Figure 4




                                                                                www.ijorcs.org
28                                                     S.M.A.Khaleelur Rahman, M.Mohamed Sathik, K.Senthamarai Kannan

Global results
Endogenous attribute                              CrimeRate
Examples                                                  47
R²                                                  0.271953
Adjusted-R²                                         0.238860
Sigma error                                        33.742476
F-Test (2,44)                                (8.2178,000928)

Coefficients
Attribute        Coef.          std        t(44)     p-value
                        -
Intercept                   102.132327   -2.103677   0.041157
               214.853376
FamInco
                 0.277415     0.069458    3.993974   0.000243
me
Male14-
                 1.151818     0.533284    2.159859   0.036273
24

                    V. CONCLUSION
   In this paper multiple linear regression with n
explanatory variables are used. We proved through
examples that Multiple regression allows analysis of
additional factors to estimate effect of each variable
and quantifying the impact of simultaneous influence
upon a single dependant variable. Regression models
also explain the variation in the dependent variable
well. Here we use this for predictive purposes. Above
two statistics Cook’sD and DFFITS after careful
consideration, omit influential points and regressions
are refitted. Effect of the case can be studied by
deleting the particular case from the data and
analyzing the rest of the population. Multiple outliers
are detected in multiple linear regression model.
Distributions are used to get suitable cutoff points.
Procedures are proposed and implemented with
examples. Regression is not robust if the data set is
small and can lead to inaccurate estimates.

                   VI. REFERENCES
[1] Hawkins. D. Identification of Outliers , Chapman and
      and Hall , London, 1980
[2] Barnett, V. and Lewis, T.: 1994, Outliers in Statistical
      Data. John Wiley & Sons., 3rd edition.
[3]   Grubbs, F. E.: 1969, Procedures for detecting outlying
      observations in samples. Technometrics 11, 1–21.
[4]   Rousseeuw, P. and Leroy, A.: 1996, Robust Regression
      and Outlier Detection. John Wiley & Sons., 3rd edition..
[5]   Cook R. D and Weisberg S.T. (1982), Residuals and
      influence in New York Chapman and Hall.
[6]   Abraham , B., and A.Chuang. “Outlier Detection and
      Time Series Modelling.” Technometrics(1989)
[7]   Jiawei Han and Micheline Kamber “Data Mining
      concepts and Techniques” Elsever, Second Edition




                                                                                www.ijorcs.org

Weitere ähnliche Inhalte

Was ist angesagt?

Selection of appropriate data analysis technique
Selection of appropriate data analysis techniqueSelection of appropriate data analysis technique
Selection of appropriate data analysis techniqueRajaKrishnan M
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis緯鈞 沈
 
Sampling distribution
Sampling distributionSampling distribution
Sampling distributionswarna dey
 
Inferential Statistics
Inferential StatisticsInferential Statistics
Inferential Statisticsewhite00
 
JSM2013,Proceedings,paper307699_79238,DSweitzer
JSM2013,Proceedings,paper307699_79238,DSweitzerJSM2013,Proceedings,paper307699_79238,DSweitzer
JSM2013,Proceedings,paper307699_79238,DSweitzerDennis Sweitzer
 
A Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of DiseasesA Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of Diseasesijsrd.com
 
Poor man's missing value imputation
Poor man's missing value imputationPoor man's missing value imputation
Poor man's missing value imputationLeonardo Auslender
 
Reverse Nearest Neighbors in Unsupervised Distance-Based Outlier Detection
Reverse Nearest Neighbors in Unsupervised Distance-Based Outlier DetectionReverse Nearest Neighbors in Unsupervised Distance-Based Outlier Detection
Reverse Nearest Neighbors in Unsupervised Distance-Based Outlier Detection1crore projects
 
2013jsm,Proceedings,DSweitzer,26sep
2013jsm,Proceedings,DSweitzer,26sep2013jsm,Proceedings,DSweitzer,26sep
2013jsm,Proceedings,DSweitzer,26sepDennis Sweitzer
 
Multivariate Analysis Techniques
Multivariate Analysis TechniquesMultivariate Analysis Techniques
Multivariate Analysis TechniquesMehul Gondaliya
 
Sampling and statistical inference
Sampling and statistical inferenceSampling and statistical inference
Sampling and statistical inferenceBhavik A Shah
 
Business Research Methods PPT - III
Business Research Methods PPT - IIIBusiness Research Methods PPT - III
Business Research Methods PPT - IIIRavinder Singh
 
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...IJMIT JOURNAL
 

Was ist angesagt? (19)

Selection of appropriate data analysis technique
Selection of appropriate data analysis techniqueSelection of appropriate data analysis technique
Selection of appropriate data analysis technique
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Advanced statistics
Advanced statisticsAdvanced statistics
Advanced statistics
 
Sampling distribution
Sampling distributionSampling distribution
Sampling distribution
 
Inferential Statistics
Inferential StatisticsInferential Statistics
Inferential Statistics
 
JSM2013,Proceedings,paper307699_79238,DSweitzer
JSM2013,Proceedings,paper307699_79238,DSweitzerJSM2013,Proceedings,paper307699_79238,DSweitzer
JSM2013,Proceedings,paper307699_79238,DSweitzer
 
A Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of DiseasesA Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of Diseases
 
Poor man's missing value imputation
Poor man's missing value imputationPoor man's missing value imputation
Poor man's missing value imputation
 
Reverse Nearest Neighbors in Unsupervised Distance-Based Outlier Detection
Reverse Nearest Neighbors in Unsupervised Distance-Based Outlier DetectionReverse Nearest Neighbors in Unsupervised Distance-Based Outlier Detection
Reverse Nearest Neighbors in Unsupervised Distance-Based Outlier Detection
 
Sec 1.3 collecting sample data
Sec 1.3 collecting sample data  Sec 1.3 collecting sample data
Sec 1.3 collecting sample data
 
Data interpretation
Data interpretationData interpretation
Data interpretation
 
Displaying your results
Displaying your resultsDisplaying your results
Displaying your results
 
2013jsm,Proceedings,DSweitzer,26sep
2013jsm,Proceedings,DSweitzer,26sep2013jsm,Proceedings,DSweitzer,26sep
2013jsm,Proceedings,DSweitzer,26sep
 
Multivariate Analysis Techniques
Multivariate Analysis TechniquesMultivariate Analysis Techniques
Multivariate Analysis Techniques
 
Sampling and statistical inference
Sampling and statistical inferenceSampling and statistical inference
Sampling and statistical inference
 
Sampling
SamplingSampling
Sampling
 
Business Research Methods PPT - III
Business Research Methods PPT - IIIBusiness Research Methods PPT - III
Business Research Methods PPT - III
 
Statistics in orthodontics
Statistics in orthodonticsStatistics in orthodontics
Statistics in orthodontics
 
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
 

Ähnlich wie Multiple Linear Regression Models in Outlier Detection

IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...IJDKP
 
Outlier Detection Using Unsupervised Learning on High Dimensional Data
Outlier Detection Using Unsupervised Learning on High Dimensional DataOutlier Detection Using Unsupervised Learning on High Dimensional Data
Outlier Detection Using Unsupervised Learning on High Dimensional DataIJERA Editor
 
A Survey on Cluster Based Outlier Detection Techniques in Data Stream
A Survey on Cluster Based Outlier Detection Techniques in Data StreamA Survey on Cluster Based Outlier Detection Techniques in Data Stream
A Survey on Cluster Based Outlier Detection Techniques in Data StreamIIRindia
 
Chapter 12. Outlier Detection.ppt
Chapter 12. Outlier Detection.pptChapter 12. Outlier Detection.ppt
Chapter 12. Outlier Detection.pptSubrata Kumer Paul
 
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERSA MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERSZac Darcy
 
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersA Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersZac Darcy
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detectionShantanuDeosthale
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Salah Amean
 
An Overview and Application of Discriminant Analysis in Data Analysis
An Overview and Application of Discriminant Analysis in Data AnalysisAn Overview and Application of Discriminant Analysis in Data Analysis
An Overview and Application of Discriminant Analysis in Data AnalysisIOSR Journals
 
Outlier Detection using Reverse Neares Neighbor for Unsupervised Data
Outlier Detection using Reverse Neares Neighbor for Unsupervised DataOutlier Detection using Reverse Neares Neighbor for Unsupervised Data
Outlier Detection using Reverse Neares Neighbor for Unsupervised Dataijtsrd
 
Data Analytics on Solar Energy Using Hadoop
Data Analytics on Solar Energy Using HadoopData Analytics on Solar Energy Using Hadoop
Data Analytics on Solar Energy Using HadoopIJMERJOURNAL
 
Chapter 10 Anomaly Detection
Chapter 10 Anomaly DetectionChapter 10 Anomaly Detection
Chapter 10 Anomaly DetectionKhalid Elshafie
 
STATISTICAL PARAMETERS
STATISTICAL  PARAMETERSSTATISTICAL  PARAMETERS
STATISTICAL PARAMETERSHasiful Arabi
 

Ähnlich wie Multiple Linear Regression Models in Outlier Detection (20)

Kdd08 abod
Kdd08 abodKdd08 abod
Kdd08 abod
 
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
 
Outlier Detection Using Unsupervised Learning on High Dimensional Data
Outlier Detection Using Unsupervised Learning on High Dimensional DataOutlier Detection Using Unsupervised Learning on High Dimensional Data
Outlier Detection Using Unsupervised Learning on High Dimensional Data
 
A Survey on Cluster Based Outlier Detection Techniques in Data Stream
A Survey on Cluster Based Outlier Detection Techniques in Data StreamA Survey on Cluster Based Outlier Detection Techniques in Data Stream
A Survey on Cluster Based Outlier Detection Techniques in Data Stream
 
Chapter 12. Outlier Detection.ppt
Chapter 12. Outlier Detection.pptChapter 12. Outlier Detection.ppt
Chapter 12. Outlier Detection.ppt
 
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERSA MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
 
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersA Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
 
12 outlier
12 outlier12 outlier
12 outlier
 
Data cleaning-outlier-detection
Data cleaning-outlier-detectionData cleaning-outlier-detection
Data cleaning-outlier-detection
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detection
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
 
An Overview and Application of Discriminant Analysis in Data Analysis
An Overview and Application of Discriminant Analysis in Data AnalysisAn Overview and Application of Discriminant Analysis in Data Analysis
An Overview and Application of Discriminant Analysis in Data Analysis
 
Chapter 12 outlier
Chapter 12 outlierChapter 12 outlier
Chapter 12 outlier
 
Multiple imputation of missing data
Multiple imputation of missing dataMultiple imputation of missing data
Multiple imputation of missing data
 
Outlier Detection using Reverse Neares Neighbor for Unsupervised Data
Outlier Detection using Reverse Neares Neighbor for Unsupervised DataOutlier Detection using Reverse Neares Neighbor for Unsupervised Data
Outlier Detection using Reverse Neares Neighbor for Unsupervised Data
 
Data Analytics on Solar Energy Using Hadoop
Data Analytics on Solar Energy Using HadoopData Analytics on Solar Energy Using Hadoop
Data Analytics on Solar Energy Using Hadoop
 
Chapter 10 Anomaly Detection
Chapter 10 Anomaly DetectionChapter 10 Anomaly Detection
Chapter 10 Anomaly Detection
 
STATISTICAL PARAMETERS
STATISTICAL  PARAMETERSSTATISTICAL  PARAMETERS
STATISTICAL PARAMETERS
 
multivitrate.pptx
multivitrate.pptxmultivitrate.pptx
multivitrate.pptx
 
poster_Reza
poster_Rezaposter_Reza
poster_Reza
 

Mehr von IJORCS

Help the Genetic Algorithm to Minimize the Urban Traffic on Intersections
Help the Genetic Algorithm to Minimize the Urban Traffic on IntersectionsHelp the Genetic Algorithm to Minimize the Urban Traffic on Intersections
Help the Genetic Algorithm to Minimize the Urban Traffic on IntersectionsIJORCS
 
Call for Papers - IJORCS, Volume 4 Issue 4
Call for Papers - IJORCS, Volume 4 Issue 4Call for Papers - IJORCS, Volume 4 Issue 4
Call for Papers - IJORCS, Volume 4 Issue 4IJORCS
 
Real-Time Multiple License Plate Recognition System
Real-Time Multiple License Plate Recognition SystemReal-Time Multiple License Plate Recognition System
Real-Time Multiple License Plate Recognition SystemIJORCS
 
FPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
FPGA Implementation of FIR Filter using Various Algorithms: A RetrospectiveFPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
FPGA Implementation of FIR Filter using Various Algorithms: A RetrospectiveIJORCS
 
Using Virtualization Technique to Increase Security and Reduce Energy Consump...
Using Virtualization Technique to Increase Security and Reduce Energy Consump...Using Virtualization Technique to Increase Security and Reduce Energy Consump...
Using Virtualization Technique to Increase Security and Reduce Energy Consump...IJORCS
 
Algebraic Fault Attack on the SHA-256 Compression Function
Algebraic Fault Attack on the SHA-256 Compression FunctionAlgebraic Fault Attack on the SHA-256 Compression Function
Algebraic Fault Attack on the SHA-256 Compression FunctionIJORCS
 
Enhancement of DES Algorithm with Multi State Logic
Enhancement of DES Algorithm with Multi State LogicEnhancement of DES Algorithm with Multi State Logic
Enhancement of DES Algorithm with Multi State LogicIJORCS
 
Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...
Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...
Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...IJORCS
 
CFP. IJORCS, Volume 4 - Issue2
CFP. IJORCS, Volume 4 - Issue2CFP. IJORCS, Volume 4 - Issue2
CFP. IJORCS, Volume 4 - Issue2IJORCS
 
Call for Papers - IJORCS - Vol 4, Issue 1
Call for Papers - IJORCS - Vol 4, Issue 1Call for Papers - IJORCS - Vol 4, Issue 1
Call for Papers - IJORCS - Vol 4, Issue 1IJORCS
 
Voice Recognition System using Template Matching
Voice Recognition System using Template MatchingVoice Recognition System using Template Matching
Voice Recognition System using Template MatchingIJORCS
 
Channel Aware Mac Protocol for Maximizing Throughput and Fairness
Channel Aware Mac Protocol for Maximizing Throughput and FairnessChannel Aware Mac Protocol for Maximizing Throughput and Fairness
Channel Aware Mac Protocol for Maximizing Throughput and FairnessIJORCS
 
A Review and Analysis on Mobile Application Development Processes using Agile...
A Review and Analysis on Mobile Application Development Processes using Agile...A Review and Analysis on Mobile Application Development Processes using Agile...
A Review and Analysis on Mobile Application Development Processes using Agile...IJORCS
 
Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...
Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...
Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...IJORCS
 
A Study of Routing Techniques in Intermittently Connected MANETs
A Study of Routing Techniques in Intermittently Connected MANETsA Study of Routing Techniques in Intermittently Connected MANETs
A Study of Routing Techniques in Intermittently Connected MANETsIJORCS
 
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...IJORCS
 
An Adaptive Load Sharing Algorithm for Heterogeneous Distributed System
An Adaptive Load Sharing Algorithm for Heterogeneous Distributed SystemAn Adaptive Load Sharing Algorithm for Heterogeneous Distributed System
An Adaptive Load Sharing Algorithm for Heterogeneous Distributed SystemIJORCS
 
The Design of Cognitive Social Simulation Framework using Statistical Methodo...
The Design of Cognitive Social Simulation Framework using Statistical Methodo...The Design of Cognitive Social Simulation Framework using Statistical Methodo...
The Design of Cognitive Social Simulation Framework using Statistical Methodo...IJORCS
 
An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...
An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...
An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...IJORCS
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
 

Mehr von IJORCS (20)

Help the Genetic Algorithm to Minimize the Urban Traffic on Intersections
Help the Genetic Algorithm to Minimize the Urban Traffic on IntersectionsHelp the Genetic Algorithm to Minimize the Urban Traffic on Intersections
Help the Genetic Algorithm to Minimize the Urban Traffic on Intersections
 
Call for Papers - IJORCS, Volume 4 Issue 4
Call for Papers - IJORCS, Volume 4 Issue 4Call for Papers - IJORCS, Volume 4 Issue 4
Call for Papers - IJORCS, Volume 4 Issue 4
 
Real-Time Multiple License Plate Recognition System
Real-Time Multiple License Plate Recognition SystemReal-Time Multiple License Plate Recognition System
Real-Time Multiple License Plate Recognition System
 
FPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
FPGA Implementation of FIR Filter using Various Algorithms: A RetrospectiveFPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
FPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
 
Using Virtualization Technique to Increase Security and Reduce Energy Consump...
Using Virtualization Technique to Increase Security and Reduce Energy Consump...Using Virtualization Technique to Increase Security and Reduce Energy Consump...
Using Virtualization Technique to Increase Security and Reduce Energy Consump...
 
Algebraic Fault Attack on the SHA-256 Compression Function
Algebraic Fault Attack on the SHA-256 Compression FunctionAlgebraic Fault Attack on the SHA-256 Compression Function
Algebraic Fault Attack on the SHA-256 Compression Function
 
Enhancement of DES Algorithm with Multi State Logic
Enhancement of DES Algorithm with Multi State LogicEnhancement of DES Algorithm with Multi State Logic
Enhancement of DES Algorithm with Multi State Logic
 
Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...
Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...
Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...
 
CFP. IJORCS, Volume 4 - Issue2
CFP. IJORCS, Volume 4 - Issue2CFP. IJORCS, Volume 4 - Issue2
CFP. IJORCS, Volume 4 - Issue2
 
Call for Papers - IJORCS - Vol 4, Issue 1
Call for Papers - IJORCS - Vol 4, Issue 1Call for Papers - IJORCS - Vol 4, Issue 1
Call for Papers - IJORCS - Vol 4, Issue 1
 
Voice Recognition System using Template Matching
Voice Recognition System using Template MatchingVoice Recognition System using Template Matching
Voice Recognition System using Template Matching
 
Channel Aware Mac Protocol for Maximizing Throughput and Fairness
Channel Aware Mac Protocol for Maximizing Throughput and FairnessChannel Aware Mac Protocol for Maximizing Throughput and Fairness
Channel Aware Mac Protocol for Maximizing Throughput and Fairness
 
A Review and Analysis on Mobile Application Development Processes using Agile...
A Review and Analysis on Mobile Application Development Processes using Agile...A Review and Analysis on Mobile Application Development Processes using Agile...
A Review and Analysis on Mobile Application Development Processes using Agile...
 
Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...
Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...
Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...
 
A Study of Routing Techniques in Intermittently Connected MANETs
A Study of Routing Techniques in Intermittently Connected MANETsA Study of Routing Techniques in Intermittently Connected MANETs
A Study of Routing Techniques in Intermittently Connected MANETs
 
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
 
An Adaptive Load Sharing Algorithm for Heterogeneous Distributed System
An Adaptive Load Sharing Algorithm for Heterogeneous Distributed SystemAn Adaptive Load Sharing Algorithm for Heterogeneous Distributed System
An Adaptive Load Sharing Algorithm for Heterogeneous Distributed System
 
The Design of Cognitive Social Simulation Framework using Statistical Methodo...
The Design of Cognitive Social Simulation Framework using Statistical Methodo...The Design of Cognitive Social Simulation Framework using Statistical Methodo...
The Design of Cognitive Social Simulation Framework using Statistical Methodo...
 
An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...
An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...
An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering Algorithm
 

Kürzlich hochgeladen

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 

Kürzlich hochgeladen (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 

Multiple Linear Regression Models in Outlier Detection

  • 1. International Journal of Research in Computer Science eISSN 2249-8265 Volume 2 Issue 2 (2012) pp. 23-28 © White Globe Publications www.ijorcs.org MULTIPLE LINEAR REGRESSION MODELS IN OUTLIER DETECTION S.M.A.Khaleelur Rahman1, M.Mohamed Sathik2, K.Senthamarai Kannan3 12 Sadakathullah Appa College, Tirunelveli, Tamilnadu, India 3 Manonmaniam Sundaranar University, Tirunelveli, Tamilnadu, India Abstract: Identifying anomalous values in the real- expected (and not due to any anomalous condition). world database is important both for improving the Outliers are most extreme observations with maximum quality of original data and for reducing the impact of or minimum sample. However, the sample maximum anomalous values in the process of knowledge and minimum are not always outliers because they discovery in databases. Such anomalous values give may not be unusually far from other observations. useful information to the data analyst in discovering Thus, the term “outliers” to values “that lies very far useful patterns. Through isolation, these data may be from the middle of the distribution in either direction”. separated and analyzed. The analysis of outliers and This definition is applicable for continuously valued influential points is an important step of the regression variables having a smooth pdf values. Sometimes, diagnostics. In this paper, our aim is to detect the numeric distance is not the only consideration in points which are very different from the others points. detecting continuous outliers. “An outlier is a single, They do not seem to belong to a particular population or very low frequency, occurrence of the value of a and behave differently. If these influential points are to variable that is far away from the bulk of the values of be removed it will lead to a different model. the variable”. The frequency of occurrence should be Distinction between these points is not always obvious an important criterion for detecting outliers in and clear. Hence several indicators are used for categorical (nominal) data, which is quite common in identifying and analyzing outliers. Existing methods of the real-world databases. A more general definition of outlier detection are based on manual inspection of an outlier is given in: an observation (or subset of graphically represented data. In this paper, we present observations) which appears to be inconsistent with the a new approach in automating the process of detecting remainder of that set of data [1]. The real cause of and isolating outliers. Impact of anomalous values on outlier occurrence is usually unknown to data users the dataset has been established by using two and/or analysts. Sometimes, this is a flawed value, indicators DFFITS and Cook’sD. The process is based resulting from the poor quality of a data set, i.e., a on modeling the human perception of exceptional data entry or a data conversion error. values by using multiple linear regression analysis. Outlier detection methods have been used to detect Keywords: Cut-value, Cook’s D, DFFITS, multiple and remove anomalous values from data [7]. There are regression analysis, outlier detection. three approaches in the outlier detection process. Determine the outliers through learning approach. I. INTRODUCTION No prior knowledge of the data is similar to IN statistics, outlier is an observation that is unsupervised clustering. This approach processes the numerically distant from the rest of the data [1]. data as a static distribution, pinpoints the most remote Grubbs defined outlier as an observation that appears points and identify them as potential outliers. to deviate markedly from other members of the sample Pre-labeled data are marked as both normal and in which it occurs [3]. Occurrence of outliers may be abnormal. This approach is similar to supervised by chance in any distribution, but they are often classification .This is a semi-supervised recognition or indicative either of measurement error or of one detection task. This algorithm learns to recognize. population that has a heavy-tailed distribution. If the occurrence is by chance, robust statistical methods are II. RELATED WORK being used to discard them. In case they indicate that the distribution has high kurtosis then one should be Mathematical calculations are used to find out very cautious in using tools or intuitions that assume a whether the outlier came from the same or different normal distribution. A mixture model is frequently population. Many statistical methods have been used for these two distinct sub-populations. Outlier devised for detecting outliers by measuring how far the points can indicate faulty data or erroneous procedures outlier is away from the other values. This can be the where a certain theory might not be valid. However, in difference between the outlier and the mean of all large samples, a small number of outliers is to be points or the difference between the outlier and the www.ijorcs.org
  • 2. 24 S.M.A.Khaleelur Rahman, M.Mohamed Sathik, K.Senthamarai Kannan mean of the remaining values, or the difference influences upon a single dependent variable. Further, between the outlier and the next closest value. because of omitted variables bias with simple regression, multiple regression is necessitated, because Different computer-based approaches have been we are interested in the effects of one of the proposed for detecting outlying data and it cannot be independent variables. Outliers are subject to masking claimed that this is the generic or universally and swamping [2][1]. One outlier masks a second acceptable method. Therefore , these approaches were outlier. In the presence of the first outlier the second is classified into four major categories based on the not considered as an outlier but can be considered an techniques used , which are: distribution-based, outlier only by itself. Thus, after the deletion of the distance-based, density-based and deviation-based first outlier the second appears as an outlier. In approaches [7]. Distribution-based approaches develop masking the resulting distance of the outlying point statistical models from the given data and then from the mean is short. identify outliers with respect to the model using discordance test. Data sets may follow normal or III. PROPOSED WORK Poison distribution. Objects that have low probability belong to the statistical model are declared as outliers. In this paper, a new approach for outliers detection However, Distribution-based approaches cannot be to detect the values that are in an abnormal distance applied on multidimensional data because they are from other values is used. A number of suspicious univariate in nature. In addition, a prior knowledge of observations are identified according to the indicators. the data distribution is required, making the To highlight the suspicious individuals the following distribution-based approaches difficult to be used in indicators are used with a set of cut values. Various practical applications. In the distance-based approach, indicators and their cut values are outliers are detected by measuring distance. Main limitations of statistical methods are countered by this Indicator Cut value approach. Rather than working on statistical tests , objects that do not have enough neighbors are defined DFFITS 2 * SQRT (p / n) based on the distance from the given object. Density- Cook’s D 4 / (n - p) based approaches compute the density of regions in the data and declare the objects in low dense regions as outliers. An outlier score to any given data point, Here n is the dataset size, and p is the number of known as Local Outlier Factor (LOF), depending on its estimated parameters (number of descriptors + 1 for a distance from its local neighborhood, is assigned. regression with intercept). This method evaluates the Deviation-based approaches do not use statistical tests overall influence of each observation. or distance-based measures to identify outliers. The DFFITS measures how much an observation has objects that deviate from the description are treated as affected its fitted value from the regression model. It outliers. also measures values larger than in absolute value. Outlier detection methods can be divided as These measures are highly influential. In DFFITS , the univariate methods, and multivariate methods. differences in individual fitted values with and without Currently many researches are worked upon particular cases are calculated. multivariate methods. Another fundamental taxonomy The following rules[3] are followed to consider the of outlier detection methods is between parametric case influential if (statistical) methods and non-parametric methods that are model-free. Statistical parametric methods work on • DFFITS > 1 for small data sets, or if the assumption that underlying observations are known • DFFITS > 2 (p/n)1/2 for large data sets or, at least, they are based on statistical estimates of unknown distribution parameters [3] [4]. These Cook’sD measures aggregate impact of each methods flag those observations that deviate from the observation on the group of regression coefficients, as model assumptions as outliers. They are often well as the group of fitted values. By this method, unsuitable for high-dimensional data sets and for differences in an aggregate of all fitted values with arbitrary data sets without prior knowledge of the and without particular case values are calculated. It underlying data distribution . measures the influence of cases on all n fitted values . Cook’sD does not require that n different regressions Regression analysis is a statistical tool for the be run. According to Cook’sD, a particular case can investigation of relationships between variables [2]. be influential by [5] Multiple regression is a technique that allows additional factors to enter the analysis separately so • Having a large residual ei that the effect of each can be estimated. It is valuable • Having a high leverage hii for quantifying the impact of various simultaneous • Having both www.ijorcs.org
  • 3. Multiple Linear Regression Models in Outlier Detection 25 Find the percentile value corresponding to D in the CrRate Male14- Unemp14- FInc Education F(p, n-p) distribution. If the percentile is less than 0.10 24 24 or 0.20(not p value), then the particular case is 79.1 151 91 108 394 relatively little influential. If the percentile is 0.5 or 163.5 143 113 96 557 more, the case has a major influence on the fit. 57.8 142 89 94 318 196.9 136 121 102 673 Most assumptions of multiple regression cannot be tested explicitly but gross deviations and violations can 123.4 141 121 91 578 be detected and dealt with appropriately. Outliers bias 68.2 121 110 84 689 the results and skew the regression line in a particular 96.3 127 111 97 620 direction , thereby leading to biased regression 155.5 131 109 79 472 coefficients. Exclusion of a single deviated value can 85.6 157 90 81 421 give a completely different set of results and model. 70.5 140 118 100 526 The regression line demonstrates the prediction of the 167.4 124 105 77 657 dependent variable (Y), by giving the independent 84.9 134 108 83 580 variables (X). Observed points are varied around the 51.1 128 113 77 507 fitted regression line. The deviation of a particular 66.4 135 117 77 529 point from the regression line (predicted value) is 79.8 152 87 92 405 called the residual value. Multiple regression analysis 94.6 142 88 116 427 is capable of dealing with an arbitrarily large number 53.9 143 110 114 487 of explanatory variables. With n explanatory variables, 92.9 135 104 89 631 multiple regression analysis will estimate the sum of 75 130 116 78 627 squared errors. Its intercept implies the constant term, 122.5 125 108 130 626 and its slope in each dimension implies one of the 74.2 126 108 102 557 regression coefficients. 43.9 157 89 97 288 In regression analysis , the term "leverage" is used 121.6 132 96 83 513 for an undesirable effect . High leverage values do not 52.3 130 116 70 486 make considered an "outlier" has an over proportional 199.3 131 121 102 674 effect on the resulting regression curve. This effect 34.2 135 109 80 564 may completely corrupt great impact on coefficients. 121.6 152 112 103 537 It means that a single data point which is located well 104.3 119 107 92 637 outside the bulk of the data and may be considered an 69.6 166 89 72 396 "outlier" has an over proportional effect on the 37.3 140 93 135 453 resulting regression curve. This effect may completely 75.4 125 109 105 617 corrupt a regression model depending on the number 107.2 147 104 76 462 of the samples and the distance of the outlier from the 92.3 126 118 102 589 rest of the data. This regression model would be 65.3 123 102 124 572 good in the absence of an outlier. 127.2 150 100 87 559 83.1 177 87 76 382 IV. RESULTS AND DISCUSSION 56.6 133 104 99 425 Now, we investigate our proposed method by using 82.6 149 88 86 395 a synthetic data set. This is an artificial data set with 115.1 145 104 88 488 two dimensions and values are entered to demonstrate 88 148 122 84 590 how regression methods are used for detecting outliers 54.2 141 109 107 489 and their influence. This data set has 14 attributes with 82.3 162 99 73 496 47 examples. All 14 attributes are continuous 103 136 121 111 622 attributes. 45.5 139 88 135 457 This dataset has 47 instances. We want to explain 50.8 126 104 78 593 the continuous attribute CrRate(CrimeRate) from four 84.9 130 121 113 588 attributes such as Male14-24, Education, Unemp14-24 Table 1 (Unemployed) and FInc(FamilyIncome). All the four Table 1 displays the number of outliers detected for attributes are also continuous in nature. two statistical procedures DFFITS and Cook’D. Cook’s distance follows an F distribution so that sample size is an important factor. Large residuals tend to have large Cook’s distance[5]. Influential observations are measured on the coefficient estimate. www.ijorcs.org
  • 4. 26 S.M.A.Khaleelur Rahman, M.Mohamed Sathik, K.Senthamarai Kannan Outliers are very different to the others i.e. they 8 2.95457339 1.08224845 look, do not belong to the analysed population and if 9 0.10356779 0.03245340 we remove these influential values, they lead us to a different model. The distinction between these kinds of 10 -0.69109583 -0.20852734 points is not always obvious. One hypothesis is 11 1.89708304 0.81378603 formulated about the relationship between the 12 -0.44522813 -0.09383719 variables of interest, here, Education and CrimeRate, 13 -0.69840181 -0.26410747 Male14-24 and CrimeRate, Unemp14-24 and Crime Rate, Family Income and CrimeRate. Common 14 -0.66027892 -0.21406586 experience suggests that less family income is an 15 0.23852123 0.07178046 important factor for inducing unemployed youth to 16 0.86122960 0.29989931 commit more crimes. In our database only four records are discriminated which implies normal assumption is 17 -0.96075284 -0.27731615 tend to make more money, people commit crime so 18 -0.66088492 -0.21784022 that crime rate will be high for the group with less 19 -1.01672661 -0.27869084 family income. But this assumption proved wrong here 20 0.62035054 0.24444196 that male12-24 hailed from high family income commit more crimes. Thus, the hypothesis is that 21 -0.30894044 -0.06945060 higher level of unemployment causes higher level of 22 -0.07215249 -0.03140858 Crime but other factors also affect it. To investigate 23 1.31708193 0.41059065 this hypothesis, we analyzed the data on Education and 24 0.31558868 0.15823852 FamilyIncome (earnings) for the individuals at the age group of 14-24. Here CrimeRate is the dependent 25 -0.58615857 -0.28383282 variable and remaining four variables male-14-24, 26 2.43565369 0.75912845 Education, Unemployed 14-24 and FamilyIncome are 27 -1.91721714 -0.40625769 independent variables. Let E denote education in years of schooling for each individual, and let I denote that 28 0.35539761 0.12204570 individual’s earnings in dollars per year. We can plot 29 0.18945691 0.06703783 this information for all of the individuals in the sample 30 -0.50027841 -0.20941073 using a two-dimensional diagram, conventionally 31 -1.08754754 -0.46675214 termed a “scatter” diagram. 32 -0.72746843 -0.18305531 Effect of each variable can be estimated separately 33 0.72281331 0.18214980 by using Multiple regression. Impact and influence of different variables upon a single dependent variable 34 -0.05855739 -0.01478066 can also be quantified. As an investigator, we are 35 -0.60047084 -0.22071476 interested only in the effects of different independent 36 0.45041770 0.14851908 variables. Four independent variables Male14-24, 37 -0.37726417 -0.21679878 Unemployed14-24, Family Income, Education and one explanatory variable CrimeRate are used to depict the 38 -0.00519032 -0.00182978 effect in different scatter diagrams. More influential 39 0.50739807 0.15381441 points in Cook’D have the higher values. In table2 40 0.80370468 0.14202599 these suspicious values are highlighted in records with 41 -1.01047850 -0.40952945 number 4, 8,11 and 26. 42 -0.87981868 -0.20346297 43 -0.83315825 -0.34374225 Statistic DFFITS Cook's D 44 -0.36672667 -0.12044833 Lower bound 0.6523 - 45 -0.82375938 -0.39888713 Upper bound 0.6523 0.0952 46 -1.32948601 -0.44754258 1 0.32646009 0.09616140 47 -0.42401546 -0.13260822 2 1.78480983 0.39986444 Table 2 3 0.66405863 0.31023771 4 2.18146706 0.77293313 5 0.43511620 0.13569903 6 -1.47110713 -0.59439152 7 -0.20111063 -0.04528905 www.ijorcs.org
  • 5. Multiple Linear Regression Models in Outlier Detection 27 Attribute Coef. std t(42) p-value Intercept -222.752 128.140538 -1.738341 0.089478 Male14-24 1.160999 0.567406 2.046153 0.047035 Education 0.091014 0.674681 0.134899 0.893337 Unemp14-24 0.007371 0.293173 0.025141 0.980062 FamIncome 0.270389 0.089632 3.016646 0.004327 Table 3 Figure 1 Figure 3 Figure 2 Figure 4 www.ijorcs.org
  • 6. 28 S.M.A.Khaleelur Rahman, M.Mohamed Sathik, K.Senthamarai Kannan Global results Endogenous attribute CrimeRate Examples 47 R² 0.271953 Adjusted-R² 0.238860 Sigma error 33.742476 F-Test (2,44) (8.2178,000928) Coefficients Attribute Coef. std t(44) p-value - Intercept 102.132327 -2.103677 0.041157 214.853376 FamInco 0.277415 0.069458 3.993974 0.000243 me Male14- 1.151818 0.533284 2.159859 0.036273 24 V. CONCLUSION In this paper multiple linear regression with n explanatory variables are used. We proved through examples that Multiple regression allows analysis of additional factors to estimate effect of each variable and quantifying the impact of simultaneous influence upon a single dependant variable. Regression models also explain the variation in the dependent variable well. Here we use this for predictive purposes. Above two statistics Cook’sD and DFFITS after careful consideration, omit influential points and regressions are refitted. Effect of the case can be studied by deleting the particular case from the data and analyzing the rest of the population. Multiple outliers are detected in multiple linear regression model. Distributions are used to get suitable cutoff points. Procedures are proposed and implemented with examples. Regression is not robust if the data set is small and can lead to inaccurate estimates. VI. REFERENCES [1] Hawkins. D. Identification of Outliers , Chapman and and Hall , London, 1980 [2] Barnett, V. and Lewis, T.: 1994, Outliers in Statistical Data. John Wiley & Sons., 3rd edition. [3] Grubbs, F. E.: 1969, Procedures for detecting outlying observations in samples. Technometrics 11, 1–21. [4] Rousseeuw, P. and Leroy, A.: 1996, Robust Regression and Outlier Detection. John Wiley & Sons., 3rd edition.. [5] Cook R. D and Weisberg S.T. (1982), Residuals and influence in New York Chapman and Hall. [6] Abraham , B., and A.Chuang. “Outlier Detection and Time Series Modelling.” Technometrics(1989) [7] Jiawei Han and Micheline Kamber “Data Mining concepts and Techniques” Elsever, Second Edition www.ijorcs.org