Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Principal Components Analysis, Calculation and Visualization

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
Pca ppt
Pca ppt
Wird geladen in …3
×

Hier ansehen

1 von 11 Anzeige

Principal Components Analysis, Calculation and Visualization

The article explains dimension reduction principles, PCA algorithm and mathematics behind. The PCA calculation and data projection is demonstrated in R, Python and Apache Spark. Finally the results are visualized with D3.js.

The article explains dimension reduction principles, PCA algorithm and mathematics behind. The PCA calculation and data projection is demonstrated in R, Python and Apache Spark. Finally the results are visualized with D3.js.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Andere mochten auch (19)

Anzeige

Ähnlich wie Principal Components Analysis, Calculation and Visualization (20)

Aktuellste (20)

Anzeige

Principal Components Analysis, Calculation and Visualization

  1. 1. Principal Components Analysis: Calculation and Visualization Author: Marjan Sterjev Every day humans and machines produce tremendous amount of data that is collected by the systems they are interacting with. The value of that data is enormous if we can analyze it and digest some conclusions that will help the humans improve their lives, increase the business revenues and so on. However, that information is very often hidden deep in the data and data analysts shall dig for valuable clue. Usually the data is provided in a columnar format with M samples, each sample having N dimensions (example dimensions could be color, weight, height etc.) The spreadsheet representation of the Data is: Dimension 1 Dimension 2 Dimension 3 ... Dimension N Sample 1 value11 value12 value13 ... value1N Sample 2 value21 value22 value23 ... value2N Sample 3 value31 value32 value32 ... value3N ... ... ... ... ... ... Sample M valueM1 valueM2 valueM2 ... valueMN The first step in each data science related work is data exploration. During the process of exploration the data analyst shall get quick first impression about the data like what dimensions are useful and what are not, is there any grouping (clustering) among the samples in the data etc. Particular dimension is more useful than some other dimension if the values in that dimension vary more compared with the values in the other dimension. For example if all of the values in the Dimension 1 are the same, than that dimension is useless because it gives us no information that will help us distinguish one data sample from another. We can safely drop that dimension (column) and proceed the data analysis without it. Data visualization can help the data analyst with detecting some clustering among the data samples. Humans can understand 2-D or 3-D plots. However, the data usually has tens or hundreds of dimensions. How can we plot that data? In order to produce 2-D plot from high-dimensional data we need to: 1. Transform the existing N dimensions of the data into a new set of N dimensions. The data value variance in some of the new dimensions will be large compared with the variance in the others. Simply speaking some of the new dimensions will be more important and the others will be less important. 2. Order the new dimensions based on the data value variance therein (order by importance). The dimension with the highest variance of the values shall come first (it shall become the leftmost dimension column in the tabular view above). 1
  2. 2. 3. Keep the 2 left most dimensions that have maximum value variance and drop the rest N-2 dimensions. Principal Components Analysis (PCA) is a well known and established algorithm for dimension reduction. The output of the algorithm is a small set of principal components that will help us project the data into a new, low-dimensional space. In order to define PCA let's start with some statistics definitions. Each sample with N dimensions can be represented as vector in the following form: X =[ X1 , X 2 X 3 , X 4 ,.... X N ] (1) Ex: X =[1,2,3] (2) The mean of the values in the vector X is defined as: ̄X = 1 N ∑ i=1 N Xi (3) Ex: ̄X = 1 3 (1+2+3)=6/3=2 (4) The standard deviation of the values in the vector X is defined as: σ= √ 1 N −1 ∑ i=1 N ( X i− ̄X ) 2 (5) Ex:σ= √1 2 ((1−2)2 +(2−2)2 +(3−2)2 )=1 (6) The variance of the values in the vector X is defined as: var ( X )= 1 N −1 ∑ i=1 N ( X i− ̄X )2 =σ2 (7) The co-variance between two vectors X and Y is defined as: cov( X ,Y )= 1 N −1 ∑ i=1 N ( X i− ̄X )(Yi−̄Y ) (8) 2
  3. 3. For square matrix A eigenvalue/eigenvector pair is defined as: Ax=λ x (9) The matrix eigenvalues are calculated by solving the following determinant equation: det ( A−λ I )=0 (10) Ex: A=(2 0 0 3 ) det( A−λ I )=det( 2−λ 0 0 3−λ )=0 (2−λ)(3−λ)=0 λ1=2 λ2=3 (11) The eigenvectors can be calculated if we substitute back the eigenvalue solutions into the eigenvector equation: Ax1=λ1 x1 Ax2=λ2 x2 (12) Ax1=λ1 x1 ( 2 0 0 3 )( x11 x12 )=2( x11 x12 ) 2 x11=2 x11 3x12=2 x12 x11=?; x12=0; x1=[ x11 0] (13) Ax2=λ2 x2 ( 2 0 0 3 )( x21 x22 )=3( x21 x22 ) 2 x21=3 x21 3 x22=3 x22 x21=0 ; x22=? ; x2=[0 x22] (14) The eigenvectors x1 and x2 shall be orthonormal. Orthonormal vectors are vectors with unit length and zero dot product: ∣x1∣=√x11 2 +x12 2 =√x11 2 =1 ∣x2∣=√x21 2 +x22 2 =√x22 2 =1 x1 x2=x11 x21+x12 x22=0 (15) 3
  4. 4. Finally we can choose the following values that satisfy the orthonormality condition: x11=−1 x22=1 x1=[−1,0] x2=[0,1] (16) The data samples matrix Data defined above has N dimensions: Dim1 , Dim2 , Dim3 ,... ,DimN (17) Each dimension can be represented as vector of M elements: Dimi=[value1i ,value2i ,... ,valueM i] (18) Each dimension vector has mean value of : ̄Dimi= 1 M ∑ k =1 M valuek i (19) The co-variance between two dimension vectors Dimi and Dim j is: cov( Dimi , Dim j)= 1 M −1 ∑ k=1 M (valuek i− ̄Dimi)(valuek j− ̄Dim j) (20) The co-variance matrix for N vectors is N x N square matrix where the element at position (i, j) equals to the co-variance of Dimi and Dim j : C=(ci j=cov(Dimi , Dim j)) (21) C=( cov(Dim1 , Dim1) cov(Dim1 , Dim2) ... cov( Dim1 , DimN ) cov(Dim2 ,Dim1) cov(Dim2 , Dim2) ... cov(Dim2 , DimN ) ... ... ,... ... ... cov( DimN , Dim1) cov(DimN , Dim2) ... cov( DimN ,DimN ) ) (22) Note that the co-variance matrix can be calculated as : C= 1 M −1 AdjData T AdjData (23) where AdjData is obtained by subtracting from each data value the column's mean: 4
  5. 5. AdjData=( value11− ̄Dim1 value12− ̄Dim2 ... value1N − ̄DimN value21− ̄Dim1 value22− ̄Dim2 ... value2N − ̄DimN ... ...,... ... ... valueM 1− ̄Dim1 valueM 2− ̄Dim2 ... valueM N − ̄DimN ) (24) Principal components of the matrix Data can be obtained by eigen decomposition of the co- variance matrix C . The eigenvalues shall be sorted in descending order, the largest comes the first. Associated eigenvectors shall be ordered the same way from left to right. We can choose the first K eigenvectors (principal components) and discard the others that are considered as lest significant. The principal components matrix PC with N rows and K columns is transformation matrix where each column is principal component. Finally, the projection of the original data Data into the space with K dimensions is calculated as dot product between the AdjData and PC : PData=AdjData PC (25) Example in R We can directly apply the data manipulations defined above as: #Load Data Data<- matrix(c(2.5,2.4,0.5,0.7,2.2,2.9,1.9,2.2,3.1,3.0,2.3,2.7,2,1.6,1,1.1,1.5,1.6,1.1,0. 9),ncol=2,byrow=T) #Generate Adjusted Data AdjData<-Data AdjData[,1]<-AdjData[,1]-mean(AdjData[,1]) AdjData[,2]<-AdjData[,2]-mean(AdjData[,2]) #Calculate co-variance matrix C<-t(AdjData)%*% AdjData/(length(AdjData[,1])-1) #Calculate eigen decomposition PC<-eigen(C)$vectors #Calculate projection PData<-AdjData%*%PC #Check diversity in the prior and posterior dimension variances sd(Data[,1]) sd(Data[,2]) sd(PData[,1]) sd(PData[,2]) The prior dimension variances were 0.78 and 0.84. They are very close and we can't throw one of them as less significant. The posterior dimension variances are 1.13 and 0.22. The difference is obvious,the first dimension is more important then the second one and we can discard the second if we need one dimensional data for exploration. 5
  6. 6. In R there are ready to use functions for PCA that encapsulate all of the above calculations. For example: pca<-princomp(Data) PC<-pca$loadings PData<-predict(pca,Data) Example in Python import numpy as np #Load Data Data=np.array([[2.5,2.4],[0.5,0.7],[2.2,2.9],[1.9,2.2],[3.1,3.0],[2.3,2.7],[2,1.6], [1,1.1],[1.5,1.6],[1.1,0.9]]) #Generate Adjusted Data DimMeans=Data.mean(axis=0) AdjData=Data-DimMeans #Calculate co-variance matrix C=AdjData.transpose().dot(AdjData)/(AdjData.shape[0]-1) #Calculate eigen decomposition EigenValues,PC=np.linalg.eig(C) idx=EigenValues.argsort()[::-1] PC = PC[:,idx] #Calculate projection PData=AdjData.dot(PC) #Compare if we have increased diversity in the prior and posterior dimension variances Data.std(axis=0,ddof=1) PData.std(axis=0,ddof=1) The Python machine learning library scikit-learn contains ready to use PCA functionality: from sklearn.decomposition import PCA pca = PCA(n_components=2) pca.fit(Data) PData=pca.transform(Data) Example in Apache Spark Apache Spark is a framework for lighting-fast cluster computing (http://spark.apache.org/). Spark can handle extremely large data sets in almost real time that is a huge improvement over the Hadoop's batch processing nature. Machine learning in Spark is supported through the MLlib library. One of the distributed algorithms implemented in there is the PCA algorithm for dimensionality reduction. For illustration purposes I will use the Iris Flower Data Set: 6
  7. 7. https://en.wikipedia.org/wiki/Iris_flower_data_set This is not some large data set. Actually it contains only 150 samples, 50 samples for each flower category: setosa, versicolor and virginica. The data contains 4 dimensions: Sepal length, Sepal width, Petal length, Petal width. Our task is to transform the original data set into 2 dimensional data set that is suitable for 2-D plotting. The following Scala Spark code demonstrates how to perform that transformation. It also presents some other Spark features like Spark SQL querying and JSON marshalling of the results. import org.apache.spark.mllib.linalg._ import org.apache.spark.mllib.linalg.distributed._ import org.apache.spark.sql._ val irisBase = sc.textFile("C:/ml/iris.data").zipWithIndex() val irisRows = irisBase.map({ case (line,index)=>{ val parts=line.split(",") new IndexedRow(index,Vectors.dense(parts.reverse.tail.map(_.toDouble).reverse)) }}) val irisMatrix = new IndexedRowMatrix(irisRows) val pc = irisMatrix.toRowMatrix().computePrincipalComponents(2) val irisPCProjection = irisMatrix.multiply(pc).rows.map(x=>(x.index,x.vector.toArray)) val irisLabels=irisBase .map({ case (line,index)=>{ val parts=line.split(",") (index,parts.reverse.head) }}) case class Iris(id:Long,x:Double,y:Double,label:String) val sqlContext = new SQLContext(sc) import sqlContext.implicits._ val irisProjection = irisPCProjection.join(irisLabels) .map(x=>Iris(x._1+1,x._2._1(0),x._2._1(1),x._2._2)).toDF() irisProjection.registerTempTable("iris") sqlContext.sql("SELECT * FROM iris ORDER BY id") .repartition(1).toJSON.saveAsTextFile("C:/ml/iris-results") After successful execution of the above commands in the Spark shell, the new samples are located in the file “iris-results/part-0000”. The content shall look like the following extract: 7
  8. 8. {"id":1,"x":-2.8271359726790117,"y":-5.641331045573361,"label":"Iris-setosa"} {"id":2,"x":-2.7959524821488304,"y":-5.14516688325295,"label":"Iris-setosa"} {"id":3,"x":-2.621523558165045,"y":-5.177378121203946,"label":"Iris-setosa"} Note that each row in the file is valid JSON, however the whole content is not valid JSON. We can manually produce valid JSON if we replace each new line with comma plus new line, opening square bracket at the beginning of the file and closing square bracket at the end of the file. This manual transformation step generates JavaScript array of JSON projected samples. Data Visualization with D3.js Data visualization is the final step when exploring data. There is an excellent JavaScript data visualization library called D3.js: http://d3js.org/ This plotting library is an excellent tool for generating online, interactive data driven dashboards. In our case we can visualize the projected Iris flower data set as presented below. <!DOCTYPE html> <html> <head> <style> body { font: 11px sans-serif; } .axis path, .axis line { fill: none; stroke: #000; shape-rendering: crispEdges; } .dot { stroke: #000; } </style> <script src ="https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.6/d3.min.js" charset="utf-8"></script> <script type="text/javascript"> var data=[ {"id":1,"x":-2.8271359726790117,"y":-5.641331045573361,"label":"Iris-setosa"}, {"id":2,"x":-2.7959524821488304,"y":-5.14516688325295,"label":"Iris-setosa"}, {"id":3,"x":-2.621523558165045,"y":-5.177378121203946,"label":"Iris-setosa"}, . . . ]; function drawScatterPlot(){ var margin = {top: 20, right: 20, bottom: 30, left: 40}; 8
  9. 9. var width = 600 - margin.left - margin.right; var height = 600 - margin.top - margin.bottom; //Setup x var xValue = function(d) { return d.x;}, xScale = d3.scale.linear().range([0,width]), xMap = function(d) { return xScale(xValue(d));}, xAxis = d3.svg.axis().scale(xScale).orient("bottom"); //Setup y var yValue = function(d) { return d.y;}, yScale = d3.scale.linear().range([height,0]), yMap = function(d) { return yScale(yValue(d));}, yAxis = d3.svg.axis().scale(yScale).orient("left"); //Setup colors var cValue = function(d) { return d.label;}, color = d3.scale.category10(); var svg = d3.select("body").append("svg") .attr("width", width + margin.left + margin.right) .attr("height", height + margin.top + margin.bottom) .append("g") .attr("transform", "translate(" + margin.left + "," + margin.top + ")"); xScale.domain([d3.min(data, xValue)-1, d3.max(data, xValue)+1]); yScale.domain([d3.min(data, yValue)-1, d3.max(data, yValue)+1]); // x-axis svg.append("g") .attr("class", "x axis") .attr("transform", "translate(0," + height + ")") .call(xAxis) .append("text") .attr("class", "label") .attr("x", width) .attr("y", -6) .style("text-anchor", "end") .text("X"); // y-axis svg.append("g") .attr("class", "y axis") .call(yAxis) .append("text") .attr("class", "label") .attr("transform", "rotate(-90)") .attr("y", 6) .attr("dy", ".71em") .style("text-anchor", "end") .text("Y"); //Plot Iris data svg.selectAll(".dot").data(data) .enter().append("circle") .attr("class", "dot") .attr("r", 4) 9
  10. 10. .attr("cx", xMap) .attr("cy", yMap) .style("fill", function(d) { return color(cValue(d));}) .on("mouseover", function(d) { var circle = d3.select(this); circle.transition().duration(100) .attr("r", 8); }) .on("mouseout", function(d) { var circle = d3.select(this); circle.transition().duration(100) .attr("r", 4 ); }); // draw legend var legend = svg.selectAll(".legend") .data(color.domain()) .enter().append("g") .attr("class", "legend") .attr("transform", function(d, i) { return "translate(0," + i * 20 + ")"; }); //Draw legend colored rectangles legend.append("rect") .attr("x", width - 18) .attr("width", 18) .attr("height", 18) .style("fill", color); //Draw legend text legend.append("text") .attr("x", width - 24) .attr("y", 9) .attr("dy", ".35em") .style("text-anchor", "end") .text(function(d) { return d;}); } </script> </head> <body onload="drawScatterPlot()"> </body> </html> The result is shown on Figure 1. We can see that the data analyst can observe the flower group boundaries with only 2 dimensions involved. All of the principles explained in this article (dimensionality reduction, principal component analysis etc.) realized with processing tools and frameworks like R, scikit-learn, Spark as well as front-end JavaScript libraries for visualization, plotting and interaction like the famous D3.js, provide us a foundation for building extremely useful online dashboards and services for solving problems in different domains. 10
  11. 11. Figure 1. PCA projected Iris flower data set 11

×