Intro to machine learning with scikit learn

Intro to machine learning with Scikit-learn

1. 1. 1 Yossi Cohen Machine Learning with Scikit-learn
2. 2. 2 INTRO TO ML PROGRAMMING
3. 3. 3 ML Programming 1. Get Data Get labels for supervised learning 1. Create a classifier 2. Train the classifier 3. Predict test data 4. Evaluate predictor accuracy *Configure and improve by repeating 2-5
4. 4. 4 The ML Process Filter Outliers Regression Classify Validate configure Model Partition
5. 5. 5 Get Data & Labels • Sources –Open data sources –Collect on your own • Verify data validity and correctness • Wrangle data –make it readable by computer –Filter it • Remove Outliers PANDAS Python library could assist in pre- processing & data manipulation before ML http://pandas.pydata.org/
6. 6. 6 Pre-Processing Change formatting Remove redundant data Filter Data (take partial data) Remove Outliers Label Split for testing (10/90, 20/80)
7. 7. 7 Data Partitioning • Data and labels –{[data], [labels]} –{[3,7, 76, 11, 22, 37, 56,2],[T, T, F, T, F, F, F, T]} –Data: [Age, Do you love Nutella?] • Partitioning will create –{[train data], [train labels],[test data], [test labels]} –We usually split the data on a ration of 9:1 –There is a tradeoff between the effectiveness of the test and the learning we could provide to the classifier • We will look at a partitioning function later
8. 8. 8 Learn (The “Smart Part”) Classification If the output is discrete to a limited amount of classes (groups) Regression If the output is continues
9. 9. 9 Learn Programming
10. 10. 10 Create Classifier For most SUPERVISED LEARNING algorithms this would be C = ClassifyAlg(Params) Its up to us (ML guys) to set the best params How? 1. We could develop a hunch for it 2. Perform an exhaustive search
11. 11. 11 Train the classifier We assigned C = ClassifyAlg(Params) This is a general algorithm with some initalizer and configurations. In this stage we train it using: C.fit(Data, Labels)
12. 12. 12 Predict After we have a trained Algorithm classifier C Prdeicted_Labels = C.predict(Data)
13. 13. 13 Predictor Evaluation We are not done yet There is a need to evaluate the predictor accuracy in comparison to other predictors and to the system requirements We will learn several methods for this
14. 14. 14 ENVIRONMENT
15. 15. 15 The Environment • There are many existing environments and tools we could use –Matlab with Machine learning toolbox –Apache Mahout –Python with Scikit-learn • Additional tools –Hadoop / Map-Reduce to accelerate and parallelize large data set processing –Amazon ML tools –NVIDIA Tools
16. 16. 16 Scikit-learn • Installation Instructions in http://scikit-learn.org/stable/install.html#install-official-release • Depends on two other libraries • numpy and scipy • Easiest way to install on windows: • Install WinPython http://sourceforge.net/projects/winpython/files/WinPython_2.7/2.7.9.4/ –Lets install this together For Linux / Mac computers just install the 3 libs separately using PIP
17. 17. 17 THE DATA
18. 18. 18 Data sets There are many data sets to work on One of them is the Iris data classification into three groups. It has an interesting story you could google later Well work on the iris data
19. 19. 19 Lab A – Plot the Iris data Plot septal length vs septal width with labels ONLY How? Google Iris data and the scikit learn environment Try to understand the second part of the program with the PCA
20. 20. 20 Iris Data import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D from sklearn import datasets iris = datasets.load_iris() X = iris.data[:, :2] # we only take the first two features. Y = iris.target x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5 y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
21. 21. 21 Plot Iris Data plt.figure(2, figsize=(8, 6)) plt.clf() plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired) plt.xlabel('Sepal length') plt.ylabel('Sepal width') plt.xlim(x_min, x_max) plt.ylim(y_min, y_max) plt.xticks(()) plt.yticks(())
22. 22. 22 Add PCA for better classification fig = plt.figure(1, figsize=(8, 6)) ax = Axes3D(fig, elev=-150, azim=110) X_reduced = PCA(n_components=3).fit_transform(iris.data) ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=Y, cmap=plt.cm.Paired) ax.set_title("First three PCA directions") ax.set_xlabel("1st eigenvector") ax.w_xaxis.set_ticklabels([]) ax.set_ylabel("2nd eigenvector") ax.w_yaxis.set_ticklabels([]) ax.set_zlabel("3rd eigenvector") ax.w_zaxis.set_ticklabels([]) plt.show()
23. 23. 23 Iris Data Classified
