Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Nächste SlideShare
×

# Intro to machine learning with scikit learn

1.293 Aufrufe

Veröffentlicht am

Intro to machine learning with Scikit-learn

Veröffentlicht in: Software
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Als Erste(r) kommentieren

### Intro to machine learning with scikit learn

1. 1. 1 Yossi Cohen Machine Learning with Scikit-learn
2. 2. 2 INTRO TO ML PROGRAMMING
3. 3. 3 ML Programming 1. Get Data Get labels for supervised learning 1. Create a classifier 2. Train the classifier 3. Predict test data 4. Evaluate predictor accuracy *Configure and improve by repeating 2-5
4. 4. 4 The ML Process Filter Outliers Regression Classify Validate configure Model Partition
5. 5. 5 Get Data & Labels • Sources –Open data sources –Collect on your own • Verify data validity and correctness • Wrangle data –make it readable by computer –Filter it • Remove Outliers PANDAS Python library could assist in pre- processing & data manipulation before ML http://pandas.pydata.org/
6. 6. 6 Pre-Processing Change formatting Remove redundant data Filter Data (take partial data) Remove Outliers Label Split for testing (10/90, 20/80)
7. 7. 7 Data Partitioning • Data and labels –{[data], [labels]} –{[3,7, 76, 11, 22, 37, 56,2],[T, T, F, T, F, F, F, T]} –Data: [Age, Do you love Nutella?] • Partitioning will create –{[train data], [train labels],[test data], [test labels]} –We usually split the data on a ration of 9:1 –There is a tradeoff between the effectiveness of the test and the learning we could provide to the classifier • We will look at a partitioning function later
8. 8. 8 Learn (The “Smart Part”) Classification If the output is discrete to a limited amount of classes (groups) Regression If the output is continues
9. 9. 9 Learn Programming
10. 10. 10 Create Classifier For most SUPERVISED LEARNING algorithms this would be C = ClassifyAlg(Params) Its up to us (ML guys) to set the best params How? 1. We could develop a hunch for it 2. Perform an exhaustive search
11. 11. 11 Train the classifier We assigned C = ClassifyAlg(Params) This is a general algorithm with some initalizer and configurations. In this stage we train it using: C.fit(Data, Labels)
12. 12. 12 Predict After we have a trained Algorithm classifier C Prdeicted_Labels = C.predict(Data)
13. 13. 13 Predictor Evaluation We are not done yet There is a need to evaluate the predictor accuracy in comparison to other predictors and to the system requirements We will learn several methods for this
14. 14. 14 ENVIRONMENT
15. 15. 15 The Environment • There are many existing environments and tools we could use –Matlab with Machine learning toolbox –Apache Mahout –Python with Scikit-learn • Additional tools –Hadoop / Map-Reduce to accelerate and parallelize large data set processing –Amazon ML tools –NVIDIA Tools
16. 16. 16 Scikit-learn • Installation Instructions in http://scikit-learn.org/stable/install.html#install-official-release • Depends on two other libraries • numpy and scipy • Easiest way to install on windows: • Install WinPython http://sourceforge.net/projects/winpython/files/WinPython_2.7/2.7.9.4/ –Lets install this together For Linux / Mac computers just install the 3 libs separately using PIP
17. 17. 17 THE DATA
18. 18. 18 Data sets There are many data sets to work on One of them is the Iris data classification into three groups. It has an interesting story you could google later Well work on the iris data
19. 19. 19 Lab A – Plot the Iris data Plot septal length vs septal width with labels ONLY How? Google Iris data and the scikit learn environment Try to understand the second part of the program with the PCA
20. 20. 20 Iris Data import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D from sklearn import datasets iris = datasets.load_iris() X = iris.data[:, :2] # we only take the first two features. Y = iris.target x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5 y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
21. 21. 21 Plot Iris Data plt.figure(2, figsize=(8, 6)) plt.clf() plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired) plt.xlabel('Sepal length') plt.ylabel('Sepal width') plt.xlim(x_min, x_max) plt.ylim(y_min, y_max) plt.xticks(()) plt.yticks(())
22. 22. 22 Add PCA for better classification fig = plt.figure(1, figsize=(8, 6)) ax = Axes3D(fig, elev=-150, azim=110) X_reduced = PCA(n_components=3).fit_transform(iris.data) ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=Y, cmap=plt.cm.Paired) ax.set_title("First three PCA directions") ax.set_xlabel("1st eigenvector") ax.w_xaxis.set_ticklabels([]) ax.set_ylabel("2nd eigenvector") ax.w_yaxis.set_ticklabels([]) ax.set_zlabel("3rd eigenvector") ax.w_zaxis.set_ticklabels([]) plt.show()
23. 23. 23 Iris Data Classified
24. 24. 24
25. 25. 25 Thank you! More About me: Yossi CohenYossi Cohen yossicohen19@gmail.comyossicohen19@gmail.com +972-545-313092+972-545-313092  Video compression and computer vision enthusiast & lecturer  Surfer