Nyc open-data-2015-andvanced-sklearn-expanded

Vivian  S. Zhang
Vivian S. ZhangCTO/Chief Data Scientist um NYC Data Science Academy
1
Advanced Scikit-Learn
Andreas Mueller (NYU Center for Data Science, scikit-learn)
2
Me
3
Classification
Regression
Clustering
Semi-Supervised Learning
Feature Selection
Feature Extraction
Manifold Learning
Dimensionality Reduction
Kernel Approximation
Hyperparameter Optimization
Evaluation Metrics
Out-of-core learning
…...
4
5
Overview
● Reminder: Basic sklearn concepts
●
Model building and evaluation:
– Pipelines and Feature Unions
– Randomized Parameter Search
– Scoring Interface
● Out of Core learning
– Feature Hashing
– Kernel Approximation
● New stuff in 0.16.0
– Overview
– Calibration
6
Training Data
Training Labels
Model
Supervised Machine Learning
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
7
Training Data
Test Data
Training Labels
Model
Prediction
Supervised Machine Learning
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
8
clf.score(X_test, y_test)
Training Data
Test Data
Training Labels
Model
Prediction
Test Labels Evaluation
Supervised Machine Learning
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
9
pca = PCA(n_components=3)
pca.fit(X_train) Training Data Model
Unsupervised Transformations
10
pca = PCA(n_components=3)
pca.fit(X_train)
X_new = pca.transform(X_test)
Training Data
Test Data
Model
Transformation
Unsupervised Transformations
11
Basic API
estimator.fit(X, [y])
estimator.predict estimator.transform
Classification Preprocessing
Regression Dimensionality reduction
Clustering Feature selection
Feature extraction
12
Cross-Validation
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(SVC(), X, y, cv=5)
print(scores)
>> [ 0.92 1. 1. 1. 1. ]
13
Cross-Validation
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(SVC(), X, y, cv=5)
print(scores)
>> [ 0.92 1. 1. 1. 1. ]
cv_ss = ShuffleSplit(len(X_train), test_size=.3,
n_iter=10)
scores_shuffle_split = cross_val_score(SVC(), X, y,
cv=cv_ss)
14
Cross-Validation
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(SVC(), X, y, cv=5)
print(scores)
>> [ 0.92 1. 1. 1. 1. ]
cv_ss = ShuffleSplit(len(X_train), test_size=.3,
n_iter=10)
scores_shuffle_split = cross_val_score(SVC(), X, y,
cv=cv_ss)
cv_labels = LeaveOneLabelOut(labels)
scores_pout = cross_val_score(SVC(), X, y, cv=cv_labels)
15
Cross -Validated Grid Search
16
Cross -Validated Grid Search
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
param_grid = {'C': 10. ** np.arange(-3, 3),
'gamma': 10. ** np.arange(-3, 3)}
grid = GridSearchCV(SVC(), param_grid=param_grid)
grid.fit(X_train, y_train)
grid.predict(X_test)
grid.score(X_test, y_test)
17
Training DataTraining Labels
Model
18
Training DataTraining Labels
Model
19
Training DataTraining Labels
Model
Feature
Extraction
Scaling
Feature
Selection
20
Training DataTraining Labels
Model
Feature
Extraction
Scaling
Feature
Selection
Cross Validation
21
Training DataTraining Labels
Model
Feature
Extraction
Scaling
Feature
Selection
Cross Validation
22
Pipelines
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(StandardScaler(), SVC())
pipe.fit(X_train, y_train)
pipe.predict(X_test)
23
Combining Pipelines and
Grid Search
Proper cross-validation
param_grid = {'svc__C': 10. ** np.arange(-3, 3),
'svc__gamma': 10. ** np.arange(-3, 3)}
scaler_pipe = make_pipeline(StandardScaler(), SVC())
grid = GridSearchCV(scaler_pipe, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)
24
Combining Pipelines and
Grid Search II
Searching over parameters of the preprocessing step
param_grid = {'selectkbest__k': [1, 2, 3, 4],
'svc__C': 10. ** np.arange(-3, 3),
'svc__gamma': 10. ** np.arange(-3, 3)}
scaler_pipe = make_pipeline(SelectKBest(), SVC())
grid = GridSearchCV(scaler_pipe, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)
25
Feature Union
Training DataTraining Labels
Model
Feature
Extraction I
Feature
Extraction II
26
Feature Union
char_and_word = make_union(CountVectorizer(analyzer="char"),
CountVectorizer(analyzer="word"))
text_pipe = make_pipeline(char_and_word, LinearSVC(dual=False))
param_grid = {'linearsvc__C': 10. ** np.arange(-3, 3)}
grid = GridSearchCV(text_pipe, param_grid=param_grid, cv=5)
27
Feature Union
char_and_word = make_union(CountVectorizer(analyzer="char"),
CountVectorizer(analyzer="word"))
text_pipe = make_pipeline(char_and_word, LinearSVC(dual=False))
param_grid = {'linearsvc__C': 10. ** np.arange(-3, 3)}
grid = GridSearchCV(text_pipe, param_grid=param_grid, cv=5)
param_grid2 = {'featureunion__countvectorizer-1__ngram_range': [(1, 3), (1, 5), (2, 5)],
'featureunion__countvectorizer-2__ngram_range': [(1, 1), (1, 2), (2, 2)],
'linearsvc__C': 10. ** np.arange(-3, 3)}
28
Randomized Parameter Search
29
Randomized Parameter Search
Source: Bergstra and Bengio
30
Randomized Parameter Search
Source: Bergstra and Bengio
Step-size free for continuous parameters
Decouples runtime from search-space size
Robust against irrelevant parameters
31
Randomized Parameter Search
params = {'featureunion__countvectorizer-1__ngram_range':
[(1, 3), (1, 5), (2, 5)],
'featureunion__countvectorizer-2__ngram_range':
[(1, 1), (1, 2), (2, 2)],
'linearsvc__C': 10. ** np.arange(-3, 3)}
32
Randomized Parameter Search
params = {'featureunion__countvectorizer-1__ngram_range':
[(1, 3), (1, 5), (2, 5)],
'featureunion__countvectorizer-2__ngram_range':
[(1, 1), (1, 2), (2, 2)],
'linearsvc__C': expon()}
33
Randomized Parameter Search
rs = RandomizedSearchCV(text_pipe,
param_distributions=param_distributins, n_iter=50)
params = {'featureunion__countvectorizer-1__ngram_range':
[(1, 3), (1, 5), (2, 5)],
'featureunion__countvectorizer-2__ngram_range':
[(1, 1), (1, 2), (2, 2)],
'linearsvc__C': expon()}
34
Randomized Parameter Search
● Always use distributions for continuous
variables.
● Don't use for low dimensional spaces.
● Future: Bayesian optimization based search.
35
Generalized Cross-Validation and Path Algorithms
36
rfe = RFE(LogisticRegression())
37
rfe = RFE(LogisticRegression())
param_grid = {'n_features_to_select': range(1, n_features)}
gridsearch = GridSearchCV(rfe, param_grid)
grid.fit(X, y)
38
rfe = RFE(LogisticRegression())
param_grid = {'n_features_to_select': range(1, n_features)}
gridsearch = GridSearchCV(rfe, param_grid)
grid.fit(X, y)
39
rfe = RFE(LogisticRegression())
param_grid = {'n_features_to_select': range(1, n_features)}
gridsearch = GridSearchCV(rfe, param_grid)
grid.fit(X, y)
rfecv = RFECV(LogisticRegression())
40
rfe = RFE(LogisticRegression())
param_grid = {'n_features_to_select': range(1, n_features)}
gridsearch = GridSearchCV(rfe, param_grid)
grid.fit(X, y)
rfecv = RFECV(LogisticRegression())
rfecv.fit(X, y)
41
42
Linear Models Feature Selection Tree-Based models [possible]
LogisticRegressionCV [new] RFECV [DecisionTreeCV]
RidgeCV [RandomForestClassifierCV]
RidgeClassifierCV [GradientBoostingClassifierCV]
LarsCV
ElasticNetCV
...
43
Scoring Functions
44
Default:
Accuracy (classification)
R2 (regression)
GridSeachCV
RandomizedSearchCV
cross_val_score
...CV
45
Scoring with imbalanced data
cross_val_score(SVC(), X_train, y_train)
>>> array([ 0.9, 0.9, 0.9])
46
Scoring with imbalanced data
cross_val_score(SVC(), X_train, y_train)
>>> array([ 0.9, 0.9, 0.9])
cross_val_score(DummyClassifier("most_frequent"), X_train, y_train)
>>> array([ 0.9, 0.9, 0.9])
47
Scoring with imbalanced data
cross_val_score(SVC(), X_train, y_train)
>>> array([ 0.9, 0.9, 0.9])
cross_val_score(DummyClassifier("most_frequent"), X_train, y_train)
>>> array([ 0.9, 0.9, 0.9])
cross_val_score(SVC(), X_train, y_train, scoring="roc_auc")
array([ 0.99961591, 0.99983498, 0.99966247])
48
Scoring with imbalanced data
cross_val_score(SVC(), X_train, y_train)
>>> array([ 0.9, 0.9, 0.9])
cross_val_score(DummyClassifier("most_frequent"), X_train, y_train)
>>> array([ 0.9, 0.9, 0.9])
cross_val_score(SVC(), X_train, y_train, scoring="roc_auc")
array([ 0.99961591, 0.99983498, 0.99966247])
49
Available metrics
print(SCORERS.keys())
>> ['adjusted_rand_score',
'f1',
'mean_absolute_error',
'r2',
'recall',
'median_absolute_error',
'precision',
'log_loss',
'mean_squared_error',
'roc_auc',
'average_precision',
'accuracy']
50
Defining your own scoring
def my_super_scoring(est, X, y):
return accuracy_scorer(est, X, y) - np.sum(est.coef_ != 0)
51
Out of Core Learning
52
Or: save ourself the effort
53
Think twice!
● Old laptop: 4GB Ram
● 1073741824 float32
● Or 1mio data points with 1000 features
● EC2 : 256 GB Ram
● 68719476736 float32
● Or 68mio data points with 1000 features
54
Supported Algorithms
● All SGDClassifier derivatives
● Naive Bayes
● MinibatchKMeans
● IncrementalPCA
● MiniBatchDictionaryLearning
55
Out of Core Learning
sgd = SGDClassifier()
for i in range(9):
X_batch, y_batch = cPickle.load(open("batch_%02d" % i))
sgd.partial_fit(X_batch, y_batch, classes=range(10))
Possibly go over the data multiple times.
56
Stateless Transformers
● Normalizer
● HashingVectorizer
● RBFSampler (and other kernel approx)
57
Text data and the hashing trick
58
Bag Of Word Representations
“You better call Kenny Loggins”
CountVectorizer / TfidfVectorizer
59
Bag Of Word Representations
“You better call Kenny Loggins”
['you', 'better', 'call', 'kenny', 'loggins']
CountVectorizer / TfidfVectorizer
tokenizer
60
Bag Of Word Representations
“You better call Kenny Loggins”
[0, …, 0, 1, 0, … , 0, 1 , 0, …, 0, 1, 0, …., 0 ]
better call youaardvak zyxst
['you', 'better', 'call', 'kenny', 'loggins']
CountVectorizer / TfidfVectorizer
tokenizer
Sparse matrix encoding
61
Hashing Trick
“You better call Kenny Loggins”
['you', 'better', 'call', 'kenny', 'loggins']
HashingVectorizer
tokenizer
hashing
[hash('you'), hash('better'), hash('call'), hash('kenny'), hash('loggins')]
= [832412, 223788, 366226, 81185, 835749]
[0, …, 0, 1, 0, … , 0, 1 , 0, …, 0, 1, 0, ... 0 ]
Sparse matrix encoding
62
Out of Core Text Classification
sgd = SGDClassifier()
hashing_vectorizer = HashingVectorizer()
for i in range(9):
text_batch, y_batch = cPickle.load(open("text_%02d" % I))
X_batch = hashing_vectorizer.transform(text_batch)
sgd.partial_fit(X_batch, y_batch, classes=range(10))
63
Kernel Approximations
64
Reminder: Kernel Trick
x
65
Reminder: Kernel Trick
66
Reminder: Kernel Trick
Classifier linear → need only
67
Reminder: Kernel Trick
Classifier linear → need only
Linear:
Polynomial:
RBF:
Sigmoid:
68
Complexity
● Solving kernelized SVM:
~O(n_samples ** 3)
● Solving linear (primal) SVM:
~O(n_samples * n_features)
n_samples large? Go primal!
69
Undoing the Kernel Trick
● Kernel approximation:
● k =
= RBFSampler
70
Usage
sgd = SGDClassifier()
kernel_approximation = RBFSampler(gamma=.001, n_components=400)
for i in range(9):
X_batch, y_batch = cPickle.load(open("batch_%02d" % i))
if i == 0:
kernel_approximation.fit(X_batch)
X_transformed = kernel_approximation.transform(X_batch)
sgd.partial_fit(X_transformed, y_batch, classes=range(10))
71
Highlights from 0.16.0
72
Highlights from 0.16.0
● Multinomial Logistic Regression,
LogisticRegressionCV.
● IncrementalPCA.
● Probability callibration of classifiers.
● Birch clustering.
● LSHForest.
● More robust integration with pandas.
73
Probability Calibration
SVC().decision_function()
→ CalibratedClassifierCV(SVC()).predict_proba()
RandomForestClassifier().predict_proba()
→ CalibratedClassifierCV(RandomForestClassifier()).predict_proba()
74
75
CDS is hiring Research Engineers
76
Thank you for your attention.
@t3kcit
@amueller
t3kcit@gmail.comx
77
Bias Variance Tradeoff
(why we do cross validation and grid searches)
78
Overfitting and Underfitting
Model complexity
Accuracy
Training
79
Overfitting and Underfitting
Model complexity
Accuracy
Training
Generalization
80
Overfitting and Underfitting
Model complexity
Accuracy
Training
Generalization
Underfitting Overfitting
Sweet spot
81
Linear SVM
82
Linear SVM
83
(RBF) Kernel SVM
84
(RBF) Kernel SVM
85
(RBF) Kernel SVM
86
(RBF) Kernel SVM
87
Decision Trees
88
Decision Trees
89
Decision Trees
90
Decision Trees
91
Decision Trees
92
Decision Trees
93
Random Forests
94
Random Forests
95
Random Forests
96
Know where you are on the bias-variance tradeoff
97
Validation Curves
train_scores, test_scores = validation_curve(SVC(), X, y,
param_name="gamma", param_range=param_range)
98
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y,train_sizes=train_sizes)
Learning Curves
1 von 98

Recomendados

Kaggle talk series top 0.2% kaggler on amazon employee access challenge von
Kaggle talk series  top 0.2% kaggler on amazon employee access challengeKaggle talk series  top 0.2% kaggler on amazon employee access challenge
Kaggle talk series top 0.2% kaggler on amazon employee access challengeVivian S. Zhang
2.5K views65 Folien
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author von
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorVivian S. Zhang
89.2K views128 Folien
Data mining with caret package von
Data mining with caret packageData mining with caret package
Data mining with caret packageVivian S. Zhang
4.4K views36 Folien
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati... von
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Vivian S. Zhang
1.3K views11 Folien
Gradient boosting in practice: a deep dive into xgboost von
Gradient boosting in practice: a deep dive into xgboostGradient boosting in practice: a deep dive into xgboost
Gradient boosting in practice: a deep dive into xgboostJaroslaw Szymczak
3.4K views79 Folien
Introduction of Xgboost von
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboostmichiaki ito
3.4K views29 Folien

Más contenido relacionado

Was ist angesagt?

Gradient Boosted Regression Trees in scikit-learn von
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
46.4K views39 Folien
Streaming Python on Hadoop von
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on HadoopVivian S. Zhang
2.3K views54 Folien
XGBoost @ Fyber von
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ FyberDaniel Hen
359 views51 Folien
Demystifying Xgboost von
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboosthalifaxchester
361 views21 Folien
How to use SVM for data classification von
How to use SVM for data classificationHow to use SVM for data classification
How to use SVM for data classificationYiwei Chen
738 views63 Folien
Ensembling & Boosting 概念介紹 von
Ensembling & Boosting  概念介紹Ensembling & Boosting  概念介紹
Ensembling & Boosting 概念介紹Wayne Chen
5.3K views26 Folien

Was ist angesagt?(20)

Gradient Boosted Regression Trees in scikit-learn von DataRobot
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
DataRobot46.4K views
XGBoost @ Fyber von Daniel Hen
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
Daniel Hen359 views
How to use SVM for data classification von Yiwei Chen
How to use SVM for data classificationHow to use SVM for data classification
How to use SVM for data classification
Yiwei Chen738 views
Ensembling & Boosting 概念介紹 von Wayne Chen
Ensembling & Boosting  概念介紹Ensembling & Boosting  概念介紹
Ensembling & Boosting 概念介紹
Wayne Chen5.3K views
Introduction to XGBoost von Joonyoung Yi
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
Joonyoung Yi5.2K views
XGBoost: the algorithm that wins every competition von Jaroslaw Szymczak
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
Jaroslaw Szymczak16.7K views
Natural Language Processing (NLP) von hichem felouat
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
hichem felouat143 views
maXbox starter65 machinelearning3 von Max Kleiner
maXbox starter65 machinelearning3maXbox starter65 machinelearning3
maXbox starter65 machinelearning3
Max Kleiner168 views
Converting Scikit-Learn to PMML von Villu Ruusmann
Converting Scikit-Learn to PMMLConverting Scikit-Learn to PMML
Converting Scikit-Learn to PMML
Villu Ruusmann27.8K views
Introduction To Generative Adversarial Networks GANs von hichem felouat
Introduction To Generative Adversarial Networks GANsIntroduction To Generative Adversarial Networks GANs
Introduction To Generative Adversarial Networks GANs
hichem felouat625 views
Support Vector Machines (SVM) von FAO
Support Vector Machines (SVM)Support Vector Machines (SVM)
Support Vector Machines (SVM)
FAO2.5K views
Reinforcement learning Research experiments OpenAI von Raouf KESKES
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAI
Raouf KESKES57 views
Data Wrangling For Kaggle Data Science Competitions von Krishna Sankar
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
Krishna Sankar7.7K views
Build your own Convolutional Neural Network CNN von hichem felouat
Build your own Convolutional Neural Network CNNBuild your own Convolutional Neural Network CNN
Build your own Convolutional Neural Network CNN
hichem felouat194 views
Machine Learning: Classification Concepts (Part 1) von Daniel Chan
Machine Learning: Classification Concepts (Part 1)Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)
Daniel Chan168 views

Destacado

Winning data science competitions, presented by Owen Zhang von
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
40K views23 Folien
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c... von
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Vivian S. Zhang
1.3K views14 Folien
Natural Language Processing(SupStat Inc) von
Natural Language Processing(SupStat Inc)Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)Vivian S. Zhang
1.9K views33 Folien
Using Machine Learning to aid Journalism at the New York Times von
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesVivian S. Zhang
3.7K views20 Folien
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) von
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)Vivian S. Zhang
1.9K views113 Folien
Bayesian models in r von
Bayesian models in rBayesian models in r
Bayesian models in rVivian S. Zhang
3.7K views53 Folien

Destacado(20)

Winning data science competitions, presented by Owen Zhang von Vivian S. Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
Vivian S. Zhang40K views
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c... von Vivian S. Zhang
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Vivian S. Zhang1.3K views
Natural Language Processing(SupStat Inc) von Vivian S. Zhang
Natural Language Processing(SupStat Inc)Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)
Vivian S. Zhang1.9K views
Using Machine Learning to aid Journalism at the New York Times von Vivian S. Zhang
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
Vivian S. Zhang3.7K views
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) von Vivian S. Zhang
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Vivian S. Zhang1.9K views
Introducing natural language processing(NLP) with r von Vivian S. Zhang
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
Vivian S. Zhang5.4K views
Max Kuhn's talk on R machine learning von Vivian S. Zhang
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
Vivian S. Zhang5.5K views
A Beginner's Guide to Machine Learning with Scikit-Learn von Sarah Guido
A Beginner's Guide to Machine Learning with Scikit-LearnA Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-Learn
Sarah Guido19.6K views
Introduction to Machine Learning with SciKit-Learn von Benjamin Bengfort
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort17.5K views
Machine learning on Hadoop data lakes von DataWorks Summit
Machine learning on Hadoop data lakesMachine learning on Hadoop data lakes
Machine learning on Hadoop data lakes
DataWorks Summit2.9K views
H2O Random Grid Search - PyData Amsterdam von Sri Ambati
H2O Random Grid Search - PyData AmsterdamH2O Random Grid Search - PyData Amsterdam
H2O Random Grid Search - PyData Amsterdam
Sri Ambati2.5K views
Nycdsa ml conference slides march 2015 von Vivian S. Zhang
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015
Vivian S. Zhang1K views
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data von Vivian S. Zhang
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
Vivian S. Zhang2K views
Spatial query tutorial for nyc subway income level along subway von Vivian S. Zhang
Spatial query tutorial  for nyc subway income level along subwaySpatial query tutorial  for nyc subway income level along subway
Spatial query tutorial for nyc subway income level along subway
Vivian S. Zhang3K views
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend... von Vivian S. Zhang
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Vivian S. Zhang1.1K views
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc von Vivian S. Zhang
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycData Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Vivian S. Zhang1.5K views
primal and dual problem von Yash Lad
primal and dual problemprimal and dual problem
primal and dual problem
Yash Lad56.8K views
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand... von PyData
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
PyData6.3K views

Similar a Nyc open-data-2015-andvanced-sklearn-expanded

Scikit-learn Cheatsheet-Python von
Scikit-learn Cheatsheet-PythonScikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-PythonDr. Volkan OBAN
1.4K views1 Folie
Cheat Sheet for Machine Learning in Python: Scikit-learn von
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnKarlijn Willems
4.3K views1 Folie
Scikit learn cheat_sheet_python von
Scikit learn cheat_sheet_pythonScikit learn cheat_sheet_python
Scikit learn cheat_sheet_pythonZahid Hasan
48 views1 Folie
Building Machine Learning Pipelines von
Building Machine Learning PipelinesBuilding Machine Learning Pipelines
Building Machine Learning PipelinesInMobi Technology
938 views24 Folien
Building ML Pipelines von
Building ML PipelinesBuilding ML Pipelines
Building ML PipelinesDebidatta Dwibedi
519 views25 Folien
X_train y_trainX_test y_testX_valid y_validtrainin.docx von
X_train y_trainX_test y_testX_valid y_validtrainin.docxX_train y_trainX_test y_testX_valid y_validtrainin.docx
X_train y_trainX_test y_testX_valid y_validtrainin.docxtroutmanboris
2 views12 Folien

Similar a Nyc open-data-2015-andvanced-sklearn-expanded(20)

Cheat Sheet for Machine Learning in Python: Scikit-learn von Karlijn Willems
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learn
Karlijn Willems4.3K views
Scikit learn cheat_sheet_python von Zahid Hasan
Scikit learn cheat_sheet_pythonScikit learn cheat_sheet_python
Scikit learn cheat_sheet_python
Zahid Hasan48 views
X_train y_trainX_test y_testX_valid y_validtrainin.docx von troutmanboris
X_train y_trainX_test y_testX_valid y_validtrainin.docxX_train y_trainX_test y_testX_valid y_validtrainin.docx
X_train y_trainX_test y_testX_valid y_validtrainin.docx
troutmanboris2 views
Advanced pg_stat_statements: Filtering, Regression Testing & more von Lukas Fittl
Advanced pg_stat_statements: Filtering, Regression Testing & moreAdvanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & more
Lukas Fittl2.7K views
Grid search.pptx von AbithaSam
Grid search.pptxGrid search.pptx
Grid search.pptx
AbithaSam26 views
maxbox starter60 machine learning von Max Kleiner
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learning
Max Kleiner96 views
Benchy, python framework for performance benchmarking of Python Scripts von Marcel Caraciolo
Benchy, python framework for performance benchmarking  of Python ScriptsBenchy, python framework for performance benchmarking  of Python Scripts
Benchy, python framework for performance benchmarking of Python Scripts
Marcel Caraciolo1.8K views
EVERYTHING ABOUT STATIC CODE ANALYSIS FOR A JAVA PROGRAMMER von Andrey Karpov
EVERYTHING ABOUT STATIC CODE ANALYSIS FOR A JAVA PROGRAMMEREVERYTHING ABOUT STATIC CODE ANALYSIS FOR A JAVA PROGRAMMER
EVERYTHING ABOUT STATIC CODE ANALYSIS FOR A JAVA PROGRAMMER
Andrey Karpov212 views
Assignment 5.2.pdf von dash41
Assignment 5.2.pdfAssignment 5.2.pdf
Assignment 5.2.pdf
dash414 views
Machinelearning Spark Hadoop User Group Munich Meetup 2016 von Comsysto Reply GmbH
Machinelearning Spark Hadoop User Group Munich Meetup 2016Machinelearning Spark Hadoop User Group Munich Meetup 2016
Machinelearning Spark Hadoop User Group Munich Meetup 2016
Evaluating classifierperformance ml-cs6923 von Raman Kannan
Evaluating classifierperformance ml-cs6923Evaluating classifierperformance ml-cs6923
Evaluating classifierperformance ml-cs6923
Raman Kannan1K views
# Produce the features of a testing data instance X_new = np. arr.pdf von info893569
# Produce the features of a testing data instance X_new = np. arr.pdf# Produce the features of a testing data instance X_new = np. arr.pdf
# Produce the features of a testing data instance X_new = np. arr.pdf
info8935692 views

Más de Vivian S. Zhang

Why NYC DSA.pdf von
Why NYC DSA.pdfWhy NYC DSA.pdf
Why NYC DSA.pdfVivian S. Zhang
126 views21 Folien
Career services workshop- Roger Ren von
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger RenVivian S. Zhang
784 views12 Folien
Nycdsa wordpress guide book von
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide bookVivian S. Zhang
1.3K views20 Folien
We're so skewed_presentation von
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentationVivian S. Zhang
1.2K views29 Folien
Wikipedia: Tuned Predictions on Big Data von
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataVivian S. Zhang
1K views20 Folien
A Hybrid Recommender with Yelp Challenge Data von
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data Vivian S. Zhang
981 views35 Folien

Más de Vivian S. Zhang(12)

Career services workshop- Roger Ren von Vivian S. Zhang
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger Ren
Vivian S. Zhang784 views
Wikipedia: Tuned Predictions on Big Data von Vivian S. Zhang
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
Vivian S. Zhang1K views
A Hybrid Recommender with Yelp Challenge Data von Vivian S. Zhang
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data
Vivian S. Zhang981 views
Kaggle Top1% Solution: Predicting Housing Prices in Moscow von Vivian S. Zhang
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Vivian S. Zhang2.5K views
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc von Vivian S. Zhang
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Vivian S. Zhang1.2K views
Data Science Academy Student Demo day--Michael blecher,the importance of clea... von Vivian S. Zhang
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Vivian S. Zhang1.5K views
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi... von Vivian S. Zhang
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Vivian S. Zhang1.1K views
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc... von Vivian S. Zhang
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
Vivian S. Zhang779 views
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D... von Vivian S. Zhang
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
Vivian S. Zhang563 views

Último

OPPOTUS - Malaysians on Malaysia 3Q2023.pdf von
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOppotus
18 views19 Folien
VoxelNet von
VoxelNetVoxelNet
VoxelNettaeseon ryu
13 views21 Folien
LIVE OAK MEMORIAL PARK.pptx von
LIVE OAK MEMORIAL PARK.pptxLIVE OAK MEMORIAL PARK.pptx
LIVE OAK MEMORIAL PARK.pptxms2332always
7 views6 Folien
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines von
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines
[DSC Europe 23] Luca Morena - From Psychohistory to Curious MachinesDataScienceConferenc1
5 views20 Folien
CRIJ4385_Death Penalty_F23.pptx von
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptxyvettemm100
7 views24 Folien
CRM stick or twist.pptx von
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptxinfo828217
11 views16 Folien

Último(20)

OPPOTUS - Malaysians on Malaysia 3Q2023.pdf von Oppotus
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf
Oppotus18 views
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines von DataScienceConferenc1
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines
CRIJ4385_Death Penalty_F23.pptx von yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1007 views
CRM stick or twist.pptx von info828217
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptx
info82821711 views
Short Story Assignment by Kelly Nguyen von kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0119 views
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ... von DataScienceConferenc1
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ... von DataScienceConferenc1
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo... von DataScienceConferenc1
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
UNEP FI CRS Climate Risk Results.pptx von pekka28
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 views
Advanced_Recommendation_Systems_Presentation.pptx von neeharikasingh29
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx
Organic Shopping in Google Analytics 4.pdf von GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials16 views
Data about the sector workshop von info828217
Data about the sector workshopData about the sector workshop
Data about the sector workshop
info82821715 views
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx von DataScienceConferenc1
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx

Nyc open-data-2015-andvanced-sklearn-expanded