Prediction of the bike rental demand in Washington
1. CSC Proprietary and Confidential
Prediction of the bike rental
demand in Washington, D.C.
Zilong Zhao
Associate Professional: Data Scientist
March 4th.2016
2. CSC Proprietary and Confidential 2December 14, 2016
I. Motivation 3-7
II. Data Exploration 8-13
III. Predictive Analysis 14-28
IV. Summary and Outlook 29-30
Table of Content
3. CSC Proprietary and Confidential 3December 14, 2016
What is a city bikeshare system?
4. CSC Proprietary and Confidential 4December 14, 2016
Source: http://regionalbraunschweig.de/fahrradparkhaeuser-fuer-die-loewenstadt/, Foto: SIna Rühland
5. CSC Proprietary and Confidential 5December 14, 2016
What is a city bikeshare system?
An automatic bike station powered by solar energy
Source: https://draufabfahren.de/unterwegs/der-perfekte-staedtetrip-mit-call-a-bike-36061
6. CSC Proprietary and Confidential 6December 14, 2016
Why is predictive analytics useful for bike sharing company?
• Bike available everytime and everywhere vs. avoiding over-capacities
• Bike positioned: how and when
• Reduction of bottlenecks caused by regular bike maintenance
• High availability of bikes: customer satisfaction increases
7. CSC Proprietary and Confidential 7December 14, 2016
Introduction of a Kaggle project
Forecast use of the bikeshare system in Washington, D.C.
About Kaggle
• A platform for various projects of predictive modelling and analytics
competitions
• Partnered with NASA, Wikipedia, Deloitte etc.
• Milestones:
Gesture recognition for Microsoft Kinect,
Data analysis for the Higgs boson project at CERN, Geneva, Switzerland
Netflix US$1.000.000 prize for prediction of user ratings for films
Source: https://www.kaggle.com/c/bike-sharing-demand
8. CSC Proprietary and Confidential 8December 14, 2016
How does the data set look like?
Data Fields:
• Datetime: hourly date + timestamp
• Season: 1 = spring, 2 = summer, 3 = fall, 4 = winter
• Holiday: whether the day is considered a holiday
• Workingday: whether the day is neither a weekend nor holiday
• Weather:
1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
• Temp: temperature in Celsius
• Atemp: "feels like" temperature in Celsius
• Humidity: relative humidity
• Windspeed: wind speed
• Casual: number of non-registered user rentals initiated
• Registered: number of registered user rentals initiated
• Count: number of total rentals, prediction target
• Size of the training data: 10886 rows, 12 columns
• 2 years, hourly
9. CSC Proprietary and Confidential 9December 14, 2016
Data preparation
• With target: The first 19 days of every month
• Without target: The rest days
Training data:
first 19 days, hourly
Test data:
rest days, hourly
85%: training set 15%:
validation
set
10. CSC Proprietary and Confidential 10December 14, 2016
Evaluation of prediction
Due to the evaluation rules from kaggle, the results will be evaluated by
the Root Mean Squared Logarithmic Error(RMSLE) defined as
1
n
𝑖=1
𝑛
log( 𝑦 𝑝𝑟𝑒𝑑 + 1) − log( 𝑦 𝑟𝑒𝑎𝑙 + 1)
2
• 𝑛 is the number of predictions
• 𝑦 𝑝𝑟𝑒𝑑 is predicted count
• 𝑦 𝑟𝑒𝑎𝑙 is the actual count
• log( 𝑥) is the natural logarithm
11. CSC Proprietary and Confidential 11December 14, 2016
Why RMSLE?
Data includes a large range of values, suppose
Real value 10 10000
Prediction 11 11000
• 𝑅𝑀𝑆𝐸: 11000 − 10000 ≫ 11 − 10
• 𝑅𝑀𝑆𝐿𝐸: log 11000 − log 10000 = log
11000
10000
= log 11 − log 10
Calculation of errors on each data point
12. CSC Proprietary and Confidential 12December 14, 2016
The weather effect
Weather:
1. Clear, Few clouds, Partly cloudy,
Partly cloudy
2. Mist + Cloudy, Mist + Broken clouds,
Mist + Few clouds, Mist
3. Light Snow, Light Rain +
Thunderstorm + Scattered clouds,
Light Rain + Scattered clouds
4. Heavy Rain + Ice Pallets +
Thunderstorm + Mist, Snow + Fog
Generally, Good weather
increases the bike
demands.
75% Quartile
25% Quartile
Median
13. CSC Proprietary and Confidential 13December 14, 2016
Q: What happened in the category 4?
“The overachieving snowfall
of January 9, 2012”
source: Blog of the Washington Post
14. CSC Proprietary and Confidential 14December 14, 2016
The Julian dates and times
Calendar date Julian date
January 1, 4713 B.C.E, at 12pm 0
January 2, 4713 B.C.E, at 12pm 1
March 4, 2016 C.E., at 14pm 2457452.0833333334885537624359130859375
• Continuous count of days
• Representation of date/time within a single variable
• Used primarily by astronomers
• Precision: 1 millisecond (0.001 seconds)
demo
16. CSC Proprietary and Confidential 16December 14, 2016
The scikit-learn library
• A machine learning library in Python
• Includes various classification, regression and clustering algorithms
• Simple and efficient
• BSD license – open source, commercially usable
17. CSC Proprietary and Confidential 17December 14, 2016
How does linear regression work?
𝒙 10 13 16 20 5
𝑦 10 6 4 0 ?
demo𝑦 ≈ 𝑎𝑥 + 𝑏
18. CSC Proprietary and Confidential 18December 14, 2016
How does linear regression work?
𝒙 10 13 16 20 5
𝑦 10 6 4 0 14
demo𝑦 ≈ 𝑎𝑥 + 𝑏
19. CSC Proprietary and Confidential 19December 14, 2016
Visualizations of the forecast with linear regression
20. CSC Proprietary and Confidential 20December 14, 2016
A decision tree for cycling
• Attributes and their values:
– Weather: Sunny, Cloudy, Rain
– Humidity: High, Normal
– Wind: Strong, Weak
• Target concept cycling: Yes, No
Weather
Sunny Cloudy Rain
Yes
Humidity Wind
StrongNormalHigh Weak
No YesNo Yes
Root node
branches node
leaf node
Target P(X)
Yes 2/3
No 1/3
21. CSC Proprietary and Confidential 21December 14, 2016
Advantages and disadvantages of decision tree
Solution: aggregating many decision Trees, using method like random forest
• Easy to explain
• Representation as human decision-making
• Graphical interpretation possible
• Handling qualitative predictors naturally
• Problem:
–Predictive accurary generally not the best
–Sometimes very non-robust
22. CSC Proprietary and Confidential 22December 14, 2016
The random forest algorithm
• Collection of
unpruned decision
trees
• Combination of
individual tree
decisions
• Improve prediction
accuracy
• Encouraging diversity
among the trees
• Bagging, random
decision trees
• Automatic feature
selection
• Output importance of
variable
source: http://www.analyticsvidhya.com/blog/2015/09/random-forest-algorithm-multiple-challenges/
23. CSC Proprietary and Confidential 23December 14, 2016
Visualizations of the forecast with random forest regressor
demo
24. CSC Proprietary and Confidential 24December 14, 2016
The feature ranking from random forest
25. CSC Proprietary and Confidential 25December 14, 2016
The hourly trend
Three categories of bike demand:
• Peak: 7~9 and 16~19 hours
• Average: 10~15 hours
• Low: 0~6 and 20~24 hours
26. CSC Proprietary and Confidential 26December 14, 2016
About TensorFlow
• A Google software library for machine intelligence
• Developed by the Google Brain team, open source since Nov. 9th, 2015
• Scalable for cross-platform such as CPUs or GPUs in servers, desktops and also
capable on mobile devices
• Currently used for both research and production
27. CSC Proprietary and Confidential 27December 14, 2016
The artificial neural network
Hidden Layer
Input
Output
an ANN with one hidden layer
𝑤11 𝑣11
𝑥1
𝑦1
ℎ1
𝑥3
𝑥2
𝑤21
𝑤31
ℎ2
ℎ5
ℎ4
ℎ3
𝑣21
𝑣31
𝑣41
𝑣51
𝑦2
ℎ1 = φ
𝑖=1
3
𝑤𝑖1 𝑥𝑖
…
𝑦1 =
𝑖=1
5
𝑣𝑖1ℎ𝑖
φ: activation function
e.g. 𝜑(𝑥) = tanh 𝑥
28. CSC Proprietary and Confidential 28December 14, 2016
Visualizations of the forecast from ANN built with TensorFlow
29. CSC Proprietary and Confidential 29December 14, 2016
The machine learning benchmarks on the biking rental data
ML-methods RMSLE on validation set training time (s)
linear regression 0,9913 0,3210065
random forest 0,3525 0,692971
ANN with TensorFlow 0,3369 348,807
The RMSLE score with ANN on Kaggle‘s test data:
0.40173.
source: https://www.kaggle.com/c/bike-sharing-demand/leaderboard (last check on Feb. 25th. 2016)
30. CSC Proprietary and Confidential 30December 14, 2016
Outlook
• Combination with time series analysis
• Feature engineering
– Categorization of hours: peak, average, low
• Separate models for registered and
casual users
Source: http://brandchannel.com/2015/11/16/google-tensorflow-ai-111615/
31. CSC Proprietary and Confidential 31December 14, 2016
Credits
• Kaggle
• Wikipedia, https://en.wikipedia.org
• Python Software Foundation, https://www.python.org/
• Stack Overflow, http://stackoverflow.com
• scikit-learn, http://scikit-learn.org
• TensorFlow, https://www.tensorflow.org/
• Financial Time Series Prediction Using Machine Learning
Algorithms, Master Thesis, LESLIE TIONG CHING OW, Aug. 2012
• Date/Time Plotting, IDL Online Help,
http://www.physics.nyu.edu/grierlab/idl_html_help/plotting14.html
• Dr. Florian Wilhelm
• The Big Data Analytics Team
32. CSC Proprietary and Confidential 32December 14, 2016
Thank You
Zilong Zhao
Big Data & Analytics
BCRM, Wiesbaden
zzhao3@csc.com
34. CSC Proprietary and Confidential 34December 14, 2016
Converting Gregorian calendar date to Julian day number
• First, computing the number of years(𝑦) and months(𝑚) since March 1, 4801 B.C.E.:
• Then, computing:
• At last, for the full Julian Date with time:
35. CSC Proprietary and Confidential 35December 14, 2016
Activation function
Definition. The activation function of a node in a neural network defines
the output of the node, given a set of predetermined inputs.
With the activation function, non-linearity is introduced into the neural
network.
φ 𝑥 = tanh 𝑥 φ 𝑥 = 𝑒−𝑥2