Analysis of Fatal Utah Avalanches with Python. From Scraping, Analysis, to Infographic

Data Analysis with Python
@__mharrison__

Happy Pi Day!
>>> sum(4./x if i%2 == 0 else -4./x for
i, x in enumerate(range(1, 10000000, 2)))

About Me
Cochair Utah Python. Consultant with 14 years
Python experience across Data Science, BI, Web,
Open Source Stack Management, and Search.
http://metasnake.com/

Why Python?
● Simple
● Onestop shop
● Ubiquitous

Analysis of Utah Avalanches
● Frequency
● Location
● Causes

Github Repo
Available at https://github.com/mattharrison/UtahAvalanche

Process
● Acquire Data
● Clean Data
● Visualize Data
● Analyze Data
● Present
● Automate

URL
https://utahavalanchecenter.org/avalanches/fata
lities

Crawling
● URL
● Inspect HTML
● Program

HTML
<div class="content">
<div class="view view-avalanches view-id-avalanches view-display-id-page_1>
<div class="view-content">
<table class="views-table cols-7" >
<thead>
<tr>
<th class="views-field views-field-field-occurrence-date" >Date</th>
<th class="views-field views-field-field-region-forecaster" >Region</th>
<th class="views-field views-field-field-region-forecaster-1" >Place</th>
<th class="views-field views-field-field-trigger" >Trigger</th>
<th class="views-field views-field-field-killed" >Number Killed</th>
<th class="views-field views-field-view-node" ></th>
<th class="views-field views-field-field-coordinates" >Coordinates</th>
</tr>
</thead>
<tbody>
<tr class="odd views-row-first">
<td class="views-field views-field-field-occurrence-date" >
<span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2015-03-04T00:00:00-
07:00">03/4/2015</span></td>
<td class="views-field views-field-field-region-forecaster" >Ogden</td>
<td class="views-field views-field-field-region-forecaster-1" >Hells Canyon</td>
<td class="views-field views-field-field-trigger" >Snowboarder</td>
<td class="views-field views-field-field-killed" >1</td>
<td class="views-field views-field-view-node" >
<a href="/avalanches/23779">Details</a></td>

Crawl
● Requests
● BeautifulSoup4

Installation
$ pip install requests beautifulsoup4

requests
import requests as r
url = 'https://utahavalanchecenter.org/avalanches/fatalities'
req = r.get(url)
data = req.text

Scraping items
● Find <div class="content">
● Find <tr>'s
● Find <td>'s
– Find Names end of class attribute views-field-
field-killed
– Find Values string of <td>
● Also get details URL from <td class='views-
field-view-node'>

Code to Scrape
def get_info(data):
soup = BeautifulSoup(data)
content = soup.find(id="content")
trs = content.find_all('tr')
res = []
for tr in trs:
tds = tr.find_all('td')
data = {}
for td in tds:
name, value = get_field_name_value(td)
if not name:
continue
data[name] = value
if data:
res.append(data)
return res

Code to Scrape
def get_field_name_value(elem):
tags = elem.get('class')
start = 'views-field-field-'
for t in tags:
if t.startswith(start):
return t[len(start):],
''.join(elem.stripped_strings)
elif t == 'views-field-view-node':
return 'url', elem.a['href']
return None, None

Scraping Details
<div id="content" class="column"><div class="section">
<a id="main-content"></a>
<span class="title"><h1>Avalanche: East Kessler</h1></span>
<div class="region region-content">
<div id="block-system-main" class="block block-system">
<div class="content">
<div id="node-23838" class="node node-avalanche node-full node-full clearfix"
about="/avalanches/23838" typeof="sioc:Item foaf:Document">
<span property="dc:title" content="Avalanche: East Kessler" class="rdf-meta element-
hidden"></span><span property="sioc:num_replies" content="0" datatype="xsd:integer" class="rdf-meta
element-hidden"></span>
...
<div class="field field-name-field-observation-date field-type-datetime field-label-above">
<div class="field-label">Observation Date<span class="field-label-colon">: </span></div>
<div class="field-items">
<div class="field-item even"><span class="date-display-single" property="dc:date"
datatype="xsd:dateTime" content="2015-03-05T00:00:00-07:00">Thursday, March 5, 2015</span></div>
</div>
</div>

Scraping Details
● Field class==field (need to use class_ in
BeatifulSoup because of keyword conflict)
● Find field-label for key
● Find field-item for value

Scraping Details
def get_avalanche_details(url, rows):
res = []
for item in rows:
req = r.get(url + item['url'])
data = req.text
soup = BeautifulSoup(data)
content = soup.find(id='content')
field_divs = content.find_all(class_='field')
for div in field_divs:
key_elem = div.find(class_='field-label')
if key_elem is None:
print "NONE!!!", div
continue
key = ''.join(key_elem.stripped_strings)
try:
value_elem = div.find(class_='field-item')
value = ''.join(value_elem.stripped_strings).
replace(u'xa0', u' ')
except AttributeError as e:
print e, div
if key in item:
continue
item[key] = value
res.append(item)
return res

BS Notes
Can be annoying to find strings:
>>> from bs4 import BeautifulSoup
>>> s = BeautifulSoup('<div>foo<div>bar</div></div>')
>>> s
<html><body><div>foo<div>bar</div></div></body></html>
>>> s.string # This bothers me! None!
>>> s.strings
<generator object _all_strings at 0x...>
>>> list(s.strings)
[u'foo', u'bar']

BS Notes
Might need to deal with unicode (xa0 is Latin
nonbreaking space)...:
value = ''.join(
value_elem.stripped_strings).
replace(u'xa0', u' ')

Other Tools
Scrapy (scrapy.org) Framework for crawling
web using Python

Convert to csv
Use pandas:
details = get_avalanche_details(base,
items[:size])
df = pd.DataFrame(details)
df.to_csv(outname)

Unicode bytes!
Traceback (most recent call last):
File "crawl.py", line 73, in <module>
crawl('/tmp/ava.csv', 2)
File "crawl.py", line 69, in crawl
df.to_csv(outname)
...
lib.write_csv_rows(self.data, ix, self.nlevels,
self.cols, self.writer)
File "pandas/lib.pyx", line 978, in
pandas.lib.write_csv_rows (pandas/lib.c:16858)
UnicodeEncodeError: 'ascii' codec can't encode character
u'u200b' in position 70: ordinal not in range(128)

Solution
Use pandas to encode as utf8:
details = get_avalanche_details(
base, items[:size])
df = pd.DataFrame(details)
df.to_csv(outname, encoding='utf-8')

Inspecting Data
● Spreadsheet
● pandas
● pandas + iPython Notebook

DataFrames
Table with columns as Series:
df = {
'index':[0,1,2],
cols = [
{ 'name':'growth'
'data':[.5, .7, 1.2] },
{ 'name':'Name'
'data':['Paul', 'George', 'Ringo'] },
]
}

DataFrames
In Pandas:
>>> df = pd.DataFrame({
... 'growth':[.5, .7, 1.2],
... 'Name':['Paul', 'George', 'Ringo'] })
>>> df
Name growth
0 Paul 0.5
1 George 0.7
2 Ringo 1.2
[3 rows x 2 columns]

DataFrames
Can create from:
● rows (list of dicts)
● columns (dicts of lists)
● csv file (pd.read_csv)
● from NumPy ndarray

DataFrames
Data for slides:
>>> df = pd.DataFrame({
... 'fname': list('ABCDEF'),
... 'lname': list('MNOPQR'),
... 'test1': range(80,85) + [None],
... 'test2': range(80,92,2)})

DataFrames
Data:
>>> df
fname lname test1 test2
0 A M 80 80
1 B N 81 82
2 C O 82 84
3 D P 83 86
4 E Q 84 88
5 F R NaN 90

DataFrames
Two Axes:
● Axes 0 Index
● Axes 1 Columns

DataFrames
>>> df.axes[0] # the index
Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
>>> df.axes[1] # columns
Index([u'fname', u'lname', u'test1', u'test2'],
dtype='object')

DataFrames
Index & Columns
>>> df.index
Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
>>> df.columns
dtype='object')

Examine
Listing columns:
>>> df.columns
dtype='object')

Examine
Describing data:
>>> df.describe()
test1 test2
count 5.000000 6.000000
mean 82.000000 85.000000
std 1.581139 3.741657
min 80.000000 80.000000
25% 81.000000 82.500000
50% 82.000000 85.000000
75% 83.000000 87.500000
max 84.000000 90.000000
[8 rows x 2 columns]

Examine
Viewing the data (use .to_string() if needed):
>>> df
fname lname test1 test2
0 A M 80 80
1 B N 81 82
2 C O 82 84
3 D P 83 86
4 E Q 84 88
5 F R NaN 90

Examine
Pull out a column (Series):
>>> df.test1 # or df['test1']
0 80
1 81
2 82
3 83
4 84
5 NaN
Name: test1, dtype: float64

Examine
Median of a column (Series):
>>> df.test1.median()
82.0

Examine
Quick correlation:
>>> df.test1.corr(df.test2)
1.0

(Thus concludes our
interlude) Back to
Avalanches

Pandas
>>> import pandas as pd
>>> df = pd.read_csv('/tmp/ava.csv')
>>> df.describe()
Unnamed: 0 Buried - Fully: Buried - Partly: Carried: Caught:
count 20.00000 17.000000 3 20.000000 20.000000
mean 9.50000 1.117647 1 1.200000 1.300000
std 5.91608 0.332106 0 0.523148 0.656947
min 0.00000 1.000000 1 1.000000 1.000000
25% 4.75000 1.000000 1 1.000000 1.000000
50% 9.50000 1.000000 1 1.000000 1.000000
75% 14.25000 1.000000 1 1.000000 1.000000
max 19.00000 2.000000 1 3.000000 3.000000
Elevation: Injured: Killed: Slope Angle: Video: killed
count 20.000000 3 20.000000 15.000000 0 20.000000
mean 9520.000000 1 1.100000 36.200000 NaN 1.100000
std 1022.689951 0 0.307794 7.692297 NaN 0.307794
min 6400.000000 1 1.000000 10.000000 NaN 1.000000
25% 8925.000000 1 1.000000 36.000000 NaN 1.000000
50% 9800.000000 1 1.000000 38.000000 NaN 1.000000
75% 10200.000000 1 1.000000 39.500000 NaN 1.000000
max 10900.000000 1 2.000000 45.000000 NaN 2.000000

Process
● Look at column types .dtypes
● Inspect columns col.describe(),
col.value_counts()
● Tweak/create columns ...

Inspect Column types
>>> df.dtypes
Unnamed: 0 int64
Accident and Rescue Summary: object
Aspect: object
Avalanche Problem: object
Avalanche Type: object
Buried - Fully: float64
Buried - Partly: float64
Carried: int64
Caught: int64
Comments: object
Coordinates: object
Depth: object
Elevation: int64
Injured: float64
Killed: int64
Location Name or Route: object
Observation Date: object
Observer Name: object
Occurence Time: object
Occurrence Date: object
Region: object
Slope Angle: float64
Snow Profile Comments: object
Terrain Summary: object
Trigger: object
Trigger: additional info: object
Vertical: object
Video: float64
Weak Layer: object
Weather Conditions and History: object
Width: object
coordinates object
killed int64
occurrence-date object
region-forecaster object
region-forecaster-1 object
trigger object
url object
dtype: object

Munging
In Data Science, 80% of time spent prepare data,
20% of time spent complain about need for
prepare data
@BigDataBorat

Munging
Some of the object (string, date, nonnumeric)
types need to be converted to numeric (other
types). Some are freeform, others are categorical

Column Names
Get rid of those pesky colons:
>>> df2 = df.rename(columns={x:x.replace(
... ':', '')
... for x in df.columns})

Categorical
>>> df2['Aspect'].value_counts()
Northeast 6
North 5
East 5
Southeast 2
West 1
Northwest 1
dtype: int64
>>> df2['Avalanche Problem'].value_counts()
Persistent Slab 4
Storm Slab 1
Deep Slab 1
Wind Slab 1
dtype: int64
>>> df2['Avalanche Type'].value_counts()
Hard Slab 12
Soft Slab 7
Cornice Fall 1
dtype: int64

Adjust Depth
>>> df2.Depth
0 3'
1 4'
2 4'
3 18"
4 8"
5 2'
6 3'
7 2'
8 16"
9 3'
10 2.5'
11 16"
12 NaN
13 3.5'
14 8'
15 3.5'
16 3'
17 2'
18 4'
19 4.5'
Name: Depth, dtype: object

Adjust Depth
>>> import re
>>> def to_inches(orig):
... """
... >>> to_inches("3'")
... 36
... """
... r = r'''(((d*.)?d*)')?(((d*.)?d*)")?'''
... regex = re.compile(r)
... txt = str(orig)
... if txt == 'nan':
... return orig
... match = regex.search(txt)
... groups = match.groups()
... feet = groups[1] or 0
... inches = groups[4] or 0
... return float(feet) * 12 + float(inches)
>>> df2['depth_inches'] = df2.Depth.apply(to_inches)

Some values are missing
.describe() only works for numeric columns
Can use .interpolate, .fillna, .dropna

>>> df2.depth_inches
0 36
1 48
2 48
3 18
4 8
5 24
6 36
7 24
8 16
9 36
10 30
11 16
12 NaN
13 42
14 96
15 42
16 36
17 24
18 48
19 54
Name: depth_inches, dtype: float64

>>> df2.depth_inches.ix[12]
nan
>>> df2.depth_inches.interpolate().ix[12]
29.0
>>> df2.depth_inches.mean()
35.89473684210526
>>> df2.depth_inches.median()
36.0
>>> df2.depth_inches.dropna().ix[12]
Traceback (most recent call last):
...
KeyError: 12

Interpolate
Does linear by default but has other algorithms

Replace NaN with Median
df2['depth_inches'] = df2.depth_inches.fillna(
df2['depth_inches'].median())

Date Munging
>>> df2['Occurrence Date']
0 Wednesday, March 4, 2015
1 Friday, March 7, 2014
2 Sunday, February 9, 2014
3 Saturday, February 8, 2014
4 Thursday, April 11, 2013
5 Friday, March 1, 2013
6 Friday, January 18, 2013
7 Saturday, March 3, 2012
8 Thursday, February 23, 2012
9 Sunday, February 5, 2012
10 Saturday, January 28, 2012
11 Sunday, November 13, 2011
12 Saturday, March 26, 2011
13 Friday, November 26, 2010
14 Sunday, April 4, 2010
15 Friday, January 29, 2010
16 Wednesday, January 27, 2010
17 Sunday, January 24, 2010
18 Tuesday, December 30, 2008
19 Wednesday, December 24, 2008
Name: Occurrence Date, dtype: object

Date Munging
>>> pd.to_datetime(df2['Occurrence Date'])
0 2015-03-04
1 2014-03-07
2 2014-02-09
3 2014-02-08
4 2013-04-11
5 2013-03-01
6 2013-01-18
7 2012-03-03
8 2012-02-23
9 2012-02-05
10 2012-01-28
11 2011-11-13
12 2011-03-26
13 2010-11-26
14 2010-04-04
15 2010-01-29
16 2010-01-27
17 2010-01-24
18 2008-12-30
19 2008-12-24
Name: Occurrence Date, dtype: datetime64[ns]

Date Munging
Might be useful to have date of week as well (Monday is the day to
go!)...
>>> df2['dow'] = df2['Occurrence Date'].apply(lambda x:
x.split(',')[0])
>>> df2.dow.value_counts()
Friday 5
Sunday 5
Saturday 4
Wednesday 3
Thursday 2
Tuesday 1
dtype: int64

Fill Vertical
Replace 'Unknown' with median:
df2['vert'] = df2.Vertical.str.replace('Unknown',
'NaN').astype(float)
df2['vert'] = df2.vert.fillna(df2.vert.median())

Coordinates
>>>df2.coordinates
0 NaN
1 40.812120000000, -110.906296000000
2 39.585986000000, -111.270003000000
3 40.482366000000, -111.648088000000
4 40.629000000000, -111.666412000000
5 39.043600000000, -111.519000000000
6 NaN
7 38.539320000000, -109.209852000000
8 40.653034000000, -111.592255000000
9 38.716456000000, -111.721988000000
10 40.624442000000, -111.669588000000
11 40.568491000000, -111.652937000000
12 39.372824000000, -111.422482000000
13 40.847320000000, -111.015129000000
14 41.050424000000, -111.844082000000
15 40.856199868806, -111.754991041400
16 40.617112000000, -111.623840000000
17 41.215563000000, -111.873307000000
18 40.871988000000, -110.974016000000
19 41.711752000000, -111.717181000000
Name: coordinates, dtype: object

Coordinates
df2['lat'] = df2.coordinates.apply(
lambda x: float(x.split(',')[0])
if str(x) != 'nan' else float('nan'))
df2['lon'] = df2.coordinates.apply(
lambda x: float(x.split(',')[1])
if str(x) != 'nan' else float('nan'))

Missing Data
We don't have:
● Temperature
● Weather (current, previous day)

Plotting
Use iPython Notebook (v3 is called jupyter) for
fun
$ pip install "ipython[notebook]"
$ ipython notebook

Mapping
Install cartopy, pip fails, github checkout
$ sudo apt-get install libgeos-dev libproj-dev
$ pip install shapely pyshp
$ git clone
git@github.com:SciTools/cartopy.git
$ cd cartopy
$ python setup.py install
Hard to get contour....

Enter gmaps
$ pip install gmaps
notebook code:
import gmaps
d2 = [x for x in zip(df.lat, df.lon) if
str(x[0]) != 'nan']
gmaps.heatmap(d2)

Enter Folium
Wraps leaflet.js
from IPython.display import HTML
import folium
def inline_map(map):
"""
Embeds the HTML source of the map directly into the IPython notebook.
This method will not work if the map depends on any files (json data). Also this uses
the HTML5 srcdoc attribute, which may not be supported in all browsers.
"""
map._build_map()
return HTML('<iframe srcdoc="{srcdoc}" style="width: 100%; height: 510px; border:
none"></iframe>'.format(srcdoc=map.HTML.replace('"', '"')))
def summary(i, row):
return "<b>{} {} {} {}</b> <p>{}</p>".format(i, row['year'],
row['Trigger'], row['Location Name or Route'],
row['Accident and Rescue Summary'])
map = folium.Map(location=d2[4], zoom_start=10, tiles='Stamen Terrain', height=700)
for i, row in df2.iterrows():
#print (row.lat, row.lon)
if str(row.lat) == 'nan' or row.lat == 0:
continue
map.simple_marker([row.lat, row.lon], popup=summary(i, row))
inline_map(map)

Histograms
Goto visualization to see the lay of the land

Histograms
Easy with matplotlib (or pandas integration):
import matplotlib.pyplot as plt
plt.hist(df2.vert)
df2.vert.hist() # using pandas

Seaborn
Recommended wrapper on top of matplotlib.
Violin plots, faceted plots, + more

Analyze
● Statsmodels
● scipy (includes scipy.stats)
● scikitlearn
● NLTK
● gensim
● SpaCy

Overview
>>> df2.Killed.sum()
107
>>> len(df2)
92

Trend
df3 = df2.groupby('year').sum().
reset_index()[['year','count']]
sb.regplot(x='year', y="count",
data=df3, lowess=0, marker='x',
scatter_kws={'s':100,
'color':"#a40000"})

Map
See previous folium code

Triggers
>>> df2.Trigger.value_counts()
Skier 40
Snowmobiler 25
Snowboarder 13
Natural 6
Unknown 3
Hiker 3
Snowshoer 1
dtype: int64

Elevation
df2.elevation.hist(orientation='horizonta
l')

Slope
def to_rad(d):
return d *math.pi/ 180
ax = plt.subplot(111)
for i, row in df2.iterrows():
jitter = (random.random()-.5)*.2
plt.plot([0, 1], [0, math.tan(to_rad(row.slope +
jitter))],
alpha=.3, color='b', linewidth=1)
ax.set_xlim(0,1)
ax.set_ylim(0,1)

Aspects
>>> df2.Aspect.value_counts()
Northeast 24
North 14
East 9
Northwest 9
West 3
Southeast 3
South 1
dtype: int64

Optimization
Crawling all avalanches takes ~2.5 min.

Luigi
Manage workers to performs tasks in parallel

Other
● ML (scikitlearn)
● Database (sqlalchemy)
● Web (django, flask, ...)

Thanks
Questions?
@__mharrison__

Analysis of Fatal Utah Avalanches with Python. From Scraping, Analysis, to Infographic

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Analysis of Fatal Utah Avalanches with Python. From Scraping, Analysis, to Infographic

Ähnlich wie Analysis of Fatal Utah Avalanches with Python. From Scraping, Analysis, to Infographic (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Analysis of Fatal Utah Avalanches with Python. From Scraping, Analysis, to Infographic