6. Where are we going?
• open data everywhere
• a data swiss army knife
• finding network patterns
• finding spatial patterns
• which stories to pursue? moving beyond
data analysis
Saturday, March 10, 2012
7. • Data.gov
• OpenDataPhilly
• DC Data Catalog
• DataSF
• Chicago Data Portal
• NYC Open Data
• London Datastore
Saturday, March 10, 2012
8. assembly member expenses
bicycle lanes
city purchase orders
dialysis centers
elevation data
filming locations
Google Transit Feed Specification (GTFS)
historical photos
influenza rates
judicial districts
Key Stage 2 test results by free school meal eligibility
land cover
monthly calls to Human Services Agency switchboard operators
neighborhood health clinics
Oyster ticket stop locations
political districts
quality of life indicators
restaurant inspections
sewer lines
traffic counts
utility excavation and paving five-year plan
violent crime incidents
ward offices
youth centers
zoning
**real-time parking availability and pricing**
Saturday, March 10, 2012
12. • What are DC agencies spending money on?
• How much are they spending?
• What are the relationships between
businesses and agencies?
• Where are these businesses located?
Saturday, March 10, 2012
14. swiss army knife
• csvkit: http://csvkit.readthedocs.org/
• a set of Python utilities for working with csv
• meant to replace csv module
• pip install csvkit (no issues!)
Saturday, March 10, 2012
16. $ csvcut -c 2,6 purchase2011_cleaned.csv | csvstat
1. AGENCY_NAME
! <type 'unicode'>
! Nulls: False
! Unique values: 85
! 5 most frequent values:
! ! DISTRICT OF COLUMBIA PUBLIC SCHOOLS:!2410
! ! STATE SUPERINTENDENT OF EDUCATION (OSSE):! 1340
! ! DEPARTMENT OF HEALTH:! 895
! ! OFFICE OF CHIEF TECHNOLOGY OFFICER:! 786
! ! OFF PUBLIC ED FACILITIES MODERNIZATION:!722
! Max length: 40
2. SUPPLIER
! <type 'unicode'>
! Nulls: False
! Unique values: 4357
! 5 most frequent values:
! ! OST, INC.:! 841
! ! DELL COMPUTER CORP.:! 366
! ! AMERICAN EXPRESS COMPANY:! 282
! ! MVS, INC.:! 176
! ! CAPITAL SERVICES AND SUPPLIES:! 167
! Max length: 52
Row count: 16075
! ! !
Saturday, March 10, 2012
17. $ csvgrep -c 6 -r ^MAYA purchase2011_cleaned.csv
PO_NUMBER,AGENCY_NAME,NIGP_DESCRIPTION,PO_TOTAL_AMOUNT,ORDER_DATE,SUPPLIER,SUPPLIER_FULL_ADDRESS
PO352244,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,408644.73,01/04/2011,MAYA ANGELOU
PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"
PO352652,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,111679.16,01/07/2011,MAYA ANGELOU
PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"
PO352920,PUBLIC CHARTER SCHOOLS,SCHOOL OPERATION AND MANAGEMENT SERVICES 71,2205630.13,01/11/2011,MAYA ANGELOU PCS,"1851
9TH STREET NW, WASHINGTON, DC, 20001"
PO355150,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,391092.49,02/07/2011,MAYA ANGELOU
PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"
PO356426,STATE SUPERINTENDENT OF EDUCATION (OSSE),FINANCIAL SERVICES (NOT OTHERWISE CLASSIFIED)
49,999891,02/23/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"
PO356632,STATE SUPERINTENDENT OF EDUCATION (OSSE),PROFESSIONAL SERVICES (NOT OTHERWISE CLASSIFIED)
58,187200,02/25/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"
PO359961,PUBLIC CHARTER SCHOOLS,SCHOOL OPERATION AND MANAGEMENT SERVICES 71,1753238,04/12/2011,MAYA ANGELOU PCS,"1851
9TH STREET NW, WASHINGTON, DC, 20001"
PO360284,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,110729.88,04/14/2011,MAYA ANGELOU
PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"
PO361203,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,92617.32,04/28/2011,MAYA ANGELOU
PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"
PO351462-V2,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATIONAL RESEARCH SERVICES 19,152229.95,05/05/2011,MAYA ANGELOU
PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"
PO364208,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,118825.51,06/09/2011,MAYA ANGELOU
PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"
PO366839,PUBLIC CHARTER SCHOOLS,SCHOOL OPERATION AND MANAGEMENT SERVICES 71,2767027,07/12/2011,MAYA ANGELOU PCS,"1851
9TH STREET NW, WASHINGTON, DC, 20001"
PO365094-V2,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,98092.35,08/15/2011,MAYA ANGELOU PCS,"1851
9TH STREET NW, WASHINGTON, DC, 20001"
PO370948,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,45736.58,08/25/2011,MAYA ANGELOU PCS,"1851 9TH
STREET NW, WASHINGTON, DC, 20001"
PO361027-V5,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,29424.86,09/06/2011,MAYA
ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"
PO374132,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,9000,09/28/2011,MAYA ANGELOU PCS,"1851 9TH
STREET NW, WASHINGTON, DC, 20001"
PO377919,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,491663.6,10/25/2011,MAYA ANGELOU PCS,"1851 9TH
STREET NW, WASHINGTON, DC, 20001"
PO381219,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,120188.81,11/29/2011,MAYA ANGELOU
PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"
PO383965,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,294690.57,12/22/2011,MAYA ANGELOU PCS,"1436 U
STREET, NW SUITE 203, WASHINGTON, DC, 20009"
! ! !
Saturday, March 10, 2012
18. $ csvcut -c 4,2,6,5 purchase2011_cleaned.csv | csvsort -r | head -n
20 | csvlook
------------------------------------------------------------------------------------------------------------
| PO_TOTAL_AMOUNT | AGENCY_NAME | SUPPLIER | ORDER_DATE |
------------------------------------------------------------------------------------------------------------
| 154133337.02 | DEPARTMENT OF TRANSPORTATION | SKANSKA-FACCHINA JV | 2011-11-10 |
| 62677473.88 | DEPARTMENT OF REAL ESTATE SERVICES | EEC OF DC INC-FORRESTER CONSTR | 2011-09-22 |
| 31809425.48 | DEPARTMENT OF HEALTH | DEFENSE LOGISTIC AGENCY | 2011-09-08 |
| 23600580.0 | DEPARTMENT OF CORRECTIONS | UNITY HEALTH CARE, INC. | 2011-10-24 |
| 23538552.0 | DEPARTMENT OF REAL ESTATE SERVICES | EEC-FORRESTER ANACOSTIA | 2011-11-08 |
| 22375314.45 | DEPARTMENT OF CORRECTIONS | CORRECTIONS CORPORATION OF | 2011-05-25 |
| 21450000.04 | DEPARTMENT OF HUMAN SERVICES | THE COMMUNITY PARTNERSHIPHOME | 2011-08-18 |
| 20813348.99 | DEPARTMENT OF REAL ESTATE SERVICES | THE JOHN AKRIDGE CO | 2011-06-28 |
| 20622000.0 | DEPARTMENT OF TRANSPORTATION | W M SCHLOSSER CO INC | 2011-08-29 |
| 19824914.0 | DEPARTMENT OF CORRECTIONS | CORRECTIONS CORPORATION OF | 2011-10-24 |
| 18300956.56 | DEPARTMENT OF HUMAN SERVICES | THE COMMUNITY PARTNERSHIPHOME | 2011-11-29 |
| 18104339.98 | DEPARTMENT OF HUMAN SERVICES | THE COMMUNITY PARTNERSHIPHOME | 2011-05-17 |
| 18000000.0 | DEPARTMENT OF HEALTH | DC PRIMARY CARE ASSOCIATION | 2011-03-10 |
| 17000000.0 | DEPARTMENT OF HEALTH | CHILDRENS NATIONAL MEDICAL CTR | 2011-11-25 |
| 16850000.0 | DEPUTY MAYOR FOR ECONOMIC DEVELOPMENT | 2 M STREET REDEVELOPMENT LLC | 2011-09-29 |
| 16333257.33 | DEPARTMENT OF HUMAN SERVICES | THE COMMUNITY PARTNERSHIPHOME | 2011-06-02 |
| 14206937.0 | PUBLIC CHARTER SCHOOLS | FRIENDSHIP PCS | 2011-07-12 |
| 13862557.44 | MUNICIPAL FACILITIES: NON-CAPITAL | US SECURITY ASSOCIATES, INC. | 2011-10-07 |
| 13800000.0 | DISTRICT DEPARTMENT OF THE ENVIRONMENT | VERMONT ENERGY INVESTMENT CORP | 2011-10-04 |
------------------------------------------------------------------------------------------------------------
! ! !
Saturday, March 10, 2012
19. Social Network Analysis
“Social network analysis is focused on
uncovering the patterning of people's
interaction.”
- http://www.insna.org/sna/what.html
Saturday, March 10, 2012
20. 99th House
President: Reagan
House majority: Democrats
Years: 1985, 1986
Saturday, March 10, 2012
21. 107th House
President: Bush
House majority: Republicans
Years: 2001, 2002
Saturday, March 10, 2012
22. 108th House
President: Bush
House majority: Republicans
Years: 2003, 2004
Saturday, March 10, 2012
23. 109th House
President: Bush
House majority: Republicans
Years: 2005, 2006
Saturday, March 10, 2012
24. 110th House
President: Bush
House majority: Democrats
Years: 2007, 2008
Saturday, March 10, 2012
25. 111th House
President: Obama
House majority: Democrats
Years: 2009, 2010
Saturday, March 10, 2012
26. CSV to network
import networkx as nx
G = nx.Graph()
node_edgelist = []
# grab edges
for row in csv_file:
node_edgelist.append((n,e))
# create edges
for f in node_edgelist:
for t in node_edgelist:
if t != f:
add_edge_or_weight(G, f[0], t[0])
Saturday, March 10, 2012
27. Centrality Analysis (networkx)
Degree - nx.degree(G)
# of connections; More connections = more important
Closeness centrality
nx.closeness_centrality(G)
Distance to all other nodes; Closer = more important
Betweenness centrality
nx.betweenness_centrality(G)
Based on the shortest path of info control
Page rank
nx.pagerank(G)
Node gains importance via the importance around him
Saturday, March 10, 2012
29. Centrality Analysis (networkx)
Digi Docs Inc Document Mangers (Dallas)
“Offers software that generates loan documents for electronic delivery.”
Iron Mountain (Mountain View)
“Iron Mountain provides information management services that help organizations
lower the costs, risks and inefficiencies of managing their physical and digital data.”
MVS, Inc. (Washington, DC)
“MVS Consulting is an 8(a) STARS II, HUBZone, LSDBE, CBE, and MBE IT
Solutions company that provides IT solutions to Federal, State and Local
Government Agencies.”
MDM OFFICE SYSTEMS INC (Washington, DC)
"Standard Office Supply - Office Supplies, Furniture Dealer, Educational Products,
Breakroom Supplies, Imaging Supplies, and Coffee Services"
Capital Services and Supplies (Washington, DC)
“CSSI is an office solutions firm located in Washington, DC since 1980. CSSI’s
goods and services are available to commercial, government, and educational
institutions throughout the continental United States.”
Saturday, March 10, 2012
30. Centrality Analysis (networkx)
Not included in previous slide...
United States Postal Service
&
Dell Computer Corp
Saturday, March 10, 2012
31. Visual the network
pos=nx.spring_layout(G,iterations=100)
plot.figure(1,figsize=(15,15))
plt.axis('off')
nx.draw_networkx_nodes(
G,
pos,node_size=100,
alpha=1,
node_color='g'
)
nx.draw_networkx_edges(G,pos,alpha=0.2)
plot.savefig('graph.png')
Saturday, March 10, 2012
41. Spatial is special
• spatial data = attributes, location, time
• mappable!
• spatial data must be referenced in space
• Tobler’s First Law of Geography
Saturday, March 10, 2012
42. Spatial analysis
• large data sets a smaller amount of
meaningful information
• exploratory (ESDA)
• spatial statistics
• mathematical modeling and prediction of
spatial processes
Saturday, March 10, 2012
43. Techniques
• point pattern analysis -- hot spots, k
density, nearest neighbor
• spatial interpolation -- kriging
• spatial regression -- ordinary least squares,
geographically weighted regression
Saturday, March 10, 2012
52. PySAL
• GeoDa Center at ASU
• Python library for spatial analysis, with modules for
exploratory spatial data analysis, spatial
econometrics, and location modeling
• http://code.google.com/p/pysal/
• requires NumPy, SciPy
Saturday, March 10, 2012
53. PySAL
• developers looking for spatial analytical methods
to incorporate in application development
• analysts working on projects that require custom
scripting
• looking for a user-friendly GUI? Try STARS,
GeoDA, GeoDASpace.
• want to integrate into a powerful GIS? Look for
plug-ins for ArcGIS & QGIS.
Saturday, March 10, 2012
55. Next steps
• quantify clusters in city, region, nation
• examine clusters along networks, business
corridors
• create beautiful, interactive maps and charts to
allow users to explore spending patterns on their
own
Saturday, March 10, 2012
57. Which stories would we go
after?
• construction contracts
• funding to charter schools
• health care costs in prisons
• local vs. regional vs. national purchases
• technology services -- look for overlap
Saturday, March 10, 2012
58. Want to learn more?
The SAGE Handbook of
Spatial Analysis
eds. A. Stewart Fotheringham and
Peter A. Rogerson
Interactive Spatial Data
Analysis
Trevor Bailey and Tony Gatrell
Geographic Information
Analysis
David O’Sullivan and David Unwin
PySAL
Luc Anselin, GeoDA Center
Arizona State University Mia, age 3, geographer in training
Saturday, March 10, 2012
59. And even more?
NetworkX tutorial
http://networkx.lanl.gov/
networkx_tutorial.pdf
UCD Dublin summer course
http://mlg.ucd.ie/summer
Social Network Analysis for
Startups (O'Reilly Media)
http://shop.oreilly.com/product/
0636920020424.do
Saturday, March 10, 2012