26. Hong Kong District Council (Disco)
Final output:
● Power of each Camp:
○ https://theinitium.com/project/20151012-hk-district-council-elections/
● Power of major parties:
○ https://theinitium.com/project/20151019-hk-district-council-elections-2/
● Guide for 2015 election:
○ https://theinitium.com/project/20151029-hk-district-council-elections-
3/desktop.html
27. Meet the data
Data:
● From 1999 to 2015
● # of Candidates: 4392
○ Name, occupation,
party, camp, votes
● # of Constituencies: 2039
○ Total votes, voting rate,
count of voters,
population
28. Methodology:
● Automatic
○ Scraping
● Semi-automatic:
○ Copy-and-paste a few table from the website
○ Data cleaning by human
● Manual input from books
○ Labour intensive
● Investigation
Meet the sources
29. Manpower overview
Metric Value
# of unique participants 8
Data collection/ cleaning 720 man-hours (3 months)
Data validation 24 man-hours (3 days)
Data analysis 50 man-hours (6 days)
Project span 5 months
Manpower overview of the large data collection campaign
31. Challenge: Hard to Collect
Database open sourced:
http://initiumlab.com/#database
1999 2003 2007 2011 2015
個人信息
(年齡)
手動抄書
(3)
手動抄書
(3)
手動抄書
(3)
手動抄書
(3)
自動抓取睇嘢
(0.5)
個人信息
(性別、職業)
手動抄書
(6)
手動抄書
(6)
手動抄書
(6)
區選網站/手動
(2)
區選網站/自動
(1)
政黨派別
(政黨)
手動抄書
(3)
手動抄書
(3)
手動抄書
(3)
區選網站/手動
(1)
區選網站/自動
(0.5)
政黨派別
(泛/建/其他)
起底+標註
(130)
起底+標註
(130)
起底+標註
(130)
起底+標註
(130)
起底+標註
(130)
選區信息
(居民數、選民數、
投票率)
區選網站/手動
(2.1)
區選網站/手動
(2.1)
區選網站/手動
(2.1)
區選網站/手動
(2.1)
區選網站/自動
(1.1)
選舉結果
(得票率)
手動抄書
(3)
手動抄書
(3)
手動抄書
(3)
手動抄書
(3)
missing
(0)
Research/ investigation consumes significant more time
Online accessible/ (semi-) formatted data saves time
Importance of open data and knowledge sharing
32. Challenge: Efficiency & Quality
Misconception:
“Manual input is only a problem of labour; not a problem of science”
- How to use semi-automatic tools to improve efficiency?
- How to track data pipeline/ dependency graph?
- How many points should you sample for data validation?
- How to maximize the performance of a group of data collectors?
- In terms of project span?
- In terms of through-put?
- How to setup incentive mechanism to ensure quality?
- …
All those are active research directions.
33. HK Legislative Council Voting
SOPA 2016 Excellence Award Winner
More: https://theinitium.com/article/20160615-sopa-awards-2016/
34. Hong Kong Legislative Council (Legco)
Hong Kong Legislative Council
● Current term: 17/10/2012 ~ 18/06/2015
● 70 members
● 12 government departments
● 2921 motions
Structured data set
Focus on mining
35. Video: Legco Voting on Youtube (English)
https://www.youtube.com/watch?v=0evK3PtLaUo
36. Video: Legco Voting on Youtube (Cantonese)
https://www.youtube.com/watch?v=KYa-ygjqaV4
38. Other output of Legco Analysis project
● Chinese report + Cantonese animation:
https://theinitium.com/article/20150812-hongkong-legcoanalysis/
● Interactive Web:
http://legco.initiumlab.com
● English animation: https://www.youtube.com/watch?
v=CExoTvKuXSw
42. Challenge: Insights? Impact? Value?
“Value”: High
Difficulty: Low
“Value”: Med
Difficulty: Med
“Value”: Low
Difficulty: High
43. Challenge: “Value” of Data
Unique source
(e.g. Disco -- Time Series)
Interesting
data point
(e.g. Legco -- Starry Lee against
herself)
Good Interpretation/
visualisation
(e.g. Legco -- Heatmap)
Technically Deep Analysis
(e.g. Legco -- Member ordering)
45. Challenge: Data Pipelining
● Integration:
○ Google Analytics
○ UMeng
○ Fabric/ Crashlytics
○ Database
○ Server log
○ … Many third party stats
● Processing:
○ Extraction
○ Transformation
○ Aggregation
○ Visualisation
● Presentation:
○ Visualisation
○ Formating
○ Articulation
A combination of
manual, semi-auto
and auto integration
Lot room for improvement
Usually deferred until must
Only useful after successful
articulation of your findings