Py ladies 0928

金融業菜鳥實習生的
python project 初體驗
2013/9/28
Yen @ PyLadies Meeting

Who am I ?

Yen ( 顏嘉儀 )
台大經濟所碩二
Python, R user

Why I am here ?
2

怎麼了嗎 (警戒＝＝＋)

那來PyLadies分享
一下吧（遠目）
3

聽說..你暑假實習...
用python…寫了啥碗糕...是
吧....

1

Take my Internship experiment
as an Example
- Crawler What wil happen
if you were a Python programmer
in Financial Community ?

It’s
about
4
month
ago...

__ __ 人壽信託部
__ __ 人壽精算部
__ __ 券商自營部

Python 可以吃嗎?
你用py…（欸，這怎麼念阿) python 寫過
爬蟲阿, 可是我們都寫C# 跟VBA耶。
__ __ 券商自營部

不過沒關係，
上班前學會C#, VBA, SQL就OK了！

迷之聲：
有這麼容易嗎QQ...

三項任務
簡單來說就是：資料的蒐集、整理與分析
#1. 大連期貨商品研究
#2. 股票節稅報價估計
#3. 每日期權strike, price戰力分佈圖

#1 大連商品交易所

根據 FIA統計，
2012 年全球期貨市場萎縮

15.3%
除了, 中國＆印度

所謂期貨，是一種
合約，承諾在固定
期限內以一個特定
價格買入或賣出固
定數量的商品或金
融產品。

#1 大連商品交易所 (R)

using R

549.5%
Tell me what happened?

#2 股票節稅報價估計（VBA）
什麼是股票節稅？

summary: 平均來說，報價0.7%
沒超過500萬不太會做～

#3 每日期權strike, price戰力分佈圖

去年我們想做一件事,
但是沒完成...

台指選擇權 201309到期賣權

買方

#3 每日期權strike, price分佈圖

券商 1, 2, 3, 4

strike price 履約價格

賣方

券商 1, 2, 3, 4

於是主管畫了一張圖給我...

strike price 履約價格
反映對市場的預期?

所謂選擇權，是一種權
利契約，買方支付權利
金後，便有權利在未來
約定的某特定日期（到
期日），依約定之履約
價格（Strike Price），買
入或賣出一定數量的
約定標的物。

#3每日期權strike, price戰力分佈圖
籌碼分析: 反應對手成本?

期貨、選擇權也有買賣日報表
2012/7/2 -> 2012/7/3 : 成交量pooling :'(

Our Guess: 造市者
XX證券/法國興業證券/奧帝華證券/中信銀行

(there’s a graph but, for some reason, we skip it here )

簡單來說，就是抓資料麻~
We had done it in C# ,
but…
一次只能抓一頁 orz...

Target：可以一次抓很多頁就贏了 XD

簡單來說，就是抓資料嘛～

C#
[ C# ] Arachnode.net ? (not free)
R is slower
Try Python Solution：Scrapy

Windows Platform : C#

Scrapy
(official)
is a web crawling framework,
used to crawl websites and extract
structured data from their pages.
是一個可以讓你快速開發網路爬蟲的套件。
多快？為什麼快？

Traditional Solution

connector

regular
expression

Request,
Response

parser

scrapy

Scrapy is well-structured framework

Parser

selfdefined

Connector
(Twisted)

XPath Parser
Scrapy, is a web crawling framework,
used to crawl websites and extract

structured data from their pages.
# Regular Expression
Every characters are
treated as the same
# Alternatives: XPath
html doc can be a
strudtured data

XPath is like “address”

# C://Python27
# html/body/div[@class="wrapper"]/
div[class="header.clearfix"]/h1[class="
logo"]/a

A Simple Demo
http://www.taifex.com.tw/chinese/index.
asp

1st Response

期交所首頁
Rule : XPath

url

url

2nd Request

url

url

url

公司簡介

商品

盤後資訊

…...
2nd Response

item

….

item

….

item

開盤價

item

Rule : XPath

….

import scrapy工具

Pseudo Code

class my_Spider( 繼承scrapy寫好的spider ):
name ="爬蟲名字"
start_urls=[initial request(URL)]
i.e. 最上層的root
def parse(self, first_response):
第一個parse函式寫死的!!
hxs = HtmlXPathSelector( first_response )
html_response => Xpath 結構化物件
Xpath = "爬取url的Rule"
extracted data = hxs.select(Xpath).extract()
yield Request(url=爬到的url, callback=self.next_parser)
def next_parser(self, second_response):
hxs = HtmlXPathSelector(second_response)
Xpath = "爬取資料的Rule"
extracted data = hxs.select(Xpath).extract()

函式名字隨便訂

cmd command

# 建立scrapy 專案
[cmd]scrapy startproject PyLadiesDemo

# 執行name = "TaiTex" 的爬蟲
[cmd]scrapy crawl TaiFex

請參考官方文件：http://doc.scrapy.org/en/latest/intro/tutorial.html

Demo
# sample code
# Good Tools: (error detect : try except)
[cmd] scrapy Shell url_that_u_want_2_crawl

例外處理 + pdb 下中斷點
# 例外處理
def parse(self, response):
try:
...
except:
網站難免有些例外
…

# 下中斷點
import pdb
pdb.set_trace()

or 弄錯XPath，
可以很即時的修正。

With Scrapy,

Everything
seems
easy and wonderful !

只要可以畫成線性節點圖

Find
the “url pattern”

crawl pages and
scrap data

store into DB

Visualization:
histogram

Google Developer Tools

Python
Scrapy
XPath Helper

Django -> MSSQL

Excel

Windows
Platform

事情沒有你想的容易...
之一
Step 1
Find the “url pattern”

Step 1

query string: http://www.taifex.com.tw/chinese/3/fcm_opt_rep.asp?
commodity_id=TXO&commodity_idt=TXO&settlemon=201310W1&pccode=P&curpage=1&pagenum=1

換月時點有許多例外

Step 1

[TXO] 3近2季
一般選擇權：每個月第3個禮拜三結算
週選擇權：每週星期三結算
what if 颱風假，連假?
[TGO] 6 近
偶數月份

之二
Step 2
crawl pages and
scrap data

什麼!! pip 不能用 !!!!!

Step 2
crawl pages and
scrap data

擋網站 !!!

因為公司

◢ ▆ ▅ ▄ ▃ 崩╰ (〒皿〒)╯ 潰▃ ▄ ▅ ▇ ◣
(installer using .bat file)

Python is beautiful ...
python crawler 49 行 (ver1)
150行 (final ver)
v.s.
C# crawler 700 行
(還只能爬一頁!!)

大勝!

之三
Step 3
store into DB

可以用MSSQL嗎？
Django 可以接M$SQL嗎？

YES !
< Solution > install “Django-mssql”
modify “settings.py”
http://django-mssql.readthedocs.org/en/latest/

如果可以重來一次的話
我會直接寫txt出來
再進MSSQL... ＝＝a

因為
1. 不用跟主管解釋django是啥(?)
2. 快
3. 寫code測試

之四
Step 4
Visualization:
histogram

好慢阿～
DB 膨脹的速度很快（200萬）
Excel 很慢，很慢，很～慢～～

<Solution>
分成history table, today table

之五

可以打包成exe檔嗎？
<Solution 1> Py2exe
改由python script啟動crawler

Fail ! @@
https://scrapy.readthedocs.org/en/latest/topics/practices.html

可以打包成exe檔嗎？
<Solution 2> Installer
.bat
use pip offline
(no update)
it works !
(while fails at a few
machines)

Conclusion
# scrapy ：還不錯的輕量級crawler框架
快速開發，專注在parser上
簡單易學，好維護
Remember 例外處理
# django: 殺雞用牛刀了
若不考慮發展其他產品，
直接寫txt出來也許更省力
# 收穫：主管終於知道Python怎麼念了! <(￣︶￣)>

what will happened ….

你可能會沒有

pip 可以用XD

正經的: C #, VBA

Reference
# official website tutorial
www.scrapy.org

# Taipei.py Thoen’s slide
http://files.meetup.com/6816242/%28Pycon%20Taipei%29%20Scrapy-20130328.pdf

# Thanks Tim & c3h3 !

Py ladies 0928

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (17)

Ähnlich wie Py ladies 0928

Ähnlich wie Py ladies 0928 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (9)

Py ladies 0928