Weitere ähnliche Inhalte Ähnlich wie Py ladies 0928 (20) Py ladies 09282. Who am I ?
Yen ( 顏嘉儀 )
台大經濟所碩二
Python, R user
3. Why I am here ?
2
怎麼了嗎 (警戒= =+)
那來PyLadies分享
一下吧(遠目)
3
聽說..你暑假實習...
用python…寫了啥碗糕...是
吧....
1
4. Take my Internship experiment
as an Example
- Crawler What wil happen
if you were a Python programmer
in Financial Community ?
20. 簡單來說,就是抓資料嘛~
C#
[ C# ] Arachnode.net ? (not free)
R is slower
Try Python Solution:Scrapy
Windows Platform : C#
21. Scrapy
(official)
is a web crawling framework,
used to crawl websites and extract
structured data from their pages.
是一個可以讓你快速開發網路爬蟲的套件。
多快? 為什麼快?
24. XPath Parser
Scrapy, is a web crawling framework,
used to crawl websites and extract
structured data from their pages.
# Regular Expression
Every characters are
treated as the same
# Alternatives: XPath
html doc can be a
strudtured data
25. XPath is like “address”
# C://Python27
# html/body/div[@class="wrapper"]/
div[class="header.clearfix"]/h1[class="
logo"]/a
28. import scrapy工具
Pseudo Code
class my_Spider( 繼承scrapy寫好的spider ):
name ="爬蟲名字"
start_urls=[initial request(URL)]
i.e. 最上層的root
def parse(self, first_response):
第一個parse函式寫死的!!
hxs = HtmlXPathSelector( first_response )
html_response => Xpath 結構化物件
Xpath = "爬取url的Rule"
extracted data = hxs.select(Xpath).extract()
yield Request(url=爬到的url, callback=self.next_parser)
def next_parser(self, second_response):
hxs = HtmlXPathSelector(second_response)
Xpath = "爬取資料的Rule"
extracted data = hxs.select(Xpath).extract()
函式名字隨便訂
29. cmd command
# 建立scrapy 專案
[cmd]scrapy startproject PyLadiesDemo
# 執行name = "TaiTex" 的爬蟲
[cmd]scrapy crawl TaiFex
請參考官方文件:http://doc.scrapy.org/en/latest/intro/tutorial.html
30. Demo
# sample code
# Good Tools: (error detect : try except)
[cmd] scrapy Shell url_that_u_want_2_crawl
31. 例外處理 + pdb 下中斷點
# 例外處理
def parse(self, response):
try:
...
except:
網站難免有些例外
…
# 下中斷點
import pdb
pdb.set_trace()
or 弄錯XPath,
可以很即時的修正。
33. #3 每日期權strike, price戰力分佈圖
Find
the “url pattern”
crawl pages and
scrap data
store into DB
Visualization:
histogram
Google Developer Tools
Python
Scrapy
XPath Helper
Django -> MSSQL
Excel
Windows
Platform
35. Step 1
Find the “url pattern”
query string: http://www.taifex.com.tw/chinese/3/fcm_opt_rep.asp?
commodity_id=TXO&commodity_idt=TXO&settlemon=201310W1&pccode=P&curpage=1&pagenum=1
38. 什麼!! pip 不能用 !!!!!
Step 2
crawl pages and
scrap data
擋網站 !!!
因為公司
◢ ▆ ▅ ▄ ▃ 崩╰ (〒皿〒)╯ 潰▃ ▄ ▅ ▇ ◣
(installer using .bat file)
39. Python is beautiful ...
python crawler 49 行 (ver1)
150行 (final ver)
v.s.
C# crawler 700 行
(還只能爬一頁!!)
大勝!
51. Reference
# official website tutorial
www.scrapy.org
# Taipei.py Thoen’s slide
http://files.meetup.com/6816242/%28Pycon%20Taipei%29%20Scrapy-20130328.pdf
# Thanks Tim & c3h3 !