2. hi!
I’m a data scientist in the Nordstrom Data Lab. I’ve built
scrapers to monitor the product catalogs of various
sports retailers.
3. Getting data can be hard
Despite the open-data movement and popularity
of APIs, volumes of data are locked up in DOMs all
over the internet.
4. Monitoring
competitor prices
• As a retailer, I want to strategically set prices in
relation to my competitors.
• But they aren’t interested in sharing their prices and
mark-down strategies with me. 😭
5. • “Scrapy is an application framework for crawling web
sites and extracting structured data which can be
used for a wide range of useful applications, like data
mining, information processing or historical archival.”
• scrapin’ on rails!
13. Spider design
Spiders have two primary components:
1. Crawling (navigation) instructions
2. Parsing instructions
14. Define the crawl behavior
in spiders/backcountry.py
After spending some
time on backcountry.com,
I decided the all brands
landing page was the
best starting URL.
20. # Paginate!!
for page in more_pages:!
next_page = str(self.base_url + page)!
yield scrapy.Request(url = next_page,!
callback = self.parse_product_pages)!
!
for product in product_pages:!
product_url = str(self.base_url + product)!
!
yield scrapy.Request(url = product_url,!
callback = self.parse_item)
21. def parse_item(self, response):!
!
item = Product()!
dirty_data = {}!
!
dirty_data['product_title'] = response.xpath(“//*[@id=‘product-buy-box’]/div/div[1]/h1/text()“).extract()!
dirty_data['description'] = response.xpath("//div[@class='product-description']/text()").extract()!
dirty_data['price'] = response.xpath("//span[@itemprop='price']/text()").extract()!
!
for variable in dirty_data.keys():!
if dirty_data[variable]: !
if variable == 'price':!
item[variable] = float(''.join(dirty_data[variable]).strip().replace('$', '').replace(',', ''))!
else: !
item[variable] = ''.join(dirty_data[variable]).strip()!
!
yield item!
Part II: Parsing
22. for variable in dirty_data.keys():!
if dirty_data[variable]: !
if variable == 'price':!
item[variable] = float(''.join(dirty_data[variable]).strip().replace('$', '').replace(',', ''))!
else: !
item[variable] = ''.join(dirty_data[variable]).strip()
Part II: Clean it now!
27. –Monica Rogati, VP of Data at Jawbone
“Data wrangling is a huge — and
surprisingly so — part of the job. It’s
something that is not appreciated by data
civilians. At times, it feels like everything
we do.”
29. Bring your projects to hacknight!
http://www.meetup.com/Seattle-PyLadies
Ladies!!
30. Bring your projects to hacknight!
http://www.meetup.com/Seattle-PyLadies
Ladies!!
Thursday, January
29th
6PM
!
Intro to iPython
and Matplotlib
Ada Developers Academy
1301 5th Avenue #1350