We help you get web data hassle free. This deck introduces the different use cases that are most beneficial to finance companies and those looking to scale revenue using web data.
3. About Scrapinghub
Scrapinghub specializes in data extraction. Our platform is
used to scrape over 4 billion web pages a month.
We offer:
● Professional Services to handle the web scraping for you
● Off-the-shelf datasets so you can get data hassle free
● A cloud-based platform that makes scraping a breeze
4. Founded in 2010, largest 100% remote company based outside of the US
We’re 134 teammates in 48 countries
5. “Getting information off the
Internet is like taking a drink
from a fire hydrant.”
– Mitchell Kapor
6. Scrapy
Scrapy is a web scraping framework that
gets the dirty work related to web crawling
out of your way.
Benefits
● No platform lock-in: Open Source
● Very popular (13k+ ★)
● Battle tested
● Highly extensible
● Great documentation
7. Portia
Portia is a Visual Scraping tool that lets you
get data without needing to write code.
Benefits
● No platform lock-in: Open Source
● JavaScript dynamic content generation
● Ideal for non-developers
● Extensible
● It’s as easy as annotating a page
9. Large Scale Infrastructure
Meet Scrapy Cloud , our PaaS for web crawlers:
● Scalable: Crawlers run on EC2 instances or dedicated servers
● Crawlera add-on
● Control your spiders: Command line, API or web UI
● Machine learning integration: BigML, MonkeyLearn
● No lock-in: scrapyd to run Scrapy spiders on your own infrastructure
10. Broad Crawls
Frontera allows us to build large scale web crawlers in Python:
● Scrapy support out of the box
● Distribute and scale custom web crawlers across servers
● Crawl Frontier Framework: large scale URL prioritization logic
● Aduana to prioritize URLs based on link analysis (PageRank, HITS)
12. Competitive Pricing
Companies use web scraping to monitor the
pricing and the ratings of competitors:
● Scrape online retailers
● Structure the data in a search engine or DB
● Create an interface to search for products
● Sentiment analysis for product rankings
13. We help a leading IT manufacturer monitor the activities of their
resellers:
● Tracking and watching out for stolen goods
● Pricing agreement violations
● Customer support responses on complaints
● Product line quality checks
Monitor Resellers
14. Lead Generation
Mine scraped data to identify who to target in a company for your
outbound sales campaigns:
● Locate possible leads in your target market
● Identify the right contacts within each one
● Augment the information you already have on them
15. Real Estate
Crawl property websites and use the data obtained in order to:
● Estimate house prices
● Rental values
● Housing stock movements
● Give insight into real estate agents and homeowners
16. Fraud Detection
Monitor for sellers that offer products violating the ToS of credit card
companies including:
● Drugs
● Weapons
● Gambling
Identify stolen cards and IDs on the Dark Web
● Forums where hackers share ID numbers / pins
17. Company Reputation
Sentiment analysis of a company or product through newsletters, social
networks and other natural language data sources.
● NLP to create an associated sentiment indicator.
● Track the relevant news supporting the indicator can lead to market
insights for long-term trends.
18. Consumer Behavior
Extract data from forums and websites like Reddit to evaluate consumer
reviews and commentary:
● Volume of comments across brands
● Topics of discussion
● Comparisons with other brands and products
● Evaluate product launches and marketing tactics
19. Tracking Legislation
Monitor bills and regulations that are being discussed in Congress. Access
court judgments and opinions in order to:
● Follow discussions
● Try to forecast legislative outcomes
● Track regulations that impact different economic sectors
20. Hiring
Crawl and extract data from job boards and other
sources in order to understand:
● Hiring trends in different sectors or regions
● Find candidates for jobs, or future leaders
● Spot and rescue employees that are shopping
for a new job
21. Monitoring Corruption
Journalists and analysts can create Open Data by extracting information
from difficult to access government websites:
● Track the activities of lobbyists
● Patterns in the behavior of government officials
● Disruptions in the economy due to corruption allegations