SlideShare a Scribd company logo
1 of 46
Web Scraping
         For Code-ophobes




@AnnieCushing
                      1
What I‟m not




@AnnieCushing              2
What I am




            3
THE WIND BENEATH MY
          WEB-SCRAPING WINGS
       @djchrisle        @ethanlyon




@AnnieCushing                         4
3 WAYS TO SCRAPE IN GOOGLE DOCS




         • ImportFeed
         • ImportHTML
         • ImportXML



@AnnieCushing                                5
=ImportFeed


              6
ImportFeed


 =ImportFeed(URL, query, headers, numItems)

=ImportFeed("http://feeds.searchengineland.com/searchengineland")

  OR

=ImportFeed(C4)  My preference




@AnnieCushing                             http://bit.ly/importfeed
                                                              7
@AnnieCushing   8
STALKING FOR LINKS



                BY @WILREYNOLDS
@AnnieCushing            http://slidesha.re/stalker-wil
                                                  9
=ImportHTML


              10
ImportHTML


                TWO OPTIONS

         • Table
         • List



@AnnieCushing                         11
=ImportHtml(URL, query, index)

   URL: “www.domain.com/whatever” OR
   cell reference
   query: “table” or “list” OR cell reference
   index: If multiple lists or tables, which
   one (3 = 3rd table)

@AnnieCushing                                   12
Table Example of ImportHTML




@AnnieCushing                           13
List Example of ImportHTML




@AnnieCushing                          14
=ImportXML


             15
ImportXML


            =ImportXML(URL, query)




@AnnieCushing               http://bit.ly/xpath-tutorial
Simple Explanation of XPath




    XPath uses path expressions to select
    nodes or node-sets in an XML
    document.




@AnnieCushing                                 17
@AnnieCushing   18
7 Types of Nodes




@AnnieCushing                19
Simple Explanation of XPath


                ELEMENTS
    <div>
    <p>
    <blockquote>
    <price>
    <ul>

@AnnieCushing                              20
PARENT-CHILD NODES
    • As you drill down, you separate nodes
      with /
    • Ex: /html/div/ul/li/a



@AnnieCushing                             21
ATTRIBUTES
    class
    id
    size

          Look for the = sign

@AnnieCushing                   22
Simple Explanation of XPath


         KEY CHARACTERS
      /: Starts at the root
    //: Starts wherever
     @: Selects attributes
     []: Answers the question “Which one?”
    [*]: All

@AnnieCushing                                 23
Let‟s Start Simple




@AnnieCushing                  24
Magic!




@AnnieCushing      25
Grab the URLs




@AnnieCushing             26
Because it‟s an @tribute!




                            27
Let‟s dial it up




@AnnieCushing   http://bit.ly/distilled-xml
@AnnieCushing   29
@AnnieCushing   30
Let‟s dial it up




@AnnieCushing                31
Could do it this way




@AnnieCushing                    32
At your own risk




@AnnieCushing                33
Better plan




@AnnieCushing           34
The world according to Annie




    // = blah blah yada yada



@AnnieCushing                            35
Can even be in the middle of the XPath




  //div[@class=„main‟]//blockquote[2]




@AnnieCushing                                      36
Other ways to tell “which one” in XPath

                STARTS-WITH




@AnnieCushing                                        37
Other ways to tell “which one” in XPath

                CONTAINS




@AnnieCushing                                       38
Other ways to tell “which one” in XPath




@AnnieCushing                                       39
Other ways to tell “which one” in XPath

                INDEX VALUE




@AnnieCushing                                       40
Other ways to tell “which one” in XPath

                    LAST()




@AnnieCushing                                       41
Become a scraping FOOL

   •   Pull queries from Topsy
   •   Pull product feeds
   •   Pull specific elements from a sitemap
   •   Scrape Twitter followers
   •   Pull GA metrics
   •   Scrape HTML tables (e.g., list of countries from Wikipedia)
   •   Scrape lists (e.g., scraped lists of consumer review sites to create a custom
       search engine, top sports blogs, etc.)
   •   Scrape rankings
   •   Scrape GA codes / Adsense IDs / IPs / IP Country Codes
   •   Find de-indexed sites
   •   Scrape directories
   •   Scrape Yahoo / Google for relevant pages from directory listings
   •   Scraping title / h1 / meta descriptions
   •   Scrape page URLs to find if someone is linking to you
   •   Scrape Google to find snippets of text on a list of domains (for link networks)
   •   Scrape Quora


@AnnieCushing                                                               @NicoMiceli
SEE IMPORT FUNCTIONS IN
    THEIR NATURAL HABITAT!
@AnnieCushing    http://bit.ly/annies-gdoc
                                      43
AWWW YEAHHH!




               44
TO PLAY …

   1. Log in
   2. File > Make a copy…
   3. Poke around and test




@AnnieCushing                45
RESOURCES

XPath Tutorial: http://bit.ly/xpath-tutorial
Annie‟s Gdoc: http://bit.ly/annies-gdoc
Distilled Guide: http://bit.ly/distilled-guide
SEER Cookbook: http://bit.ly/seer-cookbook


@AnnieCushing                              46

More Related Content

More from Annie Cushing

Demystifying Data Visualization for Marketers
Demystifying Data Visualization for MarketersDemystifying Data Visualization for Marketers
Demystifying Data Visualization for MarketersAnnie Cushing
 
Identifying Your Site's Red-Headed Step Children with Google Analytics' Multi...
Identifying Your Site's Red-Headed Step Children with Google Analytics' Multi...Identifying Your Site's Red-Headed Step Children with Google Analytics' Multi...
Identifying Your Site's Red-Headed Step Children with Google Analytics' Multi...Annie Cushing
 
10 Questions Entrepreneurs Need to Ask Their Analytics
10 Questions Entrepreneurs Need to Ask Their Analytics10 Questions Entrepreneurs Need to Ask Their Analytics
10 Questions Entrepreneurs Need to Ask Their AnalyticsAnnie Cushing
 
Give Your Data an Extreme Makeover in Under 5 Minutes (SMX Advanced)
Give Your Data an Extreme Makeover in Under 5 Minutes (SMX Advanced)Give Your Data an Extreme Makeover in Under 5 Minutes (SMX Advanced)
Give Your Data an Extreme Makeover in Under 5 Minutes (SMX Advanced)Annie Cushing
 
Take Credit Where Credit's Due
Take Credit Where Credit's DueTake Credit Where Credit's Due
Take Credit Where Credit's DueAnnie Cushing
 
Killer KPIs: Turning Data Into Gs
Killer KPIs: Turning Data Into GsKiller KPIs: Turning Data Into Gs
Killer KPIs: Turning Data Into GsAnnie Cushing
 
Establishing an Audit Framework
Establishing an Audit FrameworkEstablishing an Audit Framework
Establishing an Audit FrameworkAnnie Cushing
 

More from Annie Cushing (7)

Demystifying Data Visualization for Marketers
Demystifying Data Visualization for MarketersDemystifying Data Visualization for Marketers
Demystifying Data Visualization for Marketers
 
Identifying Your Site's Red-Headed Step Children with Google Analytics' Multi...
Identifying Your Site's Red-Headed Step Children with Google Analytics' Multi...Identifying Your Site's Red-Headed Step Children with Google Analytics' Multi...
Identifying Your Site's Red-Headed Step Children with Google Analytics' Multi...
 
10 Questions Entrepreneurs Need to Ask Their Analytics
10 Questions Entrepreneurs Need to Ask Their Analytics10 Questions Entrepreneurs Need to Ask Their Analytics
10 Questions Entrepreneurs Need to Ask Their Analytics
 
Give Your Data an Extreme Makeover in Under 5 Minutes (SMX Advanced)
Give Your Data an Extreme Makeover in Under 5 Minutes (SMX Advanced)Give Your Data an Extreme Makeover in Under 5 Minutes (SMX Advanced)
Give Your Data an Extreme Makeover in Under 5 Minutes (SMX Advanced)
 
Take Credit Where Credit's Due
Take Credit Where Credit's DueTake Credit Where Credit's Due
Take Credit Where Credit's Due
 
Killer KPIs: Turning Data Into Gs
Killer KPIs: Turning Data Into GsKiller KPIs: Turning Data Into Gs
Killer KPIs: Turning Data Into Gs
 
Establishing an Audit Framework
Establishing an Audit FrameworkEstablishing an Audit Framework
Establishing an Audit Framework
 

Recently uploaded

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Recently uploaded (20)

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Web Scraping for Code-ophobes

Editor's Notes

  1. I’m a data wrangler. I collect and drill through data like it’s my job.Because it kind of is. But I found that since coming to SEER my need for data collection at times surpassed what I could get in tools. So I turned to Gdocs and its ability to scrape.
  2. In order of complexity
  3. I always prefer to chop my Import functions into cells. Easier to troubleshoot and modify. And you don’t have to worry about parentheses b/c you don’t need them.When you get your web feet you can start getting tricky w/ the optional arguments.
  4. To learn more check out how to scrape feeds all over the place by checking out Wil’spreso.That wasn’t the original graphic. But you’ll see why it’s fitting by the time you get to the end.Point out URL.
  5. Every once in a while it’s 0-based. Honestly, if there are multiple tables (like Wikipedia pages), I just guess and change the number until it pulls the data I need.
  6. Basically, anything that’s in a table or bulleted list you can scrape.I recently pulled together a CSE of review sites. And I used ImportHTML quite a bit – to scrape both lists and tables.
  7. We’re entering the deep end of the scraping pool.
  8. Okay, so ImportXML uses Xpath. And here’s everything you need to know about Xpath …
  9. Yeah, I have no idea what that really means, and I suffer from a deplorable lack of curiosity.
  10. I’ll be showing one example of the text node that I actually used when scraping Craigslist once. (Don’t judge.)
  11. If it’s inside brackets, it’s an element.
  12. If it has an = sign inside brackets, that’s an attribute.
  13. @ … AttributeSquare brackets: which one?Ryan O and F.
  14. We have this page of content from Barry Schwartz’s blog.Let’s say we want to scrape all of the anchors (the text part of a link).We would write something like this in Google Docs …
  15. This basically means scrape all the anchors!
  16. Now if you want to also scrape the URLs, you add /@href. And why do you need the @ before href? …
  17. Don’t believe me? Check it …
  18. Okay, it’s rare that your XPath is going to be that simple.I stole this from Distilled’s Import XML Guide for Google Docs.Point out the link.
  19. When I first started scraping I’d look at the code and try to figure out the hierarchy judging by the indentation.But sometimes your child nodes can look like this …
  20. And then it gets tricky!Eventually I figured out that I could just use the bar at the bottom b/c it shows the actual hierarchy.
  21. Eventually I figured out that I could just use the bar at the bottom b/c it shows the actual hierarchy.
  22. So you could be precise and write out the XPath from the root on down the food chain. This says, “Start at the HTML element, then drill down …
  23. But you’ll look like a dork.
  24. So instead what the cool kids do is just use the double slash and grab the div you want. You just need as much detail as it takes to get that list.
  25. You can even use it in the middle of your XPath.
  26. The more complex your scraping requirement is, the more complex your XPath becomes. So some other ways to tell “which one” are with the starts-with predicate.
  27. Here I wanted to see if I could scrape all the iPad links, then use that scraped URL as a reference point to scrape the email address on that page. You’ve heard of Will It Blend? I’ve been playing my own game of Will It Scrape?
  28. This is where I used the text b/c I only wanted links that had iPad in the anchor.
  29. This is a compiled list from Nico, Ethan, Chris, and WilGive Nico a shout out!
  30. Point out link.