Suche senden
Hochladen
Don't scrape, Glean!
•
1 gefällt mir
•
737 views
tommorris
Folgen
Lacks the demo part, alas, but it's the slides I used
Weniger lesen
Mehr lesen
Technologie
Melden
Teilen
Melden
Teilen
1 von 35
Jetzt herunterladen
Downloaden Sie, um offline zu lesen
Empfohlen
CSS naming | ceci n'est pas un pipe
CSS naming | ceci n'est pas un pipe
Wilfred Nas
2310 b xd
2310 b xd
Krazy Koder
Responsive Typography II
Responsive Typography II
Clarissa Peterson
My First Rails Plugin - Usertext
My First Rails Plugin - Usertext
frankieroberto
basic knowledge abot html
basic knowledge abot html
Ankit Dubey
zigbee
zigbee
mahamad juber
SAP NetWeaver Gateway - Gateway Service Consumption
SAP NetWeaver Gateway - Gateway Service Consumption
SAP PartnerEdge program for Application Development
NetWeaver Gateway- Gateway Service Consumption
NetWeaver Gateway- Gateway Service Consumption
SAP PartnerEdge program for Application Development
Weitere ähnliche Inhalte
Ähnlich wie Don't scrape, Glean!
XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEAR
Stephan Schmidt
Ods Markup And Tagsets: A Tutorial
Ods Markup And Tagsets: A Tutorial
simienc
lf-2003_01-0269
lf-2003_01-0269
tutorialsruby
lf-2003_01-0269
lf-2003_01-0269
tutorialsruby
Csphtp1 18
Csphtp1 18
HUST
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
Ajax Experience 2009
Jsonsaga
Jsonsaga
nohmad
The JSON Saga
The JSON Saga
kaven yan
XML processing with perl
XML processing with perl
Joe Jiang
Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007
Guillaume Laforge
Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP
yucefmerhi
Grails and Dojo
Grails and Dojo
Sven Haiges
A Toda Maquina Con Ruby on Rails
A Toda Maquina Con Ruby on Rails
Rafael García
How Xslate Works
How Xslate Works
Goro Fuji
Debugging and Error handling
Debugging and Error handling
Suite Solutions
Система рендеринга в Magento
Система рендеринга в Magento
Magecom Ukraine
WordPress Development Confoo 2010
WordPress Development Confoo 2010
Brendan Sera-Shriar
Lecture 5 - Comm Lab: Web @ ITP
Lecture 5 - Comm Lab: Web @ ITP
yucefmerhi
JavaScript
JavaScript
Doncho Minkov
Orm hero
Orm hero
Simone Di Maulo
Ähnlich wie Don't scrape, Glean!
(20)
XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEAR
Ods Markup And Tagsets: A Tutorial
Ods Markup And Tagsets: A Tutorial
lf-2003_01-0269
lf-2003_01-0269
lf-2003_01-0269
lf-2003_01-0269
Csphtp1 18
Csphtp1 18
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
Jsonsaga
Jsonsaga
The JSON Saga
The JSON Saga
XML processing with perl
XML processing with perl
Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007
Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP
Grails and Dojo
Grails and Dojo
A Toda Maquina Con Ruby on Rails
A Toda Maquina Con Ruby on Rails
How Xslate Works
How Xslate Works
Debugging and Error handling
Debugging and Error handling
Система рендеринга в Magento
Система рендеринга в Magento
WordPress Development Confoo 2010
WordPress Development Confoo 2010
Lecture 5 - Comm Lab: Web @ ITP
Lecture 5 - Comm Lab: Web @ ITP
JavaScript
JavaScript
Orm hero
Orm hero
Kürzlich hochgeladen
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
SkyPlanner
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
Precisely
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
David Newbury
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
DianaGray10
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
Udaiappa Ramachandran
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
Seth Reyes
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
Matt Ray
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
UiPathCommunity
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software
20230104 - machine vision
20230104 - machine vision
Jamie (Taka) Wang
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Commit University
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Aijun Zhang
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
Asko Soukka
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
shyamraj55
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
Matsuo Lab
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
D Cloud Solutions
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
DianaGray10
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
IES VE
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
Adtran
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
infogdgmi
Kürzlich hochgeladen
(20)
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
20230104 - machine vision
20230104 - machine vision
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
Don't scrape, Glean!
1.
2.
Scraping sucks.
3.
def lastlogin
(@hmodel/ "//td[@class='text'][@width='193']" ).first.innerHTML.split("<br />"[ 9 ].strip[ -10 .. -1 ] return date[ -4 .. -1 ] + "-" + date[ -7 .. -6 ] + "-" + date[ -10 .. -9 ] end end end end
4.
Hpricot for ‘Last
login’ date on MySpace.
5.
try :
lastlogin = self.soup.findAll( True , { "width" : "193" })[ 0 ].br.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.string loginregex = re.compile( r " [0-9] / [0-9] +/ [0-9]* ") loginregex_inst = loginregex.search(lastlogin) if loginregex_inst is not None : self.lastlogin = loginregex_inst.group() except : pass pass pass pass pass pass pass pass
6.
Taken from a
Python/BeautifulSoup library.
7.
(The Ruby is
prettier, but who’s counting?)
8.
getElementsByClassName(“foo”)[0].children
9.
It’s an edge
case. MySpace’s HTML is worse than average.
10.
But it is
an ugly recipe for mental turmoil.
11.
The alternative?
12.
flickr.getPhotos()
13.
And you get
back nice XML or JSON (or even SOAP!) (or even SOAP!)
14.
But ‘D.R.Y.’! APIs
break that principle. APIs break that principle.
15.
This is the
data equivalent of the ‘accessible version’.
16.
Enter GRDDL.
17.
GRDDL defines a
transformation process for XHTML » RDF.
18.
XHTML ? That’s
what the spec says. That’s what the spec says.
19.
HTML 4 works
too. Tidy ! !
20.
RDF? Yes. Trust
me. It’s not evil. It’s not evil. It’s not evil.
21.
GRDDL can work
like a data stylesheet on top of your HTML. on top of your HTML. on top of your HTML.
22.
You simply use
HTML (or XML) in the normal way...
23.
...and define how
the data transformation.
24.
You can even
use it as a bridge for exisiting APIs and services.
25.
Could even be
used for other formats than RDF. Atom? than RDF. Atom? than RDF. Atom?
26.
Simple example: ‘Not
Safe For Work’ ‘Not Safe For Work’
27.
<a href=" http://tubgirl.com
" class="nsfw">
28.
I can write
that. I can’t write xFolk by hand. I can’t write xFolk by hand.
29.
Is ‘nsfw’ a
good class name? No.
30.
Do I care?
No.
31.
The data layer
becomes separated like CSS is from HTML.
32.
That’s the theory.
Now for the demo. Now for the demo.
33.
irc.freenode.net #swig #swhack
#swhack #swhack
34.
getsemantic.com [email_address] [email_address]
35.
[email_address] http://tommorris.org http://tommorris.org
Jetzt herunterladen