Suche senden
Hochladen
Don't scrape, Glean!
•
Als PPT, PDF herunterladen
•
1 gefällt mir
•
742 views
tommorris
Folgen
Lacks the demo part, alas, but it's the slides I used
Weniger lesen
Mehr lesen
Technologie
Melden
Teilen
Melden
Teilen
1 von 35
Jetzt herunterladen
Empfohlen
CSS naming | ceci n'est pas un pipe
CSS naming | ceci n'est pas un pipe
Wilfred Nas
2310 b xd
2310 b xd
Krazy Koder
Responsive Typography II
Responsive Typography II
Clarissa Peterson
My First Rails Plugin - Usertext
My First Rails Plugin - Usertext
frankieroberto
basic knowledge abot html
basic knowledge abot html
Ankit Dubey
zigbee
zigbee
mahamad juber
SAP NetWeaver Gateway - Gateway Service Consumption
SAP NetWeaver Gateway - Gateway Service Consumption
SAP PartnerEdge program for Application Development
NetWeaver Gateway- Gateway Service Consumption
NetWeaver Gateway- Gateway Service Consumption
SAP PartnerEdge program for Application Development
Empfohlen
CSS naming | ceci n'est pas un pipe
CSS naming | ceci n'est pas un pipe
Wilfred Nas
2310 b xd
2310 b xd
Krazy Koder
Responsive Typography II
Responsive Typography II
Clarissa Peterson
My First Rails Plugin - Usertext
My First Rails Plugin - Usertext
frankieroberto
basic knowledge abot html
basic knowledge abot html
Ankit Dubey
zigbee
zigbee
mahamad juber
SAP NetWeaver Gateway - Gateway Service Consumption
SAP NetWeaver Gateway - Gateway Service Consumption
SAP PartnerEdge program for Application Development
NetWeaver Gateway- Gateway Service Consumption
NetWeaver Gateway- Gateway Service Consumption
SAP PartnerEdge program for Application Development
XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEAR
Stephan Schmidt
Ods Markup And Tagsets: A Tutorial
Ods Markup And Tagsets: A Tutorial
simienc
lf-2003_01-0269
lf-2003_01-0269
tutorialsruby
lf-2003_01-0269
lf-2003_01-0269
tutorialsruby
Csphtp1 18
Csphtp1 18
HUST
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
Ajax Experience 2009
Jsonsaga
Jsonsaga
nohmad
The JSON Saga
The JSON Saga
kaven yan
XML processing with perl
XML processing with perl
Joe Jiang
Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007
Guillaume Laforge
Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP
yucefmerhi
Grails and Dojo
Grails and Dojo
Sven Haiges
A Toda Maquina Con Ruby on Rails
A Toda Maquina Con Ruby on Rails
Rafael García
How Xslate Works
How Xslate Works
Goro Fuji
Debugging and Error handling
Debugging and Error handling
Suite Solutions
Система рендеринга в Magento
Система рендеринга в Magento
Magecom Ukraine
WordPress Development Confoo 2010
WordPress Development Confoo 2010
Brendan Sera-Shriar
Lecture 5 - Comm Lab: Web @ ITP
Lecture 5 - Comm Lab: Web @ ITP
yucefmerhi
JavaScript
JavaScript
Doncho Minkov
Orm hero
Orm hero
Simone Di Maulo
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Safe Software
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
Kalema Edgar
Weitere ähnliche Inhalte
Ähnlich wie Don't scrape, Glean!
XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEAR
Stephan Schmidt
Ods Markup And Tagsets: A Tutorial
Ods Markup And Tagsets: A Tutorial
simienc
lf-2003_01-0269
lf-2003_01-0269
tutorialsruby
lf-2003_01-0269
lf-2003_01-0269
tutorialsruby
Csphtp1 18
Csphtp1 18
HUST
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
Ajax Experience 2009
Jsonsaga
Jsonsaga
nohmad
The JSON Saga
The JSON Saga
kaven yan
XML processing with perl
XML processing with perl
Joe Jiang
Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007
Guillaume Laforge
Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP
yucefmerhi
Grails and Dojo
Grails and Dojo
Sven Haiges
A Toda Maquina Con Ruby on Rails
A Toda Maquina Con Ruby on Rails
Rafael García
How Xslate Works
How Xslate Works
Goro Fuji
Debugging and Error handling
Debugging and Error handling
Suite Solutions
Система рендеринга в Magento
Система рендеринга в Magento
Magecom Ukraine
WordPress Development Confoo 2010
WordPress Development Confoo 2010
Brendan Sera-Shriar
Lecture 5 - Comm Lab: Web @ ITP
Lecture 5 - Comm Lab: Web @ ITP
yucefmerhi
JavaScript
JavaScript
Doncho Minkov
Orm hero
Orm hero
Simone Di Maulo
Ähnlich wie Don't scrape, Glean!
(20)
XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEAR
Ods Markup And Tagsets: A Tutorial
Ods Markup And Tagsets: A Tutorial
lf-2003_01-0269
lf-2003_01-0269
lf-2003_01-0269
lf-2003_01-0269
Csphtp1 18
Csphtp1 18
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
Jsonsaga
Jsonsaga
The JSON Saga
The JSON Saga
XML processing with perl
XML processing with perl
Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007
Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP
Grails and Dojo
Grails and Dojo
A Toda Maquina Con Ruby on Rails
A Toda Maquina Con Ruby on Rails
How Xslate Works
How Xslate Works
Debugging and Error handling
Debugging and Error handling
Система рендеринга в Magento
Система рендеринга в Magento
WordPress Development Confoo 2010
WordPress Development Confoo 2010
Lecture 5 - Comm Lab: Web @ ITP
Lecture 5 - Comm Lab: Web @ ITP
JavaScript
JavaScript
Orm hero
Orm hero
Kürzlich hochgeladen
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Safe Software
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
Kalema Edgar
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
Enterprise Knowledge
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
hariprasad279825
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
Lorenzo Miniero
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
gvaughan
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
shyamraj55
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Fwdays
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
Memoori
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
charlottematthew16
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
Scott Keck-Warren
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
null - The Open Security Community
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
Mark Billinghurst
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
Rizwan Syed
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
Fwdays
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
Addepto
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
comworks
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
UiPathCommunity
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
Florian Wilhelm
costume and set research powerpoint presentation
costume and set research powerpoint presentation
phoebematthew05
Kürzlich hochgeladen
(20)
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
costume and set research powerpoint presentation
costume and set research powerpoint presentation
Don't scrape, Glean!
1.
2.
Scraping sucks.
3.
def lastlogin
(@hmodel/ "//td[@class='text'][@width='193']" ).first.innerHTML.split("<br />"[ 9 ].strip[ -10 .. -1 ] return date[ -4 .. -1 ] + "-" + date[ -7 .. -6 ] + "-" + date[ -10 .. -9 ] end end end end
4.
Hpricot for ‘Last
login’ date on MySpace.
5.
try :
lastlogin = self.soup.findAll( True , { "width" : "193" })[ 0 ].br.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.string loginregex = re.compile( r " [0-9] / [0-9] +/ [0-9]* ") loginregex_inst = loginregex.search(lastlogin) if loginregex_inst is not None : self.lastlogin = loginregex_inst.group() except : pass pass pass pass pass pass pass pass
6.
Taken from a
Python/BeautifulSoup library.
7.
(The Ruby is
prettier, but who’s counting?)
8.
getElementsByClassName(“foo”)[0].children
9.
It’s an edge
case. MySpace’s HTML is worse than average.
10.
But it is
an ugly recipe for mental turmoil.
11.
The alternative?
12.
flickr.getPhotos()
13.
And you get
back nice XML or JSON (or even SOAP!) (or even SOAP!)
14.
But ‘D.R.Y.’! APIs
break that principle. APIs break that principle.
15.
This is the
data equivalent of the ‘accessible version’.
16.
Enter GRDDL.
17.
GRDDL defines a
transformation process for XHTML » RDF.
18.
XHTML ? That’s
what the spec says. That’s what the spec says.
19.
HTML 4 works
too. Tidy ! !
20.
RDF? Yes. Trust
me. It’s not evil. It’s not evil. It’s not evil.
21.
GRDDL can work
like a data stylesheet on top of your HTML. on top of your HTML. on top of your HTML.
22.
You simply use
HTML (or XML) in the normal way...
23.
...and define how
the data transformation.
24.
You can even
use it as a bridge for exisiting APIs and services.
25.
Could even be
used for other formats than RDF. Atom? than RDF. Atom? than RDF. Atom?
26.
Simple example: ‘Not
Safe For Work’ ‘Not Safe For Work’
27.
<a href=" http://tubgirl.com
" class="nsfw">
28.
I can write
that. I can’t write xFolk by hand. I can’t write xFolk by hand.
29.
Is ‘nsfw’ a
good class name? No.
30.
Do I care?
No.
31.
The data layer
becomes separated like CSS is from HTML.
32.
That’s the theory.
Now for the demo. Now for the demo.
33.
irc.freenode.net #swig #swhack
#swhack #swhack
34.
getsemantic.com [email_address] [email_address]
35.
[email_address] http://tommorris.org http://tommorris.org
Jetzt herunterladen