The document discusses using the Ruby programming language and tools like Hpricot and XPath to parse HTML documents, highlighting how Hpricot can be used to easily extract information like country names and URLs from a weather website that has a poorly structured table layout. Steps provided include inspecting the site with Firebug to get element XPaths and then parsing the HTML using Hpricot to retrieve the desired data.
3. So … Let’s See !
• Dynamic
• Easy to Learn
• Easy to maintain and grow
• Convenient Short‐Cuts
Ex: Str = “Linux Crea=ve Group”
Str_join = Str.split(“ “).join(“+”)
• Transparent, code faster
• Few Syntax Errors, Fewer Bugs
• It’s Fun
4. Ruby Gems
• Package Management System for Ruby Applica=ons
and Libraries
• Resolve Dependencies.
• Provides Central Repository of SoUware.
• One Command Rules:
‐ gem install <gem_name>
• Can Have your Own Local Gem Server
‐ gem install <gem_name> ‐‐source <gem_server_ip_and_port>
6. Hpricot
• Pull informa=on from virtually any website.
• Search by Element ID, Tags, CSS Selectors.
• Parse HTML including broken HTML
• Update HTML
• Use this data anywhere and anyway you want!
• Parse by XPath for directly parsing an element.
• Let’s see …. How it works.
7. Let’s Parse A Badly
Designed Site !!
• h^p://www.worldweather.org
• It’s a site that provides weather informa=on for
different loca=ons across the globe.
• In the main page they have a badly nested table
structure !!
• An ideal Web‐Developer could have put them nicely in
divs with meaningful IDs.
• But let’s face the truth and parse the Country Names
and their URLs.