SlideShare a Scribd company logo
1 of 31
Download to read offline
Product Information Extraction
Sai Ganesh K(11PT22)
OpinioZ Technologies Pvt. Ltd.
1
Special Thanks To 
 Dr. R. Nadarajan
Professor and Head
 Dr. R. Anitha
Programme coordinator,
Associate Professor
 N. Mohanraj
Associate Professor
 Mrs. S. Anandhi
Assistant Professor Senior Grade
Department of Applied Mathematics and Computational Sciences
2
External Guides
Mr. N. Kishore Kumar
Software Engineer
OpinioZ Technologies
Principal Software Architect
ClariTrics Technologies
Mr. G. Venkatesh Prabhu
Managing Director,
OpinioZ Technolgies
Chief Technology Officer,
ClariTrics Technologies
3
Agenda
 OpinioZ Technologies - Introduction
 EtailOne.com – A Brief Overview
 System Environment
 Product Classification
 Information Extraction System
 EtailOne In-House Administration System
4
OpinioZ Technologies
 OpinioZ Technologies Pvt. Ltd. is a Start Up company funded by
ClariTrics Technologies, started on 12.02.2014.
 OpinioZ Technologies has kept its major focus on creating a price
comparison website or a price comparison engine, called “EtailOne”.
 EtailOne offers the next generation product discovery platform.
5
      EtailOne.com
 Enables the customer to create a social profile with an inventory of
products which they had, have and wish to have.
 Gives the customers the advantage of knowing the product in and
out along with expert reviews and comments of what to buy, when to
buy and where to buy the product.
 Empowers its customers by integrating the fragmented market with
its unique social ecommerce platform, thereby enabling them to stay
on top of their favourite brands and products.
6
System Environment
• Processor: Intel® Core™i5
• RAM: 4 GB
• System: 64-bit Operating System
Hardware Specifications
• Operating System: Windows 8
• Languages: Java, XML 1.0
• Database: MySQL
• Technologies: Apache Tomcat, Maven, My SQL Database
Workbench 6.1CE
• Tools: Eclipse 3.7, Apache Tomcat, Microsoft Excel 2007
Software Specifications
7
PRODUCT TITLE FILTERING AND 
CLASSIFICATION
8
System Overview
 Most of the product titles contain sufficient information about a
product that allows the user to classify the product accordingly.
 Hundreds of new products are introduced by various online stores.
When these new products enter EtailOne, it is very difficult to classify
them manually. Hence, an automatic procedure is required.
 The objective of this project is to create a product classification
module that extracts all the attributes and assign the product to a
breadcrumb, appropriately.
9
System Requirement
 Creation of Regex Files
The regex files are attribute extraction rules that are written
manually to classify product titles from different categories. In order
to create these files, all the attributes that characterize a product
category should be listed along with all their possible values.
 Input File
The input file consists of the product titles and the basic
category in which it belongs.
10
Example of Sunglasses Regex File
11
Continued…
 Creation of XML Category Hierarchy Files
These files are created to design the hierarchy of the categories
and its further descending subcategories.
The file contains the regular expressions that the title should
match in order to belong to the category and regular expressions
that title should not contain, if it belongs to the category. Hence, this
eliminates the confusion that may arise when large number of vague
product titles needs to be classified.
12
Example of Footwear XML File
13
Product Title Filtering Module
14
 The Product title from the different shopping website is extracted to
do product title filtering.
 This module takes a flat file containing list of product titles from
different shopping website and breadcrumbs as input.
 Now for the given breadcrumbs and title this module parse the title
and checks whether the values in the regex file are the same as
parser values.
 If they are same, then corresponding attributes and its values are
added into the database for the respective breadcrumb.
15
Attribute Extractor Module
 The product titles and basic category in which it belongs are given as
input.
 The attribute extractor uses the basic category in which the product
belongs to decide the regex file to be matched against the title.
 The attribute heading and values are extracted and stored in the
database.
 Normalization of values should take place to ensure consistency.
 These values are also used to dynamically list the various attribute value
options available in EtailOne for the customer’s use to filter products.
16
Product Classifier Module
 The product classifier is a recursive procedure that continues to go
through deeper subcategories continuously till the regular expression
rules are satisfied.
 It stops when the negative regular expressions are satisfied or when
no regular expressions satisfy the title’s attribute values.
 The breadcrumb in which the product category belongs to, is
returned. It is stored with a category id and a separate table is
maintained consisting of the category id and its filters.
17
Output for Attribute Extraction
18
Output for Product Classification
19
20
 CRAWLERS AND ITS BEHAVIOUR
About Crawlers
 Crawlers or Web Spider is a program or automated script which
browses through the world wide web in a methodical, automated
manner.
 There are several scripting environments to create crawlers, like
PHP, JAVA, .NET, etc.
 Crawlers must connect to the webpage, send requests, and wait for
status of the connection to return.
 Then, it must understand the status code, turn source code to textual
information, and with proper setting of the character encoding.
21
Behavior of a Crawler
 Selection policy : states which pages to download. It is highly desirable that
the downloaded fraction contains the most relevant pages and not just a
random sample of the Web.
 Re-visit policy : states when to check for changes to the pages. The Web
has a dynamic nature and crawling a fraction of the Web can take weeks or
months. By the time a Web crawler has finished its crawl many events could
have happened, including creations, updates and deletions.
 Politeness policy : states how to avoid overloading websites, since a server
would have a hard time keeping up with requests from multiple crawlers.
 Parallelization policy : states how to coordinate distributed web crawlers.
This increases the efficiency of a crawler by improving the time taken to
crawl all the required details.
22
HTML DOM Hierarchy 
23
 A crawler extracts the necessary information from a web page by
looking into the source code.
 The source code that is referred or downloaded forms a tree or
hierarchy. This system can be understood by using the document
object model (DOM). Every link, form, image, etc is described by the
hierarchical system.
System Design
• Modules used to pick
up different links of the
required products
belonging to a
particular category.
• Module depends on
design technique used
to load the products.
• These links are fed
into the Listers.
• The database keeps
the list of all the seed
URLs and when it has
to refresh data, the
particular source is
called and the crawler
revisits the sites.
Seede
r
• Modules that pick
up separate product
URLs, titles,
images, and other
features it can get
from the product
listing pages.
• The images are
retrieved by
downloading in
terms of bytes.
Lister
• Modules that go
deep into the
product URL to get
very particular
information like
specifications, key
features, delivery
details, offers, etc.
• This information are
not present in
product listing page.Parse
r
24
Seeder
 The crawler maintains a list of unvisited URLs called the frontier.
 This set of URLs are stored in the database with the status "ready to seed”.
 Once a URL is crawled and the necessary details are obtained the status
shifts to "seeded".
 Before the URLs are added to the frontier they may be assigned a score that
represents the estimated benefit of visiting the page corresponding to the
URL. When the crawler has no new page to fetch, it stops.
 In order to fetch a Web page, an HTTP client sends an HTTP request for a
page and reads the response.
 The client needs to have timeouts to make sure that an unnecessary amount
of time is not spent on slow servers or in reading large pages.
25
Continued…
 There are two different ways of extracting the pagination URLs from
the web site., which includes pagination and lazy loading technique.
 The crawl history is also maintained. It is a time-stamped list of URLs
that were fetched. This history may be used for post-crawl analysis
and evaluations.
26
List Page Crawler
 The list page crawler picks up separate product URLs, titles, images
and other features it can get from the product listing pages.
 The images are retrieved by downloading in terms of bytes.
 It takes more time to visit every product URL specifically. Hence,
using a list page crawler is more advantageous.
27
Parser
 The parser is a module that goes deep into the product URL to get
specific information like specifications, key features, delivery details,
offers, etc.
 These details are not present in product listing page. Some users
may require all the specification details and key features of a
product.
 Going through each specific product URL is time consuming
process.
28
System Workflow
List of URLs is stored in the database with its
breadcrumb and "ready to seed" status.
SEEDER MODULE
The seeder module stores all the
pagination URLs and other links with
the status stored as "ready to list".
List page
crawler or
parser?
LIST PAGE CRAWLER
MODULE
PARSER MODULE
Information such as product URL and images in
the web page are stored. Other textual information
are stored separately.
29
Output
30
31
Thank you

More Related Content

What's hot

Android local databases
Android local databasesAndroid local databases
Android local databasesFatimaYousif11
 
Getting It System Toolkit: Enhancing User Experience & Customizing a Future f...
Getting It System Toolkit: Enhancing User Experience & Customizing a Future f...Getting It System Toolkit: Enhancing User Experience & Customizing a Future f...
Getting It System Toolkit: Enhancing User Experience & Customizing a Future f...Tim Bowersox
 
TSPUG: Content Management in SharePoint 2010
TSPUG: Content Management in SharePoint 2010TSPUG: Content Management in SharePoint 2010
TSPUG: Content Management in SharePoint 2010Eli Robillard
 
OBIEE11g Architecture & Internals : Collaborate'11, Orlando 2011
OBIEE11g Architecture & Internals : Collaborate'11, Orlando 2011OBIEE11g Architecture & Internals : Collaborate'11, Orlando 2011
OBIEE11g Architecture & Internals : Collaborate'11, Orlando 2011Mark Rittman
 
OpenKM Solution Document
OpenKM Solution DocumentOpenKM Solution Document
OpenKM Solution DocumentManish Chopra
 
Oracle database 12c application express release notes
Oracle database 12c application express release notesOracle database 12c application express release notes
Oracle database 12c application express release notesbupbechanhgmail
 
Rational Publishing Engine and Rational RequisitePro
Rational Publishing Engine and Rational  RequisiteProRational Publishing Engine and Rational  RequisitePro
Rational Publishing Engine and Rational RequisiteProGEBS Reporting
 
Instant J Chem - Introduction and latest
Instant J Chem - Introduction and latestInstant J Chem - Introduction and latest
Instant J Chem - Introduction and latestChemAxon
 
Elsd sql server_integration_services
Elsd sql server_integration_servicesElsd sql server_integration_services
Elsd sql server_integration_servicesSteve Xu
 
Oracle Concurrent Program Setup document
Oracle Concurrent Program Setup  documentOracle Concurrent Program Setup  document
Oracle Concurrent Program Setup documentvenkatesh gurusamy
 
SharePoint 2010 Managed Metadata
SharePoint 2010 Managed MetadataSharePoint 2010 Managed Metadata
SharePoint 2010 Managed MetadataNick Hobbs
 
Microsoft access
Microsoft accessMicrosoft access
Microsoft accessbabyparul
 
owb-platform-adapter-cookbook-177344
owb-platform-adapter-cookbook-177344owb-platform-adapter-cookbook-177344
owb-platform-adapter-cookbook-177344Carnot Antonio Romero
 
Hol311 Getting%20 Started%20with%20the%20 Business%20 Data%20 Catalog%20in%20...
Hol311 Getting%20 Started%20with%20the%20 Business%20 Data%20 Catalog%20in%20...Hol311 Getting%20 Started%20with%20the%20 Business%20 Data%20 Catalog%20in%20...
Hol311 Getting%20 Started%20with%20the%20 Business%20 Data%20 Catalog%20in%20...LiquidHub
 
Ryan-FINALProjectAbstract-v1
Ryan-FINALProjectAbstract-v1Ryan-FINALProjectAbstract-v1
Ryan-FINALProjectAbstract-v1Kevin Ryan
 

What's hot (18)

Android local databases
Android local databasesAndroid local databases
Android local databases
 
Ax
AxAx
Ax
 
Getting It System Toolkit: Enhancing User Experience & Customizing a Future f...
Getting It System Toolkit: Enhancing User Experience & Customizing a Future f...Getting It System Toolkit: Enhancing User Experience & Customizing a Future f...
Getting It System Toolkit: Enhancing User Experience & Customizing a Future f...
 
TSPUG: Content Management in SharePoint 2010
TSPUG: Content Management in SharePoint 2010TSPUG: Content Management in SharePoint 2010
TSPUG: Content Management in SharePoint 2010
 
Introduction to oracle bi 12c
Introduction to oracle bi 12cIntroduction to oracle bi 12c
Introduction to oracle bi 12c
 
OBIEE11g Architecture & Internals : Collaborate'11, Orlando 2011
OBIEE11g Architecture & Internals : Collaborate'11, Orlando 2011OBIEE11g Architecture & Internals : Collaborate'11, Orlando 2011
OBIEE11g Architecture & Internals : Collaborate'11, Orlando 2011
 
OpenKM Solution Document
OpenKM Solution DocumentOpenKM Solution Document
OpenKM Solution Document
 
Oracle database 12c application express release notes
Oracle database 12c application express release notesOracle database 12c application express release notes
Oracle database 12c application express release notes
 
Rational Publishing Engine and Rational RequisitePro
Rational Publishing Engine and Rational  RequisiteProRational Publishing Engine and Rational  RequisitePro
Rational Publishing Engine and Rational RequisitePro
 
Instant J Chem - Introduction and latest
Instant J Chem - Introduction and latestInstant J Chem - Introduction and latest
Instant J Chem - Introduction and latest
 
Elsd sql server_integration_services
Elsd sql server_integration_servicesElsd sql server_integration_services
Elsd sql server_integration_services
 
Oracle Concurrent Program Setup document
Oracle Concurrent Program Setup  documentOracle Concurrent Program Setup  document
Oracle Concurrent Program Setup document
 
SharePoint 2010 Managed Metadata
SharePoint 2010 Managed MetadataSharePoint 2010 Managed Metadata
SharePoint 2010 Managed Metadata
 
Moss Governance Guidelines
Moss Governance GuidelinesMoss Governance Guidelines
Moss Governance Guidelines
 
Microsoft access
Microsoft accessMicrosoft access
Microsoft access
 
owb-platform-adapter-cookbook-177344
owb-platform-adapter-cookbook-177344owb-platform-adapter-cookbook-177344
owb-platform-adapter-cookbook-177344
 
Hol311 Getting%20 Started%20with%20the%20 Business%20 Data%20 Catalog%20in%20...
Hol311 Getting%20 Started%20with%20the%20 Business%20 Data%20 Catalog%20in%20...Hol311 Getting%20 Started%20with%20the%20 Business%20 Data%20 Catalog%20in%20...
Hol311 Getting%20 Started%20with%20the%20 Business%20 Data%20 Catalog%20in%20...
 
Ryan-FINALProjectAbstract-v1
Ryan-FINALProjectAbstract-v1Ryan-FINALProjectAbstract-v1
Ryan-FINALProjectAbstract-v1
 

Viewers also liked

Price Comparison in turkey
Price Comparison in turkeyPrice Comparison in turkey
Price Comparison in turkeySimplesurance
 
BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Pres...
BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Pres...BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Pres...
BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Pres...Vangelis Banos
 
How Mobile drives Indonesian to do shopping discovery and price comparison
How Mobile drives Indonesian to do shopping discovery and price comparisonHow Mobile drives Indonesian to do shopping discovery and price comparison
How Mobile drives Indonesian to do shopping discovery and price comparisonThanawat Malabuppha
 
Comparison of Plumb5 with other digital marketing tools
Comparison of Plumb5 with other digital marketing toolsComparison of Plumb5 with other digital marketing tools
Comparison of Plumb5 with other digital marketing toolsVeer Endra
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm CrawlerJulien Nioche
 
Comparision between online & offline marketing
Comparision between online & offline marketingComparision between online & offline marketing
Comparision between online & offline marketingSunil Kumar
 

Viewers also liked (10)

Presentationjava
PresentationjavaPresentationjava
Presentationjava
 
Price Comparison in turkey
Price Comparison in turkeyPrice Comparison in turkey
Price Comparison in turkey
 
BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Pres...
BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Pres...BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Pres...
BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Pres...
 
Synopsis on Smart price
Synopsis on Smart priceSynopsis on Smart price
Synopsis on Smart price
 
How Mobile drives Indonesian to do shopping discovery and price comparison
How Mobile drives Indonesian to do shopping discovery and price comparisonHow Mobile drives Indonesian to do shopping discovery and price comparison
How Mobile drives Indonesian to do shopping discovery and price comparison
 
Comparison of Plumb5 with other digital marketing tools
Comparison of Plumb5 with other digital marketing toolsComparison of Plumb5 with other digital marketing tools
Comparison of Plumb5 with other digital marketing tools
 
Smart Crawler
Smart CrawlerSmart Crawler
Smart Crawler
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
Price Comparison Sites in Indonesia 2014
Price Comparison Sites in Indonesia 2014Price Comparison Sites in Indonesia 2014
Price Comparison Sites in Indonesia 2014
 
Comparision between online & offline marketing
Comparision between online & offline marketingComparision between online & offline marketing
Comparision between online & offline marketing
 

Similar to Opinioz_intern

Super applied in a sitecore migration project
Super applied in a sitecore migration projectSuper applied in a sitecore migration project
Super applied in a sitecore migration projectdodoshelu
 
New Products Web Site
New Products Web SiteNew Products Web Site
New Products Web Sitegenegw
 
How to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pdf
How to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pdfHow to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pdf
How to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pdfProductdata Scrape
 
AngularJS 1.x - your first application (problems and solutions)
AngularJS 1.x - your first application (problems and solutions)AngularJS 1.x - your first application (problems and solutions)
AngularJS 1.x - your first application (problems and solutions)Igor Talevski
 
How to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pptx
How to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pptxHow to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pptx
How to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pptxProductdata Scrape
 
Record matching over query results
Record matching over query resultsRecord matching over query results
Record matching over query resultsambitlick
 
An Introduction to Django Web Framework
An Introduction to Django Web FrameworkAn Introduction to Django Web Framework
An Introduction to Django Web FrameworkDavid Gibbons
 
Customer FX Technical Reference Sheet
Customer FX Technical Reference SheetCustomer FX Technical Reference Sheet
Customer FX Technical Reference SheetGoodCustomers
 
Cognos framework manager
Cognos framework managerCognos framework manager
Cognos framework managermaxonlinetr
 
Product Catalog and IT Service Management
Product Catalog and IT Service ManagementProduct Catalog and IT Service Management
Product Catalog and IT Service ManagementDrew Madelung
 
ArchitectureAndPlatformsAspects
ArchitectureAndPlatformsAspectsArchitectureAndPlatformsAspects
ArchitectureAndPlatformsAspectsManeesh Innani
 
UNIT - 1 Part 2: Data Warehousing and Data Mining
UNIT - 1 Part 2: Data Warehousing and Data MiningUNIT - 1 Part 2: Data Warehousing and Data Mining
UNIT - 1 Part 2: Data Warehousing and Data MiningNandakumar P
 
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...Karen Thompson
 
MetadataTheory: Metadata Tools (7th of 10)
MetadataTheory: Metadata Tools (7th of 10)MetadataTheory: Metadata Tools (7th of 10)
MetadataTheory: Metadata Tools (7th of 10)Nikos Palavitsinis, PhD
 
How to Optimize Your Drupal Site with Structured Content
How to Optimize Your Drupal Site with Structured ContentHow to Optimize Your Drupal Site with Structured Content
How to Optimize Your Drupal Site with Structured ContentAcquia
 
Case Study For Data Governance Portal
Case Study For Data Governance PortalCase Study For Data Governance Portal
Case Study For Data Governance PortalMike Taylor
 
Raybiztech Content Management Approach
Raybiztech Content Management ApproachRaybiztech Content Management Approach
Raybiztech Content Management Approachray biztech
 

Similar to Opinioz_intern (20)

Super applied in a sitecore migration project
Super applied in a sitecore migration projectSuper applied in a sitecore migration project
Super applied in a sitecore migration project
 
Codeigniter
CodeigniterCodeigniter
Codeigniter
 
New Products Web Site
New Products Web SiteNew Products Web Site
New Products Web Site
 
How to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pdf
How to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pdfHow to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pdf
How to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pdf
 
AngularJS 1.x - your first application (problems and solutions)
AngularJS 1.x - your first application (problems and solutions)AngularJS 1.x - your first application (problems and solutions)
AngularJS 1.x - your first application (problems and solutions)
 
How to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pptx
How to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pptxHow to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pptx
How to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pptx
 
Record matching over query results
Record matching over query resultsRecord matching over query results
Record matching over query results
 
An Introduction to Django Web Framework
An Introduction to Django Web FrameworkAn Introduction to Django Web Framework
An Introduction to Django Web Framework
 
Customer FX Technical Reference Sheet
Customer FX Technical Reference SheetCustomer FX Technical Reference Sheet
Customer FX Technical Reference Sheet
 
Cognos framework manager
Cognos framework managerCognos framework manager
Cognos framework manager
 
Product Catalog and IT Service Management
Product Catalog and IT Service ManagementProduct Catalog and IT Service Management
Product Catalog and IT Service Management
 
ArchitectureAndPlatformsAspects
ArchitectureAndPlatformsAspectsArchitectureAndPlatformsAspects
ArchitectureAndPlatformsAspects
 
People aggregator
People aggregatorPeople aggregator
People aggregator
 
UNIT - 1 Part 2: Data Warehousing and Data Mining
UNIT - 1 Part 2: Data Warehousing and Data MiningUNIT - 1 Part 2: Data Warehousing and Data Mining
UNIT - 1 Part 2: Data Warehousing and Data Mining
 
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
 
MetadataTheory: Metadata Tools (7th of 10)
MetadataTheory: Metadata Tools (7th of 10)MetadataTheory: Metadata Tools (7th of 10)
MetadataTheory: Metadata Tools (7th of 10)
 
How to Optimize Your Drupal Site with Structured Content
How to Optimize Your Drupal Site with Structured ContentHow to Optimize Your Drupal Site with Structured Content
How to Optimize Your Drupal Site with Structured Content
 
Case Study For Data Governance Portal
Case Study For Data Governance PortalCase Study For Data Governance Portal
Case Study For Data Governance Portal
 
Raybiztech Content Management Approach
Raybiztech Content Management ApproachRaybiztech Content Management Approach
Raybiztech Content Management Approach
 
G017254554
G017254554G017254554
G017254554
 

Opinioz_intern

  • 2. Special Thanks To   Dr. R. Nadarajan Professor and Head  Dr. R. Anitha Programme coordinator, Associate Professor  N. Mohanraj Associate Professor  Mrs. S. Anandhi Assistant Professor Senior Grade Department of Applied Mathematics and Computational Sciences 2
  • 3. External Guides Mr. N. Kishore Kumar Software Engineer OpinioZ Technologies Principal Software Architect ClariTrics Technologies Mr. G. Venkatesh Prabhu Managing Director, OpinioZ Technolgies Chief Technology Officer, ClariTrics Technologies 3
  • 4. Agenda  OpinioZ Technologies - Introduction  EtailOne.com – A Brief Overview  System Environment  Product Classification  Information Extraction System  EtailOne In-House Administration System 4
  • 5. OpinioZ Technologies  OpinioZ Technologies Pvt. Ltd. is a Start Up company funded by ClariTrics Technologies, started on 12.02.2014.  OpinioZ Technologies has kept its major focus on creating a price comparison website or a price comparison engine, called “EtailOne”.  EtailOne offers the next generation product discovery platform. 5
  • 6.       EtailOne.com  Enables the customer to create a social profile with an inventory of products which they had, have and wish to have.  Gives the customers the advantage of knowing the product in and out along with expert reviews and comments of what to buy, when to buy and where to buy the product.  Empowers its customers by integrating the fragmented market with its unique social ecommerce platform, thereby enabling them to stay on top of their favourite brands and products. 6
  • 7. System Environment • Processor: Intel® Core™i5 • RAM: 4 GB • System: 64-bit Operating System Hardware Specifications • Operating System: Windows 8 • Languages: Java, XML 1.0 • Database: MySQL • Technologies: Apache Tomcat, Maven, My SQL Database Workbench 6.1CE • Tools: Eclipse 3.7, Apache Tomcat, Microsoft Excel 2007 Software Specifications 7
  • 9. System Overview  Most of the product titles contain sufficient information about a product that allows the user to classify the product accordingly.  Hundreds of new products are introduced by various online stores. When these new products enter EtailOne, it is very difficult to classify them manually. Hence, an automatic procedure is required.  The objective of this project is to create a product classification module that extracts all the attributes and assign the product to a breadcrumb, appropriately. 9
  • 10. System Requirement  Creation of Regex Files The regex files are attribute extraction rules that are written manually to classify product titles from different categories. In order to create these files, all the attributes that characterize a product category should be listed along with all their possible values.  Input File The input file consists of the product titles and the basic category in which it belongs. 10
  • 12. Continued…  Creation of XML Category Hierarchy Files These files are created to design the hierarchy of the categories and its further descending subcategories. The file contains the regular expressions that the title should match in order to belong to the category and regular expressions that title should not contain, if it belongs to the category. Hence, this eliminates the confusion that may arise when large number of vague product titles needs to be classified. 12
  • 14. Product Title Filtering Module 14  The Product title from the different shopping website is extracted to do product title filtering.  This module takes a flat file containing list of product titles from different shopping website and breadcrumbs as input.  Now for the given breadcrumbs and title this module parse the title and checks whether the values in the regex file are the same as parser values.  If they are same, then corresponding attributes and its values are added into the database for the respective breadcrumb.
  • 15. 15
  • 16. Attribute Extractor Module  The product titles and basic category in which it belongs are given as input.  The attribute extractor uses the basic category in which the product belongs to decide the regex file to be matched against the title.  The attribute heading and values are extracted and stored in the database.  Normalization of values should take place to ensure consistency.  These values are also used to dynamically list the various attribute value options available in EtailOne for the customer’s use to filter products. 16
  • 17. Product Classifier Module  The product classifier is a recursive procedure that continues to go through deeper subcategories continuously till the regular expression rules are satisfied.  It stops when the negative regular expressions are satisfied or when no regular expressions satisfy the title’s attribute values.  The breadcrumb in which the product category belongs to, is returned. It is stored with a category id and a separate table is maintained consisting of the category id and its filters. 17
  • 21. About Crawlers  Crawlers or Web Spider is a program or automated script which browses through the world wide web in a methodical, automated manner.  There are several scripting environments to create crawlers, like PHP, JAVA, .NET, etc.  Crawlers must connect to the webpage, send requests, and wait for status of the connection to return.  Then, it must understand the status code, turn source code to textual information, and with proper setting of the character encoding. 21
  • 22. Behavior of a Crawler  Selection policy : states which pages to download. It is highly desirable that the downloaded fraction contains the most relevant pages and not just a random sample of the Web.  Re-visit policy : states when to check for changes to the pages. The Web has a dynamic nature and crawling a fraction of the Web can take weeks or months. By the time a Web crawler has finished its crawl many events could have happened, including creations, updates and deletions.  Politeness policy : states how to avoid overloading websites, since a server would have a hard time keeping up with requests from multiple crawlers.  Parallelization policy : states how to coordinate distributed web crawlers. This increases the efficiency of a crawler by improving the time taken to crawl all the required details. 22
  • 23. HTML DOM Hierarchy  23  A crawler extracts the necessary information from a web page by looking into the source code.  The source code that is referred or downloaded forms a tree or hierarchy. This system can be understood by using the document object model (DOM). Every link, form, image, etc is described by the hierarchical system.
  • 24. System Design • Modules used to pick up different links of the required products belonging to a particular category. • Module depends on design technique used to load the products. • These links are fed into the Listers. • The database keeps the list of all the seed URLs and when it has to refresh data, the particular source is called and the crawler revisits the sites. Seede r • Modules that pick up separate product URLs, titles, images, and other features it can get from the product listing pages. • The images are retrieved by downloading in terms of bytes. Lister • Modules that go deep into the product URL to get very particular information like specifications, key features, delivery details, offers, etc. • This information are not present in product listing page.Parse r 24
  • 25. Seeder  The crawler maintains a list of unvisited URLs called the frontier.  This set of URLs are stored in the database with the status "ready to seed”.  Once a URL is crawled and the necessary details are obtained the status shifts to "seeded".  Before the URLs are added to the frontier they may be assigned a score that represents the estimated benefit of visiting the page corresponding to the URL. When the crawler has no new page to fetch, it stops.  In order to fetch a Web page, an HTTP client sends an HTTP request for a page and reads the response.  The client needs to have timeouts to make sure that an unnecessary amount of time is not spent on slow servers or in reading large pages. 25
  • 26. Continued…  There are two different ways of extracting the pagination URLs from the web site., which includes pagination and lazy loading technique.  The crawl history is also maintained. It is a time-stamped list of URLs that were fetched. This history may be used for post-crawl analysis and evaluations. 26
  • 27. List Page Crawler  The list page crawler picks up separate product URLs, titles, images and other features it can get from the product listing pages.  The images are retrieved by downloading in terms of bytes.  It takes more time to visit every product URL specifically. Hence, using a list page crawler is more advantageous. 27
  • 28. Parser  The parser is a module that goes deep into the product URL to get specific information like specifications, key features, delivery details, offers, etc.  These details are not present in product listing page. Some users may require all the specification details and key features of a product.  Going through each specific product URL is time consuming process. 28
  • 29. System Workflow List of URLs is stored in the database with its breadcrumb and "ready to seed" status. SEEDER MODULE The seeder module stores all the pagination URLs and other links with the status stored as "ready to list". List page crawler or parser? LIST PAGE CRAWLER MODULE PARSER MODULE Information such as product URL and images in the web page are stored. Other textual information are stored separately. 29