Opinioz_intern

Product Information Extraction
Sai Ganesh K(11PT22)
OpinioZ Technologies Pvt. Ltd.
1

Special Thanks To
 Dr. R. Nadarajan
Professor and Head
 Dr. R. Anitha
Programme coordinator,
Associate Professor
 N. Mohanraj
Associate Professor
 Mrs. S. Anandhi
Assistant Professor Senior Grade
Department of Applied Mathematics and Computational Sciences
2

External Guides
Mr. N. Kishore Kumar
Software Engineer
OpinioZ Technologies
Principal Software Architect
ClariTrics Technologies
Mr. G. Venkatesh Prabhu
Managing Director,
OpinioZ Technolgies
Chief Technology Officer,
ClariTrics Technologies
3

Agenda
 OpinioZ Technologies - Introduction
 EtailOne.com – A Brief Overview
 System Environment
 Product Classification
 Information Extraction System
 EtailOne In-House Administration System
4

OpinioZ Technologies
 OpinioZ Technologies Pvt. Ltd. is a Start Up company funded by
ClariTrics Technologies, started on 12.02.2014.
 OpinioZ Technologies has kept its major focus on creating a price
comparison website or a price comparison engine, called “EtailOne”.
 EtailOne offers the next generation product discovery platform.
5

EtailOne.com
 Enables the customer to create a social profile with an inventory of
products which they had, have and wish to have.
 Gives the customers the advantage of knowing the product in and
out along with expert reviews and comments of what to buy, when to
buy and where to buy the product.
 Empowers its customers by integrating the fragmented market with
its unique social ecommerce platform, thereby enabling them to stay
on top of their favourite brands and products.
6

System Environment
• Processor: Intel® Core™i5
• RAM: 4 GB
• System: 64-bit Operating System
Hardware Specifications
• Operating System: Windows 8
• Languages: Java, XML 1.0
• Database: MySQL
• Technologies: Apache Tomcat, Maven, My SQL Database
Workbench 6.1CE
• Tools: Eclipse 3.7, Apache Tomcat, Microsoft Excel 2007
Software Specifications
7

PRODUCT TITLE FILTERING AND
CLASSIFICATION
8

System Overview
 Most of the product titles contain sufficient information about a
product that allows the user to classify the product accordingly.
 Hundreds of new products are introduced by various online stores.
When these new products enter EtailOne, it is very difficult to classify
them manually. Hence, an automatic procedure is required.
 The objective of this project is to create a product classification
module that extracts all the attributes and assign the product to a
breadcrumb, appropriately.
9

System Requirement
 Creation of Regex Files
The regex files are attribute extraction rules that are written
manually to classify product titles from different categories. In order
to create these files, all the attributes that characterize a product
category should be listed along with all their possible values.
 Input File
The input file consists of the product titles and the basic
category in which it belongs.
10

Example of Sunglasses Regex File
11

Continued…
 Creation of XML Category Hierarchy Files
These files are created to design the hierarchy of the categories
and its further descending subcategories.
The file contains the regular expressions that the title should
match in order to belong to the category and regular expressions
that title should not contain, if it belongs to the category. Hence, this
eliminates the confusion that may arise when large number of vague
product titles needs to be classified.
12

Example of Footwear XML File
13

Product Title Filtering Module
14
 The Product title from the different shopping website is extracted to
do product title filtering.
 This module takes a flat file containing list of product titles from
different shopping website and breadcrumbs as input.
 Now for the given breadcrumbs and title this module parse the title
and checks whether the values in the regex file are the same as
parser values.
 If they are same, then corresponding attributes and its values are
added into the database for the respective breadcrumb.

Attribute Extractor Module
 The product titles and basic category in which it belongs are given as
input.
 The attribute extractor uses the basic category in which the product
belongs to decide the regex file to be matched against the title.
 The attribute heading and values are extracted and stored in the
database.
 Normalization of values should take place to ensure consistency.
 These values are also used to dynamically list the various attribute value
options available in EtailOne for the customer’s use to filter products.
16

Product Classifier Module
 The product classifier is a recursive procedure that continues to go
through deeper subcategories continuously till the regular expression
rules are satisfied.
 It stops when the negative regular expressions are satisfied or when
no regular expressions satisfy the title’s attribute values.
 The breadcrumb in which the product category belongs to, is
returned. It is stored with a category id and a separate table is
maintained consisting of the category id and its filters.
17

Output for Attribute Extraction
18

Output for Product Classification
19

20
CRAWLERS AND ITS BEHAVIOUR

About Crawlers
 Crawlers or Web Spider is a program or automated script which
browses through the world wide web in a methodical, automated
manner.
 There are several scripting environments to create crawlers, like
PHP, JAVA, .NET, etc.
 Crawlers must connect to the webpage, send requests, and wait for
status of the connection to return.
 Then, it must understand the status code, turn source code to textual
information, and with proper setting of the character encoding.
21

Behavior of a Crawler
 Selection policy : states which pages to download. It is highly desirable that
the downloaded fraction contains the most relevant pages and not just a
random sample of the Web.
 Re-visit policy : states when to check for changes to the pages. The Web
has a dynamic nature and crawling a fraction of the Web can take weeks or
months. By the time a Web crawler has finished its crawl many events could
have happened, including creations, updates and deletions.
 Politeness policy : states how to avoid overloading websites, since a server
would have a hard time keeping up with requests from multiple crawlers.
 Parallelization policy : states how to coordinate distributed web crawlers.
This increases the efficiency of a crawler by improving the time taken to
crawl all the required details.
22

HTML DOM Hierarchy
23
 A crawler extracts the necessary information from a web page by
looking into the source code.
 The source code that is referred or downloaded forms a tree or
hierarchy. This system can be understood by using the document
object model (DOM). Every link, form, image, etc is described by the
hierarchical system.

System Design
• Modules used to pick
up different links of the
required products
belonging to a
particular category.
• Module depends on
design technique used
to load the products.
• These links are fed
into the Listers.
• The database keeps
the list of all the seed
URLs and when it has
to refresh data, the
particular source is
called and the crawler
revisits the sites.
Seede
r
• Modules that pick
up separate product
URLs, titles,
images, and other
features it can get
from the product
listing pages.
• The images are
retrieved by
downloading in
terms of bytes.
Lister
• Modules that go
deep into the
product URL to get
very particular
information like
specifications, key
features, delivery
details, offers, etc.
• This information are
not present in
product listing page.Parse
r
24

Seeder
 The crawler maintains a list of unvisited URLs called the frontier.
 This set of URLs are stored in the database with the status "ready to seed”.
 Once a URL is crawled and the necessary details are obtained the status
shifts to "seeded".
 Before the URLs are added to the frontier they may be assigned a score that
represents the estimated benefit of visiting the page corresponding to the
URL. When the crawler has no new page to fetch, it stops.
 In order to fetch a Web page, an HTTP client sends an HTTP request for a
page and reads the response.
 The client needs to have timeouts to make sure that an unnecessary amount
of time is not spent on slow servers or in reading large pages.
25

Continued…
 There are two different ways of extracting the pagination URLs from
the web site., which includes pagination and lazy loading technique.
 The crawl history is also maintained. It is a time-stamped list of URLs
that were fetched. This history may be used for post-crawl analysis
and evaluations.
26

List Page Crawler
 The list page crawler picks up separate product URLs, titles, images
and other features it can get from the product listing pages.
 The images are retrieved by downloading in terms of bytes.
 It takes more time to visit every product URL specifically. Hence,
using a list page crawler is more advantageous.
27

Parser
 The parser is a module that goes deep into the product URL to get
specific information like specifications, key features, delivery details,
offers, etc.
 These details are not present in product listing page. Some users
may require all the specification details and key features of a
product.
 Going through each specific product URL is time consuming
process.
28

System Workflow
List of URLs is stored in the database with its
breadcrumb and "ready to seed" status.
SEEDER MODULE
The seeder module stores all the
pagination URLs and other links with
the status stored as "ready to list".
List page
crawler or
parser?
LIST PAGE CRAWLER
MODULE
PARSER MODULE
Information such as product URL and images in
the web page are stored. Other textual information
are stored separately.
29

Opinioz_intern

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (10)

Similar to Opinioz_intern

Similar to Opinioz_intern (20)

Opinioz_intern