SlideShare a Scribd company logo
1 of 14
Download to read offline
Classification of E-commerce Websites by
Product Categories
Case Study
Moiseev George
Higher School of Economics
Faculty of Computer Science
Higher School of Economics , Moscow, 2016
www.hse.ru
Outline
• Introduction
• Preprocessing
• Feature extraction
• Classification and evaluation
• Experimental results
2
Problem Statement
• Retrieve e-commerce websites (e-shops)
• Classify e-shops by sold product type
*We don’t include customer-to-customer websites as e-
commerce shops
3
Applications
• Market research
• Statistics gathering
• Organizing a knowledge base
• Goods search
4
Dataset
The dataset was received by datainsight.ru
There are two training subsets marked by experts:
1. 1312 e-commerce and 1077 non e-commerce web
sites
2. 1448 of 15 product
categories.
5
Preprocessing
Downloading a website:
Starting from the main page
Download all internal hyperlinks from a web page which weren’t
downloaded before
Check if equal webpage was already downloaded by other
hyperlink
What information should be saved from other webpages:
1. Nothing
2. Only meta data
3. Everything
6
Preprocessing
Each webpage will be stored in two versions
• Raw page:
– Remove only javascript and obvious advertisements
• Cleaned page:
– Extract only content of markup tags
– Tokenization – retrieving sentences and words
– Stemming – reducing words to their root or base form
– Lowercase conversion
– Filter out stopwords
7
Feature Extraction
There many methods and models for automatic text feature
extraction:
• Bag of words
• n-grams
• word2vec
• TF-IDF (on the picture)
• Mutual information
• Chi-square
• …
8
Feature extraction
Proposed approach:
The term weighting formula for the i-th term in the k-th website is
derived from TF-IDF as follows:
𝑊𝑖𝑘 =
𝑡𝑓𝑖𝑘 log
𝑁
𝑛𝑖
(𝑡𝑓𝑖𝑗 log
𝑁
𝑛𝑗
)2𝑁
𝑗=1
where ni is the number of websites where the i-th term appears, N –
total number of web sites in the sample and tfik is computed as:
𝑡𝑓𝑖𝑘 = 𝑤(𝑡)f(𝑖, 𝑘, 𝑡)
𝑇
𝑡
Where w(t) is inversely proportional frequency of a tag t, f(i, k, t) is
frequency of the i-th term in t-th tag.
9
Classification and
evaluation
• Support Vector Machine as classifier.
• multiclass classification performs in “one-vs-all” way.
• precision, recall and F-score for evaluation
• overall performance of the product type classification is evaluated
by average F-score among all categories.
10
Results
F-score of e-commerce class in binary classification
11
Used web site
information pure TF-IDF TF-IDF with Tag
weighting
only main page 0.85 0.89
main page + meta
and title from other
pages
0.89 0.94
main page +
whole other pages 0.86 0.92
.
Results
average F-score of e-commerce categorization by sold product type:
12
.
Used web site
information pure TF-IDF TF-IDF with Tag
Weighting
only main page 0.67 0.72
main page + meta
and title from
other pages
0.74 0.79
main page +
whole other pages 0.73 0.81
References
1. A. Rahmani and S. Meshkizadeh, "Webpage Classification based
on Compound of Using HTML Features & URL Features and
Features of Sibling Pages", International Journal of Advancements in
Computing Technology, vol. 2, no. 4, pp. 36-46, 2010.
2. A. Aizawa, "An information-theoretic perspective of tf-idf measures",
Information Processing & Management, vol. 39, no. 1, pp. 45-65,
2003.
3. D. Powers, "Evaluation: From Precision, Recall and F-Measure to
ROC, Informedness, Markedness & Correlation", Journal of Machine
Learning Technologies, vol. 1, no. 2, pp. 37-63, 2011.
4. Vapnik, V., Cortez, C.: Support vector networks. Machine Learning.
(1995).
5. Ghani, R., Slattery, S., Yang, Y.: Hypertext categorization using
hyperlink patterns and meta data. ICML 01: Proceedings of the
Eighteenth International Conference on Machine Learning. 178-185
(2001).
13
.
Moiseev George
gvmoiseev@edu.hse.ru

More Related Content

Viewers also liked (6)

Marina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chantsMarina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chants
 
Measuring the economic impact of swimming sport events
Measuring the economic impact of swimming sport eventsMeasuring the economic impact of swimming sport events
Measuring the economic impact of swimming sport events
 
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  ImagesAlexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
 
Machine Learning in Ecommerce
Machine Learning in EcommerceMachine Learning in Ecommerce
Machine Learning in Ecommerce
 
Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...
Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...
Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...
 
Parts of system unit
Parts of system unitParts of system unit
Parts of system unit
 

Similar to George Moiseev - Classification of E-commerce Websites by Product Categories

2016 siteIQ Website Evaluation Services Brochure
2016 siteIQ Website Evaluation Services Brochure2016 siteIQ Website Evaluation Services Brochure
2016 siteIQ Website Evaluation Services Brochure
Kenna Dian
 
Victoria Onoprienko - Effective Search Engine Reputation Management Strategies
Victoria Onoprienko - Effective Search Engine Reputation Management StrategiesVictoria Onoprienko - Effective Search Engine Reputation Management Strategies
Victoria Onoprienko - Effective Search Engine Reputation Management Strategies
Netpeak
 
KB Seminars: Working with Technology - Product Management; 10/13
KB Seminars: Working with Technology - Product Management; 10/13KB Seminars: Working with Technology - Product Management; 10/13
KB Seminars: Working with Technology - Product Management; 10/13
MDIF
 
Uрtoрromo
UрtoрromoUрtoрromo
Uрtoрromo
forseman
 
Web Rec Final Report
Web Rec Final ReportWeb Rec Final Report
Web Rec Final Report
weichen
 

Similar to George Moiseev - Classification of E-commerce Websites by Product Categories (20)

RankTank tutorial (ranktank.eu/)
RankTank tutorial (ranktank.eu/)RankTank tutorial (ranktank.eu/)
RankTank tutorial (ranktank.eu/)
 
2016 siteIQ Website Evaluation Services Brochure
2016 siteIQ Website Evaluation Services Brochure2016 siteIQ Website Evaluation Services Brochure
2016 siteIQ Website Evaluation Services Brochure
 
Victoria Onoprienko - Effective Search Engine Reputation Management Strategies
Victoria Onoprienko - Effective Search Engine Reputation Management StrategiesVictoria Onoprienko - Effective Search Engine Reputation Management Strategies
Victoria Onoprienko - Effective Search Engine Reputation Management Strategies
 
Dynamic ads and online superstores teaching yandex.direct to choose efficient...
Dynamic ads and online superstores teaching yandex.direct to choose efficient...Dynamic ads and online superstores teaching yandex.direct to choose efficient...
Dynamic ads and online superstores teaching yandex.direct to choose efficient...
 
KB Seminars: Working with Technology - Product Management; 10/13
KB Seminars: Working with Technology - Product Management; 10/13KB Seminars: Working with Technology - Product Management; 10/13
KB Seminars: Working with Technology - Product Management; 10/13
 
Competitive Benchmarks_Approach
Competitive Benchmarks_ApproachCompetitive Benchmarks_Approach
Competitive Benchmarks_Approach
 
Uрtoрromo
UрtoрromoUрtoрromo
Uрtoрromo
 
How Many Columns Should I Use? How using the best page layout led to a 681% r...
How Many Columns Should I Use? How using the best page layout led to a 681% r...How Many Columns Should I Use? How using the best page layout led to a 681% r...
How Many Columns Should I Use? How using the best page layout led to a 681% r...
 
Data analytics and SEO to grow your international business
Data analytics and SEO to grow your international businessData analytics and SEO to grow your international business
Data analytics and SEO to grow your international business
 
Web Rec Final Report
Web Rec Final ReportWeb Rec Final Report
Web Rec Final Report
 
CoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core OperationsCoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core Operations
 
CoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core OperationsCoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core Operations
 
ME - How Many Columns Should I Use?
ME - How Many Columns Should I Use?ME - How Many Columns Should I Use?
ME - How Many Columns Should I Use?
 
The power of BI
The power of BIThe power of BI
The power of BI
 
Project+team+1 slides (2)
Project+team+1 slides (2)Project+team+1 slides (2)
Project+team+1 slides (2)
 
Team project - Data visualization on Olist company data
Team project - Data visualization on Olist company dataTeam project - Data visualization on Olist company data
Team project - Data visualization on Olist company data
 
Benchmarking
BenchmarkingBenchmarking
Benchmarking
 
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingApplied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce Setting
 
Seo Presentation for Beginners, Complete SEO ppt,
Seo Presentation for Beginners, Complete SEO ppt,Seo Presentation for Beginners, Complete SEO ppt,
Seo Presentation for Beginners, Complete SEO ppt,
 
Search Engine Optimisation (Seo) And Search Engine Marketing
Search Engine Optimisation (Seo) And Search Engine MarketingSearch Engine Optimisation (Seo) And Search Engine Marketing
Search Engine Optimisation (Seo) And Search Engine Marketing
 

More from AIST

Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныАлена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
AIST
 
Иосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBAИосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBA
AIST
 

More from AIST (20)

Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныАлена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
 
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
 
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поискПавел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
 
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
 
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
 
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
 
Иосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBAИосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBA
 
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge ExchangeNikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
 
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First GlanceEdward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
 
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
 
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
 
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamediumValeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
 
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
 
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
 
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingArtyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
 
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
 
Andrey Mukhtarov - The Study of Applicability of the Decision Tree Method for...
Andrey Mukhtarov - The Study of Applicability of the Decision Tree Method for...Andrey Mukhtarov - The Study of Applicability of the Decision Tree Method for...
Andrey Mukhtarov - The Study of Applicability of the Decision Tree Method for...
 
Oxana Logunova - The Results Of Sulfur Print Image Classification Of Section ...
Oxana Logunova - The Results Of Sulfur Print Image Classification Of Section ...Oxana Logunova - The Results Of Sulfur Print Image Classification Of Section ...
Oxana Logunova - The Results Of Sulfur Print Image Classification Of Section ...
 

Recently uploaded

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 

Recently uploaded (20)

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 

George Moiseev - Classification of E-commerce Websites by Product Categories

  • 1. Classification of E-commerce Websites by Product Categories Case Study Moiseev George Higher School of Economics Faculty of Computer Science Higher School of Economics , Moscow, 2016 www.hse.ru
  • 2. Outline • Introduction • Preprocessing • Feature extraction • Classification and evaluation • Experimental results 2
  • 3. Problem Statement • Retrieve e-commerce websites (e-shops) • Classify e-shops by sold product type *We don’t include customer-to-customer websites as e- commerce shops 3
  • 4. Applications • Market research • Statistics gathering • Organizing a knowledge base • Goods search 4
  • 5. Dataset The dataset was received by datainsight.ru There are two training subsets marked by experts: 1. 1312 e-commerce and 1077 non e-commerce web sites 2. 1448 of 15 product categories. 5
  • 6. Preprocessing Downloading a website: Starting from the main page Download all internal hyperlinks from a web page which weren’t downloaded before Check if equal webpage was already downloaded by other hyperlink What information should be saved from other webpages: 1. Nothing 2. Only meta data 3. Everything 6
  • 7. Preprocessing Each webpage will be stored in two versions • Raw page: – Remove only javascript and obvious advertisements • Cleaned page: – Extract only content of markup tags – Tokenization – retrieving sentences and words – Stemming – reducing words to their root or base form – Lowercase conversion – Filter out stopwords 7
  • 8. Feature Extraction There many methods and models for automatic text feature extraction: • Bag of words • n-grams • word2vec • TF-IDF (on the picture) • Mutual information • Chi-square • … 8
  • 9. Feature extraction Proposed approach: The term weighting formula for the i-th term in the k-th website is derived from TF-IDF as follows: 𝑊𝑖𝑘 = 𝑡𝑓𝑖𝑘 log 𝑁 𝑛𝑖 (𝑡𝑓𝑖𝑗 log 𝑁 𝑛𝑗 )2𝑁 𝑗=1 where ni is the number of websites where the i-th term appears, N – total number of web sites in the sample and tfik is computed as: 𝑡𝑓𝑖𝑘 = 𝑤(𝑡)f(𝑖, 𝑘, 𝑡) 𝑇 𝑡 Where w(t) is inversely proportional frequency of a tag t, f(i, k, t) is frequency of the i-th term in t-th tag. 9
  • 10. Classification and evaluation • Support Vector Machine as classifier. • multiclass classification performs in “one-vs-all” way. • precision, recall and F-score for evaluation • overall performance of the product type classification is evaluated by average F-score among all categories. 10
  • 11. Results F-score of e-commerce class in binary classification 11 Used web site information pure TF-IDF TF-IDF with Tag weighting only main page 0.85 0.89 main page + meta and title from other pages 0.89 0.94 main page + whole other pages 0.86 0.92 .
  • 12. Results average F-score of e-commerce categorization by sold product type: 12 . Used web site information pure TF-IDF TF-IDF with Tag Weighting only main page 0.67 0.72 main page + meta and title from other pages 0.74 0.79 main page + whole other pages 0.73 0.81
  • 13. References 1. A. Rahmani and S. Meshkizadeh, "Webpage Classification based on Compound of Using HTML Features & URL Features and Features of Sibling Pages", International Journal of Advancements in Computing Technology, vol. 2, no. 4, pp. 36-46, 2010. 2. A. Aizawa, "An information-theoretic perspective of tf-idf measures", Information Processing & Management, vol. 39, no. 1, pp. 45-65, 2003. 3. D. Powers, "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation", Journal of Machine Learning Technologies, vol. 1, no. 2, pp. 37-63, 2011. 4. Vapnik, V., Cortez, C.: Support vector networks. Machine Learning. (1995). 5. Ghani, R., Slattery, S., Yang, Y.: Hypertext categorization using hyperlink patterns and meta data. ICML 01: Proceedings of the Eighteenth International Conference on Machine Learning. 178-185 (2001). 13 .