SlideShare a Scribd company logo
1 of 14
Download to read offline
Identifying Auxiliary Web Images
Using Combination of Analyses
                                         Tewson Seeoun
             Sirindhorn International Institute of Technology


                                         With Guidance From

                        Asst. Prof. Dr. Toshiaki Kondo
             Sirindhorn International Institute of Technology
                       Dr. Choochart Haruechaiyasak
  Human Language Technology Laboratory, NECTEC, NSTDA
Agenda
    ●   Introduction
    ●   Background
         ●   Document Object Model (DOM) in HTML
         ●   Support Vector Machine (SVM)
    ●   Objective
    ●   Methodology
    ●   Results
    ●   Discussion / Future Work
    ●   Conclusion
        Acknowledgement
                                                   2
    ●
Introduction


       ●   Websites contain images.
       ●   Some images are not necessary.
           ●   Search Engine Indexing
           ●   Printing
       ●   Ignoring them is sometimes
           economical and green.

                                            3
Background - DOM


●   Web browsers / layout engines parse
    HTML / CSS / JavaScript into DOM.
●   DOM represents things (elements) in a Web page.
●   An element has properties (position, size, etc.).
●   JavaScript sees DOM.



                                                        4
Background - SVM




 ●   SVM is a supervised machine learning algorithm
 ●   SVM is used for statistical pattern recognition.




                                                        5
Objective (for now)




To recognize patterns of auxiliary Web images quickly
  using DOM analysis and basic image processing




                                                  6
Methodology
 HTML                                          IMG
          PyQtWebKit             Python
 CSS                    DOM                    Files
  JS


                            jQuery
                                                   PIL
                 Page Level Features
                                                   OpenCV
                Domain Level Features
                                                   Tesseract



       Labels          MySQL           Image Level Features
                                                          7
Methodology (continued)
        ●   Image Level Features
             ●   No. of Colors
             ●   No. of Human Faces
             ●   No. of Alphabets
        ●   Page Level Features
             ●   Position
             ●   Dimension
             ●   No. of Images with Similar Dimension
        ●   Domain Level Features
                 External / Internal Links
                                                        8
             ●
Methodology (continued)

     MySQL     80% (500/626) Randomly-Selected

                              SVM (Train)
        20%
                           Model

         SVM (Predict)



    Results
    Results
     Results

                                                 9
Results



   10-fold Cross-Validation (10 Experiments)
          Average Accuracy = 84.92%
     After Applying Grid-Search Technique
          Average Accuracy = 93.17%



                                               10
Discussion
   ●   Some pages cannot be parsed.
        ●   Frames and redirections
   ●   Positions can be miscalculated.
        ●   JavaScript used in displaying images
        ●   CSS sprites
   ●   Tesseract is not well-tuned.
   ●   Small images have to be magnified, but how much?
   ●   Downloading images for processing is a bottleneck.
   ●   Features are not weighted.
       The definition of “auxiliary image” is subjective.
                                                            11
   ●
Future Work



 ●   Context Analysis
 ●   Weighed Features
 ●   Adaptive Page Analysis (Website Categorization)
 ●   Techniques Evaluation / Optimization



                                                  12
Conclusion




Layout analysis and basic image processing techniques
  alone perform well, but the system could be better.




                                                   13
Acknowledgement


    ●   NSTDA, NECTEC, and YSTP program
    ●   Dr. Choochart Haruechaiyasak
    ●   Dr. Toshiaki Kondo
    ●   Mr. Krikamol Muendet
    ●   And Many Others...


                                          14

More Related Content

Similar to Identifying Auxiliary Web Images Using Combinations of Analyses

Architectures For Scaling Ajax
Architectures For Scaling AjaxArchitectures For Scaling Ajax
Architectures For Scaling Ajaxwolframkriesing
 
Dconrails Gecco Presentation
Dconrails Gecco PresentationDconrails Gecco Presentation
Dconrails Gecco PresentationJuan J. Merelo
 
JS Single-Page Web App Essentials
JS Single-Page Web App EssentialsJS Single-Page Web App Essentials
JS Single-Page Web App EssentialsSergey Bolshchikov
 
Building assets on the fly with Node.js
Building assets on the fly with Node.jsBuilding assets on the fly with Node.js
Building assets on the fly with Node.jsAcquisio
 
Super Sizing Youtube with Python
Super Sizing Youtube with PythonSuper Sizing Youtube with Python
Super Sizing Youtube with Pythondidip
 
Iasi code camp 12 october 2013 responsive images in the wild-vlad zelinschi
Iasi code camp 12 october 2013 responsive images in the wild-vlad zelinschiIasi code camp 12 october 2013 responsive images in the wild-vlad zelinschi
Iasi code camp 12 october 2013 responsive images in the wild-vlad zelinschiCodecamp Romania
 
Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017
Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017
Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017VisageCloud
 
InfoEducatie - Face Recognition Architecture
InfoEducatie - Face Recognition ArchitectureInfoEducatie - Face Recognition Architecture
InfoEducatie - Face Recognition ArchitectureBogdan Bocse
 
Everything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebEverything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebJames Rakich
 
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Turi, Inc.
 
Yurii Pashchenko: Unlocking the potential of Segment Anything Model (UA)
Yurii Pashchenko: Unlocking the potential of Segment Anything Model (UA)Yurii Pashchenko: Unlocking the potential of Segment Anything Model (UA)
Yurii Pashchenko: Unlocking the potential of Segment Anything Model (UA)Lviv Startup Club
 
CIKB - Software Architecture Analysis Design
CIKB - Software Architecture Analysis DesignCIKB - Software Architecture Analysis Design
CIKB - Software Architecture Analysis DesignAntonio Castellon
 
Using Web Standards to create Interactive Data Visualizations for the Web
Using Web Standards to create Interactive Data Visualizations for the WebUsing Web Standards to create Interactive Data Visualizations for the Web
Using Web Standards to create Interactive Data Visualizations for the Webphilogb
 
Talk Paris Infovis 091207132953 Phpapp01(2)
Talk Paris Infovis 091207132953 Phpapp01(2)Talk Paris Infovis 091207132953 Phpapp01(2)
Talk Paris Infovis 091207132953 Phpapp01(2)johnnybiz
 

Similar to Identifying Auxiliary Web Images Using Combinations of Analyses (20)

Architectures For Scaling Ajax
Architectures For Scaling AjaxArchitectures For Scaling Ajax
Architectures For Scaling Ajax
 
Dconrails Gecco Presentation
Dconrails Gecco PresentationDconrails Gecco Presentation
Dconrails Gecco Presentation
 
JS Single-Page Web App Essentials
JS Single-Page Web App EssentialsJS Single-Page Web App Essentials
JS Single-Page Web App Essentials
 
Building assets on the fly with Node.js
Building assets on the fly with Node.jsBuilding assets on the fly with Node.js
Building assets on the fly with Node.js
 
Super Sizing Youtube with Python
Super Sizing Youtube with PythonSuper Sizing Youtube with Python
Super Sizing Youtube with Python
 
Os Solomon
Os SolomonOs Solomon
Os Solomon
 
Iasi code camp 12 october 2013 responsive images in the wild-vlad zelinschi
Iasi code camp 12 october 2013 responsive images in the wild-vlad zelinschiIasi code camp 12 october 2013 responsive images in the wild-vlad zelinschi
Iasi code camp 12 october 2013 responsive images in the wild-vlad zelinschi
 
Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017
Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017
Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017
 
InfoEducatie - Face Recognition Architecture
InfoEducatie - Face Recognition ArchitectureInfoEducatie - Face Recognition Architecture
InfoEducatie - Face Recognition Architecture
 
Coding the UI
Coding the UICoding the UI
Coding the UI
 
Coding Ui
Coding UiCoding Ui
Coding Ui
 
Everything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebEverything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the Web
 
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015
 
Yurii Pashchenko: Unlocking the potential of Segment Anything Model (UA)
Yurii Pashchenko: Unlocking the potential of Segment Anything Model (UA)Yurii Pashchenko: Unlocking the potential of Segment Anything Model (UA)
Yurii Pashchenko: Unlocking the potential of Segment Anything Model (UA)
 
CIKB - Software Architecture Analysis Design
CIKB - Software Architecture Analysis DesignCIKB - Software Architecture Analysis Design
CIKB - Software Architecture Analysis Design
 
Using Web Standards to create Interactive Data Visualizations for the Web
Using Web Standards to create Interactive Data Visualizations for the WebUsing Web Standards to create Interactive Data Visualizations for the Web
Using Web Standards to create Interactive Data Visualizations for the Web
 
Talk Paris Infovis 091207132953 Phpapp01(2)
Talk Paris Infovis 091207132953 Phpapp01(2)Talk Paris Infovis 091207132953 Phpapp01(2)
Talk Paris Infovis 091207132953 Phpapp01(2)
 
Performance on a budget
Performance on a budgetPerformance on a budget
Performance on a budget
 
20080611accel
20080611accel20080611accel
20080611accel
 
Asp.Net MVC3 - Basics
Asp.Net MVC3 - BasicsAsp.Net MVC3 - Basics
Asp.Net MVC3 - Basics
 

Recently uploaded

Matatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptxMatatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptxJenilouCasareno
 
The Benefits and Challenges of Open Educational Resources
The Benefits and Challenges of Open Educational ResourcesThe Benefits and Challenges of Open Educational Resources
The Benefits and Challenges of Open Educational Resourcesaileywriter
 
ppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyesppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyesashishpaul799
 
....................Muslim-Law notes.pdf
....................Muslim-Law notes.pdf....................Muslim-Law notes.pdf
....................Muslim-Law notes.pdfVikramadityaRaj
 
philosophy and it's principles based on the life
philosophy and it's principles based on the lifephilosophy and it's principles based on the life
philosophy and it's principles based on the lifeNitinDeodare
 
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdfTelling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdfTechSoup
 
[GDSC YCCE] Build with AI Online Presentation
[GDSC YCCE] Build with AI Online Presentation[GDSC YCCE] Build with AI Online Presentation
[GDSC YCCE] Build with AI Online PresentationGDSCYCCE
 
The Last Leaf, a short story by O. Henry
The Last Leaf, a short story by O. HenryThe Last Leaf, a short story by O. Henry
The Last Leaf, a short story by O. HenryEugene Lysak
 
factors influencing drug absorption-final-2.pptx
factors influencing drug absorption-final-2.pptxfactors influencing drug absorption-final-2.pptx
factors influencing drug absorption-final-2.pptxSanjay Shekar
 
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & EngineeringBasic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & EngineeringDenish Jangid
 
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文中 央社
 
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfDanh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfQucHHunhnh
 
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General QuizPragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General QuizPragya - UEM Kolkata Quiz Club
 
IATP How-to Foreign Travel May 2024.pdff
IATP How-to Foreign Travel May 2024.pdffIATP How-to Foreign Travel May 2024.pdff
IATP How-to Foreign Travel May 2024.pdff17thcssbs2
 
Keeping Your Information Safe with Centralized Security Services
Keeping Your Information Safe with Centralized Security ServicesKeeping Your Information Safe with Centralized Security Services
Keeping Your Information Safe with Centralized Security ServicesTechSoup
 
Morse OER Some Benefits and Challenges.pptx
Morse OER Some Benefits and Challenges.pptxMorse OER Some Benefits and Challenges.pptx
Morse OER Some Benefits and Challenges.pptxjmorse8
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽中 央社
 
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdfINU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdfbu07226
 

Recently uploaded (20)

Matatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptxMatatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptx
 
The Benefits and Challenges of Open Educational Resources
The Benefits and Challenges of Open Educational ResourcesThe Benefits and Challenges of Open Educational Resources
The Benefits and Challenges of Open Educational Resources
 
ppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyesppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyes
 
....................Muslim-Law notes.pdf
....................Muslim-Law notes.pdf....................Muslim-Law notes.pdf
....................Muslim-Law notes.pdf
 
philosophy and it's principles based on the life
philosophy and it's principles based on the lifephilosophy and it's principles based on the life
philosophy and it's principles based on the life
 
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdfTelling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
 
[GDSC YCCE] Build with AI Online Presentation
[GDSC YCCE] Build with AI Online Presentation[GDSC YCCE] Build with AI Online Presentation
[GDSC YCCE] Build with AI Online Presentation
 
The Last Leaf, a short story by O. Henry
The Last Leaf, a short story by O. HenryThe Last Leaf, a short story by O. Henry
The Last Leaf, a short story by O. Henry
 
factors influencing drug absorption-final-2.pptx
factors influencing drug absorption-final-2.pptxfactors influencing drug absorption-final-2.pptx
factors influencing drug absorption-final-2.pptx
 
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & EngineeringBasic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
 
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
 
Word Stress rules esl .pptx
Word Stress rules esl               .pptxWord Stress rules esl               .pptx
Word Stress rules esl .pptx
 
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfDanh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
 
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
Operations Management - Book1.p  - Dr. Abdulfatah A. SalemOperations Management - Book1.p  - Dr. Abdulfatah A. Salem
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
 
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General QuizPragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
 
IATP How-to Foreign Travel May 2024.pdff
IATP How-to Foreign Travel May 2024.pdffIATP How-to Foreign Travel May 2024.pdff
IATP How-to Foreign Travel May 2024.pdff
 
Keeping Your Information Safe with Centralized Security Services
Keeping Your Information Safe with Centralized Security ServicesKeeping Your Information Safe with Centralized Security Services
Keeping Your Information Safe with Centralized Security Services
 
Morse OER Some Benefits and Challenges.pptx
Morse OER Some Benefits and Challenges.pptxMorse OER Some Benefits and Challenges.pptx
Morse OER Some Benefits and Challenges.pptx
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
 
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdfINU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
 

Identifying Auxiliary Web Images Using Combinations of Analyses

  • 1. Identifying Auxiliary Web Images Using Combination of Analyses Tewson Seeoun Sirindhorn International Institute of Technology With Guidance From Asst. Prof. Dr. Toshiaki Kondo Sirindhorn International Institute of Technology Dr. Choochart Haruechaiyasak Human Language Technology Laboratory, NECTEC, NSTDA
  • 2. Agenda ● Introduction ● Background ● Document Object Model (DOM) in HTML ● Support Vector Machine (SVM) ● Objective ● Methodology ● Results ● Discussion / Future Work ● Conclusion Acknowledgement 2 ●
  • 3. Introduction ● Websites contain images. ● Some images are not necessary. ● Search Engine Indexing ● Printing ● Ignoring them is sometimes economical and green. 3
  • 4. Background - DOM ● Web browsers / layout engines parse HTML / CSS / JavaScript into DOM. ● DOM represents things (elements) in a Web page. ● An element has properties (position, size, etc.). ● JavaScript sees DOM. 4
  • 5. Background - SVM ● SVM is a supervised machine learning algorithm ● SVM is used for statistical pattern recognition. 5
  • 6. Objective (for now) To recognize patterns of auxiliary Web images quickly using DOM analysis and basic image processing 6
  • 7. Methodology HTML IMG PyQtWebKit Python CSS DOM Files JS jQuery PIL Page Level Features OpenCV Domain Level Features Tesseract Labels MySQL Image Level Features 7
  • 8. Methodology (continued) ● Image Level Features ● No. of Colors ● No. of Human Faces ● No. of Alphabets ● Page Level Features ● Position ● Dimension ● No. of Images with Similar Dimension ● Domain Level Features External / Internal Links 8 ●
  • 9. Methodology (continued) MySQL 80% (500/626) Randomly-Selected SVM (Train) 20% Model SVM (Predict) Results Results Results 9
  • 10. Results 10-fold Cross-Validation (10 Experiments) Average Accuracy = 84.92% After Applying Grid-Search Technique Average Accuracy = 93.17% 10
  • 11. Discussion ● Some pages cannot be parsed. ● Frames and redirections ● Positions can be miscalculated. ● JavaScript used in displaying images ● CSS sprites ● Tesseract is not well-tuned. ● Small images have to be magnified, but how much? ● Downloading images for processing is a bottleneck. ● Features are not weighted. The definition of “auxiliary image” is subjective. 11 ●
  • 12. Future Work ● Context Analysis ● Weighed Features ● Adaptive Page Analysis (Website Categorization) ● Techniques Evaluation / Optimization 12
  • 13. Conclusion Layout analysis and basic image processing techniques alone perform well, but the system could be better. 13
  • 14. Acknowledgement ● NSTDA, NECTEC, and YSTP program ● Dr. Choochart Haruechaiyasak ● Dr. Toshiaki Kondo ● Mr. Krikamol Muendet ● And Many Others... 14