SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Downloaden Sie, um offline zu lesen
Identifying Auxiliary Web Images
Using Combination of Analyses
                                         Tewson Seeoun
             Sirindhorn International Institute of Technology


                                         With Guidance From

                        Asst. Prof. Dr. Toshiaki Kondo
             Sirindhorn International Institute of Technology
                       Dr. Choochart Haruechaiyasak
  Human Language Technology Laboratory, NECTEC, NSTDA
Agenda
    ●   Introduction
    ●   Background
         ●   Document Object Model (DOM) in HTML
         ●   Support Vector Machine (SVM)
    ●   Objective
    ●   Methodology
    ●   Results
    ●   Discussion / Future Work
    ●   Conclusion
        Acknowledgement
                                                   2
    ●
Introduction


       ●   Websites contain images.
       ●   Some images are not necessary.
           ●   Search Engine Indexing
           ●   Printing
       ●   Ignoring them is sometimes
           economical and green.

                                            3
Background - DOM


●   Web browsers / layout engines parse
    HTML / CSS / JavaScript into DOM.
●   DOM represents things (elements) in a Web page.
●   An element has properties (position, size, etc.).
●   JavaScript sees DOM.



                                                        4
Background - SVM




 ●   SVM is a supervised machine learning algorithm
 ●   SVM is used for statistical pattern recognition.




                                                        5
Objective (for now)




To recognize patterns of auxiliary Web images quickly
  using DOM analysis and basic image processing




                                                  6
Methodology
 HTML                                          IMG
          PyQtWebKit             Python
 CSS                    DOM                    Files
  JS


                            jQuery
                                                   PIL
                 Page Level Features
                                                   OpenCV
                Domain Level Features
                                                   Tesseract



       Labels          MySQL           Image Level Features
                                                          7
Methodology (continued)
        ●   Image Level Features
             ●   No. of Colors
             ●   No. of Human Faces
             ●   No. of Alphabets
        ●   Page Level Features
             ●   Position
             ●   Dimension
             ●   No. of Images with Similar Dimension
        ●   Domain Level Features
                 External / Internal Links
                                                        8
             ●
Methodology (continued)

     MySQL     80% (500/626) Randomly-Selected

                              SVM (Train)
        20%
                           Model

         SVM (Predict)



    Results
    Results
     Results

                                                 9
Results



   10-fold Cross-Validation (10 Experiments)
          Average Accuracy = 84.92%
     After Applying Grid-Search Technique
          Average Accuracy = 93.17%



                                               10
Discussion
   ●   Some pages cannot be parsed.
        ●   Frames and redirections
   ●   Positions can be miscalculated.
        ●   JavaScript used in displaying images
        ●   CSS sprites
   ●   Tesseract is not well-tuned.
   ●   Small images have to be magnified, but how much?
   ●   Downloading images for processing is a bottleneck.
   ●   Features are not weighted.
       The definition of “auxiliary image” is subjective.
                                                            11
   ●
Future Work



 ●   Context Analysis
 ●   Weighed Features
 ●   Adaptive Page Analysis (Website Categorization)
 ●   Techniques Evaluation / Optimization



                                                  12
Conclusion




Layout analysis and basic image processing techniques
  alone perform well, but the system could be better.




                                                   13
Acknowledgement


    ●   NSTDA, NECTEC, and YSTP program
    ●   Dr. Choochart Haruechaiyasak
    ●   Dr. Toshiaki Kondo
    ●   Mr. Krikamol Muendet
    ●   And Many Others...


                                          14

Weitere ähnliche Inhalte

Ähnlich wie Identifying Auxiliary Web Images Using Combinations of Analyses

Iasi code camp 12 october 2013 responsive images in the wild-vlad zelinschi
Iasi code camp 12 october 2013 responsive images in the wild-vlad zelinschiIasi code camp 12 october 2013 responsive images in the wild-vlad zelinschi
Iasi code camp 12 october 2013 responsive images in the wild-vlad zelinschi
Codecamp Romania
 
CIKB - Software Architecture Analysis Design
CIKB - Software Architecture Analysis DesignCIKB - Software Architecture Analysis Design
CIKB - Software Architecture Analysis Design
Antonio Castellon
 
Talk Paris Infovis 091207132953 Phpapp01(2)
Talk Paris Infovis 091207132953 Phpapp01(2)Talk Paris Infovis 091207132953 Phpapp01(2)
Talk Paris Infovis 091207132953 Phpapp01(2)
johnnybiz
 

Ähnlich wie Identifying Auxiliary Web Images Using Combinations of Analyses (20)

Architectures For Scaling Ajax
Architectures For Scaling AjaxArchitectures For Scaling Ajax
Architectures For Scaling Ajax
 
Dconrails Gecco Presentation
Dconrails Gecco PresentationDconrails Gecco Presentation
Dconrails Gecco Presentation
 
JS Single-Page Web App Essentials
JS Single-Page Web App EssentialsJS Single-Page Web App Essentials
JS Single-Page Web App Essentials
 
Building assets on the fly with Node.js
Building assets on the fly with Node.jsBuilding assets on the fly with Node.js
Building assets on the fly with Node.js
 
Super Sizing Youtube with Python
Super Sizing Youtube with PythonSuper Sizing Youtube with Python
Super Sizing Youtube with Python
 
Os Solomon
Os SolomonOs Solomon
Os Solomon
 
Iasi code camp 12 october 2013 responsive images in the wild-vlad zelinschi
Iasi code camp 12 october 2013 responsive images in the wild-vlad zelinschiIasi code camp 12 october 2013 responsive images in the wild-vlad zelinschi
Iasi code camp 12 october 2013 responsive images in the wild-vlad zelinschi
 
InfoEducatie - Face Recognition Architecture
InfoEducatie - Face Recognition ArchitectureInfoEducatie - Face Recognition Architecture
InfoEducatie - Face Recognition Architecture
 
Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017
Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017
Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017
 
Coding the UI
Coding the UICoding the UI
Coding the UI
 
Coding Ui
Coding UiCoding Ui
Coding Ui
 
Everything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebEverything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the Web
 
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015
 
Yurii Pashchenko: Unlocking the potential of Segment Anything Model (UA)
Yurii Pashchenko: Unlocking the potential of Segment Anything Model (UA)Yurii Pashchenko: Unlocking the potential of Segment Anything Model (UA)
Yurii Pashchenko: Unlocking the potential of Segment Anything Model (UA)
 
CIKB - Software Architecture Analysis Design
CIKB - Software Architecture Analysis DesignCIKB - Software Architecture Analysis Design
CIKB - Software Architecture Analysis Design
 
Using Web Standards to create Interactive Data Visualizations for the Web
Using Web Standards to create Interactive Data Visualizations for the WebUsing Web Standards to create Interactive Data Visualizations for the Web
Using Web Standards to create Interactive Data Visualizations for the Web
 
Talk Paris Infovis 091207132953 Phpapp01(2)
Talk Paris Infovis 091207132953 Phpapp01(2)Talk Paris Infovis 091207132953 Phpapp01(2)
Talk Paris Infovis 091207132953 Phpapp01(2)
 
Performance on a budget
Performance on a budgetPerformance on a budget
Performance on a budget
 
20080611accel
20080611accel20080611accel
20080611accel
 
Asp.Net MVC3 - Basics
Asp.Net MVC3 - BasicsAsp.Net MVC3 - Basics
Asp.Net MVC3 - Basics
 

Kürzlich hochgeladen

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Kürzlich hochgeladen (20)

HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
latest AZ-104 Exam Questions and Answers
latest AZ-104 Exam Questions and Answerslatest AZ-104 Exam Questions and Answers
latest AZ-104 Exam Questions and Answers
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 

Identifying Auxiliary Web Images Using Combinations of Analyses

  • 1. Identifying Auxiliary Web Images Using Combination of Analyses Tewson Seeoun Sirindhorn International Institute of Technology With Guidance From Asst. Prof. Dr. Toshiaki Kondo Sirindhorn International Institute of Technology Dr. Choochart Haruechaiyasak Human Language Technology Laboratory, NECTEC, NSTDA
  • 2. Agenda ● Introduction ● Background ● Document Object Model (DOM) in HTML ● Support Vector Machine (SVM) ● Objective ● Methodology ● Results ● Discussion / Future Work ● Conclusion Acknowledgement 2 ●
  • 3. Introduction ● Websites contain images. ● Some images are not necessary. ● Search Engine Indexing ● Printing ● Ignoring them is sometimes economical and green. 3
  • 4. Background - DOM ● Web browsers / layout engines parse HTML / CSS / JavaScript into DOM. ● DOM represents things (elements) in a Web page. ● An element has properties (position, size, etc.). ● JavaScript sees DOM. 4
  • 5. Background - SVM ● SVM is a supervised machine learning algorithm ● SVM is used for statistical pattern recognition. 5
  • 6. Objective (for now) To recognize patterns of auxiliary Web images quickly using DOM analysis and basic image processing 6
  • 7. Methodology HTML IMG PyQtWebKit Python CSS DOM Files JS jQuery PIL Page Level Features OpenCV Domain Level Features Tesseract Labels MySQL Image Level Features 7
  • 8. Methodology (continued) ● Image Level Features ● No. of Colors ● No. of Human Faces ● No. of Alphabets ● Page Level Features ● Position ● Dimension ● No. of Images with Similar Dimension ● Domain Level Features External / Internal Links 8 ●
  • 9. Methodology (continued) MySQL 80% (500/626) Randomly-Selected SVM (Train) 20% Model SVM (Predict) Results Results Results 9
  • 10. Results 10-fold Cross-Validation (10 Experiments) Average Accuracy = 84.92% After Applying Grid-Search Technique Average Accuracy = 93.17% 10
  • 11. Discussion ● Some pages cannot be parsed. ● Frames and redirections ● Positions can be miscalculated. ● JavaScript used in displaying images ● CSS sprites ● Tesseract is not well-tuned. ● Small images have to be magnified, but how much? ● Downloading images for processing is a bottleneck. ● Features are not weighted. The definition of “auxiliary image” is subjective. 11 ●
  • 12. Future Work ● Context Analysis ● Weighed Features ● Adaptive Page Analysis (Website Categorization) ● Techniques Evaluation / Optimization 12
  • 13. Conclusion Layout analysis and basic image processing techniques alone perform well, but the system could be better. 13
  • 14. Acknowledgement ● NSTDA, NECTEC, and YSTP program ● Dr. Choochart Haruechaiyasak ● Dr. Toshiaki Kondo ● Mr. Krikamol Muendet ● And Many Others... 14