3. Introduction
HTML parser choice Task: WYSIWYG editor
HTML5::Sanitizer interna Team
HTML5::Sanitizer usage Live example
Conclusion
1 Introduction
Task: WYSIWYG editor
Team
Live example
2 HTML parser choice
3 HTML5::Sanitizer interna
4 HTML5::Sanitizer usage
5 Conclusion
Uwe Voelker HTML5::Sanitizer
4. Introduction
HTML parser choice Task: WYSIWYG editor
HTML5::Sanitizer interna Team
HTML5::Sanitizer usage Live example
Conclusion
Task: WYSIWYG editor
integrate WYSIWYG editor in XING
frontend architect researched open source solutions
Uwe Voelker HTML5::Sanitizer
5. Introduction
HTML parser choice Task: WYSIWYG editor
HTML5::Sanitizer interna Team
HTML5::Sanitizer usage Live example
Conclusion
Task: WYSIWYG editor
integrate WYSIWYG editor in XING
frontend architect researched open source solutions
none was suited, mostly for security reasons
decision was made, to build it inhouse
Uwe Voelker HTML5::Sanitizer
6. Introduction
HTML parser choice Task: WYSIWYG editor
HTML5::Sanitizer interna Team
HTML5::Sanitizer usage Live example
Conclusion
Task: WYSIWYG editor
integrate WYSIWYG editor in XING
frontend architect researched open source solutions
none was suited, mostly for security reasons
decision was made, to build it inhouse
goals: secure, share proïŹles (allowed tags) between frontend
and backend
Uwe Voelker HTML5::Sanitizer
7. Introduction
HTML parser choice Task: WYSIWYG editor
HTML5::Sanitizer interna Team
HTML5::Sanitizer usage Live example
Conclusion
Team
Christopher Blum Ingo Chao Uwe Voelker
Javascript QA (HTML5/CSS) Perl
Uwe Voelker HTML5::Sanitizer
8. Introduction
HTML parser choice Task: WYSIWYG editor
HTML5::Sanitizer interna Team
HTML5::Sanitizer usage Live example
Conclusion
Live example
Uwe Voelker HTML5::Sanitizer
9. Introduction
HTML parser choice CPAN modules
HTML5::Sanitizer interna Evaluation
HTML5::Sanitizer usage Final decision
Conclusion
1 Introduction
2 HTML parser choice
CPAN modules
Evaluation
Final decision
3 HTML5::Sanitizer interna
4 HTML5::Sanitizer usage
5 Conclusion
Uwe Voelker HTML5::Sanitizer
10. Introduction
HTML parser choice CPAN modules
HTML5::Sanitizer interna Evaluation
HTML5::Sanitizer usage Final decision
Conclusion
HTML parser on CPAN
HTML::Parser
HTML::TreeBuilder
HTML::TreeBuilder::LibXML
XML::LibXML
HTML::HTML5::Parser
Marpa::HTML
...
Uwe Voelker HTML5::Sanitizer
11. Introduction
HTML parser choice CPAN modules
HTML5::Sanitizer interna Evaluation
HTML5::Sanitizer usage Final decision
Conclusion
Uwe Voelker HTML5::Sanitizer
12. Introduction
HTML parser choice CPAN modules
HTML5::Sanitizer interna Evaluation
HTML5::Sanitizer usage Final decision
Conclusion
started with HTML::HTML5::Parser (HH5P)
because it understands semantic of HTML 5 tags
Uwe Voelker HTML5::Sanitizer
17. Introduction
Processing Phases
HTML parser choice
Parsing
HTML5::Sanitizer interna
Converting
HTML5::Sanitizer usage
Writing
Conclusion
Processing phases
preprocessing (e. g. migration)
Uwe Voelker HTML5::Sanitizer
18. Introduction
Processing Phases
HTML parser choice
Parsing
HTML5::Sanitizer interna
Converting
HTML5::Sanitizer usage
Writing
Conclusion
Processing phases
preprocessing (e. g. migration)
parsing (HTML â DOM tree)
Uwe Voelker HTML5::Sanitizer
19. Introduction
Processing Phases
HTML parser choice
Parsing
HTML5::Sanitizer interna
Converting
HTML5::Sanitizer usage
Writing
Conclusion
Processing phases
preprocessing (e. g. migration)
parsing (HTML â DOM tree)
converting (rebuild tree according to proïŹle)
Uwe Voelker HTML5::Sanitizer
20. Introduction
Processing Phases
HTML parser choice
Parsing
HTML5::Sanitizer interna
Converting
HTML5::Sanitizer usage
Writing
Conclusion
Processing phases
preprocessing (e. g. migration)
parsing (HTML â DOM tree)
converting (rebuild tree according to proïŹle)
writing (DOM tree â HTML)
Uwe Voelker HTML5::Sanitizer
21. Introduction
Processing Phases
HTML parser choice
Parsing
HTML5::Sanitizer interna
Converting
HTML5::Sanitizer usage
Writing
Conclusion
Parsing HTML with XML::LibXML
use XML : : LibXML ;
my $ p a r s e r = XML : : LibXMLâ>new (
encoding => âUTFâ8 â ,
recover => 2 ,
keep blanks => 1 ,
no cdata => 1 ,
expand entities => 1 ,
no network => 1 ,
suppress errors => 1 ,
s u p p r e s s w a r n i n g s => 1 ,
);
Uwe Voelker HTML5::Sanitizer
22. Introduction
Processing Phases
HTML parser choice
Parsing
HTML5::Sanitizer interna
Converting
HTML5::Sanitizer usage
Writing
Conclusion
Parsing HTML with XML::LibXML
my $doc = $ p a r s e r â>p a r s e h t m l s t r i n g (
$html ,
{
no cdata => 1 ,
suppress errors => 1 ,
s u p p r e s s w a r n i n g s => 1 ,
},
);
Uwe Voelker HTML5::Sanitizer
23. Introduction
Processing Phases
HTML parser choice
Parsing
HTML5::Sanitizer interna
Converting
HTML5::Sanitizer usage
Writing
Conclusion
Converting - rebuilding DOM tree
loop through every node (only ELEMENT and TEXT)
Uwe Voelker HTML5::Sanitizer
24. Introduction
Processing Phases
HTML parser choice
Parsing
HTML5::Sanitizer interna
Converting
HTML5::Sanitizer usage
Writing
Conclusion
Converting - rebuilding DOM tree
loop through every node (only ELEMENT and TEXT)
drop unwanted elements completely (e. g. <script>)
change unknown elements to <span>
Uwe Voelker HTML5::Sanitizer
25. Introduction
Processing Phases
HTML parser choice
Parsing
HTML5::Sanitizer interna
Converting
HTML5::Sanitizer usage
Writing
Conclusion
Converting - rebuilding DOM tree
loop through every node (only ELEMENT and TEXT)
drop unwanted elements completely (e. g. <script>)
change unknown elements to <span>
eventually change tag name (proïŹle)
transform (or copy) attributes
Uwe Voelker HTML5::Sanitizer
26. Introduction
Processing Phases
HTML parser choice
Parsing
HTML5::Sanitizer interna
Converting
HTML5::Sanitizer usage
Writing
Conclusion
Converting - rebuilding DOM tree
loop through every node (only ELEMENT and TEXT)
drop unwanted elements completely (e. g. <script>)
change unknown elements to <span>
eventually change tag name (proïŹle)
transform (or copy) attributes
proceed recursively with child nodes
Uwe Voelker HTML5::Sanitizer
27. Introduction
Processing Phases
HTML parser choice
Parsing
HTML5::Sanitizer interna
Converting
HTML5::Sanitizer usage
Writing
Conclusion
Writing HTML
mainly for additional escapes
could not ïŹnd a nice way to integrate this in XML::LibXML
Uwe Voelker HTML5::Sanitizer
28. Introduction
Processing Phases
HTML parser choice
Parsing
HTML5::Sanitizer interna
Converting
HTML5::Sanitizer usage
Writing
Conclusion
Writing HTML
mainly for additional escapes
could not ïŹnd a nice way to integrate this in XML::LibXML
$text =Ë s/&/& ; / g ;
$text =Ë s / â /'/g;# â
$text =Ë s /â/&q u o t ; / g;#â
$text =Ë s/</& l t ; / g ;
$text =Ë s/>/&g t ; / g ;
$text =Ë s / â/	 6 ; / g ;
$text =Ë s /{/ 2 3 ; / g ;
$text =Ë s /}/ 2 5 ; / g ;
Uwe Voelker HTML5::Sanitizer
30. Introduction
Usage
HTML parser choice
ProïŹle
HTML5::Sanitizer interna
Examples
HTML5::Sanitizer usage
Debugging
Conclusion
Usage
# construct object
my $ s a n i t i z e r = HTML5 : : S a n i t i z e r â>new (
p r o f i l e => âMy : : P r o f i l e â ,
);
# c a l l process ()
my $ c l e a n = $ s a n i t i z e r â>p r o c e s s ( $html ) ;
Uwe Voelker HTML5::Sanitizer
31. Introduction
Usage
HTML parser choice
ProïŹle
HTML5::Sanitizer interna
Examples
HTML5::Sanitizer usage
Debugging
Conclusion
ProïŹle
you have to build your own
Uwe Voelker HTML5::Sanitizer
32. Introduction
Usage
HTML parser choice
ProïŹle
HTML5::Sanitizer interna
Examples
HTML5::Sanitizer usage
Debugging
Conclusion
ProïŹle
you have to build your own
class with just one method: element($tag)
return undef or a hashref with:
Uwe Voelker HTML5::Sanitizer
33. Introduction
Usage
HTML parser choice
ProïŹle
HTML5::Sanitizer interna
Examples
HTML5::Sanitizer usage
Debugging
Conclusion
ProïŹle
you have to build your own
class with just one method: element($tag)
return undef or a hashref with:
remove remove complete sub tree (boolean)
rename tag rename tag (string)
set attributes set these attributes (hashref)
check attributes check/transform these attributes (hashref)
set class set class (string)
add class add class from other attributes (hashref)
Uwe Voelker HTML5::Sanitizer
36. Introduction
Usage
HTML parser choice
ProïŹle
HTML5::Sanitizer interna
Examples
HTML5::Sanitizer usage
Debugging
Conclusion
Examples - script
completely remove <script> (including all children)
{
remove => 1 ,
}
otherwise it would be converted to <span>
and all children processed recursively
Uwe Voelker HTML5::Sanitizer
38. Introduction
Usage
HTML parser choice
ProïŹle
HTML5::Sanitizer interna
Examples
HTML5::Sanitizer usage
Debugging
Conclusion
Examples - big
<big> â <span class=âbigâ>
{
r e n a m e t a g => â s p a n â ,
s e t c l a s s => â b i g â ,
}
Uwe Voelker HTML5::Sanitizer
39. Introduction
Usage
HTML parser choice
ProïŹle
HTML5::Sanitizer interna
Examples
HTML5::Sanitizer usage
Debugging
Conclusion
Examples - a
add rel=ânofollowâ and target=â blankâ to every link
Uwe Voelker HTML5::Sanitizer
40. Introduction
Usage
HTML parser choice
ProïŹle
HTML5::Sanitizer interna
Examples
HTML5::Sanitizer usage
Debugging
Conclusion
Examples - a
add rel=ânofollowâ and target=â blankâ to every link
{
s e t a t t r i b u t e s => {
rel => â n o f o l l o w â ,
t a r g e t => â b l a n k â ,
},
}
Uwe Voelker HTML5::Sanitizer
41. Introduction
Usage
HTML parser choice
ProïŹle
HTML5::Sanitizer interna
Examples
HTML5::Sanitizer usage
Debugging
Conclusion
Examples - font
r e n a m e t a g => â s p a n â ,
a d d c l a s s => { s i z e => â s i z e f o n t â } ,
Uwe Voelker HTML5::Sanitizer
42. Introduction
Usage
HTML parser choice
ProïŹle
HTML5::Sanitizer interna
Examples
HTML5::Sanitizer usage
Debugging
Conclusion
Examples - font
r e n a m e t a g => â s p a n â ,
a d d c l a s s => { s i z e => â s i z e f o n t â } ,
sub c l a s s s i z e f o n t {
my ( $ s e l f , $ v a l ) = @ ;
return unless $val ;
r e t u r n â s i z e âxxâl a r g e â i f $ v a l eq â 7 â ;
# ...
r e t u r n â s i z e âxxâs m a l l â i f $ v a l eq â 1 â ;
r e t u r n â s i z e âl a r g e r â i f $ v a l =Ë /Ë+/;
r e t u r n â s i z e âs m a l l e r â i f $ v a l =Ë /Ë â/;
return ;
}
Uwe Voelker HTML5::Sanitizer
43. Introduction
Usage
HTML parser choice
ProïŹle
HTML5::Sanitizer interna
Examples
HTML5::Sanitizer usage
Debugging
Conclusion
Debugging
if the result is not as expected, you can access intermediate
results:
my $ r e s = $ s a n i t i z e r â>p r o c e s s ( $html , { r e t u r n r e s u l t
# s e e HTML5 : : S a n i t i z e r : : R e s u l t
s a y $ r e s â>i n p u t ;
s a y $ r e s â>p r e p r o c e s s e d ;
s a y $ r e s â>p a r s e d d o c â>t o S t r i n g ;
s a y $ r e s â>c o n v e r t e d d o c â>t o S t r i n g ;
s a y $ r e s â>o u t p u t ;
p r i n t $ r e s â>d e b u g o u t p u t ;
Uwe Voelker HTML5::Sanitizer