Weitere ähnliche Inhalte Ăhnlich wie Out with Regex, In with Tokens (20) KĂźrzlich hochgeladen (20) Out with Regex, In with Tokens2. Who is this Sean guy?
⢠Web Architect at OmniTI (http://omniti.com/)
⢠Former Editor-in-Chief of php|architect and former
organizer of php|tek
⢠PHP Community, Habari, Phergie
⢠Other conferences (PHP Quebec earlier this year)
⢠the Twitter: @coates
⢠Beer Lover (and brewer)
⢠(I speak too quickly)
3. âA token is a
categorized block of
text. It can look like
anything; it just needs
to be a useful part of
the structured text.â
-Wikipedia
7. $a = 5 + 7 ;
Whitespace
Variable
8. $a = 5 + 7 ;
Assign
Whitespace
Variable
9. $a = 5 + 7 ;
Number
Assign
Whitespace
Variable
10. $a = 5 + 7 ;
Add
Number
Assign
Whitespace
Variable
11. $a = 5 + 7 ;
Add
Number
Assign Number
Whitespace
Variable
12. $a = 5 + 7 ;
Add
Number
Assign Number
Whitespace
Terminator
Variable
17. PHP Example
T_OPEN_TAG <?php
T_VARIABLE $a
T_WHITESPACE
=
T_WHITESPACE
T_LNUMBER 5
T_WHITESPACE
+
T_WHITESPACE
T_LNUMBER 7
;
T_WHITESPACE
T_COMMENT // $b
18. âLexingâ
⢠a Lexer converts a sequence of characters
into tokens
⢠âLexical Analysisâ
⢠Lex, Flex, re2c (lexer generators)
19. Static vs. Dynamic
Analysis
⢠Dynamic: actual execution, practical
implementations such as pen. testing.
⢠Static: analysis of code, tokens, opcodes,
etc. to determine if a particular action will
take place
⢠(not the only use for Tokens, though)
21. Out with Regex
⢠Find all variables
⢠Regex:
/($[a-z_][a-z0-9_]*)/i
⢠context matters:
$str = '$a = 5 + 7; // $b';
22. Regex Fail
<?php
$str = '$a = 5 + 7; // $b';
preg_match_all(
'/($[a-z_][a-z0-9_]*)/i', $str, $m
);
var_dump($m[0]);
24. Out with Regex
⢠Find all variables RONG!
⢠Regex:
/($[a-z_][a-z0-9_]*)/i
⢠context matters:
$str = '$a = 5 + 7; // $b';
26. Token Approach
<?php
// look ma, no regex!
$str = '<?php $a = 5 + 7; // $b';
foreach (token_get_all($str) as $t) {
if (is_array($t) && $t[0] == T_VARIABLE) {
echo $t[1] . quot;nquot;;
}
}
// outputs: $a
27. PHP Example (again)
T_OPEN_TAG <?php
T_VARIABLE $a
T_WHITESPACE
=
T_WHITESPACE
T_LNUMBER 5
T_WHITESPACE
+
T_WHITESPACE
T_LNUMBER 7
;
T_WHITESPACE
T_COMMENT // $b
28. Regex can be complicated
(email validation from MRE)
[040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn
015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] *
(?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?!
[^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xff
n015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()]
* (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^
x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn
015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )*
) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn
015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* | (?: [^(040)<>@,;:quot;.[]000-037x80-
xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * (?: (?: ( [^x80-
xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-
xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * )* < [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xff
n015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()]
* )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )*
] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: .
[040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^
x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* (?: , [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff]
| ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-
xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-
xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) )
[^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-
xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?:
( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* )* : [040t]* (?: ( [^
x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )? (?: [^(040)<>@,;:quot;.[]000-
037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-
xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^
x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-
037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff]
[^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-
xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^
x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )*
(?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^
x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* > )
29. DifďŹcult validation
made simpler
⢠Email validation is haaaard!
⢠Validate logical units separately:
s e a n @ p h p. n e t
30. DifďŹcult validation
made simpler
⢠Email validation is haaaard!
⢠Validate logical units separately:
s e a n @ p h p. n e t
Domain
Localpart Separator
31. DifďŹcult validation
made simpler
⢠Email validation is haaaard!
⢠Validate logical units separately:
s e a n @ p h p. n e t
⢠Still hard, but validation is restricted to
different types of data
⢠BTW, donât bother (-:
32. Regex can be complicated
(email validation from MRE)
[040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn
015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] *
(?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?!
[^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xff
n015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()]
* (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^
x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn
015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )*
) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn
015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* | (?: [^(040)<>@,;:quot;.[]000-037x80-
xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * (?: (?: ( [^x80-
xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-
xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * )* < [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xff
n015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()]
* )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )*
strpos($email, â@â) !== false
] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: .
[040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^
x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* (?: , [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff]
| ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-
xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-
xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) )
[^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-
xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?:
( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* )* : [040t]* (?: ( [^
x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )? (?: [^(040)<>@,;:quot;.[]000-
037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-
xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^
x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-
037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff]
[^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-
xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^
x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )*
(?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^
x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* > )
33. Dirty Little Secret
⢠Most tokenizers (lexers) use regular
expressions to separate tokens
⢠re2c
⢠Multiple ways to represent separators,
whitespace, etc.. simpliďŹed with regex
34. Practical Uses
⢠Compile source code
⢠Simple, contextual replacement (e.g. BBCode)
⢠Friendly line breaks
⢠âCurlyâ quotes, special punctuation
⢠Input validation/stripping
⢠Refactoring
39. Tokenizer in Userspace
⢠token_get_all() returns an array of scalars
and arrays
⢠A bit hard to work with
⢠Needs opening tag (<?php or <? depending
on conďŹg)
40. Tokenizer in Userspace
(Example)
print_r(token_get_all('<?php $a = 5 + 7; // $b'));
Array [2] => Array [5] => Array [8] => Array [11] => Array
( ( ( ( (
[0] => Array [0] => 370 [0] => 305 [0] => 370 [0] => 370
( [1] => [1] => 5 [1] => [1] =>
[0] => 367 [2] => 1 [2] => 1 [2] => 1 [2] => 1
[1] => <?php ) ) ) )
[2] => 1
) [3] => = [6] => Array [9] => Array [12] => Array
[4] => Array ( ( (
[1] => Array ( [0] => 370 [0] => 305 [0] => 365
( [0] => 370 [1] => [1] => 7 [1] => // $b
[0] => 309 [1] => [2] => 1 [2] => 1 [2] => 1
[1] => $a [2] => 1 ) ) )
[2] => 1 )
) [7] => + [10] => ; )
41. Tokenizer in Userspace
(Example)
[0] => Array
(
[0] => 367 Token Number
[1] => <?php Token Text
[2] => 1 Line Number
)
[1] => Array
(
[0] => 309 token_name(309)
[1] => $a == âT_VARIABLEâ
[2] => 1
)
(...)
[3] => = Scalar (not array)
42. Practical Example:
<pre>
Simple Highlighter
<?php
$c = array(
T_VARIABLE => 'red',
T_LNUMBER => 'blue',
);
foreach (token_get_all(fread(STDIN, 9999999)) as $t) {
if (!is_array($t)) {
echo htmlentities($t);
} elseif (!isset($c[$t[0]])) {
echo htmlentities($t[1]);
continue;
} else {
echo '<span style=quot;color: ' . $c[$t[0]] . 'quot;>'
. htmlentities($t[1]) . '</span>';
}
}
?>
</pre>
43. Highlighter Output
<?php
$a = 5 + 7; // $b
<pre>
<?php
<span style=quot;color: redquot;>$a</span> =
<span style=quot;color: bluequot;>5</span> +
<span style=quot;color: bluequot;>7</span>; // $b
</pre>
45. Entities
⢠Hi... I'm Sean
⢠Hi… I’m Sean
⢠Hi⌠Iâm Sean
46. Entities
⢠Here's some code <code>$foo = 'bar';</code>
⢠Here’some code
47. Entities
⢠Here's some code <code>$foo = 'bar';</code>
⢠Here’some code <code>$foo = 'bar';</code>
⢠Hereâs some code <code>$foo = 'bar';</code>
48. Entities
⢠Here's some code <code>$foo = 'bar';</code>
⢠Here’some code <code>$foo = 'bar';</code>
⢠Hereâs some code <code>$foo = 'bar';</code>
49. Tokalizer
⢠PHP token analysis wrapper
⢠Object-oriented
⢠Normalized
⢠Includes a partial parser (in PHP, so itâs
slow). Doesnât work with new 5.3
constructs... yet.
⢠http://github.com/scoates/tokalizer
53. Habariâs HTML
Tokenizer
⢠Filter user input (can strip tags intelligently)
⢠Allow plugins to inject/replace whole
blocks of HTML without (developer-facing)
regex
⢠Facilitate autop, introspection