How to check valid Email? Find using regex.

How to check
valid email?
Not only in Ruby
brought to DRUG by
Piotr Wasiak 20.02.2023
Find using RegEx(p?)

Agenda
2
1. RegEx overview
2. Recommendations
3. Ruby quirks / amenities
4. Tools / Resources
5. Advanced RE(2)
6. Ruby 3.2 RE changes

Who am I?
Piotr Wasiak
Ruby, Rails developer
Current PRUG organiser
3
Interests:
● climbing, hiking, squash
● contract bridge, chess
● ruby, programming, crypto

Regular Expression
is a character sequence, that defines a search pattern
The purpose is:
● validate the string by the pattern
● get parts of the content (e.g. find or find_and_replace in text editors)
4

RegEx history
● Concept of language arose in the 1950s
● Different syntaxes (1980+):
○ POSIX (Basic - or Extended Regular Expressions)
○ Perl (inﬂuenced/imported to other languages as PCRE 1997, PCRE2 2015)
5

RegEx as a state machine
6
Statement validation: /(?<name>ADAM|PIOTR)s?[=><]{1,2}s*"(?:PIENIĄDZ|KUKU)"/g

Find RegEx
In replace we can use
matched whole
phrase or groups.
Group number is
ordered by starting
bracket index and is
limited to 1 - 9
8

Valid email (1/3)
Rails popular gem solution:
9

Valid email (2/3)
10
Email validation:
/(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"
(?:[x01-x08x0bx0cx0e-x1fx21x23-x5bx5d-x7f]|[x01-x09x0bx0c
x0e-x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?.)+[a-z0-9]
(?:[a-z0-9-]*[a-z0-9])?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[x01-x08x0
bx0cx0e-x1fx21-x5ax5d-x7f]|[x01-x09x0bx0cx0e-x7f])+)])/g

Valid email (3/3)
11
Email validation:
/(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"
(?:[x01-x08x0bx0cx0e-x1fx21x23-x5bx5d-x7f]|[x01-x09x0bx0c
x0e-x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?.)+[a-z0-9]
(?:[a-z0-9-]*[a-z0-9])?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[x01-x08x0
bx0cx0e-x1fx21-x5ax5d-x7f]|[x01-x09x0bx0cx0e-x7f])+)])/g

original_regexp =
%r{(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[x01-x08x0bx0cx0e-x1f!#-x5b]-x7f]|[x01-x09x0bx0cx0e-x7f])*")@(?:(?:[[:alnum:]](?:[a-z0-9
-]*[[:alnum:]])?.)+[[:alnum:]](?:[a-z0-9-]*[[:alnum:]])?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[[:alnum:]]:(?:[x01-x08x0bx
0cx0e-x1f!-Z]-x7f]|[x01-x09x0bx0cx0e-x7f])+)])}
alnum_with_hypen = /[a-z0-9-]/.source # posix alternative /[-[:alnum:]]/
ip_number_type = /25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?/.source
common_parts = /[x01-x08x0bx0cx0e-x1f]-x7f]/.source
username_without_backslash_prepended_set = /[#{common_parts}!#-x5b]/.source
domain_port_unescaped_set = /[#{common_parts}!-Z]/.source
domain_port_escaped_chars_set = /[#{common_parts}x0e-x7f]/.source
non_ending_chars = %r{[a-z0-9!#$%&'*+/=?^_`{|}~-]+}.source
final_with_variables =
/(?:#{non_ending_chars}(?:.#{non_ending_chars})*|"(?:#{username_without_backslash
_prepended_set}|#{domain_port_escaped_chars_set})*")@(?:(?:[[:alnum:]](?:#{alnum
_with_hypen}*[[:alnum:]])?.)+[[:alnum:]](?:#{alnum_with_hypen}*[[:alnum:]])?|[(?
:(?:#{ip_number_type}).){3}(?:#{ip_number_type}|#{alnum_with_hypen}*[[:alnum:]]:(
?:#{domain_port_unescaped_set}|#{domain_port_escaped_chars_set})+)])/
13
Simplify valid email

original_regexp =
%r{(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[x01-x08x0bx0cx0e-x1f!#-x5b]-x7f]|[x01-x09x0bx0cx0e-x7f])*")@(?:(?:[[:alnum:]](?:[a-z0-9
-]*[[:alnum:]])?.)+[[:alnum:]](?:[a-z0-9-]*[[:alnum:]])?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[[:alnum:]]:(?:[x01-x08x0bx
0cx0e-x1f!-Z]-x7f]|[x01-x09x0bx0cx0e-x7f])+)])}
alnum_with_hypen = /[a-z0-9-]/.source # posix alternative /[-[:alnum:]]/
ip_number_type = /25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?/.source
ascii_wo_tabs_cr_nl = /[[:ascii:]&&[^x09-x0ax0d]]/.source
domain_port_escaped_chars_set = /[#{ascii_wo_tabs_cr_nl}x09x20"]/.source
domain_port_unescaped_set = /[#{ascii_wo_tabs_cr_nl}&&[^x20]]/.source
username = /[#{domain_port_unescaped_set}&&[^"]]/.source
non_ending_chars = %r{[a-z0-9!#$%&'*+/=?^_`{|}~-]+}.source
final_with_variables =
/(?:#{non_ending_chars}(?:.#{non_ending_chars})*|"(?:#{username}|#{domain_port_
escaped_chars_set})*")@(?:(?:[[:alnum:]](?:#{alnum_with_hypen}*[[:alnum:]])?.)+[[
:alnum:]](?:#{alnum_with_hypen}*[[:alnum:]])?|[(?:(?:#{ip_number_type}).){3}(?:#
{ip_number_type}|#{alnum_with_hypen}*[[:alnum:]]:(?:#{domain_port_unescaped_set}|
#{domain_port_escaped_chars_set})+)])/
14
Simplify valid email (more ruby version)

original_regexp = %r{ # there is no heredoc for regexp
(?: # strings with some special chars, but not ending with .
[a-z0-9!#$%&'*+/=?^_`{|}~-]+
(?:
.[a-z0-9!#$%&'*+/=?^_`{|}~-]+
)*
|
"
(?: # special chars enquoted
[x01-x08x0bx0cx0e-x1f!#-x5b]-x7f]
|
# prepended with backslash, here escaped
[x01-x09x0bx0cx0e-x7f] # more special chars
)*
" # closing quote
)
@ # the most crucial ampersand
(?: # domain regexp
(?: # at least one subdomain joined and finished with .
[[:alnum:]]
(?:
[a-z0-9-]* # subdomain can have many alphanumeric or - inside
[[:alnum:]] # subdomain have to finish with alphanumeric char
)?
. # dot separator
)+
[[:alnum:]] # domain have to start with alphanumeric char
(?:
[a-z0-9-]* # domain can have many alphanumeric or - inside
[[:alnum:]] # domain have to finish with alphanumeric char
)? 15
/x comments mode
| # or direct ip implementation or 3 numbers
with . suffix and some special usecases
[ # enquoted with square brackets
(?:
(?: # numbers are quite complex in RegEx
25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]? #
0-255
). # . suffix
){3} # 3 times
(?:
25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]? # 0-255
| # or 3 numbers with . suffix and some
special usecases
[a-z0-9-]* # alnums also starting with -
[[:alnum:]] # finishing without -
:
(?:
[x01-x08x0bx0cx0e-x1f!-Z]-x7f] #
many chars
|
# more ansii chars prefixed with
backslash
[x01-x09x0bx0cx0e-x7f]
)+
)
] # closing square bracket
)
}x # switch to treat spaces/new lines and `# `
suffix as comments

Ruby simply string methods are faster and more meaningful:
● .start_with? / .end_with?
● .include?(‘some substring’)
● .chomp
● .strip
● .lines
● .split(‘ ’) # without regexp
● .tr(‘ !?‘, ‘1-9’)
16
Do not overuse regular expression (1/2)

Libraries and gems for common concepts:
● URI(url)
+ .host / .path / .query / .fragment
● File(path_to_ﬁle)
+ .dirname / .basename / .extname
● Nokogiri::HTML(
open('https://nokogiri.org/’)
)
17
Do not overuse regular expression (2/2)

Do not use REGEX as language parser
Programming languages depend more on language nodes/tree.
There will be always a problem with some exceptions, different coding
styles
In Ruby we need to use Ripper or other tools to decompose Ruby code
into pieces
Markup languages can be parsed by e.g. Nokogiri, Ox, Oj gems easier
and more secure
18

Clear RegEx
● extract common parts in alternation
● put more likely to appear words in the front of alternation
● use comments and whitespace with /x modiﬁer
● give a name for captured groups, use also non-captured
● split code to smaller logical pieces
● lint code with ruby -w for warnings
19

mix ? Interpolation of RegEx
MULTILINE
IGNORECASE
EXTENDED
21

Joke
Scrabble: what is a longest word from combined RE switch letters?
22
I M N O X

Joke
Scrabble: what is a longest word from combined RE switch letters?
23
I M N O X

- in general "dot matches at line breaks mode" is turn on with s flag
instead of ruby m flag
- In Ruby, ^ and $ always match on every line.
If you want to specify the beginning of the string, use A.
For the very end of the string, use z (or Z including final line break).
Quirks in Ruby RegEx engine (1/3)
24

Ruby does not allow
● look-ahead
● negative look-behind
inside a look-behind, such as:
25

- Intersection […&&[…]]
- Subtraction […&&[^…]]
26
Character classes operators

Tools / Websites
● regex101.com/
nicest editor, explanation on hover, cheatset, performance analysis
● www.debuggex.com/ visualized graphs with cheat-set
● Visualization plugins for Visual Studio Code
● rubocop and rubocop-performance have some rules for regex
● rubular.com/ check if RegEx works in Ruby 2.5. Other with 2.1
● rubyapi.org/3.1/o/regexp good Ruby docs
31

Backtracking
problem
34
/d-d+$/g

Catastrophic backtracking case /a?n
an
=~ an
/
35

“Most modern engines are regex-directed because this is the only way to
implement useful features such as lazy quantiﬁers and backreferences;
and atomic grouping and possessive quantiﬁers that give extra control
to backtracking.”
PCRE like solutions
36

Back to Finite Automaton - (D/N) FA
39
/abb*a/

RegEx to Deterministic Finite Automaton
What RegEx is it?
40

/(100?)*1/ matches: [ 1010101, 1, 10101, 1001001]
41

/(100?)*1/
42

/(100?)*1/
43

6. Ruby 3.2 RE changes
45
Regexp improvements against ReDoS
It is known that Regexp matching may take unexpectedly long.
If your code attempts to match a possibly inefﬁcient Regexp against an
untrusted input, an attacker may exploit it for efﬁcient Denial of Service

ReDoS improvements (2/2)
47
Improved Regexp matching algorithm using a memoization technique

Sources
48
● devopedia.org/regex-engines
● patshaughnessy.net/2012/4/3/ (...) rubys-regular-expression-algorithm
● github.com/google/re2/wiki/Syntax
● optimized re2 called hyperscan
● wiki/Determinizacja_automatu_skonczonego
● regular-expressions.info/refrepeat.html
● rexegg.com/regex-optimizations.html
● bugs.ruby-lang.org/issues/19104 selective memiozation

Thanks for listening
What’s your question?
49

How to check valid Email? Find using regex.

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie How to check valid Email? Find using regex.

Ähnlich wie How to check valid Email? Find using regex. (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

How to check valid Email? Find using regex.