2. 2
Strings and Languages
• Regular Expressions are an important notation for specifying patterns.
• Alphabet – any finite set of symbols
e.g. ASCII, binary alphabet, UNICODE, EBCDIC,LATIN-1
• String – A finite sequence of symbols drawn from an alphabet
– Banana (ASCII Alphabet)
– Length of a string => |s|
– Empty String => ε
• Other terms relating to strings: prefix; suffix; substring; proper prefix,
suffix, or substring (non-empty, not entire string); subsequence
• Language – A set of strings over a fixed alphabet
3. 3
Languages
• A language, L, is simply any set of strings over a
fixed alphabet.
Alphabet Languages
{0,1} {0,10,100,1000,100000…}
{0,1,00,11,000,111,…}
{a,b,c} {abc,aabbcc,aaabbbccc,…}
{A, … ,Z} {FOR,WHILE,GOTO,…}
{A,…,Z,a,…,z,0,…9, { All legal PASCAL progs}
+,-,…,<,>,…}
Special Languages: - EMPTY LANGUAGE
- contains string only
6. 6
Operations on Languages
OPERATION DEFINITION
union of L and M
written L M
concatenation of L
and M written LM
Kleene closure of L
written L*
positive closure of L
written L+
L M = {s | s is in L or s is in M}
LM = {st | s is in L and t is in M}
L+=
0
i
i
L
L* denotes “zero or more concatenations of “ L
L*=
1
i
i
L
L+ denotes “one or more concatenations of “ L
Exponentiation Lo={ε}, L1=L,L2=LL
7. 7
Operations on Languages
• LUD is the set of letters and digits
• LD is the set of strings consisting of a
letter followed by a digit
• L4 is the set of all four strings
• L* is the set of strings including ε
• D+ is the set of strings of one or more
digits.
8. 8
Say What?
L = {A, B, C, D } D = {1, 2, 3}
• L D
{A, B, C, D, 1, 2, 3 }
• LD
{A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3 }
• L2
{ AA, AB, AC, AD, BA, BB, BC, BD, CA, … DD}
• L*
{ All possible strings of L plus }
• L+
L* -
• L (L D )
Valid :{ A1,AA2,B345,CD45} Invlaid:{321,4A2}
• L (L D )*
Valid:{ A,A1,A23,D3,DA5..} Invalid:{31}
9. 9
Regular Expressions
• A Regular Expression is a Set of Rules /
Techniques for Constructing Sequences of
Symbols (Strings) from an Alphabet.
• Let Be an Alphabet, r a Regular Expression
Then L(r) is the Language That is characterized
by the Rules of r
10. 10
Regular Expressions
• Defined over an alphabet Σ
• ε represents {ε}, the set containing the empty string
• If a is a symbol in Σ, then a is a regular expression
denoting {a}, the set containing the string a
• If r and s are regular expressions denoting the
languages L(r) and L(s), then:
– (r)|(s) is a regular expression denoting L(r)U L(s)
– (r)(s) is a regular expression denoting L(r)L(s)
– (r)* is a regular expression denoting (L(r))*
– (r) is a regular expression denoting L(r)
• Precedence: * (left associative), then concatenation (left
associative), then | (left associative)
11. 11
Regular Expressions
Alphabet = {a, b}
1. a|b denotes {a, b}
2. (a|b)(a|b) denotes {ab, aa, ba, bb}
3. a* denotes {, a, aa, …}
4. (a|b)* - Strings of a’s and b’s including the
5. a|a*b – a followed by zero/more a’s followed by b
12. 12
Algebraic Properties of Regular
Expressions
AXIOM DESCRIPTION
r | s = s | r
r | (s | t) = (r | s) | t
(r s) t = r (s t)
r = r
r = r
r* = ( r | )*
r ( s | t ) = r s | r t
( s | t ) r = s r | t r
r** = r*
| is commutative
| is associative
concatenation is associative
concatenation distributes over |
relation between * and
Is the identity element for concatenation
* is idempotent
13. 13
Regular Definitions
• Names maybe given to regular expressions; these
names can be used like symbols
• Let is an alphabet of basic symbols. The regular
definition is a sequence of definitions of the form
d1 r1
d2 r2
. . .
dn rn
Where, each di is a distinct name, and each ri is a
regular expression over the symbols in {d1, d2,
…, di-1 }
14. 14
Regular Definitions
• Example 1:
– letter A|B|…|Z|a|b|…|z
– digit 0|1|…|9
– id letter (letter | digit)*
• Example 2
– digit 0 | 1 | 2 | … | 9
– digits digit digit*
– optional_fraction . digits |
– optional_exponent ( E ( + | -| ) digits) |
– num digits optional_fraction optional_exponent
15. 15
Regular Definitions
• Shorthand
– One or more instances: r+ denotes rr*
– Zero or one Instance: r? denotes r|ε
– Character classes: [a-z] denotes
[a|b|…|z]
17. 17
Limitations of Regular
Expression
• Some languages cannot be described by any regular
expression
• Cannot describe balanced or nested constructs
– Example, all valid strings of balanced parentheses
– This can be done with CFG
• Cannot describe repeated strings
– Example: {wcw|w is a string of a’s and b’s}
– This can be done with CFG
• Can be used to denote only a fixed or unspecified
number of repetitions.