SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Downloaden Sie, um offline zu lesen
Unicode
Regular Expressions

  s/�/�/g
       Nick Patch
    23 January 2013
Unicode Refresher

    Unicode attempts to support the
characters of the world — a massive task!
Unicode Refresher

It's hard to attach a single meaning to the
  word “character” but most folks think of
  characters as the smallest stand-alone
      components of a writing system.
Unicode Refresher

  In Unicode, this sense of characters is
 represented by one or more code points,
which are each stored in one or more bytes.
Unicode Refresher

      However, programmers and
programming languages tend to think of
  characters as individual code points,
       or worse, individual bytes.

  We need to modernize our habits!
Unicode Refresher

Unicode is not just a big set of characters.
  It also defines standard properties for
 each character and standard algorithms
      for operations such as collation,
     normalization, and segmentation.
Normalization

NFD(ᾀ◌̀) = α◌̓◌̀◌ͅ
NFC(ᾀ◌̀) = ᾂ̀
Normalization

NFD(Чю◌́рлёнис) = Чю◌́рле◌̈нис
NFC(Чю◌́рлёнис) = Чю◌́рлёнис
Normalization

  ᾂ ≡ ἂ◌ͅ ≡ ᾀ◌̀ ≡ ᾳ◌̓◌̀ ≡
 α◌̓◌̀◌ͅ ≡ α◌̓◌ͅ◌̀ ≡ α◌ͅ◌̓◌̀
             ≠
ᾲ◌̓ ≡ ὰ◌̓◌ͅ ≡ ὰ◌ͅ◌̓ ≡ ᾳ◌̀◌̓ ≡
 α◌̀◌̓◌ͅ ≡ α◌̀◌ͅ◌̓ ≡ α◌ͅ◌̀◌̓
Perl Normalization

use Unicode::Normalize;

say $str;          # ᾀ◌̀
say NFD($str);     # α◌̓◌̀◌ͅ
say NFC($str);     # ᾂ̀
JavaScript Normalization

var unorm = require('unorm');

console.log($str);              # ᾀ◌̀
console.log(unorm.nfd($str));   # α◌̓◌̀◌ͅ
console.log(unorm.nfc($str));   # ᾂ̀
PHP Normalization

echo $str;            # ᾀ◌̀

echo Normalizer::normalize($str,
Normalizer::FORM_D); # α◌̓◌̀◌ͅ

echo Normalizer::normalize($str,
Normalizer::FORM_C); # ᾂ̀
Grapheme Clusters

regex:      /^.$/

string 1:   ᾂ


string 2:   α◌̓◌̀◌ͅ
Grapheme Clusters

regex:         /^.$/

string 1:      ᾂ
              ⇧

string 2:      α◌̓◌̀◌ͅ
              ⇧

1. anchor beginning of string
Grapheme Clusters

regex:         /^.$/

string 1:      ᾂ
              ⇧

string 2:      α◌̓◌̀◌ͅ
              ⇧

1. anchor beginning of string
2. match code point (excl. n)
Grapheme Clusters

regex:         /^.$/

string 1:      ᾂ
              ⇧⇧

string 2:      α◌̓◌̀◌ͅ


1. anchor beginning of string
2. match code point (excl. n)
3. anchor at end of string
Grapheme Clusters

regex:         /^.$/

string 1:     ᾂ
             ⇧⇧

string 2:      α◌̓◌̀◌ͅ


1. anchor beginning of string
2. match code point (excl. n)
3. anchor at end of string
4. 1 success but 1 failure — mixed results �
Grapheme Clusters

regex:      /^X$/

string 1:   ᾂ


string 2:   α◌̓◌̀◌ͅ
Grapheme Clusters

regex:         /^X$/

string 1:      ᾂ
              ⇧

string 2:      α◌̓◌̀◌ͅ
              ⇧

1. anchor beginning of string
Grapheme Clusters

regex:         /^X$/

string 1:      ᾂ
              ⇧

string 2:      α◌̓◌̀◌ͅ
              ⇧

1. anchor beginning of string
2. match grapheme cluster
Grapheme Clusters

regex:         /^X$/

string 1:      ᾂ
              ⇧⇧

string 2:      α◌̓◌̀◌ͅ
              ⇧      ⇧

1. anchor beginning of string
2. match grapheme cluster
3. anchor at end of string
Grapheme Clusters

regex:         /^X$/

string 1:      ᾂ
              ⇧⇧

string 2:      α◌̓◌̀◌ͅ
              ⇧      ⇧

1. anchor beginning of string
2. match grapheme cluster
3. anchor at end of string
4. success! �
Perl

use   v5.12; # better yet: v5.14
use   utf8;
use   charnames qw( :full ); # unless v5.16
use   open qw( :encoding(UTF-8) :std );

$str =~ /^X$/;

$str =~ s/^(X)$/->$1<-/;
PHP

preg_match('/^X$/u', $str);

preg_replace('/^(X)$/u', '->$1<-', $str);
JavaScript
[This slide intentionally left blank.]
Match Any Character

two bytes (if byte mode):      е..и
code point (exc. n):          е.и
code point (incl. n):         еp{Any}и
grapheme cluster (incl. n):   еXи
Match Any Letter

letter code point:еp{General_Category=Letter}и
letter code point:   еpLи
Cyrillic code point: еp{Script=Cyrillic}и
Cyrillic code point: еp{Cyrillic}и

letter grapheme cluster: е(?=pL)Xи
regex:      / о p{Cyrillic} т /x

string 1:   който


string 2:   кои◌̆то
regex:          / о p{Cyrillic} т /x

string 1:       който


string 2:       кои◌̆то


1. match letter о
regex:          / о p{Cyrillic} т /x

string 1:       който


string 2:       кои◌̆то


1. match letter о
2. match Cyrillic letter (1 code point)
regex:          / о p{Cyrillic} т /x

string 1:       който


string 2:       кои◌̆то


1. match letter о
2. match Cyrillic letter (1 code point)
3. match letter т
regex:         / о p{Cyrillic} т /x

string 1:      който


string 2:      кои◌̆то


1. match letter о
2. match Cyrillic letter (1 code point)
3. match letter т
4. 1 success but 1 failure — mixed results �
regex:      / о (?= p{Cyrillic} ) X т /x

string 1:   който


string 2:   кои◌̆то
regex:          / о (?= p{Cyrillic} ) X т /x

string 1:       който


string 2:       кои◌̆то


1. match letter о
regex:          / о (?= p{Cyrillic} ) X т /x

string 1:       който
                 ⇧

string 2:       кои◌̆то
                 ⇧

1. match letter о
2. positive lookahead Cyrillic letter (1 code point)
regex:          / о (?= p{Cyrillic} ) X т /x

string 1:       който
                 ⇧

string 2:       кои◌̆то
                 ⇧

1. match letter о
2. positive lookahead Cyrillic letter (1 code point)
3. match grapheme cluster (1+ code points)
regex:          / о (?= p{Cyrillic} ) X т /x

string 1:       който
                 ⇧

string 2:       кои◌̆то
                 ⇧

1. match letter о
2. positive lookahead Cyrillic letter (1 code point)
3. match grapheme cluster (1+ code points)
4. match letter т
regex:          / о (?= p{Cyrillic} ) X т /x

string 1:       който
                 ⇧

string 2:       кои◌̆то
                 ⇧

1. match letter о
2. positive lookahead Cyrillic letter (1 code point)
3. match grapheme cluster (1+ code points)
4. match letter т
5. success! �
Character Literals

      [‫]يی‬

    (?:‫)ي|ی‬
Character Literals

      [‫]يی‬

    (?:‫)ي|ی‬
Character Literals

       [‫]يی‬

     (?:‫)ي|ی‬

[x{064A}x{06CC}]
Character Literals

            [‫]يی‬

          (?:‫)ي|ی‬

     [x{064A}x{06CC}]

   [N{ARABIC LETTER YEH}
N{ARABIC LETTER FARSI YEH}]
Properties

         p{Script=Latin}

           Name: Script
           Value: Latin


   Match any code point with the
value “Latin” for the Script property.
Properties

         P{Script=Latin}

           Name: Script
          Value: not Latin

           Negated form:
 Match any code point without the
value “Latin” for the Script property.
Properties

           p{Latin}

     Name: Script (implicit)
        Value: Latin


The Script and General Category
properties don't require the name
because they're so common and
    their values don't conflict.
Properties

     p{General_Category=Letter}

        Name: General Category
            Value: Letter


   Match any code point with the value
“Letter” for the General Category property.
Properties

          p{gc=Letter}

   Name: General Category (gc)
          Value: Letter


Property names may be abbreviated.
Properties

            p{gc=L}

 Name: General Category (gc)
      Value: Letter (L)


The General Category property is
so commonly used that its values
 all have standard abbreviations.
Properties

                   p{L}

    Name: General Category (implicit)
           Value: Letter (L)


And the General Category values may even
be used on their own, like the Script values.
 These two properties have distinct values.
Properties

               pL

Name: General Category (implicit)
       Value: Letter (L)


Single-character General Category
 values don't require curly braces.
Properties

               PL

Name: General Category (implicit)
      Value: not Letter (L)


      Don't forget negation!
s/�/�/g

Weitere ähnliche Inhalte

Was ist angesagt?

Regular Expressions grep and egrep
Regular Expressions grep and egrepRegular Expressions grep and egrep
Regular Expressions grep and egrep
Tri Truong
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expression
Bharat17485
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
Raj Gupta
 
Haskell retrospective
Haskell retrospectiveHaskell retrospective
Haskell retrospective
chenge2k
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of code
source{d}
 

Was ist angesagt? (20)

Declarative Semantics Definition - Term Rewriting
Declarative Semantics Definition - Term RewritingDeclarative Semantics Definition - Term Rewriting
Declarative Semantics Definition - Term Rewriting
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expression
 
Regular Expressions grep and egrep
Regular Expressions grep and egrepRegular Expressions grep and egrep
Regular Expressions grep and egrep
 
Finaal application on regular expression
Finaal application on regular expressionFinaal application on regular expression
Finaal application on regular expression
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Optimization of dfa
Optimization of dfaOptimization of dfa
Optimization of dfa
 
Introduction - Imperative and Object-Oriented Languages
Introduction - Imperative and Object-Oriented LanguagesIntroduction - Imperative and Object-Oriented Languages
Introduction - Imperative and Object-Oriented Languages
 
Regular Expressions
Regular ExpressionsRegular Expressions
Regular Expressions
 
Regular Expressions 101 Introduction to Regular Expressions
Regular Expressions 101 Introduction to Regular ExpressionsRegular Expressions 101 Introduction to Regular Expressions
Regular Expressions 101 Introduction to Regular Expressions
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expression
 
And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...
 
Dictor
DictorDictor
Dictor
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
 
Haskell retrospective
Haskell retrospectiveHaskell retrospective
Haskell retrospective
 
DEFUN 2008 - Real World Haskell
DEFUN 2008 - Real World HaskellDEFUN 2008 - Real World Haskell
DEFUN 2008 - Real World Haskell
 
Ch3
Ch3Ch3
Ch3
 
Introduction to Regular Expressions
Introduction to Regular ExpressionsIntroduction to Regular Expressions
Introduction to Regular Expressions
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of code
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 

Ähnlich wie Unicode Regular Expressions

Saumya Debray The University of Arizona Tucson
Saumya Debray The University of Arizona TucsonSaumya Debray The University of Arizona Tucson
Saumya Debray The University of Arizona Tucson
jeronimored
 
Lecture 3 Perl & FreeBSD administration
Lecture 3 Perl & FreeBSD administrationLecture 3 Perl & FreeBSD administration
Lecture 3 Perl & FreeBSD administration
Mohammed Farrag
 
Good Evils In Perl
Good Evils In PerlGood Evils In Perl
Good Evils In Perl
Kang-min Liu
 
Recursive descent parsing
Recursive descent parsingRecursive descent parsing
Recursive descent parsing
Boy Baukema
 
Introduction to Perl
Introduction to PerlIntroduction to Perl
Introduction to Perl
Sway Wang
 

Ähnlich wie Unicode Regular Expressions (20)

Regular Expressions: JavaScript And Beyond
Regular Expressions: JavaScript And BeyondRegular Expressions: JavaScript And Beyond
Regular Expressions: JavaScript And Beyond
 
my$talk=qr{((?:ir)?reg(?:ular )?exp(?:ressions?)?)}i;
my$talk=qr{((?:ir)?reg(?:ular )?exp(?:ressions?)?)}i;my$talk=qr{((?:ir)?reg(?:ular )?exp(?:ressions?)?)}i;
my$talk=qr{((?:ir)?reg(?:ular )?exp(?:ressions?)?)}i;
 
Linux fundamental - Chap 06 regx
Linux fundamental - Chap 06 regxLinux fundamental - Chap 06 regx
Linux fundamental - Chap 06 regx
 
Perl Presentation
Perl PresentationPerl Presentation
Perl Presentation
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
 
Saumya Debray The University of Arizona Tucson
Saumya Debray The University of Arizona TucsonSaumya Debray The University of Arizona Tucson
Saumya Debray The University of Arizona Tucson
 
Cleancode
CleancodeCleancode
Cleancode
 
Lecture 3 Perl & FreeBSD administration
Lecture 3 Perl & FreeBSD administrationLecture 3 Perl & FreeBSD administration
Lecture 3 Perl & FreeBSD administration
 
Good Evils In Perl
Good Evils In PerlGood Evils In Perl
Good Evils In Perl
 
Stop overusing regular expressions!
Stop overusing regular expressions!Stop overusing regular expressions!
Stop overusing regular expressions!
 
Recursive descent parsing
Recursive descent parsingRecursive descent parsing
Recursive descent parsing
 
Perl_Part4
Perl_Part4Perl_Part4
Perl_Part4
 
Practical approach to perl day1
Practical approach to perl day1Practical approach to perl day1
Practical approach to perl day1
 
Introduction to Perl
Introduction to PerlIntroduction to Perl
Introduction to Perl
 
Fundamental Unicode in Perl
Fundamental Unicode in PerlFundamental Unicode in Perl
Fundamental Unicode in Perl
 
Bioinformatica 06-10-2011-p2 introduction
Bioinformatica 06-10-2011-p2 introductionBioinformatica 06-10-2011-p2 introduction
Bioinformatica 06-10-2011-p2 introduction
 
Bioinformatica p2-p3-introduction
Bioinformatica p2-p3-introductionBioinformatica p2-p3-introduction
Bioinformatica p2-p3-introduction
 
Quick start reg ex
Quick start reg exQuick start reg ex
Quick start reg ex
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Unicode Regular Expressions

  • 1. Unicode Regular Expressions s/�/�/g Nick Patch 23 January 2013
  • 2. Unicode Refresher Unicode attempts to support the characters of the world — a massive task!
  • 3. Unicode Refresher It's hard to attach a single meaning to the word “character” but most folks think of characters as the smallest stand-alone components of a writing system.
  • 4. Unicode Refresher In Unicode, this sense of characters is represented by one or more code points, which are each stored in one or more bytes.
  • 5. Unicode Refresher However, programmers and programming languages tend to think of characters as individual code points, or worse, individual bytes. We need to modernize our habits!
  • 6. Unicode Refresher Unicode is not just a big set of characters. It also defines standard properties for each character and standard algorithms for operations such as collation, normalization, and segmentation.
  • 9. Normalization ᾂ ≡ ἂ◌ͅ ≡ ᾀ◌̀ ≡ ᾳ◌̓◌̀ ≡ α◌̓◌̀◌ͅ ≡ α◌̓◌ͅ◌̀ ≡ α◌ͅ◌̓◌̀ ≠ ᾲ◌̓ ≡ ὰ◌̓◌ͅ ≡ ὰ◌ͅ◌̓ ≡ ᾳ◌̀◌̓ ≡ α◌̀◌̓◌ͅ ≡ α◌̀◌ͅ◌̓ ≡ α◌ͅ◌̀◌̓
  • 10. Perl Normalization use Unicode::Normalize; say $str; # ᾀ◌̀ say NFD($str); # α◌̓◌̀◌ͅ say NFC($str); # ᾂ̀
  • 11. JavaScript Normalization var unorm = require('unorm'); console.log($str); # ᾀ◌̀ console.log(unorm.nfd($str)); # α◌̓◌̀◌ͅ console.log(unorm.nfc($str)); # ᾂ̀
  • 12. PHP Normalization echo $str; # ᾀ◌̀ echo Normalizer::normalize($str, Normalizer::FORM_D); # α◌̓◌̀◌ͅ echo Normalizer::normalize($str, Normalizer::FORM_C); # ᾂ̀
  • 13. Grapheme Clusters regex: /^.$/ string 1: ᾂ string 2: α◌̓◌̀◌ͅ
  • 14. Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧ string 2: α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string
  • 15. Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧ string 2: α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string 2. match code point (excl. n)
  • 16. Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧⇧ string 2: α◌̓◌̀◌ͅ 1. anchor beginning of string 2. match code point (excl. n) 3. anchor at end of string
  • 17. Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧⇧ string 2: α◌̓◌̀◌ͅ 1. anchor beginning of string 2. match code point (excl. n) 3. anchor at end of string 4. 1 success but 1 failure — mixed results �
  • 18. Grapheme Clusters regex: /^X$/ string 1: ᾂ string 2: α◌̓◌̀◌ͅ
  • 19. Grapheme Clusters regex: /^X$/ string 1: ᾂ ⇧ string 2: α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string
  • 20. Grapheme Clusters regex: /^X$/ string 1: ᾂ ⇧ string 2: α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string 2. match grapheme cluster
  • 21. Grapheme Clusters regex: /^X$/ string 1: ᾂ ⇧⇧ string 2: α◌̓◌̀◌ͅ ⇧ ⇧ 1. anchor beginning of string 2. match grapheme cluster 3. anchor at end of string
  • 22. Grapheme Clusters regex: /^X$/ string 1: ᾂ ⇧⇧ string 2: α◌̓◌̀◌ͅ ⇧ ⇧ 1. anchor beginning of string 2. match grapheme cluster 3. anchor at end of string 4. success! �
  • 23. Perl use v5.12; # better yet: v5.14 use utf8; use charnames qw( :full ); # unless v5.16 use open qw( :encoding(UTF-8) :std ); $str =~ /^X$/; $str =~ s/^(X)$/->$1<-/;
  • 26. Match Any Character two bytes (if byte mode): е..и code point (exc. n): е.и code point (incl. n): еp{Any}и grapheme cluster (incl. n): еXи
  • 27. Match Any Letter letter code point:еp{General_Category=Letter}и letter code point: еpLи Cyrillic code point: еp{Script=Cyrillic}и Cyrillic code point: еp{Cyrillic}и letter grapheme cluster: е(?=pL)Xи
  • 28. regex: / о p{Cyrillic} т /x string 1: който string 2: кои◌̆то
  • 29. regex: / о p{Cyrillic} т /x string 1: който string 2: кои◌̆то 1. match letter о
  • 30. regex: / о p{Cyrillic} т /x string 1: който string 2: кои◌̆то 1. match letter о 2. match Cyrillic letter (1 code point)
  • 31. regex: / о p{Cyrillic} т /x string 1: който string 2: кои◌̆то 1. match letter о 2. match Cyrillic letter (1 code point) 3. match letter т
  • 32. regex: / о p{Cyrillic} т /x string 1: който string 2: кои◌̆то 1. match letter о 2. match Cyrillic letter (1 code point) 3. match letter т 4. 1 success but 1 failure — mixed results �
  • 33. regex: / о (?= p{Cyrillic} ) X т /x string 1: който string 2: кои◌̆то
  • 34. regex: / о (?= p{Cyrillic} ) X т /x string 1: който string 2: кои◌̆то 1. match letter о
  • 35. regex: / о (?= p{Cyrillic} ) X т /x string 1: който ⇧ string 2: кои◌̆то ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point)
  • 36. regex: / о (?= p{Cyrillic} ) X т /x string 1: който ⇧ string 2: кои◌̆то ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point) 3. match grapheme cluster (1+ code points)
  • 37. regex: / о (?= p{Cyrillic} ) X т /x string 1: който ⇧ string 2: кои◌̆то ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point) 3. match grapheme cluster (1+ code points) 4. match letter т
  • 38. regex: / о (?= p{Cyrillic} ) X т /x string 1: който ⇧ string 2: кои◌̆то ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point) 3. match grapheme cluster (1+ code points) 4. match letter т 5. success! �
  • 39. Character Literals [‫]يی‬ (?:‫)ي|ی‬
  • 40. Character Literals [‫]يی‬ (?:‫)ي|ی‬
  • 41. Character Literals [‫]يی‬ (?:‫)ي|ی‬ [x{064A}x{06CC}]
  • 42. Character Literals [‫]يی‬ (?:‫)ي|ی‬ [x{064A}x{06CC}] [N{ARABIC LETTER YEH} N{ARABIC LETTER FARSI YEH}]
  • 43. Properties p{Script=Latin} Name: Script Value: Latin Match any code point with the value “Latin” for the Script property.
  • 44. Properties P{Script=Latin} Name: Script Value: not Latin Negated form: Match any code point without the value “Latin” for the Script property.
  • 45. Properties p{Latin} Name: Script (implicit) Value: Latin The Script and General Category properties don't require the name because they're so common and their values don't conflict.
  • 46. Properties p{General_Category=Letter} Name: General Category Value: Letter Match any code point with the value “Letter” for the General Category property.
  • 47. Properties p{gc=Letter} Name: General Category (gc) Value: Letter Property names may be abbreviated.
  • 48. Properties p{gc=L} Name: General Category (gc) Value: Letter (L) The General Category property is so commonly used that its values all have standard abbreviations.
  • 49. Properties p{L} Name: General Category (implicit) Value: Letter (L) And the General Category values may even be used on their own, like the Script values. These two properties have distinct values.
  • 50. Properties pL Name: General Category (implicit) Value: Letter (L) Single-character General Category values don't require curly braces.
  • 51. Properties PL Name: General Category (implicit) Value: not Letter (L) Don't forget negation!