SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Copyright © 2019 Oracle and/or its affiliates. All rights
Copyright © 2019 Oracle and/or its affiliates. All rights
Regular Expressions with full Unicode support
Martin Hansson
Software Development
MySQL Optimizer Team
The ins and outs of the new regular expression functions and the ICU library
Copyright © 2019 Oracle and/or its affiliates. All rights
Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for
information purposes only, and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and should not be relied upon
in making purchasing decisions. The development, release, and timing of any features or
functionality described for Oracle’s products remains at the sole discretion of Oracle.
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
What Happened?
Old regexp library (Henry Spencer)
‱ Does not support Unicode
‱ Limited Features
‱ No resource control
‱ Only Boolean Search
https://mysqlserverteam.com/new-regular-expression-functions-in-mysql-8-0/
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Not some niche feature
Feature Requests for Extracting Substring:
Bug#79428 No way to extract a substring matching a regex
Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine
Bug#16357 add in functions to do regular expression replacements in a select
query
Bug#9105 Regular expression support for Search & Replace
51 “affects me” total
CTE had 59 “affects me”
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
New Regular Expression Functions
REGEXP_INSTR
REGEXP_LIKE
REGEXP_REPLACE
REGEXP_SUBSTR
Copyright © 2019 Oracle and/or its affiliates. All rights
Program Agenda
Security
ICU library
Unicode
Working with Unicode in Regular Expressions
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Two Security Concerns
Memory Runtime
8
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Security
Cap on runtime
mysql> SELECT regexp_instr(
'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC',
'(A+)+B');
ERROR 3699 (HY000): Timeout exceeded in regular expression
match.
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Security
Cap on Memory
mysql> SELECT regexp_instr(
'', '(((((((){120}){11}){11}){11}){80}){11}){4}' );
ERROR 3699 (HY000): Timeout exceeded in regular expression match.
mysql> SET GLOBAL regexp_stack_limit = 239;
mysql> SELECT regexp_instr(
'', '(((((((){120}){11}){11}){11}){80}){11}){4}' );
ERROR 3698 (HY000): Overflow in the regular expression backtrack stack.
Copyright © 2019 Oracle and/or its affiliates. All rights
Program Agenda
Security
ICU library
Unicode
Working with Unicode in Regular Expressions
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
ICU library
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Building ICU
Need three libraries
‱ i18n library
– Regular expressions
– Character sets
‱ Common library
‱ Data Library
Copyright © 2019 Oracle and/or its affiliates. All rights
Program Agenda
Security
ICU library
Unicode
Working with Unicode in Regular Expressions
15
UTF-32
ab d
0x00000061 0x000000610x00000061 0x000000610x00000062 0x000000640x000000610x000000610x000000610x0001f37a
16
UTF-8
ab d
0x62 0x000000610x000000610x000000610xF09F8DBA0x62 0x64
17
UTF-16
ab d
0x0062 0x3CD87ADF0x0062 0x0064
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Under the Hood
‱ Count codepoints
‱ Convert to UTF-16
‱ Use the C API
‱ Convert back if needed
Copyright © 2019 Oracle and/or its affiliates. All rights
Program Agenda
Security
ICU library
Unicode
Working with Unicode in Regular Expressions
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case folding
Simple case sensitivity
mysql> SELECT regexp_like( 'a', '(?i)A' ); # mode modifier
1
mysql> SELECT regexp_like( 'a', 'A', ‘i’ ); # match_parameter
1
mysql> SELECT regexp_like(
'a' COLLATE utf8mb4_0900_as_cs, 'A' ); # collation
0
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case folding
Simple case sensitivity
mysql> SELECT regexp_like( 'Abc', 'abC', ‘c’ );
→ 0
mysql> SELECT regexp_like( 'Abc', 'abC', ‘i’ );
→ 1
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case folding
Case-mapping process
A → a
B → b
C → c
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case folding
Full Case Folding
ß → ss
mysql> SELECT regexp_like( 'ß', '^ss$', ‘c’ );
→ 0
mysql> SELECT regexp_like( 'ß', '^ss$', ‘i’ );
→ 1
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case folding
Full Case Folding
៛ ⇒ áŒŁÎč
U+1F9B GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND
PROSGEGRAMMENI
U+1F23 U+03B9 GREEK SMALL LETTER ETA WITH DASIA AND VARIA
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case folding
Has to Look Like a String in order to Match
mysql> SELECT regexp_like( 'ß', '^ss$' );
→ 1
mysql> SELECT regexp_like( 'ß', '^s+$' );
→ 0
mysql> SELECT regexp_like( 'ß', '^s{2}$' );
→ 0
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case folding
Can’t start Match Within Expanded Character
mysql> SELECT regexp_like( 'ß', 's$' );
→ 0
mysql> SELECT regexp_like( 'ß', '^s' );
→ 0
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case folding
Collations
mysql> select 'ß' collate utf8mb4_de_pb_0900_ai_ci = 'ss'G
*************************** 1. row
'ß' collate utf8mb4_de_pb_0900_ai_ci = 'ss': 1
mysql> select 'ß' collate utf8mb4_de_pb_0900_as_cs = 'ss'G
*************************** 1. row
'ß' collate utf8mb4_de_pb_0900_as_cs = 'ss': 0
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Case folding
Language Dependent Case Folding
mysql> SELECT regexp_like( 'I', 'i' );
→ 1
mysql> SELECT regexp_like( 'Ä°', 'i' );
→ 0
mysql> SELECT regexp_like( 'I', ' ı' );
→ 0
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Beware of Conversion!
mysql> set names latin1;
mysql> create table t1 ( a char ( 10 ) );
mysql> insert into t1 values ( 'Ă„' );
mysql> select a from t1G
*************************** 1. row
a: Ă„
mysql> select regexp_like( a, 'Ă„' ) from t1G
*************************** 1. row
regexp_like( a, 'Ă„' ): 1
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Beware of Conversion!
Use Hex Codes!
mysql> select hex( a ) from t1;
+----------+
| hex( a ) |
+----------+
| C383C2A5 |
+----------+
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Beware of Conversion!
Use Hex Codes!
mysql> select hex( a ) from t1;
+----------+
| hex( a ) |
+----------+
| C383C2A5 |
+----------+
Latin-1: 0x e5
UTF-8: 0x c3 a5
Ă„ is encoded as:
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 32
Conversion flow
Terminal UTF-8
c3a5 Ă„
Latin-1 → UTF-8
UTF-8 → Latin-1
C383C2A5 = Ä
Server
Table UTF-8
Server
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Power Tip
Use Hex Codes
and Character set
Introducers!
mysql> set global character_set_client = utf8mb4;
mysql> select _utf8mb4 0xc3a5, _latin1 0xe5;
+-----------------+--------------+
| _utf8mb4 0xc3a5 | _latin1 0xe5 |
+-----------------+--------------+
| Ă„ | Ă„ |
+-----------------+--------------+
mysql> set global character_set_client = latin1;
mysql> select _utf8mb4 0xc3a5, _latin1 0xe5;
+-----------------+--------------+
| _utf8mb4 0xc3a5 | _latin1 0xe5 |
+-----------------+--------------+
| Ă„ | Ă„ |
+-----------------+--------------+
Copyright © 2019 Oracle and/or its affiliates. All rights
Questions?
Regular Expressions with full Unicode support

Weitere Àhnliche Inhalte

Was ist angesagt?

MySQL partitioning
MySQL partitioning MySQL partitioning
MySQL partitioning OracleMySQL
 
OpenWorld 2018 - 20 years of hints and tips
OpenWorld 2018 - 20 years of hints and tipsOpenWorld 2018 - 20 years of hints and tips
OpenWorld 2018 - 20 years of hints and tipsConnor McDonald
 
Pattern Matching with SQL - APEX World Rotterdam 2019
Pattern Matching with SQL - APEX World Rotterdam 2019Pattern Matching with SQL - APEX World Rotterdam 2019
Pattern Matching with SQL - APEX World Rotterdam 2019Connor McDonald
 
Python and the MySQL Document Store
Python and the MySQL Document StorePython and the MySQL Document Store
Python and the MySQL Document StoreJesper Wisborg Krogh
 
18c and 19c features for DBAs
18c and 19c features for DBAs18c and 19c features for DBAs
18c and 19c features for DBAsConnor McDonald
 
Second Level Cache in JPA Explained
Second Level Cache in JPA ExplainedSecond Level Cache in JPA Explained
Second Level Cache in JPA ExplainedPatrycja Wegrzynowicz
 
DI Frameworks - hidden pearls
DI Frameworks - hidden pearlsDI Frameworks - hidden pearls
DI Frameworks - hidden pearlsSven Ruppert
 
Proxy deep-dive java-one_20151027_001
Proxy deep-dive java-one_20151027_001Proxy deep-dive java-one_20151027_001
Proxy deep-dive java-one_20151027_001Sven Ruppert
 
HOW TO SCALE FROM ZERO TO BILLIONS!
HOW TO SCALE FROM ZERO TO BILLIONS!HOW TO SCALE FROM ZERO TO BILLIONS!
HOW TO SCALE FROM ZERO TO BILLIONS!Maziyar PANAHI
 
Agile Database Development with JSON
Agile Database Development with JSONAgile Database Development with JSON
Agile Database Development with JSONChris Saxon
 
Lazy vs. Eager Loading Strategies in JPA 2.1
Lazy vs. Eager Loading Strategies in JPA 2.1Lazy vs. Eager Loading Strategies in JPA 2.1
Lazy vs. Eager Loading Strategies in JPA 2.1Patrycja Wegrzynowicz
 
Latin America tour 2019 - Flashback
Latin America tour 2019 -  FlashbackLatin America tour 2019 -  Flashback
Latin America tour 2019 - FlashbackConnor McDonald
 
Locking and Concurrency Control
Locking and Concurrency ControlLocking and Concurrency Control
Locking and Concurrency ControlMorgan Tocker
 
MySQL Goes to 8! FOSDEM 2020 Database Track, January 2nd, 2020
MySQL Goes to 8!  FOSDEM 2020 Database Track, January 2nd, 2020MySQL Goes to 8!  FOSDEM 2020 Database Track, January 2nd, 2020
MySQL Goes to 8! FOSDEM 2020 Database Track, January 2nd, 2020Geir HĂžydalsvik
 

Was ist angesagt? (15)

MySQL partitioning
MySQL partitioning MySQL partitioning
MySQL partitioning
 
OpenWorld 2018 - 20 years of hints and tips
OpenWorld 2018 - 20 years of hints and tipsOpenWorld 2018 - 20 years of hints and tips
OpenWorld 2018 - 20 years of hints and tips
 
Pattern Matching with SQL - APEX World Rotterdam 2019
Pattern Matching with SQL - APEX World Rotterdam 2019Pattern Matching with SQL - APEX World Rotterdam 2019
Pattern Matching with SQL - APEX World Rotterdam 2019
 
Python and the MySQL Document Store
Python and the MySQL Document StorePython and the MySQL Document Store
Python and the MySQL Document Store
 
18c and 19c features for DBAs
18c and 19c features for DBAs18c and 19c features for DBAs
18c and 19c features for DBAs
 
Second Level Cache in JPA Explained
Second Level Cache in JPA ExplainedSecond Level Cache in JPA Explained
Second Level Cache in JPA Explained
 
Les02
Les02Les02
Les02
 
DI Frameworks - hidden pearls
DI Frameworks - hidden pearlsDI Frameworks - hidden pearls
DI Frameworks - hidden pearls
 
Proxy deep-dive java-one_20151027_001
Proxy deep-dive java-one_20151027_001Proxy deep-dive java-one_20151027_001
Proxy deep-dive java-one_20151027_001
 
HOW TO SCALE FROM ZERO TO BILLIONS!
HOW TO SCALE FROM ZERO TO BILLIONS!HOW TO SCALE FROM ZERO TO BILLIONS!
HOW TO SCALE FROM ZERO TO BILLIONS!
 
Agile Database Development with JSON
Agile Database Development with JSONAgile Database Development with JSON
Agile Database Development with JSON
 
Lazy vs. Eager Loading Strategies in JPA 2.1
Lazy vs. Eager Loading Strategies in JPA 2.1Lazy vs. Eager Loading Strategies in JPA 2.1
Lazy vs. Eager Loading Strategies in JPA 2.1
 
Latin America tour 2019 - Flashback
Latin America tour 2019 -  FlashbackLatin America tour 2019 -  Flashback
Latin America tour 2019 - Flashback
 
Locking and Concurrency Control
Locking and Concurrency ControlLocking and Concurrency Control
Locking and Concurrency Control
 
MySQL Goes to 8! FOSDEM 2020 Database Track, January 2nd, 2020
MySQL Goes to 8!  FOSDEM 2020 Database Track, January 2nd, 2020MySQL Goes to 8!  FOSDEM 2020 Database Track, January 2nd, 2020
MySQL Goes to 8! FOSDEM 2020 Database Track, January 2nd, 2020
 

Ähnlich wie Regular Expressions with full Unicode support

MySQL 8.0 InnoDB Cluster demo
MySQL 8.0 InnoDB Cluster demoMySQL 8.0 InnoDB Cluster demo
MySQL 8.0 InnoDB Cluster demoKeith Hollman
 
MySQL NoSQL JSON JS Python "Document Store" demo
MySQL NoSQL JSON JS Python "Document Store" demoMySQL NoSQL JSON JS Python "Document Store" demo
MySQL NoSQL JSON JS Python "Document Store" demoKeith Hollman
 
Top 10 SQL Performance tips & tricks for Java Developers
Top 10 SQL Performance tips & tricks for Java DevelopersTop 10 SQL Performance tips & tricks for Java Developers
Top 10 SQL Performance tips & tricks for Java Developersgvenzl
 
20160821 coscup-my sql57docstorelab01
20160821 coscup-my sql57docstorelab0120160821 coscup-my sql57docstorelab01
20160821 coscup-my sql57docstorelab01Ivan Ma
 
MySQL Replication
MySQL ReplicationMySQL Replication
MySQL ReplicationMark Swarbrick
 
Performance Schema and Sys Schema in MySQL 5.7
Performance Schema and Sys Schema in MySQL 5.7Performance Schema and Sys Schema in MySQL 5.7
Performance Schema and Sys Schema in MySQL 5.7Mark Leith
 
Mysql tech day_paris_ps_and_sys
Mysql tech day_paris_ps_and_sysMysql tech day_paris_ps_and_sys
Mysql tech day_paris_ps_and_sysMark Leith
 
MySQL NoSQL APIs
MySQL NoSQL APIsMySQL NoSQL APIs
MySQL NoSQL APIsMorgan Tocker
 
MySQL Troubleshooting with the Performance Schema
MySQL Troubleshooting with the Performance SchemaMySQL Troubleshooting with the Performance Schema
MySQL Troubleshooting with the Performance SchemaSveta Smirnova
 
MySQL8.0 in COSCUP2017
MySQL8.0 in COSCUP2017MySQL8.0 in COSCUP2017
MySQL8.0 in COSCUP2017Shinya Sugiyama
 
APEX Connect 2019 - SQL Tuning 101
APEX Connect 2019 - SQL Tuning 101APEX Connect 2019 - SQL Tuning 101
APEX Connect 2019 - SQL Tuning 101Connor McDonald
 
12 Things Developers Will Love About Oracle Database 12c Release 2
12 Things Developers Will Love About Oracle Database 12c Release 212 Things Developers Will Love About Oracle Database 12c Release 2
12 Things Developers Will Love About Oracle Database 12c Release 2Chris Saxon
 
APEX Connect 2019 - successful application development
APEX Connect 2019 - successful application developmentAPEX Connect 2019 - successful application development
APEX Connect 2019 - successful application developmentConnor McDonald
 
20161029 py con-mysq-lv3
20161029 py con-mysq-lv320161029 py con-mysq-lv3
20161029 py con-mysq-lv3Ivan Ma
 
20190615 hkos-mysql-troubleshootingandperformancev2
20190615 hkos-mysql-troubleshootingandperformancev220190615 hkos-mysql-troubleshootingandperformancev2
20190615 hkos-mysql-troubleshootingandperformancev2Ivan Ma
 
What’s New in Oracle Database 12c for PHP
What’s New in Oracle Database 12c for PHPWhat’s New in Oracle Database 12c for PHP
What’s New in Oracle Database 12c for PHPChristopher Jones
 
Graal and Truffle: One VM to Rule Them All
Graal and Truffle: One VM to Rule Them AllGraal and Truffle: One VM to Rule Them All
Graal and Truffle: One VM to Rule Them AllThomas Wuerthinger
 
Ruby on Rails Oracle adaptera izstrāde
Ruby on Rails Oracle adaptera izstrādeRuby on Rails Oracle adaptera izstrāde
Ruby on Rails Oracle adaptera izstrādeRaimonds Simanovskis
 
MySQL 8 High Availability with InnoDB Clusters
MySQL 8 High Availability with InnoDB ClustersMySQL 8 High Availability with InnoDB Clusters
MySQL 8 High Availability with InnoDB ClustersMiguel AraĂșjo
 

Ähnlich wie Regular Expressions with full Unicode support (20)

MySQL 8.0 InnoDB Cluster demo
MySQL 8.0 InnoDB Cluster demoMySQL 8.0 InnoDB Cluster demo
MySQL 8.0 InnoDB Cluster demo
 
MySQL NoSQL JSON JS Python "Document Store" demo
MySQL NoSQL JSON JS Python "Document Store" demoMySQL NoSQL JSON JS Python "Document Store" demo
MySQL NoSQL JSON JS Python "Document Store" demo
 
Top 10 SQL Performance tips & tricks for Java Developers
Top 10 SQL Performance tips & tricks for Java DevelopersTop 10 SQL Performance tips & tricks for Java Developers
Top 10 SQL Performance tips & tricks for Java Developers
 
20160821 coscup-my sql57docstorelab01
20160821 coscup-my sql57docstorelab0120160821 coscup-my sql57docstorelab01
20160821 coscup-my sql57docstorelab01
 
MySQL Replication
MySQL ReplicationMySQL Replication
MySQL Replication
 
Performance Schema and Sys Schema in MySQL 5.7
Performance Schema and Sys Schema in MySQL 5.7Performance Schema and Sys Schema in MySQL 5.7
Performance Schema and Sys Schema in MySQL 5.7
 
Mysql tech day_paris_ps_and_sys
Mysql tech day_paris_ps_and_sysMysql tech day_paris_ps_and_sys
Mysql tech day_paris_ps_and_sys
 
MySQL NoSQL APIs
MySQL NoSQL APIsMySQL NoSQL APIs
MySQL NoSQL APIs
 
MySQL Troubleshooting with the Performance Schema
MySQL Troubleshooting with the Performance SchemaMySQL Troubleshooting with the Performance Schema
MySQL Troubleshooting with the Performance Schema
 
MySQL8.0 in COSCUP2017
MySQL8.0 in COSCUP2017MySQL8.0 in COSCUP2017
MySQL8.0 in COSCUP2017
 
APEX Connect 2019 - SQL Tuning 101
APEX Connect 2019 - SQL Tuning 101APEX Connect 2019 - SQL Tuning 101
APEX Connect 2019 - SQL Tuning 101
 
12 Things Developers Will Love About Oracle Database 12c Release 2
12 Things Developers Will Love About Oracle Database 12c Release 212 Things Developers Will Love About Oracle Database 12c Release 2
12 Things Developers Will Love About Oracle Database 12c Release 2
 
APEX Connect 2019 - successful application development
APEX Connect 2019 - successful application developmentAPEX Connect 2019 - successful application development
APEX Connect 2019 - successful application development
 
20161029 py con-mysq-lv3
20161029 py con-mysq-lv320161029 py con-mysq-lv3
20161029 py con-mysq-lv3
 
20190615 hkos-mysql-troubleshootingandperformancev2
20190615 hkos-mysql-troubleshootingandperformancev220190615 hkos-mysql-troubleshootingandperformancev2
20190615 hkos-mysql-troubleshootingandperformancev2
 
What’s New in Oracle Database 12c for PHP
What’s New in Oracle Database 12c for PHPWhat’s New in Oracle Database 12c for PHP
What’s New in Oracle Database 12c for PHP
 
Graal and Truffle: One VM to Rule Them All
Graal and Truffle: One VM to Rule Them AllGraal and Truffle: One VM to Rule Them All
Graal and Truffle: One VM to Rule Them All
 
Ruby on Rails Oracle adaptera izstrāde
Ruby on Rails Oracle adaptera izstrādeRuby on Rails Oracle adaptera izstrāde
Ruby on Rails Oracle adaptera izstrāde
 
Rootconf admin101
Rootconf admin101Rootconf admin101
Rootconf admin101
 
MySQL 8 High Availability with InnoDB Clusters
MySQL 8 High Availability with InnoDB ClustersMySQL 8 High Availability with InnoDB Clusters
MySQL 8 High Availability with InnoDB Clusters
 

KĂŒrzlich hochgeladen

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 JustđŸ“Č Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 JustđŸ“Č Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 JustđŸ“Č Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 JustđŸ“Č Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
âž„đŸ” 7737669865 đŸ”â–» Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
âž„đŸ” 7737669865 đŸ”â–» Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...âž„đŸ” 7737669865 đŸ”â–» Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
âž„đŸ” 7737669865 đŸ”â–» Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Nandini Layout ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 đŸ„” Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 đŸ„” Book Your One night Standamitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎9352988975 Two shot with one girl...gajnagarg
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectBoston Institute of Analytics
 
Just Call Vip call girls Bellary Escorts ☎9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎9352988975 Two shot with one girl ...gajnagarg
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 đŸ„” Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 đŸ„” Book Your One night Standamitlee9823
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 

KĂŒrzlich hochgeladen (20)

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 JustđŸ“Č Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 JustđŸ“Č Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 JustđŸ“Č Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 JustđŸ“Č Call Ruhi Call Girl Phone No Amri...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
âž„đŸ” 7737669865 đŸ”â–» Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
âž„đŸ” 7737669865 đŸ”â–» Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...âž„đŸ” 7737669865 đŸ”â–» Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
âž„đŸ” 7737669865 đŸ”â–» Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Nandini Layout ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 đŸ„” Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 đŸ„” Book Your One night Stand
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Just Call Vip call girls kakinada Escorts ☎9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎9352988975 Two shot with one girl...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Just Call Vip call girls Bellary Escorts ☎9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎9352988975 Two shot with one girl ...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls In Doddaballapur Road ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 đŸ„” Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 đŸ„” Book Your One night Stand
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 

Regular Expressions with full Unicode support

  • 1. Copyright © 2019 Oracle and/or its affiliates. All rights
  • 2. Copyright © 2019 Oracle and/or its affiliates. All rights Regular Expressions with full Unicode support Martin Hansson Software Development MySQL Optimizer Team The ins and outs of the new regular expression functions and the ICU library
  • 3. Copyright © 2019 Oracle and/or its affiliates. All rights Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
  • 4. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | What Happened? Old regexp library (Henry Spencer) ‱ Does not support Unicode ‱ Limited Features ‱ No resource control ‱ Only Boolean Search https://mysqlserverteam.com/new-regular-expression-functions-in-mysql-8-0/
  • 5. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Not some niche feature Feature Requests for Extracting Substring: Bug#79428 No way to extract a substring matching a regex Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in a select query Bug#9105 Regular expression support for Search & Replace 51 “affects me” total CTE had 59 “affects me”
  • 6. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | New Regular Expression Functions REGEXP_INSTR REGEXP_LIKE REGEXP_REPLACE REGEXP_SUBSTR
  • 7. Copyright © 2019 Oracle and/or its affiliates. All rights Program Agenda Security ICU library Unicode Working with Unicode in Regular Expressions
  • 8. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Two Security Concerns Memory Runtime 8
  • 9. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Security Cap on runtime mysql> SELECT regexp_instr( 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC', '(A+)+B'); ERROR 3699 (HY000): Timeout exceeded in regular expression match.
  • 10. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Security Cap on Memory mysql> SELECT regexp_instr( '', '(((((((){120}){11}){11}){11}){80}){11}){4}' ); ERROR 3699 (HY000): Timeout exceeded in regular expression match. mysql> SET GLOBAL regexp_stack_limit = 239; mysql> SELECT regexp_instr( '', '(((((((){120}){11}){11}){11}){80}){11}){4}' ); ERROR 3698 (HY000): Overflow in the regular expression backtrack stack.
  • 11. Copyright © 2019 Oracle and/or its affiliates. All rights Program Agenda Security ICU library Unicode Working with Unicode in Regular Expressions
  • 12. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | ICU library
  • 13. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Building ICU Need three libraries ‱ i18n library – Regular expressions – Character sets ‱ Common library ‱ Data Library
  • 14. Copyright © 2019 Oracle and/or its affiliates. All rights Program Agenda Security ICU library Unicode Working with Unicode in Regular Expressions
  • 15. 15 UTF-32 ab d 0x00000061 0x000000610x00000061 0x000000610x00000062 0x000000640x000000610x000000610x000000610x0001f37a
  • 18. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Under the Hood ‱ Count codepoints ‱ Convert to UTF-16 ‱ Use the C API ‱ Convert back if needed
  • 19. Copyright © 2019 Oracle and/or its affiliates. All rights Program Agenda Security ICU library Unicode Working with Unicode in Regular Expressions
  • 20. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Simple case sensitivity mysql> SELECT regexp_like( 'a', '(?i)A' ); # mode modifier 1 mysql> SELECT regexp_like( 'a', 'A', ‘i’ ); # match_parameter 1 mysql> SELECT regexp_like( 'a' COLLATE utf8mb4_0900_as_cs, 'A' ); # collation 0
  • 21. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Simple case sensitivity mysql> SELECT regexp_like( 'Abc', 'abC', ‘c’ ); → 0 mysql> SELECT regexp_like( 'Abc', 'abC', ‘i’ ); → 1
  • 22. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Case-mapping process A → a B → b C → c
  • 23. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Full Case Folding ß → ss mysql> SELECT regexp_like( 'ß', '^ss$', ‘c’ ); → 0 mysql> SELECT regexp_like( 'ß', '^ss$', ‘i’ ); → 1
  • 24. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Full Case Folding ៛ ⇒ áŒŁÎč U+1F9B GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI U+1F23 U+03B9 GREEK SMALL LETTER ETA WITH DASIA AND VARIA
  • 25. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Has to Look Like a String in order to Match mysql> SELECT regexp_like( 'ß', '^ss$' ); → 1 mysql> SELECT regexp_like( 'ß', '^s+$' ); → 0 mysql> SELECT regexp_like( 'ß', '^s{2}$' ); → 0
  • 26. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Can’t start Match Within Expanded Character mysql> SELECT regexp_like( 'ß', 's$' ); → 0 mysql> SELECT regexp_like( 'ß', '^s' ); → 0
  • 27. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Collations mysql> select 'ß' collate utf8mb4_de_pb_0900_ai_ci = 'ss'G *************************** 1. row 'ß' collate utf8mb4_de_pb_0900_ai_ci = 'ss': 1 mysql> select 'ß' collate utf8mb4_de_pb_0900_as_cs = 'ss'G *************************** 1. row 'ß' collate utf8mb4_de_pb_0900_as_cs = 'ss': 0
  • 28. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Language Dependent Case Folding mysql> SELECT regexp_like( 'I', 'i' ); → 1 mysql> SELECT regexp_like( 'Ä°', 'i' ); → 0 mysql> SELECT regexp_like( 'I', ' ı' ); → 0
  • 29. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Beware of Conversion! mysql> set names latin1; mysql> create table t1 ( a char ( 10 ) ); mysql> insert into t1 values ( 'Ă„' ); mysql> select a from t1G *************************** 1. row a: Ă„ mysql> select regexp_like( a, 'Ă„' ) from t1G *************************** 1. row regexp_like( a, 'Ă„' ): 1
  • 30. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Beware of Conversion! Use Hex Codes! mysql> select hex( a ) from t1; +----------+ | hex( a ) | +----------+ | C383C2A5 | +----------+
  • 31. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Beware of Conversion! Use Hex Codes! mysql> select hex( a ) from t1; +----------+ | hex( a ) | +----------+ | C383C2A5 | +----------+ Latin-1: 0x e5 UTF-8: 0x c3 a5 Ă„ is encoded as:
  • 32. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 32 Conversion flow Terminal UTF-8 c3a5 Ă„ Latin-1 → UTF-8 UTF-8 → Latin-1 C383C2A5 = Ä Server Table UTF-8 Server
  • 33. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Power Tip Use Hex Codes and Character set Introducers! mysql> set global character_set_client = utf8mb4; mysql> select _utf8mb4 0xc3a5, _latin1 0xe5; +-----------------+--------------+ | _utf8mb4 0xc3a5 | _latin1 0xe5 | +-----------------+--------------+ | Ă„ | Ă„ | +-----------------+--------------+ mysql> set global character_set_client = latin1; mysql> select _utf8mb4 0xc3a5, _latin1 0xe5; +-----------------+--------------+ | _utf8mb4 0xc3a5 | _latin1 0xe5 | +-----------------+--------------+ | Ă„ | Ă„ | +-----------------+--------------+
  • 34. Copyright © 2019 Oracle and/or its affiliates. All rights Questions?

Hinweis der Redaktion

  1. I am Worked with MySQL since time immemorial, MySQL AB. Work from Uppsala Sweden, former head office for MySQL. Swedish is my native tongue. Makes a differrence as you will see.
  2. I am Worked with MySQL since time immemorial, MySQL AB. Work from Uppsala Sweden, former head office for MySQL. Swedish is my native tongue. Makes a differrence as you will see.
  3. So what’s all this about? We switched our regex library in 8.0.4. At the time I blogged about it here. The old one was written by HS in 1986. called regexp very good regex library. Has been used widely in the Unix realm, is part of POSIX standard. Also called “the book regexp library” because he updated it for the book Software Solutions in C in 1994. . Made its way to Tcl, Postgres and even early perl. Apparently Postgres still use it. Really good. Great performance. But ASCII only. Worked byte-by-byte. Lacks many features. Not safe – Easy to put in infinite loop. You can only do boolean search, not do matching doesn’t have a pattern buffer out of the box. Hence doesn’t support search-replace.
  4. And this was a quite popular request. Four FR bugs against getting the matched substring alone. We had 51 “Affects me” in total. CTE had 59, but that’s a really popular feature.
  5. Now we have four functions Instr → position, before or after Like → boolean Replace → replaces a match, capture Substr → the matched substring
  6. So here’s the agenda. On top, we have security, which is why we chose ICU. Perhaps not the obvious choice given the candidates. It also has close ties to Unicode. What I won’t cover here is all the features of regular expressions. These are documented in our manual and if you can always head to the ICU documentations. My ambition is to teach you about how to work efficiently and securely with unicode and to give some insight into where common wisdom breaks down. I presented here 3 years ago and I had a really good time, so I wanted to go again. I told my boss what am I going to talk about, I haven’t really added anything new since last time. All I can think of is the new regular expression. “Tell’em about that, he said” They’ll love that. So, I submitted this talk as a 20 minute presentation. Not only did it get accepted but it got upgraded to 30 minutes. I couldn’t think of much to say, so I asked around. “What do YOU want to know about regular expressions with Unicode?”. Nobody had a clue. So that’s why I just picked some common pitfalls that I consider tricky.
  7. The way a malicious user can exploit regex matching is by exhausting the memory or creating an infinite loop, consuming all the cpu time.
  8. Out of the box there’s always cap on runtime. Runtime is specified in “steps of the match engine. A bit vacuous. Correspondence with actual processor time will depend on the speed of the processor and the details of the specific pattern, but will typically be on the order of milliseconds. Match the first A, capture, then repeat that match. Backtrack, match 2nd, repeat that and so on. Eventually fail because of the C. Set conservatively to 32 (secure by default)
  9. Here I’m trying to run out of memory. Really have to provoke here. Reach the time limit first. Match empty string 120 times, repeat that 11 times, repeat that 11 times, etc. Backtracking stack used by engine. Bytes. Choking to 239 bytes Default size 80 MB. Never managed to DOS server.
  10. So
 about the icu library
  11. What is ICU library. Set of I18n libs. What they provide is Globalization support and Unicode for software applications. They have an open source license. From what I gather compatible with GNU, but IANAL. Used by Java, Apple, Amazon, IBM
 Unicode consortium mostly known for emoji nowadays. New releases of Unicode typically contain new emojis. And so you have to be able search for them. Haha-papa a.k.a. Sushi-beer bug. And so regexp have to suport them. 💬 5 billion emojis are sent daily on Facebook Messenger 📾 By mid-2015, half of all comments on Instagram included an emoji 🍑 Only 7% of people use the peach emoji as a fruit The rest mostly use it as a butt or for other non-fruit uses According to emojipedia In a sense ICU is Unicode. Support for all of Unicode
  12. We ship ICU with MySQL, and optionally build bundled. We ship 59.1. I notice Ubuntu 18.04 ships 60. There’s the internationalization library which contains regexp and charsets. All we use right now. All we bundle. The common library contains things like the breakiterator which helps work with grapheme clusters. I won’t go into grapheme clusters in this presentation. We don’t handle those yet. The data library is not used currently. Don’t ship. Fairly big, not needed for regexp.
  13. Tell you a bit about Unicode Specifies three encodings.
  14. + constant size + maps 1-to-1 to unicode codepoints - space consuming
  15. + Optimized for Western ASCII + Small (for Western) + Self-synchronizing (what isn’t???) - Variable size De-facto standard for the web 92.9%
  16. Generally regarded Worst of both worlds - Bigger than UTF-8 - Not fixed like UTF-32 + More is constant (what? Which planes?) + Also self-synchronizing Surrogate pairs Broken in Java. How? Alas, used by ICU
  17. So they way we use ICU is, unless you start on the first character, we count the code points before. Convert the rest to UTF-16, search with ICU. We use ICU’s C API. There is C++ API.
  18. So, I have two examples how to work with Unicode.
  19. You can specify case sensitivity in three ways. Mode modifiers Inside the regexp have the highest priorority. If there are no mode modifiers, match_paramete is used. String of modifiers. ‘c’ means case-sensitive, ‘i’ means in-sensitive. If there are none of those, we look at the collation. There are rules for computing which collation should be used in any comparison. Apply here.
  20. Case insensitivity seems simple at first. Text is normalized by transforming to the same case. Then compare. On the next slide we see how such a case mapping could look.
  21. Totally obvious, right? One character maps to exactly one character. This is called simple case insensitive matching. Well there are some trickier cases.
  22. The german Ess-zet is generally understood to be equivalent to two s’es. So in full case insensitive matching they should be equal. Since there is no esszet in any other language, this folding is part of the default. I could go on all day about case mapping, it’s a 61-page document in the Unicode standard. But these are the essentials.
  23. This example is a little more complicated for me. Here one letter obviously maps to two letters. Actually letters. Not just code points. If you paste them and press backspace, the little I goes away. In this case they’re different. Works the same way. It’s all greek to me.
  24. Full case folding used when the pattern contains anything looks like a character string, even just one char.
  25. A match can never start within an expanded character. The anchors here enforce a match that would 1) start in the middle 2) end in the middle
  26. This is consistent with how collations work with the equals predicate. Hard to read collation name Charset, language code, pb – don’t know, accent sensitive, case sensitive.
  27. Case folding can also be language dependent. In the default case folding, capital I folds to small I with dot. However, in the turkish case folding, a dotted capital I is case folded to dotted lowercase I. Dotless capital I folds to dotless lowercase I. In Turkish locale, actually wrong.
  28. Another problem with full Unicode and regexp. You need to be careful when you send non-ASCII data from a client. Here is a cautionary tale. Here I changed the variables character_set_connection, character_set_client and character_set_results. What SET NAMES does. So, I create a table. I populate it. Swedish letter Ă„. Pronounced Read back. Check with a regular expression match. So, everything is fine, right? Let’s do a “trust but verify” here. I want to see what’s actualy in the table. The problem is that it will always be converted to my character set. I want to apply a function to it on the server side. Problem is, all functions will also convert their arguments. What to do? All functions save one: The hex() function. It will tell the truth.
  29. So here we have .. what? Is this really a w/ring ? Let’s check.
  30. This is not Ă„ in any encoding. What is going on
  31. My terminal is UTF-8. So, when I type Ă„ on my Swedish keyboard, it sends c3a5 to the server. Now, when I set character_set_client, what I really said “interpret as latin1”. Fine c3a5 thats a-wave yen. Stores that. But the table stores utf8 so let’s convert. And that becomes Now, when I do select, it reads character_set_results, oh yeah, you speak latin-1. Let me translate for ya. And so we’re back full circle. Especially tricky with latin-1 since anything is valid latin-1. No check fails.
  32. So here’s a power tip for troubleshooting your multilinguas regexps. If you use hex codes and character set introducers, it’s totally unambiguous. As you see here.