Schei. encoding

Schei�
encoding
Codierungen und Ã¤hnliches

Schei�
encoding
Codierungen und Ã¤hnliches

Andreas Heigl
@heiglandreas

Historisches

Anfänge

ASCII

ISO

Unicode

Anfänge

Lochkarten

Kein Bedarf für „Schrift“

ASCII

American Standard Code for Information
Interchange

1963 entstanden

7-Bit Zeichensatz (128 Zeichen)

ASCII

0 1 2 3 4 5 6 7 8 9 A B C D E F
0
1
2
3
4 A B C D E F G H I J K L M N O
5 P Q R S T U VWX Y Z
6
7

ASCII

0 1 2 3 4 5 6 7 8 9 A B C D E F
0
1
2
3
5 P Q R S T U V W X Y Z
6 a b c d e f g h i j k l m n o
7 p q r s t u v w x y z

ASCII

0 1 2 3 4 5 6 7 8 9 A B C D E F
0
1
2
3 0 1 2 3 4 5 6 7 8 9
5 P Q R S T U V W X Y Z
6 a b c d e f g h i j k l m n o
7 p q r s t u v w x y z

ASCII

0 1 2 3 4 5 6 7 8 9 A B C D E F
0
1
2 ! ” # $ % & ’ ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
5 P Q R S T U V W X Y Z [ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ .

ASCII

0 1 2 3 4 5 6 7 8 9 A B C D E F
0
1
2 ! ” # $ % & ’ ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ .

ISO

International Standardisation Organisation

ASCII-Kompatibel

8-Bit Zeichensatz (256 Zeichen)

-1 Latin-1, Westeuropa -9 Latin-5, Türkisch

-2 Latin-2, Mitteleuropa -10 Latin-6, Nordisch

-3 Latin-3, Südeuropa -11 Thai

-4 Latin-4, Nordeuropa -12 - nicht existent

-5 Kyrillisch -13 Latin-7, Baltisch

-6 Arabisch -14 Latin-8, Keltisch

-7 Griechisch -15 Latin-9, Westeuropäisch

-8 Hebräisch -16 Latin-10, Südosteuropäisch

ISO-8859
-1 Latin-1, Westeuropa -9 Latin-5, Türkisch

-2 Latin-2, Mitteleuropa -10 Latin-6, Nordisch

-3 Latin-3, Südeuropa -11 Thai

-4 Latin-4, Nordeuropa -12 - nicht existent

-5 Kyrillisch -13 Latin-7, Baltisch

-6 Arabisch -14 Latin-8, Keltisch

-7 Griechisch -15 Latin-9, Westeuropäisch

-8 Hebräisch -16 Latin-10, Südosteuropäisch

ISO-8859-1

0 1 2 3 4 5 6 7 8 9 A B C D E F
8
9
A nbsp
¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ shy
® ¯
B ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
C À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
D Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
E à á â ã ä å æ ç è é ê ë ì í î ï
F ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

Andere Codierungen

MacRoman

WindowsLatin

...

Unicode

1991 (1988) entstanden

Multibyte Zeichensatz (z.Zt. 1.114.112
Zeichen)

ASCII/ISO-8859-1 Kompatibel

Unterschiedliche Codierungen (UTF-8,
UTF-16, UTF-32, EBCDIC, .... )

Unicode

U+0000 - U+10FFFF

U+0000 - U+007F -> ASCII

U+0080 - U+00FF -> ISO-8859-1

Verschiedene Codierungen möglich

UTF-32

jedes Zeichen 4 Byte lang

Einfach

Speicherintensiv

UTF-16
jedes Zeichen 2 Byte lang

Einfach

Codiert nur die meistgenutzten Zeichen aus
Unicode

Nur Zeichen von U+0000 bis U+FFFF möglich.

Big- oder LittleEndian?

MacOS-X, Windows, Java, .Net

UTF-8
jedes Zeichen zwischen 1 und 4 Byte lang

Streamsicher, da Start- und Folgezeichen
unterschieden werden

Alle Unicode-Zeichen codierbar und noch viel mehr
(bis zu 2^42 - 4.398.046.511.104)

Platzsparend, da oft nur 1 Byte gespeichert
werden muss (Lateinische Schriften)

Linux, IETF

UTF-8

Startbyte 0xxxxxxx oder 11xxxxxx

Folgebyte 10xxxxxx

0xxxxxxx für 1-Byte-Zeichen

1xxxxxxx für Mehr-Byte-Zeichen.

Die Anzahl der 1 zeigt die Anzahl der
Gesamt-Byte an

UTF-8

00-7F ein Byte langes Zeichen (ASCII)

80-BF 2., 3. oder 4. Byte einer mehrbyte-
Sequenz

C2-DF Start einer 2 Byte langen Sequenz

E0-EF Start einer 3 Byte langen Sequenz

F0-F4 Start einer 4 Byte langen Sequenz

UTF-8

UTF-8 Binär Unicode Fehler

U+0079
y 0x79 01111001
01111001

0xC3 0xA4 11000011 U+00E4
ä Ã¤
10100100 11100100

Codierung und PHP

PHP ist NICHT Unicode-sicher

Warum ist das ein Problem?

Warum ist das ein
Problem?
Kein Problem wenn NUR EINE ISO-8859-Variante
zum Einsatz kommt.

Internationale Seiten

Mehrsprachige Texte (da langt schon ein
griechisches Wort ...)

Mehrbyte-Zeichen und strlen?

Darstellung von einzelnen Zeichen eines MehrByte-
Zeichens (z.B. Ã¤ statt ä)

Codierung und PHP

mb_internal_encoding('UTF-8'); mb_http_input ('UTF-8'); mb_http_output ('UTF-8');
Content-type: text/html; charset=utf-8
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> und <form accept-
charset = "utf-8">
CREATE DATABASE ... DEFAULT CHARACTER SET utf8 COLLATE utf8 ... ENGINE ...
CHARSET=utf8 COLLATE=utf8_unicode_ci
SET NAMES 'utf8' COLLATE 'utf8_unicode_ci'
default_charset = UTF-8
mb_string Funktionen
mb-Reguläre Ausdrücke
u-Modiﬁkator für preg-RegEx

mb_string

MultiByte-Funktionen

Ersetzen „normale“ String-Funktionen

Ermöglichen den Umgang mit UTF-8

Überladen von Standard-Funktionen für Mail (1),
String (2) und RegEx-Funktionen (4) mittles
mbstring.overload = n

iconv
Konvertierung aus beliebigem Zeichensatz
nach z.B. UTF-8

<?php
$sourceEnc = mb_detect_encoding($string);
$targetEnc = ‘UTF8//TRANSLIT/ /IGNORE‘;
echo iconv($sourceEnc,$targetEnc,$string);

MySQL

Server [mysqld]
default-collation=utf8_bin
Client character-set-server=utf8
collation-server=utf8_bin
Verbindung default-character-set=utf8
Datenbank
[client]
Tabelle default-character-set=utf8

Client MySQL
(UTF-8)
Ü
C3 9C
Verbindung
(ISO-8859-1)
Ãœ
C3 9C Server
(UTF-8)
Ãœ
C3 83 C2 9C

Resourcen

http://php.net/manual/de/
mbstring.overload.php

http://www.ibm.com/developerworks/library/
os-php-unicode/index.html

http://unicode.org

Google, Bing, Yahoo, .....

Schei. encoding

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Empfohlen

Empfohlen (20)

Schei. encoding

Hinweis der Redaktion