Manipulating Text in PHP

Since the strlen() function only counts bytes, it reports incorrect results when a character requires more than one byte. To count the characters in a string, independent of how many bytes each character requires, use mb_strlen(), as shown in Example 20-1.

Example 20-1. Measuring string length

$english = “cheese”;

$greek = “rupt”;

print “strlen() says ” . strlen($english) . ” for $english and ” .

strlen($greek) . ” for $greek.\n”;

print “mb_strlen() says ” . mb_strlen($english) . ” for $english and ” .

mb_strlen($greek) . ” for $greek.\n”;

Since each of the Greek characters requires two bytes, the output of Example 20-1 is:

strlen() says 6 for cheese and 8 for rupt.

mb_strlen() says 6 for cheese and 4 for rupt.

Operations that depend on string positions, such as finding substrings, must also be done in a character-aware instead of byte-aware way when multibyte characters are used. Example 2-12 used substr() to extract the first 30 bytes of a user-submitted message. To extract the first 30 characters, use mb_substr() instead, as shown in Example 20-2.

Example 20-2. Extracting a substring

$message = “In Russia, I like to eat Kawa and drink KBac.”;

print “substr() says: ” . substr($message, 0, 30) . “\n”;

print “mb_substr() says: ” . mb_substr($message, , 30) . “\n”;

Example 20-2 prints:

substr() says: In Russia, I like to eat Ka

mb_substr() says: In Russia, I like to eat Kawa

The line of output from substr() is totally bungled! Each Cyrillic character requires more than one byte, and 30 bytes into the string is midway through the byte sequence for a particular character. The output from mb_substr() stops properly on the correct character boundary.

What “uppercase” and “lowercase” mean is also different in different character sets. The mb_strtolower() and mb_strtoupper() functions provide character-aware versions of strtolower() and strtoupper(). Example 20-3 shows these functions at work.

Example 20-3. Changing case

$english = “Please stop shouting.”;

$danish = “Venligst stoppe raben.”;

$vietnamese = “Hay dung la het.”;

print “strtolower() says: ;

print ” ” . strtolower($english) . “\n”;

print ” ” . strtolower($danish) . “\n”;

print ” ” . strtolower($vietnamese) . ” ;

print “mb_strtolower() says: \n”;

print ” ” . mb_strtolower($english) . “\n”;

print ” ” . mb_strtolower($danish) . “\n”;

print ” ” . mb_strtolower($vietnamese) . “\n”;

print “strtoupper() says: ;

print ” ” . strtoupper($english) . “\n”;

print ” ” . strtoupper($danish) . “\n”;

print ” ” . strtoupper($vietnamese) . ” ;

print “mb_strtoupper() says: \n”;

print ” ” . mb_strtoupper($english) . “\n”;

print ” ” . mb_strtoupper($danish) . “\n”;

print ” ” . mb_strtoupper($vietnamese) . “\n”;

Example 20-3 prints:

Because strtoupper() and strtolower() work on individual bytes, they don’t replace whole multibyte characters with the correct equivalents like mb_strtoupper() and mb_strtolower() do.

Source: Sklar David (2016), Learning PHP: A Gentle Introduction to the Web’s Most Popular Language, O’Reilly Media; 1st edition.

Leave a Reply Cancel reply

Login