Collation and Normalization in Java

Most programmers know how to compare strings with the compareTo method of the String class. Unfortunately, when interacting with human users, this method is not very useful. The compareTo method uses the values of the UTF-16 encoding of the string, which leads to absurd results, even in English. For example, the following five strings are ordered according to the compareTo method:

America

Zulu

able

zebra

Angstrom

For dictionary ordering, you would want to consider upper case and lower case to be equivalent. To an English speaker, the sample list of words would be ordered as

able

America

Angstrom

zebra

Zulu

However, that order would not be acceptable to a Swedish user. In Swedish, the letter A is different from the letter A, and it is collated after the letter Z! That is, a Swedish user would want the words to be sorted as

able

America

zebra

Zulu

Angstrom

To obtain a locale-sensitive comparator, call the static Collator.getlnstance method:

Collator coll = Collator.getlnstance(locale);

words.sort(coll); // Collator implements Comparator<Object>

Since the Collator class implements the Comparator interface, you can pass a Collator object to the List.sort(Comparator) method to sort a list of strings.

There are a couple of advanced settings for collators. You can set a collator’s strength to select how selective it should be. Character differences are classified as primary, secondary, or tertiary. For example, in English, the difference between “A” and “Z” is considered primary, the difference between “A” and “A” is secondary, and between “A” and “a” is tertiary.

By setting the strength of the collator to Cottator.PRIMARY, you tell it to pay attention only to primary differences. By setting the strength to Cottator.SECONDARY, you instruct the collator to take secondary differences into account. That is, two strings will be more likely to be considered different when the strength is set to “secondary” or “tertiary,” as shown in Table 7.6.

When the strength has been set to Cottator.IDENTICAL, no differences are allowed. This setting is mainly useful in conjunction with a rather technical collator setting, the decomposition mode, which we take up next.

Occasionally, a character or sequence of characters can be described in more than one way in Unicode. For example, an “A” can be Unicode character U+00C5, or it can be expressed as a plain A (U+0065) followed by a ° (“combining ring above”; U+030A). Perhaps more surprisingly, the letter sequence “ffi” can be described with a single character “Latin small ligature ffi” with code U+FB03. (One could argue that this is a presentation issue that should not have resulted in different Unicode characters, but we don’t make the rules.)

The Unicode standard defines four normalization forms (D, KD, C, and KC) for strings. See www.unicode.org/unicode/reports/tr15/tr15-23.htmt for the details. In the normalization form C, accented characters are always composed. For example, a sequence of A and a combining ring above ° is combined into a single character A. In form D, accented characters are always decomposed into their base letters and combining accents: A is turned into A followed by °. Forms KC and KD also decompose characters such as ligatures or the trademark symbol.

You can choose the degree of normalization that you want a collator to use. The value Cottator.NO_DECOMPOSITION does not normalize strings at all. This option is faster, but it might not be appropriate for texts that express characters in multiple forms. The default, Cottator.CANONICAL_DECOMPOSITION, uses the normalization form D. This is useful for texts that contain accents but not ligatures. Finally, “full decomposition” uses normalization form KD. See Table 7.7 for examples.

It is wasteful to have the collator decompose a string many times. If one string is compared against other strings many times, you can save the decomposition in a collation key object. The getCottationKey method returns a CottationKey object that you can use for further, faster comparisons. Here is an example:

String a = . .

CottationKey aKey = cott.getCottationKey(a);

if(aKey.compareTo(coU.getCottationKey(b)) == 0) // fast comparison

…

Finally, you might want to convert strings into their normalized forms even if you don’t do collation—for example, to store strings in a database or communicate with another program. The java.text.Normatizer class carries out the normalization process. For example,

String name = “Angstrom”;

String normatized = Normatizer.normatize(name, Normatizer.Form.NFD); // uses normatization

// form D

The normalized string contains ten characters. The “A” and “o” are replaced by “A°” and “o“” sequences.

However, that is not usually the best form for storage and transmission. Normalization form C first applies decomposition and then combines the accents back in a standardized order. According to the W3C, this is the recommended mode for transferring data over the Internet.

The program in Listing 7.4 lets you experiment with collation order. Type a word into the text field and click the Add button to add it to the list of words. Each time you add another word, or change the locale, strength, or decomposition mode, the list of words is sorted again. An = sign indicates words that are considered identical (see Figure 7.3).

The locale names in the combo box are displayed in sorted order, using the collator of the default locale. If you run this program with the US English locale, note that “Norwegian (Norway,Nynorsk)” comes before “Norwegian (Norway)”, even though the Unicode value of the comma character is greater than the Unicode value of the closing parenthesis.

Source: Horstmann Cay S. (2019), Core Java. Volume II – Advanced Features, Pearson; 11th edition.

Leave a Reply Cancel reply

Login