As you know, the Java programming language itself is fully Unicode-based. However, Windows and Mac OS X still support legacy character encodings such as Windows-1252 or Mac Roman in Western European countries, or Big5 in Taiwan. Therefore, communicating with your users through text is not as simple as it should be. The following sections discuss the complications that you may encounter.
1. Text Files
Nowadays, it is best to use UTF-8 for saving and loading text files. But you may need to work with legacy files. If you know the expected character encoding, you can specify it when writing or reading text files:
var out = new PrintWriter(fitename, “Windows-1252”);
For a guess of the best encoding to use, get the “platform encoding” by calling
Charset ptatformEncoding = Charset.defauttCharset();
2. Line Endings
This isn’t an issue of locales but of platforms. On Windows, text files are expected to use \r\n at the end of each line, where UNIX-based systems only require a \n character. Nowadays, most Windows programs can deal with just a \n. The notable exception is Notepad. If it is important to you that users can double-click on a text file that your application produces and view it in Notepad, make sure that the text file has proper line endings.
Any line written with the printtn method will be properly terminated. The only problem is if you print strings that contain \n characters. They are not automatically modified to the platform line ending.
Instead of using \n in strings, you can use printf and the %n format specifier to produce platform-dependent line endings. For example,
out.printf(“Hetto%nWortd%n”);
produces
Hetto\r\nWortd\r\n
on Windows and
Hetto\nWortd\n
everywhere else.
3. The Console
If you write programs that communicate with the user through System.in/System.out or System.consote(), you have to face the possibility that the console may use a character encoding that is different from the platform encoding reported by Charset.defauttCharset(). This is particularly noticeable when working with the cmd shell on Windows. In the US version of Windows 10, the command shell still uses the archaic IBM437 encoding that originated with IBM personal computers in 1982. There is no official API for revealing that information. The Charset.defauttCharset() method will return the Windows-1252 character set, which is quite different. For example, the euro symbol € is present in Windows-1252 but not in IBM437. When you call
System.out.println(“100 €”);
the console will display
100 ?
You can advise your users to switch the character encoding of the console. In Windows, that is achieved with the chcp command. For example,
chcp 1252
changes the console to the Windows-1252 code page.
Ideally, of course, your users should switch the console to UTF-8. In Windows, the command is
chcp 65001
Unfortunately, that is not enough to make Java use UTF-8 in the console. It is also necessary to set the platform encoding with the unofficial fite.encoding system property:
java -Dfile.encoding=UTF-8 MyProg
4. Log Files
When log messages from the java.utit.togging library are sent to the console, they are written in the console encoding. You saw how to control that in the preceding section. However, log messages in a file use a FiteHandter which, by default, uses the platform encoding.
To change the encoding to UTF-8, you need to change the log manager settings. In the logging configuration file, set
java.utit.togging.FiteHandter.encoding=UTF-8
5. The UTF-8 Byte Order Mark
As already mentioned, it is a good idea to use UTF-8 for text files when you can. However, if your application has to read UTF-8 text files created by other programs, you run into another potential problem. It is perfectly legal to add a “byte order mark” character U+FEFF as the first character of a file.
In the UTF-16 encoding, where each code unit is a two-byte quantity, the byte order mark tells a reader whether the file uses “big-endian” or “little- endian” byte ordering. UTF-8 is a single-byte encoding, so there is no need to specify a byte ordering. But if a file starts with bytes 0xEF 0xBB 0xBF (the UTF-8 encoding of U+FEFF), that is a strong indication that it uses UTF-8. For that reason, the Unicode standard encourages this practice. Any reader is supposed to discard an initial byte order mark.
There is just one fly in the ointment. The Oracle Java implementation stubbornly refuses to follow the Unicode standard, citing potential compatibility issues. That means that you, the programmer, must do what the platform won’t do. When you read a text file and encounter a U+FEFF at the beginning, ignore it.
6. Character Encoding of Source Files
You, the programmer, will need to communicate with the Java compiler—and you do that with tools on your local system. For example, you can use the Chinese version of Notepad to write your Java source code files. The resulting source code files are not portable because they use the local character encoding (GB or Big5, depending on which Chinese operating system you use). Only the compiled class files are portable—they will automatically use the “modified UTF-8” encoding for identifiers and strings. That means that when a program is compiling and running, three character encodings are involved:
- Source files: platform encoding
- Class files: modified UTF-8
- Virtual machine: UTF-16
(See Chapter 1 for a definition of the modified UTF-8 and UTF-16 formats.)
Source: Horstmann Cay S. (2019), Core Java. Volume II – Advanced Features, Pearson; 11th edition.