In the Java API, an object from which we can read a sequence of bytes is called an input stream. An object to which we can write a sequence of bytes is called an output stream. These sources and destinations of byte sequences can be—and often are—files, but they can also be network connections and even blocks of memory. The abstract classes InputStream and OutputStream are the basis for a hierarchy of input/output (I/O) classes.
Byte-oriented input/output streams are inconvenient for processing information stored in Unicode (recall that Unicode uses multiple bytes per character). Therefore, a separate hierarchy provides classes, inheriting from the abstract Reader and Writer classes, for processing Unicode characters. These classes have read and write operations that are based on two-byte char values (that is, UTF-16 code units) rather than byte values.
1. Reading and Writing Bytes
The InputStream class has an abstract method:
abstract int read()
This method reads one byte and returns the byte that was read, or -1 if it encounters the end of the input source. The designer of a concrete input stream class overrides this method to provide useful functionality. For example, in the FileInputStream class, this method reads one byte from a file. System.in is a predefined object of a subclass of InputStream that allows you to read information from “standard input,” that is, the console or a redirected file.
The InputStream class also has nonabstract methods to read an array of bytes or to skip a number of bytes. Since Java 9, there is a very useful method to read all bytes of a stream:
byte[] bytes = in.readAUBytes();
There are also methods to read a given number of bytes—see the API notes.
These methods call the abstract read method, so subclasses need to override only one method.
Similarly, the OutputStream class defines the abstract method
abstract void write(int b)
which writes one byte to an output location.
If you have an array of bytes, you can write them all at once:
byte[] values = …;
out.write(vatues);
The transferTo method transfers all bytes from an input stream to an output stream:
in.transferTo(out);
Both the read and write methods block until the byte is actually read or written. This means that if the input stream cannot immediately be accessed (usually because of a busy network connection), the current thread blocks. This gives other threads the chance to do useful work while the method is waiting for the input stream to become available again.
The available method lets you check the number of bytes that are currently available for reading. This means a fragment like the following is unlikely to block:
int bytesAvailable = in.available();
if (bytesAvailable > 0)
{
var data = new byte[bytesAvailable];
in.read(data);
}
When you have finished reading or writing to an input/output stream, close it by calling the close method. This call frees up the operating system resources that are in limited supply. If an application opens too many input/output streams without closing them, system resources can become depleted. Closing an output stream also flushes the buffer used for the output stream: Any bytes that were temporarily placed in a buffer so that they could be delivered as a larger packet are sent off. In particular, if you do not close a file, the last packet of bytes might never be delivered. You can also manually flush the output with the flush method.
Even if an input/output stream class provides concrete methods to work with the raw read and write functions, application programmers rarely use them. The data that you are interested in probably contain numbers, strings, and objects, not raw bytes.
Instead of working with bytes, you can use one of many input/output classes that build upon the basic InputStream and OutputStream classes.
2. The Complete Stream Zoo
Unlike C, which gets by just fine with a single type FILE*, Java has a whole zoo of more than 60 (!) different input/output stream types (see Figures 2.1 and 2.2).
Let’s divide the animals in the input/output stream zoo by how they are used. There are separate hierarchies for classes that process bytes and characters. As you saw, the InputStream and OutputStream classes let you read and write individual bytes and arrays of bytes. These classes form the basis of the hierarchy shown in Figure 2.1. To read and write strings and numbers, you need more capable subclasses. For example, DataInputStream and DataOutputStream let you read and write all the primitive Java types in binary format. Finally, there are input/output streams that do useful stuff; for example, the ZipInputStream and ZipOutputStream let you read and write files in the familiar ZIP compression format.
For Unicode text, on the other hand, you can use subclasses of the abstract classes Reader and Writer (see Figure 2.2). The basic methods of the Reader and Writer classes are similar to those of InputStream and OutputStream.
abstract int read()
abstract void write(int c)
The read method returns either a UTF-16 code unit (as an integer between 0 and 65535) or -1 when you have reached the end of the file. The write method is called with a Unicode code unit. (See Volume I, Chapter 3 for a discussion of Unicode code units.)
There are four additional interfaces: Ctoseabte, Ftushabte, Readabte, and Appendabte (see Figure 2.3). The first two interfaces are very simple, with methods
void close() throws IOException
and
void flush()
respectively. The classes InputStream, OutputStream, Reader, and Writer all implement the Closeable interface.
OutputStream and Writer implement the Flushable interface.
The Readable interface has a single method
int read(CharBuffer cb)
The CharBuffer class has methods for sequential and random read/write access. It represents an in-memory buffer or a memory-mapped file. (See Section 2.5.2, “The Buffer Data Structure,” on p. 132 for details.)
The Appendable interface has two methods for appending single characters and character sequences:
Appendable append(char c)
Appendable append(CharSequence s)
The CharSequence interface describes basic properties of a sequence of char values. It is implemented by String, CharBuffer, StringBuilder, and StringBuffer.
Of the input/output stream classes, only Writer implements Appendable.
3. Combining Input/Output Stream Filters
FileInputStream and FileOutputStream give you input and output streams attached to a disk file. You need to pass the file name or full path name of the file to the constructor. For example,
var fin = new FileInputStream(“employee.dat”);
looks in the user directory for a file named employee.dat.
Like the abstract InputStream and OutputStream classes, these classes only support reading and writing at the byte level. That is, we can only read bytes and byte arrays from the object fin.
byte b = (byte) fin.read();
As you will see in the next section, if we just had a DataInputStream, we could read numeric types:
DataInputStream din = . . .;
double x = din.readDouble();
But just as the FileInputStream has no methods to read numeric types, the DataInputStream has no method to get data from a file.
Java uses a clever mechanism to separate two kinds of responsibilities. Some input streams (such as the FileInputStream and the input stream returned by the openStream method of the URL class) can retrieve bytes from files and other more exotic locations. Other input streams (such as the DataInputStream) can assemble bytes into more useful data types. The Java programmer has to combine the two. For example, to be able to read numbers from a file, first create a FileInputStream and then pass it to the constructor of a DataInputStream.
var fin = new FileInputStream(“employee.dat”);
var din = new DataInputStream(fin);
double x = din.readDouble();
If you look at Figure 2.1 again, you can see the classes FilterInputStream and FilterOutputStream. The subclasses of these classes are used to add capabilities to input/output streams that process bytes.
You can add multiple capabilities by nesting the filters. For example, by default, input streams are not buffered. That is, every call to read asks the operating system to dole out yet another byte. It is more efficient to request blocks of data instead and store them in a buffer. If you want buffering and the data input methods for a file, use the following rather monstrous sequence of constructors:
var din = new DataInputStream(
new BufferedInputStream(
new FileInputStream(“employee.dat”)));
Notice that we put the DataInputStream last in the chain of constructors because we want to use the DataInputStream methods, and we want them to use the buffered read method.
Sometimes you’ll need to keep track of the intermediate input streams when chaining them together. For example, when reading input, you often need to peek at the next byte to see if it is the value that you expect. Java provides the PushbackInputStream for this purpose.
var pbin = new PushbackInputStream(
new BufferedInputStream(
new FileInputStream(“employee.dat”)));
Now you can speculatively read the next byte
int b = pbin.read();
and throw it back if it isn’t what you wanted.
if (b != ‘<‘) pbin.unread(b);
However, reading and unreading are the only methods that apply to a pushback input stream. If you want to look ahead and also read numbers, then you need both a pushback input stream and a data input stream reference.
var pbin = new PushbackInputStream(
new BufferedInputStream(
new FileInputStream(“employee.dat”)));
var din = new DataInputStream(pbin);
Of course, in the input/output libraries of other programming languages, niceties such as buffering and lookahead are automatically taken care of—so it is a bit of a hassle to resort, in Java, to combining stream filters. However, the ability to mix and match filter classes to construct useful sequences of input/output streams does give you an immense amount of flexibility. For example, you can read numbers from a compressed ZIP file by using the following sequence of input streams (see Figure 2.4):
var zin = new ZipInputStream(new FileInputStream(“employee.zip”));
var din = new DataInputStream(zin);
(See Section 2.2.3, “ZIP Archives,” on p. 85 for more on Java’s handling of ZIP files.)
4. Text Input and Output
When saving data, you have the choice between binary and text formats. For example, if the integer 1234 is saved in binary, it is written as the sequence of bytes 00 00 04 D2 (in hexadecimal notation). In text format, it is saved as the string “1234”. Although binary I/O is fast and efficient, it is not easily readable by humans. We first discuss text I/O and cover binary I/O in Section 2.2, “Reading and Writing Binary Data,” on p. 78.
When saving text strings, you need to consider the character encoding. In the UTF-16 encoding that Java uses internally, the string “Jose” is encoded as 00 4A 00 6F 00 73 00 E9 (in hex). However, many programs expect that text files use a different encoding. In UTF-8, the encoding most commonly used on the Internet, the string would be written as 4A 6F 73 C3 A9, without the zero bytes for the first three letters and with two bytes for the e character.
The OutputStreamWriter class turns an output stream of Unicode code units into a stream of bytes, using a chosen character encoding. Conversely, the InputStreamReader class turns an input stream that contains bytes (specifying characters in some character encoding) into a reader that emits Unicode code units.
For example, here is how you make an input reader that reads keystrokes from the console and converts them to Unicode:
var in = new InputStreamReader(System.in);
This input stream reader assumes the default character encoding used by the host system. On desktop operating systems, that can be an archaic encoding such as Windows 1252 or MacRoman. You should always choose a specific encoding in the constructor for the InputStreamReader, for example:
var in = new InputStreamReader(new FileInputStream(“data.txt”), StandardCharsets.UTF_8);
See Section 2.1.8, “Character Encodings,” on p. 75 for more information on character encodings.
The Reader and Writer classes have only basic methods to read and write individual characters. As with streams, you use subclasses for processing strings and numbers.
5. How to Write Text Output
For text output, use a PrintWriter. That class has methods to print strings and numbers in text format. In order to print to a file, construct a PrintStream from a file name and a character encoding:
var out = new PrintWriter(“employee.txt”, StandardCharsets.UTF_8);
To write to a print writer, use the same print, println, and printf methods that you used with System.out. You can use these methods to print numbers (int, short, long, float, double), characters, boolean values, strings, and objects.
For example, consider this code:
String name = “Harry Hacker”;
double salary = 75000;
out.print(name);
out.print(‘ ‘);
out.printtn(satary);
This writes the characters
Harry Hacker 75000.0
to the writer out. The characters are then converted to bytes and end up in the file employee.txt.
The println method adds the correct end-of-line character for the target system (“\r\n” on Windows, “\n” on UNIX) to the line. This is the string obtained by the call System.getProperty(“tine.separator”).
If the writer is set to autoflush mode, all characters in the buffer are sent to their destination whenever println is called. (Print writers are always buffered.) By default, autoflushing is not enabled. You can enable or disable autoflushing by using the PrintWriter(Writer writer, boolean autoFlush) constructor:
var out = new PrintWriter(
new OutputStreamWriter(
new FiteOutputStream(“emptoyee.txt”), StandardCharsets.UTF_8), true);
// autoflush
The print methods don’t throw exceptions. You can call the checkError method to see if something went wrong with the output stream.
NOTE: Java veterans might wonder whatever happened to the PrintStream class and to System.out. In Java 1.0, the PrintStream class simply truncated all Unicode characters to ASCII characters by dropping the top byte. (At the time, Unicode was still a 16-bit encoding.) Clearly, that was not a clean or portable approach, and it was fixed with the introduction of readers and writers in Java 1.1. For compatibility with existing code, System.in, System.out, and System.err are still input/output streams, not readers and writers. But now the PrintStream class internally converts Unicode characters to the default host encoding in the same way the PrintWriter does. Objects of type PrintStream act exactly like print writers when you use the print and println methods, but unlike print writers they allow you to output raw bytes with the write(int) and write(byte[]) methods.
6. How to Read Text Input
The easiest way to process arbitrary text is the Scanner class that we used extensively in Volume I. You can construct a Scanner from any input stream.
Alternatively, you can read a short text file into a string like this:
var content = Fites.readString(path, charset);
But if you want the file as a sequence of lines, call
List<String> tines = Fites.readAHLines(path, charset);
If the file is large, process the lines lazily as a Stream<String>:
try (Stream<String> tines = Fites.tines(path, charset))
{
…
}
You can also use a scanner to read tokens—strings that are separated by a delimiter. The default delimiter is white space. You can change the delimiter to any regular expression. For example,
Scanner in = . . .;
in.useDetimiter(“\\PL+”);
accepts any non-Unicode letters as delimiters. The scanner then accepts tokens consisting only of Unicode letters.
Calling the next method yields the next token:
white (in.hasNext())
{
String word = in.next();
…
}
Alternatively, you can obtain a stream of all tokens as
Stream<String> words = in.tokens();
In early versions of Java, the only game in town for processing text input was the BufferedReader class. Its readLine method yields a line of text, or nutt when no more input is available. A typical input loop looks like this:
InputStream inputStream = . . .;
try (var in = new BufferedReader(new InputStreamReader(inputStream, charset)))
{
String tine;
white ((tine = in.readLine()) != nutt)
{
do something with tine
}
}
Nowadays, the BufferedReader class also has a tines method that yields a Stream<String>. However, unlike a Scanner, a BufferedReader has no methods for reading numbers.
7. Saving Objects in Text Format
In this section, we walk you through an example program that stores an array of Employee records in a text file. Each record is stored in a separate line. Instance fields are separated from each other by delimiters. We use a vertical bar (|) as our delimiter. (A colon (:) is another popular choice. Part of the fun is that everyone uses a different delimiter.) Naturally, we punt on the issue of what might happen if a | actually occurs in one of the strings we save.
Here is a sample set of records:
Harry Hacker|35500|1989-10-01
Carl Cracker|75000|1987-12-15
Tony Tester|38000|1990-03-15
Writing records is simple. Since we write to a text file, we use the PrintWriter class. We simply write all fields, followed by either a | or, for the last field, a newline character. This work is done in the following writeData method that we add to our Employee class:
public static void writeEmployee(PrintWriter out, Employee e)
{
out.println(e.getName() + “|” + e.getSalary() + “|” + e.getHireDay());
}
To read records, we read in a line at a time and separate the fields. We use a scanner to read each line and then split the line into tokens with the String.split method.
pubtic static Emptoyee readEmptoyee(Scanner in)
{
String line = in.nextLine();
String[] tokens = Une.spUt(“\\|”);
String name = tokens[0];
double salary = Double.parseDouble(tokens[1]);
LocalDate hireDate = LocalDate.parse(tokens[2]);
int year = hireDate.getYear();
int month = hireDate.getMonthValue();
int day = hireDate.getDayOfMonth();
return new Employee(name, salary, year, month, day);
}
The parameter of the split method is a regular expression describing the separator. We discuss regular expressions in more detail at the end of this chapter. As it happens, the vertical bar character has a special meaning in regular expressions, so it needs to be escaped with a \ character. That character needs to be escaped by another \, yielding the “\\|” expression.
The complete program is in Listing 2.1. The static method
void writeData(Employee[] e, PrintWriter out)
first writes the length of the array, then writes each record. The static method
Emptoyee[] readData(Scanner in)
first reads in the length of the array, then reads in each record. This turns out to be a bit tricky:
int n = in.nextInt();
in.nextLine(); // consume newline
var employees = new Employee[n];
for (int i = 0; i < n; i++)
{
employees[i] = new Employee();
employees[i].readData(in);
}
The call to nextInt reads the array length but not the trailing newline character. We must consume the newline so that the readData method can get the next input line when it calls the nextLine method.
8. Character Encodings
Input and output streams are for sequences of bytes, but in many cases you will work with texts—that is, sequences of characters. It then matters how characters are encoded into bytes.
Java uses the Unicode standard for characters. Each character, or “code point,” has a 21-bit integer number. There are different character encodings—methods for packaging those 21-bit numbers into bytes.
The most common encoding is UTF-8, which encodes each Unicode code point into a sequence of one to four bytes (see Table 2.1). UTF-8 has the advantage that the characters of the traditional ASCII character set, which contains all characters used in English, only take up one byte each.
Another common encoding is UTF-16, which encodes each Unicode code point into one or two 16-bit values (see Table 2.2). This is the encoding used in Java strings. Actually, there are two forms of UTF-16, called “big-endian” and “little-endian.” Consider the 16-bit value 0x2122. In the big-endian format, the more significant byte comes first: 0x21 followed by 0x22. In the little-endian format, it is the other way around: 0x22 0x21. To indicate which of the two is used, a file can start with the “byte order mark,” the 16-bit quantity OxFEFF. A reader can use this value to determine the byte order and then discard it.
In addition to the UTF encodings, there are partial encodings that cover a character range suitable for a given user population. For example, ISO 8859-1 is a one-byte code that includes accented characters used in Western European languages. Shift-JIS is a variable-length code for Japanese characters. A large number of these encodings are still in widespread use.
There is no reliable way to automatically detect the character encoding from a stream of bytes. Some API methods let you use the “default charset”—the character encoding preferred by the operating system of the computer. Is that the same encoding that is used by your source of bytes? These bytes may well originate from a different part of the world. Therefore, you should always explicitly specify the encoding. For example, when reading a web page, check the Content-Type header.
The StandardCharsets class has static variables of type Charset for the character encodings that every Java virtual machine must support:
StandardCharsets.UTF_8
StandardCharsets.UTF_16
StandardCharsets.UTF_16BE
StandardCharsets.UTF_16LE
StandardCharsets.ISO_8859_1
StandardCharsets.US_ASCII
To obtain the Charset for another encoding, use the static forName method:
Charset shiftJIS = Charset.forName(“Shift-JIS”);
Use the Charset object when reading or writing text. For example, you can turn an array of bytes into a string as
var str = new String(bytes, StandardCharsets.UTF_8);
Source: Horstmann Cay S. (2019), Core Java. Volume II – Advanced Features, Pearson; 11th edition.