Getting Web Data in Java Networking

To access web servers in a Java program, you will want to work at a higher level than socket connections and HTTP requests. In the following sections, we discuss the classes that the Java library provides for this purpose.

1. URLs and URIs

The URL and URLConnection classes encapsulate much of the complexity of retrieving information from a remote site. You can construct a URL object from a string:

var url = new URL(urlString);

If you simply want to fetch the contents of the resource, use the openStream method of the URL class. This method yields an InputStream object. Use it in the usual way—for example, to construct a Scanner:

InputStream inStream = url.openStream();

var in = new Scanner(inStream, StandardCharsets.UTF_8);

The java.net package makes a useful distinction between URLs (uniform resource locators) and URIs (uniform resource identifiers).

A URI is a purely syntactical construct that contains the various parts of the string specifying a web resource. A URL is a special kind of URI, namely, one with sufficient information to locate a resource. Other URIs, such as

mailto:cay@horstmann.com

are not locators—there is no data to locate from this identifier. Such a URI is called a URN (uniform resource name).

In the Java library, the URI class has no methods for accessing the resource that the identifier specifies—its sole purpose is parsing. In contrast, the URL class can open a stream to the resource. For that reason, the URL class only works with schemes that the Java library knows how to handle, such as http: , https:, ftp:, the local file system (file:), and JAR files (jar:).

To see why parsing is not trivial, consider how complex URIs can be. For example,

http:/google.com?q=Beach+Chalet

ftp://username:password@ftp.yourserver.com/pub/file.txt

The URI specification gives the rules for the makeup of these identifiers. A URI has the syntax

[scheme:] schemeSpecificPart[#fragment]

Here, the [. . .] denotes an optional part, and the : and # are included literally in the identifier.

If the scheme: part is present, the URI is called absolute. Otherwise, it is called relative.

An absolute URI is opaque if the schemeSpecificPart does not begin with a / such as

mailto:cay@horstmann.com

All absolute nonopaque URIs and all relative URIs are hierarchical. Examples are

http://horstmann.com/index.html

../../java/net/Socket.html#Socket()

The schemeSpecificPart of a hierarchical URI has the structure

[/authority][path][? query]

where, again, [. . .] denotes optional parts.

For server-based URIs, the authority part has the form

[user-info@] host[:port]

The port must be an integer.

RFC 2396, which standardizes URIs, also supports a registry-based mechanism in which the authority has a different format, but this is not in common use.

One of the purposes of the URI class is to parse an identifier and break it up into its components. You can retrieve them with the methods

getScheme

getSchemeSpecificPart

getAuthority

getUserInfo

getHost

getPort

getPath

getQuery

getFragment

The other purpose of the URI class is the handling of absolute and relative identifiers. If you have an absolute URI such as

http://docs.mycompany.com/api/java/net/ServerSocket.htmt

and a relative URI such as

../../java/net/Socket.htmt#Socket()

then you can combine the two into an absolute URI.

http://docs.mycompany.com/api/java/net/Socket.htmt#Socket()

This process is called resolving a relative URL.

The opposite process is called relativization. For example, suppose you have a base URI

http://docs.mycompany.com/api

and a URI

http://docs.mycompany.com/api/java/tang/String.htmt

Then the relativized URI is

java/tang/String.htmt

The URI class supports both of these operations:

relative = base.retativize(combined);

combined = base.resotve(retative);

2. Using a URLConnection to Retrieve Information

If you want additional information about a web resource, you should use the URLConnection class, which gives you much more control than the basic URL class.

When working with a URLConnection object, you must carefully schedule your steps.

Call the openConnection method of the URL class to obtain the URLConnection object:

URLConnection connection = url.openConnection();

Set any request properties, using the methods

setDoInput

setDoOutput

setlfModifiedSince

setUseCaches

setAttowUserInteraction

setRequestProperty

setConnectTimeout

setReadTimeout

We discuss these methods later in this section and in the API notes.

Connect to the remote resource by calling the connect method:

connection.connect();

Besides making a socket connection to the server, this method also queries the server for header information.

After connecting to the server, you can query the header information. Two methods, getHeaderFietdKey and getHeaderFietd, enumerate all fields of the header. The method getHeaderFietds gets a standard Map object containing the header fields. For your convenience, the following methods query standard fields:

getContentType

getContentLength

getContentEncoding

getDate

getExpiration

getLastModified

Finally, you can access the resource data. Use the getInputStream method to obtain an input stream for reading the information. (This is the same input stream that the openStream method of the URL class returns.) The other method, getContent, isn’t very useful in practice. The objects that are returned by standard content types such as text/ptain and image/gif require classes in the sun hierarchy for processing. You could register your own content handlers, but we do not discuss this technique in our book.

Let us now look at some of the URLConnection methods in detail. Several methods set properties of the connection before connecting to the server. The most important ones are setDoInput and setDoOutput. By default, the connection yields an input stream for reading from the server but no output stream for writing. If you want an output stream (for example, for posting data to a web server), you need to call

connection.setDoOutput(true);

Next, you may want to set some of the request headers. The request headers are sent together with the request command to the server. Here is an example:

GET www.server.com/index.htmt HTTP/1.0

Referer: http://www.somewhere.com/tinks.htmt

Proxy-Connection: Keep-Ative

User-Agent: Mozitta/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.4)

Host: www.server.com

Accept: text/htmt, image/gif, image/jpeg, image/png, */*

Accept-Language: en Accept-Charset:

iso-8859-1,*,utf-8

Cookie: orangemitano=192218887821987

The setIfModifiedSince method tells the connection that you are only interested in data modified since a certain date.

Finally, you can use the catch-all setRequestProperty method to set any name/ value pair that is meaningful for the particular protocol. For the format of the HTTP request headers, see RFC 2616. Some of these parameters are not well documented and are passed around by word of mouth from one programmer to the next. For example, if you want to access a password-protected web page, you must do the following:

Concatenate the user name, a colon, and the password.

String input = username + “:” + password;

Compute the Base64 encoding of the resulting string. (The Base64 encoding encodes a sequence of bytes into a sequence of printable ASCII characters.)

Base64.Encoder encoder = Base64.getEncoder();

String encoding = encoder.encodeToString(input.getBytes(StandardCharsets.UTF_8));

Call the setRequestProperty method with a name of “Authorization” and the value “Basic ” + encoding.

connection.setRequestProperty(“Authorization”, “Basic ” + encoding);

Once you call the connect method, you can query the response header information. First, let’s see how to enumerate all response header fields. The implementors of this class felt a need to express their individuality by introducing yet another iteration protocol. The call

String key = connection.getHeaderFietdKey(n);

gets the nth key from the response header, where n starts from 1! It returns null if n is zero or greater than the total number of header fields. There is no method to return the number of fields; you simply keep calling getHeaderFieldKey until you get null. Similarly, the call

String value = connection.getHeaderFietd(n); returns the nth value.

The method getHeaderFields returns a Map of response header fields.

Map<String,List<String>> headerFields = connection.getHeaderFields();

Here is a set of response header fields from a typical HTTP request:

Date: Wed, 27 Aug 2008 00:15:48 GMT

Server: Apache/2.2.2 (Unix)

Last-Modified: Sun, 22 Jun 2008 20:53:38 GMT

Accept-Ranges: bytes

Content-Length: 4813

Connection: close

Content-Type: text/html

As a convenience, six methods query the values of the most common header types and convert them to numeric types when appropriate. Table 4.1 shows these convenience methods. The methods with return type long return the number of seconds since January 1, 1970 GMT.

The program in Listing 4.6 lets you experiment with URL connections. Supply a URL and an optional user name and password on the command line when running the program, for example:

java urlConnection.URLConnectionTest http://www.yourserver.com user password

The program prints

All keys and values of the header
The return values of the six convenience methods in Table 4.1
The first ten lines of the requested resource

3. Posting Form Data

In the preceding section, you saw how to read data from a web server. Now we will show you how your programs can send data back to a web server and to programs that the web server invokes.

To send information from a web browser to the web server, a user fills out a form, like the one in Figure 4.7.

When the user clicks the Submit button, the text in the text fields and the settings of any checkboxes, radio buttons, and other input elements are sent back to the web server. The web server invokes a program that processes the user input.

Many technologies enable web servers to invoke programs. Among the best known ones are Java servlets, JavaServer Faces, Microsoft Active Server Pages (ASP), and Common Gateway Interface (CGI) scripts.

The server-side program processes the form data and produces another HTML page that the web server sends back to the browser. This sequence is illustrated in Figure 4.8. The response page can contain new information (for example, in an information-search program) or just an acknowledgment. The web browser then displays the response page.

We do not discuss the implementation of server-side programs in this book. Our interest is merely in writing client programs that interact with existing server-side programs.

When form data are sent to a web server, it does not matter whether the data are interpreted by a servlet, a CGI script, or some other server-side technology. The client sends the data to the web server in a standard format, and the web server takes care of passing it on to the program that generates the response.

Two commands, called GET and POST, are commonly used to send information to a web server.

In the GET command, you simply attach query parameters to the end of the URL. The URL has the form

http://host/path? query

Each parameter has the form name=value. Parameters are separated by & characters. Parameter values are encoded using the URL encoding scheme, following these rules:

Leave the characters A through Z, a through z, 0 through 9, and . – ~ _ unchanged.
Replace all spaces with + characters.
Encode all other characters into UTF-8 and encode each byte by a %, followed by a two-digit hexadecimal number.

For example, to transmit San Francisco, CA, you use San+Francisco%2c+CA, as the hexadecimal number 2c is the UTF-8 code of the ‘,’ character.

This encoding keeps any intermediate programs from messing with spaces and other special characters.

At the time of this writing, the Google Maps site (www.googte.com/maps) accepts query parameters with names q and ht whose values are the location query and the human language of the response. To get a map of 1 Market Street in San Francisco, with a response in German, use the following URL:

http://www.googte.com/maps?q=1+Market+Street+San+Francisco&ht=de

Very long query strings can look unattractive in browsers, and older browsers and proxies have a limit on the number of characters that you can include in a GET request. For that reason, a POST request is often used for forms with a lot of data. In a POST request, you do not attach parameters to a URL; instead, you get an output stream from the URLConnection and write name/value pairs to the output stream. You still have to URL-encode the values and separate them with & characters.

Let us look at this process in detail. To post data to a server-side program, first establish a URLConnection:

var urt = new URL(“http://host/path”);

URLConnection connection = urt.openConnection();

Then, call the setDoOutput method to set up the connection for output:

connection.setDoOutput(true);

Next, call getOutputStream to get a stream through which you can send data to the server. If you are sending text to the server, it is convenient to wrap that stream into a PrintWriter.

var out = new PrintWriter(connection.getOutputStream(), StandardCharsets.UTF_8);

Now you are ready to send data to the server:

out.print(name1 + “=” + URLEncoder.encode(vatue1, StandardCharsets.UTF_8) + “&”);

out.print(name2 + “=” + URLEncoder.encode(vatue2, StandardCharsets.UTF_8));

Close the output stream:

out.ctose();

Finally, call getInputStream and read the server response.

Let’s run through a practical example. The web site at https://toots.usps.com /zip-code-tookup.htm?byaddress contains a form to find the zip code for a street address (see Figure 4.7). To use this form in a Java program, you need to know the URL and the parameters of the POST request.

You could get that information by looking at the HTML code of the form, but it is usually easier to “spy” on a request with a network monitor. Most browsers have a network monitor as part of their development toolkit. For example, Figure 4.9 shows a screen capture of the Firefox network monitor when submitting data to our example web site. You can find out the submission URL as well as the parameter names and values.

When posting form data, the HTTP header includes the content type:

Content-Type: apptication/x-www-form-urtencoded

You can also post data in other formats. For example, when sending data in JavaScript Object Notation (JSON), set the content type to apptication/json.

The header for a POST must also include the content length, for example

Content-Length: 124

The program in Listing 4.7 sends POST form data to any server-side program. Place the data into a .properties file such as the following:

url=https://toots.usps.com/tools/app/ziptookup/zipByAddress

User-Agent=HTTPie/0.9.2

address1=1 Market Street

address2=

city=San Francisco

state=CA

companyName=

…

The program removes the urt and User-Agent entries and sends all others to the doPost method.

In the doPost method, we first open the connection and set the user agent. (The zip code service does not work with the default User-Agent request parameter which contains the string Java, perhaps because the postal service doesn’t want to serve programmatic requests.)

Then we call setDoOutput(true), and open the output stream. We then enumerate all keys and values. For each of them, we send the key, = character, value, and & separator character:

out.print(key);

out.print(‘=’);

out.print(URLEncoder.encode(value, StandardCharsets.UTF_8));

if (more pairs) out.print(‘&’);

When switching from writing to reading any part of the response, the actual interaction with the server happens. The Content-Length header is set to the size of the output. The Content-Type header is set to application/x-www-form-urlencoded unless a different content type was specified. The headers and data are sent to the server. Then the response headers and server response are read and can be queried. In our example program, this switch happens in the call to connection.getContentEncoding().

There is one twist with reading the response. If a server-side error occurs, the call to connection.getInputStream() throws a FileNotFoundException. However, the server still sends an error page back to the browser (such as the ubiquitous “Error 404—page not found”). To capture this error page, call the getErrorStream method:

InputStream err = connection.getErrorStream();

When you send POST data to a server, it can happen that the server-side program responds with a redirect: a different URL that should be called to get the actual information. The server could do that because the information is available elsewhere, or to provide a bookmarkable URL. The HttpURLConnection class can handle redirects in most cases.

Even though redirects are usually automatically handled, there are some situations where you need to do them yourself. Automatic redirects between HTTP and HTTPS are not supported for security reasons. Redirects can also fail for more subtle reasons. For example, an earlier version of the zip code service used a redirect. Recall that we set the User-Agent request parameter so that the post office didn’t think we made a request via the Java API. While it is possible to set the user agent to a different string in the initial request, that setting is not used in automatic redirects. Instead, automatic redirects always send a generic user agent string that contains the word Java.

In such situations, you can manually carry out the redirects. Before connecting the server, turn off automatic redirects:

connection.setInstanceFoUowRedirects(false);

After making the request, get the response code:

int responseCode = connection.getResponseCode();

Check if it is one of

HttpURLConnection.HTTP_MOVED_PERM

HttpURLConnection.HTTP_MOVED_TEMP

HttpURLConnection.HTTP_SEE_OTHER

In that case, get the Location response header to obtain the URL for the redirect. Then disconnect and make another connection to the new URL:

String location = connection.getHeaderField(“Location”); if (location != null)

{

URL base = connection.getURL();

connection.disconnect();

connection = (HttpURLConnection) new URL(base, location).openConnection();

…

}

The techniques that this program illustrates can be useful whenever you need to query information from an existing web site. Simply find out the parameters that you need to send, and then strip out the HTML tags and other unnecessary information from the reply.