CHAPTER 18 Strings in C#

All strings in C# are instances of the System.String type in the CLR. Because of this, many built-in operations are available that work with strings. For example, the String class defines an indexer function that can be used to iterate over the characters of the string:

using System; class Test {

public static void Main()

{

string s = “Test String”;

for (int index = 0; index < s.Length; index++)

Console.WriteLine(“Char: {0}”, s[index]);

}

}

1. Operations

The string class is an example of an immutable type, which means the characters contained in the string can’t be modified by users of the string. All modifying operations on a string return a new string instance rather than modifying the instance on which the method is called.

Immutable types are used to make reference types that have value semantics (in other words, act somewhat like value types).

The string class supports the comparison and searching methods listed in Table 18-1.

The string class supports the modification methods described in Table 18-2, which all return a new string instance.

2. String Encodings and Conversions

From the C# perspective, strings are always Unicode strings. When dealing only in the .NET world, this greatly simplifies working with strings.

Unfortunately, it’s sometimes necessary to deal with the messy details of other kinds of strings, especially when dealing with text files produced by older applications. The System .Text

namespace contains classes that can be used to convert between an array of bytes and a character encoding such as ASCII, Unicode, UTF7, or UTF8. Each coding is encapsulated in a class such as ASCIIEncoding.

To convert from a string to a block of bytes, the GetEncoder() method on the encoding class is called to obtain an Encoder, which is then used to do the encoding. Similarly, to convert from a block of bytes to a specific encoding, GetDecoder() is called to obtain a decoder.

3. Converting Objects to Strings

The function object.ToString () is overridden by the built-in types to provide an easy way of converting from a value to a string representation of that value. Calling ToString() produces the default representation of a value; you can obtain a different representation by calling String.Format(). See Chapter 34 for more information.

4. An Example

You can use the Split function to break a string into substrings at separators:

using System; class Test {

public static void Main()

{

string s = “Oh, I hadn’t thought of that”;

char[] separators = new char[] {‘ ‘, ‘,’};

foreach (string sub in s.Split(separators))

{

Console.WriteLine(“Word: {0}”, sub);

}

}

}

This example produces the following output:

Word: Oh

Word:

Word: I

Word: hadn’t

Word: thought

Word: of

Word: that

The separators character array defines what characters the string will be broken on. The Split()function returns an array of strings, and the foreach statement iterates over the array and prints it out.

In this case, the output isn’t particularly useful because the “,” string gets broken twice. You can fix this by using the regular expression classes.

5. StringBuilder

Though you can use the String. Format() function to create a string based on the values of other strings, it isn’t necessarily the most efficient way to assemble strings. The runtime provides the StringBuilder class to make this process easier.

The StringBuilder class supports the properties and methods described in Table 18-3 and Table 18-4.

The following example demonstrates how you can use the StringBuilder class to create a string from separate strings:

using System; using System.Text; class Test {

public static void Main()

{

string s = “I will not buy this record, it is scratched”;

char[] separators = new char[] {‘ ‘, ‘,’};

StringBuilder sb = new StringBuilder();

int number = 1;

foreach (string sub in s.Split(separators))

{

sb.AppendFormat(“{0}: {1} “, number++, sub);

}

Console.WriteLine(“{0}”, sb);

}

}

This code will create a string with numbered words and will produce the following output:

1: I 2: will 3: not 4: buy 5: this 6: record 7:  8: it 9: is 10: scratched

Because the call to split() specified both the space and the comma as separators, it considers there to be a word between the comma and the following space, which results in an empty entry.

6. Regular Expressions

If the searching functions found in the string class aren’t powerful enough, the System .Text namespace contains a regular expression class named Regex. Regular expressions provide a powerful method for doing search and/or replace functions.

Although this section contains a few examples of using regular expressions, a detailed description of them is beyond the scope of the book. Several regular expression books are available, and the subject is also covered in most books about Perl. Mastering Regular Expressions, Second Edition (O’Reilly, 2002), by Jeffrey E. F. Friedl, and Regular Expression Recipes: A Problem-Solution Approach (Apress, 2005), by Nathan A. Good, are two great references.

The regular expression class uses a rather interesting technique to get maximum perfor­mance. Rather than interpret the regular expression for each match, it writes a short program on the fly to implement the regular expression match, and that code is then run.

You can revise the previous example of Split () to use a regular expression, rather than single characters, to specify how the split should occur. This will remove the blank word that was found in the preceding example.

// file: regex.cs using System;

using System.Text.RegularExpressions; class Test {

public static void Main()

{

string s = “Oh, I hadn’t thought of that”;

Regex regex = new Regex(@” |, “);

char[] separators = {‘ ‘, ‘,’};

foreach (string sub in regex.Split(s))

{

Console.WriteLine(“Word: {0}”, sub);

}

}

}

This example produces the following output:

Word: Oh

Word: I

Word: hadn’t

Word: thought

Word: of

Word: that

In the regular expression, the string is split either on a space or on a comma followed by a space.

7. Regular Expression Options

When creating a regular expression, you can specify several options to control how the matches are performed. Compiled is especially useful to speed up searches that use the same regular expression multiple times. Table 18-5 lists the regular expression options.

8. More Complex Parsing

Using regular expressions to improve the function of Split () doesn’t really demonstrate their power. The following example uses regular expressions to parse an IIS log file. That log file looks something like this:

#Software: Microsoft Internet Information Server 4.0

#Version: 1.0

#Date: 1999-12-31 00:01:22

#Fields: time c-ip cs-method cs-uri-stem sc-status

00:01:31 157.56.214.169 GET /Default.htm 304

00:02:55 157.56.214.169 GET /docs/project/overview.htm 200

The following code will parse this into a more useful form:

// file=logparse.cs

// compile with: csc logparse.cs

using System;

using System.Net;

using System.IO;

using System.Text.RegularExpressions;

using System.Collections;

class Test {

public static void Main(string[] args)

{

if (args.Length == 0) //we need a file to parse

{

Console.WriteLine(“No log file specified.”);

}

else

ParseLogFile(args[0]);

}

public static void ParseLogFile(string filename)

{

if (!System.IO.File.Exists(filename))

{

Console.WriteLine (“The file specified does not exist.”);

}

else

{

FileStream f = new FileStream(filename, FileMode.Open);

StreamReader stream = new StreamReader(f);

string line;

line = stream.ReadLine();         //     header line

line = stream.ReadLine();         //     version line

line = stream.ReadLine();         //     Date line

Regex        regexDate= new Regex(@”\:\s(?<date>[A\s]+)\s”);

Match        match = regexDate.Match(line);

string date = “”;

if (match.Length != 0)

date = match.Groups[“date”].ToString();

line = stream.ReadLine(); // Fields line

Regex regexLine =

new Regex(                   // match digit or :

@”(?<time>(\d|\:)+)\s” +

// match digit or .

@”(?<ip>(\d|\.)+)\s” +

// match any nonwhite

@”(?<method>\S+)\s” +

// match any nonwhite

@”(?<uri>\S+)\s” +

// match any nonwhite

@”(?<status>\d+)”);

// read through the lines, add an

// IISLogRow for each line

while ((line = stream.ReadLine()) != null)

{

//Console.WriteLine(line); match = regexLine.Match(line);

if (match.Length != 0)

{

Console.WriteLine(“date: {0} {1}”, date, match.Groups[“time”]);

Console.WriteLine(“IP Address: {0}”, match.Groups[“ip”]);

Console.WriteLine(“Method: {0}”, match.Groups[“method”]);

Console.WriteLine(“Status: {0}”, match.Groups[“status”]);

Console.WriteLine(“URI: {0}\n”, match.Groups[“uri”])

}

}

f.Close();

}

}

}

The general structure of this code should be familiar. This example has two regular expressions. The date string and the regular expression used to match it are as follows:

#Date: 1999-12-31 00:01:22 \:\s(?<date>[^\s]+)\s

In the code, regular expressions are usually written using the verbatim string syntax, since the regular expression syntax also uses the backslash character. Regular expressions are most easily read if they’re broken down into separate elements. The following code matches the colon (: ):

\:

The backslash (\) is required because the colon by itself means something else. The following code matches a single character of whitespace (a tab or space):

\s

In this next part, the ?<date> names the value that will be matched so it can be extracted later:

(?<date>[^\s]+)

The [^\s] is called a character group, with the ^ character meaning “none of the following characters.” This group therefore matches any nonwhitespace character. Finally, the + character means to match one or more occurrences of the previous description (nonwhitespace). The parentheses delimit how to match the extracted string. In the preceding example, this part of the expression matches 1999-12-31.

To match more carefully, you could use the \d (digit) specifier, with the whole expression written as follows:

\:\s(?<date>\d\d\d\d-\d\d-\d\d)\s

That covers the simple regular expression. You can use a more complex regular expression to match each line of the log file. Because of the regularity of the line, we could have used Split(), but that wouldn’t have been as illustrative. The clauses of the regular expression are as follows:

(?(\d|\:)+)\s     // match digit or : to extract time

(?(\d|\.)+)\s     // match digit or . to get IP address

(?\S+)\s          // any nonwhitespace for method

(?\S+)\s          // any nonwhitespace for uri

(?\d+)            // any digit for status

9. Secure String

Confidential data is typically stored in two main data types—numeric types such as floats and integers will hold keys and initialization vectors used in encryption processes, and strings will hold data such as passwords, credit card names, and confidential document fragments. Securing integral types is typically quite easy; they can be zeroed out as soon as they aren’t needed, and even when they hold confidential data, identifying them using the memory window of a debugger or using a crash dump analysis tool is quite hard because they don’t look any different to the millions of other bytes that live in a process’s address space.

Strings are a little different; the immutability of System.String means it’s impossible to clear the contents once a value has been stored in it, and when a string is modified (causing a new allocation) or moved about during garbage collection, multiple copies of the string’s char­acters are left lying around the address space of the process. This immutability is a significant risk if an attacker can manage to exploit a vulnerability to begin reading the process’s memory, attach a debugger to the process, or cause the process to crash and capture the resulting memory dump file. Recognizing a group of characters that form words is an easy process when scanning a large binary file (and can easily be automated), and this makes finding the confidential data that was stored using String a quick and simple process.

These risks have been deemed sufficient to warrant the introduction of a new type in .NET 2.0 that addresses these issues: SecureString. SecureString is a string type that’s pinned and encrypted in memory, and it’s mutable so that its contents can be cleared when it’s no longer needed. To provide a simple, standard way to clear the contents of SecureString, you can implement the IDisposable interface (see Chapter 8 for details of IDisposable) and clear it at the end of a using block:

static System.Security.SecureString ReadSecretData()

{

System.Security.SecureString s = new System.Security.SecureString();

//read in secret data

return s;

}

static void Main(string[] args)

{

using (ReadSecretData())

{

// do required processing of data

}

// SecureString cleared here

// SecureString is now empty

}

The number of methods supported by SecureString is quite limited; there are a few methods for populating the string (AppendChar(), InsertAt(), and SetAt()) and a few more for clearing the string (Clear() and RemoveAt()). There’s also a MakeReadOnly() method to prevent the contents from changing.

There are no direct methods for creating a SecureString from a String object or for going the other way. This is a deliberate omission, as converting from or to a String negates the security benefit of SecureString. The indirect conversion route from a String is through an overloaded constructor of SecureString that takes a char* parameter, and String conversion is possible through the Marshal.SecureStringToGlobalAllocUni() method:

//don’t do this at home (or work)!! static unsafe void ToAndFromString()

{

string s = “Some data”; fixed (char* pS = s)

{

using (SecureString ss = new SecureString(pS, s.Length))

{

//a few random modifications ss.AppendChar(‘!’);

ss.InsertAt(1, ‘_’); ss.SetAt(0, ‘ ‘);

ss.RemoveAt(3);

//make read-only ss.MakeReadOnly();

//convert back to a string

IntPtr ssData = Marshal.SecureStringToGlobalAllocUni(ss);

String newString = Marshal.PtrToStringUni(ssData);

Marshal.FreeHGlobal(ssData);

}

}

}

The contents of SecureString are protected by the Windows Data Protection API (DPAPI), which is available only on Windows 2000 (SP3 and newer), Windows XP, and Windows Server 2003 and newer. Attempting to create a SecureString on older platforms will result in a security exception being raised.

The support for SecureString in other types of the .NET Framework libraries is poor, but there are plans to provide overloads for security-centric methods that allow a SecureString to be used instead of a String in future releases. The main use of SecureString in the 2.0 release of .NET will simply be the in-memory processing of confidential data. As data access and UI libraries are modified to support it, end-to-end string security will be possible.

Source: Gunnerson Eric, Wienholt Nick (2005), A Programmer’s Introduction to C# 2.0, Apress; 3rd edition.

Leave a Reply

Your email address will not be published. Required fields are marked *