Handling data in ColdFusion



Many of the issues involved with globalizing applications deal with processing data from the various sources supported by ColdFusion, including the following:

  • General character encoding issues

  • Locale-specific content

  • Input data from URLs and HTML forms

  • File data

  • Databases

  • E-mail

  • HTTP

  • LDAP

  • WDDX

  • COM

  • CORBA

  • Searching and indexing

General character encoding issues

Applications developed for earlier versions of ColdFusion that assumed that the character length of a string was the same as the byte length might produce errors in ColdFusion. The byte length of a string depends on the character encoding.

Locale-specific content

Generating multilocale content

In an application that supports users in multiple locales and produces output that is specific to multiple locales, you call the SetLocale function in every request to set the locale for that specific request. When processing has completed, the locale should be set back to its previous value. One useful technique is to save the user’s desired locale in a Session variable once the user has selected it, and use the Session variable value to set the locale for each user request during the session.

Supporting the euro

The euro is the currency of many European countries, and ColdFusion supports the reading and writing of correctly formatted euro values. Unlike other supported currencies, the euro is not tied to any single country (or locale). The LSCurrencyFormat and LSParseCurrency functions rely on the underlying JVM for their operations, and the rules used for currencies depend on the JVM. For Sun JVMs, the 1.3 releases did not support euros and used the older country-specific currencies. The 1.4 releases use euros for all currencies that are in the euro zone as of 2002. If you are using a JVM that does not support the euro, use the LSEuroCurrencyFormat and LSParseEuroCurrency functions to format and parse euro values in locales that use euros as their currency.

Input data from URLs and HTML forms

A web application server receives character data from request URL parameters or as form data.

The HTTP 1.1 standard only allows US-ASCII characters (0-127) for the URL specification and for message headers. This requires a browser to encode the non-ASCII characters in the URL, both address and parameters, by escaping (URL encoding) the characters using the “%xx” hexadecimal format. URL encoding, however, does not determine how the URL is used in a web document. It only specifies how to encode the URL.

Form data uses the message headers to specify the encoding used by the request (Content headers) and the encoding used in the response (Accept headers). Content negotiation between the client and server uses this information.

There are several techniques for handling both URL and form data entered in different character encodings.

Handling URL strings

URL requests to a server often contain name-value pairs as part of the request. For example, the following URL contains name-value pairs as part of the URL:

http://company.com/prod_page.cfm?name=Stephen;ID=7645

As discussed previously, URL characters entered using any character encoding other than US-ASCII are URL-encoded in a hexadecimal format. However, by default, a web server assumes that the characters of a URL string are single-byte characters.

One common method used to support non-ASCII characters within a URL is to include a name-value pair within the URL that defines the character encoding of the URL. For example, the following URL uses a parameter called encoding to define the character encoding of the URL parameters:

http://company.com/prod_page.cfm?name=Stephen;ID=7645;encoding=Latin-1

Within the prod_page.cfm page, you can check the value of the encoding parameter before processing any of the other name-value pairs. This guarantees that you handle the parameters correctly.

You can also use the SetEncoding function to specify the character encoding of URL parameters. The SetEncoding function takes two parameters: the first specifies a variable scope and the second specifies the character encoding used by the scope. Since ColdFusion writes URL parameters to the URL scope, you specify "URL" as the scope parameter to the function.

For example, if the URL parameters are passed using Shift-JIS, you could access them as follows:

<cfscript>  
    setEncoding("URL", "Shift_JIS");  
    writeoutput(URL.name);  
    writeoutput(URL.ID);  
</cfscript> 
Note: To specify the Shift-JIS character encoding, use the Shift_JIS attribute, with an underscore (_), not a hyphen (-).

Handling form data

The HTML form tag and the ColdFusion cfform tag let users enter text on a page, then submit that text to the server. The form tags are designed to work only with single-byte character data. Since ColdFusion uses 2 bytes per character when it stores strings, ColdFusion converts each byte of the form input into a two-byte representation.

However, if a user enters double-byte text into the form, the form interprets each byte as a single character, rather than recognize that each character is 2 bytes. This corrupts the input text, as the following example shows:

  1. A customer enters three double-byte characters in a form, represented by 6 bytes.

  2. The form returns the six bytes to ColdFusion as six characters. ColdFusion converts them to a representation using 2 bytes per input byte for a total of 12 bytes.

  3. Outputting these characters results in corrupt information displayed to the user.

To work around this issue, use the SetEncoding function to specify the character encoding of input form text. The SetEncoding function takes two parameters: the first specifies the variable scope and the second specifies the character encoding used by the scope. Since ColdFusion writes form parameters to the Form scope, you specify "Form" as the scope parameter to the function. If the input text is double-byte, ColdFusion preserves the two-byte representation of the text.

The following example specifies that the form data contains Korean characters:

<cfscript> 
    setEncoding("FORM", "EUC-KR");  
</cfscript> 
<h1> Form Test Result </h1> 
<strong>Form Values :</strong> 
 
<cfset text = "String = #form.input1# , Length = #len(Trim(form.input1))#"> 
<cfoutput>#text#</cfoutput>

File data

You use the cffile tag to write to and read from text files. By default, the cffile tag assumes that the text that you are reading, writing, copying, moving, or appending is in the JVM default file character encoding, which is typically the system default character encoding. For cffile action="Read", ColdFusion also checks for a byte order mark (BOM) at the start of the file; if there is one, it uses the character encoding that the BOM specifies.

Problems can arise if the file character encoding does not correspond to JVM character encoding, particularly if the number of bytes used for characters in one encoding does not match the number of bytes used for characters in the other encoding.

For example, assume that the JVM default file character encoding is ISO 8859-1, which uses a single byte for each character, and the file uses Shift-JIS, which uses a two-byte representation for many characters. When reading the file, the cffile tag treats each byte as an ISO 8859-1 character, and converts it into its corresponding two-byte Unicode representation. Because the characters are in Shift-JIS, the conversion corrupts the data, converting each two-byte Shift-JIS character into two Unicode characters.

To enable the cffile tag to correctly read and write text that is not encoded in the JVM default character encoding, you can pass the charset attribute to it. Specify as a value the character encoding of the data to read or write, as the following example shows:

<cffile action="read"  
    charset="EUC-KR"  
    file = "c:\web\message.txt"  
    variable = "Message" > 

Databases

ColdFusion applications access databases using drivers for each of the supported database types. The conversion of client native language data types to SQL data types is transparent and is done by the driver managers, database client, or server. For example, the character data (SQL CHAR, VARCHAR) you use with JDBC API is represented using Unicode-encoded strings.

Database administrators configure data sources and usually are required to specify the character encodings for character column data. Many of the major vendors, such as Oracle, Sybase, and Informix, support storing character data in many character encodings, including Unicode UTF-8 and UTF-16.

The database drivers supplied with ColdFusion correctly handle data conversions from the database native format to the ColdFusion Unicode format. You do not have to perform any additional processing to access databases. However, always check with your database administrator to determine how your database supports different character encodings.

E-mail

ColdFusion sends e-mail messages using the cfmail, cfmailparam, and cfmailpart tags.

By default, ColdFusion sends mail in UTF-8 encoding. You can specify a different default encoding on the Mail page in the ColdFusion Administrator, and you can use the charset attribute of the cfmail and cfmailpart tags to specify the character encoding for a specific mail message or part of a multipart mail message.

HTTP

ColdFusion supports HTTP communication using the cfhttp and cfhttpparam tags and the GetHttpRequestData function.

The cfhttp tag supports making HTTP requests. The cfhttp tag uses the Unicode UTF-8 encoding for passing data by default, and you can use the charset attribute to specify the character encoding. You can also use the cfhttpparam tag mimeType attribute to specify the MIME type and character set of a file.

LDAP

ColdFusion supports LDAP (Lightweight Directory Access Protocol) through the cfldap tag. LDAP uses the UTF-8 encoding format, so you can mix all retrieved data with other data and safely manipulated it. No extra processing is required to support LDAP.

WDDX

ColdFusion supports the cfwddx tag. ColdFusion stores WDDX (Web Distributed Data Exchange) data as UTF-8 encoding, so it automatically supports double-byte character encodings. You do not have to perform any special processing to handle double-byte characters with WDDX.

COM

ColdFusion supports COM through the cfobjecttype="com" tag. All string data used in COM interfaces is constructed using wide characters (wchars), which support double-byte characters. You do not have to perform any special processing to interface with COM objects.

CORBA

ColdFusion supports CORBA through the cfobjecttype="corba" tag. The CORBA 2.0 interface definition language (IDL) basic type “String” used the Latin-1 character encoding, which used the full 8-bits (256) to represent characters.

As long as you are using CORBA later than version 2.0, which includes support for the IDL types wchar and wstring, which map to Java types char and string respectively, you do not have to do anything to support double-byte characters.

However, if you are using a version of CORBA that does not support wchar and wstring, the server uses char and string data types, which assume a single-byte representation of text.

Searching and indexing

ColdFusion supports Verity search through the cfindex, cfcollection, and cfsearch tags. To support multilingual searching, the ColdFusion product CD-ROM includes the Verity language packs that you install to support different languages.