About character encodings



A character encodingmaps each character in a character set to a numeric value that a computer can represent. These numbers can be represented by a single byte or multiple bytes. For example, the ASCII encoding uses 7 bits to represent the Latin alphabet, punctuation, and control characters.

You use Japanese encodings, such as Shift-JIS, EUC-JP, and ISO-2022-JP, to represent Japanese text. These encodings can vary slightly, but they include a common set of approximately 10,000 characters used in Japanese.

The following terms apply to character encodings:

SBCS
Single-byte character set; a character set encoded in one byte per character, such as ASCII or ISO 8859-1.

DBCS
Double-byte character set; a method of encoding a character set in no more than 2 bytes, such as Shift-JIS. Many character encoding schemes that are referred to as double-byte, including Shift-JIS, allow mixing of single-byte and double-byte encoded characters. Others, such as UCS-2, use 2 bytes for all characters.

MBCS
Multiple-byte character set; a character set encoded with a variable number of bytes per character, such as UTF-8.

The following table lists some common character encodings; however, there are many additional character encodings that browsers and web servers support:

Encoding

Type

Description

ASCII

SBCS

7-bit encoding used by English and Indonesian Bahasa languages

Latin-1

(ISO 8859-1)

SBCS

8-bit encoding used for many Western European languages

Shift_JIS

DBCS

16-bit Japanese encoding

Note: Use an underscore character (_), not a hyphen (-) in the name in CFML attributes.

EUC-KR

DBCS

16-bit Korean encoding

UCS-2

DBCS

Two-byte Unicode encoding

UTF-8

MBCS

Multibyte Unicode encoding. ASCII is 7-bit; non-ASCII characters used in European and many Middle Eastern languages are two-byte; and most Asian characters are three-byte

The World Wide Web Consortium maintains a list of all character encodings supported by the Internet. You can find this information at www.w3.org/International/O-charset.html.

Computers must often convert between character encodings. In particular, the character encodings most commonly used on the Internet are not used by Java or Windows. Character sets used on the Internet are typically single-byte or multiple-byte (including DBCS character sets that allow single-byte characters). These character sets are most efficient for transmitting data, because each character takes up the minimum necessary number of bytes. Currently, Latin characters are most frequently used on the web, and most character encodings used on the web represent those characters in a single byte.

Computers, however, process data most efficiently if each character occupies the same number of bytes. Therefore, Windows and Java both use double-byte encoding for internal processing.

The Java Unicode character encoding

ColdFusion uses the Java Unicode Standard for representing character data internally. This standard corresponds to UCS-2 encoding of the Unicode character set. The Unicode character set can represent many languages, including all major European and Asian character sets. Therefore, ColdFusion can receive, store, process, and present text from all languages supported by Unicode.

The Java Virtual Machine (JVM) that is used to processes ColdFusion pages converts between the character encoding used on a ColdFusion page or other source of information to UCS-2. The page or data encodings that ColdFusion supports depend on the specific JVM, but include most encodings used on the web. Similarly, the JVM converts between its internal UCS-2 representation and the character encoding used to send the response to the client.

By default, ColdFusion uses UTF-8 to represent text data sent to a browser. UTF-8 represents the Unicode character set using a variable-length encoding. ASCII characters are sent using a single byte. Most European and Middle Eastern characters are sent as 2 bytes, and Japanese, Korean, and Chinese characters are sent as 3 bytes. One advantage of UTF-8 is that it sends ASCII character set data in a form that is recognized by systems designed to process only single-byte ASCII characters, while it is flexible enough to handle multiple-byte character representations.

While the default format of text data returned by ColdFusion is UTF-8, you can have ColdFusion return a page to any character set supported by Java. For example, you can return text using the Japanese language Shift-JIS character set. Similarly, ColdFusion can handle data that is in many different character sets. For more information, see Determining the page encoding of server output.

Character encoding conversion issues

Because different character encodings support different character sets, you can encounter errors if your application gets text in one encoding and presents it in another encoding. For example, the Windows Latin-1 character encoding, Windows-1252, includes characters with hexadecimal representations in the range 80-9F, while ISO 8859-1 does not include characters in that range. As a result, under the following circumstances, characters in the range 80-9F, such as the euro symbol (Ä), are not displayed properly:

  • A file encoded in Windows-1252 includes characters in the range 80-9F.

  • ColdFusion reads the file, specifying the Windows-1252 encoding in the cffile tag.

  • ColdFusion displays the file contents, specifying ISO-8859 in the cfcontent tag.

Similar issues can arise if you convert between other character encodings; for example, if you read files encoded in the Japanese Windows default encoding and display them using Shift-JIS. To prevent these problems, ensure that the display encoding is the same as the input encoding.