|
Regular expression syntax
Regular expression syntax has several basic rules and methods.
Using character setsThe
pattern within the brackets of a regular expression defines a character
set that is used to match a single character. For example, the regular
expression " [A-Za-z] " specifies to match any single uppercase
or lowercase letter enclosed by spaces. In the character set, a
hyphen indicates a range of characters.
The regular expression " B[IAU]G " matches the strings “ BIG
“, “ BAG “, and “ BUG “, but does not match the string " BOG ".
If you specified the regular expression as " B[IA][GN] ", the
concatenation of character sets creates a regular expression that
matches the corresponding concatenation of characters in the search
string. This regular expression matches a space, followed by “B”,
followed by an “I” or “A”, followed by a “G” or “N”, followed by
a trailing space. The regular expression matches “ BIG ”, “ BAG
”, “BIN ”, and “BAN ”.
The regular expression [A-Z][a-z]* matches any word that starts
with an uppercase letter and is followed by zero or more lowercase
letters. The special character * after the closing square bracket
specifies to match zero or more occurrences of the character set.
Note: The * only applies to the character set that
immediately precedes it, not to the entire regular expression.
A + after the closing square bracket specifies to find one or
more occurrences of the character set. You interpret the regular
expression "[A-Z]+" as matching one or more uppercase
letters enclosed by spaces. Therefore, this regular expression matches
" BIG " and also matches “ LARGE ”, “ HUGE ”, “ ENORMOUS ”, and
any other string of uppercase letters surrounded by spaces.
Considerations when using special charactersSince
a regular expression followed by an * can match zero instances of
the regular expression, it can also match the empty string. For
example,
<cfoutput>
REReplace("Hello","[T]*","7","ALL") - #REReplace("Hello","[T]*","7","ALL")#<BR>
</cfoutput>
results in the following output:
REReplace("Hello","[T]*","7","ALL") - 7H7e7l7l7o
The regular expression [T]* can match empty strings. It first
matches the empty string before “H” in “Hello”. The “ALL” argument
tells REReplace to
replace all instances of an expression. The empty string before
“e” is matched, and so on, until the empty string before “o” is
matched.
This result might be unexpected. The workarounds for these types
of problems are specific to each case. In some cases you can use
[T]+, which requires at least one “T”, instead of [T]*. Alternatively,
you can specify an additional pattern after [T]*.
In the following examples the regular expression has a “W” at
the end:
<cfoutput>
REReplace("Hello World","[T]*W","7","ALL") –
#REReplace("Hello World","[T]*W","7","ALL")#<BR>
</cfoutput>
This expression results in the following more predictable output:
REReplace("Hello World","[T]*W","7","ALL") - Hello 7orld
Finding repeating charactersIn some cases, you might want to find
a repeating pattern of characters in a search string. For example,
the regular expression "a{2,4}" specifies to match two to four occurrences
of “a”. Therefore, it would match: "aa", "aaa", "aaaa", but not "a"
or "aaaaa". In the following example, the REFind function
returns an index of 6:
<cfset IndexOfOccurrence=REFind("a{2,4}", "hahahaaahaaaahaaaaahhh")>
<!--- The value of IndexOfOccurrence is 6--->
The regular expression "[0-9]{3,}" specifies to match any integer
number containing three or more digits: “123”, “45678”, and so on.
However, this regular expression does not match a one-digit or two-digit
number.
You use the following syntax to find repeating characters:
{m,n}
Where m is
0 or greater and n is greater than or equal to m.
Match m through n (inclusive) occurrences.
The
expression {0,1} is equivalent to the special character ?.
{m,}
Where m is
0 or greater. Match at least m occurrences. The syntax {,n} is
not allowed.
The expression {1,} is equivalent to the special
character +, and {0,} is equivalent to *.
{m}
Where m is
0 or greater. Match exactly m occurrences.
Case sensitivity in regular expressionsColdFusion
supplies case-sensitive and case-insensitive functions for working with
regular expressions. REFind and REReplace perform
case-sensitive matching and REFindNoCase and REReplaceNoCase perform
case-insensitive matching.
You can build a regular expression that models case-insensitive
behavior, even when used with a case-sensitive function. To make
a regular expression case insensitive, substitute individual characters
with character sets. For example, the regular expression [Jj][Aa][Vv][Aa],
when used with the case-sensitive functions REFind or REReplace,
matches all of the following string patterns:
Using subexpressionsParentheses group parts of regular expressions into subexpressions that
you can treat as a single unit. For example, the regular expression
"ha" specifies to match a single occurrence of the string. The regular
expression "(ha)+" matches one or more instances of “ha”.
In the following example, you use the regular expression "B(ha)+"
to match the letter "B" followed by one or more occurrences of the
string "ha":
<cfset IndexOfOccurrence=REFind("B(ha)+", "hahaBhahahaha")>
<!--- The value of IndexOfOccurrence is 5 --->
You can use the special character | in a subexpression to create
a logical "OR". You can use the following regular expression to
search for the word "jelly" or "jellies":
<cfset IndexOfOccurrence=REFind("jell(y|ies)", "I like peanut butter and jelly">
<!--- The value of IndexOfOccurrence is 26--->
Using special charactersRegular expressions define the following list of special
characters:
+ * ? . [ ^ $ ( ) { | \
In some cases, you use a special character as a literal character.
For example, if you want to search for the plus sign in a string,
you have to escape the plus sign by preceding it with a backslash:
"\+"
The following table describes the special characters for regular
expressions:
Special Character
|
Description
|
\
|
A backslash followed by any special character
matches the literal character itself, that is, the backslash escapes
the special character.
For example, "\+" matches the plus
sign, and "\\" matches a backslash.
|
.
|
A period matches any character, including
newline.
To match any character except a newline, use [^#chr(13)##chr(10)#],
which excludes the ASCII carriage return and line feed codes. The
corresponding escape codes are \r and \n.
|
[ ]
|
A one-character character set that matches
any of the characters in that set.
For example, "[akm]" matches
an “a”, “k”, or “m”. A hyphen in a character set indicates a range
of characters; for example, [a-z] matches any single lowercase letter.
If
the first character of a character set is the caret (^), the regular
expression matches any character except those in the set. It does
not match the empty string.
For example, [^akm] matches any
character except “a”, “k”, or “m”. The caret loses its special meaning
if it is not the first character of the set.
|
^
|
If the caret is at the beginning of a regular
expression, the matched string must be at the beginning of the string being
searched.
For example, the regular expression "^ColdFusion"
matches the string "ColdFusion lets you use regular expressions"
but not the string "In ColdFusion, you can use regular expressions."
|
$
|
If the dollar sign is at the end of a regular
expression, the matched string must be at the end of the string
being searched.
For example, the regular expression "ColdFusion$"
matches the string "I like ColdFusion" but not the string "ColdFusion
is fun."
|
?
|
A character set or subexpression followed
by a question mark matches zero or one occurrence of the character
set or subexpression.
For example, xy?z matches either “xyz”
or “xz”.
|
|
|
The OR character allows a choice between
two regular expressions.
For example, jell(y|ies) matches
either “jelly” or “jellies”.
|
+
|
A character set or subexpression followed
by a plus sign matches one or more occurrences of the character
set or subexpression.
For example, [a-z]+ matches one or more
lowercase characters.
|
*
|
A character set or subexpression followed
by an asterisk matches zero or more occurrences of the character
set or subexpression.
For example, [a-z]* matches zero or
more lowercase characters.
|
()
|
Parentheses group parts of a regular expression
into subexpressions that you can treat as a single unit.
For
example, (ha)+ matches one or more instances of “ha”.
|
(?x)
|
If at the beginning of a regular expression,
it specifies to ignore whitespace in the regular expression and
lets you use ## for end-of-line comments. You can match a space
by escaping it with a backslash.
For example, the following
regular expression includes comments, preceded by ##, that are ignored
by ColdFusion:
reFind("(?x)
one ##first option
|two ##second option
|three\ point\ five ## note escaped spaces
", "three point five")
|
(?m)
|
If at the beginning of a regular expression,
it specifies the multiline mode for the special characters ^ and
$.
When used with ^, the matched string can be at the start
of the entire search string or at the start of new lines, denoted
by a linefeed character or chr(10), within the search string. For
$, the matched string can be at the end the search string or at
the end of new lines.
Multiline mode does not recognize a
carriage return, or chr(13), as a new line character.
The
following example searches for the string “two” across multiple
lines:
#reFind("(?m)^two", "one#chr(10)#two")#
This
example returns 4 to indicate that it matched “two” after the chr(10)
linefeed. Without (?m), the regular expression would not match anything,
because ^ only matches the start of the string.
The character
(?m) does not affect \A or \Z, which always match the start or end
of the string, respectively. For information on \A and \Z, see Using escape sequences.
|
(?i)
|
If at the beginning of a regular expression
for REFind(), it specifies to perform a case-insensitive
compare.
For example, the following line would return an
index of 1:
#reFind("(?i)hi", "HI")#
If
you omit the (?i), the line would return an index of zero to signify
that it did not find the regular expression.
|
(?=...)
|
If at the beginning of a regular expression,
it specifies to use positive lookahead when searching for the regular expression.
Positive
lookahead tests for the parenthesized subexpression like regular
parenthesis, but does not include the contents in the match - it
merely tests to see if it is there in proximity to the rest of the
expression.
For example, consider the expression to extract
the protocol from a URL:
<cfset regex = "http(?=://)">
<cfset string = "http://">
<cfset result = reFind(regex, string, 1, "yes")>
mid(string, result.pos[1], result.len[1])
This
example results in the string "http". The lookahead parentheses
ensure that the "://" is there, but does not include it in the result.
If you did not use lookahead, the result would include the extraneous
"://".
Lookahead parentheses do not capture text, so backreference
numbering will skip over these groups. For more information on backreferencing,
see Using backreferences.
|
(?!...)
|
If at the beginning of a regular expression,
it specifies to use negative lookahead. Negative is just like positive
lookahead, as specified by (?=...), except that it tests for the
absence of a match.
Lookahead parentheses do not capture text,
so backreference numbering will skip over these groups. For more information
on backreferencing, see Using backreferences.
|
(?:...)
|
If you prefix a subexpression with "?:",
ColdFusion performs all operations on the subexpression except that
it will not capture the corresponding text for use with a back reference.
|
You
must be aware of the following considerations when using special characters
in character sets, such as [a-z]:
To include a hyphen (-) in the brackets of a character
set as a literal character, you cannot escape it as you can other
special characters because ColdFusion always interprets a hyphen
as a range indicator. Therefore, if you use a literal hyphen in
a character set, make it the last character in the set.
To include a closing square bracket (]) in the character
set, escape it with a backslash, as in [1-3\]A-z]. You do not have
to escape the ] character outside the character set designator.
Using escape sequencesEscape
sequences are special characters in regular expressions preceded
by a backslash (\). You typically use escape sequences to represent
special characters within a regular expression. For example, the
escape sequence \t represents a tab character within the regular
expression, and the \d escape sequence specifies any digit, as [0-9]
does. ColdFusion escape sequences are case sensitive.
The following table lists the escape sequences that ColdFusion
supports:
Escape Sequence
|
Description
|
\b
|
Specifies a boundary defined by a transition
from an alphanumeric character to a nonalphanumeric character, or from
a nonalphanumeric character to an alphanumeric character.
For
example, the string " Big" contains boundary defined by the space
(nonalphanumeric character) and the "B" (alphanumeric character).
The
following example uses the \b escape sequence in a regular expression
to locate the string "Big" at the end of the search string and not
the fragment "big" inside the word "ambiguous".
reFindNoCase("\bBig\b",
"Don’t be ambiguous about Big.")
<!--- The value of IndexOfOccurrence is 26 --->
When
used inside a character set (for example [\b]), it specifies a backspace
|
\B
|
Specifies a boundary defined by no transition
of character type. For example, two alphanumeric characters in a row
or two nonalphanumeric characters in a row; opposite of \b.
|
\A
|
Specifies a beginning of string anchor,
much like the ^ special character.
However, unlike ^, you
cannot combine \A with (?m) to specify the start of newlines in
the search string.
|
\Z
|
Specifies an end of string anchor, much
like the $ special character.
However, unlike $, you cannot
combine \Z with (?m) to specify the end of newlines in the search
string.
|
\n
|
Newline character
|
\r
|
Carriage return
|
\t
|
Tab
|
\f
|
Form feed
|
\d
|
Any digit, similar to [0-9]
|
\D
|
Any nondigit character, similar to [^0-9]
|
\w
|
Any alphanumeric character, or the underscore
(_), similar to [[:word:]]
|
\W
|
Any nonalphanumeric character, except the
underscore similar to [^[:word:]]
|
\s
|
Any whitespace character including tab,
space, newline, carriage return, and form feed. Similar to [ \t\n\r\f].
|
\S
|
Any nonwhitespace character, similar to
[^ \t\n\r\f]
|
\\x
|
A hexadecimal representation of character,
where d is a hexadecimal digit
|
\ddd
|
An octal representation of a character,
where d is an octal digit, in the form \000 to \377
|
Using character classesIn character sets within regular expressions,
you can include a character class. You enclose the character class
inside brackets, as the following example shows:
REReplace ("Adobe Web Site","[[:space:]]","*","ALL")
This code replaces all the spaces with *, producing this string:
Adobe*Web*Site
You can combine character classes with other expressions within
a character set. For example, the regular expression [[:space:]123]
searches for a space, 1, 2, or 3. The following example also uses
a character class in a regular expression:
<cfset IndexOfOccurrence=REFind("[[:space:]][A-Z]+[[:space:]]",
"Some BIG string")>
<!--- The value of IndexOfOccurrence is 5 --->
The following table shows the character classes that ColdFusion
supports. Regular expressions using these classes match any Unicode
character in the class, not just ASCII or ISO-8859 characters.
Character class
|
Matches
|
:alpha:
|
Any alphabetic character.
|
:upper:
|
Any uppercase alphabetic character.
|
:lower:
|
Any lowercase alphabetic character
|
:digit:
|
Any digit. Same as \d.
|
:alnum:
|
Any alphabetic or numeric character.
|
:xdigit:
|
Any hexadecimal digit. Same as [0-9A-Fa-f].
|
:blank:
|
Space or a tab.
|
:space:
|
Any whitespace character. Same as \s.
|
:print:
|
Any alphanumeric, punctuation, or space
character.
|
:punct:
|
Any punctuation character
|
:graph:
|
Any alphanumeric or punctuation character.
|
:cntrl:
|
Any character not part of the character
classes [:upper:], [:lower:], [:alpha:], [:digit:], [:punct:], [:graph:],
[:print:], or [:xdigit:].
|
:word:
|
Any alphabetic or numeric character, plus
the underscore (_). Same as \w
|
:ascii:
|
The ASCII characters, in the Hexadecimal
range 0 - 7F
|
|