You compose regular expressions from operators. In the following sections, we describe the regular expression operators specified by POSIX; GNU also uses these. Most operators have more than one representation as characters. See section 2 Regular Expression Syntax, for what characters represent what operators under what circumstances.
For most operators that can be represented in two ways, one
representation is a single character and the other is that character
preceded by `\'. For example, either `(' or `\('
represents the open-group operator. Which one does depends on the
setting of a syntax bit, in this case RE_NO_BK_PARENS
. Why is
this so? Historical reasons dictate some of the varying
representations, while POSIX dictates others.
Finally, almost all characters lose any special meaning inside a list
(see section 3.6 List Operators ([
... ]
and [^
... ]
)).
This operator matches the character itself. All ordinary characters (see section 2 Regular Expression Syntax) represent this operator. For example, `f' is always an ordinary character, so the regular expression `f' matches only the string `f'. In particular, it does not match the string `ff'.
.
)This operator matches any single printing or nonprinting character except it won't match a:
RE_DOT_NEWLINE
isn't set.
RE_DOT_NOT_NULL
is set.
The `.' (period) character represents this operator. For example, `a.b' matches any three-character string beginning with `a' and ending with `b'.
This operator concatenates two regular expressions a and b. No character represents this operator; you simply put b after a. The result is a regular expression that will match a string if a matches its first part and b matches the rest. For example, `xy' (two match-self operators) matches `xy'.
Repetition operators repeat the preceding regular expression a specified number of times.
*
)This operator repeats the smallest possible preceding regular expression as many times as necessary (including zero) to match the pattern. `*' represents this operator. For example, `o*' matches any string made up of zero or more `o's. Since this operator operates on the smallest preceding regular expression, `fo*' has a repeating `o', not a repeating `fo'. So, `fo*' matches `f', `fo', `foo', and so on.
Since the match-zero-or-more operator is a suffix operator, it may be useless as such when no regular expression precedes it. This is the case when it:
Three different things can happen in these cases:
RE_CONTEXT_INVALID_OPS
is set, then the
regular expression is invalid.
RE_CONTEXT_INVALID_OPS
isn't set, but
RE_CONTEXT_INDEP_OPS
is, then `*' represents the
match-zero-or-more operator (which then operates on the empty string).
The matcher processes a match-zero-or-more operator by first matching as many repetitions of the smallest preceding regular expression as it can. Then it continues to match the rest of the pattern.
If it can't match the rest of the pattern, it backtracks (as many times as necessary), each time discarding one of the matches until it can either match the entire pattern or be certain that it cannot get a match. For example, when matching `ca*ar' against `caaar', the matcher first matches all three `a's of the string with the `a*' of the regular expression. However, it cannot then match the final `ar' of the regular expression against the final `r' of the string. So it backtracks, discarding the match of the last `a' in the string. It can then match the remaining `ar'.
+
or \+
)
If the syntax bit RE_LIMITED_OPS
is set, then Regex doesn't recognize
this operator. Otherwise, if the syntax bit RE_BK_PLUS_QM
isn't
set, then `+' represents this operator; if it is, then `\+'
does.
This operator is similar to the match-zero-or-more operator except that
it repeats the preceding regular expression at least once;
see section 3.4.1 The Match-zero-or-more Operator (*
), for what it operates on, how some
syntax bits affect it, and how Regex backtracks to match it.
For example, supposing that `+' represents the match-one-or-more operator; then `ca+r' matches, e.g., `car' and `caaaar', but not `cr'.
?
or \?
)
If the syntax bit RE_LIMITED_OPS
is set, then Regex doesn't
recognize this operator. Otherwise, if the syntax bit
RE_BK_PLUS_QM
isn't set, then `?' represents this operator;
if it is, then `\?' does.
This operator is similar to the match-zero-or-more operator except that
it repeats the preceding regular expression once or not at all;
see section 3.4.1 The Match-zero-or-more Operator (*
), to see what it operates on, how
some syntax bits affect it, and how Regex backtracks to match it.
For example, supposing that `?' represents the match-zero-or-one operator; then `ca?r' matches both `car' and `cr', but nothing else.
{
... }
or \{
... \}
)
If the syntax bit RE_INTERVALS
is set, then Regex recognizes
interval expressions. They repeat the smallest possible preceding
regular expression a specified number of times.
If the syntax bit RE_NO_BK_BRACES
is set, `{' represents
the open-interval operator and `}' represents the
close-interval operator ; otherwise, `\{' and `\}' do.
Specifically, supposing that `{' and `}' represent the open-interval and close-interval operators; then:
{count}
{min,}
{min, max}
The interval expression (but not necessarily the regular expression that contains it) is invalid if:
RE_DUP_MAX
(which symbol `regex.h'
defines).
If the interval expression is invalid and the syntax bit
RE_NO_BK_BRACES
is set, then Regex considers all the
characters in the would-be interval to be ordinary. If that bit
isn't set, then the regular expression is invalid.
If the interval expression is valid but there is no preceding regular
expression on which to operate, then if the syntax bit
RE_CONTEXT_INVALID_OPS
is set, the regular expression is invalid.
If that bit isn't set, then Regex considers all the characters--other
than backslashes, which it ignores--in the would-be interval to be
ordinary.
|
or \|
)
If the syntax bit RE_LIMITED_OPS
is set, then Regex doesn't
recognize this operator. Otherwise, if the syntax bit
RE_NO_BK_VBAR
is set, then `|' represents this operator;
otherwise, `\|' does.
Alternatives match one of a choice of regular expressions: if you put the character(s) representing the alternation operator between any two regular expressions a and b, the result matches the union of the strings that a and b match. For example, supposing that `|' is the alternation operator, then `foo|bar|quux' would match any of `foo', `bar' or `quux'.
The alternation operator operates on the largest possible surrounding regular expressions. (Put another way, it has the lowest precedence of any regular expression operator.) Thus, the only way you can delimit its arguments is to use grouping. For example, if `(' and `)' are the open and close-group operators, then `fo(o|b)ar' would match either `fooar' or `fobar'. (`foo|bar' would match `foo' or `bar'.)
The matcher usually tries all combinations of alternatives so as to match the longest possible string. For example, when matching `(fooq|foo)*(qbarquux|bar)' against `fooqbarquux', it cannot take, say, the first ("depth-first") combination it could match, since then it would be content to match just `fooqbar'.
[
... ]
and [^
... ]
)Lists, also called bracket expressions, are a set of one or more items. An item is a character, a character class expression, or a range expression. The syntax bits affect which kinds of items you can put in a list. We explain the last two items in subsections below. Empty lists are invalid.
A matching list matches a single character represented by one of the list items. You form a matching list by enclosing one or more items within an open-matching-list operator (represented by `[') and a close-list operator (represented by `]').
For example, `[ab]' matches either `a' or `b'. `[ad]*' matches the empty string and any string composed of just `a's and `d's in any order. Regex considers invalid a regular expression with a `[' but no matching `]'.
Nonmatching lists are similar to matching lists except that they match a single character not represented by one of the list items. You use an open-nonmatching-list operator (represented by `[^'(2)) instead of an open-matching-list operator to start a nonmatching list.
For example, `[^ab]' matches any character except `a' or `b'.
If the posix_newline
field in the pattern buffer (see section 7.1.1 GNU Pattern Buffers is set, then nonmatching lists do not match a newline.
Most characters lose any special meaning inside a list. The special characters inside a list follow.
RE_BACKSLASH_ESCAPE_IN_LISTS
is
set.
[:
... :]
)) if the syntax bit RE_CHAR_CLASSES
is set and what
follows is a valid character class expression.
RE_CHAR_CLASSES
is set and what precedes it is an
open-character-class operator followed by a valid character class name.
-
)) if it's
not first or last in a list or the ending point of a range.
All other characters are ordinary. For example, `[.*]' matches `.' and `*'.
[:
... :]
)
If the syntax bit RE_CHARACTER_CLASSES
is set, then Regex
recognizes character class expressions inside lists. A character
class expression matches one character from a given class. You form a
character class expression by putting a character class name between an
open-character-class operator (represented by `[:') and a
close-character-class operator (represented by `:]'). The
character class names and their meanings are:
alnum
alpha
blank
cntrl
digit
graph
print
except omits space
lower
print
punct
space
upper
xdigit
0
--9
, a
--f
, A
--F
These correspond to the definitions in the C library's `<ctype.h>'
facility. For example, `[:alpha:]' corresponds to the standard
facility isalpha
. Regex recognizes character class expressions
only inside of lists; so `[[:alpha:]]' matches any letter, but
`[:alpha:]' outside of a bracket expression and not followed by a
repetition operator matches just itself.
-
)Regex recognizes range expressions inside a list. They represent those characters that fall between two elements in the current collating sequence. You form a range expression by putting a range operator between two characters.(3) `-' represents the range operator. For example, `a-f' within a list represents all the characters from `a' through `f' inclusively.
If the syntax bit RE_NO_EMPTY_RANGES
is set, then if the range's
ending point collates less than its starting point, the range (and the
regular expression containing it) is invalid. For example, the regular
expression `[z-a]' would be invalid. If this bit isn't set, then
Regex considers such a range to be empty.
Since `-' represents the range operator, if you want to make a `-' character itself a list item, you must do one of the following:
For example, `[-a-z]' matches a lowercase letter or a hyphen (in English, in ASCII).
(
... )
or \(
... \)
)A group, also known as a subexpression, consists of an open-group operator, any number of other operators, and a close-group operator. Regex treats this sequence as a unit, just as mathematics and programming languages treat a parenthesized expression as a unit.
Therefore, using groups, you can:
|
or \|
)) or a repetition operator (see section 3.4 Repetition Operators).
If the syntax bit RE_NO_BK_PARENS
is set, then `(' represents
the open-group operator and `)' represents the
close-group operator; otherwise, `\(' and `\)' do.
If the syntax bit RE_UNMATCHED_RIGHT_PAREN_ORD
is set and a
close-group operator has no matching open-group operator, then Regex
considers it to match `)'.
If the syntax bit RE_NO_BK_REF
isn't set, then Regex recognizes
back references. A back reference matches a specified preceding group.
The back reference operator is represented by `\digit'
anywhere after the end of a regular expression's digit-th
group (see section 3.7 Grouping Operators ((
... )
or \(
... \)
)).
digit must be between `1' and `9'. The matcher assigns numbers 1 through 9 to the first nine groups it encounters. By using one of `\1' through `\9' after the corresponding group's close-group operator, you can match a substring identical to the one that the group does.
Back references match according to the following (in all examples below, `(' represents the open-group, `)' the close-group, `{' the open-interval and `}' the close-interval operator):
RE_DOT_NEWLINE
isn't set) string that is composed of two
identical halves; the `(.*)' matches the first half and the
`\1' matches the second half.
You can use a back reference as an argument to a repetition operator. For example, `(a(b))\2*' matches `a' followed by two or more `b's. Similarly, `(a(b))\2{3}' matches `abbbb'.
If there is no preceding digit-th subexpression, the regular expression is invalid.
These operators can constrain a pattern to match only at the beginning or end of the entire string or at the beginning or end of a line.
^
)This operator can match the empty string either at the beginning of the string or after a newline character. Thus, it is said to anchor the pattern to the beginning of a line.
In the cases following, `^' represents this operator. (Otherwise, `^' is ordinary.)
RE_CONTEXT_INDEP_ANCHORS
is set, and it is outside
a bracket expression.
(
... )
or \(
... \)
), and section 3.5 The Alternation Operator (|
or \|
).
These rules imply that some valid patterns containing `^' cannot be
matched; for example, `foo^bar' if RE_CONTEXT_INDEP_ANCHORS
is set.
If the not_bol
field is set in the pattern buffer (see section 7.1.1 GNU Pattern Buffers), then `^' fails to match at the beginning of the
string. See section 7.2.3 POSIX Matching, for when you might find this useful.
If the newline_anchor
field is set in the pattern buffer, then
`^' fails to match after a newline. This is useful when you do not
regard the string to be matched as broken into lines.
$
)This operator can match the empty string either at the end of the string or before a newline character in the string. Thus, it is said to anchor the pattern to the end of a line.
It is always represented by `$'. For example, `foo$' usually matches, e.g., `foo' and, e.g., the first three characters of `foo\nbar'.
Its interaction with the syntax bits and pattern buffer fields is exactly the dual of `^''s; see the previous section. (That is, "beginning" becomes "end", "next" becomes "previous", and "after" becomes "before".)
Go to the first, previous, next, last section, table of contents.