Go to the first, previous, next, last section, table of contents.


3 Common Operators

You compose regular expressions from operators. In the following sections, we describe the regular expression operators specified by POSIX; GNU also uses these. Most operators have more than one representation as characters. See section 2 Regular Expression Syntax, for what characters represent what operators under what circumstances.

For most operators that can be represented in two ways, one representation is a single character and the other is that character preceded by `\'. For example, either `(' or `\(' represents the open-group operator. Which one does depends on the setting of a syntax bit, in this case RE_NO_BK_PARENS. Why is this so? Historical reasons dictate some of the varying representations, while POSIX dictates others.

Finally, almost all characters lose any special meaning inside a list (see section 3.6 List Operators ([ ... ] and [^ ... ])).

3.1 The Match-self Operator (ordinary character)

This operator matches the character itself. All ordinary characters (see section 2 Regular Expression Syntax) represent this operator. For example, `f' is always an ordinary character, so the regular expression `f' matches only the string `f'. In particular, it does not match the string `ff'.

3.2 The Match-any-character Operator (.)

This operator matches any single printing or nonprinting character except it won't match a:

newline
if the syntax bit RE_DOT_NEWLINE isn't set.
null
if the syntax bit RE_DOT_NOT_NULL is set.

The `.' (period) character represents this operator. For example, `a.b' matches any three-character string beginning with `a' and ending with `b'.

3.3 The Concatenation Operator

This operator concatenates two regular expressions a and b. No character represents this operator; you simply put b after a. The result is a regular expression that will match a string if a matches its first part and b matches the rest. For example, `xy' (two match-self operators) matches `xy'.

3.4 Repetition Operators

Repetition operators repeat the preceding regular expression a specified number of times.

3.4.1 The Match-zero-or-more Operator (*)

This operator repeats the smallest possible preceding regular expression as many times as necessary (including zero) to match the pattern. `*' represents this operator. For example, `o*' matches any string made up of zero or more `o's. Since this operator operates on the smallest preceding regular expression, `fo*' has a repeating `o', not a repeating `fo'. So, `fo*' matches `f', `fo', `foo', and so on.

Since the match-zero-or-more operator is a suffix operator, it may be useless as such when no regular expression precedes it. This is the case when it:

Three different things can happen in these cases:

  1. If the syntax bit RE_CONTEXT_INVALID_OPS is set, then the regular expression is invalid.
  2. If RE_CONTEXT_INVALID_OPS isn't set, but RE_CONTEXT_INDEP_OPS is, then `*' represents the match-zero-or-more operator (which then operates on the empty string).
  3. Otherwise, `*' is ordinary.

The matcher processes a match-zero-or-more operator by first matching as many repetitions of the smallest preceding regular expression as it can. Then it continues to match the rest of the pattern.

If it can't match the rest of the pattern, it backtracks (as many times as necessary), each time discarding one of the matches until it can either match the entire pattern or be certain that it cannot get a match. For example, when matching `ca*ar' against `caaar', the matcher first matches all three `a's of the string with the `a*' of the regular expression. However, it cannot then match the final `ar' of the regular expression against the final `r' of the string. So it backtracks, discarding the match of the last `a' in the string. It can then match the remaining `ar'.

3.4.2 The Match-one-or-more Operator (+ or \+)

If the syntax bit RE_LIMITED_OPS is set, then Regex doesn't recognize this operator. Otherwise, if the syntax bit RE_BK_PLUS_QM isn't set, then `+' represents this operator; if it is, then `\+' does.

This operator is similar to the match-zero-or-more operator except that it repeats the preceding regular expression at least once; see section 3.4.1 The Match-zero-or-more Operator (*), for what it operates on, how some syntax bits affect it, and how Regex backtracks to match it.

For example, supposing that `+' represents the match-one-or-more operator; then `ca+r' matches, e.g., `car' and `caaaar', but not `cr'.

3.4.3 The Match-zero-or-one Operator (? or \?)

If the syntax bit RE_LIMITED_OPS is set, then Regex doesn't recognize this operator. Otherwise, if the syntax bit RE_BK_PLUS_QM isn't set, then `?' represents this operator; if it is, then `\?' does.

This operator is similar to the match-zero-or-more operator except that it repeats the preceding regular expression once or not at all; see section 3.4.1 The Match-zero-or-more Operator (*), to see what it operates on, how some syntax bits affect it, and how Regex backtracks to match it.

For example, supposing that `?' represents the match-zero-or-one operator; then `ca?r' matches both `car' and `cr', but nothing else.

3.4.4 Interval Operators ({ ... } or \{ ... \})

If the syntax bit RE_INTERVALS is set, then Regex recognizes interval expressions. They repeat the smallest possible preceding regular expression a specified number of times.

If the syntax bit RE_NO_BK_BRACES is set, `{' represents the open-interval operator and `}' represents the close-interval operator ; otherwise, `\{' and `\}' do.

Specifically, supposing that `{' and `}' represent the open-interval and close-interval operators; then:

{count}
matches exactly count occurrences of the preceding regular expression.
{min,}
matches min or more occurrences of the preceding regular expression.
{min, max}
matches at least min but no more than max occurrences of the preceding regular expression.

The interval expression (but not necessarily the regular expression that contains it) is invalid if:

If the interval expression is invalid and the syntax bit RE_NO_BK_BRACES is set, then Regex considers all the characters in the would-be interval to be ordinary. If that bit isn't set, then the regular expression is invalid.

If the interval expression is valid but there is no preceding regular expression on which to operate, then if the syntax bit RE_CONTEXT_INVALID_OPS is set, the regular expression is invalid. If that bit isn't set, then Regex considers all the characters--other than backslashes, which it ignores--in the would-be interval to be ordinary.

3.5 The Alternation Operator (| or \|)

If the syntax bit RE_LIMITED_OPS is set, then Regex doesn't recognize this operator. Otherwise, if the syntax bit RE_NO_BK_VBAR is set, then `|' represents this operator; otherwise, `\|' does.

Alternatives match one of a choice of regular expressions: if you put the character(s) representing the alternation operator between any two regular expressions a and b, the result matches the union of the strings that a and b match. For example, supposing that `|' is the alternation operator, then `foo|bar|quux' would match any of `foo', `bar' or `quux'.

The alternation operator operates on the largest possible surrounding regular expressions. (Put another way, it has the lowest precedence of any regular expression operator.) Thus, the only way you can delimit its arguments is to use grouping. For example, if `(' and `)' are the open and close-group operators, then `fo(o|b)ar' would match either `fooar' or `fobar'. (`foo|bar' would match `foo' or `bar'.)

The matcher usually tries all combinations of alternatives so as to match the longest possible string. For example, when matching `(fooq|foo)*(qbarquux|bar)' against `fooqbarquux', it cannot take, say, the first ("depth-first") combination it could match, since then it would be content to match just `fooqbar'.

3.6 List Operators ([ ... ] and [^ ... ])

Lists, also called bracket expressions, are a set of one or more items. An item is a character, a character class expression, or a range expression. The syntax bits affect which kinds of items you can put in a list. We explain the last two items in subsections below. Empty lists are invalid.

A matching list matches a single character represented by one of the list items. You form a matching list by enclosing one or more items within an open-matching-list operator (represented by `[') and a close-list operator (represented by `]').

For example, `[ab]' matches either `a' or `b'. `[ad]*' matches the empty string and any string composed of just `a's and `d's in any order. Regex considers invalid a regular expression with a `[' but no matching `]'.

Nonmatching lists are similar to matching lists except that they match a single character not represented by one of the list items. You use an open-nonmatching-list operator (represented by `[^'(2)) instead of an open-matching-list operator to start a nonmatching list.

For example, `[^ab]' matches any character except `a' or `b'.

If the posix_newline field in the pattern buffer (see section 7.1.1 GNU Pattern Buffers is set, then nonmatching lists do not match a newline.

Most characters lose any special meaning inside a list. The special characters inside a list follow.

`]'
ends the list if it's not the first list item. So, if you want to make the `]' character a list item, you must put it first.
`\'
quotes the next character if the syntax bit RE_BACKSLASH_ESCAPE_IN_LISTS is set.
`[:'
represents the open-character-class operator (see section 3.6.1 Character Class Operators ([: ... :])) if the syntax bit RE_CHAR_CLASSES is set and what follows is a valid character class expression.
`:]'
represents the close-character-class operator if the syntax bit RE_CHAR_CLASSES is set and what precedes it is an open-character-class operator followed by a valid character class name.
`-'
represents the range operator (see section 3.6.2 The Range Operator (-)) if it's not first or last in a list or the ending point of a range.

All other characters are ordinary. For example, `[.*]' matches `.' and `*'.

3.6.1 Character Class Operators ([: ... :])

If the syntax bit RE_CHARACTER_CLASSES is set, then Regex recognizes character class expressions inside lists. A character class expression matches one character from a given class. You form a character class expression by putting a character class name between an open-character-class operator (represented by `[:') and a close-character-class operator (represented by `:]'). The character class names and their meanings are:

alnum
letters and digits
alpha
letters
blank
system-dependent; for GNU, a space or tab
cntrl
control characters (in the ASCII encoding, code 0177 and codes less than 040)
digit
digits
graph
same as print except omits space
lower
lowercase letters
print
printable characters (in the ASCII encoding, space tilde--codes 040 through 0176)
punct
neither control nor alphanumeric characters
space
space, carriage return, newline, vertical tab, and form feed
upper
uppercase letters
xdigit
hexadecimal digits: 0--9, a--f, A--F

These correspond to the definitions in the C library's `<ctype.h>' facility. For example, `[:alpha:]' corresponds to the standard facility isalpha. Regex recognizes character class expressions only inside of lists; so `[[:alpha:]]' matches any letter, but `[:alpha:]' outside of a bracket expression and not followed by a repetition operator matches just itself.

3.6.2 The Range Operator (-)

Regex recognizes range expressions inside a list. They represent those characters that fall between two elements in the current collating sequence. You form a range expression by putting a range operator between two characters.(3) `-' represents the range operator. For example, `a-f' within a list represents all the characters from `a' through `f' inclusively.

If the syntax bit RE_NO_EMPTY_RANGES is set, then if the range's ending point collates less than its starting point, the range (and the regular expression containing it) is invalid. For example, the regular expression `[z-a]' would be invalid. If this bit isn't set, then Regex considers such a range to be empty.

Since `-' represents the range operator, if you want to make a `-' character itself a list item, you must do one of the following:

For example, `[-a-z]' matches a lowercase letter or a hyphen (in English, in ASCII).

3.7 Grouping Operators (( ... ) or \( ... \))

A group, also known as a subexpression, consists of an open-group operator, any number of other operators, and a close-group operator. Regex treats this sequence as a unit, just as mathematics and programming languages treat a parenthesized expression as a unit.

Therefore, using groups, you can:

If the syntax bit RE_NO_BK_PARENS is set, then `(' represents the open-group operator and `)' represents the close-group operator; otherwise, `\(' and `\)' do.

If the syntax bit RE_UNMATCHED_RIGHT_PAREN_ORD is set and a close-group operator has no matching open-group operator, then Regex considers it to match `)'.

3.8 The Back-reference Operator (\digit)

If the syntax bit RE_NO_BK_REF isn't set, then Regex recognizes back references. A back reference matches a specified preceding group. The back reference operator is represented by `\digit' anywhere after the end of a regular expression's digit-th group (see section 3.7 Grouping Operators (( ... ) or \( ... \))).

digit must be between `1' and `9'. The matcher assigns numbers 1 through 9 to the first nine groups it encounters. By using one of `\1' through `\9' after the corresponding group's close-group operator, you can match a substring identical to the one that the group does.

Back references match according to the following (in all examples below, `(' represents the open-group, `)' the close-group, `{' the open-interval and `}' the close-interval operator):

You can use a back reference as an argument to a repetition operator. For example, `(a(b))\2*' matches `a' followed by two or more `b's. Similarly, `(a(b))\2{3}' matches `abbbb'.

If there is no preceding digit-th subexpression, the regular expression is invalid.

3.9 Anchoring Operators

These operators can constrain a pattern to match only at the beginning or end of the entire string or at the beginning or end of a line.

3.9.1 The Match-beginning-of-line Operator (^)

This operator can match the empty string either at the beginning of the string or after a newline character. Thus, it is said to anchor the pattern to the beginning of a line.

In the cases following, `^' represents this operator. (Otherwise, `^' is ordinary.)

These rules imply that some valid patterns containing `^' cannot be matched; for example, `foo^bar' if RE_CONTEXT_INDEP_ANCHORS is set.

If the not_bol field is set in the pattern buffer (see section 7.1.1 GNU Pattern Buffers), then `^' fails to match at the beginning of the string. See section 7.2.3 POSIX Matching, for when you might find this useful.

If the newline_anchor field is set in the pattern buffer, then `^' fails to match after a newline. This is useful when you do not regard the string to be matched as broken into lines.

3.9.2 The Match-end-of-line Operator ($)

This operator can match the empty string either at the end of the string or before a newline character in the string. Thus, it is said to anchor the pattern to the end of a line.

It is always represented by `$'. For example, `foo$' usually matches, e.g., `foo' and, e.g., the first three characters of `foo\nbar'.

Its interaction with the syntax bits and pattern buffer fields is exactly the dual of `^''s; see the previous section. (That is, "beginning" becomes "end", "next" becomes "previous", and "after" becomes "before".)


Go to the first, previous, next, last section, table of contents.