Go to the first, previous, next, last section, table of contents.


2 Regular Expression Syntax

Characters are things you can type. Operators are things in a regular expression that match one or more characters. You compose regular expressions from operators, which in turn you specify using one or more characters.

Most characters represent what we call the match-self operator, i.e., they match themselves; we call these characters ordinary. Other characters represent either all or parts of fancier operators; e.g., `.' represents what we call the match-any-character operator (which, no surprise, matches (almost) any character); we call these characters special. Two different things determine what characters represent what operators:

  1. the regular expression syntax your program has told the Regex library to recognize, and
  2. the context of the character in the regular expression.

In the following sections, we describe these things in more detail.

2.1 Syntax Bits

In any particular syntax for regular expressions, some characters are always special, others are sometimes special, and others are never special. The particular syntax that Regex recognizes for a given regular expression depends on the value in the syntax field of the pattern buffer of that regular expression.

You get a pattern buffer by compiling a regular expression. See section 7.1.1 GNU Pattern Buffers, and section 7.2.1 POSIX Pattern Buffers, for more information on pattern buffers. See section 7.1.2 GNU Regular Expression Compiling, section 7.2.2 POSIX Regular Expression Compiling, and section 7.3.1 BSD Regular Expression Compiling, for more information on compiling.

Regex considers the value of the syntax field to be a collection of bits; we refer to these bits as syntax bits. In most cases, they affect what characters represent what operators. We describe the meanings of the operators to which we refer in section 3 Common Operators, section 4 GNU Operators, and section 5 GNU Emacs Operators.

For reference, here is the complete list of syntax bits, in alphabetical order:

RE_BACKSLASH_ESCAPE_IN_LISTS
If this bit is set, then `\' inside a list (see section 3.6 List Operators ([ ... ] and [^ ... ]) quotes (makes ordinary, if it's special) the following character; if this bit isn't set, then `\' is an ordinary character inside lists. (See section 2.4 The Backslash Character, for what `\' does outside of lists.)
RE_BK_PLUS_QM
If this bit is set, then `\+' represents the match-one-or-more operator and `\?' represents the match-zero-or-more operator; if this bit isn't set, then `+' represents the match-one-or-more operator and `?' represents the match-zero-or-one operator. This bit is irrelevant if RE_LIMITED_OPS is set.
RE_CHAR_CLASSES
If this bit is set, then you can use character classes in lists; if this bit isn't set, then you can't.
RE_CONTEXT_INDEP_ANCHORS
If this bit is set, then `^' and `$' are special anywhere outside a list; if this bit isn't set, then these characters are special only in certain contexts. See section 3.9.1 The Match-beginning-of-line Operator (^), and section 3.9.2 The Match-end-of-line Operator ($).
RE_CONTEXT_INDEP_OPS
If this bit is set, then certain characters are special anywhere outside a list; if this bit isn't set, then those characters are special only in some contexts and are ordinary elsewhere. Specifically, if this bit isn't set then `*', and (if the syntax bit RE_LIMITED_OPS isn't set) `+' and `?' (or `\+' and `\?', depending on the syntax bit RE_BK_PLUS_QM) represent repetition operators only if they're not first in a regular expression or just after an open-group or alternation operator. The same holds for `{' (or `\{', depending on the syntax bit RE_NO_BK_BRACES) if it is the beginning of a valid interval and the syntax bit RE_INTERVALS is set.
RE_CONTEXT_INVALID_OPS
If this bit is set, then repetition and alternation operators can't be in certain positions within a regular expression. Specifically, the regular expression is invalid if it has: If this bit isn't set, then you can put the characters representing the repetition and alternation characters anywhere in a regular expression. Whether or not they will in fact be operators in certain positions depends on other syntax bits.
RE_DOT_NEWLINE
If this bit is set, then the match-any-character operator matches a newline; if this bit isn't set, then it doesn't.
RE_DOT_NOT_NULL
If this bit is set, then the match-any-character operator doesn't match a null character; if this bit isn't set, then it does.
RE_INTERVALS
If this bit is set, then Regex recognizes interval operators; if this bit isn't set, then it doesn't.
RE_LIMITED_OPS
If this bit is set, then Regex doesn't recognize the match-one-or-more, match-zero-or-one or alternation operators; if this bit isn't set, then it does.
RE_NEWLINE_ALT
If this bit is set, then newline represents the alternation operator; if this bit isn't set, then newline is ordinary.
RE_NO_BK_BRACES
If this bit is set, then `{' represents the open-interval operator and `}' represents the close-interval operator; if this bit isn't set, then `\{' represents the open-interval operator and `\}' represents the close-interval operator. This bit is relevant only if RE_INTERVALS is set.
RE_NO_BK_PARENS
If this bit is set, then `(' represents the open-group operator and `)' represents the close-group operator; if this bit isn't set, then `\(' represents the open-group operator and `\)' represents the close-group operator.
RE_NO_BK_REFS
If this bit is set, then Regex doesn't recognize `\'digit as the back reference operator; if this bit isn't set, then it does.
RE_NO_BK_VBAR
If this bit is set, then `|' represents the alternation operator; if this bit isn't set, then `\|' represents the alternation operator. This bit is irrelevant if RE_LIMITED_OPS is set.
RE_NO_EMPTY_RANGES
If this bit is set, then a regular expression with a range whose ending point collates lower than its starting point is invalid; if this bit isn't set, then Regex considers such a range to be empty.
RE_UNMATCHED_RIGHT_PAREN_ORD
If this bit is set and the regular expression has no matching open-group operator, then Regex considers what would otherwise be a close-group operator (based on how RE_NO_BK_PARENS is set) to match `)'.

2.2 Predefined Syntaxes

If you're programming with Regex, you can set a pattern buffer's (see section 7.1.1 GNU Pattern Buffers, and section 7.2.1 POSIX Pattern Buffers) syntax field either to an arbitrary combination of syntax bits (see section 2.1 Syntax Bits) or else to the configurations defined by Regex. These configurations define the syntaxes used by certain programs---GNU Emacs, POSIX Awk, traditional Awk, Grep, Egrep--in addition to syntaxes for POSIX basic and extended regular expressions.

The predefined syntaxes--taken directly from `regex.h'---are:

[[[ syntaxes ]]]

2.3 Collating Elements vs. Characters

POSIX generalizes the notion of a character to that of a collating element. It defines a collating element to be "a sequence of one or more bytes defined in the current collating sequence as a unit of collation."

This generalizes the notion of a character in two ways. First, a single character can map into two or more collating elements. For example, the German collates as the collating element `s' followed by another collating element `s'. Second, two or more characters can map into one collating element. For example, the Spanish `ll' collates after `l' and before `m'.

Since POSIX's "collating element" preserves the essential idea of a "character," we use the latter, more familiar, term in this document.

2.4 The Backslash Character

The `\' character has one of four different meanings, depending on the context in which you use it and what syntax bits are set (see section 2.1 Syntax Bits). It can: 1) stand for itself, 2) quote the next character, 3) introduce an operator, or 4) do nothing.

  1. It stands for itself inside a list (see section 3.6 List Operators ([ ... ] and [^ ... ])) if the syntax bit RE_BACKSLASH_ESCAPE_IN_LISTS is not set. For example, `[\]' would match `\'.
  2. It quotes (makes ordinary, if it's special) the next character when you use it either:
  3. It introduces an operator when followed by certain ordinary characters--sometimes only when certain syntax bits are set. See the cases RE_BK_PLUS_QM, RE_NO_BK_BRACES, RE_NO_BK_VAR, RE_NO_BK_PARENS, RE_NO_BK_REF in section 2.1 Syntax Bits. Also:
  4. In all other cases, Regex ignores `\'. For example, `\n' matches `n'.


Go to the first, previous, next, last section, table of contents.