Go to the first, previous, next, last section, table of contents.


7 Programming with Regex

Here we describe how you use the Regex data structures and functions in C programs. Regex has three interfaces: one designed for GNU, one compatible with POSIX and one compatible with Berkeley UNIX.

7.1 GNU Regex Functions

If you're writing code that doesn't need to be compatible with either POSIX or Berkeley UNIX, you can use these functions. They provide more options than the other interfaces.

7.1.1 GNU Pattern Buffers

To compile, match, or search for a given regular expression, you must supply a pattern buffer. A pattern buffer holds one compiled regular expression.(4)

You can have several different pattern buffers simultaneously, each holding a compiled pattern for a different regular expression.

`regex.h' defines the pattern buffer struct as follows:

[[[ pattern_buffer ]]]

7.1.2 GNU Regular Expression Compiling

In GNU, you can both match and search for a given regular expression. To do either, you must first compile it in a pattern buffer (see section 7.1.1 GNU Pattern Buffers).

Regular expressions match according to the syntax with which they were compiled; with GNU, you indicate what syntax you want by setting the variable re_syntax_options (declared in `regex.h' and defined in `regex.c') before calling the compiling function, re_compile_pattern (see below). See section 2.1 Syntax Bits, and section 2.2 Predefined Syntaxes.

You can change the value of re_syntax_options at any time. Usually, however, you set its value once and then never change it.

re_compile_pattern takes a pattern buffer as an argument. You must initialize the following fields:

translate initialization
translate
Initialize this to point to a translate table if you want one, or to zero if you don't. We explain translate tables in section 7.1.7 GNU Translate Tables.
fastmap
Initialize this to nonzero if you want a fastmap, or to zero if you don't.
buffer
allocated
If you want re_compile_pattern to allocate memory for the compiled pattern, set both of these to zero. If you have an existing block of memory (allocated with malloc) you want Regex to use, set buffer to its address and allocated to its size (in bytes). re_compile_pattern uses realloc to extend the space for the compiled pattern as necessary.

To compile a pattern buffer, use:

char * 
re_compile_pattern (const char *regex, const int regex_size, 
                    struct re_pattern_buffer *pattern_buffer)

regex is the regular expression's address, regex_size is its length, and pattern_buffer is the pattern buffer's address.

If re_compile_pattern successfully compiles the regular expression, it returns zero and sets *pattern_buffer to the compiled pattern. It sets the pattern buffer's fields as follows:

buffer
to the compiled pattern.
used
to the number of bytes the compiled pattern in buffer occupies.
syntax
to the current value of re_syntax_options.
re_nsub
to the number of subexpressions in regex.
fastmap_accurate
to zero on the theory that the pattern you're compiling is different than the one previously compiled into buffer; in that case (since you can't make a fastmap without a compiled pattern), fastmap would either contain an incompatible fastmap, or nothing at all.

If re_compile_pattern can't compile regex, it returns an error string corresponding to one of the errors listed in section 7.2.2 POSIX Regular Expression Compiling.

7.1.3 GNU Matching

Matching the GNU way means trying to match as much of a string as possible starting at a position within it you specify. Once you've compiled a pattern into a pattern buffer (see section 7.1.2 GNU Regular Expression Compiling), you can ask the matcher to match that pattern against a string using:

int
re_match (struct re_pattern_buffer *pattern_buffer, 
          const char *string, const int size, 
          const int start, struct re_registers *regs)

pattern_buffer is the address of a pattern buffer containing a compiled pattern. string is the string you want to match; it can contain newline and null characters. size is the length of that string. start is the string index at which you want to begin matching; the first character of string is at index zero. See section 7.1.8 Using Registers, for a explanation of regs; you can safely pass zero.

re_match matches the regular expression in pattern_buffer against the string string according to the syntax in pattern_buffers's syntax field. (See section 7.1.2 GNU Regular Expression Compiling, for how to set it.) The function returns @math{-1} if the compiled pattern does not match any part of string and @math{-2} if an internal error happens; otherwise, it returns how many (possibly zero) characters of string the pattern matched.

An example: suppose pattern_buffer points to a pattern buffer containing the compiled pattern for `a*', and string points to `aaaaab' (whereupon size should be 6). Then if start is 2, re_match returns 3, i.e., `a*' would have matched the last three `a's in string. If start is 0, re_match returns 5, i.e., `a*' would have matched all the `a's in string. If start is either 5 or 6, it returns zero.

If start is not between zero and size, then re_match returns @math{-1}.

7.1.4 GNU Searching

Searching means trying to match starting at successive positions within a string. The function re_search does this.

Before calling re_search, you must compile your regular expression. See section 7.1.2 GNU Regular Expression Compiling.

Here is the function declaration:

int 
re_search (struct re_pattern_buffer *pattern_buffer, 
           const char *string, const int size, 
           const int start, const int range, 
           struct re_registers *regs)

whose arguments are the same as those to re_match (see section 7.1.3 GNU Matching) except that the two arguments start and range replace re_match's argument start.

If range is positive, then re_search attempts a match starting first at index start, then at @math{start + 1} if that fails, and so on, up to @math{start + range}; if range is negative, then it attempts a match starting first at index start, then at @math{start -1} if that fails, and so on.

If start is not between zero and size, then re_search returns @math{-1}. When range is positive, re_search adjusts range so that @math{start + range - 1} is between zero and size, if necessary; that way it won't search outside of string. Similarly, when range is negative, re_search adjusts range so that @math{start + range + 1} is between zero and size, if necessary.

If the fastmap field of pattern_buffer is zero, re_search matches starting at consecutive positions; otherwise, it uses fastmap to make the search more efficient. See section 7.1.6 Searching with Fastmaps.

If no match is found, re_search returns @math{-1}. If a match is found, it returns the index where the match began. If an internal error happens, it returns @math{-2}.

7.1.5 Matching and Searching with Split Data

Using the functions re_match_2 and re_search_2, you can match or search in data that is divided into two strings.

The function:

int
re_match_2 (struct re_pattern_buffer *buffer, 
            const char *string1, const int size1, 
            const char *string2, const int size2, 
            const int start, 
            struct re_registers *regs, 
            const int stop)

is similar to re_match (see section 7.1.3 GNU Matching) except that you pass two data strings and sizes, and an index stop beyond which you don't want the matcher to try matching. As with re_match, if it succeeds, re_match_2 returns how many characters of string it matched. Regard string1 and string2 as concatenated when you set the arguments start and stop and use the contents of regs; re_match_2 never returns a value larger than @math{size1 + size2}.

The function:

int
re_search_2 (struct re_pattern_buffer *buffer, 
             const char *string1, const int size1, 
             const char *string2, const int size2, 
             const int start, const int range, 
             struct re_registers *regs, 
             const int stop)

is similarly related to re_search.

7.1.6 Searching with Fastmaps

If you're searching through a long string, you should use a fastmap. Without one, the searcher tries to match at consecutive positions in the string. Generally, most of the characters in the string could not start a match. It takes much longer to try matching at a given position in the string than it does to check in a table whether or not the character at that position could start a match. A fastmap is such a table.

More specifically, a fastmap is an array indexed by the characters in your character set. Under the ASCII encoding, therefore, a fastmap has 256 elements. If you want the searcher to use a fastmap with a given pattern buffer, you must allocate the array and assign the array's address to the pattern buffer's fastmap field. You either can compile the fastmap yourself or have re_search do it for you; when fastmap is nonzero, it automatically compiles a fastmap the first time you search using a particular compiled pattern.

To compile a fastmap yourself, use:

int
re_compile_fastmap (struct re_pattern_buffer *pattern_buffer)

pattern_buffer is the address of a pattern buffer. If the character c could start a match for the pattern, re_compile_fastmap makes pattern_buffer->fastmap[c] nonzero. It returns @math{0} if it can compile a fastmap and @math{-2} if there is an internal error. For example, if `|' is the alternation operator and pattern_buffer holds the compiled pattern for `a|b', then re_compile_fastmap sets fastmap['a'] and fastmap['b'] (and no others).

re_search uses a fastmap as it moves along in the string: it checks the string's characters until it finds one that's in the fastmap. Then it tries matching at that character. If the match fails, it repeats the process. So, by using a fastmap, re_search doesn't waste time trying to match at positions in the string that couldn't start a match.

If you don't want re_search to use a fastmap, store zero in the fastmap field of the pattern buffer before calling re_search.

Once you've initialized a pattern buffer's fastmap field, you need never do so again--even if you compile a new pattern in it--provided the way the field is set still reflects whether or not you want a fastmap. re_search will still either do nothing if fastmap is null or, if it isn't, compile a new fastmap for the new pattern.

7.1.7 GNU Translate Tables

If you set the translate field of a pattern buffer to a translate table, then the GNU Regex functions to which you've passed that pattern buffer use it to apply a simple transformation to all the regular expression and string characters at which they look.

A translate table is an array indexed by the characters in your character set. Under the ASCII encoding, therefore, a translate table has 256 elements. The array's elements are also characters in your character set. When the Regex functions see a character c, they use translate[c] in its place, with one exception: the character after a `\' is not translated. (This ensures that, the operators, e.g., `\B' and `\b', are always distinguishable.)

For example, a table that maps all lowercase letters to the corresponding uppercase ones would cause the matcher to ignore differences in case.(5) Such a table would map all characters except lowercase letters to themselves, and lowercase letters to the corresponding uppercase ones. Under the ASCII encoding, here's how you could initialize such a table (we'll call it case_fold):

for (i = 0; i < 256; i++)
  case_fold[i] = i;
for (i = 'a'; i <= 'z'; i++)
  case_fold[i] = i - ('a' - 'A');

You tell Regex to use a translate table on a given pattern buffer by assigning that table's address to the translate field of that buffer. If you don't want Regex to do any translation, put zero into this field. You'll get weird results if you change the table's contents anytime between compiling the pattern buffer, compiling its fastmap, and matching or searching with the pattern buffer.

7.1.8 Using Registers

A group in a regular expression can match a (posssibly empty) substring of the string that regular expression as a whole matched. The matcher remembers the beginning and end of the substring matched by each group.

To find out what they matched, pass a nonzero regs argument to a GNU matching or searching function (see section 7.1.3 GNU Matching and section 7.1.4 GNU Searching), i.e., the address of a structure of this type, as defined in `regex.h':

struct re_registers
{
  unsigned num_regs;
  regoff_t *start;
  regoff_t *end;
};

Except for (possibly) the num_regs'th element (see below), the ith element of the start and end arrays records information about the ith group in the pattern. (They're declared as C pointers, but this is only because not all C compilers accept zero-length arrays; conceptually, it is simplest to think of them as arrays.)

The start and end arrays are allocated in various ways, depending on the value of the regs_allocated field in the pattern buffer passed to the matcher.

The simplest and perhaps most useful is to let the matcher (re)allocate enough space to record information for all the groups in the regular expression. If regs_allocated is REGS_UNALLOCATED, the matcher allocates @math{1 + re_nsub} (another field in the pattern buffer; see section 7.1.1 GNU Pattern Buffers). The extra element is set to @math{-1}, and sets regs_allocated to REGS_REALLOCATE. Then on subsequent calls with the same pattern buffer and regs arguments, the matcher reallocates more space if necessary.

It would perhaps be more logical to make the regs_allocated field part of the re_registers structure, instead of part of the pattern buffer. But in that case the caller would be forced to initialize the structure before passing it. Much existing code doesn't do this initialization, and it's arguably better to avoid it anyway.

re_compile_pattern sets regs_allocated to REGS_UNALLOCATED, so if you use the GNU regular expression functions, you get this behavior by default.

xx document re_set_registers

POSIX, on the other hand, requires a different interface: the caller is supposed to pass in a fixed-length array which the matcher fills. Therefore, if regs_allocated is REGS_FIXED the matcher simply fills that array.

The following examples illustrate the information recorded in the re_registers structure. (In all of them, `(' represents the open-group and `)' the close-group operator. The first character in the string string is at index 0.)

7.1.9 Freeing GNU Pattern Buffers

To free any allocated fields of a pattern buffer, you can use the POSIX function described in section 7.2.6 Freeing POSIX Pattern Buffers, since the type regex_t---the type for POSIX pattern buffers--is equivalent to the type re_pattern_buffer. After freeing a pattern buffer, you need to again compile a regular expression in it (see section 7.1.2 GNU Regular Expression Compiling) before passing it to a matching or searching function.

7.2 POSIX Regex Functions

If you're writing code that has to be POSIX compatible, you'll need to use these functions. Their interfaces are as specified by POSIX, draft 1003.2/D11.2.

7.2.1 POSIX Pattern Buffers

To compile or match a given regular expression the POSIX way, you must supply a pattern buffer exactly the way you do for GNU (see section 7.1.1 GNU Pattern Buffers). POSIX pattern buffers have type regex_t, which is equivalent to the GNU pattern buffer type re_pattern_buffer.

7.2.2 POSIX Regular Expression Compiling

With POSIX, you can only search for a given regular expression; you can't match it. To do this, you must first compile it in a pattern buffer, using regcomp.

To compile a pattern buffer, use:

int
regcomp (regex_t *preg, const char *regex, int cflags)

preg is the initialized pattern buffer's address, regex is the regular expression's address, and cflags is the compilation flags, which Regex considers as a collection of bits. Here are the valid bits, as defined in `regex.h':

REG_EXTENDED
says to use POSIX Extended Regular Expression syntax; if this isn't set, then says to use POSIX Basic Regular Expression syntax. regcomp sets preg's syntax field accordingly.
REG_ICASE
says to ignore case; regcomp sets preg's translate field to a translate table which ignores case, replacing anything you've put there before.
REG_NOSUB
says to set preg's no_sub field; see section 7.2.3 POSIX Matching, for what this means.
REG_NEWLINE
says that a:

If regcomp successfully compiles the regular expression, it returns zero and sets *pattern_buffer to the compiled pattern. Except for syntax (which it sets as explained above), it also sets the same fields the same way as does the GNU compiling function (see section 7.1.2 GNU Regular Expression Compiling).

If regcomp can't compile the regular expression, it returns one of the error codes listed here. (Except when noted differently, the syntax of in all examples below is basic regular expression syntax.)

REG_BADRPT
For example, the consecutive repetition operators `**' in `a**' are invalid. As another example, if the syntax is extended regular expression syntax, then the repetition operator `*' with nothing on which to operate in `*' is invalid.
REG_BADBR
For example, the count `-1' in `a\{-1' is invalid.
REG_EBRACE
For example, `a\{1' is missing a close-interval operator.
REG_EBRACK
For example, `[a' is missing a close-list operator.
REG_ERANGE
For example, the range ending point `z' that collates lower than does its starting point `a' in `[z-a]' is invalid. Also, the range with the character class `[:alpha:]' as its starting point in `[[:alpha:]-|]'.
REG_ECTYPE
For example, the character class name `foo' in `[[:foo:]' is invalid.
REG_EPAREN
For example, `a\)' is missing an open-group operator and `\(a' is missing a close-group operator.
REG_ESUBREG
For example, the back reference `\2' that refers to a nonexistent subexpression in `\(a\)\2' is invalid.
REG_EEND
Returned when a regular expression causes no other more specific error.
REG_EESCAPE
For example, the trailing backslash `\' in `a\' is invalid, as is the one in `\'.
REG_BADPAT
For example, in the extended regular expression syntax, the empty group `()' in `a()b' is invalid.
REG_ESIZE
Returned when a regular expression needs a pattern buffer larger than 65536 bytes.
REG_ESPACE
Returned when a regular expression makes Regex to run out of memory.

7.2.3 POSIX Matching

Matching the POSIX way means trying to match a null-terminated string starting at its first character. Once you've compiled a pattern into a pattern buffer (see section 7.2.2 POSIX Regular Expression Compiling), you can ask the matcher to match that pattern against a string using:

int
regexec (const regex_t *preg, const char *string, 
         size_t nmatch, regmatch_t pmatch[], int eflags)

preg is the address of a pattern buffer for a compiled pattern. string is the string you want to match.

See section 7.2.5 Using Byte Offsets, for an explanation of pmatch. If you pass zero for nmatch or you compiled preg with the compilation flag REG_NOSUB set, then regexec will ignore pmatch; otherwise, you must allocate it to have at least nmatch elements. regexec will record nmatch byte offsets in pmatch, and set to @math{-1} any unused elements up to @math{pmatch[nmatch] - 1}.

eflags specifies execution flags---namely, the two bits REG_NOTBOL and REG_NOTEOL (defined in `regex.h'). If you set REG_NOTBOL, then the match-beginning-of-line operator (see section 3.9.1 The Match-beginning-of-line Operator (^)) always fails to match. This lets you match against pieces of a line, as you would need to if, say, searching for repeated instances of a given pattern in a line; it would work correctly for patterns both with and without match-beginning-of-line operators. REG_NOTEOL works analogously for the match-end-of-line operator (see section 3.9.2 The Match-end-of-line Operator ($)); it exists for symmetry.

regexec tries to find a match for preg in string according to the syntax in preg's syntax field. (See section 7.2.2 POSIX Regular Expression Compiling, for how to set it.) The function returns zero if the compiled pattern matches string and REG_NOMATCH (defined in `regex.h') if it doesn't.

7.2.4 Reporting Errors

If either regcomp or regexec fail, they return a nonzero error code, the possibilities for which are defined in `regex.h'. See section 7.2.2 POSIX Regular Expression Compiling, and section 7.2.3 POSIX Matching, for what these codes mean. To get an error string corresponding to these codes, you can use:

size_t
regerror (int errcode,
          const regex_t *preg,
          char *errbuf,
          size_t errbuf_size)

errcode is an error code, preg is the address of the pattern buffer which provoked the error, errbuf is the error buffer, and errbuf_size is errbuf's size.

regerror returns the size in bytes of the error string corresponding to errcode (including its terminating null). If errbuf and errbuf_size are nonzero, it also returns in errbuf the first @math{errbuf_size - 1} characters of the error string, followed by a null. errbuf_size must be a nonnegative number less than or equal to the size in bytes of errbuf.

You can call regerror with a null errbuf and a zero errbuf_size to determine how large errbuf need be to accommodate regerror's error string.

7.2.5 Using Byte Offsets

In POSIX, variables of type regmatch_t hold analogous information, but are not identical to, GNU's registers (see section 7.1.8 Using Registers). To get information about registers in POSIX, pass to regexec a nonzero pmatch of type regmatch_t, i.e., the address of a structure of this type, defined in `regex.h':

typedef struct
{
  regoff_t rm_so;
  regoff_t rm_eo;
} regmatch_t;

When reading in section 7.1.8 Using Registers, about how the matching function stores the information into the registers, substitute pmatch for regs, pmatch[i]->rm_so for regs->start[i] and pmatch[i]->rm_eo for regs->end[i].

7.2.6 Freeing POSIX Pattern Buffers

To free any allocated fields of a pattern buffer, use:

void 
regfree (regex_t *preg)

preg is the pattern buffer whose allocated fields you want freed. regfree also sets preg's allocated and used fields to zero. After freeing a pattern buffer, you need to again compile a regular expression in it (see section 7.2.2 POSIX Regular Expression Compiling) before passing it to the matching function (see section 7.2.3 POSIX Matching).

7.3 BSD Regex Functions

If you're writing code that has to be Berkeley UNIX compatible, you'll need to use these functions whose interfaces are the same as those in Berkeley UNIX.

7.3.1 BSD Regular Expression Compiling

With Berkeley UNIX, you can only search for a given regular expression; you can't match one. To search for it, you must first compile it. Before you compile it, you must indicate the regular expression syntax you want it compiled according to by setting the variable re_syntax_options (declared in `regex.h' to some syntax (see section 2 Regular Expression Syntax).

To compile a regular expression use:

char *
re_comp (char *regex)

regex is the address of a null-terminated regular expression. re_comp uses an internal pattern buffer, so you can use only the most recently compiled pattern buffer. This means that if you want to use a given regular expression that you've already compiled--but it isn't the latest one you've compiled--you'll have to recompile it. If you call re_comp with the null string (not the empty string) as the argument, it doesn't change the contents of the pattern buffer.

If re_comp successfully compiles the regular expression, it returns zero. If it can't compile the regular expression, it returns an error string. re_comp's error messages are identical to those of re_compile_pattern (see section 7.1.2 GNU Regular Expression Compiling).

7.3.2 BSD Searching

Searching the Berkeley UNIX way means searching in a string starting at its first character and trying successive positions within it to find a match. Once you've compiled a pattern using re_comp (see section 7.3.1 BSD Regular Expression Compiling), you can ask Regex to search for that pattern in a string using:

int
re_exec (char *string)

string is the address of the null-terminated string in which you want to search.

re_exec returns either 1 for success or 0 for failure. It automatically uses a GNU fastmap (see section 7.1.6 Searching with Fastmaps).


Go to the first, previous, next, last section, table of contents.