Regex - Regular Expressions in ANS Forth

(Version 0.8, 18 November 2010)

  1. Introduction
  2. Features
  3. Portability
  4. Download
  5. Install and Test
  6. Using Regex
  7. String Builder
  8. Features for Future Versions
  9. Features Excluded
  10. Implementation Notes
  11. ANS Forth Compliancy
  12. Contact
  13. Version History
  14. Home

Important change in versions 0.5 onwards

Change to the specification of match.

1. Introduction

Regex is an ANS Forth program to compile regular expressions written in the conventional format and to match ASCII text against them. The regular expression flavour is Perl-like. This release is rich in features but is still in development with features to be added. The package includes a String Builder module to make use of the results from a successful regular expression match. Regex is intended for use on 32 bit or larger desktop computers rather than 16 bit or embedded processors and so little attention has been paid to code size - ease of development and maintenance have been the main drivers. To this end the tools LexGen and Grace have been used for the regular expression scanner and parsers and an object oriented approach has been used in the design. The availability of this program has not yet been publicly announced, if you stumble upon it you are welcome to use it but be warned that it is still under development so various things may change until version 1.0 is released.

2. Features

The features currently provided are:

For String Builder features see String Builder.

3. Portability

Regex 0.8 has been tested successfully on 32 bit versions of GForth, SwiftForth, VFX Forth, Win32 Forth and BigForth, all running under MS Windows XP and Windows 7 (32 bit version) and is, therefore, likely to be ANS Forth compliant. Version 0.2 was reported as working on GForth under Mac OSX and Linux, and on PFE; there is no reason, other than pessimism, to suppose the latest version won't work on these systems. Versions before 0.8 would not work with 64 bit Forths, version 0.8 should work on 64 bit Forths but this remains untested. Feedback on the use of Regex with other systems will be welcomed.

Use of Regex using Forths with cells smaller than 32 bits is not supported.

4. Download

Regex version 0.8 released 18 November 2010. This contains the following files:

Regex files:

Testing:

Documentation:

Note that, from version 0.3 onwards, Regex and String Builder files have been concatenated until one large file - regex.fth. Similarly regextest.fth. The state transition tables and parsers were generated by LexGen and Grace in an unformatted mode and hence are unreadable (being computer generated, they were never intended to be readable anyway).

5. Install and Test

This is configured to work with GForth 0.7.0. To test Regex with GForth, unzip the files into a suitable directory under GForth called, say, regex. Then type:

s" regex/regextest.fth" included

or similar, which should give the following output:


Gforth 0.7.0, Copyright (C) 1995-2008 Free Software Foundation, Inc.
Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license'
Type `bye' to exit
s" regex/regextest.fth" included Loading extended mini oof ...
Extended mini oof loaded. <0> Loading Sets.fth ...
redefined cell  redefined decode  Sets.fth loaded. <0> redefined 1Cell with 1cel
l  redefined ?name-found  redefined noop  redefined ~  redefined testsym?  redef
ined this-parser  redefined ~  redefined testsym?  redefined >string  redefined 
Format  redefined <metachar>  redefined <forth>  redefined <group>  redefined <e
scapedchar>  redefined <reference>  redefined <conditional>  redefined <modemod>
  redefined <casefolder>  redefined <modifier>  redefined this-parser  redefined
 ~

Regex version 0.8
Copyright (C) Gerry Jackson 2010

redefined TESTING with Testing
Running Regex tests

*  1 *  2 *  3 *  4 *  5 *  6 *  7 *  8 *  9 * 10 * 11 * 12 * 13 * 14 * 15
* 16 * 17 * 18 * 19 * 20 * 21 * 22 * 23 * 24 * 25 * 26 * 27 * 28 * 29 * 30
* 31 * 32 * 33 * 34 * 35 * 36 * 37 * 38 * 39 * 40 * 41 * 42 * 43 * 44 * 45
* 46 * 47 * 48 * 49 * 50 * 51 * 52 * 53 * 54 * 55 * 56 * 57 * 58 * 59 * 60
* 61 * 62 * 63 * 64
Regex & String Builder tests completed

 ok

With other systems, directory paths may need to be added to the lines that include files e.g.
      s" filename" included
in files regex.fth and regextest.fth. This is system dependent as there is no ANS Forth standard way to specify directory paths.

The output from other systems will vary, ignore any redefinition messages (unless they cause an error). The last six lines are the important ones that prove the software works.

If a system fails with an error message during the tests, please send details (the regular expression, any output and the system used) to the contact email address below.

6. Using Regex

This section describes features relevant to Regex; it is not a primer or tutorial on regular expressions - there are many such resources available on the internet or books available for purchase. Knowledge of regular expressions is assumed.

Loading Regex

Assuming the directory paths in regex.fth are correct for your Forth system, simply type:

s" regex.fth" included

Compiling and matching regular expressions

These words are provided to compile regular expressions and to match strings against the compiled regular expression:

regex ( "ccc..." -- rgx ) compiles the following text as a regular expression until one of (?e), (?end) or the end of the line is reached. The multi-line free-format modifiers (?x) or (?x: can be used in the first line to compile a multi-line regular expression with # comments (note that this free-format mode is ignored inside a character class specification where white space and # are treated as characters in the usual way). It returns rgx which is the adddress of the compiled regular expression and which should be used as an input parameter in a subsequent match

regex$ ( caddr u -- rgx ) compiles the regular expression held as a string at (caddr u) returning the address of the compiled regular expression

parse-regex ( char "ccc<char>" -- rgx ) parse ccc in the input buffer delimited by the character char. If the delimiter is not present, it parses to the end of the line. The parsed text is treated as a regular expression and compiled to address rgx. This word enables a delimiter of choice to be used, for example a user may define:
   : parse/ [char] / parse-regex ;
to delimit a regular expression with the character /

match ( caddr1 u1 rgx -- caddr u -1 | caddr1 u1 0 ) attempts to match the text at (caddr1 u1) against the compiled regular expression rgx. If a match is found (caddr u -1) is returned where (caddr u) is the rest of the input string following the leftmost matching string, otherwise the input string and 0 (caddr1 u1 0) are returned to indicate failure to match. (Note that the returned string may have length u = 0 if, say, the match was at the end of the input string. The matching string may be obtained using get-match. The specification of match was changed in version 0.5 because experience using Regex showed that returning the rest of the string being searched is far more useful than returning the matching string.

Accessing Regex results

get-match ( -- caddr u -1 | 0 0 0 ) returns a matching string from the previous execution of match if a match was found, otherwise (0 0 0).

get-subex ( n -- caddr u ) returns the contents of the nth pair of capturing parentheses, where the count starts at 1 from the start of the regular expression. 0 get-subex is the same as get-match

(Note that substitute and head&tail were removed in version 0.2 as they have been superseded by the String Builder.)

Notes on various features

No description is provided for most of the features listed above as they behave in a standard manner - examples of usage can be found in the test program provided. This section notes a few items that may not be standard.

  1. Regex finds the longest leftmost match in the string being processed and is therefore does not have a traditional NFA engine using terms described in "Mastering Regular Expression" by Jeffrey Friedl (p 177). Therefore alternation is not ordered.
  2. Capturing parentheses. The default configuration allows up to 9 pairs of capturing parentheses in any one regular expression. This number may be varied by changing the value of constant subex-limit in file regexmatch.fth
  3. Embedded code. In (?{forth code}), the embedded forth code is a sequence of one or more Forth words. The embedded code can make free use of the data stack i.e. data present on the stack before match can be accessed and additional data added or removed from the stack (introduced in version 0.8). Restrictions on embedded Forth code are: To overcome the restrictions in the last two bullet points, put the offending sequence of Forth code words into a colon definition and call that in the embedded Forth. With such a single call Regex saves the xt without a :noname definition and executes that xt during a match.
  4. End of line anchors. $ \z \Z treat both CRLF and LF as the end of the line.
  5. Mode modifiers. Such as (?i) and friends operate until one of: the corresponding (?-i); the end of the regular expression; or until the end of the immediate enclosing parentheses.
    The default modifiers (initially all off) are used at the start of a regular expression. These defaults may be changed prior to compilation of a regular expression by the use of: save-idefault save-mdefault save-sdefault save-xdefault, all with stack effect ( flag -- ). For example:
       true save-idefault
    sets the i modifier default value to on, making following regular expressions case insensitive
  6. Named References. (Note this has been changed in version 0.4 and later versions)The name used in named capture (?<name>...) and named references \g{name} must have been previously declared using the defining word refname. e.g.
    refname alice
    Following a successful match, the contents of a named reference may be retrieved in Forth code (including embedded Forth code) by executing the name e.g. alice which will return the matched string as ( caddr u ) if matched or ( 0 0 ) otherwise.
    The name must not be used as a 2variable as was the case in previous versions. If an invalid name is used the system will report an error.
  7. The look behind feature can contain any valid regular expression items such as \w* which is of indeterminate length. Hence the NFA matching engine does not know how far back to start matching the look behind expression, hence it starts at the beginning of the subject string. With large strings this may be slow and so should be used carefully e.g
    (?<=(\d+)\s+abc\s+def.*)\1\s+abc\s+\?(?=xyz)
    is potentially slow as the look behind is applied first. On the other hand
    (\d+)\s+abc\s+\?(?=xyz)(?<=\1\s+abc\s+def.*)
    is fast as the look behind follows the match and look ahead.
  8. Conditional expressions. The syntax is: (?(?<test><then-part>|<else-part>) The back and named reference tests return true if the reference has participated in the match, otherwise false. The value of the reference is not used. Conditional expressions may be nested indefinitely.
  9. The embedded code in conditional expressions must leave true (i.e. non-zero) or false on the stack to indicate whether the then-part or else-part of the expression is to be matched.

Character set

This version of Regex has been tested with the 7 bit ASCII character set. Limited tests have been carried out with 8 bit characters and there should be no problems. The character size can be changed to 8 bit by setting the constant maxchar to 255 in file regex.fth. Do not set it larger, e.g. for 16 bit characters, as character classes will then use large amounts of memory.

7. String Builder

This has been introduced to the Regex package to simplify using the results from matching a regular expression. The String Builder has syntax in this form (alternatives use stringer and parse-stringer - see below):

            pattern-string stringer$

where the pattern-string is a string containing a sequence of:

The word stringer$ parses the pattern-string and acts on the contents to build up a string in the concatenation buffer. The typical use is to replace the matching text from a regular expression match with some other text. Note that the necessary code to read/write strings from/to files is outside the scope of the Regex package at present - these operations can clearly be written in standard Forth. A possible future extension could handle some of this in the String Builder command language.

For examples of usage refer to the test program included in the download.

String Builder words

stringer ( "ccc..." -- caddr u ) Accepts the pattern-string from the input source until terminated by the end of line or (?e) in free-layout mode. The concatenation buffer is cleared, unless the command (?-c) is first in pattern-string. The pattern-string is then parsed and acted upon according to the contents. The string built up in the concatenation buffer is returned as (caddr u) on completion, unless the command (?-g) has been used.

stringer$ ( caddr u -- caddr2 u2 ) As for stringer except that the pattern-string is already available as (caddr u).

parse-stringer ( char -- caddr u ) Parse the input until the delimiting character. The parsed string is then used as the pattern-string and stringer$ is then executed.

Related words

stringify ( -- caddr u) Accepts text from the input source in free-layout mode until an (?end) or (?e) command is encountered. White space, including new line characters, are stripped out and the remaining characters concatenated into a single string. The (caddr u) pair of the resulting string is returned to the caller. The aim of this word is to create a pattern string from a readable free-layout representation that may be used inside a colon definition; alternatively to easily create strings containing 'difficult' characters such as ". Use the word constant$ to save the string (see below).

The following points apply to stringify:

constant$ ( caddr u "spaces<name>" -- ) copies the string specified by ( caddr u ) to dataspace at caddr2, creates a constant called <name> that, when executed, places (caddr2 u) on the stack. This word is included for use with stringify.

Metacharacters and Escaped Characters

These are preceded by the \ character and append the following character or value to the concatenation buffer. The list of these is:

            \\  appends character \     \(  appends character (
            \)          character )     \$          character $
            \.          character .     \{          character {
            \}          character }     \#          character #
            \|          character |     \%          character %

            \a          value 7         \e          value 27
            \f          value 12        \n          value 10
            \r          value 13        \s          value 32
            \t          value 9         \v          value 11
            \0          value 0
            \cX         value (following character mod 32)
                        e.g. \cH appends CTRL H value 8 as does \ch and \c(
    

References

The pattern string can include various references to the results from the last regular expression match; these append the associated value to the concatenation buffer. The types of reference are:

Commands

The commands that can be included in the pattern string are:

Future extensions and changes

There is plenty of scope to add to the set of commands in the String Builder and this will probably happen as experience is gained in its use. Examples are the addition of other control structures e.g. loops but it is unclear where the most cost-effective boundary lies between doing this in the pattern string or in Forth.

The floating point options of the format command are outstanding.

For commands the String Builder has kept mostly to the regular expression syntax of Perl. This may change in future versions.

Concatenation buffer

A simple concatenation buffer is provided to build up replacement strings for text matched with regular expressions. Words to use this are:

clear-concat ( -- ) Empties the buffer.

concat ( caddr u -- ) Appends the string (caddr u) to the contents of the buffer

concat-char ( char -- ) Appends a character to the buffer.

get-concat ( -- caddr u )Returns the contents of the buffer

For an example of use, see the test program.

The size of the concatenation buffer is set by constant cb-size in file regex.fth and therefore may be changed by the user.

The concatenation buffer may be deleted with: cb delete

Further String Processing

For more complex string processing a package such as David Williams' Dynamic Strings Package can be used, where embedded code can be inserted into the pattern string to call words in the strings package.

8. Features for Future Versions

These features may be added in future versions, but not necessarily in this order:

9. Features excluded

There are no plans to include the following features:

10. Implementation Notes

The regular expression engine is based on the ideas given in the paper Regular Expression Matching Can Be Simple And Fast by Russ Cox.

The regular expression and string builder parsers were generated by Grace from BNF specifications. In all there are three separate parsers in the package.

The state transition tables were generated by LexGen. There are four tables compacted into a single table.

11. ANS Forth compliancy

Regex requires a case-insensitive ANS Forth system. The following words from the various ANS Forth word sets are used.

12. Contact

Any comments, feedback (good or bad), error reports etc will be gratefully received, email Gerry Jackson.

13. Version history

v0.8   18 November 2010

v0.7   11 August 2010

v0.6   31 July 2010

v0.5   27 July 2010

v0.4   10 June 2010

v0.3   24 May 2010

v0.2   7 May 2010

v0.1   23 April 2010

v0.0   2 April 2010

14. Some ANS Forth Software

Home