http://cpan.uwinnipeg.ca/htdocs/perl/perlreref.html http://cpan.uwinnipeg.ca/htdocs/perl/perlretut.html http://cpan.uwinnipeg.ca/htdocs/perl/perlunicode.html

Guiguts is implemented in Perl, and so the regular expression (regex) syntax it supports is Perl's regex syntax, with a few twists. In general any tutorial or essay that describes Perl regexes is accurate for Guiguts. For links to some, see this page.

The following tables divide regex syntax into related groups. Scroll, or click in the navigation bar, to go to one of the tables.

Forms that Match Specific Characters

These forms match a single specific character.

Form Meaning Example Comments
Any char except... \ . ^ $ * + ? /
( { [ | ) ] }
Illustration: Find "Illustration:"
\x The character x. \[Foot.*]
\\n
Find [Foot...], not a character class.
Find "\n" not newline.
\a ASCII bell (\x07)
\e ASCII Escape (\x1B)
\f ASCII Formfeed (\x0C)
\n The host system's newline code and multiline search enabled (see below) <b>[\n\w\W]+</b> Find bold markup that may span lines.
\r ASCII Return (\x0D) search: \r
repl: \n
Convert Mac text document to UNIX.
\t ASCII Tab (\x09) [\t ]+ One or more spaces or tabs (see \s)
\0dd Match a byte with octal value dd \011+ Find one or more tabs.
\cx Control-x. \ci+ Find one or more tabs.
\xdd Match a byte with hex value dd \x09+ Find one or more tabs.
\x{dddd} Match a multi-byte hex value. \x{03A3} Find the Unicode character Σ

When the literal \n appears anywhere in the regex, Guiguts causes Perl to treat the entire document as a single string. The anchors (see below) ^ and $ now refer to the beginning and end of the document. The dot can match \n so an expression like .* can match the entire document. This feature lets you to search for multi-line phrases but requires extra care in using greedy quantifiers.

Forms that Match Classes of Characters

These forms match, not a specific character, but any one of a class of related characters.

Form Meaning Example Comments
. Any but newline (but see \n) ^.$ Find a line with exactly one character
[xyz] Any of x, y, or z [Cc]hapter Find "Chapter" or "chapter"
[a-z] Anything in the range a to z inclusive. [\x80-\xff] Find a Latin-1 extended character.
[^xyz]
[^a-z]
Anything but x, y, or z
Anything outside the range a-z
src="[^"]+" Find src=", a string of anything-but-quote, and a closing quote.
[[:posix:]]
[[:^posix]]
Test POSIX named class (see table below) [[:cntrl:] ] Equivalent to [\x00-\x20]
\p{unicode}
\P{unicode}
Test Unicode named class (see table below) \P{IsASCII} Any non-ASCII (Latin1 or Unicode) character.
\d Any digit ([0-9]) 1\d\d\d
(\d+),?
A number 1000-1999.
Page number in index entry.
\D A non-digit ([^0-9]) (need example!)
\s Whitespace ([ \t\n\r\f]) \s\s+ Find a string of 2 or more spaces.
\S Non-whitespace ([^ \t\n\r\f]) \S\s\s+\S Words separated by 2 or more spaces.
\w A word char ([a-zA-Z0-9_]) <\w+> Simple HTML markup, <b>, <div> etc.
\W Non-word ([^a-zA-Z0-9_]) (need example!)

Named Character Classes

POSIX (a standards body) and Unicode (another) have defined names for character classes. POSIX names are delimited this way: [:alnum:] and must appear inside class brackets, thus: [[:alpha:][:digit:]] is the class of alphabetics plus digits. To negate a POSIX class, insert a caret after the first colon: [[:^alpha:]] is the class of non-alphabetics.

Unicode names are used so: \p{IsAlnum}, and negated by use of a capital P, so \P{IsAlpha} is the class of non-alphabetics.

The important feature of these named classes is that they are Unicode-aware, so for example \p{IsLower} includes all lowercase letters from all languages, and \p{IsPunct} matches punctuation in Greek and Farsi as well as Latin-1.

POSIX Unicode Meaning
alnumIsAlnumAlphanumeric
alphaIsAlphaAlphanbetic
asciiIsASCIIAny ASCII
alnumIsAlnumAlphanumeric
blankIsSpaceSpace or tab
cntrlIsCntrlControl characters
digitIsDigitDigits
graphIsGraphAlphanumeric and punctuation
lowerIsLowerLowercase letters
printIsPrintAlphanumeric, punctuation, space
punctIsPunctPunctuation
spaceIsSpaceWhitespace ([\s\ck], i.e. \s plus ASCII vertical tab)
IsSpacePerlPerl's whitespace: \s
upperIsUpperUppercase letters
wordIsWordIsAlnum plus underscore
xdigitIsXDigitHexadecimal ([a-fA-F0-9]

Forms that Quantify

These forms always follow an expression and modify how many times the expression should repeat.

Form Meaning Example Comments
*
*?
Zero or more.
Zero or more (nongreedy).
^.*$ Find a line of any length from 0 to ?
+
+?
One or more.
One or more (nongreedy).
(very,\s)+ very Find "very, very,... very good."
?
??
Zero or one.
Preferably, zero.
(very, )?very "very, very" and "very"
{n} Exactly n of them. (very, ){2} Exactly "very, very, "
{n,} At least n times but as many as possible. (example?)
{n,}? At least n, more if necessary, but as few as possible. (example?)
{n,m} At least n times but not more than m times. (example?)
{n,m}? At least n, not more than m, and as few as possible. (example?)

Greediness

The basic quantifiers are "greedy"; that is, they always match as many repeats as they can, while still allowing the entire regex to match. Greediness can lead to unexpected results. For example, this search for italic markup:
<i>.+</i>
when applied to a line like this:
<i>What?</i> What do you <i>mean?</i>
will match the entire line: the greedy .+ matches as much as it can while still allowing the entire regex to match, which means it matches everything through the second "?".

To tame the greed of a quantifier, append a question-mark to it. Then it matches as few repetitions as it can, while still allowing the regex to match. The test
<i>.+?</i>
when applied to:
<i>What?</i> What do you <i>mean?</i>
matches only the first marked-up word.

Forms that Match Positions

These forms match particular positions such as "end of line," or particular transitions such as "start of word." They have zero width, but serve to "anchor" the rest of the expression to a fixed spot.

Form Meaning Example Comments
^ Beginning of line (but see \n) ^\s\s+ Find a line that begins with two or more spaces.
$ End of line (but see \n) \s$ Find a line ending in a space.
\b Word boundary: between \w\W or between \W\w. \b\w+ful\b Whole words ending in -ful (soulful) but not in -fully.
\B Non-word boundary: between \w\w or \W\W. (need example!)

Forms that Control Matching

These forms control how the search is carried out, including grouping and alternates.

Form Meaning Example Comments
(regex) Group a regex for reference or as an alternate src="([^"]+)" Find src="something" and group the something for reference.
\n The text matched by the nth set of parens from the left. (\b\w+\b)\s+\1 Find a duplicated word word.
| Delimits alternate choices house(cat|keeper)
(19|20)\d\d
\cM\cJ|\cM|\cJ
Find housecat or housekeeper.
Find a year in either century.
Find CR-LF, or CR, or LF.
(?:regex) Group and do not count for reference. (?:(src|href)=")(.+?)" Find either src="x" or href="x"; $1 is x.
(?=regex) "Lookahead"—test but do not match ful(?=\p{IsPunc}) Find suffix ful followed by punctuation; punct. is not part of match.
(?!regex) Negative Lookahead [Sidenote(?!:) Find sidenote markup missing colon.
(?<=regex) "Lookbehind"—test but do not match. (?<"#Page_)(\d+)" Find reference to page anchor; $1 is page #.
(?<!regex) Negative Lookbehind. (?!<#)Page_ Find page anchor not used as target.

Replacement Patterns

These forms are used in the replacement text. They control what text will replace the text that is found. Their use is discussed on this page.

Form Meaning Example Comments
Anything except \ Literal replacement Find: <I>
Repl:<i>
Replace one string with another.
$n The text matched by the nth set of parens from the left. ((19|20)\d\d) $1 is the whole matched year; $2 the first two digits, 19 or 20.
\L...\E Lowercase Find: <(I|B)>
Repl:<\L$1\E>
Lowercase bold or italic markup.
\U...\E Uppercase Find: (Chapter)
Repl:<\U$1\E>
Change Chapter to CHAPTER.
\T...\E Title-case See discussion.
\A...\E Anchor See discussion.
\C...\E Replace with Perl expression See discussion.