The GUIGUTS Tool

Guiguts is implemented in Perl, and so the regular expression (regex) syntax it supports is Perl's regex syntax, with a few twists. In general any tutorial or essay that describes Perl regexes is accurate for Guiguts. For links to some, see this page.

The following tables divide regex syntax into related groups. Scroll, or click in the navigation bar, to go to one of the tables.

Forms that Match Specific Characters

These forms match a single specific character.

Form	Meaning	Example	Comments
Any char except...	\ . ^ $ * + ? / ( { [ \| ) ] }	Illustration:	Find "Illustration:"
\x	The character x.	*\[Foot.] \\n**	Find [Foot...], not a character class. Find "\n" not newline.
\a	ASCII bell (\x07)
\e	ASCII Escape (\x1B)
\f	ASCII Formfeed (\x0C)
\n	The host system's newline code and multiline search enabled (see below)	<b>[\n\w\W]+</b>	Find bold markup that may span lines.
\r	ASCII Return (\x0D)	search: \r repl: \n	Convert Mac text document to UNIX.
\t	ASCII Tab (\x09)	[\t ]+	One or more spaces or tabs (see \s)
\0dd	Match a byte with octal value dd	\011+	Find one or more tabs.
\cx	Control-x.	\ci+	Find one or more tabs.
\xdd	Match a byte with hex value dd	\x09+	Find one or more tabs.
\x{dddd}	Match a multi-byte hex value.	\x{03A3}	Find the Unicode character Σ

When the literal \n appears anywhere in the regex, Guiguts causes Perl to treat the entire document as a single string. The anchors (see below) ^ and $ now refer to the beginning and end of the document. The dot can match \n so an expression like .* can match the entire document. This feature lets you to search for multi-line phrases but requires extra care in using greedy quantifiers.

Forms that Match Classes of Characters

These forms match, not a specific character, but any one of a class of related characters.

Form	Meaning	Example	Comments
.	Any but newline (but see \n)	^.$	Find a line with exactly one character
[xyz]	Any of x, y, or z	[Cc]hapter	Find "Chapter" or "chapter"
[a-z]	Anything in the range a to z inclusive.	[\x80-\xff]	Find a Latin-1 extended character.
[^xyz] [^a-z]	Anything but x, y, or z Anything outside the range a-z	src="[^"]+"	Find src=", a string of anything-but-quote, and a closing quote.
[[:posix:]] [[:^posix]]	Test POSIX named class (see table below)	[[:cntrl:] ]	Equivalent to [\x00-\x20]
\p{unicode} \P{unicode}	Test Unicode named class (see table below)	\P{IsASCII}	Any non-ASCII (Latin1 or Unicode) character.
\d	Any digit ([0-9])	1\d\d\d (\d+),?	A number 1000-1999. Page number in index entry.
\D	A non-digit ([^0-9])		(need example!)
\s	Whitespace ([ \t\n\r\f])	\s\s+	Find a string of 2 or more spaces.
\S	Non-whitespace ([^ \t\n\r\f])	\S\s\s+\S	Words separated by 2 or more spaces.
\w	A word char ([a-zA-Z0-9_])	<\w+>	Simple HTML markup, <b>, <div> etc.
\W	Non-word ([^a-zA-Z0-9_])		(need example!)

Named Character Classes

POSIX (a standards body) and Unicode (another) have defined names for character classes. POSIX names are delimited this way: [:alnum:] and must appear inside class brackets, thus: [[:alpha:][:digit:]] is the class of alphabetics plus digits. To negate a POSIX class, insert a caret after the first colon: [[:^alpha:]] is the class of non-alphabetics.

Unicode names are used so: \p{IsAlnum}, and negated by use of a capital P, so \P{IsAlpha} is the class of non-alphabetics.

The important feature of these named classes is that they are Unicode-aware, so for example \p{IsLower} includes all lowercase letters from all languages, and \p{IsPunct} matches punctuation in Greek and Farsi as well as Latin-1.

POSIX	Unicode	Meaning
alnum	IsAlnum	Alphanumeric
alpha	IsAlpha	Alphanbetic
ascii	IsASCII	Any ASCII
alnum	IsAlnum	Alphanumeric
blank	IsSpace	Space or tab
cntrl	IsCntrl	Control characters
digit	IsDigit	Digits
graph	IsGraph	Alphanumeric and punctuation
lower	IsLower	Lowercase letters
print	IsPrint	Alphanumeric, punctuation, space
punct	IsPunct	Punctuation
space	IsSpace	Whitespace ([\s\ck], i.e. \s plus ASCII vertical tab)
	IsSpacePerl	Perl's whitespace: \s
upper	IsUpper	Uppercase letters
word	IsWord	IsAlnum plus underscore
xdigit	IsXDigit	Hexadecimal ([a-fA-F0-9]

Forms that Quantify

These forms always follow an expression and modify how many times the expression should repeat.

Form	Meaning	Example	Comments
* *?	Zero or more. Zero or more (nongreedy).	*^.$**	Find a line of any length from 0 to ?
+ +?	One or more. One or more (nongreedy).	(very,\s)+ very	Find "very, very,... very good."
? ??	Zero or one. Preferably, zero.	(very, )?very	"very, very" and "very"
{n}	Exactly n of them.	(very, ){2}	Exactly "very, very, "
{n,}	At least n times but as many as possible.		(example?)
{n,}?	At least n, more if necessary, but as few as possible.		(example?)
{n,m}	At least n times but not more than m times.		(example?)
{n,m}?	At least n, not more than m, and as few as possible.		(example?)

Greediness

The basic quantifiers are "greedy"; that is, they always match as many repeats as they can, while still allowing the entire regex to match. Greediness can lead to unexpected results. For example, this search for italic markup:
.+
when applied to a line like this:
What? What do you mean?
will match the entire line: the greedy .+ matches as much as it can while still allowing the entire regex to match, which means it matches everything through the second "?".

To tame the greed of a quantifier, append a question-mark to it. Then it matches as few repetitions as it can, while still allowing the regex to match. The test
.+?
when applied to:
What? What do you mean?
matches only the first marked-up word.

Forms that Match Positions

These forms match particular positions such as "end of line," or particular transitions such as "start of word." They have zero width, but serve to "anchor" the rest of the expression to a fixed spot.

Form	Meaning	Example	Comments
^	Beginning of line (but see \n)	^\s\s+	Find a line that begins with two or more spaces.
$	End of line (but see \n)	\s$	Find a line ending in a space.
\b	Word boundary: between \w\W or between \W\w.	\b\w+ful\b	Whole words ending in -ful (soulful) but not in -fully.
\B	Non-word boundary: between \w\w or \W\W.		(need example!)

Forms that Control Matching

These forms control how the search is carried out, including grouping and alternates.

Form	Meaning	Example	Comments
(regex)	Group a regex for reference or as an alternate	src="([^"]+)"	Find src="something" and group the something for reference.
\n	The text matched by the nth set of parens from the left.	(\b\w+\b)\s+\1	Find a duplicated word word.
\|	Delimits alternate choices	house(cat\|keeper) (19\|20)\d\d \cM\cJ\|\cM\|\cJ	Find housecat or housekeeper. Find a year in either century. Find CR-LF, or CR, or LF.
(?:regex)	Group and do not count for reference.	(?:(src\|href)=")(.+?)"	Find either src="x" or href="x"; $1 is x.
(?=regex)	"Lookahead"—test but do not match	ful(?=\p{IsPunc})	Find suffix ful followed by punctuation; punct. is not part of match.
(?!regex)	Negative Lookahead	[Sidenote(?!:)	Find sidenote markup missing colon.
(?<=regex)	"Lookbehind"—test but do not match.	(?<"#Page_)(\d+)"	Find reference to page anchor; $1 is page #.
(?<!regex)	Negative Lookbehind.	(?!<#)Page_	Find page anchor not used as target.

Replacement Patterns

These forms are used in the replacement text. They control what text will replace the text that is found. Their use is discussed on this page.

Form	Meaning	Example	Comments
Anything except \	Literal replacement	Find: <I> Repl:<i>	Replace one string with another.
$n	The text matched by the nth set of parens from the left.	((19\|20)\d\d)	$1 is the whole matched year; $2 the first two digits, 19 or 20.
\L...\E	Lowercase	Find: <(I\|B)> Repl:<\L$1\E>	Lowercase bold or italic markup.
\U...\E	Uppercase	Find: (Chapter) Repl:<\U$1\E>	Change Chapter to CHAPTER.
\T...\E	Title-case		See discussion.
\A...\E	Anchor		See discussion.
\C...\E	Replace with Perl expression		See discussion.

GUIGUTS

Regular Expression Reference Card