The GUIGUTS Tool

The Search Dialog

Use control-f or Search>Search & Replace to open the search dialog:

You can resize this dialog. Make it wider if you need to search for very long phrases. The search dialog remains open until you close it.

Basic Searching

To find a certain text,

type or paste the text into the Search Text box.
to find Foot as well as foot, set Case Insensitive on.
to find both foot and footnote, set Whole Word off.
to avoid having punctuation like [Foot misinterpreted, set Regex off (regular expressions discussed below).
to search from the bottom of the document up, set Reverse.

Click Search. Guiguts searches for the text. When it is found, Guiguts scrolls the document to display the found text, sets the insertion point before its first character, and hilights the text in orange. The orange hilight shows what text would be replaced; it does not mean the text is selected for editing. To cut, copy or replace the found text, you must drag over it to select it.

Each time you edit the Search Text field you begin a new search, and when you click Search, Guiguts begins the search from the top of the document (from the bottom if Reverse is set). When you click Search again without editing the search text, Guiguts continues searching from the current insertion point—from the last-found text if you have not moved the insertion point.

Change the setting of Reverse to continue a search backward in the direction from which it came. Set Start at Beginning to restart a search at the end of the document without editing the search text.

When the text is not found, Guiguts sounds the bell and scrolls to the top of the document. Clicking Search then starts a new search with the same search text.

Replacing

To replace the found text, enter the new text in the Replacement Text field. Click Search to find the first or next target. To test a replacement, click Replace and observe the results; if they are not satisfactory, use Undo. To replace some targets and not others, click Search until you reach a target that needs replacement. Then click Replace & Search.

To perform a global replacement, click Replace All. Guiguts repeats the search and replace operation starting from the end of the document and continuing to the other end of the document.

Each individual replacement is an action that can be undone. If Replace All makes 50 changes, you must apply Undo 50 times to undo them all.

Search and Selections

If a selection is active before you open the Search dialog, up to one line of the selection is copied into the Search Text field for you. Thus if you want to look for a particular word or phrase from the text, just select it and key control-f; the search is ready to use. (This clears the selection.)

If you make a selection after entering the search text, when you click Search the search is confined to that selection. Instead of starting at the top or bottom of the document, it starts at the top or bottom of the selection. If a selection is active when you click Replace All, the replacements are confined to the selection.

Caution: when a search within a selection succeeds, the found text is highlighted and the selection is cleared. If you want to continue searching or replacing within that selection, you must reestablish the selection first.

Hot Keys

The following hot-keys are available when the Search dialog is open:

Keyboard focus in document window
control-f	Search (focus moves to search dialogue)
control-g	Search again (focus remains in document window)
Keyboard focus in search dialog
Enter	Search
shift-Enter	Replace
control-Enter	Replace & Search
shift-control-Enter	Replace All

Word Counts

If you are searching for a whole word (Whole Word is checked), and if you have run the Word Frequency Routine since loading the document, then when you click Search, the count of matching words is displayed beside the search text field. The count of whole words matching the Replacement Text is also shown.

These counts are taken from the Word Frequency report and so reflect the document when the Word Frequency routine was run. The counts are case-insensitive.

Using Predefined Searches

The Search menu offers predefined searches that speed common post-proofing tasks.

Stepping Through Blocks

The Search menu has ten choices for stepping through the markup blocks of the document: a Next and a Previous for each of five kinds of blocks. Use of these is discussed here.

Orphaned Markup

Use Search>Find Orphaned Brackets & Markup to open a small palette with choices of every type of balanced markup. For each of these nine choices, the presence of one marker without its balancing marker is probably an error.

Click a type of markup, for example /* */, and click Search. Guiguts scans the document for all opening and closing markups of this type (a process that can take some time, for a common markup in a large document). It finds the first instance of an opening mark that is missing its close, or a closing mark missing its opening. The unbalanced markup is highlighted with search-orange. Click Next to find another of the same type.

Highlighting Quotes

The Search menu contains three commands that help you locate unbalanced quotes. (Guiguts cannot find unbalanced quotes automatically, as it can find unbalanced parens, because there is no simple way to tell an open-quote from a close-quote, or a single-quote from an apostrophe.)

These commands operate on a selection. Select a paragraph or a passage in which you have confused or unbalanced quotes. Choose Select>Highlight double quotes in selection. The double quotes in the passage are revealed in search-orange.

You can highlight single quotes (apostrophes) with the next menu item. The final menu item clears all search-highlights.

Using Automatic Word Highlighting

Guiguts can highlight many words of interest at one time. Right-click in the status bar. A normal file-open dialog appears; use it to find the file containing a list of words to highlight. A sample file is wordlist/en-common.txt in the Guiguts directory.

After a brief delay, all words listed in the file are highlighted, wherever they appear in the document. Page through the document and each word of interest will stand out for you to inspect. Left-click to turn highlighting off and on.

You can make your own file of words to highlight. The file format is simply text with one word per line. Words may not contain any punctuation except the apostrophe. Words may use any Unicode character below ordinal FE00. The highlighting is case-sensitive; if a word can appear with and without an initial cap, include both versions in the list.

Using Scanno Searches

Scanno searching is automated searching for common OCR errors. Use Search>Stealth Scannos or click in the toolbar to start the process. Guiguts presents a standard file-open dialog headed "Scannos list?" Use this dialog to navigate to one of the three files distributed with Guiguts, which are:

en-common.rc	Several dozen scannos often found in English text, such as "arid" for "and."
mispelled.rc	A file of about 3,400 literal scan errors that have been seen in DP projects.
regex.rc	A file with a few dozen sophisticated regular expressions designed to find common errors.

Select the file to use and click Open. Guiguts opens the Search dialog with additional controls visible:

The first scanno from the file is put in the search text and Guiguts searches for it. Examine the highlighted word or phrase to see if it is an OCR error. Correct it if necessary. Some of the scanno files set replacement text that will correct the error automatically.

Click Search to find the next instance of this scanno. Continue clicking Search until Guiguts can find no more of that scanno and scrolls to the top of the document. If you click too quickly past a likely error, set Reverse to back up. If the search is too inclusive (for example, the search for "ail" finds many words that include those letters) you can click Whole Word to restrict the search.

Click Next Stealtho to begin a search for the next item from the file. If you click Next Stealtho in error, use Prev Stealtho to return to a previous item. Set the Auto Advance button to speed processing of a large file. Then Guiguts tests each scanno in sequence and does not stop until it finds one that actually appears in your document.

Note: The Word Frequency window offers a different way to search for these same scannos which might be more useful for files such as misspelled.rc with many entries.

Scanno Hints

The scannos in some files have explanatory hints. Click the Hint button to possibly see an explanation of the current scanno:

You may if you wish edit existing hints or add hints to scannos that do not have them. Click the Edit button to open a hint-editing dialog:

Use the arrow buttons to scroll through the scannos of the current file. If you modify the hint text, click Add to add the changes to the scanno file in memory. If you modify the search or replacement text, clicking Add creates a new entry; to replace an entry, back up to it and use Del to delete it.

These changes affect the loaded scanno file in memory. Only when you click Save is the scanno file on disk permanently updated.

Using Regular Expressions

A regular expression is a formal way of describing a pattern of text. You use regular expressions (regexs for short) when you need to search, not for a specific string like Foot, but for any string that fits a certain pattern. To search for a pattern of text, type the pattern into the Search Text field and set the Regex switch on.

While you are composing a regex, use Help>Regex Quick Reference to open a formal summary of regex syntax elements in a window that is small enough to keep open for convenient reference. You can also open this Regular-Expression Reference which has examples, but needs a larger window.

Regular Expression Resources

Regular expressions are amazingly powerful and flexible tools, if you understand their terse and technical syntax. Try the following resources for help in mastering regular expressions:

Regex questions are asked and answered in this DP forum thread, which begins with a tutorial.
Jan Goyvaerts's tutorial is detailed and wordy.
Miloslav Nic provides a regex tutorial that is built around examples, and is available in English, Czech, German and Spanish.
For many more pointers, see the directory pages at Google and The Open Directory.

The remainder of this topic covers special features that are supported by Guiguts and not always covered in tutorials.

Finding Multiline Patterns

A normal regex will only find a pattern that is contained in a single line. The reason is that a search for "any characters" (like .*) or "anything but" (like [^>]+) will not match to the newline character that marks the end of every line. (This is an artificial restriction, a relic of the days when computers could not load the whole file into memory at once.)

In post-proofing we often need to find patterns that extend across multiple lines; for example, to find every use of bold the pattern would be [^<]+?. This will indeed find bold markups that are contained in a single line, but the "anything but <" test will not match a newline, so this pattern will not match to a bold phrase that begins on one line and ends on another.

However, if your pattern includes an explicit use of the newline (written \n) Guiguts changes the regex rules so that "anything" and "anything but" do find newlines. You can use [^<]+?</b\n? to find any bold phrase. The \n? at the end means "a newline—or not" and serves only to get a newline into the pattern so as to trigger multiline mode. Another example: to[\s\n]+he\b finds the phrase "to he" (a likely OCR misread of "to be") even if it is split by a line-end.

Searches of this type are both memory- and cpu-intensive, and as a result noticeably slower than normal pattern searches, so use them only when you need them.

Regex Replacements

The regex syntax for replacements lets you replace what you've found with a mix of new text, text quoted exactly from the found text, and quoted text that you modify, for example by forcing it to uppercase.

Replacing with New Text

You find the OCR has consistently misread CHAPTER as CHAETER, CHATTER, or CHARTER. You set a regex search for CHA[ETR]TER, with the fixed replacement text of CHAPTER, and click Replace All. The found text, whatever it may be, is replaced by the new text. Similar examples can be found in the "scanno" source files described here.

Replacing by Quoting the Found Text

You use parentheses within the search pattern to isolate the parts of the found text you want to quote in the replacement. Left-parenthesis characters in the pattern are numbered 1-9, left to right. In the Replacement Text, $1 means "here insert the text found by the first parenthesized part of the pattern." $2 quotes the second parenthesized bit, and so on.

Often italic markup starts on the wrong side of punctuation, for example "Eh?" or (ibid.). The following pattern looks for italic markup preceding punctuation: (['"(]+). The parens isolate the part of the pattern that finds the punctuation. The replacement pattern $1 fixes the error by quoting the found punctuation followed by italic markup, reversing their order. A search pattern for trailing italics could be ([.!;'")]+) and its replacement would be $1.

Replacing by Modifying Quoted Text

Guiguts provides five ways to modify quoted text while replacing it:

\L...\E	Force all text between \L and \E to lowercase. For example, \L$1\E, quote $1 in lowercase.
\U...\E	Force all text between \U and \E to uppercase. For example, \U$1\E, quote $1 in uppercase.
\T...\E	Force all text between \T and \E to title case (initial cap). For example, \T$1\E, quote $1 with initial caps.
\A...\E	Format the text between \A and \E as an anchor, for example \Afoot\E produces <a name="foot" id="foot" />
\C...\E	Process the text between \C and \E as a Perl executable expression, and replace with the result of the expression.
\Carabic(...)\E	Roman numerals within the arabic() expression are converted to arabic.

As an example of \A (anchor) replacement, consider adding anchors at every chapter. The search text
(CHAPTER )\s*([IVXLC]+)
finds CHAPTER followed by a roman numeral, and sets each part for quoting, omitting any extra spaces. The replacement
\A$1$2\E\T$1 $2\E
puts the text back something like
<a name="CHAPTER_II" id="CHAPTER_II" />Chapter II
The found text is replaced twice, first in the form of an anchor; then as the chapter title in bold with a single, non-breaking space.

The \C...\E replacement, although sophisticated, can take the place of hours of hand-labor. For example, suppose you auto-generate HTML before adjusting the page markers. You look at the HTML and find the page anchors are all too high by 7: <a name="Page_117" id="Page_117" />) should really name Page_110, etc.

This can be fixed using \C...\E. The following pattern finds a page number anchor and sets up to quote the first instance of the page number:
<a name="Page_([0-9]+)"[^/]+?/>
The following replacement rewrites the anchor, subtracting 7:
<a name="Page_\C$1-7\E" id="Page_\C$1-7\E" />
The expression $1-7 becomes 117-7, which becomes 110 when executed by \C...\E. Thus each replaced anchor has a number 7 less than before.

Guiguts provides the arabic() function to convert numbers in Roman numerals to arabic form. (No, there's no function to go the other way.) For example, the pattern
\b([IVXLCDM]+)\b
finds one or more Roman digits preceded by a word-break (\b) and quotes the numerals. The replacement
\Carabic("$1")\E
replaces the numeral with its arabic equivalent. You could apply this to a entire table by selecting the table and clicking Replace All.

GUIGUTS

Using Search and Replace