The GUIGUTS Tool

Guiguts provides three ways to enter special characters: the Latin-1 palette, the Unicode menu, and the Greek tool, each described in the following sections. However, before you use any of these tools, you should understand the character codes that are permitted in PG etexts; these are discussed below. In particular, you should know to avoid using the special symbols available in the Windows and Mac system fonts, discussed below.

The Latin-1 Palette

Click in the toolbar or use Help> Latin-1 Chart to open a small palette displaying the special characters of the Latin-1 set. Click on a character in the palette to insert either the character or its HTML entity (described below) at the insertion point.

The Unicode Menu

Single-click Unicode in the menu bar to open a long menu listing a selection of blocks of Unicode symbols. Click on one of the block names to open a palette displaying the characters in that block. If the font you are using does not contain certain characters, those characters display as blanks or empty boxes. Hover the mouse over a character; a "tool tip" pops up listing the decimal and hex values of that character, and its official name. Click the character to insert it in the document. At the top of a palette you can select whether the palette will insert the Unicode character itself, or the HTML entity (described below) for it.

When searching the menu for a character by its number, notice that the menu lists blocks by their hexadecimal, not decimal, value. The code for a left double quotation mark, “, is found at 201C, not at 8220.

Guiguts does not provide the entire gamut of Unicode symbols. For a table of all Unicode available through the Guiguts menu, open this page in a separate window. It may take some time to load. Characters that are not in the serif font used by your browser display as empty boxes. If you cannot find a character, you can find out if it exists in Unicode using this search page.

When the document contains even one Unicode character outside of Latin-1 (that is, any multi-byte character), Guiguts will save the file as a UTF-8 file. Any other text editor that is compatible with UTF-8 should open the file correctly. If you remove all multi-byte characters from a document, it is next saved as Latin-1.

Guiguts at this time requires the Aspell spelling checker version 0.5, which does not handle Unicode. Spell-checking may not work right at least for words containing multi-byte characters.

The Gutcheck tool does not handle UTF-8 data well. If the document contains more than a very few multi-byte Unicode characters, running Gutcheck may produce useless output.

Support for the feature of pop-up tool tips in the Unicode palettes can make the palettes slow to load the first time each is used. You can disable the feature by changing the name of the file Unicode in the Guiguts folder to any other name.

The Greek Transliteration Tool

Knowledge of Greek was basic to advanced education throughout the 18th and 19th centuries; naturally, scholars and poets of those periods toss words, phrases, even paragraphs of Greek into their books. PG wants these transliterated into ASCII equivalents; the method is summarized in the PG FAQ page.

The DP Guidelines tell the proofer to transliterate Greek text and enclose it in [Greek:] markup. The standard proofing interface has a pop-up tool to assist this. However, you need to recheck and possibly re-do all Greek, for two reasons. First, transliteration is difficult, and errors are likely. Second, the pop-up tool does not support all accents and obsolete characters, so if you understand Greek orthography, you may be able to do a better or more complete job.

Greek in ASCII, Beta, Unicode and HTML

The PG method used by proofers is a simple conversion from Greek symbols to 7-bit ASCII. Beta coding is a more complex transliteration scheme that lets you preserve more of the Greek orthography in ASCII form. The Beta code is summarized on this page.

All the Greek symbols are available in two blocks of Unicode. They can be found in the middle of the Guiguts Unicode menu. These characters require multi-byte codes, so if you put them in an etext it will be saved in UTF-8 form.

All the Greek alphabet symbols have HTML entity codes (described below). Thus the HTML version of an etext can display the original Greek text while remaining an ASCII document.

The Greek Tool

Use Help> Greek Transliteration or click in the toolbar to open the Greek Transliteration tool:

To enter transliterated Greek text, you click on the images of the characters in sequence. The transliteration is built up in the text window based on your selection of the four switches at the top of the window:

The Latin-1 switch produces PG/Beta ASCII codes.
The Greek Name switch produces the English names of the characters.
The HTML code switch produces HTML Entity codes.
The UTF-8 switch produces Unicode characters.

Click Space to enter a space. You can also edit the text in the text window manually, and cut, copy and paste into it.

When the text in the window is correct, click Transfer to insert the contents of the text window at the insertion point in the document.

To build a character with accents and/or breathing marks, type the base ASCII letter in the small field at the bottom of the screen. The corresponding Greek character is shown. Click on the Beta-code accent marks to the right; Guiguts displays the resulting composite character. Key Enter to move the composite character into the text window. Additional capabilities of the Character Builder window:

Empty field	Press Enter to enter a linebreak
Empty field	Press Backspace to backspace in the text field
Space	Enter a space in the text field
s then space	Enter a terminating lowercase sigma
o^ or O^	Enter lowercase or uppercase Omega
e^ or E^	Enter lowercase or uppercase Eta

Four buttons in the second row let you convert the contents of the text window between encodings. For example, you could copy a proofer's transliteration, paste it into the text window, click ASCII->Greek, then compare the Greek to the original page image.

You can enter HTML entity codes directly by setting that switch and clicking on Greek letters. There is no direct method for converting built-up characters to HTML. You can do it indirectly as follows: When the desired text is visible as ASCII codes, use either ASCII->Greek or Beta Code->Unicode to get Greek symbols. Click Transmit to put the symbols in the document. Highlight the symbols in the document and use Selection> Convert to Named/Numeric Entities.

Character Codes and Compatibility

The characters in a text file are encoded as small binary numbers. That means there must be agreement on how the numbers are to be decoded: agreement, for example, that the number 32 will be decoded as a space, and 33 as an exclamation mark.

Seven-Bit ASCII

There is one standard encoding on which all common operating systems, web browsers, and text editors agree: the 7-bit ASCII code. It is called that because it is an agreement on the use of the numbers that can be represented in 7 binary bits, 0-127.

Seven-bit ASCII defines 96 codes (32-127), providing the English alphabetics and common punctuation. (Numbers 0-31 were avoided because they have historic functions for controlling the flow of data in transmission.) Open this page in another window to see the full list of 7-bit ASCII codes.

The PG FAQ says "You should use plain ASCII for straight English texts." Seven-bit ASCII is favored by Project Gutenberg because an etext that uses only 7-bit ASCII can be read on any equipment and software, anywhere.

Latin-1 or ISO-8859-1

The next step is to use eight bits, giving an additional set of code numbers 128-255. The ISO (International Organization for Standards, a body that coordinates the work of the national standards organizations of many countries) has standardized ISO-8859-1, Latin alphabet No. 1, also referred to as Latin-1.

The Latin-1 standard assigns character codes to the 96 numbers 160-255 (numbers 128-159 are skipped for historic reasons), providing most accented characters needed for European languages plus a variety of special symbols. The characters in pop-up menus on the PGDP Proofing Interface are the Latin-1 characters. Open this page in a separate window to see the Latin-1 character set.

The PG FAQ says you should use ISO-8859 when you must, but "also provide a 7-bit plain ASCII version with the accents stripped ... we make a point of always supplying an ASCII version where possible, even if the ASCII version is degraded when compared to the 8-bit original."

You can use regex search to visit the Latin-1 special characters in the document. Search for [\x7f-\xff]. In this way you can find all the accented characters that should be changed to make a 7-bit ASCII version of an etext. When converting a Latin-1 etext to 7-bit ASCII, it is helpful to understand the PG scheme for diacritical markup, described in this topic of the PGDP guidelines. Many accented characters can be preserved in ASCII form using this markup.

Windows Special Characters

The Windows operating system (for US and UK keyboards) also defines an 8-bit code. The numbers 0-127 are the same as 7-bit ASCII but the numbers 128-159 have additional symbols like trademark, bullet, and endash. (For a detailed discussion of the Windows character set, see this page.)

If you are using Windows, you can easily enter a symbol like a curved-double-quote that is not in Latin-1. Doing so makes your document incompatible with Latin-1 and with Unicode. You can get rid of all the Windows characters you may have put in the document accidentally by applying the menu command Fixup> Convert Windows CP 1252 characters to Unicode. Guiguts sweeps the document and changes Windows-unique codes to their Unicode equivalents. This leaves the document with Unicode characters in it. To find them, see the Unicode topic, below.

MacRoman

Early in the history of the Mac, Apple defined an 8-bit set of 223 characters. The "MacRoman" code includes 7-bit ASCII but uses the codes 128-255 for a different selection of symbols than ISO and ANSI did. (This explains why special characters are jumbled in email between Windows and Mac users.) If you are using a Mac and you edit the document in some tool other than Guiguts, you must be careful not to pollute the document with special characters that appear correct but are coded in the MacRoman set.

TextEdit is the default application if you double-click a file of type "txt." It allows MacRoman by default, but you can make it safe. Open TextEdit> Preferences. In "Panther" look in the upper right. In Tiger, select the "Open and Save" button. Set the "Plain Text Encoding" for both Open and Save to the choice, "Western (ISO Latin 1)." (If this choice is not at first available in the pop-up menu, select "Customize Encodings List" from the end of the menu and enable the Latin-1 choice in the list of all encodings.) For TextEdit in Tiger, you should also set "Western (ISO Latin 1)" as the encoding for saved HTML files.

BBEdit can use any code set, but you must tell it which to use. Open BBEdit> Preferences and select the page named "Text Files: Opening." Set the preference "If File's Encoding Can't Be Guessed, use: Western (ISO Latin 1)." Go to the page "Text Files: Saving" and set "Default Text Encoding: Western (ISO Latin 1)." Then, before you save any PG document, pull down the File Options menu in the document header (the icon is a tiny page symbol) and make sure that "Encoding: Western (ISO Latin 1)" is set.

Unicode

There is no way to get all the characters of the world's languages into a set of 255 numbers. The only solution is to use more bits per character. Unicode is a standard that has assigned nearly 100,000 letter symbols to numbers in the range of zero to about one million.

Obviously such code numbers won't fit in a single byte. In the most common encoding, called UTF-8, each Unicode character is encoded in a sequence of from one to four bytes. The 128 codes of 7-bit ASCII stand for themselves, so that a 7-bit ASCII etext is in fact, a UTF-8-encoded Unicode text as well.

Unicode UTF-8 uses the numbers 128-255 as markers to introduce 2-byte, 3-byte and 4-byte groups that represent other characters. As a result, a Unicode text file is not compatible with a Latin-1 text file. (If software treats one under the belief that it is the other, wrong special characters are displayed.)

PG accepts etexts in UTF-8 coding when the additional characters are necessary to the book. You can find out if the document contains any Unicode greater than one-byte codes. To find any multi-byte characters including punctuation, search with this regular expression: \P{IsAscii} (note the uppercase P). This finds all multi-byte characters even if they are punctuation.

You can find all words containing multi-byte characters a different way. Use Fixup> Run Word Frequency Routine. In the report window, click the Unicode>FF button. Words containing a multi-byte character are listed. You can jump to the words by double-clicking them.

HTML Character Entities

The HTML character entities are special codes that let you use a sequence of ASCII letters to command the browser to display a special character. Entities always start with the ampersand (&) and end with the semicolon. For example, ¼ is the entity for the character ¼. Open this page to see the list of all available entities.

You cannot use HTML entities in the text file; the reader would not understand ¼. You use entities in an HTML file so that the file itself, bookname.html, is a 7-bit ASCII file, yet the browser can display accented, Greek or mathematical symbols.

The heading of an HTML or XHTML document is supposed to specify its character encoding. This is usually done with the following statement in the head section:

<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

Specifying charset=ISO-8859-1 tells the browser that the document might contain the full Latin-1 set, which most browsers support. However, Guiguts automatically converts all Latin-1 and Unicode characters to HTML entities during automatic HTML Conversion. Any HTML generated by Guiguts consists solely of 7-bit ASCII text.

GUIGUTS

Latin-1, Unicode, and Greek Characters