In this chapter we will develop a program that can compress a document file to about 70% of its original size, and another to return it to normal. With such utilities we can store more document text on a diskette than would otherwise be possible.
The file handling in these programs is quite a bit more complicated, and requires more interaction with the CP/M file system, than any we've done so far. With it we establish a general pattern that can be used for all such utilities.
There are a number of methods for compressing data. All of them depend on predicting that there is some kind of redundancy in the data, and then encoding the data in a way that eliminates the redundancy. If the predicted redundancy does not occur, then some algorithms will actually expand the data, while others simply make no change in its size.
One common method is called run-length encoding. That method locates runs (sequences) of identical characters, and encodes each run as a single, representative character and a count of the number of times it is to be duplicated. For instance, run-length encoding would reduce each of the lines of dashes that appear in the assembly programs in this book to a few bytes. Run-length encoding has no effect on a file that doesn't contain any runs. Unfortunately, that is the case with most files of general English text, such as the CP/M files in which I keep the chapters of this book.
Another compression method depends on knowing the frequency distribution of the characters in a file, and on there being a large difference in frequency between the most common character and the least common. Then each character can be assigned a bit-code whose length depends on its frequency of use. The most common character is given a code consisting of a single bit, and the least common a code with as many bits as there are unique characters in the file. This method is the most effective one possible when the frequency distribution is known perfectly, but it fails disastrously when the prediction is wrong.
Here we will use an algorithm that depends on a prediction about English text that has been verified by the experience of generations of typesetters. It is the prediction that the most common characters are the space, followed by the letters "etaoinshrdlu" in that order. It follows that pairs of characters taken from that list will also be common. We can then compress text by locating pairs of letters from the list, and encoding each such pair in a single byte. I've applied an earlier version of this method to more than a megabyte of text, and have achieved compression of 25 percent to 35 percent in every case.
We can add two heuristic frills to the basic algorithm. In CP/M, ASCII text is stored as lines, with each line delimited from the next by the pair of characters CR and LF. In some files, most lines begin with a TAB character. We will encode the pair CR, LF as a single byte, and encode the triple CR, LF, TAB, in a single byte. After looking at the finished program, we will consider how run-length encoding could be added as well.
PACK, and its stablemate UNPACK, are to be utilities (or "filters"); that is, programs that read an input file and write an output file containing the same data, modified. A natural syntax for such a command is
PACK inputfile outputfile
Our work as programmers would be easier if the user were required to specify a complete fileref for each of inputfile and outputfile. But utilities are much more useful when we specify a reasonable set of defaults for the file references. Each file reference consists of three parts, a drivecode, a filename, and a filetype. We can represent them so:
x:filename.typ
(MP/M 2 adds a fourth component, the file password.) The program has to specify all three parts in the FCB when it requests a CP/M service. The command's user, however, should not have to give all the parts when the program can provide a reasonable default. The user will appreciate us even more if we use the default rules consistently in every utility.
There is a CP/M convention for the default drivecode. When the user omits a drivecode from a CP/M command, CP/M supplies the default drive letter (the letter used in the command prompt, as "A>"). The programs in this book will use that convention. The default drive number can be gotten via a BDOS service request.
CP/M allows a file to have a filename that is completely blank. Unfortunately, it is hard for a program to distinguish between a blank filename and the complete absence of the inputfile operand. The programs will be much simpler if a fileref "a:.typ" is not allowed. Since there isn't much practical use for such a fileref, we will prohibit it. The user will be required to specify an input filename.
A filetype of three blanks is allowed by CP/M. While that doesn't have much use either, we gain no programming advantage by prohibiting it, so we won't. In the inputfile operand, an omitted filetype is actually a specification of a filetype of three blanks.
The outputfile operand has the same three parts. If a part of the output fileref is omitted, what shall a utility do? The output file is related to the input file. It would be convenient if its default file reference were related as well. A simple rule is this: if a part of the output fileref is omitted, a utility will use the same part of the input reference. Let's consider the effect of that rule for several variations of the PACK command.
A>pack chap01.doc b:
The input file is A:CHAP01.DOC. The output drivecode has been specified, but the output filename and filetype have been omitted. Copying these parts of the input fileref gives an output file of B:CHAP01.DOC.
A>pack chap01.doc .pak
The input file is A:CHAP01.DOC. The output drivecode and filename are missing; taking them from the input reference gives an output file of A:CHAP01.PAK.
A>pack input c:output
The input file is A:INPUT (blank filetype), the output is C:OUTPUT (with a blank filetype copied from the input).
A>pack b:replaced.doc
The input file is B:REPLACED.DOC. Copying all three missing parts of the output reference from the input reference gives us an output file of—B:REPLACED.DOC. The default convention calls for the output file to replace the input file! When I realized this, my first thought was to find some way to ban it. But the facility is a useful one. It turned out to be easy to do, and the resulting program is better for it.
Figure 3-1 shows the high-level plan for the logic of PACK. Each of the initialization steps requires a lot of elaboration. In fact, the initialization and termination code came out to be four times as long as the main loop.
Figure 3-1. Sketch of program PACK.
Program Pack(infile,outfile) prepare the input file for reading prepare the output fileref prepare a work file for output get First byte get Second byte while First byte is not CpmEof... if compression is possible, then First byte := First, Second bytes merged get another Second byte end if. write First byte First byte := Second byte get another Second byte end while. write any buffered output close the workfile erase the output fileref rename the workfile to the output fileref end Pack.
Let us consider input first. There will be a subroutine GetChar. It will be similar to GetChar in the preceding chapter, in that it will return the next byte of input on each call. Unlike the preceding programs, PACK deals only with ASCII files. By CP/M convention, the ASCII character named SUB (control-z, 1Ah) marks the end of a file of ASCII text. Therefore, GetChar need not return a special end-of-file flag. When the next byte is SUB, or CpmEof as it is called in the pseudo-code, end of file has been found.
GetChar will require a File Control Block (an FCB) to use for input. It is set up during initialization. Figure 3-2 shows the steps needed to do this. If the input filename is blank the program will abort with a message. (The alternative is to see if the drivecode and filetype have also been omitted; even then the program can't be sure that the user didn't mean "the file with a blank name and blank type on the default drive.")
I chose to provide FCBs for both files, rather than using the default FCB in low storage. It's important that two particular bytes of an FCB be set to zero before a file is opened (the extent number and current record number). The preceding programs didn't have to do so because those bytes of the default FCB are set to zero before the program is called. It's no harder to zero the whole FCB, since we have a library routine FillZero for just that purpose.
It would be possible to assemble an FCB with zero values in it, just as it's possible to assemble an initial value for any variable. But such static initialization is unreliable. It works as long as the program is loaded afresh from disk every time it is to be run. However, it is possible to restart a CP/M command without reloading it. If that happens, the second time the program is entered its variables will be in a "used" condition. The programs in this book always initialize all variables at execution time.
It will be simple to move the different parts of the input fileref from the default FCB, where the Console Command Processor (CCP) put them while parsing the command, to the input FCB in the program. Another included routine, MoveHtoD, will come in handy here. The last step is to set up an index for the use of GetChar.
Figure 3-2. Design for the preparation of the input file mechanism for PACK and UNPACK.
prepare the input file for reading: if "infile" was omitted, Abort "An input filename is required." fill the input FCB with zeroes get an input drivecode for the FCB if a drivecode was supplied, use it, else use the system default drivecode. move the input filename and filetype to the FCB open the input FCB if no file exists, Abort "Input file not found." InIndex := 255 { force read on first GetChar }
The version of GetChar in the last chapter wasn't very satisfactory, because it relied on its buffer's being on a particular storage boundary. However, there is a convenient input buffer that is always on a known boundary. CP/M sets aside the area from 0080h through 00FFh for use as a buffer. It would be a shame to waste this area. (Actually, a few CP/M systems are forced by unusual hardware to place their low-storage fields, including this buffer, at some other address xx80h. That's why the addresses named in CPMEQU.LIB are defined in terms of "CpmBase." In a standard system, CpmBase is 0000h; in a variant system it can be redefined and these programs reassembled.)
The version of GetChar that we'll use from now on is shown in Figure 3-3. InIndex isn't a proper index, but a one-byte storage address of the last byte taken from CpmBuffer. This is the reverse of what we've done before, when the index pointed to the next byte to be used. InIndex must be incremented before it is used, not after. When incrementing it makes it overflow to 00h, the buffer is empty.
Figure 3-3. The byte-input routine GetChar, as used in PACK and ohter utilities.
GetChar: returns a byte T is a temporary byte CpmBuffer is an array[128..255] in low storage global InIndex is an index over it InIndex := InIndex + 1 if ( InIndex = 00h ) then { input buffer is empty } CpmBuffer[128] := CpmEof read to CpmBuffer, ignoring return code InIndex := 128 endif. T := CpmBuffer[InIndex] if ( T = CpmEof ) then { end of file, never read further } InIndex := InIndex-1 return T end GetChar
CP/M has two different conventions for marking the end of an ASCII file. If the file ends in the middle of a 128-byte record, the end is marked by a CpmEof (SUB) character. There may be one such byte, or the record may be filled to its end with CpmEof bytes (the programs in this book fill the record, but some programs just stick in one of them and leave garbage in the rest of the record). If the last byte of data in the file happened to fall in the last byte of a 128-byte record, end of file will be marked by a nonzero return code from a sequential read request.
It's a bit complicated to check for two different end of file signals. GetChar demonstrates one way to handle the problem. It puts CpmEof at the head of the buffer before reading. Then, if there are no more records, the end of file mark will still be there after the read. The read request's return code can be ignored.
Once GetChar has returned CpmEof, it should not be called again. However, GetChar has been designed in the anticipation that it might be called again after end of file. It sets InIndex back when CpmEof appears, so that once it has delivered CpmEof, it will never deliver anything else.
VDUMP was an example of a large class of programs that can do their work while looking at only one byte of input at a time. PACK is a member of a different class of programs. It must be able to see two bytes of input at once. Read the central loop in Figure 3-1 carefully. It takes input one byte at a time (expressed as "get another Second byte"), and writes output one byte at a time ("write First byte"). However, at all times it has access to two bytes, First and Second. That is required because the test, "if compression is possible," can only be made by considering both bytes at once. The test really is, "if the First byte is in the list of common characters, and if the Second byte is in the list, too."
What the loop really calls for is a window, just two bytes wide, that can be slid along the text. As one byte passes out of view on the left, another enters the window from the right. Sometimes the program will find that the two bytes in the window can be compressed into one. Then it must pluck both bytes out of the window, put a compression code in the window, and slide the window right.
Read Figure 3-4 with the window image in mind. The "window" is realized as two variables, B and C. The subroutine Shift slides the window right over the text by putting C into B, and a new byte into C. The program initializes the window by calling Shift twice.
The subroutine Shout ("Shift and output") does more. It sends the character at the left edge of the window to the output file, and then shifts the window right.
The algorithm of Figure 3-4 looks for pairs of characters such that the first is one of "(blank)etaoinshrdlu" and the second is one of "(blank)etaoins." The first is reduced to an index from 0 to 12, the second to an index 0 to 7. The two indexes are packed into bits 6 through 0 of a byte. Bit 7 of the byte is set to 1 to signal a compression byte.
What if the file already contains bytes whose most significant bits are 1? Such bytes would look like compression bytes, and presumably UNPACK would try to decompress them. If such a byte turns up, the code of Figure 3-2 flags it in the output file by writing the constant CopyThis before it.
Because the index that replaces byte B can never be greater than 12, or 0Ch, a compression byte can never be greater than E7h. That leaves compression byte values E8h through FFh uncommitted. One of those values is needed to stand for CopyThis, and two more are used as EndLine (for CR, LF) and EndLineTab (CR, LF, TAB).
Figure 3-4. Pseudo-code of the main loop of PACK, showing the compression algorithm.
Program Pack(infile, outfile): B, C are bytes CopyThis, EndLine, and EndLineTab are byte values, such that compression won't produce them. LimitB is the largest 4-bit number, i.e. 15 LimitC is the largest 3-bit number, i.e. 7 ...initialization of input, output, and work files Shift(B,C) {initialize the two-byte window to..} Shift(B,C) {..the first two bytes of the file } While ( B <> CpmEof ) if ( Lookup(B) <= LimitB ) and ( Lookup(C) <= LimitC ) then C:= 80h + (Lookup(B) shiftleft 3) + Lookup(C) Shift(B,C) else if ( B>127 ) then PutChar(CopyThis) else if ( B=CR ) and ( C=LF ) then C := EndLine Shift(B,C) if ( C=TAB ) then C := EndLineTab Shift(B,C) endif. endif. endif. endif. Shout(B,C) end while. Shift( var X, Y: bytes ) X := Y Y := GetChar end Shift. Shout( var X, Y: bytes ) PutChar(X) Shift(X,Y) end Shout. Lookup( X: a byte ): returns a small number. if X is a member of " etaoinshrdlu" then return its index 0...12 in the list else return a value greater than 127 endif. end Lookup.
When we promise to transform a file without losing any information, it behooves us to take some pains to make sure we will do it correctly. One principle of program structure is that the programmer should be able to make assertions at certain critical points in a program structure. An assertion is a statement of a condition that has to be true at the point in order for the program to be correct. The point of making an assertion is that, presumably, the truth of the assertion, and hence the program's correctness, can be verified by inspecting the code. Alternatively, inspection might reveal a path through the code that results in the assertion's being false, in which case a design error has been found.
Three useful assertions can be made of the loop in Figure 3-2. At the top of the loop we should assert, "B contains a different byte from the last one tested at this point." If that weren't true in every possible case, there would be a possibility of an endless loop. Since Shift loads a different value into B, the assertion can be reduced to "Shift has been called at least once since the last time this test was made." It's easy to verify that this last statement is true; inspect Figure 3-2 to see.
An assertion that is harder to verify can be made of the statement "Shout(B,C)" at the end of the loop: "Whenever Shout is called, the value in B is one that should be written." Make it more specific: "Whenever Shout is called, B contains either a byte that can't be compressed, or a byte that is the result of compression." If the assertion isn't true, then the algorithm isn't working right. In order for it to be false, B would have to be a byte that could be compressed and wasn't, or else garbage.
The third assertion is that no data will be lost: "whenever Shift is called, B is either a byte that has just been written, or a byte whose value has been compressed into C." If that assertion isn't true, the program might shift some byte out of sight without writing it.
One aim of this design style is to produce routines that are short enough that such assertions can be verified with good confidence, by eye. I can't see any path through Figure 3-2 that would invalidate these assertions. Can you?
Because of the default rules for the output fileref, it is possible that PACK's output file may have the same name as its input file. The first step in making an output file is to erase any file of the same name. That is obviously not the right thing to do when that same file might be the program's input! PACK will have to create a work file with an artificial name for its output. At the end of the program it can rename the work file to have the correct name.
The CCP places the outputfile command operand in the second half of the default FCB. It is fairly easy to get the parts of that fileref from there to the FCB the program will use, as sketched in Figure 3-5. If the user omitted the drivecode, filename, or filetype, it's easy to copy them to the output FCB.
The output filetype, however, is not put into the FCB. It is set aside for use at the end of the program, and a filetype of "$$$" is put in the FCB. It is a CP/M convention that temporary work files have type "$$$." A program can erase any file of that type at any time. PACK will create a workfile whose filename is that specified by the user, but whose filetype is "$$$."
Figure 3-5 shows the steps needed to prepare that work file for use. Once the FCB has been set up, a single service request erases any file of the same name. Before that is done, we had better be sure that there are no question marks in the output filename. The reason will be clearer when we look at finishing the output file.
Another service request "makes" the file. To "make" a file is to place an initial entry for that file in the disk directory. If the directory is full, the request will fail, and the program will abort with a message.
Figure 3-5. Design for the preparation of the output and work files for PACK and other utilities.
OutLimit is a large number (at least 2048) and a 128-multiple OutBuffer is an array[0..OutLimit] of bytes, OutIndex is an index over it. prepare the output fileref.. fill the output FCB with zeroes get an output drivecode for the FCB if a drivcode was supplied, use it, else use the input drivecode. get an output filename for the FCB if a filename was given, use it, else use the input filename. get an output filetype and save it (away from the FCB) if a filetype was given, use it, else use the input filetype. if the output filename or filetype is ambiguous, Abort "The output fileref may not be ambiguous." prepare a workfile for output.. move "$$$" to the output FCB filetype erase any ".$$$" file with the same filename make a file, using the output FCB if the make fails, Abort "Can't create the output file." initialize the the OutBuffer address and OutLimit OutIndex := 0 { output buffer is empty }
Up to this point we have done file I/O in 128-byte units. That is adequate when a program works on only one file at a time. When the program reads and writes by turns, however, it is important that it operate on at least one of the files in larger units. One of PACK's two files must be buffered in order to get adequate performance. CP/M's primitive I/O methods are the cause of this.
Most CP/M systems have at most 64 KB of storage, and the resident system code (the BDOS and the BIOS) is designed to be as small as possible in order to give the maximum space to command programs. The BIOS, which performs disk I/O, almost never contains buffer space for more than a single disk sector. A disk sector may be as large as one kilobyte, or as small as 128 bytes. Regardless of its size, a sector contains data for only one file. Sectors are allocated to a file in groups, with all the sectors of one group adjacent on the disk.
When a program works on a single file, its I/O requests will cause the BIOS to progress smoothly through adjacent disk sectors. There will be a minimum of disk arm movement and, if the program processes quickly enough, sectors will be transferred at a rate near the hardware maximum (disable the subroutine Output in VDUMP, so that the program is no longer limited by the speed of the terminal, to see just how fast a file can be read).
The situation changes when a program uses two files alternately. Suppose that the PutChar routine of PACK worked in 128-byte units as GetChar does. The BIOS would be first asked to read the input file. It would read the disk sector containing the first input record. Then it would be asked to write in the output file. It would have to move the disk arm to a different place on the disk, or select a different drive, and write a sector of that file. The following requests would send it back to the first file, and then back to the other, and so on. The system would end up spending more time in sawing the disk arm in and out, or in clattering back and forth between drives, than it did in processing.
When disk sectors are larger than 128 bytes, the situation is even worse. In order to write one 128-byte record into a 512-byte disk sector, the BIOS must read the sector, copy the record into it, and then write the sector. If the next service request is a write that falls in the same sector, the read can be skipped. But when reads and writes alternate, the BIOS will end up doing three I/O operations for every two service requests. It will have to read an input sector for a read service, then for the write service it will read an output sector and write it back.
In our utilities we will minimize the problem by buffering a large quantity of output in storage. GetChar will request many input records while PutChar is filling a large output buffer. When the buffer is full, it will be dumped to the output file in one long burst of output requests. This will minimize the number of times that CP/M is asked to switch between files.
Read the pseudo-code in Figure 3-6 to see how output buffering can work. PutChar is similar to GetChar; on each call it handles a single byte and advances an index. When the index reaches the limit of the buffer, PutChar calls DumpOutput to write the buffered data. I made DumpOutput a separate routine for two reasons. First, it is complicated enough to deserve separate consideration. But there is a more important reason.
At the end of the program, there will almost certainly be some unwritten output data remaining in the buffer. It might be as little as one byte, or as much as one byte less than the size of the buffer. The last thing before closing the output file, PACK must write that buffered data. It is convenient to write DumpOut as a separate routine, so that it can be called in this case as well as from PutChar.
The output buffer will be a multiple of 128 bytes in size. When DumpOut is called from PutChar, the buffer will be completely full. When DumpOut is called the last time, at the end of the run, that won't necessarily be true. That will be DumpOut's cue to fill the last buffered record with CpmEof bytes. (Asserted at the entry to DumpOut: "If OutIndex is not an integral multiple of 128, DumpOut will not be called again." If that were false, the program could put end-of-file marks in the middle of the file.)
The process of writing the buffer as a series of 128-byte records is simple in principle. As we'll see, it is a bit more complicated at the assembly language level.
Figure 3-6. Pseudo-code of the input and output routines for utilities such as PACK and UNPACK.
PutChar( X: a byte ) if ( OutIndex = OutLimit ) then { buffer is full } DumpOutput OutIndex := 0 end if. OutBuffer[OutIndex] := X OutIndex := OutIndex + 1 end PutChar. DumpOutput: Q is a temporary word While ( OutIndex mod 128 <> 0 ) .. OutBuffer[OutIndex] := CpmEof OutIndex := OutIndex + 1 end while. For Q := 0 to OutIndex, step by 128, do.. write record at OutBuffer[Q] if error, Abort "Error writing work file." end for. end DumpOut.
There are several things still to be done after the main loop has ended. They are noted at the end of Figure 3-1. The data remaining in the output buffer have to be written to disk. A call to DumpOutput takes care of that.
Then the work file has to be closed. Calling the BDOS for a Close service request causes it to update the disk directory to reflect the final space allocation to the file (we will go into file space allocation in Chapter 5). The file can be closed with a single SERVICE macro. The output data will be safe on disk then, under its temporary "$$$" filetype.
With the output secure, it is safe to erase any existing file designated by the real output fileref. If the output file is replacing the input file, this is the point at which the input file will disappear. This is also the point at which, if the output fileref were ambiguous, we could do a lot of damage. The Erase service request works just like the ERA command. If the user gave an output fileref of, say, "*.OUT," the program would now erase every file with a filetype of "OUT." PACK did exactly that to me, before the test for an ambiguous output reference was added to the included code!
Having erased any conflicting file, the program is free to rename the work file so that it has the desired filetype. Renaming is again a matter of a single SERVICE macro call.
The assembly language version of PACK that developed from the foregoing design is shown in Listing 3-1. For the most part it is a straightforward translation of the pseudo-code. There are a few assembly language techniques that might be unfamiliar to beginners.
Consider first the main loop, from Main to MainEnd (lines 100-171). The logic here is exactly that of Figure 3-2. I tinkered with a number of variations, and finally returned to this simple structure. It has a performance flaw. When byte B is in the compression list and byte C is one of "hrdlu," compression is not possible. As the code stands, it will write byte B and loop back to test the former byte C again as the new byte B. It ought to be possible to avoid the redundant test of the C byte once it is known to be one of "hrdlu," but all my attempts at this ended up with a complicated structure that wasted as much time in extra branches as it saved. Perhaps a reader can do better.
An earlier version of PACK served me well for months. In that program, the Lookup function did a sequential search of the list of compressible values. That had the disadvantage that Lookup had a variable execution time which was longest when the input byte did not appear in the list. This implementation uses a much faster table lookup. I used the facilities of MAC to construct the table (lines 43-67). The REPT directive builds the basic table, an array of 128 bytes containing FFh. The "$-PRINT" statement eliminates those 128 uninteresting lines from the assembly listing. The macro ENTER (defined in lines 36-41) makes it simple to set the few non-FFh bytes in the table.
The I/O routines turned out to be pleasingly compact. I designed them for general use and made them included units. Listing 3-1 shows PACK after the INCLUDE command has been run, so that you can study the I/O code in context. A newcomer to assembly language should study the method used for testing OutIndex against OutLimit (lines 275-277). The 8080 lacks a 16-bit subtract instruction, but in this case the same result is obtained by making OutLimit a negative number, the twos complement of the size of the buffer. Addition of a negative number is the same as subtraction of a positive number. You might want to verify that a negative OutLimit will produce a carry when added to an OutIndex that contains the size of the buffer.
The comparable test in GetChar (line 225) led me into a bug. As I first coded it, it read
inr a jnc GetChar2
I didn't realize that the increment instruction does not set the carry flag. Of course, GetChar always took the branch, and the program ran, and ran, and ran.
It also took several tries to get the register manipulations in DumpOut correct. A FOR-loop is simple in principle. Implementing one, given the 8080's limited abilities for 16-bit arithmetic, is not simple at all.
The initialization and termination code (line 379 and onward) was simple by comparison. It was just a dull slog through a series of string moves and BDOS service requests. There is no point in expending brain-sweat to optimize such code. For files of any practical size the main loop and I/O code will dominate the program's execution time.
Once PACK was complete, it was simple to create UNPACK. It took only a few moments' thought to arrive at the main loop logic shown in Figure 3-7. A Case statement was obviously the way to express the selection of a different handling procedure for each class of bytes.
The assembly code that developed appears in Listing 3-2. Here I chose a form of table lookup to select the different entries of the Case statement. A 256-byte table (lines 34-61) contains one entry for each possible input byte value. Each entry contains the displacement from the top of the Case structure to the start of the code that handles that byte value. Again, MAC's REPT directive was invaluable in creating the repetitive parts of the table. Imagine hand-coding all 256 entries!
The bulk of UNPACK's code is identical to that of PACK, and was included from the library. Note how much shorter the listing is when the included routines are omitted.
Figure 3-7. The pseudo-code logic of the main loop of UNPACK.
Program UNPACK (inputfile, outputfile) ...constants, initialization as for PACK while ( B <> CpmEof ) Case of B: 00h...7Fh: do nothing. 80h...E7h: PutChar( List[(B shiftright 3) and 0Fh] ) B := List[B and 07h] EndLine: PutChar(CR) B := LF EndLineTab: PutChar(CR) PutChar(LF) B := TAB CopyThis: Shift(B,C) any other: Abort "Impossible input byte." end case. Shout(B,C) end while. ...remainder as for PACK, omitting Lookup
As given, PACK is not very effective when applied to data other than English text. Applied to its own source code, for example, it can compress only the leading TABs and a byte or two in each comment. The addition of run-length encoding would increase its effect on some files without impairing its use for documents.
There are 21 unused compression byte values. The byte values EBh through EFh might best be reserved for other special-case compressions. The sequence TAB, TAB occurs in some files, for example. Other files contain pairs or triplets of zero characters with some frequency. A BASIC data file often contains many repetitions of comma, zero and comma, quote. A sequence-numbered file will contain the sequence CR, LF, zero at the head of most lines.
The 16 values from F0h to FFh could serve to encode runs of identical characters. CP/M is, I believe, unique among operating systems in its convention for the use of TAB characters in data. By applying a system-wide convention for tab stops, CP/M makes it possible for a TAB to have consistent interpretation no matter what file it appears in. In other operating systems this is not true, and so TABs almost never appear in their files. Long runs of spaces appear instead. In such systems, the best use of F0h to FFh would be to encode runs of 2 to 17 spaces.
In CP/M runs of spaces are uncommon. The Fxh bytes, then, should be used to encode any run of 3 to 18 identical characters. Such a run could be encoded as two bytes Fnh, xxh, where xxh is the character to be duplicated 3+n times when unpacked. That is, F0h followed by "*" would stand for three asterisks; F1h and "*" would stand for four asterisks, and so on.
You might enjoy adding run-length encoding to PACK. If you do so, of course, you must also add run-length decoding to UNPACK. One coding trick may help in PACK. The program must stop encoding a run in two cases: when the run breaks (i.e. when, after a Shift, B doesn't equal C), and when the maximum run code FFh has been reached. If you begin the count with F1h rather than with F0h, then the machine's Zero flag will be set after the fifteenth increment of the run code. That makes it easy to detect the maximum run-count. Be very careful not to violate any of the assertions we checked earlier!