Word Processor Data STR
Word Processor Data STR
Word Processor Data STR
Abstract
The data structure used ot maintain the sequence of characters is an important part of a text editor. This paper investigates and evaluates the range of possible data structures for text sequences. The ADT interface to the text sequence component of a text editor is examined. Six common sequence data structures (array, gap, list, line pointers, xed size bu ers and piece tables) are examined and then a general model of sequence data structures that encompasses all six structures is presented. The piece table method is explained in detail and its advantages are presented. The design space of sequence data structures is examined and several variations on the ones listed above are presented. These sequence data structures are compared experimentally and evaluated based on a number of criteria. The experimental comparison is done by implementing each data structure in an editing simulator and testing it using a synthetic load of many thousands of edits. We also report on experiments on the senstivity of the results to variations in the parameters used to generate the synthetic editing load.
1 Introduction
The central data structure in a text editor is the one that manages the sequence of characters that represents the current state of the le that is being edited. Every text editor requires such a data structure but books on data structures do not cover data structures for text sequences. Articles on the design of text editors often discuss the data structure they use 1, 3, 6, 8, 11, 12] but they do not cover the area in a general way. This article is concerned with such data structures. Figure 1 shows where sequence data structures t in with other data structures. Some ordered sets are ordered by something intrinsic in the items in the sets (e.g., the value of an integer, the lexicographic position of a string) and the position of an inserted item depends on its value and the values of the items already in the set. Such data structures are mainly concerned with fast searching. Data structures for this type of ordered set have been studied extensively. The other possibility is for the order to be determined by where the items are placed when they are inserted into the set. If insert and delete is restricted to the two ends of the ordering then you have a deque. For deques, the two basic data structures are an array (used circularly) and a linked list. Nothing beyond this is necessary due to the simplicity of the ADT interface to deques. If you can insert and delete items from anywhere in the ordering you have a sequence. An important subclass
Author's address: Computer Science Department, University of New Mexico, Albuquerque, New Mexico 87131, o ce: 505-277-5446, messages: 505-277-3112, fax: 505-277-0813, email: crowley@unmvax.cs.unm.edu
Tree
Heap
...
Hash table
Linked List
Array
Gap
Line spans
Piece tables
Figure 1: Ordered sets is sequences where reading an item in the sequence (by position number) is extremely localized. This is the case for text editors and it is this subclass that is examined in this paper. A linked list and an array are the two obvious data structures for a sequence. Neither is suitable for a general purpose text editor (a linked list takes up too much memory and an array is too slow because it requires too much data movement) but they provide useful base cases on which to build more complex sequence data structures. The gap method is a simple extension of an array, it is simply an array with a gap in the middle where characters are inserted and deleted. Many text editors use the gap method since it is simple and quite e cient but the demands on a modern text editor (multiple les, very large les, structured text, sophisticated undo, virtual memory, etc.) encourage the investigation of more complicated data structures which might handle these things more e ectively. The more sophisticated sequence data structures keep the sequence as a recursive sequence of spans of text. The line span method keeps each line together and keeps an array or a linked list of line pointers. The xed bu er method keeps a linked list of xed size bu ers each of which is partially full of text from the sequence. Both the line span method and the xed bu er method have been used for many text editors. A less commonly used method is the piece table method which keeps the text as a sequence of \pieces" of text from either the original le and an \added text" le. This method has many advantages and these will become clear as the methods are presented in detail and analyzed. A major purpose of this paper is to describe the piece table method and explain why it is a good data structure for text sequences. 2
Looking at methods in detail suggests a general model of sequence data structures that subsumes them all. Based on examination of this general model I will propose several new sequence data structures that do not appear to have been tried before. It is di cult to analyze these algorithms mathematically so an experimental approach was taken. I have implemented a text editor simulator and a number of sequence data structures. Using this as an experimental text bed I have compared the performance of each of these sequence data structures under a variety of conditions. Based on these experiments and other considerations I conclude with recommendations on which sequence data structures are best in various situations. In almost all cases, either the gap method or the piece table method is the best data structure.
2 Sequence Interfaces
It is useful to start out with a de nition of a sequence and the interface to it. Since the more sophisticated text sequence data structures are recursive in that they require a component data structure that maintains a sequence of pointers, I will formulate the sequence of a general sequence of \items" rather than as a sequence of characters. This supports discussion of the recursive sequence data structures better.
Syntax:
Empty : ! Sequence Insert : Sequence Position Item ! Sequence Delete : Sequence Position ! Sequence ItemAt : Sequence Position ! Item fEndOfFileg
Types:
s : Sequence i : Item
3
p, p1, p2 : Position
Axioms:
1. Delete(Empty p) = Empty 2. Delete(Insert(s p1 i) p2 ) = if p1 < p2 then Insert(Delete(s p2 ; 1) p1 i) if p1 = p2 then s if p1 > p2 then Insert(Delete(s p2) p1 ; 1 i) 3. ItemAt(Empty p) = EndOfFile 4. ItemAt(Insert(s p1 i) p2 ) = if p1 < p2 then ItemAt(s p2 ; 1) if p1 = p2 then i if p1 > p2 then ItemAt(s p2 ) The de nition of a Sequence is relatively simple. Axiom 1 says that deleting from an Empty Sequence is a no-op. This could be considered an error. Axiom 2 allows the reduction of a Sequence of Inserts and Deletes to a Sequence containing only Inserts. This de nes a canonical form of a Sequence which is a Sequence of Inserts on a initial Empty Sequence. Axiom 3 implies that reading outside the Sequence returns a special EndOfFile item. This also could have been an error. Axiom 4 de nes the semantics of a Sequence by de ning what is at each position of a canonical Sequence.1
typedef ReturnCode int /* 1 for success, zero or negative for failure */ typedef Position int /* a position in the sequence */
/* the rst item in the sequence is at position 0 */ typedef Item unsigned char /* they are sequences of eight bit bytes */ typedef struct f /* To be determined */ /* Whatever information we need for the data structures we choose */ g Sequence
1
In this interface the only operations that change the Sequence are Insert and Delete.
I am ignoring the error of inserting beyond the end of the existing sequence.
Sequence Empty( ) ReturnCode Insert( Sequence *sequence, Position position, Item ch ) ReturnCode Delete( Sequence *sequence, Position position ) Item ItemAt( Sequence *sequence, Position position ) | This does not actually require a pointer to a Sequence since no change to the sequence is being made but we expect that they will be large structures and should not be passing them around. I am ignoring error returns (e.g., position out of range) for the purposes of this discussion. These are easily added if desired. ReturnCode Close( Sequence *sequence )
Many variations are possible. The next few paragraphs discuss some of them. Any practical interface would allow the sequence to be initialized with the contexts of a le. In theory this is just the Empty operation followed by an Insert operation for each character in the initializing le. Of course, this is too ine cient for a real text editor.2 Instead we would have a NewSequence operation:
Sequence NewSequence( char * le name ) | The sequence is initialized with the contents of the le whose name is contained in ` le name'.
Usually the Delete operation will delete any logically contiguous subsequence
ReturnCode Delete( Sequence *sequence, Position beginPosition, Position endPosition )
Sometimes the Insert operation will insert a subsequence instead of just a single character.
ReturnCode Insert( Sequence *sequence, Position position, Sequence sequenceToInsert )
Sometimes Copy and Move are separate operations (instead of being composed of Inserts and Deletes).
ReturnCode Copy( Sequence *sequence, Position fromBegin, Position fromEnd, Position toPosition ) ReturnCode Move( Sequence *sequence, Position fromBegin, Position fromEnd, Position toPosition )
Although this is the method I use in my text editor simulator described later.
ReturnCode SequenceAt( Sequence *sequence, Position fromBegin, Position fromEnd, Sequence *returnedSequence )
These variations will not a ect the basic character of the data structure used to implement the sequence or the comparisons between them that follow. Therefore I will assume the rst interface (Empty, Insert, Delete, IntemAt, and Close).
number of ItemAts, often just a few characters around the edit but possibly the whole rest of the window (if a newline in inserted or deleted). The criteria used for comparing sequence data structures are: The time taken by each operation The paging behavior of each operation The amount of space used by the sequence data structure How easily it ts in with typical le and IO systems The complexity (and space taken by) the implementation Later I will present timings comparing the basic operations for a range of sequence data structures. These timings will be taken from example implementations of the data data structures and a text editor simulator that calls these implementations.
4 De nitions
An item is the basic element. Usually it will be a character. A sequence is an ordered set of items. Sequential items in a sequence are said to be logically contiguous. The sequence data structure will keep the items of the sequence in bu ers. A bu er is a set of sequentially addressed memory locations. A bu er contains items from the sequence but not necessarily in the same order as they appear logically in the sequence. Sequentially addressed items in a bu er are physically contiguous. When a string of items is physically contiguous in a bu er and is also logically contiguous in the sequence we call them a span. A descriptor is a pointer to a span. In some cases the bu er is actually part of the descriptor and so no pointer is necessary. This variation is not important to the design of the data structures but is more a memory management issue. Sequence data structures keep spans in bu ers and keep enough information (in terms of descriptors and sequences of descriptors) to piece together the spans to form the sequence. Bu ers can be kept in memory but most sequence data structures allow bu ers to get as large as necessary or allow an unlimited number of bu ers. Thus it is necessary to keep the bu ers on disk in disk les. Many sequence data structures use bu ers of unlimited size, that is, their size is determined by the le contents. This requires the bu er to be a disk le. With enough disk block caching this can be made as fast as necessary. The concepts of bu ers, spans and descriptors can be found in almost every sequence data structure. Sequence data structures vary in terms of how these concepts are used. If a sequence data structures uses a variable number of descriptors it requires a recursive sequence data structure to keep track of the sequence of descriptors. In section 5 we will look at three sequence data structures that use a xed number of descriptors and in section 6 we will look at three sequence data structures that use a variable number of descriptors. Section 7 will present a general model of a sequence data structure that encompasses all these data structures. 7
Sequence
Figure 2: The array method contains the items of the sequence in physically contiguous order. Deletes are handled by moving all the items following the deleted item to ll in the hole left by the deleted item. Inserts are handled by moving all the items that will follow the item to be inserted in order to make room for the new item. ItemAt is an array reference. The bu er can be extended as much as necessary to hold the data. Clearly this would not be an e cient data structure if a lot of editing was to be done are large les. It is a useful base case and is a reasonable choice in situations where few inserts and deletes are made (e.g., a read-only sequence) or the sequences are relatively small (e.g., a one-line text editor). This data structure is sometimes used to hold the sequence of descriptors in the more complex sequence data structure, for example, an array of line pointers (see section 6).
Span
Gap
Span Buffer
Sequence
Figure 3: The gap (or two span) method editing operations take place. Inserts are handled by using up one of the places in the gap and incrementing the length of the rst descriptor (or decrementing the begin pointer of the second descriptor). Deletes are handled by decrementing the length of the rst descriptor (or incrementing the begin pointer of the second descriptor). ItemAt is a test ( rst or second span?) and an array reference. When the cursor moves the gap is also moved so if the cursor moves 100 items forward then 100 items have to be moved from the second span to the rst span (and the descriptors adjusted). Since most cursor moves are local, not that many items have to be moved in most cases. Actually the gap does not need to move every time the cursor is moved. When an editing operation is requested then the gap is moved. This way moving the cursor around the le while paging or searching will not cause unnecessary gap moves. If the gap lls up, the second span is moved in the bu er to create a new gap. There must be an algorithm to determine the new gap size. As in the array case, the bu er can be extended to any length. In practice, it is usually increased in size by some xed increment or by some xed percentage of the current bu er size and this becomes the size of the new gap. With virtual memory we can make the bu er so large that it is unlikely to ll up. And with some help from the operating system, we can expand the gap without actually moving any data. This method does use up large quantities of virtual address space, however. This method is simple and surprisingly e cient. The gap method is also called the split bu er method and is discussed in 9].
Sequence
Figure 4: The linked list method The linked list method uses a lot of extra space and so is not appropriate for a large sequence but is frequently used as a way of implementing a sequence of descriptors required in the more complex sequence data structures. In fact, it is the most common method for that. One could think of the array method as a special case of the linked list method. An array is really a linked list with the links implicit, that is, the link is computed to be the next physically sequential address after the current item. In this view, the linked list method the only primitive sequence data type. The array method is a special case if linked list method and the gap method is a variation on the array method.
Fixed size bu ers Variable size bu ers Content determined spans Line size is limited Line spans Editor determined spans Fixed size bu ers Piece tables These are the three methods that will be examined in this section.
Sequence
Figure 5: The line spans method Line deletes are handled by deleting the line descriptor. Deleting characters within a line involves moving the rest of the characters in the line up to ll the gap. Since any one line is probably not that long this is reasonably e cient. Line inserts are handled by adding a line descriptor. Inserting characters in a line involves copying the initial part (before the insert) of the line bu er to new space allocated for the line, adding the characters to be inserted, copying the rest of the line and pointing the descriptor at the new line. Multiple line inserts and deletes are combinations of these operations. Caching can make this all fairly e cient. Usually new space is allocated at the end of the bu er and the space occupied by deleted or changed lines is not reused since the e ort of dynamic memory allocation is not worth the trouble. A disk le the continues to grow at the end can be handled quite e ciently by most le systems. This method uses as many descriptors as there are lines in the le, that is, a variable number of descriptors hence there is a recursive problem of keeping these descriptors in a sequence. Typically one of the basic methods described in section 5 is used to maintain the sequence of line descriptors. The linked list method can be used (as in Ved 6]) or the array method (as in Godot 11], Gina 1] and ed 3]).
NOTE: reference to SW Tools and SW Tools Sampler here. For linked lists of lines.
11
These simpler methods are acceptable since the number of line descriptors is much smaller than the number of characters in the le being edited. The linked list method allows e cient insertions and deletions but requires the management of list nodes. This method is acceptable for a line oriented text editor but is not as common these days since strict line orientation is seen as too restrictive. It does require preprocessing of the entire bu er before you can begin editing since the line descriptors have to be set up.
Sequence
Figure 6: Fixed size bu ers The disk block size (or some multiple of the disk block size) is usually the most e cient choice for the xed size bu ers since then the editor can do its own disk management more easily and not depend on the virtual memory system or the le system for e cient use of the disk. Usually a lower bound on the number of items in a bu er is set (half the bu er size is a common choice). This requires moving items between bu ers and occasionally merging two bu ers to prevent the accumulation of large numbers of bu ers. There are four problems with letting too many bu ers accumulate: wasted space in the bu ers, the recursive sequence of descriptors gets too large, the probability that an edit will be con ned to one bu er is reduced, and 12
As an example of xed size bu ers, suppose disk blocks are 4K bytes long. Each bu er will be 4K bytes long and will contain a span of length from 2K to 4K bytes. Each bu er is handled using the array method, that is, inserts and deletes are done by moving the items up or down in the bu er. Typically an edit will only a ect one bu er but if a bu er lls up it is split into two bu ers and if the span in a bu er falls below 2K bytes then items are moved into it from an adjacent bu er or it is coalesced with an adjacent bu er. Each descriptor points to a bu er and contains the length of the span in the bu er. The xed size bu er method also requires a recursive sequence for the descriptors. This could be any of the basic methods but most examples from the literature use a linked list of descriptors. The loose packing allows small changes to be made within the bu ers and the fact that the bu ers are linked makes it easy to add and delete bu ers. This method is used in the text editors Gina 1] and sam 12] and is described by Kyle 9].
13
Original File
Sequence
Figure 7: The piece table method piece table after the le is read in initially. This is a very short le containing only 20 characters. Figure 9 shows the piece table after the word \large " has been deleted. Figure 10 shows the piece table after the word \English " has been added. Notice that, in general, a delete increases the number of pieces in the piece table by one and an insert increases the number of pieces in the piece table by two.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
large
(empty)
span
of
text
File
Original
Start 0
Length 19
Piece table
Sequence
large
(empty)
span
of
text
File
Original Original
Start 0 8
Length 2 12
Piece table
A span of text
Sequence
Figure 9: A piece table after a delete Let us look at another example. Suppose we start with a new le that is 1000 bytes long and make the following edits. 14
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
large
span
of
text
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
English File
Original Original Add Original Start 0 8 0 16 Length 2 8 8 4
Sequence
Figure 10: A piece table after a delete and an insert 1. Six characters inserted (typed in) after character 900. 2. Character 600 deleted. 3. Five characters inserted (typed in) after character 500. The piece table after these edits will look like this: le start length logical o set orig 0 500 0 add 6 5 500 orig 500 100 505 orig 601 300 605 add 0 6 905 orig 901 100 911 The \logical o set" column does not actually exist in the piece table but can be computed from it (it is the running total of the lengths). These logical o sets are not kept in the piece table because they would all have to be updated after each edit. The piece table method has several advantages. The original le is never changed so it can be a read-only le. This is advantageous for caching systems since the data never changes. The add le is append-only and so, once written, it never changes either. Items never move once they have been written into a bu er so they can be pointed to by other data structures working together with the piece table. Undo is made much easier by the fact that items are never written over. It is never necessary to save deleted or changed items. Undo is just a matter of keeping the right piece descriptors around. Unlimited undoes can be easily supported. 15
No le preprocessing is required. The initial piece can be set up only knowing the length of the original le, information that can be quickly and easily obtained from the le system. Thus the size of the le does not a ect the startup time. The amount of memory used is a function of the number of edits not the size of the le. Thus edits on very large les will be quite e cient. The above description implies that a sequence must start out as one big piece and only inserts and deleted can add pieces. Following this rule keeps the number of pieces at a minimum and the fewer pieces there are the more e cient the ItemAt operations are. But the text editor is free to split pieces at other times to suit its purposes. For example, a word processor needs to keep formatting information as well as the text sequence itself. This formatting information can be kept as a tree where the leaves of the tree are pieces. A word in bold face would be kept as a separate piece so it could be pointed to by a \bold" format node in the format tree. The text editor Lara 8] uses piece tables this way. As another example, suppose the text editor implements hypertext links between any two spans of text in the sequence. The span at each end of the link can be isolated in a separate piece and the link data structure would point to these two pieces.6 This technique is used in the Pastiche text editor 5] for ne-grained hypertext. These techniques work because piece table descriptors and items do not move when edits occur and so these tree structures will be maintained with little extra work even if the sequence is edited heavily. Overall, the piece table is an excellent data structure for sequences and is normally the data structure of choice. Caching can be used to speed up this data structure so it is competitive with other data structures for sequences. Piece tables are used in the text editors: Bravo 10], Lara 8], Point 4] and Pastiche 5]. Fraser and Krishnamurthy 7] suggest the use of piece tables as a way to implement their idea of \live text".
There are some details to deal with to make this all work but they are easy to handle.
16
4. The main memory is of limited size and hence cannot hold all the items in large sequences.7 5. The environment provides dynamic memory allocation (although some sequence data structures will do their own and not use the environment's dynamic memory allocation). 6. The environment provides a reasonably e cient le system for item storage that provides les of (for all practical purposes) unlimited size. The following concepts are used in the model. An item is the basic component of our sequences. In most cases it will be a character or byte (but it might be a descriptor in a recursive sequence data structure). A sequence is an ordered set of items. During editing, items will be inserted into and deleted from the sequence. The items in the sequence are logically contiguous. A bu er is some contiguous space in main memory or on disk that can contain items (Assumption 4). All items in the sequence are kept in bu ers (Assumption 3). Consecutive items in a bu er are physically contiguous. A span is one or more items that are logically contiguous in the sequence and are also physically contiguous in the bu er. (Assumption 1) A descriptor is a data structure that represents a span. Usually the descriptor contains a pointer to the span but it is also possible for the descriptor to contain the bu er that contains the span. A sequence data structure is either A basic sequence data structure which is one of: { An array. { An array with a gap. { A linked list of items. { A more complex linked structure of items. A recursive sequence data structure which comprises: { Zero or more bu ers each of which contains zero or more spans. { A (recursive) sequence data structure of zero or more descriptors to spans in the bu ers. This model is recursive in that to implement a sequence of items it is necessary to implement a sequence of descriptors. This recursion is usually only one step, that is, the sequence of descriptors in implemented with a basic sequence data structure. The de ciencies of the basic sequence data structures for implementing character sequences are less critical for sequences of descriptors since there are usually far fewer descriptors and so sophisticated methods are not required.
Even if virtual memory is provided there will be an upper bound on it in any actual system con guration. In addition, most sophisticated sequence data structures do not rely on virtual memory to e ciently shuttle sequence data between main memory and the disk. Usually the program can do better since it understands exactly how the data is accessed.
7
17
sequence data structures to implement the sequence of descriptors. The recursive methods are bene cial when the sequences are quite large so we might use a two-level recursive method if the number of descriptors was quite large. As we mentioned above, this might be the case with a piece table. So there are four new variations that we have uncovered in this analysis. The xed size bu ers method using the gap method inside each bu er. The xed size bu ers method using the gap method for descriptors. This might be better if virtual memory performance was a consideration. The piece table method using the gap method for descriptors. This might be better if virtual memory performance was a consideration. A two level recursive method that uses a recursive method to maintain the sequence of descriptors. This would suitable if there is a very large number of descriptors.
Null | the null method that does nothing. This is for comparison since it measures the overhead of the procedure calls. Arr | The array method. List | The list method. Gap | The gap method. FsbA { The xed size bu er method with the array method used to maintain the sequence inside each bu er. FsbAOpt { The xed size bu er method with the array method used to maintain the sequence inside each bu er and with ItemAt optimized. FsbG { The xed size bu er method with the gap method used to maintain the sequence inside each bu er. Piece { The piece table method. PieceOpt { The piece table method with ItemAt optimized I will present graphs for Insert and ItemAt operations. The Delete operation takes about the same time as the Insert operation for all these sequence data structures. Figure 11 shows how the speed of the ItemAt operation is a ected by the size of the sequence. It basically has no e ect except for the interesting result that ItemAt operation for the PieceOpt 35:0
e e e e e
sec/ call
30:0 2 "GapItemAt.sz" 2 222 2 22 "PieceItemAt.sz" 2 2 2 "FsbGItemAt.sz" 25:0 2 2 2 "ListItemAt.sz" 2 2 2 2 2 2 "FsbAItemAt.sz" 20:0 "ArrItemAt.sz" ? ???? ?4444 ? ? ? ? ? ? ? ? ? ? "PieceItemAtOpt.sz" 4?4?44 4 4 4 4 4 4 4 4 4 4"FsbAItemAtOpt.sz" ?? 15:0 4 4 10:0 5:0 +++++ + + + + + + + + + + +++++
e e e e e e e e e e c b e c b e c b e c b e c b e c b cc cc c ccc c c bbbbbbbbbb c b c b c b c b
2 4 ?
b c e
0:0
33333 3 3 3 3 3 3 3 3 3 3 33333
10000 20000 30000 40000
Size
50000
60000
70000
80000
90000
Figure 11: ItemAt times as the length of the sequence varies method gets faster for larger arrays. The reason for this is that for longer sequences the caching used 20
in the optimization becomes more e ective. Each ItemAt is faster although (since the sequences are longer) ItemAts is called many more times. The Arr method is the fastest and is nearly as fast as the Null method. The Gap method is only a little slower. The FsbA method is much slower and is about the same as the List method, the FsbG method and the Piece method. The optimized FsbA method is nearly as fast as the Gap method and the PieceOpt method gets close. Even so, the PieceOpt ItemAt is half the speed of the Arr ItemAt. Since ItemAt is such a frequent operation it is necessary to optimize (with caching) all the methods except for the Arr and the Gap method. Figure 12 shows how the speed of the Insert operation is a ected by the size of the sequence. It 90:0 80:0 70:0 60:0 50:0 40:0 30:0 20:0 10:0 0:0
Insert | Size of Sequence "NullInsert.sz" "GapInsert.sz" "PieceInsert.sz" "FsbGInsert.sz" "ListInsert.sz"
sec/ call
10000
20000
30000
40000
Size
50000
60000
70000
80000
90000
Figure 12: Insert times as the size of the sequence varies has no e ect except for shorter sequences. The List method is the fastest and the FsbG, Gap and Piece methods are all about half its speed. The Arr and FsbA methods are not shown on this graph because they are so much slower that they would distort the graph (as the next two graphs show). Figure 13 includes the FsbA method which is an order of magnitude slower than the other methods (for the Insert operation). Figure 14 includes the Arr method which is two orders of magnitude slower than the other methods. Note that it gets slower linearly with the size of the sequence, as one would expect.
900:0 800:0 700:0 600:0 sec/ 500:0 call 400:0 300:0 200:0 100:0 0:0
10000
20000
30000
40000
Size
50000
60000
70000
80000
90000
Insert | Size of Sequence "NullInsert.sz" "GapInsert.sz" "PieceInsert.sz" "FsbGInsert.sz" "ListInsert.sz" "FsbAInsert.sz" "ArrInsert.sz"
50000
60000
70000
80000
90000
Insert | Standard deviation of normal distribution "NullInsert.sd" "GapInsert.sd" "PieceInsert.sd" "FsbGInsert.sd" "ListInsert.sd"
200
250
300
350
Figure 15: Insert times as the standard ceviation varies standard deviation of the normal distribution. The ItemAt operation is una ected by increases in the standard deviation of the normal distribution. Figure 16 shows how the Insert operation is a ected by changes in the percent of edit locations that are taken from a uniform distribution over the entire sequence (that is, where the next edit is randomly located in the sequence instead of instead of being normally distributed around the location of the previous edit). Only the Gap method is a ected. Figure 17 shows how the ItemAt operation is a ected by changes in the percent of edit locations that are taken from a uniform distribution over the entire sequence (that is, where the next edit is randomly located in the sequence). Only the Piece and List methods are a ected but only in ranges that one would not expect to nd in normal text editing. Figure 18 shows how the bu er size a ects the time taken by the Insert operation in the FsbA and FsbG methods. The FsbG method is una ected by the bu er size while the FsbA method goes up linearly (and sharply) as the bu er size increases. The increase levels o at 8000 where the bu er size is equal to the sequence size and so the entire sequence is in one bu er and the method has degenerated into the Arr method. The following table gives the general trends of the results. The units vary from machine to machine but the ratios were reasonably steady. Some of the results have wide ranges. This means that the gure depends on one or more of: the size of the le being editing, the distribution of the position of edits in the sequence, and the size of the bu ers (for the Fsb method).
23
1000:0 900:0 800:0 700:0 600:0 sec/ 500:0 call 400:0 300:0 200:0 100:0 0:0
Insert | Percent Uniform Jumps) "NullInsert.un" "GapInsert.un" "PieceInsert.un" "FsbGInsert.un" "ListInsert.un" "FsbAInsert.un"
20
40
Percent uniform
60
80
100
120
100:0 90:0 80:0 70:0 2 60:0 2 2 sec/ 50:0 22 call 2 4 40:0 22 2 444 30:0 2222 44 44 20:0 4 4?4 ? ? ? ? ? ? ? ? 4 ???4? 10:0 ++ + + ++++++++ + 333 0:0 3 3 3 3 3 3 3 3 3 3 0 20 40
ccccc bb b bb c b c b c b c b c b c b c b c b
4
? + 3 60
c b
4
? + 3
c b
3 + 2 4 ?
b c
Percent uniform
? + 3 80
c b
? + 3
c b
100
120
24
400:0 200:0 0:0 0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Size
The FsbA method is slow for Inserts and Deletes but its ItemAt can be made quite fast with some simple caching. It is possible to reduce the ItemAt time even further by making it an inline operation. The FsbG method reduces the Insert and Delete times radically but the ItemAt time is a bit higher. The equivalent ItemAt caching would be a little more complicated and a little slower. The Piece method has excellent Insert and Delete times (only slightly slower than the linked list method) but its ItemAt time is quite slow even with simple caching. More complex ItemAt caching that avoids procedure calls is necessary when using the Piece method. The idea of the caching is simple. Instead of requesting an item you request the address and length of the longest span starting at a particular position. Then all the items in the span can be accessed with a pointer dereference (and increment). This will bring the ItemAt time down to the level of the array and gap methods.
Time Space Ease programming Size of code Low (39 lines) Low (59 lines) Medium (79 lines) The lines of code measure was taken from the sample implementations.
26
27
Low of Hard
FSB-Gap Piece Fast (with caching) Fairly fast (with caching) Low Low Hard Medium
References
1] C. C. Charlton and P. H. Leng. Editors: two for the price of one. Software|Practice and Experience, 11:195{202, 1981. 2] Computer System Research Group, EECS, University of California, Berkeley, CA 94720. UNIX User's Reference Manual (4.3 Berkeley Software Distribution), April 1986. 3] Computer System Research Group, EECS, University of California, Berkeley, CA 94720. UNIX User's Supplementary Documents (4.3 Berkeley Software Distribution), April 1986. 4] C. Crowley. The Point text editor for X. Technical Report CS91-3, University of New Mexico, 1991. 5] C. Crowley. Using ne-grained hypertext for recording and viewing program structures. Technical Report CS91-2, University of New Mexico, 1991. 6] B. Elliot. Design of a simple screen editor. Software|Practice and Experience, 12:375{384, 1982. 7] C. W. Fraser and B. Krishnamurthy. Live text. Software|Practice and Experience, 20(8):851{ 858, August 1990. 8] J. Gutknecht. Concepts of the text editor Lara. Communications of the ACM, 28(9):942{960, September 1985. 9] J. Kyle. Split bu ers, patched links, and half-transpositions. Computer Language, pages 67{70, December 1989. 10] B. W. Lampson. Bravo Manual in the Alto User's Handbook. Xerox Palo Alto Research Center, Palo Alto, CA, 1976. 11] I. A. MacLeod. Design and implementation of a display oriented text editor. Software|Practice and Experience, 7:771{778, 1977. 12] R. Pike. The text editor sam. Software|Practice and Experience, 17(11):813{845, November 1987.
29