Extensible language-aware merging

Walter Tichy

Extensible language-aware merging

2002, International Conference on Software Maintenance, 2002. Proceedings.

Extensible Language-Aware Merging James J. Hunt Forschungszentrum Informatik Karlsruhe, Germany jjh@fzi.de Abstract Parallel development has become standard practice in software development and maintenance. Though most every revision control and configuration management system provides some form of merging for combining changes made in parallel, these mechanisms often yield unsatisfactory results. The authors present a new merging algorithm, that uses a fast differencing algorithm and renaming analysis to provide better merge results. The system is language aware, but not language dependent and does not require a special editor, so it can be easily integrated in current development environments. 1. Introduction Modern software development requires the coordination of new development with ongoing system maintenance, so that corrections to released code are incorporated in following releases. Revision control and configuration management systems provide facilities managing parallel work on the current development line and maintenance of previous releases. Current systems do provide mechanisms for combining changes made in parallel, but these mechanisms for merging variants of a software document often yield unsatisfactory results. Merging may seem quite straight forward. In fact, even early revision control systems, like RCS[14], incorporate a feature for merging revisions from different branches of development. Unfortunately, this and similar tools do not perform as well as one would hope. The traditional solution uses a line based comparison because it can be used with most text formats and it is relatively fast, but the quality of the results is quite low. Other solutions have been proposed, but they have failed to gain acceptance. The standard practice has remained with the line based tools. The main difficulty for differencing and merging is making a good trade-off between accuracy, applicability, and performance. To this end, the authors have developed Proceedings of the International Conference on Software Maintenance (ICSM’02) 0-7695-1819-2/02 $17.00 © 2002 IEEE Walter F. Tichy University of Karlsruhe Karlsruhe, Germany tichy@ira.uka.de Extensible, Language-Aware Merging, or ELAM a new algorithm for producing an integrated revision by combining two variants of a common base revision. 2. The Problem Merging two variants of a software document means producing a new version of the document which incorporates all changes made in both variants of the document. Three way merging uses a common base version to ascertain exactly what changes were made. The chief problem of merging is to determine which changes can be integrated automatically and which changes can not. The notion of conflict is used to make this determination. Most systems use the notion of collocation to define conflicts. If two changes occur at the same point in a program, they conflict. A more exact definition says that two changes conflict when the code in one change can influence the result of the code in the other change, i.e., they exhibit semantic interference. These two definitions are not equivalent. Many collocated changes have no influence on one another. Also changes resulting in non-local conflicts, like those resulting from changes to a method or procedure signature and renaming, are not collocated. Though ELAM, as will be seen, uses collocation for identifying most conflicts, the semantic interference criteria is use to define conflicts. A collocation conflict that do not exhibit semantic interference is termed an apparent conflict. Since general conflict resolution is an intractable problem, the design of ELAM places the emphasis on improving conflict detection and isolation. Semantic knowledge about the programs in question provides the basis for distinguishing between apparent conflicts, which result from non-interfering changes made at the same position in both variants, and actual semantic conflicts. The resolution of apparent conflicts is heuristic, so the system can not guarantee that all apparent conflicts will be identified. Nevertheless, simple renaming conflicts are actually resolved as well. 3. Related Work A merge can be computed directly from two variants and a common parent or base revision by extending the matrix used by the Longest Common Subsequence (LCS) algorithm[6] into three dimension to accommodate both variants along with the base. In practice, this is not done because the resulting algorithm is rather slow. Three dimensional LCS[7] has a worst case execution of O(n3 ) in the length of the base. Furthermore, only local conflicts could be detected. Instead, current research and practice concentrates on three way merging based on pair differencing. There are two variations to three way merging independent of other consideration: symmetric and asymmetric merging. In the symmetric case, differences are calculated between the base version and each of the variants; however, in the asymmetric case, the second difference is calculated between the two variants instead of the base and the second variant. The advantage of the symmetric method is that the results are independent of the order of the variants; whereas the advantage of the asymmetric method is that like changes, made in both variants, are easier to find. Aside from the issue of symmetry, the remaining differences in strategy revolve around how the difference information is obtained, what form it has, and how it is processed. Line based algorithms are the simplest and semantic based analysis are the most complex. Between the extremes lie process and structural based approaches. 3.1. Line Based Algorithms The line based approach for merging was the first to be devised and is still the most widely used. The best example is Unix diff3[11], which is still in widespread use today. The algorithm is asymmetric and uses Unix diff for differencing. There are two main difficulties with this algorithm: changes that are irrelevant to the meaning of the program, like reformatting, can make the output unusable, and some kinds of conflicts are missed. For example, when a variable is renamed in one variant and a line is added in the other that references that variable. In this case, though the variable name from the second reference is wrong, no conflict will be flagged. 3.2. Process Based Algorithms As an improvement, Mr. Lippe[8] took process based approach. It is an attempt to produce better merging by requiring the editor to be “smart.” The editor must maintain a list of all changes made to each file, where a change is a well defined operation. The algorithm uses change lists to compute the merge. Conflicts are determined on a strictly local basis. This approach has the advantage that some conflicts resulting from syntactic changes which have no semantic Proceedings of the International Conference on Software Maintenance (ICSM’02) 0-7695-1819-2/02 $17.00 © 2002 IEEE effect, e.g. reformatting and rudimentary identifier renaming, can be resolved. However, the cost of this approach is rather high. One is restricted to using a particular editor when doing development. Furthermore, knowledge of the program structure is buried in the editor, via special operators like “rename all occurrences of variable x to y ” or by promoting simple operations like insert, add, and delete to more complex operations like “move block to subroutine.” Still, process based methods are not powerful enough to resolve more than simple renaming conflicts. 3.3. Structure Based Algorithms The structure based approach falls midway between the semantic and the line and character based approaches. In order to calculate the difference between two revisions, each is parsed into a tree describing the syntactic structure of the program code. Difference and merge are then computed on the parse trees. This method is not as complex as the semantic methods, but its results can not be disturbed by reformatting, as is the case with the line and character methods. Messrs Buffenbarger[2] and Westfechtel[15] have investigated the structural approach to merging. Mr. Westfechtel’s work is more interesting, since he uses contextsensitive edges added during editing to capture binding information for resolving non-local name conflicts. Though both techniques solve many of the problems of the line and character based algorithms, there are still significant deficiencies. The binding model for identifier analysis is primitive. Special language features like identifier overloading, structure referencing, and identifier importation can not be handled. Both are unable to resolve conflicts that go beyond simple renaming because no language specific information is available to the merge algorithm. 3.4. Semantic Based Algorithms The semantic based approach is the holy grail of merging. A true semantic algorithm is concerned with what a program computes, not how it is written. In general, semantic based differencing and merging are undecidable; therefore an approximation must be made. Most of the work in this area has been done by the team of Ms. Horwitz and Messrs Reps, Binkley, Yang, and Prins[3, 13, 16, 1]. They have adapted slicing and compiler techniques for basic block optimization to provide semantic analysis for merging, but the technique has only been demonstrated on toy languages. Though they have developed heuristics for analyzing procedures, no solution has yet been found for handling languages with arrays, pointers, and methods. Semantic based algorithms are beginning to show some promise, but they still appear to be a long way from eclipsing other techniques. 3.5. Summary Despite the promise of new technology, LCS, line base methods remain standard practice in the software industry for merging. All other methods are too costly in terms of limited applicability or require the use of a special development tools. All methods are limited to single file merging. 4. Design and Implementation Differencer Variant 1 Parser Symbol Analysis Base Parser Symbol Analysis Symbol Analysis Variant 2 Renaming Analysis Parser Merging New Revision Renaming Analysis Differencer Data Process Data Flow Figure 1. ELAM Data Flow ELAM improves on structural based merging by using a more sophisticated differencing algorithm and including some semantic information in a language independent form. It is a symmetric, three way merge based on the linearized tree differencer. LTDIFF is a fast, token based differencing algorithm to identify changes between a base version of a software document and each of variant. Renaming analysis assists the merge with conflict detection and resolution via the Renaming Detector [9]. The system is language aware because the core algorithm operates with abstract language constructs instead of language specific constructs. Parser generation[12], modularized symbol analysis, and rule based name analysis generate the underlying abstract representation. This division between input and internal representation makes the system easy to extend and refine to handle new languages. Supporting a new language requires only a new grammar for the parser generator, one new Java class, and a hand full of semantic rules. It took approximately 60 hours to extend the original system to include support for Scheme. LTDIFF [4] combines the core string comparison techniques of a fast delta compression algorithm[5] with syntactic information from the program versions being compared. It simulates tree comparison with linear token matching. After parsing into a language neutral form, token strings are compared to find the maximal match coverage between two revisions. Then structural information from the parse is used to find the best break point at all places where matches overlap. Thus lexical differencing is used to speed up syntactic comparison. The Renaming Detector inherits the language-aware features of LTDIFF: it is insensitive to formatting, matches Proceedings of the International Conference on Software Maintenance (ICSM’02) 0-7695-1819-2/02 $17.00 © 2002 IEEE moved code properly, and treats comments and programming language statements differently. The Renaming Detector finds name changes that span multiple files—an important consideration for practical use. Its adaptation to multiple languages is achieved by using LTDIFF’s parser, an implementation for an abstract symbol analysis module, and some language specific rules for renaming analysis. The actual merging phase of ELAM sits at the end of the analysis process undertaken by LTDIFF and the Renaming Detector. First, each of the three input revisions—a common ancestor revision, base, and two variants thereof, left variant and right variant—are parsed and analyzed to determine which tokens are symbols and what attributes they have. The result is two token strings and two symbol tables. Then the token strings and symbol tables for both base/variant pairs are processed to produce two change lists—left difference and right difference—which describe left variant and right variant in terms of base, respectively. Finally the actual merging process uses the symbol tables, along with left difference and right difference annotated by renaming analysis to generate a new revision which represents the combination of base with all changes made in left variant and right variant. Figure 1 illustrates how the various parts work together to produce the desired result. Though not illustrated in figure 1, there is some sharing between the two difference calculations. A suffix tree[10] is used by LTDIFF to provide a fast substring index into base. Since the same suffix tree can be used for both differencing steps, the suffix tree need be generated only once. This saves considerable time in the overall merge process. There are not just savings in this process. Conflict resolution requires information from symbol analysis that is not needed for renaming detection but generated there. In order to determine if an apparent conflict between two program statements actually conflict or not, the conflict analysis phase needs to know which references are pure, i.e. do not change the value of the referent or its content, and which are not. The assumption is that all references are impure until shown to be otherwise. The symbol analysis phase determines which references should be marked pure, i.e., references that do not change the value of the referent. For instance, a variable appearing as the destination of a set routine, as in (set a 5) in Scheme, is impure, but a variable appearing as an argument to a function in a purely functional language is pure. This part is language specific and is implemented conservatively. Only references that can be shown to be pure are marked so. It is up to each of the language implementations to decide how much processing is worth investing to mark pure references, since the resolution of apparent conflicts works even when no references are marked pure, just not as well. The output of the merging process is a list of merge blocks or token subsequences. The list can contain in- Merge List Contains Block print() print() writeXML() getStart() writeXML() getLength() Common Block Move Block Insert Block getBaseStart() Conflict Block getLeftVariantBlocks() getRightVariantBlocks() Rename Block Rename Insert Delete Conflict Figure 2. Merge Output stances of seven block types according to from whence the sequence derives and whether or not there is a conflict: common block, move block, insert block, rename block, rename insert block, conflict block, and delete conflict block. Of these, all but common blocks and conflict blocks have left and right instances (for left variant and right variant, respectively). The class structure, as depicted in figure 2, is more complicated than what would be needed to textually present the merged revision. Each block class behaves differently during some phase of the merging process. Common blocks, move blocks, and insert blocks represent the unchanged blocks, move blocks, and insert blocks from the difference structures of the input where no conflicts exist. Rename blocks and rename insert blocks reflect the additional blocks added to a difference as a result renaming analysis. Finally, conflict and delete conflict blocks represent conflicts. The structure is designed for further automatic processing and both graphic and textual display with the possibility of manual intervention; therefore it is necessary to record additional information about the blocks found. The merge algorithm itself has three phases: 1. integrating renamings from each variant into the other; 2. building a rough merge from the variants; and then 3. resolving apparent conflicts with semantic information. Renaming integration is carried out directly on the change lists. The second phase generates the first merge list. Finally, the resolution phase removes conflicts by combining the changes from both left difference and right difference contained in the conflict blocks of the merge list. 4.1. Integrating Renamings The first phase of merging prepares the change lists for the next two phases. At this stage, each change list contains records for the name changes found by Renaming Detector in their respective variants. These name changes now need to be projected into the opposite change list. Handling name changes before the main merge phase helps reduce the Proceedings of the International Conference on Software Maintenance (ICSM’02) 0-7695-1819-2/02 $17.00 © 2002 IEEE number of apparent conflicts in the second phase, because marked name changes no longer appear as differences. Furthermore, non local conflicts that the next phase can not find are either removed or marked. The only place where name changes are difficult for merging is when one variant introduces a name change and the other adds a new reference of the identifier whose name has changed. Projecting the name change into the other variant removes what would otherwise be a non local merge conflict. There are two kinds of changes that are of interest: pure name changes and signature changes. As long as only the name of a given identifier is changed in one of the variants, there is no conflict. If there is a change in both, both may be rejected without changing the semantics of the final program. Care must only be exercised with name changes where an identifier’s new name already existed in the base. In that case, a name change conflict could cause more than one name change to be rejected. Signature changes pose a more complex problem. As with name changing, if no new reference to the changed identifier is added, then local conflict resolution is sufficient to handle the change. However, if a new reference is added in the other variant, the system does not try to automatically resolve the conflict; rather it is simply marked in the change list as a conflict. 4.2. A Rough Merge The core of the merge algorithm is quite simple. It relies on locality to determine which sequences to include in the resultant merge. This phase of the algorithm is not dissimilar to how the Unix command diff3 works, though diff3 is an asymmetric merge using the results of a line based differencing algorithm as input. The algorithm calculates the rough merge in a single pass through the two differences. The algorithm relies on finding corresponding unchanged blocks in both differences and using these as reference points for the merge. Starting with zero as the last alignment point, ELAM loops through the following steps, while there are still blocks in both change lists: 1. find next alignment point at an unchanged block or a match rename block; 2. fill the gap between the new alignment point and the last one; and then 3. add a common block or rename block and set the next alignment point to the end of the common or rename block. Alignment is defined as two unchanged blocks, one from each difference, that contain a token located at a common position in base. A block alignment has a length defined as the number of tokens in common with the base between Table 1. Base Decision Table for Filling Gaps left difference common — move — move common move common insert insert — insert — insert & common combination — combination right difference — common — move move move common insert common — insert insert insert & common — — combination combination output none none move block move block move block for both move block move block insert block insert block insert block insert block conflict block delete conflict block delete conflict block combination of blocks combination of blocks conflict block Left Alternate 1111111 0000000 0000000 1111111 0000000 1111111 Left 0000000 1111111 0000000 1111111 Insert 0000000 1111111 0000000 1111111 0000000 1111111 Right Alternate Unchanged A Unchanged A 1111111 0000000 0000000 1111111 0000000 1111111 Right 0000000 1111111 0000000 1111111 Insert 0000000 1111111 0000000 1111111 0000000 1111111 Unchanged B Unchanged B 0000000 1111111 1111111 0000000 0000000 1111111 Left 0000000 1111111 0000000 1111111 Insert 0000000 1111111 0000000 1111111 0000000 1111111 Common A 1111111 0000000 0000000 1111111 0000000 1111111 Right 0000000 1111111 0000000 1111111 Insert 0000000 1111111 0000000 1111111 0000000 1111111 Common B Unchanged D UnChanged C Common D 1111111 0000000 0000000 1111111 0000000 1111111 Left 0000000 1111111 0000000 1111111 0000000 1111111 Conflict 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 Move Unchanged D 0000000 1111111 1111111 0000000 Right 0000000 1111111 0000000 1111111 Conflict 0000000 1111111 0000000 1111111 111111 000000 000000111111 111111 000000 000000 111111 000000 111111 000000 111111 Left Right 000000 111111 000000 111111 000000 000000111111 111111 000000 111111 000000 111111 Conflict Conflict 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 Move Unchanged F Unchanged E Common F Unchanged F the blocks. There can only be one, possibly empty, string of tokens between any two common blocks, though one unchanged block may have aligned token strings in more than one unchanged block in the other difference. The start position in base of the aligned token string is used as the next alignment point in the current iteration, and the last position of the aligned token string is used as the last alignment point for the next iteration. Once an alignment is established, blocks are added to the merge to reflect the changes. Table 1 illustrated the possible combinations and the resultant actions. In general, if a change has been made to only one of the variants, then there is no conflict. Otherwise a conflict block is inserted. The exception is that, when deletes and moves coincide, the move is simply accepted. Actually, some semantic information is used here as well. Whenever a delete contains a definition of an identifier and a new reference is added in the other variant, a delete conflict block is added to the merge. Each iteration ends with the addition of a common block to represent the aligned token substring. Figure 3 shows a simple example of the alignment process and figure 4 shows the resultant merge. In this example, there are two insertions, one move, one conflict, and one deletion in the results. The first insertions in the left and the right variant occur at distinct points relative to the base, so both are accepted. The second insertions, marked left conflict and right conflict respectively, occur at the same location relative to the base, thus they conflict. Unchanged C is moved and Unchanged E is deleted. Aligning common sequences between the two variants is necessary for any three way merge algorithm; however, the conflict detection in this phase of the algorithm is rudimentary and conflict resolution is essentially nonexistent. Though the previous phase insures that all non local con- Proceedings of the International Conference on Software Maintenance (ICSM’02) 0-7695-1819-2/02 $17.00 © 2002 IEEE Figure 3. Aligning Blocks Figure 4. Resultant Merge flicts are detected, many apparent conflicts remain. For instance, two different methods inserted in each of the two variants at the same position relative to base will be flagged as a conflict, when in fact, both can usually be added without causing any difficulty. All apparent conflicts are then examined in the last phase. 4.3. A Little Semantics The goal of the last phase is to reduce the number of apparent conflicts included in the merge. The strategy is to segregate code sequences into classes according to the difficulty of determining whether or not the two sequences interfere with one another. The strategy has three parts, given a token string from each of the two variants in a conflict block. First, reduce each token string to the smallest set of parse tree nodes that span the token substring exactly. Then, determine a compatibility class for each string by combining the compatibility class of the spanning nodes. Finally, the algorithm uses the two compatibilities to choose the appropriate action. The system relies on five node compatibility classes. Listed from least restrictive to most restrictive, they are compatible, signature compatible, name compatible, reference compatible, and incompatible. Furthermore, for reference compatibility, references that definitely do not change the state of the referent—pure references—are distinguished from references that might modify the referent— impure references. Table 2 defines the compatibility types An Internal Node in Spanning Set Last Token First Token Figure 5. Token String Spanning Nodes by the addition requirements needed to demonstrate compatibility between token strings of a given compatibility. Table 2. Compatibility Types Compatible Signature Compatible Name Compatible Reference Compatible Incompatible two nodes with this attribute are always compatible with other nodes. two nodes with this attribute must differ in name, argument types, or number to be compatible. two nodes of this type must differ in name to be compatible. these nodes are compatible when the set of impure reference, in their subnodes, are disjoint. nodes of this type can never be combined. The first three classes are easy to handle. Compatibility is the simplest class. It is used mostly for complete comments. Comments can not interfere with program code. A method in Java is an example of signature compatibility. Two methods may be both included in the final merge as long as their name, number of arguments, or arguments types differ. The system need only compare the names and argument lists. Name compatibility is a bit more restrictive. A Scheme function is an example of this call. The name alone must be unique. Reference compatibility is the most difficult. A statement in Java is an example of this compatibility. In this case, the program must examine all identifiers in both token sequences. If no identifier references are common to the nodes or if only pure references are common, then there is no conflict. Otherwise the conflict is unresolvable. If two code segments do not share identifier references or if all the references are pure then no information can flow between the two code segments and no collision is possible. Reference comparison is accomplished by inserting all identifier references from one side of a conflict into a hash table and then looking up all references from the other side. Finally incompatibility is used for most other token sequences. This is appropriate, since only larger program units can be analyzed reasonably. This class inhibits all compatibility checking. Conflicts involving this node type are unresolvable. Proceedings of the International Conference on Software Maintenance (ICSM’02) 0-7695-1819-2/02 $17.00 © 2002 IEEE In general, the algorithm needs to consider several nodes from each side of the conflict, not just one. The dominate node, or nodes, of a segment of code is determined, first by finding the nearest common ancestor, and then finding the highest sequence of nodes under that ancestor that do not have subnodes outside the range of the token string. As can be seen in figure 5, the nodes need not all be at the same depth in the tree. Both of these operations are done by a linear search up the parent chains of the first and last token in the range. If there is no single node that spans the code segment exactly, then the compatibility type is that of the most restrictive compatibility. The compatibility class of each node in the parse tree is given by the grammar for each language. Once the compatibility class for a code segment is determined, conflict resolution can be performed without regard to the underlying language. Only the compatibility types need be considered. Except for incompatible nodes, which can not lead to conflict resolution, nodes of different compatibility types do not conflict and both segments are included in the output. Nodes of the same compatibility class that are compatible according to table 2 do not conflict as well. Again, both segments are included in the output. Otherwise the segments being considered are marked as a conflict. At the end of this phase, the merge is finished. The results can either be reduced to a text file or used to drive an interactive tool for final conflict resolution. An Emacs extension would be a convenient means of providing such a tool. 5. Evaluation A lack of real project data makes evaluating ELAM difficult. Only very few non trivial merge examples were at hand. During the course of development, only one student worked with the authors on the project, therefore, given the modularity involved only two real merge examples were found. The rest of the examples are constructed to illustrate various features of the merge process. There are several aspects of the system to evaluate: format independence, name change propagation, and collocated change resolution. Format independence is exhibited when disjoint sets of adds, removes, or changes of code segments can be integrated even when one developer changes the formatting of the entire software document. ELAM demonstrates name change propagation by actively changing the name of an identifier when that identifier references a definition whose name was changed in the other variant. Collocation resolution refers to the process of determining whether or not changes made in both variants of a software document are compatible, and hence both allowable, or if they do, in fact, conflict. Eight examples have been constructed to cover each of these cases. Base Base class Rename class Test f f Left Variant g g int _foo; public Rename(int foo) f g Left Variant class Rename class Test g g int _value; int getFoo() f _foo = foo; g int getFoo() f g return _foo; g Right Variant f int _foo; int getFoo() f return _value; class Rename f f int _value; int getFoo() f g void setFoo(int value) f _value = value; g Merge return _foo; Right Variant class Test f public float test2; int _foo; public int getFoo() f return _foo; public int test1; int _foo; int getFoo() f return _value; g g return _foo; g void setFoo(int value) f g g _foo = value; Merge class Rename f class Test f int _foo; public Rename(int foo) f g _foo = foo; int getFoo() f g g f return _foo; g f g Figure 6. Renamed Identifier Example Proceedings of the International Conference on Software Maintenance (ICSM’02) 0-7695-1819-2/02 $17.00 © 2002 IEEE return _foo; void setFoo(int value) void setFoo(int value) f _foo = value; g f _foo = value; Figure 7. Collocated Change Example sequential in terms of performance. In fact, it appears to be uncorrelated to the size of the input in tokens. Even when renaming analysis is only undertaken within each file singularly, renaming still accounts for the largest portion of the total execution time. With the two preliminary paths that are required for external reference analysis, renaming analysis would dwarf all other costs. Though overall performance is acceptable, renaming analysis provides a ample opportunity for performance improvements. Figure 8. Runtime Performance of ELAM 7000 ltdiff renaming merge ltdiff renaming merge 6000 5000 Time (milliseconds) Figures 6 and 7 provide examples of these change types with the merge that ELAM produces. In both figures, changes from the left variant are shaded and changes from the right are highlighted with halftone bold type. Figure 6 illustrates how a name change in one variant is propagated into an addition in the other. The left variant, depicts not only an added constructor, but also a variable name change. A new reference to the same variable exists in the right variant. ELAM integrates both variants, while correctly changing the name of the variable reference added in the right variant. For clarity, the effected token is boxed in the figure. Figure 7 illustrates both format independence and collocated change. Both variants introduce new public variables at the beginning of the class. Each contain additional changes as well. In particular, the right variant has been reformatted. The resulting merge correctly incorporates all changes. The two new variable definitions do not conflict. The test cases were also evaluated for runtime performance. The execution environment is an IBM Thinkpad with 300MHz Intel Mobile Pentium II under Linux 2.4 with the Java JDK 1.3.1 from Sun Microsystems. In order to minimize fluctuations due to just-in-time compilation, a trail was run before the actual data was captured. Ten examples are insufficient to quantify the performance of the algorithm, however the graph in figure 8 does suggest that the actual merge part of the algorithm is incon- public int test1; public float test2; int _foo; public int getFoo() 4000 3000 2000 1000 0 0 500 1000 1500 Size of Base Version (Tokens) 2000 2500 Correlations: ltdiff 0.998; renaming 0.765; merge 0.153 Figure 9. TokenAST.java Base Version public class TokenAST extends Token implements Atom, AST f public int getPosition() f int pos = super.getPosition(); if (pos >= 0) return pos; if (getFirstChild() != null) return ((TokenAST)getFirstChild()).getPosition(); else return pos; g /** Get the first child of this node; null if not children */ public AST getFirstChild() f return _down; g /** Get the next sibling in line after this one */ public AST getNextSibling() f return _right; g public void setFirstChild(AST child) f _down = (TokenAST)child; g public void setNextSibling(AST next) f _right = (TokenAST)next; g /** Print out a child-sibling tree in LISP notation */ public String toStringList() f TokenAST token = this; String token_string = ""; if (token.getFirstChild() != null) token_string += " ("; token_string += " " + this.toString(); if (token.getFirstChild() != null) token_string += ((TokenAST)token.getFirstChild()).toStringList(); if (token.getFirstChild() != null) token_string += ")"; if (token.getNextSibling() != null) token_string += ((TokenAST)token.getNextSibling()).toStringList(); return token_string; g changes to the same file in parallel. Merge results are given both for diff3 and for ELAM. Though some of the original code has been replaced by ellipsis, so as to shorten the example for improved readability, the example clearly illustrates the advantages of ELAM over diff3. Figure 9 shows the basis version of the TokenAST.java file. Figure 10 shows the first variant of TokenAST.java, where the developer added three new methods to the class. Figure 11 gives the second variant. Here, a method named equals is added and two methods are rewritten with new names given to the old methods: getFirstChild becomes getFirstChildOrComment and getNextSibling becomes getNextSibFigure 11. TokenAST.java Variant 2 /** NEW: !!!!! 21.08.2001 * getNextSibling() and getFirstChild() do not return comment-nodes anymore. * If you want to get comments, you need to use getNextSiblingOrComment() and * getFirstChildOrComment() */ public class TokenAST extends Token implements Atom, AST f g public int getPosition() f int pos = super.getPosition(); if (pos >= 0) return pos; if (getFirstChildOrComment() != null) return ((TokenAST)getFirstChildOrComment()).getPosition(); else return pos; 5.1. An Example from the Field Figures 9 through 13 depict a merge case encountered in actual code development. Merging was necessitated by g /** * Get the first child of this node; null if not children * getNextSibling() and getFirstChild() do not return comment-nodes anymore. * If you want to get comments, you need to use getNextSiblingOrComment() and * getFirstChildOrComment() */ public AST getFirstChild() f TokenAST tmp = _down; while (tmp != null && tmp.isComment()) tmp = tmp._right; return tmp; Figure 10. TokenAST.java Variant 1 public class TokenAST extends Token implements Atom, AST f public int getPosition() f int pos = super.getPosition(); if (pos >= 0) return pos; if (getFirstChild() != null) return ((TokenAST)getFirstChild()).getPosition(); else return pos; g public AST getFirstChildOrComment() f return _down; g /** Get the next sibling in line after this one * getNextSibling() and getFirstChild() do not return comment-nodes anymore. * If you want to get comments, you need to use getNextSiblingOrComment() and * getFirstChildOrComment() */ public AST getNextSibling() f TokenAST tmp = _right; while (tmp != null && tmp.isComment()) tmp = tmp._right; return tmp; g public boolean isFirst() f return (_parent == null) || (_parent.getFirstChild() == this); g public boolean isLast() f return this.getNextSibling() == null; g g public AST getNextSiblingOrComment() f return _right; g /** Get the first child of this node; null if not children */ public AST getFirstChild() f return _down; g /** Get the next sibling in line after this one */ public AST getNextSibling() f return _right; g public void setFirstChild(AST child) f _down = (TokenAST)child; g public void setNextSibling(AST next) f _right = (TokenAST)next; g public boolean equals(Atom other) f public void setFirstChild(AST child) f _down = (TokenAST)child; g public void setNextSibling(AST next) f _right = (TokenAST)next; g public Atom findBalanceAtom() Atom other_parent = ((TokenAST)other).getParent(); if (getParent() == null) return ((((Token)other).getType() == getType()) && (((Token)other)._text == _text) && (other_parent != null)); else return ((((Token)other).getType() == getType()) && (((Token)other)._text == _text) && (other_parent != null) && (((Token)other_parent).getType() == getParent().getType())); f int end_type = getBalanceType(); if (end_type == InputState.BALANCE_DEFAULT) return null; else f TokenAST result; for (result = this; (result != null) && (result.getType() != end_type); result = (TokenAST)result.getNextSibling()); return result; g /** Print out a child-sibling tree in LISP notation */ public String toStringList() g g f TokenAST token = this; String token_string = ""; if (token.getFirstChildOrComment() != null) token_string += " ("; token_string += " " + this.toString(); if (token.getFirstChildOrComment() != null) token_string += ((TokenAST)token.getFirstChildOrComment()).toStringList(); if (token.getFirstChildOrComment() != null) token_string += ")"; if (token.getNextSiblingOrComment() != null) token_string += ((TokenAST)token.getNextSiblingOrComment()).toStringList(); return token_string; /** Print out a child-sibling tree in LISP notation */ public String toStringList() f TokenAST token = this; String token_string = ""; if (token.getFirstChild() != null) token_string += " ("; token_string += " " + this.toString(); if (token.getFirstChild() != null) token_string += ((TokenAST)token.getFirstChild()).toStringList(); if (token.getFirstChild() != null) token_string += ")"; if (token.getNextSibling() != null) token_string += ((TokenAST)token.getNextSibling()).toStringList(); return token_string; g g Proceedings of the International Conference on Software Maintenance (ICSM’02) 0-7695-1819-2/02 $17.00 © 2002 IEEE g g Figure 12. TokenAST.java Merge with diff3 Figure 13. TokenAST.java Merge with ELAM /** NEW: !!!!! 21.08.2001 * getNextSibling() and getFirstChild() do not return comment-nodes anymore. * If you want to get comments, you need to use getNextSiblingOrComment() and * getFirstChildOrComment() */ public class TokenAST extends Token implements Atom, AST /** NEW: !!!!! 21.08.2001 * getNextSibling() and getFirstChild() do not return comment-nodes anymore. * If you want to get comments, you need to use getNextSiblingOrComment() and * getFirstChildOrComment() */ public class TokenAST extends Token implements Atom, AST f f public int getPosition() f int pos = super.getPosition(); if (pos >= 0) return pos; if (getFirstChildOrComment() != null) return ((TokenAST)getFirstChildOrComment()).getPosition(); else return pos; public int getPosition() f int pos = super.getPosition(); if (pos >= 0) return pos; if (getFirstChildOrComment() != null) return ((TokenAST)getFirstChildOrComment()).getPosition(); else return pos; g g public boolean isFirst() public boolean isFirst() f f return (_parent == null) || (_parent.getFirstChildOrComment() == this); return (_parent == null) || (_parent. getFirstChild () == this); g g public boolean isLast() f return this.getNextSiblingOrComment() == null; g public boolean isLast() f return this. getNextSibling () == null; g /** * Get the first child of this node; null if not children * getNextSibling() and getFirstChild() do not return comment-nodes anymore. * If you want to get comments, you need to use getNextSiblingOrComment() and * getFirstChildOrComment() */ public AST getFirstChild() f TokenAST tmp = _down; while (tmp != null && tmp.isComment()) tmp = tmp._right; return tmp; /** * Get the first child of this node; null if not children * getNextSibling() and getFirstChild() do not return comment-nodes anymore. * If you want to get comments, you need to use getNextSiblingOrComment() and * getFirstChildOrComment() */ public AST getFirstChild() f TokenAST tmp = _down; while (tmp != null && tmp.isComment()) tmp = tmp._right; return tmp; g public AST getFirstChildOrComment() f return _down; g /** Get the next sibling in line after this one * getNextSibling() and getFirstChild() do not return comment-nodes anymore. * If you want to get comments, you need to use getNextSiblingOrComment() and * getFirstChildOrComment() */ public AST getNextSibling() f TokenAST tmp = _right; while (tmp != null && tmp.isComment()) tmp = tmp._right; return tmp; g public AST getFirstChildOrComment() f return _down; g /** Get the next sibling in line after this one * getNextSibling() and getFirstChild() do not return comment-nodes anymore. * If you want to get comments, you need to use getNextSiblingOrComment() and * getFirstChildOrComment() */ public AST getNextSibling() f TokenAST tmp = _right; while (tmp != null && tmp.isComment()) tmp = tmp._right; return tmp; g public AST getNextSiblingOrComment() f return _right; g g public AST getNextSiblingOrComment() f return _right; g public void setFirstChild(AST child) f _down = (TokenAST)child; g public void setNextSibling(AST next) f _right = (TokenAST)next; g public Atom findBalanceAtom() public void setFirstChild(AST child) f _down = (TokenAST)child; g public void setNextSibling(AST next) f _right = (TokenAST)next; g <<<<<<< cases/TokenAST.4.java public Atom findBalanceAtom() f int end_type = getBalanceType(); if (end_type == InputState.BALANCE_DEFAULT) return null; else f int end_type = getBalanceType(); if (end_type == InputState.BALANCE_DEFAULT) return null; else f TokenAST result; for (result = this; (result != null) && (result.getType() != end_type); result = (TokenAST)result.getNextSiblingOrComment()); return result; f TokenAST result; for (result = this; (result != null) && (result.getType() != end_type); result = (TokenAST)result.getNextSibling()); return result; g g public boolean equals(Atom other) g g f Atom other_parent = ((TokenAST)other).getParent(); if (getParent() == null) return ((((Token)other).getType() == getType()) && (((Token)other)._text == _text) && (other_parent != null)); else return ((((Token)other).getType() == getType()) && (((Token)other)._text == _text) && (other_parent != null) && (((Token)other_parent).getType() == getParent().getType())); ||||||| cases/TokenAST.0.java ======= public boolean equals(Atom other) f Atom other_parent = ((TokenAST)other).getParent(); if (getParent() == null) return ((((Token)other).getType() == getType()) && (((Token)other)._text == _text) && (other_parent != null)); else return ((((Token)other).getType() == getType()) && (((Token)other)._text == _text) && (other_parent != null) && (((Token)other_parent).getType() == getParent().getType())); g /** Print out a child-sibling tree in LISP notation */ public String toStringList() f TokenAST token = this; String token_string = ""; if (token.getFirstChildOrComment() != null) token_string += " ("; token_string += " " + this.toString(); if (token.getFirstChildOrComment() != null) token_string += ((TokenAST)token.getFirstChildOrComment()).toStringList(); if (token.getFirstChildOrComment() != null) token_string += ")"; if (token.getNextSiblingOrComment() != null) token_string += ((TokenAST)token.getNextSiblingOrComment()).toStringList(); return token_string; g >>>>>>> cases/TokenAST.5.java /** Print out a child-sibling tree in LISP notation */ public String toStringList() f TokenAST token = this; String token_string = ""; if (token.getFirstChildOrComment() != null) token_string += " ("; token_string += " " + this.toString(); if (token.getFirstChildOrComment() != null) token_string += ((TokenAST)token.getFirstChildOrComment()).toStringList(); if (token.getFirstChildOrComment() != null) token_string += ")"; if (token.getNextSiblingOrComment() != null) token_string += ((TokenAST)token.getNextSiblingOrComment()).toStringList(); return token_string; g g Proceedings of the International Conference on Software Maintenance (ICSM’02) 0-7695-1819-2/02 $17.00 © 2002 IEEE g g lingOrComment. All existing references to the old methods are changed to the new names. For convenience, the changes of variant 1 are highlighted with a gray background and the changes of variant 2 are set in halftone bold type. The reader should note that the methods added in variant 1 reference the methods whose names have been changed in variant 2. As can be seen in figure 12, diff3 is not able to merge the first change correctly, thus resulting in two errors. One of the errors is clearly visible from the result. The method named findBalanceAtom added in variant 1 and the method named equals where both added at the same place relative to the base revision. Since diff3 does not have any semantic knowledge about the code in question, it is not able to determine whether or not there is a semantic collision. It must flag this textual collision as a conflict. The other error slips by diff3 undetected. The diff algorithm has no way of recognizing that the methods added in variant 1 reference the methods whose names are changed in variant 2. The references in the methods added in variant 1 to the methods renamed in variant 2 should be changed to use the new names. In figure 12, the places where this should happen are marked with a surrounding rectangle. The merge produced by ELAM, given in figure 13, does not exhibit these errors. First, it recognizes that the references to getFirstChild and getNextSibling in the methods added in variant 1 must be changed to getFirstChildOrComment and getNextSiblingOrComment respectively. Second, the algorithm gathers enough semantic information to know that the new method named findBalanceAtom from variant 1 and the new methods named equals from variant 2 do not conflict. ELAM includes both methods in the new version resulting in a semantically correct merge. 6. Future Work There are two areas that could be improved upon: language support and conflict resolution. The current implementation works well with languages like Java and Scheme, where no preprocessor is used, but languages that rely on preprocessors, like C, and C++, are more difficult to process. More sophisticated parsing techniques are needed to fully support these languages. The current conflict resolution strategy is rather simple. More experience should yield useful information about improving the strategy. 7. Conclusions The ELAM algorithm is a new algorithm that provides a better trade off among performance, applicability, and accuracy. On the one hand, it can find non-local conflicts that line based merging misses and exclude apparent conflicts that line based merging generates. On the other hand, ELAM has wider applicability than current full semantic approaches. An implementation in Java demonstrated that the algorithm is effective. Parsers are available for Java and Proceedings of the International Conference on Software Maintenance (ICSM’02) 0-7695-1819-2/02 $17.00 © 2002 IEEE Scheme. The ELAM algorithm achieves the main goal of providing full syntactic merging with some semantic analysis in acceptable time. References [1] D. Binkley, S. Horwitz, and T. Reps. Program integration for languages with procedure calls. ACM Trans. on Software Eng. and Methodology, 4(1):3–35, Jan. 1995. [2] J. Buffenbarger. Syntactic software merging. Lecture Notes in Comp. Sci., 1005: ICSE SCM-4 and SCM-5 Workshop Sel. Papers:53–67, 1995. [3] S. Horwitz, J. Prins, and T. Reps. Integrating non-interfering versions of programs. ACM Trans. on Programming Languages and Systems, 11(3):345–387, July 1989. [4] J. J. Hunt. Extensible, Language-Aware Differencing and Merging. PhD thesis, Universitt Karlsruhe, Nov. 2001. [5] J. J. Hunt, K.-P. Vo, and W. F. Tichy. An empirical study of delta algorithms. ACM Trans. on Software Eng. and Methodology, 7(2):49–66, 1998. [6] J. W. Hunt and T. G. Szymanski. A fast algorithm for computing longest common subsequences. Communications of the ACM, 20(5):350–353, May 1977. [7] R. Irving and C. Fraser. Two algorithms for the longest common subsequence of three (or more) strings. In U. Manber, Z. Galil, M. Crochemore, and A. Apostolico, editors, Proceedings of the 3rd Annual Symposium on Combinatorial Pattern Matching, number 644, pages 214–229. SpringerVerlag, 1992. [8] E. Lippe. Operation-based merging. In H. Weber, editor, Software Eng. Notes: Proc. of the 5th ACM SIGSOFT Symp. on Software Development Environments, pages 78– 87. ACM, Dec. 1992. [9] G. Malpohl, J. J. Hunt, and W. F. Tichy. Renaming detection. In Proc. of the 15th Inter. Conf. on Automated Software Eng. IEEE Computer Society Press, Sept. 2000. [10] E. M. McCreight. A space economical suffix tree construction algorithm. Journal of the ACM, 32:262–272, 1976. [11] E. W. Meyers. An o(nd) difference algorithm and its variations. Algorithmica, 1(2):251–266, 1986. [12] T. Parr and R. Quong. A Predicated-LL(k) parser generator. Software Practice and Experience, 25(7):789–810, July 1995. [13] T. Reps, S. Horwitz, and J. Prins. Support for integrating program variants in an environment for programming in the large. In Proc. of the Inter. Workshop on Software Version and Config. Control, Stuttgart, FRG, Jan. 1988. Teubner Verlag. [14] W. F. Tichy. RCS: A revision control system. In Integrated Interactive Computing Systems. North-Holland Publishing Co, 1983. [15] B. Westfechtel. Structure-oriented merging of revisions of software documents. In Proc. of the 3rd Inter. Workshop on SCM, pages 68–79. ACM Press, 1991. [16] W. Yang, S. Horwitz, and T. Reps. A program integration algorithm that accommodates semantic-preserving transformations. ACM Trans. on Software Eng. and Methodology, 1(3):310–354, July 1992.

Log In

Extensible language-aware merging

Related papers

Related papers

Related topics