Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

End-user software engineering

2004, Communications of the ACM

By Margaret Burnett, Curtis Cook, and Gregg Rothermel END-USER A strategy that allows end users the ability to perform quality control methods as well as inspires them to enhance the dependability of their software themselves. SOFTWARE ENGINEERING End-user programming has become the most common form of programming in use today [2], but there has been little investigation into the dependability of the programs end users create. This is problematic because the dependability of these programs can be very important; in some cases, errors in end-user programs, such as formula errors in spreadsheets, have cost millions of dollars. (For example, see www.theregister.co.uk/content/67/31298.html or panko.cba.hawaii.edu/ssr/Mypapers/whatknow.htm.) We have been investigating ways to address this problem by developing a software engineering paradigm viable for end-user programming, an approach we call end-user software engineering. End-user software engineering does not mimic the traditional approaches of segregated support for each element of the software engineering life cycle, nor does it ask the user to think in those terms. Instead, it employs a feedback loop sup- ported by behind-the-scenes reasoning, with which the system and user collaborate to monitor dependability as the end user’s program evolves. This approach helps guard against the introduction of faults1 in the user’s program and if faults have already been introduced, helps the user detect and locate them. Because spreadsheet languages are the most widely used end-user programming languages to date—in fact, they may be the most widely used of all programming languages—we have prototyped 1 We follow the standard terminology for discussing program errors. A “failure” is an incorrect output, and a “fault” is the incorrect element(s) of source code causing the failure. For example, an answer of “D” in a spreadsheet cell if the student’s grade should actually be a “B” is a failure; the incorrect formula, such as omission of one of the student’s test grades in the sum upon which his/her letter grade is based, is the fault. COMMUNICATIONS OF THE ACM September 2004/Vol. 47, No. 9 53 grades Figure 1. The teacher’s sequence of interactions with WYSIWYT testing. Student Grades HWAVG MIDTERM FINAL COURSE LETTER NAME ID 1 2 Abbott, Mike Farnes, Joan 1,035 7,649 89 92 91 94 86 92 88.4 92.6 B A 3 4 5 Green, Matt Smith, Scott Thomas, Sue 2,314 2,316 9,857 78 84 89 80 90 89 75 86 89 77.4 86.6 93.45 C B A ing dependability in end-user programming. WYSIWYT Testing In our What You See Is What AVERAGE 86.4 88.8 85.6 87.69 You Test (WYSIWYT) methodology, a user can test a spread(a) sheet incrementally as he or she grades develops it by simply validating any value as correct at any point in the process. Behind the Student Grades COURSE LETTER NAME ID HWAVG MIDTERM FINAL scenes, these validations are used 1 Abbott, Mike 1,035 89 91 86 88.4 B 2 Farnes, Joan 7,649 92 94 92 92.6 A to measure the quality of testing 3 Green, Matt 2,314 78 80 75 77.4 C in terms of a test-adequacy crite4 Smith, Scott 2,316 84 90 86 86.6 B 5 Thomas, Sue 9,857 89 89 89 93.45 A rion. These measurements are if (courseR4 >=90) then "A" 6 else (if (courseR4 >=80) then "B" 7 AVERAGE 86.4 88.8 85.6 87.69 then projected to the user via else (if (courseR4 >=70) then "C" else (if (courseR4 >=60) then "D" several different visual devices, else "F"))) to help them direct their testing (b) activities. grades For example, suppose a teacher is creating a student grades spreadsheet, as in Figure 1. Student Grades During this process, whenever NAME ID HWAVG MIDTERM FINAL COURSE LETTER the teacher notices that a value in 1 Abbott, Mike 1,035 89 91 86 88.4 B 2 Farnes, Joan 7,649 92 94 92 92.6 A a cell is correct, she can check it 3 Green, Matt 2,314 78 80 75 77.4 C off (“validate” it). The checkmark 4 Smith, Scott 2,316 84 90 86 86.6 B provides feedback, and later 5 Thomas, Sue 9,857 89 89 89 93.45 A 6 reminds the teacher that the cell’s 7 AVERAGE 86.4 88.8 85.6 87.69 value has been validated under current inputs. (Empty boxes and (c) question marks in boxes are also our approach in the spreadsheet paradigm. Our pro- possible; both indicate that the cell’s value has not totype includes the following end-user software engi- been validated under the current inputs. In addition, neering devices: the question mark indicates that validating the cell would increase testedness.) • An interactive testing methodology to help endA second, more important, result of the teacher’s user programmers test; validation action is that the colors of the validated • Fault localization capabilities to help users find cell’s borders become more blue, indicating that data the faults that testing may have revealed; dependencies between the validated cell and cells it ref• Interactive assertions to continually monitor valerences have been exercised in producing the validated ues the program produces, and alert users to values. From the border colors, the teacher is kept potential discrepancies; and informed of which areas of the spreadsheet are tested • Motivational devices that gently attempt to inter- and to what extent. Thus, in the figure, row 4’s Letter est end users in appropriate software engineering cell’s border is partially blue (purple), because some of behaviors at suitable moments. the dependencies ending at that cell have now been tested. Testing results also flow upstream against In this article, we describe how these devices can be dataflow to other cells whose formulas have been used used by end-user programmers. We also summarize in producing a validated value. In our example, all the results of our empirical investigations into the use- dependencies ending in row 4’s Course cell have now fulness and effectiveness of these devices for promot- been exercised, so that cell’s border is now blue. 6 7 54 September 2004/Vol. 47, No. 9 COMMUNICATIONS OF THE ACM End-user software engineering employs a feedback loop supported by behind-the-scenes reasoning, with which the system and user collaborate to monitor dependability as the end user’s program evolves. If the teacher chooses, she can also view dependencies by displaying dataflow arrows between cells or between subexpressions in formulas. In Figure 1(b), she has chosen to view dependencies ending at row 4’s Letter cell. These arrows follow the same color scheme as the cell borders. A third visual device, a “percent tested” bar at the top of the spreadsheet, displays the percentage of dependencies that have been tested, providing the teacher with an overview of her testing progress. Although the teacher need not realize it, the colors that result from placing checkmarks reflect the use of a definition-use test adequacy criterion [6] that tracks the data dependencies between cell formulas caused by references to other cells. Testing a program “perfectly” (well enough to guarantee detecting all faults) generally requires too many inputs; a test adequacy criterion provides a way to distribute a testing effort across elements of the program. In the spreadsheet paradigm, we say that a cell is fully tested if all its data dependencies have been covered by tests; those cells have their borders painted blue. Cells for which dependencies have not been fully covered have borders ranging from red to various shades of purple. The overall testing process is similar to the process used by professional programmers in “white box” unit testing, in which inputs are applied until some level of code coverage has been achieved. In the spreadsheet environment, however, the process is truly incremental, bearing some similarity to test-driven development approaches. These considerations, along with the testing theory underlying this methodology, are described in detail in [9]. Our teacher may eventually need to try different input values in certain cells in the spreadsheet, to cause other dependencies between formulas to come into play so their results can be checked. This process of conjuring up suitable inputs can be difficult, even for professional programmers, but help is available. Help-Me-Test. To get help finding inputs to further test a cell, the teacher selects that cell and pushes the Help-Me-Test button in the spreadsheet toolbar. The system responds by attempting to generate inputs [5]. The system first constructs representations of the chains of dependencies that control the execution of particular data dependencies; then it iteratively explores portions of these chains, applying constrained linear searches over the spreadsheet’s input space and data gathered through iterative executions. If the system succeeds, suitable input values appear in the cells, providing the teacher with new opportunities to validate. Our empirical results show that Help-Me-Test is typically highly effective and provides fast response [5]. Finding faults. Suppose in the process of testing, the teacher notices that row 5’s Letter grade (“A”) is incorrect. There must be some error in our teacher’s formulas, but how shall she find it? This is a thorny problem even for professional programmers, and various technologies have been proposed to assist them. Some of these technologies build on information available to the system about successful and failed tests and about dependencies [11]. We are experimenting with approaches that draw from these roots [10]; here, we describe one of them. Our teacher indicates that row 5’s Letter grade is erroneous by placing an X mark in it. Row 5’s Course average is obviously also erroneous, so she X’s that one, too. As Figure 1(c) shows, both cells now contain pink interiors, but Course is darker than Letter because Course contributed to two incorrect values (its own and Letter’s) whereas Letter contributed to only its own. These colorings reflect the likelihood that the cell formulas contain faults, with darker shades reflecting greater likelihood. The goal is to help the teacher prioritize which potentially suspicious formula to investigate first, in terms of their likelihood of contributing to a fault. Although this example is too small for the shadings to contribute a great deal, users in our empirical work who used the technique on larger examples did tend to follow the darkest cells. When they did so, they were automatically guided into dataflow debugging, which paid off in their debugging effectiveness. COMMUNICATIONS OF THE ACM September 2004/Vol. 47, No. 9 55 Suppose that, with the help of the colorings, our teacher fixes the fault in the Course cell. (The weights in the weighted average did not add up to exactly 100%.) When she completes her edit the underlying algorithms partner with the spreadsheet evaluation engine in visiting affected cells in order to calculate the dependencies between formulas that might be affected by the changes. These dependencies are marked untested, and the rejuvenated screen display shows the resulting colors, directing the teacher’s attention to portions of the spreadsheet that should now be retested. Assertions NAME Testing can reveal faults, but it 1 Abbott, Mike may not reveal them all. Recent 2 Farnes, Joan empirical work into human pro3 Green, Matt 4 Smith, Scott gramming errors [7] categorized the types of errors participants made in introducing or attempt5 Thomas, Sue ing to remove faults. In that 6 study, most errors were due to 7 AVERAGE poor strategies and to attention problems such as paying attention to the wrong part of the program or working memory overload interfering with efforts to track down the fault. For professional programmers, assertions in the form of preconditions, postconditions, and invariants help with these issues, because these assertions can continuously attend to the entire program, reasoning about the properties the programmers expect of their program logic, and about interactions between different sections of the program. Our approach to assertions [3] attempts to provide these same advantages to end-user programmers such as the teacher. These assertions are composed of Boolean expressions about cells’ values. They look like enumerations of values and/or ranges of valid values, and these enumerations and ranges can also be composed (“and”ed and “or”ed together). For example, suppose the teacher had not noticed the fault in row 5’s Course cell after all; we will show how assertions can be used to detect that fault. Suppose she creates assertions to continually monitor whether all numeric cells on row 5 will be between 0 and 100. To do so, she can either type ranges, as in Figure 2, or use a graphical syntax. The assertions she enters (next to the stick figures) provide a cross-check that can automatically alert the teacher to even subtle faults such as getting the weights slightly wrong in the Course grade calculation. That power goes far beyond simply checking cell values against the user-entered assertions, and derives 56 September 2004/Vol. 47, No. 9 COMMUNICATIONS OF THE ACM mainly from two sources: from aggressive participation by Help-Me-Test, and from propagation of some of the user-entered assertions to new system-generated assertions on downstream cells. When assertions are present, Help-Me-Test’s behavior is slightly different than we’ve described. For cells with only constants for formulas, it politely stays within the ranges specified by the assertions. But when cells with non-constant formulas have assertions, Help-Me-Test aggressively tries to derive input grades Student Grades ID HWAVG MIDTERM FINAL COURSE LETTER 1,035 7,649 89 92 91 94 86 92 88.4 92.6 B A 2,314 2,316 78 84 80 90 75 86 77.4 86.6 C B 0 to 100 0 to 100 9,857 0 to 100 0 to 100 0.0 to 105.0 89 89 89 93.45 86.4 88.8 85.6 87.69 A cell values that will violate Figure 2. When the teacher assertions, the system those assertions on the enters propagates them to deduce downstream cells. Thus, the more assertions. In this case, a conflict was presence of assertions turns detected (circled in red), Help-Me-Test into an revealing a fault. aggressive seeker of faults. The propagation to system-generated assertions (for example, “0.0 to 105.0” next to the computer icon in Figure 2) produces three other ways assertions can semiautomatically identify faults. First, the system automatically monitors all values as they change, to see if they violate any of the assertions. Whenever a cell’s value does violate an assertion, the system circles the value in red. For example, whenever the student’s Course does not fall between 0 and 100, the system will circle it. Second, assertions might conflict with each other, as in Figure 2, in which case the system will circle the conflict in red. Conflicts indicate that either there is a fault in the cell’s formula, or there are erroneous user-entered assertions. Third, the system-generated assertions might look wrong to the user, again indicating the presence of formula faults or user-entered assertion errors. All three ways to identify faults have been used successfully by end users. For example, in an empirical study [3], the participants using assertions were significantly more effective at debugging spreadsheet formulas than were participants without access to assertions. The Surprise-Explain-Reward Strategy A key to the power of assertions is the propagation aspect, which can happen only if there is an initial source of assertions from which to propagate. In some cases, initial sources of assertions might them- the spreadsheet, so they can make informed choices about what actions to take next. It uses the element of surprise to attempt to arouse the curiosity of the end users, and if they become interested, the system follows up with explanations and, potentially, rewards. For example, Help-Me-Test grades uses the element of surprise as a springboard in the SurpriseExplain-Reward strategy to introStudent Grades NAME ID HWAVG MIDTERM FINAL COURSE LETTER duce users to assertions. 1 Abbott, Mike 1,035 89 91 86 88.4 B Whenever our teacher invokes 2 Farnes, Joan 7,649 92 94 92 92.6 A 3 Green, Matt 2,314 78 80 75 77.4 C Help-Me-Test, the system not 4 Smith, Scott 2,316 84 90 86 86.6 B only generates values for input -7 to 176 0 to 219 21 to 390 01 to 294.0 cells, but also creates (usually blaF 33.6 5 Thomas, Sue 9,857 7 57 32 The computer's testing caused it to wonder if this 6 tantly incorrect, so as to surprise) would be a good guard. Fix the guard to protect 7 AVERAGE 70 82.4 74.2 75.72 against bad values, by typing a range or double-clicking. “guessed” assertions to place on these cells. For example, in Figure 3, when the teacher selected row Figure 3. While selves be derivable (such as through 5’s Letter cell and pushed Help-Me-Test, while genergenerating new statistical monitoring of input data ating new values (indicated by thickened borders), values that will [8] or based on help increase testedness of row nearby labels Number Populations Main results 5’s Letter cell, and types studied Help-Me-Test also and annotations of studies guessed some [4]). However, WYSIWYT 5 summative End users, WYSIWYT was associated with more effective and efficient testing and debugging. assertions. computer End users with WYSIWYT tested more than those without WYSIWYT. in other cases, testing science WYSIWYT helped reduce overconfidence in spreadsheet correctness, but did the only possistudents, not completely resolve this issue. spreadsheets ble source is the teacher herself. 2 formative, End users, End users tended to test as much as they could without help initially, but when they Still, it does not seem reason- Help-MeTest 2 summative spreadsheets eventually turned to Help-Me-Test, they commented favorably about it, and continued to use it. able to expect the teacher to Users were willing to wait a long time for Help-Me-Test to try to find a value, and in the seek out an assertions feature in circumstances when it could not, they did not tend to lose confidence in the system. Users did not always make correct decisions about which values were right and which a spreadsheet environment. were wrong. Help-Me-Test algorithms were usually able to generate new test values quickly enough Given end users’ unfamiliarity to maintain responsiveness. with quality control methods Fault 3 formative End users Different fault localization heuristics had very different advantages early in users’ testing processes. Although some of the techniques tended to converge given a lot of for software, strategies must be Localization tests, users did not tend to run enough tests to reach this point. devised by which end-user softWhen users made mistaken decisions about value correctness, their mistakes almost always assumed too much correctness (not too little correctness). ware engineering approaches Early computations, before the system has much information collected, may be the most important for shaping users’ attitudes about the usefulness of fault localization devices. capture the interest of end-user Those who used the technique tended to follow dataflow strategies about twice as much as the other participants, and the dataflow strategy was the only one tied to programmers and motivate identification of “non-local” faults. them to take appropriate steps Assertions 2 formative, End users End users using assertions were more effective and faster at debugging. that will enhance their soft1 summative Assertions were usable by end users. ware’s correctness. Surprise2 formative, End users Comfort level and experience with the spreadsheet paradigm were important factors 1 summative in determining whether “surprises” were motivating (interesting, arousing curiosity) We have devised a strategy ExplainReward or demotivating (perceived as too costly or risky). Surprise-Explain-Reward was effective in encouraging end users to use assertions, that aims to motivate end users without forcing use of assertions before the users were ready. to make use of software engiThe type of communication used to communicate “surprises” may critically affect users’ problem-solving strategies and productivity. neering devices, and to provide the just-in-time support needed to effectively follow up on this Table 1. Empirical work Help-Me-Test also guessed some assertions. These date into end-user interest. Our strategy is termed tosoftware engineering guessed assertions, which we’ll refer to as HMT asserSurprise-Explain-Reward [12]. It devices. (More details tions (because they are generated by Help-Me-Test), the studies are are intended to surprise the teacher into becoming aims to choose timely moments inabout [3, 5, 9, 10, 12] and to inform end users of the bene- at www.engr.oregon- curious about assertions. She can satisfy her curiosity fits, costs, and risks [1] of the state.edu/~burnett/ITR2 using tool tips, as in Figure 3, which explain to her the 000/empirical.html). software engineering devices meaning and rewards of assertions. If she follows up available and of potential faults in by accepting an HMT assertion (either as guessed or COMMUNICATIONS OF THE ACM September 2004/Vol. 47, No. 9 57 after editing it), the resulting assertion will be propagated as seen earlier in Figure 2. As a result, the system may detect some problems; if so, red circles will appear as in Figure 2. If the red circles identify faults, the circles (and assertions) also serve as rewards. It is important to note that, although our strategy rests on surprise, it does not attempt to rearrange the teacher’s work priorities by requiring her to do anything about the surprises. No dialog boxes pop up and there are no modes. HMT assertions are a passive feedback system; they try to win user attention but do not require it. If the teacher chooses to follow up, she can mouse over the assertions to receive an explanation, which explicitly mentions the rewards for pursuing assertions. In a behavior study [12], users did not always attend to HMT assertions for the first several minutes of their task; thus it appears that the amount of visual activity is reasonable for requesting but not demanding attention. However, almost all of them did eventually turn their attention to assertions, and when they did, they used them effectively. We have conducted more than a dozen empirical studies related to end-user software engineering research. Some of the results of these studies have been discussed here; all main results are summarized in Table 1. Some of our studies were conducted early in the development of our end-user software engineering devices, so as to influence their design at early stages; these are labeled “formative.” Other studies evaluated the effectiveness devices at much later stages; these are labeled “summative.” References 1. Blackwell, A. First steps in programming: A rationale for attention investment models. In Proceedings of the IEEE Symposium on HumanCentric Computing Languages and Environments (Arlington, VA, Sept. 3–6, 2002), 2–10. 2. Boehm, B. Abts, C., Brown, A., Chulani, S., Clark, B., Horowitz, E., Madachy, R., Reifer, J., and Steece, B. Software Cost Estimation with COCOMO II. Prentice Hall PTR, Upper Saddle River, NJ, 2000. 3. Burnett, M., Cook, C., Pendse, O., Rothermel, G., Summet, J., and Wallace, C. End-user software engineering with assertions in the spreadsheet paradigm. In Proceedings of the International Conference on Software Engineering (Portland, OR, May 3–10, 2003), 93–103. 4. Burnett, M. and Erwig, M. Visually customizing inference rules about apples and oranges. In Proceedings of the IEEE Symposium on HumanCentric Computing Languages and Environments (Arlington, VA, Sept. 3–6, 2002), 140–148. 5. Fisher, M., Cao, M., Rothermel, G., Cook, C., and Burnett, M. Automated test generation for spreadsheets. In Proceedings of the International Conference on Software Engineering (Orlando FL, May 2002), 141–151. 6. Frankl, P., and Weyuker, E. An applicable family of data flow criteria. IEEE Trans. Software Engineering 14, 100 (Oct. 1988), 1483–1498. 7. Ko, A. and Myers, B. Development and evaluation of a model of programming errors. In Proceedings of the IEEE Symposium on HumanCentric Computing Languages and Environments (Auckland, NZ, Oct. 28–31), 7–14. 8. Raz, O., Koopman, P., and Shaw, M. Semantic anomaly detection in online data sources. In Proceedings of the International Conference on Software Engineering (Orlando, FL, May 19–25, 2002), 302–312. 9. Rothermel, G., Burnett, M., Li, L., DuPuis, C., and Sheretov, A. A methodology for testing spreadsheets. ACM Trans. Software Engineering and Methodology 10, 1 (Jan. 2001), 110–147. 10. Ruthruff, J., Creswick, E., Burnett, M., Cook, C., Prabhakararao, S., Fisher II, M., and Main, M. End-user software visualizations for fault localization. In Proceedings of the ACM Symposium on Software Visualization (San Diego, CA, June 11–13, 2003), 123–132. 11. Tip, F. A survey of program slicing techniques. J. Programming Languages 3, 3 (1995), 121–189. 12. Wilson, A., Burnett, M., Beckwith, L., Granatir, O., Casburn, L., Cook, C., Durham, M., and Rothermel, G. Harnessing curiosity to increase correctness in end-user programming. In Proceedings of the ACM Conference on Human Factors in Computing Systems. (Ft. Lauderdale, FL, Apr. 3–10, 2003), 305–312. Conclusion Burnett (burnett@eecs.orst.edu) is a professor in the Giving end-user programmers ways to easily create Margaret School of Electrical Engineering and Computer Science at Oregon their own programs is important, but it is not State University, Corvallis, OR. enough. Like their counterparts in the world of pro- Curtis Cook (cook@eecs.orst.edu) is Professor Emeritus in the fessional software development, end-user program- School of Electrical Engineering and Computer Science at Oregon University, Corvallis, OR. mers need support for other aspects of the software State Gregg Rothermel (grother@eecs.orst.edu) is an associate life cycle. However, because end users are different professor in the School of Electrical Engineering and Computer from professional programmers in background, Science at Oregon State University, Corvallis, OR. motivation, and interest, the end-user community cannot be served by simply repackaging techniques and tools developed for professional software engi- This work was supported in part by NSF under ITR-0082265 and in part by the EUSES Consortium via NSF’s ITR-0325273. Collaborators on this project are listed at neers. Directly supporting these users in software www.engr.oregonstate.edu/~burnett/ITR2000. development activities beyond the programming to make digital or hard copies of all or part of this work for personal or stage—while at the same time taking their differ- Permission classroom use is granted without fee provided that copies are not made or distributed ences in background, motivation, and interests into for profit or commercial advantage and that copies bear this notice and the full citaon the first page. To copy otherwise, to republish, to post on servers or to redisaccount—is the essence of the end-user software tion tribute to lists, requires prior specific permission and/or a fee. engineering vision. As our empirical results show, an end-user programming environment that employs the approaches we describe here can significantly improve the ability of end-user programmers to safe© 2004 ACM 0001-0782/04/0900 $5.00 guard the dependability of their software. c 58 September 2004/Vol. 47, No. 9 COMMUNICATIONS OF THE ACM