Archive for the 'Paper Notes' Category

Peer Evaluation Dysfunctions

July 30, 2009

Every business is a system, and the performance of individuals is largely the result of the way the system operates… the system causes 80% of the problems in a business, and the system is management’s responsibility.

Ranking people for merit raises pits individual employees against each other and strongly discourages collaboration.

There is no greater demotivator than a reward system that is perceived to be unfair. It doesn’t matter whether the system is fair or not. If there is a perception of unfairness, then those who think that they have been treated unfairly will rapidly lose their motivation.

When we optimize a part of a chain, we invariably suboptimize overall performance.

—Mary Poppendieck, The Best Software Writing I, 2005.


Notes on “An experimental investigation of the impact of individual, program, and organizational characteristics on software maintenance effort”

July 2, 2009

An experimental investigation of the impact of individual, program, and organizational characteristics on software maintenance effort: Ramanujan, S. and Scamell, R.W. and Shah, J.R.

Maintenance effort is expensive.

The Human Information Processing model is a model of how cognition works. Information travels back and forth from our extremely short-term sensory buffers, to short-term memory, to buffer memory, to long-term memory. Short-term memory has a small capacity. Information moves from short-term to long-term memory via rehearsal. Long-term memory is subdivided into episodic and non-episodic memory. Episodic memory includes time and place information (what we think of as “memories”). Non-episodic memory is what we would typically call “knowledge.” We subdivide non-episodic memory into semantic and syntactic knowledge. Semantic knowledge is abstract knowledge independent of context. Syntactic knowledge is specific, contextual, rote-memorization knowledge. The buffer is essentially the focus of our attention, including the active subset of long-term memory.

Maintenance tasks are subdivided into program comprehension, program modification and program composition.

Program comprehension -> Creation in the buffer of a semantic understanding of the program using syntactic and semantic knowledge from LTM.

Program modification -> Manipulations on the buffer structure. Correction of the buffer structure based on discrepancies from output.

Program composition -> Problem analyzed in buffer into ‘given state’ and ‘desired state.’ To achieve desired state, knowledge is transferred from LTM into buffer. Problem solution conceived as plan for program. Stepwise refinement on this plan.

43 predictions from HIP supported in Ramanujan and Cooper 1994.

There is little agreement in the literature about how to operationalize software maintenance effort.

Some options: Number of repair requests, number of repairs per production run. Assumption: All repairs of equal magnitude. Total cost of maintenance in terms of labor hours. Assumption: Constant quality of labor.

Proposal: Labor hours weighted by quality of labor. (Unclear how to operationalize quality of labor)

For this study: Time required to successfully implement changes.

The study tested seven hypotheses and found support for most. Conclusions include: Keep control-flow complexity low, name variables well, don’t use novices for maintenance on large software.

Constructs in the hypotheses include: Program control-flow complexity, program size, time to maintain, variable-name mnemonicity, programmer semantic knowledge, and time pressure.

For my purposes, control-flow complexity and variable-name mnemonicity may not be relevant for many DSLs.

Notes on “An empirical study of fault localization for end-user programmers”

June 30, 2009
An empirical study of fault localization for end-user programmers: Ruthruff, J.R. and Burnett, M. and Rothermel, G.
End-user programming is the majority of all programming. End-user programmers do not have adequate tool support for the tasks they perform, which include spreadsheet programming, macro creation and rule-based programs. This paper focuses on fault localization.
A fault localization technique has two factors that are important to its performance. The first is the information base or type of information maintained by the technique. The second is the mapping or the way the technique maps information onto feedback.
Four contributions: Empirical investigation of importance of the two factors. Empirical data on three information bases and three mappings. Insights into how to measure fault localization technique effectiveness. Finally, evidence that user make mistakes when testing and debugging end-user programs.

Notes on “Intelligent Analysis and Off-Line Debugging of VLSI Device Test Programs”

June 30, 2009

Intelligent Analysis and Off-Line Debugging of VLSI Device Test Programs: Ma, Y. and Shi, W.

I thought, based on the title, that this paper would give some indication of a debugging experience on a VLSI Device Test Program (which seemed like a likely candidate for  DSL). It didn’t.

Notes on “Program understanding behavior during {debugging, corrective maintenance} of large scale software”

June 29, 2009

Program understanding behavior during debugging of large scale software: Von Mayrhauser, A. and Vans, A.M.

Program understanding behavior during corrective maintenance of large-scale software: Vans, A.M. and von Mayrhauser, A. and Somlo, G.

Very similar papers.


  1. What do programmers do when debugging code?
  2. Do programmers follow the Integrated Comprehension Model?
  3. Is there a common comprehension process? What is the role of hypotheses?
  4. What types of information do programmers look for during corrective maintenance?

An observational study of four programmers.

Integrated Comprehension Model has four components: Top down model (domain/application level), situation model (domain-independent, algorithmic level (e.g. a sort)), program model (bottom-up, control flow), knowledge base.

Programmer actions:

  • “Use of knowledge”, generating hypotheses
  • Chunking and learning new things at lower levels of abstraction


  • Comprehension occurs at low levels of abstraction while domain knowledge is insufficient
  • Domain knowledge allows rapid assimilation of software knowledge
  • Situation model is a bridge between program and top-down models.

Information needs:

  • Programmers need to build up their models at all levels of abstraction.

Programmers make hypotheses at all levels of abstraction. Most hypotheses are confirmed or abandoned, a few fail.

Editorial comment 1: Two near-identical papers. I wonder what happened.

Editorial comment 2: The conclusion:

“We consider these conclusions to be working hypotheses rather than generally validated behavior. The sample of subjects was simply too small. These conclusions should be validated through further observations. Unfortunately, such observational studies are costly.”

It’s just wrong. It confuses quantitative and qualitative research AND it makes excuses.

Editorial comment 3: For some reason, these papers were really hard to read. I guess I shouldn’t expect program comprehension researchers to know anything about paper comprehension.

Notes on “A framework and methodology for studying the causes of software errors in programming systems”

June 27, 2009

“A framework and methodology for studying the causes of software errors in programming systems”: Andrew Ko and Brad Myers.

Six part paper. Expanded version of Ko 2003. 40+ pages.

Section 1: Software errors are common and expensive. Existing tools for describing programming system error-proneness are inadequate.

Section 2: Classifications of programming difficulties

Four salient aspects of software errors:

  1. Surface Qualities: Notational issues
  2. Cognitive Cause: Forgetting, misunderstanding, not knowing, etc.
  3. Programming Activity: Specification, algorithm design, etc.
  4. Type of Action leading to the error: Creating code, reusing code, modifying specifications or code, designing, etc.

James Reason’s Human Error gives a systemic view of failures and a catalog of common cognition failures.

A system has several layers, each with its own defenses. An error in one can lie dormant until it interacts with errors in other layers.

In software engineering, the layers could be viewed as:

  • Specifications (Ambiguous, incomplete, incorrect)
  • Programmer (Knowledge, attention, expertise)
  • Programming System (Compiler errors, static checkers, syntax highlighting, assertions, etc.)
  • Program (Software errors, possibly predisposed by one of the previous layers.)

An error at the programmer layer is called a cognitive breakdown.

Three types of cognitive activity:

  1. Skill-based activity: Routine actions. Doesn’t really require much attention. E.g. Opening a file. Fails due to inattention breakdowns (e.g. strong, wrong habit kicks in, routine action interrupted but not resumed.) or overattention
  2. Rule-based activity: Learned expertise. Fails because of broken rules or wrong rule choice. e.g. in C, to map an action on an array, use a for loop.
  3. Knowledge-based activity: Conscious, deliberate problem-solving. Fails due to cognitive limitations and biases inherent in cognition. e.g. satisficing incorrectly, confirmation bias + overconfidence.

Section 3: Framework

Programmers have three types of activities: Specification, Implementation, Runtime

Six actions: Design, create, reuse, modify, understand, explore

Three breakdowns: Skill, Rule, Knowledge

Cognitive breakdown has four parts:

(What kind of breakdown) during (what action) through (what interface) on (what information).

Chains of cognitive breakdowns lead to software errors.

Section 4: Empirical methodology for using the framework to study a programming system’s error-proneness

  1. Design a programming task
  2. Observe using think-aloud
  3. Work backward from software error to causes
  4. Analyze chains of breakdowns for patterns and relationships

Section 5: Case study of using framework on Alice

40 hours of analysis on 15 hours of video.

Most errors caused by incorrect boolean conditions, copy-paste with references offscreen. It seems fairly straightforward to map errors through the framework.

Section 6: Discussion

The framework allows empirical comparisons. The methodology is limited in that it requires a lot of work, it may be difficult to learn, and its replicability has not been tested.

Notes on “Statistical debugging: A hypothesis testing-based approach”

June 26, 2009

Statistical debugging: A hypothesis testing-based approach: C. Liu. et al.

Gather correct and incorrect program executions, collecting data about when predicates throughout the program evaluate to true or false. Some predicates statistically discriminate correct and incorrect executions, and are excellent starting points for a search for faults.

The tool is called SOBER. Its results are incrementally better than other similar tools.

This approach is mostly nullified in several DSL implementation strategies.

Notes on “Debugging by asking questions about program output”

June 25, 2009

Debugging by asking questions about program output: Andrew Ko

A key to the difficulty of debugging is the difficulty of mapping conceptual-level questions about program behaviour to concrete questions about code.

This paper introduces the Whyline.

Key motivators: Developers form multiple false hypotheses of failure causes. Unchecked false hypotheses eventually cause problems. Developers break good code while ‘fixing’ false hypothesis causes.

Therefore: Allow developers to specify incorrect output. Correct incorrect assumptions early. (Although it isn’t clear that the assumption-correction is implemented in Whyline or implementable.)

To build a whyline:

  • Outputs are defined at a conceptual level and must be recorded. (e.g. Paint commands)
  • Developers restrict the domain of their questions in several ways:
    • Type of causality (Why or why didn’t)
    • Scope of question’s objects
    • Relative to events
    • Features of output types
  • Analyze causality of processing leading to the incorrect output
  • Aggregate chains of causality that executed in similar contexts

Notes on “Development and evaluation of a model of programming errors”

June 24, 2009

Development and evaluation of a model of programming errors: Ko and Myers.

The authors created a model of programming errors based on J. Reason’s general model of human errors.

In the model, breakdowns can occur in specification activities, implementation activities, or debugging activities. These breakdowns cascade into further breakdowns resulting in the introduction of errors in the program. A breakdown is caused by a knowledge, attentional or strategic problem when performing one of the subactivities.

Thus, we can categorize an error by the chain of breakdowns that led to it. Each breakdown is composed of a broken artifact, the activity that was being performed on the artifact at the time of error introduction, and whether the underlying problem was knowledge, attentional or strategic.

The authors validate the model using two observational studies using the Contextual Inquiry method. This means the experimenters tracked the subjects’ goals as they worked in addition to recording their activities. They fit the data using the model and show that the model has good descriptive and explanatory power.

This is among the nicest theory-driven ESE papers I’ve read. I think it’s a great approach. I suspect the authors of the “theory use in software engineering” paper would approve.

Now that they’ve planted this model:

  • We can ask of future studies of errors in software engineering, “Why didn’t you relate this work to the model?”
  • We can measure the reliability of the model by checking whether different observers of programming activity fit errors and breakdowns to the model in the same way.
  • We can merge in future refinements of the underlying cognitive psychology model.
  • We can develop baselines of prevalence of the various breakdown possibilities, and show how these change as programmers gain experience.
  • We can begin to measure quantitatively how the various constructs interact, then design experiments that control for these interactions.
  • Etc.

All this considered, it’s too bad Ko has moved on.

Notes on “An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks”

June 24, 2009

An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks: Ko et al.

This paper redescribes the same experiment as Ko 2005 [my summary]. It seems that one experiment generated at least four papers. This paper doesn’t seem to add much to the previous paper.

Notes on “Eliciting design requirements for maintenance-oriented IDEs: a detailed study of corrective and perfective maintenance tasks”

June 24, 2009

Eliciting design requirements for maintenance-oriented IDEs: a detailed study of corrective and perfective maintenance tasks: Ko, A.J. and Aung, H.H. and Myers, B.A.

A study of expert Java programmers performing maintenance tasks using Eclipse.

Maintenance interleaves three activities: Collecting code fragments (building a working set), navigating the dependencies of the fragments, actually editing code.

Time breakdown observation: 35% navigation, 46% inspecting irrelevant code. 88% of hypotheses are wrong. Programmers copied-and-pasted code. In 10% of the copies, there was a forgotten necessary change that took around up to 18 minutes, average 3 minutes to debug. A further 12% of the copies missed a piece, costing average 4 minutes to debug.

Bigger screens/workspaces might help (though the experimental workspace was unusually tiny: 1024×768)

Other IDE proposals:

  • Support creating a working set of code fragments
  • Support saving and restoring working sets
  • Automatically add dependencies when a fragment is added
  • When pasting code, mark unchanged references as suspect
  • When pasting code, check if any dependencies were missed
  • Integrate slicing for program output

Notes on “Debugging: The Good, The Bad, and the Quirky–A qualitative analysis of novices’ strategies”

June 24, 2009

Debugging: the good, the bad, and the quirky–a qualitative analysis of novices’ strategies: L. Murphy et al.

Programming novices warm up by solving a programming problem. They are then given an semantically incorrect syntactically correct solution to the same problem and asked to debug it. The researchers logged the activities during the debugging tasks, interviewed the subjects, and surveyed the subjects on debugging matters.

The researchers categorized the strategies/behaviours they witnessed. The 13* strategy categories match with at least one other study:

  1. Gain domain knowledge e.g. reread problem specification, examine sample output
  2. Tracing the program i.e. mentally or on paper, print statements
  3. Testing i.e. running the program
  4. Understanding code i.e. reading the code
  5. Using resources i.e. JavaDoc, Java tutorial
  6. Using tools i.e. a debugger
  7. Isolating the problem e.g. commenting out code, forcing certain control flow
  8. Pattern matching e.g. recognizing a cliched error
  9. Consider alternatives i.e. devising hypotheses
  10. Environmental e.g. documenting their work, using editor features such as undo
  11. Work around problem e.g. rewriting code, or introducing new code to work around errors in other code.
  12. Just in case e.g. Making unnecessary cleanup changes
  13. Tinkering i.e. random, unproductive changes

*The paper says 12 strategies, I think this was a typo since they list 13.

Many of the strategies were used ineffectively: Useless tracing, failing to note a failing test, shallow code reading, using unfamiliar tools, commenting out or modifying correct code, completely rewriting code. Some of the strategies are inherently ineffective, especially tinkering.

The papers also notes several amusing novice behaviours.

Notes on “Programmers use slices when debugging”

June 24, 2009

Programmers use slices when debugging: Mark Weiser

I’ve mentioned slicing before. This is one of the original papers.

When programmer’s use a “working backward” debugging strategy, they are in fact constructing slices. Weiser tested this with an experiment: Programmers debug three programs. They are then quizzed on the parts of the program they remember. The hypothesis is supported if the programmers remember slices relevant to the errors better than other parts of the program.

Notes on “A systematic review of theory use in software engineering experiments”

June 23, 2009

A systematic review of theory use in software engineering experiments: Hannay, J.E. et al.

Theories are for the generalization of knowledge. They are the engine of our deeper understanding of the world. Experiments provide descriptions of causality, but theories explain causality.

The article is an overview of theory for the purposes of software engineering.

Challenges for a software engineering researcher: Finding relevant theories, determining what counts as a theory (called ‘theory identity’), using theories explicitly.

There are five types of theory:

  1. Analysis: Descriptions of the world. It is controversial whether these count as theories. .
  2. Explanation: Explanations of the world lacking predictive power. A UML model can be viewed as a type 2 theory of the design of a software system
  3. Prediction: Providing predictions without explanations e.g. statistical models
  4. Explanation and prediction: Most commonly accepted kind of theories.
  5. Design and action: How-to instructions including design principles, usually with an implicit prediction of benefit.

The article contains a brief discussion of the epistemology of theory, which I don’t care about.

Theories are built of representations, constructs and relationships in a scope. These are conceptual-level concerns. Experiments operationalize these, e.g. experimental variables operationalize constructs, treatments operationalize causes, hypotheses operationalize predictions. .

Theories can be used in several roles within experimental studies. A theory can inform the design of an experiment, it can be used to explain the results of an experiment post hoc, it can be tested by (and perhaps modified subsequent to) an experiment, it can be proposed given the results of an experiment, or it can form the basis of other theories that take the other roles.

The authors filtered much of current ESE research to identify 103 article that mention theory use in one of the roles above. They performed some sophisticated analyses of theory use in these articles. Theories are rarely used more than once; they are most commonly used to inform research questions and the design of experiments; theories are more frequently proposed than tested. The take-home message seems to be that theory use in ESE research is weak, although it is acknowledged that better theory use would be a good thing.

I note that this paper’s research method did not operate using any theory of its own.

Despite the fact that Greg sent this one for us to read (5 weeks after it went into my reading queue, thankyouverymuch), this was a good paper. It would be fun to brainstorm some grand unified type-3 theories of software engineering.

Notes on “A review of automated debugging systems: knowledge, strategies and techniques”

June 23, 2009

I previously read and summarized McCauley 2008, which reviewed this paper among others. It included the seven types of debugging knowledge detailed in this paper:

  1. Knowledge of intended program
  2. Knowledge of actual program
  3. Understanding of programming language
  4. General programming expertise
  5. Knowledge of application domain
  6. Knowledge of bugs
  7. Knowledge of debugging methods

Each of 1 & 2 are subdivided into:

  1. Program I/O
  2. Behaviour
  3. Implementation

McCauley also covered the four general debugging strategies from this paper: Filtering, checking equivalence of intended and actual program, checking well-formedness of actual program and recognizing bug cliches.

What I did not realize is that this categorization was itself derived from a review of automated tutoring systems, general purpose debugging systems and program tracers. Thus, these should not be viewed as empirically-verified knowledge and strategies of human debuggers, but rather as tool-supported knowledge and tool-supported strategies. (Although the tools themselves may be informed by empirical studies.)