Archive for March, 2009

Planning a Thesis

March 30, 2009

The second semester is winding down. In a few weeks, I’m meant to hit full speed on thesis work. 

What’s going on right now?

  • An illness halved my productivity today. Hope to be better tomorrow.
  • I suspected my thesis idea had already been done. I just confirmed that news.
  • In compilers, I have to finish off a final assignment, do a chapter of reading, and prepare for the final examination.
  • In the project course, I have to get my share of the project to a proud state.
  • My teaching assistantship boss has effectively saved most of my hours until now. Now he’s spending them.
  • I’m a minor contributor to an ITICSE paper. Reviewer comments are back and I need to help revise. Mostly that means digging up and reviewing more literature.
  • I’m traveling for family reasons this weekend. The week following I’m volunteering for Mesh and then I’m away again for Easter.
  • It’s tax season and it looks like I’m going to have to work hard to get the forms I need.

What’s next?

  • I want to read Jorge’s thesis as well as a few of the top ACM theses.  
  • I want to get caught up on the last few years of the top SE and ESE papers.
  • I want to get through more of the ESP proceedings and ideally catch up on PPIG too.

I hope by completing the above I’ll get a sense of what’s going on in the field and how I can join in.

Of course, all of that excludes my non-academic interests, like finishing off jQuery in Action, Beautiful Code, SICP, Real World Haskell, or building things, or going to the gym, or seeing human beings…

Notes on “Knowledge Organization and Skill Differences in Computer Programmers”

March 30, 2009

Knowledge Organization and Skill Differences in Computer Programmers: Katherine B. McKeithen and Judith S. Reitman.

Weeks ago, Greg asked, “What paper do you wish you had written?”

This is mine.

I tracked down this paper because I found a reference to it as “Chase and Simon applied to computer programmers.” “Chase and Simon applied to computer programmers” has been my pitch for a thesis topic for the past couple weeks.

Here’s the experiment: Show some programmers a computer program. Take it away. Ask them to recreate it. Show them the program again. Ask them to recreate it. Repeat until they get it completely right. Do this with a computer program that makes sense and one that is scrambled.

The experts do better than novices, but not on the scrambled programs.

There was one key difference from Chase and Simon: Over successive trials, experts programmers leave the novices further and further behind. The authors hypothesize this is because computer program chunks are less easily perceived visually than chess or Go chunks.

There was another experiment in the paper: Give the subjects a list of language keywords and have them memorize them. The novices organized the list fairly haphazardly. The experts organized the list according to semantic relation. It’s not that the experts organize their information more. The experts organize better.

This was a nicely written paper: There was a good number of subjects, a clear experimental design, a nice exposition of the relevant statistics. It was even published in a psychology journal, which would satisfy my psychology-envy. 

This was published in 1981. I am nearly 30 years behind.

Will it fly?

March 27, 2009

Or, How to Determine How Hard It Will Be To Make Your Idea Real: A Framework for Evaluation of Stakeholder Support of Initiatives.

Make a list of everyone affected by your proposal.

Now give each person a letter-grade:


A: Will fight for and support your idea.

B: Supports your idea but won’t actively help.

C: Doesn’t support your idea but won’t actively fight it.

D: Will fight against your idea.


To succeed, you need more powerful A’s than C’s and no powerful D’s.

It’s a simple system but it works in many scenarios.

A Better Peer Evaluation System

March 26, 2009

I’ve ranted against a popular peer evaluation system. An anonymous commenter challenged me: “Make something better.” I think I did. It’s simple to understand and eliminates undesirable incentives.

This isn’t a perfect system. Any compensation system has faults because of the unstoppable tension between individual and group performance incentives. I only claim that this is an improvement over the Oakley system.

Overview of the problem

A team of m members completes a school assignment. The overall grade given to the group is G. We wish to adjust each individual team member’s grade g1, g2, …, gm by an adjustment a1, a2, …, am based on peer evaluations, where each ai can be positive or negative.

We assume each team member evaluates each other team member. As we’ll see, we can ignore self-evaluations without loss of generality. Thus, the input to our function f is an overall grade G and a matrix E

X ? ? ?
? X ? ?
? ? X ?
? ? ? X

Eij represents the ith member’s evaluation of the jth. The output of f is the vector a1, a2, …, am.

The Oakley paper peer evaluations have ratings from 1 to 5 and “No Show” to “Excellent.” Without loss of generality we’ll just use numbers and let the user substitute appropriate strings.

The problem is to find an acceptable f.

Properties of f

We constrain f with a few desirable properties.

  1. Zero-sum: ∑a= 0. It should not be possible for the group members to collude and raise their average above G. We want the adjusted marks preserve G as their arithmetic mean.
  2. Truthful: Ei does not affect ai. A team member’s evaluations of his or her teammates should not affect his own adjustment. Otherwise, the team member has an incentive preventing honest feedback. In general we want to eliminate all incentives that interfere with honesty. For this reason we ignore self-evaluations: They may contain useful information for a human judge, but they contain nothing usable by an algorithm.
  3. Range-limited: -r ≤ ai ≤ r. The administrator may wish to set a bound r on the adjustments assigned, such that a high-performing team member has an adjustment of +r and a deadweight gets -r.
  4. Determinism: For fairness, the algorithm should be deterministic.

Explicit non-problems

If ≥m/2 members collude and are malicious, there is nothing we can do. These situations will require human judgement.

We’re not trying to build the perfect peer evaluation system—just one better than Oakley’s. All systems will have flaws and require human judgement.

Special cases

Some administrators may wish to reward a functional group or penalize a dysfunctional group. They may wish to violate the zero-sum property to do this. This can be applied as a post-processing step.

Algorithm Example

Let’s say a team has 4 members and receives a collective grade of 75% on an assignment. The professor is willing to allow each individual mark to shift 9% according to peer evaluations, thus, the lowest possible mark is 66% and the highest is 84%.

m = 4 (Ahmed, Bob, Charlie, Denise)

G = 75

r = 9

First, we give each member a temporary score of G – r: 66%.

Now, each group member i submits their vector Ei with m elements. We ignore Eii. Each element Eij is a number from 0 to (2 × r) ÷ (m – 1), i.e. from 0 to 6. Also, ∑E= r i.e. the sum of the elements is 9. That is, each group member allocates 9 points among the other three, but with a cap of 6 to prevent one member from getting an adjustment higher than 9.

Subjective Evaluations

This system of points allocation satisfies our properties but it feels crass somehow. There is an elegance to independent evaluations from “No Show”, “Superficial”, “Unsatisfactory”, “Deficient”, “Marginal” through “Ordinary”, “Satisfactory”, “Very Good” and “Excellent.”

We therefore want an adapter from the Oakley system. We will translate evaluations on this 9-point scale into a points allocation†. We’ll call the minimum Oakley score “No Show” Kmin, and set it to 1. We’ll call the maximum Oakley score “Excellent” Kmax, and set it to 9.

To allocate each student’s points, we have a vector s, 1 ≤ s ≤ 9 of their subjective evaluations of their teammates. Say Ahmed gave Bob a “Superficial”, Charlie “Satisfactory” and Denise “Excellent.” Ahmed’s s is therefore {◊, 2, 7, 9}. First, we compress the range of each entry by 6/9 in order to satisfy the requirement to give no element higher than 6: t = {◊, 1.3, 4.7, 6}. Now we multiply by r/∑t to make the sum r: u = {◊, 1, 3.5, 4.5}.

Will this always work? The range of s is {◊, 1, 1, 1} to {◊, 9, 9, 9}. With (2 × r) ÷ (m – 1) = 6, as in the example, the range of t is {◊, 6/9, 6/9, 6/9} to {◊, 6, 6, 6}. The range of ∑t is 2 to 18. Thus the range of r/∑t is to 1/3 to 4.5.

We have a positive multiplier whenever r > ∑t but in this situation no t will be scaled larger than (2 × r) ÷ (m – 1), so we maintain our desirable properties†.


Choose m, G, r

If you are not using a subjective scale, ask each group member i for a points allocation Eij with a maximum of (2 × r) ÷ (m – 1) that sums to r. ai = sum over j: ∑Eji – r.   

If you are using a subjective scale, ensure that r ≥ Kmax when you serialize the scale into numeric form. 

Collect the matrix Sij with scores from 1 to Kmax. Let Tij = (2 × r) ÷ (m – 1) × S. For each row Ti, Ei = r / ∑Ti. ai = sum over j:Eji – r.


A.G. helped with the main formula. Lila Fontes helped with the subjective evaluation formula. 

†We need r ≥ Kmax. We multiply t by r/∑t. The lower bound on ∑t is (m-1)×Kmin×2r÷((m-1)×Kmax). For simplicity we assume Kmin = 1.  We cancel out the (m-1) to find min(∑t) = 2r/Kmax. Thus, at worst, we multiply t by Kmax/r. To make sure this number is ≤ 1, we need to make sure r ≥ Kmax.

Aside: Extended Algorithm

A variant of this idea could be used to allocate bonuses in a company. Let B be the total bonus to be allocated in a team. Here, each team member allocates n × B × w dollars among n colleagues, where the n colleagues are those the individual is qualified to evaluate, w is an weighting factor for the individual’s importance (∑w = 1 for a bonus pool, w = 1/(m-1) in the student example), and m is the total number of people in the bonus pool.

How to Succeed at Anything

March 10, 2009

(Or, Notes on “The Role of Deliberate Practice in the Acquisition of Expert Performance”)

The good news: Talent is meaningless. You can be world-class at anything.

The bad news: The formula requires absolute commitment: 3-4 hours daily of pleasureless, intense practice, and fifty to eighty hours a week of total domain-related time… for a decade. You must always get a full-night’s sleep. And you can’t catch up to anyone who’s got a head start.

Annotated selections from the paper follow.

Experts’ memory for representative stimuli from their domain is vastly superior to that of lesser experts.

Novel programmer interview technique: Show the programmer a program. Give them a moment or two to study it. Take it away. Take a short break. Ask them to reproduce it.

Adults perform at a level far from their maximal level even for tasks they frequently carry out… “It is that we have too many other improvements to make, or do not know how to direct our practice, or do not really care enough about improving, or some mixture of these three conditions.”

“Do your best” is hard advice to follow.

When Tchaikovsky asked two of the greatest violinists of his day to play his violin concerto, they refused, deeming the score unplayable. Today, elite violinists consider this concerto part of the standard repertory.

“World-class” is a moving target. World records are broken by performers with more resources available for more practice.

To make an eminent achievement, one must first achieve the level of an expert and then in addition surpass the achievements of already recognized eminent people and make innovative contributions to the domain.

The 10-year rule

It takes normal individuals approximately a decade to acquire [their adult vocabulary.]

The time between chess players’ first learning the rules of chess and attaining international international chess master status was 11.7 years for those who learned chess rules late (after age 11).

J.R. Hayes confirmed that 10 years’ experience is necessary in another domain, musical composition.

Simon and Chase’s “10-year fule” is supported by data from a wide range of domains: Music, mathematics, tennis, swimming, and long-distance running. Long periods of necessary preparation can also be inferred for writers and scientists.

In many other domains, the highest level of expert performance is displayed by individuals with more than 10 years of experience: evaluation of livestock, diagnosis of X-rays, and medical diagnosis.


Motivation and perseverance are necessary for attainment of eminent performance.

On Practice

The  relation between acquired performance and the amount of practice and experience was found to be weak to moderate in the earlier review. We propose that the reason for this comparatively weak relation is that the current definition of practice is vague… we must analyze the types of activities commonly called practice.

We want to distinguish activities invented with the primary purpose of attaining and improving skills from other types of everyday activities, in which learning may be an indirect result. On the basis of several thousand years of education, along with more recent laboratory research on learning and skill acquisition, a number of conditions for optimal learning and improvement of performance have been uncovered. The most cited condition concerns the subjects’ motivation to attend to the task and exert effort to improve their performance. In addition, the design of the task should take into account the preexisting knowledge of the learners so that the task can be correctly understood after a brief period of instruction. The subjects should receive immediate informative feedback and knowledge of results of their performance. The subjects should repeatedly perform the same or similar tasks.

There is no activity in the common education of programmers that matches the description above.

In the absence of adequate feedback, efficient learning is impossible and improvement only minimal even for highly motivated subjects. Hence mere repetition of an activity will not automatically lead to improvement in, especially, accuracy of performance.

Studies show that providing a motivated individual with repeated exposure to a task does not ensure that the highest levels of performance will be attained. Assessment of subjects’ methods shows that inadequate strategies often account for the lack of improvement.

How many times do programmers receive feedback on their programs? A few times per semester for a student, then once per code review in industry?

As the complexity of a desired skill increases beyond the simple structure of most laboratory tasks, the logically possible methods to correctly and incorrectly perform the task by subjects increase as well. To assure effective learning, subjects ideally should be given explicit instructions about the best method and be supervised by a teacher to allow individualized diagnosis of errors, informative feedback, and remedial part training. The instructor has to organize the sequence of appropriate training tasks and monitor improvement to decide when transitions to more complex and challenging tasks are appropriate. Although it is possible to generate curricula and use group instruction, it is generally recognized that individualized supervision by a teacher is superior.

Tutoring yields better performance by two standard deviations—the average tutored student performed at the 98th percentile of students taught with the conventional method.

Close supervision is rare. Most feedback for students comes from overworked TAs who take a quick read through a program.

How much would the software industry earn and save if its programmers were two standard deviations better? Is there a positive ROI on tutoring?

Given the cost of individualized instruction, the teacher designs practice activities that the individual can engage in between meetings with the teacher. We call these practice activities deliberate practice and distinguish them from other activities, such as playful interaction, paid work, and observation of others, that individuals can pursue in the domain.

What is deliberate practice for programmers?

You can’t practice on the job

Individuals given a new job are often given some period of apprenticeship or supervised activity during which they are supposed to acquire an acceptable level of reliable performance. Thereafter individuals are expected to give their best performance in work activities and hence individuals rely on previously well-entrenched methods rather than exploring alternative methods with unknown reliability. The costs of mistakes or failures to meet deadlines are generally great, which discourages learning and acquisition of new and possibly better methods during the time of work.

The workplace is unsuitable for learning advanced programming skills. That leaves undergraduate programs. Yet, a low GPA is the (severe) consequence of failure on most undergraduate programming projects. Small wonder undergraduates are reluctant to try new techniques and tools such as unit testing, formal specification, debuggers, and IDEs.

Hard Work

Recent analyses… reveal an enjoyable state of “flow”… an enjoyable state of effortless mastery and execution of an activity. This state of diffused attention is almost antithetical to focused attention required by deliberate practice to maximize feedback and information about corrective action.

We claim that deliberate practice requires effort and is not inherently enjoyable.

A necessary precondition for practice… is that the individual be fully attentive to his playing so that he or she will notice areas of potential improvement and avoid errors… Practice without such concentration is even detrimental to improvement of performance.

Practice Time is Everything

There is a complete correspondence between the skill level of the groups and their accumulation of practice time.

At the age of 18 the expert pianists had accumulated 7,606 hr of practice, which is reliably different from the 1,606 hr of practice accumulated by the amateurs.

During the diary week experts were fully engaged in music and spent close to 60 hr on music-related activities.

We’re a very immature field compared to others like music. How many CS graduates had even started programming at 18? And yet we’re the ones programming the rockets, robots and reactors.

Talent is Practice in Disguise

Given that acquired skill resulting from prior accumulated practice cannot be observed, it could easily be incorrectly attributed to native talent.

Years of intensive preparation under the supervision of a master invariably precede the attainment of international recognition.

The real innate difference are in practice precursors

In fact, within our framework we would expect that several “personality” factors, such as individual differences in activity levels and emotionality may differentially predispose individuals toward deliberate practice as well as allow these individuals to sustain very high levels of it for extended periods.

Intelligence necessary but not sufficient

“high but not the highest intelligence, combined with the greatest degree of persistence, will achieve greater eminence than the highest degree of intelligence with somewhat less persistence.”

How to succeed wildly at anything

We view elite performance as the product of a decade or more of maximal efforts to improve performance in a domain through an optimal distribution of deliberate practice.


Assorted Notes on Personality

March 9, 2009

Do programmers’ personalities affect how they make code?

To get me started in the area of personality, Jordan Peterson kindly pointed me to three papers.

The first, “Predicting creativity and academic success with a ‘Fake-Proof’ measure of the Big Five,” explains a method of measuring personality that is resistant to bias. This is necessary because people are liable to inflate their Agreeableness, Conscientiousness and Emotional Stability. Native English speakers are able to bias their responses better than others. The key insight is to pit these traits against each other. Instead of asking questions designed to measure one trait, each question forces the subject to weigh traits against each other.

Some highlighted notes from the paper:

Quantifiable measures of personality can be used to predict real-world outcomes. The most predictive of these, Conscientiousness, accounts for 12-25% of the variance in academic performance. Composite measures of self-discipline are twice as effective as IQ at predicting academic performance. Conscientiousness is also the best single personality predictor of workplace performance. After Conscientiousness, Emotional Stability is the best Big Five predictor of workplace performance. It is also a good predictor of job satisfaction and organizational commitment. Extraverts are good candidates for team-based activities. They are effective performers in leadership positions. They are happy and enthusiastic. Agreeableness is a good predictor of teamwork performance. A combination of Extraversion and Agreeableness predicts a transformational leadership style. Openness is a significant predictor of an individual’s creativity.

A quick primer on the psychology of personality: In the early days, theories of personality were mostly opinions. The measures were unreliable and not terribly predictive. Eventually, researchers went through a dictionary, classified all the words related to personality, created questionnaires based on these words, and then performed factor analysis on the result to get five main dimensions of personality. Some analyses find fewer or more factors, but a consensus has emerged around five factors. These are called the Big Five. The Big Five are Conscientiousness, Agreeableness, Neuroticism, Openness/Intellect, and Extraversion. These five are well-studies, valid, reliable, and predictive.

The second paper, “Between Facets and Domains: 10 Aspects of the Big Five,” digs into the constituent components of the Big Five. Each of the Big Five domains can be broken into two aspects. Extraversion can be broken into assertiveness and enthusiasm. Neuroticism can be broken into volatility and withdrawal. Agreeableness can be broken into compassion and politeness. Openness/Intellect is unsurprisingly broken into openness and intellect. Conscientiousness can be broken into industriousness and orderliness. 

There are significant intercorrelations. For example, while conscientiousness is good overall, high orderliness is correlated with neuroticism, which is maladaptive.

The third paper was “Prefrontal Cognitive Ability, Intelligence, Big Five Personality, and the Prediction of Advanced Academic and Workplace Performance.” The main thrust of this paper was to rip into psychologists’ measurement of intelligence.

First and foremost is the formal nature of the psychometric conception of intelligence… [and its] limited grounding in other domains of psychology and neuroscience.

More interestingly for me, it turns out that Dorsolateral Prefrontal Cortex function can be measured and it is significantly correlated with conscientiousness and academic success. Intriguingly, it is strongly correlated with working memory capacity, which suggests we can measure the size of an individual’s working memory. This might prove useful for e.g. a study of program comprehension. 

Another tidbit had to do with self-reports of job performance and personality:

Supervisor-rated job performance was not significantly correlated with any of the personality variables. Self-rated job performance, on the other hand, was significantly correlated with Extraversion, Openness, and Conscientiousness and had near-significant correlations with Emotional Stability and Agreeableness.

Intelligence has a predictive validity of .58 for professional-managerial jobs. The relationship between cognitive ability and performance increases, not decreases, with experience. There is a linear decline in cognitive function over our lives.

One piece of discussion jumped out at me: Those with damaged frontal lobes are susceptible to goal neglect, and this is relatively more likely to occur under conditions of novelty or ambiguity.

Goal neglect = misprioritization of tasks

Novelty or ambiguity = when you don’t really know what you’re doing, e.g. many programming tasks

Could a measure of D-PFCA predict a programmer’s likelihood to shave the yak?

Along the way, Hassam pointed me to “Exploring the underlying aspects of pair programming: The impact of personality,” which purports to study the impact of personality on pair programmers. Alas the study is quite badly done (e.g. it uses the widely discredited Myers-Brigg Type Indicator to measure personality and suffers a general lack of rigour) and gives no useful information.

As an aside, I’m astounded by the vast difference in quality between the psychology papers and most of the empirical software engineering papers I’ve read so far. The psychology papers are of a far higher quality—I hope to learn from them.

Academic Hacker News hits Reddit frontpage, top of Google

March 9, 2009

Academic Hacker News ( hit the front page of reddit today and is the top Google result for “Academic Hacker News.”

A redditor pointed out that the lack of comments is a threat to its life. I’m not sure how to get people commenting— suggestions welcome.

The site is up to 162 registered users. I want to encourage more people here to use it, somehow, but Zak believes that the growth will make it less valuable as a research tool.

March 7, 2009

The launch

The other day I put up Academic Hacker News. People at school were too busy to check it out so I submitted it on the real Hacker News yesterday. You can see the discussion.

The response has been rewarding. People seem enthused with the idea. I got a few suggestions about my terrible taste in colours and the difficulty in finding the RSS feed. Over 100 people have registered user accounts and there have been over twenty submissions so far. Not bad for < 24 hours.

I want this to succeed and become a good tool for practitioners in our area. Now I need ideas on how to build momentum and get the word out. Help!

A lesson and thanks

This wouldn’t have happened if the department didn’t have infrastructure that let us easily set up servers. It also wouldn’t have worked if I didn’t have Alan‘s help and approval along the way. Thanks, Alan.

An open question: What other infrastructure could we have that would enable lightweight independent student projects?

Now, off to check out all the submitted links…

Academic Hacker News: A social news site for academic papers in CS

March 5, 2009

Check out Academic Hacker News. Sign up. Participate. Let’s make something good.

Hack Session 4

March 1, 2009

Andrew Louis, Andrew Trusty and I hacked together a social news site for academic papers. Code on hyfen’s github; application on Google App Engine.

There were three of us, so our 4-hour hack session means this application has 12-person hours of work behind it. I think our next iteration will turn this thing into something really worth using—we’ve planned some neat features. Right now it’s quite rough around the edges.

Give ‘er  a spin! But we warn you… Louis calls this an “extreme beta.”