User Tools

Site Tools


automated_semantic_grading_of_programs
@TechReport{automated_semantic_grading,
author =       {R. Singh and S. Gulwani and A. Solar-Lezama},
title =        {Automated semantic grading of programs},
institution =  {MIT/Microsoft Research},
year =         2012,
annote = {For short programs with precise specs and instructor-supplied formal model of possible errors, use program synthesis to find least-cost set of mods that make student   solution have same semantics as expanded-with-error-model reference solution.}
}

Idea: use techniques from constraint-based program synthesis to suggest minimal changes that turn a nonworking program in an imperative language into a working one.

Overall assumption: the semantics of a short piece of a code can be completely captured by a reference solution and a set of “likely errors”, so that any student program that is “on the right track” can be transformed into one whose semantics match that of the reference solution.

Practical assumptions: short programs, complete specification of correct behavior, smallish set of predictable errors (since everyone doing same HW).

Background: in sat-based program synthesis, a program contains one or more *holes* and comes with a set of correctness constraints. A synthesizer explores the space of possible hole values and fills the holes so as to meet the spec. This was Armando Solar-Lezama's PhD thesis with Ras Bodik here.

  1. Instructor provides reference solution, plus a set of rules in “EML” (error modeling language) specifying corrections to likely errors students might make. Example: a rule to catch off-by-one-errors in loop bounds or array references
  2. Rewriter combines reference solution and EML rules into a set of candidate programs expressed in ~IMP, an imperative language where the usual constructs (while loop, statement list, method call, assignment, expression…) are replaced by ones in which multiple candidates could occur. Eg: x[y] becomes ~x[~y], where ~x refers to both x and other things the student might erroneously put there instead (maybe another identifier), similarly ~y (eg, might be y-1, y+1, etc).
  3. This program is translated into a progam whose “holes” represent bound values. Eg, to translate the set {~y0, ~y1,…~yn}: T({~y0, ~y1,…,~yn}) = if (??) then ~y0 else T(~y1,…,~yN) end. (Synthesizer would then have to fill ?? hole with 'true' or 'false'.)
  4. Synthesizer explores space by first looking for any solutions with modification cost = 0 (ie, student program is semantically correct), then with cost=1, etc., and giving feedback to student explaining what rules had to be applied to fix.

Implemented on C# using Microsoft Roslyn compiler framework to do AST manipulations, tested on 10K submissions from pex4fun website with a selection of problems (array reverse, string palindrome, factorial, etc, plus a couple they added). Initially, they (hand-?)sorted solutions to eliminate “trivially wrong”, which interestingly includes solutions of the form “Transform the array into some other class, and use a built-in method of that class”.

Results:

  • Effectiveness = (fraction programs corrected) / (fraction programs containing “small” errors). 57% to 83% on test set, with low outlier ArraySort 47%. Most take a few CPU-seconds per student submission, with outlier being ArraySort and MaxContiguousSubsequence which take ~30 and ~20 CPU-seconds per student respectively. They extrapolate that for the non-outliers, you can grade 100k submissions for a few dollars on EC2.
  • Fixable = of those programs corrected, what fraction fixable with small local changes (vs so many transformations that it would be hard for student to see correlation to their original code). All over the map from 28% to 100%.
  • Examples of non-fixable small errors: incorrect swap statement ( a=b; b=a ) vs (temp=a; a=b; b=temp); using i instead of a[i]; saying a[i]>C rather than (a[i]-a[i-1]) > C; etc.

What's cool:

  • Fraction of fixable problems levels off after an error model with just 5 EML rules, suggesting that many students do make “same” mistakes.
  • Can point out complete sets of errors, eg, 4 corrections needed to make program work (although majority of fixable submissions required 1 or 2 fixes).

Limitations:

  • Instructor must be able to specify error transformation rules completely. If a type of student error isn't expressed in rules, synthesizer can't map that student solution to a good one.
  • Spec must be completely provided, and (apparently) as black box. Eg, to rule out solutions of the form “transform problem into another problem solvable by built-in methods”, the spec would need to be glass-box.
automated_semantic_grading_of_programs.txt · Last modified: 2018/02/28 17:02 (external edit)