User Tools

Site Tools


autograding-programming-assignments

Paper summaries

  • Emergent, Crowd-scale Programming Practice in the IDE The author proposes Codex which uses statistical technique to detect certain patterns that break language convention (so more likely to have bugs), e.g. if a programmer codes: lower_case_name = name.downcase! , then Codex will warn the coder that his code is likely to have a bug. Since codex has observed downcase! 57 times, and the abstraction var = var.any_method more than 100,000 times, but it has only encountered this usage once. However, Codex has encountered the correct version, lower_case_name = name.downcase, more than 200 times.
  • Codewebs: Scalable Homework Search for Massive Open Online Programming Courses This work has two major technical innovations. First, they propose an efficient method for indexing code by “code phrases” which is defined as subgraphs of an AST. Their index is implemented as a hash table. To efficiently query the index, they compute hash codes for each code phrase by hashing the list obtained via a postorder traversal of subtree's nodes. Second, they introduce a new notion, probabilistic semantic equivalence, which can be used to measure the similarity between two sub-ASTs.
  • Transformation-Based Diagnosis of Student Programs for Programming Tutoring Systems I read this paper because I think it might help us improve the notion of 'control flow trees'. This paper focuses on the automatic diagnosis of students' programming errors by comparing the student program iwth a model program after both have been standardized by some program transformations. We can do some similar transformations to improve our 'control flow trees'.
  • Automatic Generation of Programming Feedback: A Data Driven Approach This paper is written from the perspective of solution space. The author reduce the solution space by applying semantic-preserving transformation to student's program. Then 'connect' the solution space, i.e. computing similarity between two solutions, by calculating string edit distance of two ASTs. Surprisingly, the author claims taht he tried tree edit distance which does not seem to work particularly well in practice. Another valuable idea in this paper is that the feedback given to a student's program is another student's program which is slightly better. This is similar to our climbing ladder idea
  • Singh, Gulwani, Solar-Lezama, Automated semantic grading of programs. For short programs with precise specs and instructor-supplied formal model of possible errors, use program synthesis to find least-cost set of mods that make student solution have same semantics as expanded-with-error-model reference solution.
  • Lancaster et al., A comparison of source code plagiarism detection engines. A 2010 survey reveals that most plagiarism-detection engines operate by looking at n-grams at some level, though may apply tokenization and other approaches to thwart naive efforts to perturb your plagiarized code. MOSS and JPlag are cited as examples of highly-regarded/widely-used systems that do pairwise checking (vs. computing an independent “fingerprint vector” of each submission) and can thwart various simple cheat attempts.
  • Huang, J. Piech, C. Nguyen, A., & Guibas, L. (2013) (Stanford Lytics Lab). Syntactic and Functional Variability of a Million Code Submissions in a Machine Learning MOOC. Clustering of ASTs (represented as language-neutral JSON data structures) and visuliazation reveals clumps of student code submissions that embody similar solution strategies. Should be easily retargetable to other languages. Scaling up to all pairs requires some more work — the matching alg running time is quadratic in AST size.
  • Srikant and Aggarwal: Automatic grading of computer programs: a machine learning approach. SVMs and ridge regression were used on program features constructed by annotating the data dependency graph with control flow, e.g. a node corresponding to a variable that is a loop-carried dependency is annotated with which loop it's in, can give coarse-grained grades (1-5) in reasonable agreement with human graders.
  • Glassman, Singh, Miller: Feature engineering for clustering student solutions, work-in-progress at Learning@Scale 2014. Variation theory suggests that when people learn from examples, the examples can be compared along 3 axes. Contrast shows both examples and non-examples of the concept. Generalization shows two ways of addressing the concept that are basically the same but differ in low-level details. Separation shows two different ways of addressing the concept. For code, one interpretation is that generalization shows same general algorithm/approach to a problem but with different low-level implementation details (like array slicing vs. array indexing), and separation shows different algorithms for same problem. Authors use 2-level hierarchical clustering to identify both categories, with the goal of showing students alternative correct approaches for the problem they solved. Not sure if they have a full paper on this or not but they have a TOCHI submission on *visualizing* differences in student code submissions.

Ask 11 teachers, 7 automated comparison algo's: given a reference solution R, which of 2 other student solutions C1 and C2 is most similar?

  • Red flag: their input dataset was NOT real student submissions, but rather mutations of a reference solution by applying variations drawn from the literature on plagiarism detection! Would they do as well on real submissions?
  • Score(C1,C2) = some measure of difference between any pair of pieces of code; T = some threshold value of the score. In experiments, T is tuned to maximize agreement for each <teacher, algorithm> pair, ie there is an 11×7 matrix of T-values.
  • SIMILAR(C1,R,C2) is a 3-valued function: -1 if C1 is more similar (that is, if Score(C1,R) - Score(C2,R) > T), +1 if C2 more similar, 0 if equally similar (ie, if abs(Score(C1,R)-Score(C2,R)) ⇐ T). Note that in this case C1 and C2 could just be the same distance but in opposite directions from R?)
  • Agreement between 2 algo's (or between a teacher and an algo): % of time that SIMILAR has same value on a given triple.
  • Strategies: Jaccard distance (size of intersection / size of union of tokens), various term-frequency distances
  • Teachers are capricious: pairwise inter-teacher agreement varied from 62% to 95%, whereas by tuning T, they could get 77-87% agreement between any teacher and the “best fit” algorithm for that teacher.
  • Something to try: They didn't try combining algorithm results, so maybe it would be interesting to use the algorithm scores as part of a feature vector.

Work on program comprehension using ASTs and other structural features of code

Many papers use one or more software metrics as attributes for computing similarity between blocks of code, so it's good to be familiar with them.

Evaluation Experiments on the Detection of Programming Patterns Using Software Metrics, K. Kontogiannis, 1997

The author looks at five software metrics as a way of determining similarity of two AST subtrees: “S-complexity” (number of function calls); “D-complexity” (number of global variables defined or updated within the block, divided by number of function calls); Cyclomatic complexity; Albrecht complexity (a combination of globals, files opened, function parameters, and other stuff, seems to be related to function points which were invented by Albrecht at IBM as a way of classifying the complexity of each function); Kafura complexity (from Wikipedia: Henry and Kafura's (1981) complexity value is defined as “the procedure length multiplied by the square of fan-in multiplied by fan-out” (Length ×(fan-in × fan-out)²)).

The author obtains different results of precision and recall by using different subsets of these metrics; the basic message to me is that we should look seriously at the metrics reported by the various Ruby tools.

Maletic and Marcus,Supporting Program Comprehension Using Semantic and Structural Information (ICSE 2001)
@InProceedings{program_comprehension_2001,
author =       {Jonathan Maletic and Andrian Marcus},
title =        {Supporting Program Comprehension Using Semantic and Structural Information},
booktitle = {23rd Intl. Conf. on Software Engineering (ICSE)},
year =      2001,
month =     {May},
address =   {Toronto, Canada}}

This work focuses more on analyzing a large piece of software and using clustering to find the files/documents/modules that are closely related.

Uses Latent Semantic Indexing, which is used to cluster text documents by concept, as a way to get at the underlying semantics of source code. The insight is that while LSI is sometimes criticized in the Info Retrieval world because it does not consider word order or morphology (endings and other mutations to words), this is fine for source code because naming is often suggestive of semantic concepts, ie, developers choose names for variables, functions, modules, etc that are likely to be meaningful to other developers.

LSI constructs a (terms x documents) sparse matrix, constructs the singular value matrix using singular value decomposition, and truncates its rank to some chosen K that is much smaller than the dimensions of the terms x documents matrix.

Documents (files) in a source tree are connected in an undirected multigraph where each edge specifies either a semantic similarity or structural similarity between two files. (Two files can be connected by both a semantic and a structural similarity edge.)

Two documents are placed in the same semantic cluster (have a semantic edge between them) if the cosine-similarity of their vector descriptions is >0.7 (that is, angle between them is smaller than π/4 or 45 degrees). Two documents have a structural edge between them if the two documents are part of the same developer-specified subcollection (which they misleadingly call a “file”).

They ran the technique on the source code for the old NCSA Mosaic web browser, and found two separate implementations of the same abstract datatype (linked list) and various other clusters that could be labeled by humans with labels such as “drawing area”, “growable string buffer”, etc.

Other work that cites this or is cited by it:

  • Later work by these authors, “Identification of high-level concept clones in source code”, is almost identical to this paper.
  • Mancoridis et al, “Using Automatic Clustering to Produce High-Level System Organizations of Source Code”, 1998
  • Cites Girard and Koschke as aeralier work on automatically detecting abstract datatypes in source code
Baxter et al, Clone Detection Using ASTs

Cites the classic algorithm of Al Aho, 1986, in their Aho/Sethi/Ullman compiler book (the “dragon book”), for detecting common subexpressions during compilation. The basic idea is to hash subtrees of the AST into a fixed number of buckets B, and then within each bucket compare every subtree to every other subtree (which takes O(n^3)). Their hash function ignores names of identifiers and also tests changing the order of arguments in commutative operations like '+', to avoid being fooled by simple “copy, paste, make a small change” scenarios.

Rather than detecting only “clones” in this way, they also check whether “sequences” - left-leaning or right-leaning trees representing a sequence of subtrees, such as the body of a function - contain common subsequences:

They also have a heuristic for checking other “almost-clones”, by checking whether the parents of already-detected clones or near-clones are themselves near-clones. (Though it seems like using Edit Distance on ASTs, as Huang et al did, would be much cleaner than what they do here.)

Krinke, Identifying Similar Code with Program Dependence Graphs (PDG)

A PDG has vertices representing assignment statements and control predicates (if, switch(), etc) in a program. A “control edge” from v1→v2 means that if the condition specified in v1 is true, v2 will be executed next. A “data dependency edge” from v1→v2 means a variable assigned in v1 is consumed by v2.

This work constructs this graph, then looks for “similar paths” v0..vN in two different subgraphs. Essentially, it looks like they're using the approach of Baxter et al. but annotating each subgraph with how many data-dependence edges it has, as another measure of similarity. They bound the maximum length of “similar” paths because their approach relies on identifying a pair of starting vertices (they don't say how) and then “growing” the path to include its neighbors.

Naude, Assessing Program Code through Static Structural Similarity

This work describes a novel measure of similarity. Perhaps more interestingly, it goes through a lot of the history of other measures (like tree distance), and criticizes them. It also describes a technique for mapping variable names between different programs. Essentially, it just recommends trying all the different mappings and figuring out which ones will cause the similarity metric to be the closest, and to use that one.

Some ideas to try

* Create feature vector that includes output of 'flog' or other complexity metric reporters, number of each type of 'reek' message, etc * Trying clustering on ASTs of various depths, ie, pruning the AST at some fixed depth in order to compare 'gross structure' * identify the “best” solution or solutions using a simple heuristic, like minimum flog score or lowest number of lines of code, and then have some of the features in the feature vector be pairwise comparisons between a given submission and that “reference” submission. For examples of other “pairwise” features, see the summary of Gaudencio et al. above.

Using Ruby metrics tools for autograding

Two of my students who helped with CS169 TA duties very early on investigated some of these tools (all of them are supplied as gems), and here is what they found:

  • flay - looks for structural code duplication. This is probably useless for small exercise autograding but may be useful for evaluating the final project. For our short programming assignments the score is 0, but running flay on ResearchMatch returned dozens of places in the code that could use refactoring. For example, it found three methods in user.rb that could easily be combined into one method.
  • flog - looks at assignments, branches, and calls (ABC complexity). Accurately reflects the algorithmic complexity of some of the smaller exercises. Potentially a good candidate for short assignment autograding. If used for larger projects, students should be instructed to mainly focus on extreme score outliers which should be broken up into multiple methods. A good tool for finding portions of code that are hairy.
  • reek - makes stylistic suggestions about method length, syntax duplication, method cohesion, uninformative names and more. This is our favorite metric for the short assignments as it provides simple, easy to understand feedback in text form. We found that low cohesion warnings can often be disregarded. For long assignments such as ResearchMatch, reek complains about too many things, which gets overwhelming. flog and flay do a better job of targeting higher level issues. reek should be reserved for targeted refactorings of individual methods or files. (Idea: use flog or flay to decide which reek warnings are most important?)
  • rails_best_practices - mostly complains about whitespace issues. Least useful of the metrics we experimented with. ('NOTE:' this was over a year ago, and this gem may have improved since then.)
  • rcov - measures test coverage. Not useful for short assignments. Definitely interesting to run on larger projects to illustrate which portions of the application or barely touched by the test suite. Perhaps even worth a mention in the TDD chapter?
autograding-programming-assignments.txt · Last modified: 2018/02/28 17:02 (external edit)