User Tools

Site Tools


two_papers

These papers propose 2 different appraoches to using machine learning to scale human grading. One combines ML with peer learning to see if ML can reduce amount of effort per response and either improve or not decrease grading accuracy (answer: ranges from “just barely” to “no”). The other proposes a UI and automatic clustering for exploring similar sets of answers and applying grades & feedback to whole clusters at a time. Philosophically they are different in that peer learning tries to reduce the level of sophistication needed to do grading (not every peer needs to be an experienced grader, if the rubric is good and multiple peers are used to identify discrepancies; and in this paper, the peer task is further subdivided into 'feature identification' and 'feature verification' steps), whereas the clustering paper attempts to improve the productivity of a sophisticated human grader by amplifying his leverage.

TL;DR:
  • Splitting the peer task into identify+verify, and using ML to predict how many human peers are needed, can result in a *modest* decrease in effort for a *modest* loss of accuracy on simple (binary) questions; don't do it on enumerative questions, as it's strictly worse than peer-median. Peer-median seems to still be the way to go if you have enough peers.
  • With a reasonable UI, clustering applied to human grading dramatically improves productivity (effect increases nonlinearly with number of submissions, since it takes awhile for the clustering leverage to kick in) with no appreciable loss of grading accuracy. So answers to (these) short answer questions cluster well.
USING ML TO SCALE PEER GRADING (Kulkarni et al + Scott Klemmer)

The first augments a baseline Calibrated Peer Grading scheme (grade = median of N peers) with simple machine learning to answer the question: “How many peer graders does this question need?” The intuition is that more ambiguous answers will require more human graders than ambiguous ones. The paper is a bit confusing to read because two separate ideas appear to be tested:

  1. Using ML to estimate human effort required for grading: A staff-trained, off-the-shelf, lexical-feature-based classifier predicts a response's grade and the probability of that grade; the probability is used as confidence, and the number of peer graders is selected based on this number. The authors stress that choosing the best classifier was a non-goal.
  2. Different workflow for peer grading: Rather than just assigning a grade from 1-3 points, grading is divided into an 'identify' step (peers identify which features of a rubric are reflected by a response) and 'verify' step (separate peers verify whether the feature assignments are correct).

Overall, the baseline (median-of-peers) requires the most effort but produces the most accurate results (though at most 85% for 'binary' questions and 49% for 'enumerative' questions). In comparison, identify-verify gets 72% on binary questions with 84% of the effort - not a huge savings in efficiency and a measurable loss in accuracy. On enumerative questions, identify-verify does worse in both effort and accuracy. The real problem is the maximum reported accuracy of peer grading vs. staff is 80% with peer-median, and that's the best case; most cases (with enumerated vs binary questions, eg) are substantially lower. This may be OK for pass/fail or providing some feedback to student even if the assessment is “incorrect”. Student acceptance of identify-verify (did they think they would get a fair grade) were lower than median-of-peers. THere were also some possible usability challenges with the workflow's UI.

CAVEATS:
  • I don't quite understand why not simply use the classifier prediction to select the number of raters, and then use median-of-peers.
USING CLUSTERING TO SCALE HUMAN GRADING (Mike Brooks, UW; Sumit Basu et al, MSR)

An earlier paper from these authors proposed a hierarchical clustering algorithm whose distance metric uses word stemming, tf-idf simlarity, string match, and Wikipedia-based LSA similarity. It creates a 2-level hierarchy (⇐ 10 clusters, ⇐5 subclusters/cluster, leaves) of responses to a question. This paper is more an HCI contribution:

  • “word cloud” visualization in which color lightness rather than font size indicates word prevalence, and order of words matches their (normalized) order in answers within a cluster, as a way to help graders explore answer space;
  • grading actions UI, in which individual responses inherit subcluster labels, and subclusters inherit cluster labels, but either can be manually overridden;
  • feedback as well as a grade can be applied at cluster level, and feedback can be reused in UI.

Grading corpus was ~700 MTurk answers to 2 q's on US Citizenship exam. Graders needed to have >1 year experience, i.e. this work does not attempt to lower the level of sophistication/expertise needed for grading, unlike peer grading. All subjects graded both sets of answers - one with clustered UI and the other with “flat” (one answer at a time) UI - for 20 minutes.

  • Nice visualizations show how gain of clustering vs flat changes over time: relatively more advantage as you do more assignments.
  • Participants liked the clustering interface much better
  • Participants were more productive by up to a factor of 8, with no statistically significant loss of grading accuracy vs gold standard
  • 3x as many answers received feedback (since applied at cluster level)
CAVEATS:
  • In the clustering algo, why the magic constants 2, 10, 5? (see Basu S et al, “Powergrading”, TACL I (2013) for answer)
  • “Gold standard” accuracy was based on subset of answers where 3 indepnendent human graders had unanimous agreement, which was about 82% of answers for each question. So some small possiblity that the questions chosen happen to be easy to grade, i.e. we don't know the effect of “poorly designed” questions that admit a wide range of open-ended answers that would make them hard to grade.
FOLLOWUP PAPERS TO READ:

Zhang M, Contrasting automated & human scoring of esays, R&D Connections (2013)

Hearst M, The debate on automated essay grading, Intell. Sys & their Apps, 2000

Jordan S et al, e-assesemtn for learning? potential of short-answer free-text questions with tailored feedback, Brit J Edu Tech 40(2) 2009, p 371

two_papers.txt · Last modified: 2018/02/28 17:02 (external edit)