These papers propose 2 different appraoches to using machine learning to scale human grading. One combines ML with peer learning to see if ML can reduce amount of effort per response and either improve or not decrease grading accuracy (answer: ranges from “just barely” to “no”). The other proposes a UI and automatic clustering for exploring similar sets of answers and applying grades & feedback to whole clusters at a time. Philosophically they are different in that peer learning tries to reduce the level of sophistication needed to do grading (not every peer needs to be an experienced grader, if the rubric is good and multiple peers are used to identify discrepancies; and in this paper, the peer task is further subdivided into 'feature identification' and 'feature verification' steps), whereas the clustering paper attempts to improve the productivity of a sophisticated human grader by amplifying his leverage.
The first augments a baseline Calibrated Peer Grading scheme (grade = median of N peers) with simple machine learning to answer the question: “How many peer graders does this question need?” The intuition is that more ambiguous answers will require more human graders than ambiguous ones. The paper is a bit confusing to read because two separate ideas appear to be tested:
Overall, the baseline (median-of-peers) requires the most effort but produces the most accurate results (though at most 85% for 'binary' questions and 49% for 'enumerative' questions). In comparison, identify-verify gets 72% on binary questions with 84% of the effort - not a huge savings in efficiency and a measurable loss in accuracy. On enumerative questions, identify-verify does worse in both effort and accuracy. The real problem is the maximum reported accuracy of peer grading vs. staff is 80% with peer-median, and that's the best case; most cases (with enumerated vs binary questions, eg) are substantially lower. This may be OK for pass/fail or providing some feedback to student even if the assessment is “incorrect”. Student acceptance of identify-verify (did they think they would get a fair grade) were lower than median-of-peers. THere were also some possible usability challenges with the workflow's UI.
An earlier paper from these authors proposed a hierarchical clustering algorithm whose distance metric uses word stemming, tf-idf simlarity, string match, and Wikipedia-based LSA similarity. It creates a 2-level hierarchy (⇐ 10 clusters, ⇐5 subclusters/cluster, leaves) of responses to a question. This paper is more an HCI contribution:
Grading corpus was ~700 MTurk answers to 2 q's on US Citizenship exam. Graders needed to have >1 year experience, i.e. this work does not attempt to lower the level of sophistication/expertise needed for grading, unlike peer grading. All subjects graded both sets of answers - one with clustered UI and the other with “flat” (one answer at a time) UI - for 20 minutes.
Zhang M, Contrasting automated & human scoring of esays, R&D Connections (2013)
Hearst M, The debate on automated essay grading, Intell. Sys & their Apps, 2000
Jordan S et al, e-assesemtn for learning? potential of short-answer free-text questions with tailored feedback, Brit J Edu Tech 40(2) 2009, p 371