User Tools

Site Tools


Here's my notes, summaries and questions of a group of papers I read together. Numbers are the order in which i read them, so i can cite them in my summary, but it's not the order in which they appeared. There is some LPU aspect to these, as each is pretty short (~6 pages with big margins).

  1. Sujith Ravi and Jihie Kim, Profiling Student Interactions in Threaded Discussions with Speech Act Classifiers, Proceedings of the AI in Education Conference, 2007.
  2. Donghui Feng, Jihie Kim, Erin Shaw, and Ed Hovy, Towards Modeling Threaded Discussions through Ontology-based Analysis, In Proceedings of National Conference on Artificial Intelligence (AAAI-2006).
  3. Donghui Feng, Erin Shaw, Jihie Kim, and Ed Hovy, Learning to Detect Conversation Focus of Threaded Discussions, In Proc. of the Joint Human Language Technology Conference/Annual Meeting of the North American Chapter of the Assoc. for Computational Linguistics (HLT-NAACL 2006), 2006.

All of these papers have the following in common:

  • analyze content of threaded discussion forums for a medium-sized undergrad OS course at USC
  • point out that undergrad forum posts are much more noisy/entropic than “well behaved” training corpora, requiring more filtering and preprocessing than nontechnical prose before throwing them at a classifier. Eg, ignore threads with <2 messages (since many analyses rely on thread structure) or >5 messages (tend to get off-topic or ramble), replace domain-specific jargon such as code snippets with a single “CODE” token, detect multiple distinct phrases with exact same semantics (“When I do this” vs “When you do this” actually mean the same thing), requiring more care than usual in preprocessing (stemming, singularizing, etc).

A key concept is a *Speech Act or SA*: a labeling of the type of post something is – pose a question, answer a question, elaborate on a previous answer, offer a correction to a previous answer. This is based on Speech Act Theory which dates from 60s (Austin 1962, Searle 1969 - yes, pretty sure it's the Chinese Room Searle). A single message can be labeled with more than one SA, and its SA's can be with respect to more than one previous message.

Key conclusions include:

  • Semantic analysis of forum text, to predict either student behavior/outcomes or necessity of intervention,gives better results than prediction using metadata alone (level of message activity, message timestamp series, etc.)
  • If you can build classifiers to determine what kind of SA a message is, you can answer queries such as “Which popular questions are still unanswered”

[1] builds two linear SVM classifiers for SAs: “QC” emits the SA of question-posts with 88% accuracy, and “AC” emits the SA of answer-posts with 73% accuracy (accuracy means compared to human annotators, though authors note that a group of human annotators agreed with each other only 95% and 86% of the time respectively, so both humans and classifiers do a better job of detecting questions than answers). For training, each message in training corpus (~1k messages) is hand-labeled as a 'question' or 'answer' and hand-labeled as to its SA type. Classifiers are trained on top 200 features (using information gain) based on n-grams for 1⇐n⇐4. QC achieves 88% accuracy and AC 73%, and “a similar pattern was observed among human annotators”, with the agreement ratio between the classifier and a human being 95% for questions and 86% for answers.

The classifiers are then used to characterize “unanswered questions” in the forums, ie, where TA intervention may be useful. They use the classifiers to answer 4 questions as true/false for each *thread*, and compare the classifier-based answer to a human's answer (agreement rate in parens). Not surprisingly, since QC does better than AC, query Q1 is better supported than the other queries.

  • Q1 - does thread contain at least 1 question? (92.7%)
  • Q2 - was first question in thread answered? (74.5%)
  • Q3 - were all questions in thread answered? (70.0%)
  • Q4 - were some but not all of the questions in thread answered? (75.8%)

The explanation of these questions, and the rules showing how classifier output maps to Y/N answer for each, are poorly explained and it's not clear whether Q1-Q4 are intended to constitute a partition of the threads. For evaluation they chose 55 threads at random. They also measure agreement among different humans giving same answer, and those rates are 15-20% higher than machine/human agreement rates, except for Q1 where they're about the same.

[2] tries to improve the SA classifiers' accuracy by adding features from Linguistic Inquiry and Word Count (LIWC) - there's a bag of features like number of words, words per sentence, verb tense, swear words, positive/negative emotion words, etc. They used four standard feature-selection algorithms (chi-square, info gain, gain ratio, and something called ReliefF, which I couldn't find a reference for and don't know what it is) to narrow down the features and retrain their classifiers. Classifier accuracy was 93% for sources [questions] and 90% for sinks [answers].

They then looked at whether there were positive correlations between earned grades and any of the variables they measured for threads - including standard metadata like number of posts and duration between post and project deadline, the number and types of SA's according to the automatic classifier, etc. However, the lackluster results suggest that LIWC features don't help much, and most other features are not great predictors of student outcomes. For example, they hypothesized that poor performers would be characterized by asking more questions than they answer, but the data didn't bear this out. OTOH their sample was tiny, so who knows. At any rate the main interesting contribution is arguably the use of LIWC to slightly improve SA classification.

[3] tries to automate the labeling of both the topic of a message thread and “how different messages in a thread are related”. The second contribution wasn't clear to me because papers [1] and [2] suggest that SA's *are* the way to express “relationships” among thread messages, yet this paper doesn't discuss SA's at all.

The idea is to use an apparently fairly conventional Rocchio-style relevance-feedback classifier on the individual messages, but rather than relying on a labeled training set, they automatically generate an ontology from a “canonical text”, in their case the table of contents and top-level index of the course textbook (using the page number of each index entry to associate that entry with a chapter and subsection). The ontology in this case contained about 1550 topics based on extracting about 3000 unique words from TOC + index). What's confusing is that in their tiny corpus (206 messages, 50 threads), (coarse) labels for messages *are* provided because students posting in forum are required to select the topic of their question from a dropdown menu created by the instructor. Presumably these 6 choices were present in their auto-generated tree, since they go on to report classification accuracy according to six different classification heuristics.

They give six different ways to “classify the topic of” a thread. The simplest is to run Rocchio on the bag-of-words of all messages in the thread. The other five have to do with classifying individual messages and then classifying the thread based on the topic of the OP (original poster), based on the topic most frequently found across all the threads' messages, etc. They don't argue for the benefits of any particular one of these, offering them only as different examples of how to classify. In fact, all six methods (including the bag-of-words-over-all-messages) achieve classification accuracies between 62% and 68%, with the lone standout being 72% when the classificaiton is “Nearest topic based on votes consisting of individual messages' classification scores”.

So the idea is interesting, and I like the part about automating the ontology generation, but no lessons are derived from the results. It might be interesting to combine this work with [1].

[4] tries to use the HITS algorithm, a precursor to PageRank. In HITS, each page's “authority score” is computed as the sum of the “hub scores” of other pages that point to it, and its hub score is computed as the sum of the authority scores of the nodes it points to; repeat until convergence. To apply it, the threads are organized into a graph in which threads are linked based on lexical similarity (cosine similarity using TF*IDF) and SA strength scores. (This work was done before [3] or [2], so manual SA labeling was done.) Messages also have a self-link associated with the “trustworthiness” of the message's author; they measured “trustworthiness” as “percentage of positive responses to this message” (where “positive” encompasses a subset of the 13 SA's with which threads were manually labeled), but one can imagine measuring this directly using a reputation/ranking system instead.

After pruning excessively short and excessively long threads, they were left with ~1300 messges comprising ~300 threads of average length 4 messages each. The goal is to detect the focus of each conversation (thread); they consider both the hub score and authority score emitted by HITS, and measure precision and mean reciprocal rank of the possible candidates for conversation focus. They compare to a “baseline” that guesses randomly, which gives 28% precision and MRR of 0.539 (ie, about what you'd expect by chance, so I don't know why this was chosen as the baseline). Interestingly, the best results were obtained when SA features and poster-reputation were considered *but* lexical simlarity was *excluded*.

  • Methodology seems weak, with a lot of magic constants and no discussion of cross-validation. The explanations weren't particularly easy to follow, the datasets were small, and the baselines didn't always make sense.
  • Authors say their analysis “is driven by requirements from instructors and students, rather than the need of general online information-seeking communities”, but they don't say how those requirements differ. (It's not clear to me that they do.)
  • A key question to keep in mind for this kind of work is whether the goal is exploratory (give the instructor insight) or action-oriented (automatically attempt interventions). In Recovery-Oriented Computing, we had the philosophy that “false positive” triggers that needlessly initiated recovery were sometimes OK, as long as the cost incurred by the “needless” recovery was small and didn't adversely impact availability. I wonder if there is a similar principle we could use here: if the intervention does no harm, it might be OK to trigger it even when not “necessary”.
  • Some of the papers cite other work that tries to predict student outcome or intervention need based on forum metadata without analyzing the text (frequency of posts by a given student, time gap between posts or between a post and an assignment deadline, etc), claiming that you get better results with text analysis; but none of the papers seems to try to combine the two.
  • The Q1-Q4 characterizations in [1] are a nice candidate for visualization (although they'd have to be refined, and/or give instructor the ability to “define” the questions on the fly).
  • Their training set seems tiny. With MOOCs, would training on previous offerings of the same course be effective?
  • Results of [3] could perhaps be used to suggest “These threads might answer your question…” in real time as question is being typed/posted.
kim-forum-analysis.txt · Last modified: 2018/02/28 17:02 (external edit)