Here's my notes, summaries and questions of a group of papers I read together. Numbers are the order in which i read them, so i can cite them in my summary, but it's not the order in which they appeared. There is some LPU aspect to these, as each is pretty short (~6 pages with big margins).
All of these papers have the following in common:
A key concept is a *Speech Act or SA*: a labeling of the type of post something is – pose a question, answer a question, elaborate on a previous answer, offer a correction to a previous answer. This is based on Speech Act Theory which dates from 60s (Austin 1962, Searle 1969 - yes, pretty sure it's the Chinese Room Searle). A single message can be labeled with more than one SA, and its SA's can be with respect to more than one previous message.
Key conclusions include:
 builds two linear SVM classifiers for SAs: “QC” emits the SA of question-posts with 88% accuracy, and “AC” emits the SA of answer-posts with 73% accuracy (accuracy means compared to human annotators, though authors note that a group of human annotators agreed with each other only 95% and 86% of the time respectively, so both humans and classifiers do a better job of detecting questions than answers). For training, each message in training corpus (~1k messages) is hand-labeled as a 'question' or 'answer' and hand-labeled as to its SA type. Classifiers are trained on top 200 features (using information gain) based on n-grams for 1⇐n⇐4. QC achieves 88% accuracy and AC 73%, and “a similar pattern was observed among human annotators”, with the agreement ratio between the classifier and a human being 95% for questions and 86% for answers.
The classifiers are then used to characterize “unanswered questions” in the forums, ie, where TA intervention may be useful. They use the classifiers to answer 4 questions as true/false for each *thread*, and compare the classifier-based answer to a human's answer (agreement rate in parens). Not surprisingly, since QC does better than AC, query Q1 is better supported than the other queries.
The explanation of these questions, and the rules showing how classifier output maps to Y/N answer for each, are poorly explained and it's not clear whether Q1-Q4 are intended to constitute a partition of the threads. For evaluation they chose 55 threads at random. They also measure agreement among different humans giving same answer, and those rates are 15-20% higher than machine/human agreement rates, except for Q1 where they're about the same.
 tries to improve the SA classifiers' accuracy by adding features from Linguistic Inquiry and Word Count (LIWC) - there's a bag of features like number of words, words per sentence, verb tense, swear words, positive/negative emotion words, etc. They used four standard feature-selection algorithms (chi-square, info gain, gain ratio, and something called ReliefF, which I couldn't find a reference for and don't know what it is) to narrow down the features and retrain their classifiers. Classifier accuracy was 93% for sources [questions] and 90% for sinks [answers].
They then looked at whether there were positive correlations between earned grades and any of the variables they measured for threads - including standard metadata like number of posts and duration between post and project deadline, the number and types of SA's according to the automatic classifier, etc. However, the lackluster results suggest that LIWC features don't help much, and most other features are not great predictors of student outcomes. For example, they hypothesized that poor performers would be characterized by asking more questions than they answer, but the data didn't bear this out. OTOH their sample was tiny, so who knows. At any rate the main interesting contribution is arguably the use of LIWC to slightly improve SA classification.
 tries to automate the labeling of both the topic of a message thread and “how different messages in a thread are related”. The second contribution wasn't clear to me because papers  and  suggest that SA's *are* the way to express “relationships” among thread messages, yet this paper doesn't discuss SA's at all.
The idea is to use an apparently fairly conventional Rocchio-style relevance-feedback classifier on the individual messages, but rather than relying on a labeled training set, they automatically generate an ontology from a “canonical text”, in their case the table of contents and top-level index of the course textbook (using the page number of each index entry to associate that entry with a chapter and subsection). The ontology in this case contained about 1550 topics based on extracting about 3000 unique words from TOC + index). What's confusing is that in their tiny corpus (206 messages, 50 threads), (coarse) labels for messages *are* provided because students posting in forum are required to select the topic of their question from a dropdown menu created by the instructor. Presumably these 6 choices were present in their auto-generated tree, since they go on to report classification accuracy according to six different classification heuristics.
They give six different ways to “classify the topic of” a thread. The simplest is to run Rocchio on the bag-of-words of all messages in the thread. The other five have to do with classifying individual messages and then classifying the thread based on the topic of the OP (original poster), based on the topic most frequently found across all the threads' messages, etc. They don't argue for the benefits of any particular one of these, offering them only as different examples of how to classify. In fact, all six methods (including the bag-of-words-over-all-messages) achieve classification accuracies between 62% and 68%, with the lone standout being 72% when the classificaiton is “Nearest topic based on votes consisting of individual messages' classification scores”.
So the idea is interesting, and I like the part about automating the ontology generation, but no lessons are derived from the results. It might be interesting to combine this work with .
 tries to use the HITS algorithm, a precursor to PageRank. In HITS, each page's “authority score” is computed as the sum of the “hub scores” of other pages that point to it, and its hub score is computed as the sum of the authority scores of the nodes it points to; repeat until convergence. To apply it, the threads are organized into a graph in which threads are linked based on lexical similarity (cosine similarity using TF*IDF) and SA strength scores. (This work was done before  or , so manual SA labeling was done.) Messages also have a self-link associated with the “trustworthiness” of the message's author; they measured “trustworthiness” as “percentage of positive responses to this message” (where “positive” encompasses a subset of the 13 SA's with which threads were manually labeled), but one can imagine measuring this directly using a reputation/ranking system instead.
After pruning excessively short and excessively long threads, they were left with ~1300 messges comprising ~300 threads of average length 4 messages each. The goal is to detect the focus of each conversation (thread); they consider both the hub score and authority score emitted by HITS, and measure precision and mean reciprocal rank of the possible candidates for conversation focus. They compare to a “baseline” that guesses randomly, which gives 28% precision and MRR of 0.539 (ie, about what you'd expect by chance, so I don't know why this was chosen as the baseline). Interestingly, the best results were obtained when SA features and poster-reputation were considered *but* lexical simlarity was *excluded*.