Progress Report: Crowdsourced Comment Aggregation

Overview of Problem
We would like to implement a system for online comment aggregation that leverages computational linguistics methodologies as well as crowdsourced tasks to present meaningful representative comments to users. We intend to use TF-IDF (term frequency-inverse document frequency) to compare comments, create clusters of similar comments into sub-topics and determine the representative comments from the entire input set. In addition to TF-IDF clustering, users who wish to post will first be required to perform a task that contributes to the selection process for representative comments. We believe that an element of human subjectivity will help generate high-quality comment aggregation results beyond what automated methods can provide.

We do not yet have TF-IDF operational on our web application, so we will focus our discussion below on related research as well as a description of progress towards implementing our crowdsourced comment aggregation web application.


Background/Related Research
Across the Internet, information is aggregated and summarized to present quick and meaningful information to users. Topic posting and commenting presents a prime example of how it is useful to sort and prioritize data to present to a user. For example, topic posting forums organize and display comments to posts using a variety of techniques such as user ranking, comment ranking, moderating, and comment summarization. The following are case studies of some popular systems:

1. Slashdot asks moderators, site users who have been selected, to rate comments. The top-ranked comments are highlighted when displayed either in a threaded or nested structure. Users choose a threshold that determines the number of comments they wish to view as they have been ranked. Additionally, reputation is built into the system where registered users’ ranking history establishes their state of karma, a factor that influences ranking. While a moderation system like slashdot is labor-intensive and restrictive in nature, a nice benefit is that trolling and generally “useless” commenting becomes invisible to most users.

2. Reddit is a user-generated posting site where comments are ranked and can be sorted by users by categories like best, hot, controversial, new, top, and old. Comments are displayed in a nested structure, with a threshold of the highest ranking comments displayed based on the user’s sorting choice. On Reddit, communities are moderated by users who have taken the initiative with new or existing communities and posts are ranked based on a user-determined “upvotes” versus “downvotes.” Moderators can remove “bad” posts and add new moderators. Similar to slashdot, reddit users build karma as they make their way commenting on posts.

As far as we are aware, our approach of combining text comparison and requiring users to make a subjective determination about comments where the result feeds into the comment aggregation selection process is new. While users may participate in comment ranking systems of Slashdot and Reddit, our approach of having users choose a representative comment out of an automated pre-determined selection is a new task.


Status of Web Application Implementation
Currently the web application is not in an operational state. We instead present the following sketches showing two aspects of the forum interface. A user who wishes to submit a comment is shown the current top representative comment as well as the list of all comments (first image). After the comment is submitted, the user is required to answer a subjective question about previous comments (second image).
project_shot1v2.jpg project_shot2v2.jpg

Once TF-IDF clustering is successfully implemented, it will power the selection process for a top representative comment, as displayed to the user. As users answer subjective questions about the comment, their responses will also be incorporated into further iterations of the selection process.

Status of TF-IDF and K-Means Clustering Algorithm

We are currently facing challenges in terms of implementing the algorithm behind the web interface. We are communicating with faculty/staff in the NLP department (Hall Daume III, Kristy Hollingshead) to find out what tools are appropriate for this kind of task. We had decided to use the Python based NLTK toolkit to perform the TF-IDF text comparison and k-means clustering of comments - which for NLP purposes, we will call documents.

Producing the TF-IDF values for each document is not the difficult part. The more challenging part is to perform the clustering with NLTK. We will aim at concentrating on this for the following week or two and come out with a product that will give us k clusters of similar documents.

Moreover, another challenge that we will be facing is related to the length of comments affecting the accuracy of our results. TF-IDF works best when there are more words in a document. Given the nature of comments, there will not likely be many words per comment, therefore, the results produced might be lacking in accuracy when determining their corresponding sub-topics. However, since TF-IDF has been used with Twitter before, we will still try it and see. Moreover, we are not counting solely on TF-IDF to produce quality results, rather we are also complementing this with user-input/crowdsourcing which should significantly help determine what comments are most representative.


Sources

__http://slashdot.org/faq__

__http://www.reddit.com/help/faq__

Jagan Sankaranarayanan, Hanan Samet, Benjamin E. Teitler, Michael D. Lieberman, and Jon Sperling. 2009. TwitterStand: news in tweets. In Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (GIS '09). ACM, New York, NY, USA, 42-51.

NLTK resources:

__http://www.nltk.org/book__

__http://www.desilinguist.org/pdf/crossroads.pdf__

__http://www.cloudera.com/blog/2010/03/natural-language-processing-with-hadoop-and-python/__




11/23/11: Comment (Ben)
Sounds like you are making good progress. Looking forward to seeing how this turns out. One other issue that I think is likely to come up (if it hasn't already) is that longer comments are likely to contain multiple distinct topics. Trying to aggregate them will be problematic. I.e., Comment 1 contains topics A & B
Comment 2 contains topics A & C
Comment 3 contains topics B & C
So, the best way to aggregate these would be to break them up into their components. Then, the better version of A could be taken from comments 1 & 2, the better version of B could be taken from comments 1 & 3, and the better version of C could be taken from comments 2 & 3.
This is probably beyond what you will be able to do at this point, but something to consider.