Final Report:

Final Report


Who are your team members (You can work in groups of up to 3 people.)
  • Gregory Sanders
  • Christoph Schulze
  • Kotaro Hara

What is the problem you are solving?
On online shopping websites like Amazon, products are reviewed and commented by customers who bought them. The reliable reviews are used as information for potential customers to decide whether they should buy products or not. In order to evaluate the reliability of the reviews, apart from the review texts, Amazon allows customers to evaluate the quality of reviews (i.e., say if the reviews are useful or not useful). Some products’ reviews are evaluated many times, but on the other hand, there are many products with reviews that are not evaluated, which makes potential customers to make purchase decision solely from the texts. One solution to give reliability figure to such reviews is to automatically calculate reliability figure by using textual information.

On this project, we will pursue two goals. First, we will create a program that learns and automatically classifies whether reviews on products are reliable or not. Secondly, we will investigate if we can adopt the classifier we create for one domain (e.g., reviews on Amazon) are adoptable to the other domains (e.g., reviews on Yelp). This would be beneficial to someone who would like to develop a social review-oriented system and lacks the proper amount of participation to get good feedback on which reviews are helpful. This could also be interesting in the sense of us being able to classify features that are commonly shared among good reviews versus bad ones.

What data will you use (how will you get it)?
We will use customer reviews on product pages of Amazon and another web page (not decided yet). We are still looking for data that is cleaned and ready to use. If we cannot find such data, we will program a web crawler to scrape data from web pages.

How will you do the project?
Important dates:
    • Oct 2: Submit a proposal
    • Oct 9: Finish collecting datasets and a list of related research
    • Oct 16: Finish making the first prototype classifier to classify usefulness of review on one domain.
    • Oct 23: Make improvements on one domain classifier. Apply the classifier to the other domain and figure out what needs to be improved.
    • Oct 30: ...
    • Nov 7: Progress report
    • Dec 12: Presentation

Which algorithms/techniques/models you plan to use/develop?
We will use something akin to regression to predict the real-numbered value that will be given to a review by taking the proportion of positive and negative secondary reviews.

What will you evaluate, and will you know if you are successful?
We will evaluate
  1. Accuracy of the classifier on one domain
  2. Accuracy of the classifier on other domains with varying degrees of dissimilarity

What do you expect to submit/accomplish by the end of the semester?
Possible conferences to be submitted:
  • ICWSM-12 (Abstract submission January 13th 2012)
    • There is a call for: Subjectivity in textual data; sentiment analysis; polarity/opinion identification and extraction

Progress Report


Besides describing the progress of our class project this report will also describe our system in more detail and explain what parts still need to be implemented. In response to the comments on the proposal we also added a related works and an algorithm section. During our work we also came up with an idea for future works that is explained at the end of this report.

Related Works

Mansour et. al, studied the use of domain adaptation for reviews of multiple kinds of products being sold on Amazon’s online store. The authors studied the methods of combining different models to predict sentiment in a target domain, and found that instead of finding a convex linear combination of models, finding the weighted combinations of the source distributions in each source domain faired much better. We are also interested in using domain adaptation to improve the predictive power of our model detailed below.


For the machine learning portion of our project we are using the generative model of Zaidan et. al,, which has both a generative and discriminative component.
The discriminative components of the model entails a log-likelihood model computing the probability of the label(sentiment of review, quality of review, etc.) given the observed data and probabilistic model, as well as a Conditional Random Field(CRF) that captures the probability of the rationales for a review given the observed data, class label, and log-likelihood model.
In our case, the class labels will be given a priori, from users who have already rated reviews, and the rationales will be given by users of our system, with users highlighting sections of the review that they feel are important for the class label.
The generative portion comes into play with the joint prior across the log-likelihood model and CRF, attempting to explain both the class labels as well as the rationales. This allows us to only have a small number of rationales given across the data set, but still allow us to use it to inform on the rest of the data points according to the current CRF parameters. The more users of our system that highlight portions of reviews for rationales, the better the model becomes.


We identified a good dataset that has already been used in several studies. The dataset was created for the publication:
John Blitzer, Mark Dredze, Fernando Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. Association of Computational Linguistics (ACL), 2007. [PDF]
It contains Amazon review data with information about reviews of reviews (rating of reviews) that we need for the training of our machine learning algorithm. The data is stored in an easily parsable XML format and is divided into different domains. (E.g. Apparel, Toys, Electronics...)

System Architecture

We decided to use a client-server model for our system. We chose this architecture model for two reasons:
  • Training and applying a machine learning model can be time consuming. In a client-server system we can offload this work to the server.
  • In addition to the static data set we need to collect user generated data for the training of the machine learning model. This data is stored in a central SQL database on our web server.

Model of the System Architecture

Client/User Interface

The client/user interface for our system will be a Firefox plugin that wraps the Amazon homepage. It allows us to show the reliability score, calculated by our learning model, next to existing product reviews. To do that it parses the amazon page extracts the reviews and sends them to the server and retrieves the results of the learning model from the server. The user can also help us collect training data for our machine learning model by annotating Amazon review texts. He can highlight parts of review texts that he believes are the reason why the review is reliable/unreliable. This information is also sent to the server that stores it in the database.

The UI will also have a non Amazon part that just shows existing reviews from our database to the users and allows them to annotate them in the same fashion. In addition to that users can supply any review to the system to obtain its reliability score.

We built a prototype of our client system in Java Script. The prototype application allows the user to provide a review and send it to the server to calculate its reliability score.
The screen shot shows the prototype view for the annotation task. The interface allows users to see:
  • Product name
  • Product type
  • Rating of the product
  • Product review text

The user then judges whether the review texts sound reliable or not. Then we ask them to highlight parts of the review texts that made them believe the reviews are reliable/unreliable. This information is then sent to the server which stores it in its database. In the next step we will extend the prototype so that it will work on the actual Amazon web page.

Server tasks

The server has three main tasks to perform:
  • Collect user data: The server receives the user generated data described in the previous section, parses it and saves it in the database.
  • Apply review to the model: Receive reviews for classification, parse the reviews and apply them to the model. The results that contain the reliability judgement are than also parsed and sent back to the client.
  • Retrain the model: The server will retrain the algorithm with the new user generated input data and create an updated version of the machine learning model. This will either be done in a time triggered loop or based on the amount of new user generated data.

Our pre processing in Rapid Miner

We used Rapid Miner to implement a first version of the machine learning model. Rapid Miner is an open source tool for data analysis and data mining with a really good user interface that allows us to design our models in a Simulink like fashion. We created two Rapid Miner processes (a workflow in Rapid Miner). The first one reads training data from the database pre processes it (see screenshot of rapid miner process), trains a machine learning model and saves it in a file. The second process reads the reviews and the stored model file and applies the review data to the model. The machine learning algorithm used in the initial version is just a place holder for evaluation purposes of the rest of the architecture. In the final version it will be replaced with the one described in the algorithm section.

We wrote a Java program that can execute our models as a standalone program without the need of the Rapid Miner GUI. We can already retrain the model and apply data to it for classification using our java code. In addition to finalizing the model we will have to finish implementing the client-server communication architecture.

Future Works


We believe that we can make the annotation task into an output agreement game. The tasks that the users would have to perform are the same as the ones described in the user interface section.

A brief idea is as follows: We ask users to provide their opinion (reliable/unreliable) and highlight parts of reviews that they believe are the reasons for their judgement. Multiple players (> 2) work on annotation tasks on a same review text asynchronously. Two players earn points if their opinion matches and additional points if parts that they highlighted are the same or are at least overlapping. On the other hand, if two players give opposite opinion and highlight the same area, they lose points. We believe this could motivate users to highlight informative parts.

This could be implemented as a separate web page. All it would need is access to our SQL database.

Comment (Ben): I don't think I understand your approach. By "secondary reviews", do you mean ratings of reviews? And do you mean you are going to use a machine learning technique and use reviews with ratings as training data? I'd also you like to start thinking about *which* ML techniques you will use. Also, you need to think about how you will evaluate the quality of your evaluation system. Also, you should be clear that you are using review ratings as ground truth, but they *are not* ground truth - they are a proxy. We don't know whether review ratings are reliable themselves. This work is going to be questioned unless you have some actual ground truth. More generally, your English here is not very clear, and not always grammatical. Finally, you should start with a literature review on what others are doing to look at review quality.

11/23/11: Comment (Ben):
Thank you for the clarifications - this all seems reasonable now, and you are making fine progress. However, there still not sufficient related work. Are you saying that only one other group has every considered the idea of distinguishing real from fake product reviews? This is surely not true. Among others, here is is some recent work: