proposal.tex

\documentclass[letter,12pt]{article}
\usepackage{acl2010}
\usepackage{times}
\usepackage{latexsym,color,amsmath,amssymb,graphicx}
\usepackage{multirow}
\usepackage{rotating}
\usepackage{stmaryrd} % For double brackets

\input std-macros

\title{Project Proposal for cs228}
\author{Chinmay Kulkarni and Gabor Angeli}

\begin{document}

\maketitle

\begin{abstract}
This report presents an approach to predicting the helpfulness of a 
	movie review, taking an approach which considers new topics introduced 
	over time and which attempts to separate extrinsic {\em noise} factors 
	from intrinsic factors.
\end{abstract}


\section{Problem Statement}
Online reviews have become popular in a number of domains, such as products, restaurants, and movies. Crowdsourced reviews are useful in presenting a diverse range of opinions as well as catering to a wide-range of information needs. 

However, with the increasing number of online reviews comes the related problem of gauging the quality and helpfulness of a review. Several websites now allow their users to mark reviews as helpful or not. Such manual ratings, however, may be sparse. For instance, reviews that have been recently created may have few ratings. Furthermore, a review about a product that already has many reviews would be less frequently seen by visitors, and so less likely to have a manual rating.  

In this project, we focus on the problem of predicting the (manual) helpfulness of a review, based on features extracted from the review text and  helpfulness ratings provided by human annotators for other reviews. 


\subsection{Approach}
Instead of modeling the problem as a straightforward regression of rated helpfulness given review features, we imagine instead a model that considers the manual (observed) rating as a random variable that is influenced by factors that are both {\em intrinsic} and {\em extrinsic} to the review. 

We define a factor intrinsic if it can be predicted based on the text of a review alone. By extension, an extrinsic factor cannot be predicted purely from the text of a review alone. Our approach is informed by the following two hypotheses. 

{\bf Hyp 1:}  The actual helpfulness of a review (the helpfulness dependent solely on its intrinsic factors) is strongly influenced by the number of new topics the review introduces.

Intuitively, if a review introduces several new topics, it adds a lot more information than a review that does not, and so should be more helpful. 

{\bf Hyp 2:} We consider human ratings to be noisy and influenced by both {\em intrinsic} and {\em extrinsic} factors.

This hypothesis is motivated by the fact that reviews written early on are more likely to be rated as helpful, since a reader has fewer other helpful reviews to compare it with (informally, new reviews have a higher bar to pass to be marked as helpful).
The helpfulness rating provided by human annotators is considered the {\em extrinsic} helpfulness.

A more detailed justification for both hypotheses  is provided in \refsec{preliminary}.

\subsection{Applications}
Automatically assigning helpfulness to an online review has many applications. One application is to predict helpfulness of reviews which do not yet have a well-established human-annotated rating (such as newly-created reviews, or reviews for popular products with several other reviews).  Another application is text summarization. In this case, reviews that are more helpful should receive a higher weighting in the summarization task. Current summarization systems tend to consider all reviews as  equally informative {\em a priori}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% IMPLEMENTATION
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\Section{implementation}{Implementation}
\subsection{Model}
\FigStar{model}{0.43}{model}{
	The graphical model proposed to be used in the project.
	Grey nodes denote observed features; the blue node denotes the
	variable that is being tested; the red node denotes a feature 
	derived from a topic model.
}

To solve the problem, we intend to employ a directed graphical model.
The model is shown in \reffig{model}.
Observed variables are colored grey;
The feature variables (i.e. {\em Lexical Features}, {\em Structural Features}, 
	etc) are compressed into a single node for illustrative purposes;
	these nodes would either be split into individual features or modeled
	in some structured way.
The variable we are interested in ({\em Extrinsic Helpfulness \%}) is marked
	in blue.
A node denoting the number of new topics introduced in the review is marked
	in red, as it is a placeholder for an embedded topic model.

The essence of the model is to represent the two hypotheses presented in the
	introduction. 
The decoupling of observations into {\em extrinsic} and
	{\em intrinsic} influences is done by placing the nodes at the top
	of the graph (influencing {\em Extrinsic Helpfulness \%}), or at the bottom
	(a result of {\em Intrinsic Helpfulness \%}).
Extrinsic helpfulness is in turn affected by a notion of intrinsic helpfulness
	in conjunction with the noise factors.
The features used for this part of the model would likely align roughly with 
	those of \newcite{2006kim-helpfulness} and other similar work.

The second hypothesis is modeled by the red node {\em Number of New Topics}.
This will be an implementation of LDA, either trained separately from or
	jointly with the rest of the model.
The paper by \newcite{2008titov-summarization} attempts a similar technique
	for review summarization.

A directed model was chosen since the intuition of the problem appears to
	be causal in nature.
%	align well with a directed model paradigm.
For instance, {\em Intrinsic Helpfulness} is a factor in causing the value of
	{\em Extrinsic Helpfulness}, in conjunction with the other extrinsic
	causes.
Similarly, {\em New Topics} causes a higher helpfulness; and so forth.

\subsection{Training}
The problem as presented is a supervised training problem.
Pairs of reviews and their associated extrinsic helpfulness are provided at
	training time.
At test time, the task is to predict the extrinsic helpfulness from the review
	and other reviews from the same item.

The data used for these experiments is taken from the IMDB corpus, which
	consists of a total of 45,772 movies and 1,808,564 reviews.
In all likelihood only a small subset of the corpus will be used;
	we have access to this corpus.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% PRELIMINARY RESEARCH
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\Section{preliminary}{Preliminary Research}

This project extends off of a project done for cs424 last quarter.
That project consisted of training a simple classifier with features
	defined both over the {\em extrinsic} and {\em intrinsic} features.
The results of the project were reasonable (Pearson correlation of 0.486)
	though by no means good.
Furthermore, the most useful features in the model were the ones that
	effectively modeled the noise in the rating process, suggesting the
	feasibility of {\bf Hyp2}.

The first hypothesis is roughly approximated by counting the number of
	new unique words in a review and correlating it with the average helpfulness
	of that review.
This trend is presented in \reffig{newwords}.
However, this graph conflates the percentage of new words with the number of
	reviews already seen, and hence may not be indicative of the validity
	of our hypothesis.

Statistics were collected on the dataset to support the two hypotheses.
The second hypothesis is supported by correlations between extrinsic
	factors and extrinsic helpfulness.
For instance, \reffig{postindex} shows a negative correlation between the
	post index of a review and its helpfulness.
Similarly, \reffig{otherreviews} shows a strong correlation between the
	per-movie normalization (the average rating of the other reviews)
	and the review's helpfulness.

\Fig{fig/plot/helpfulGivenReviewPostIndexAve.png}{0.30}{postindex}{
	The correlation between the post index of the review
	and the helpfulness of the review.
	To reduce clutter, this graph averages the Helpfulness ($y$) values
	for each of 1000 buckets ($x$).
}

\Fig{fig/plot/helpfulGivenOtherReviewsAve.png}{0.30}{otherreviews}{
	The correlation between the average helpfulness of other reviews
	in the same movie, and the review in question.
	For example, in the raw graph, a movie with three reviews 
	rated at 0.0, 0.5, and 1.0 would
	be represented as three points \{(0.75,0.0),(0.5,0.5),(0.25,1.0)\}.
	To reduce clutter, this graph averages the Helpfulness ($y$) values
	for 1000 buckets of Other Review Helpfulness ($x$).
}


\Fig{fig/plot/helpfulGivenNewWordsAve.png}{0.30}{newwords}{
	The correlation between the percentage of unique words in the review
	which do not appear in any previous review,
	and the helpfulness of the review.
	To reduce clutter, this graph averages the Helpfulness ($y$) values
	for each of 1000 buckets ($x$).
}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% PREVIOUS WORK
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\Section{previous}{Previous Work}
This section details some of the previous work done in the field.
The first briefly describes one of the initial characteristic papers
	on assessing review helpfulness.
The second presents an approach to jointly performing text summarization
	and topic modeling; this approach appears relevant to our task.

\subsection{Automatically Assessing Review Helpfulness \cite{2006kim-helpfulness}}
This paper was the earliest work present in the literature on the subject of assessing review helpfulness.
The main idea of the paper was to use SVM regression on a large number of features,
	analyzing the results and the impact of each subset of features on the performance
	of the system.
The paper concluded that a relatively simple feature set (length of the review,
	unigram counts, and star rating) performed the best, with the first two
	(length and unigram counts) alone being almost as effective as all three
	combined.

The dataset used by the paper was a corpus of Amazon reviews, in the categories
	of {\em MP3 Players} and {\em Digital Cameras}.
The dataset consisted of 821 products and 33,016 reviews for {\em MP3 Players},
	and 1,104 products and 26,189 reviews for {\em Digital Cameras}.
This data was then filtered for duplicate or near-duplicate entries ($>80\%$ bigrams match),
	resulting in 85 products and 12,07 reviews being discarded for {\em MP3 Players},
	and 38 products and 3,692 reviews being discarded for {\em Digital Cameras}.
Helpfulness ratings were obtained by taking every review with more than 5 helpfulness
	responses.
This resulted in approximately a third to half of the reviews being discarded.

Training was done using SVM regression.
The authors tested on a variety of kernels, however found a radial bias function (RBF)
	kernel to perform the best; all results are reported using this method.

To assist in feature creation, a {\em Product-Feature} set was automatically extracted.
This was done by mining references to product features from {\tt Epinions.com},
	where users are allowed to describe the pros and cons of each product.
Frequent words were pruned from this list, resulting in around 10,000 unique features
	for both domains.

Features were extracted over each review.
These features fell into the following categories:
\begin{enumerate}
	\item {\bf Structural}: review length ({\tt LEN}); average sentence length, 
		number of sentences, etc. ({\tt SEN}); 
		HTML formatting ({\tt HTM})
	\item {\bf Lexical}: {\it tf-idf} statistic on each unigram ({\tt UGR});
		{\it tf-idf} on each bigram ({\tt BGR})
	\item {\bf Syntactic} ({\tt SYN}): features on POS information
	\item {\bf Semantic}: product features ({\tt PRF}); 
		{\it General Inquirer} sentiment words ({\tt GIW}) 
	\item {\bf Meta Info}: star rating (average/deviation) ({\tt STR})
\end{enumerate}

Evaluation was done using the Spearman correlation coefficient.
The best results came from the three features ({\tt LEN+UGR+STR}),
	resulting in a Spearman coefficient of 0.656 on {\em MP3 Players}
	and 0.595 on {\em Digital Cameras}.
Adding additional features tended to hurt performance; adding every feature
	dropped the {\em MP3 Player} score to 0.601,
	although mildly improving the {\em Digital Camera} score to 0.604.

\subsection{A Joint Model of Text and Aspect Ratings for Sentiment Summarization \cite{2008titov-summarization}}

This paper presents an approach to summarization, jointly learning the topics to summarize and text which
	describes them.
The paper proposes a model ({\em Multi-Aspect Sentiment Model}) consisting of two parts: 
	an unsupervised topic model, and a classifier from words to sentiment ratings.
The paper evaluates on a hotel review dataset taken from {\tt TripAdvisor.com}.

The paper presents the task of summarization as a two-fold task: the first task
	is described as {\em aspect identification and mention extraction} -- determining which aspects of the
	reviews are relevant to describe, and determining which text fragments describe them.
The second task is {\em sentiment classification} -- determining the sentiment on the
	relevant extracted text.
The paper attempts to incorporate both of these tasks into a single model which extracts text fragments
	and their associated rating, given the review and a per-aspect rating.

The dataset used in the paper consists of 10,000 reviews from {\tt TripAdvisor.com}, where each
review was rated in at least {\em service}, {\em location}, and {\em rooms}.

The approach taken is the build a model -- coined as a Multi-Aspect Sentiment model (MAS) --
	which is effectively built on a combination of a multi-grain LDA topic model and a series of MaxEnt
	classifiers for each topic.
The model is such that a word in the document is sampled from either a local or global topic;
	the intent is that global topics will capture topics corresponding to non-sentiment phenomena
	(e.g. {\em MP3 players} versus {\em hotels}), while the local topics will capture sentiment-laden
	words.

The paper raises the issue that often aspects of reviews correlate strongly with each other.
That is, if you dislike a hotel, you will likely not rate any aspect highly.
To address this, the model classifies not over absolute ratings, but over the difference between
	the aspect rating and the overall rating.

Inference over the model was done using Gibbs sampling, as exact inference is intractable.
The model achieves a precision of between 75\% and 85\% for the different aspects.

\bibliographystyle{acl}
\bibliography{main}
\end{document}