-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathproposal.tex
273 lines (223 loc) · 14.5 KB
/
proposal.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
\documentclass[letter,12pt]{article}
\usepackage{acl2010}
\usepackage{times}
\usepackage{latexsym,color,amsmath,amssymb,graphicx}
\usepackage{multirow}
\usepackage{rotating}
\usepackage{stmaryrd} % For double brackets
\input std-macros
\title{Project Proposal for cs228}
\author{Chinmay Kulkarni and Gabor Angeli}
\begin{document}
\maketitle
\begin{abstract}
This report presents an approach to predicting the helpfulness of a
movie review, taking an approach which considers new topics introduced
over time and which attempts to separate extrinsic {\em noise} factors
from intrinsic factors.
\end{abstract}
\section{Problem Statement}
Online reviews have become popular in a number of domains, such as products, restaurants, and movies. Crowdsourced reviews are useful in presenting a diverse range of opinions as well as catering to a wide-range of information needs.
However, with the increasing number of online reviews comes the related problem of gauging the quality and helpfulness of a review. Several websites now allow their users to mark reviews as helpful or not. Such manual ratings, however, may be sparse. For instance, reviews that have been recently created may have few ratings. Furthermore, a review about a product that already has many reviews would be less frequently seen by visitors, and so less likely to have a manual rating.
In this project, we focus on the problem of predicting the (manual) helpfulness of a review, based on features extracted from the review text and helpfulness ratings provided by human annotators for other reviews.
\subsection{Approach}
Instead of modeling the problem as a straightforward regression of rated helpfulness given review features, we imagine instead a model that considers the manual (observed) rating as a random variable that is influenced by factors that are both {\em intrinsic} and {\em extrinsic} to the review.
We define a factor intrinsic if it can be predicted based on the text of a review alone. By extension, an extrinsic factor cannot be predicted purely from the text of a review alone. Our approach is informed by the following two hypotheses.
{\bf Hyp 1:} The actual helpfulness of a review (the helpfulness dependent solely on its intrinsic factors) is strongly influenced by the number of new topics the review introduces.
Intuitively, if a review introduces several new topics, it adds a lot more information than a review that does not, and so should be more helpful.
{\bf Hyp 2:} We consider human ratings to be noisy and influenced by both {\em intrinsic} and {\em extrinsic} factors.
This hypothesis is motivated by the fact that reviews written early on are more likely to be rated as helpful, since a reader has fewer other helpful reviews to compare it with (informally, new reviews have a higher bar to pass to be marked as helpful).
The helpfulness rating provided by human annotators is considered the {\em extrinsic} helpfulness.
A more detailed justification for both hypotheses is provided in \refsec{preliminary}.
\subsection{Applications}
Automatically assigning helpfulness to an online review has many applications. One application is to predict helpfulness of reviews which do not yet have a well-established human-annotated rating (such as newly-created reviews, or reviews for popular products with several other reviews). Another application is text summarization. In this case, reviews that are more helpful should receive a higher weighting in the summarization task. Current summarization systems tend to consider all reviews as equally informative {\em a priori}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% IMPLEMENTATION
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\Section{implementation}{Implementation}
\subsection{Model}
\FigStar{model}{0.43}{model}{
The graphical model proposed to be used in the project.
Grey nodes denote observed features; the blue node denotes the
variable that is being tested; the red node denotes a feature
derived from a topic model.
}
To solve the problem, we intend to employ a directed graphical model.
The model is shown in \reffig{model}.
Observed variables are colored grey;
The feature variables (i.e. {\em Lexical Features}, {\em Structural Features},
etc) are compressed into a single node for illustrative purposes;
these nodes would either be split into individual features or modeled
in some structured way.
The variable we are interested in ({\em Extrinsic Helpfulness \%}) is marked
in blue.
A node denoting the number of new topics introduced in the review is marked
in red, as it is a placeholder for an embedded topic model.
The essence of the model is to represent the two hypotheses presented in the
introduction.
The decoupling of observations into {\em extrinsic} and
{\em intrinsic} influences is done by placing the nodes at the top
of the graph (influencing {\em Extrinsic Helpfulness \%}), or at the bottom
(a result of {\em Intrinsic Helpfulness \%}).
Extrinsic helpfulness is in turn affected by a notion of intrinsic helpfulness
in conjunction with the noise factors.
The features used for this part of the model would likely align roughly with
those of \newcite{2006kim-helpfulness} and other similar work.
The second hypothesis is modeled by the red node {\em Number of New Topics}.
This will be an implementation of LDA, either trained separately from or
jointly with the rest of the model.
The paper by \newcite{2008titov-summarization} attempts a similar technique
for review summarization.
A directed model was chosen since the intuition of the problem appears to
be causal in nature.
% align well with a directed model paradigm.
For instance, {\em Intrinsic Helpfulness} is a factor in causing the value of
{\em Extrinsic Helpfulness}, in conjunction with the other extrinsic
causes.
Similarly, {\em New Topics} causes a higher helpfulness; and so forth.
\subsection{Training}
The problem as presented is a supervised training problem.
Pairs of reviews and their associated extrinsic helpfulness are provided at
training time.
At test time, the task is to predict the extrinsic helpfulness from the review
and other reviews from the same item.
The data used for these experiments is taken from the IMDB corpus, which
consists of a total of 45,772 movies and 1,808,564 reviews.
In all likelihood only a small subset of the corpus will be used;
we have access to this corpus.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% PRELIMINARY RESEARCH
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\Section{preliminary}{Preliminary Research}
This project extends off of a project done for cs424 last quarter.
That project consisted of training a simple classifier with features
defined both over the {\em extrinsic} and {\em intrinsic} features.
The results of the project were reasonable (Pearson correlation of 0.486)
though by no means good.
Furthermore, the most useful features in the model were the ones that
effectively modeled the noise in the rating process, suggesting the
feasibility of {\bf Hyp2}.
The first hypothesis is roughly approximated by counting the number of
new unique words in a review and correlating it with the average helpfulness
of that review.
This trend is presented in \reffig{newwords}.
However, this graph conflates the percentage of new words with the number of
reviews already seen, and hence may not be indicative of the validity
of our hypothesis.
Statistics were collected on the dataset to support the two hypotheses.
The second hypothesis is supported by correlations between extrinsic
factors and extrinsic helpfulness.
For instance, \reffig{postindex} shows a negative correlation between the
post index of a review and its helpfulness.
Similarly, \reffig{otherreviews} shows a strong correlation between the
per-movie normalization (the average rating of the other reviews)
and the review's helpfulness.
\Fig{fig/plot/helpfulGivenReviewPostIndexAve.png}{0.30}{postindex}{
The correlation between the post index of the review
and the helpfulness of the review.
To reduce clutter, this graph averages the Helpfulness ($y$) values
for each of 1000 buckets ($x$).
}
\Fig{fig/plot/helpfulGivenOtherReviewsAve.png}{0.30}{otherreviews}{
The correlation between the average helpfulness of other reviews
in the same movie, and the review in question.
For example, in the raw graph, a movie with three reviews
rated at 0.0, 0.5, and 1.0 would
be represented as three points \{(0.75,0.0),(0.5,0.5),(0.25,1.0)\}.
To reduce clutter, this graph averages the Helpfulness ($y$) values
for 1000 buckets of Other Review Helpfulness ($x$).
}
\Fig{fig/plot/helpfulGivenNewWordsAve.png}{0.30}{newwords}{
The correlation between the percentage of unique words in the review
which do not appear in any previous review,
and the helpfulness of the review.
To reduce clutter, this graph averages the Helpfulness ($y$) values
for each of 1000 buckets ($x$).
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% PREVIOUS WORK
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\Section{previous}{Previous Work}
This section details some of the previous work done in the field.
The first briefly describes one of the initial characteristic papers
on assessing review helpfulness.
The second presents an approach to jointly performing text summarization
and topic modeling; this approach appears relevant to our task.
\subsection{Automatically Assessing Review Helpfulness \cite{2006kim-helpfulness}}
This paper was the earliest work present in the literature on the subject of assessing review helpfulness.
The main idea of the paper was to use SVM regression on a large number of features,
analyzing the results and the impact of each subset of features on the performance
of the system.
The paper concluded that a relatively simple feature set (length of the review,
unigram counts, and star rating) performed the best, with the first two
(length and unigram counts) alone being almost as effective as all three
combined.
The dataset used by the paper was a corpus of Amazon reviews, in the categories
of {\em MP3 Players} and {\em Digital Cameras}.
The dataset consisted of 821 products and 33,016 reviews for {\em MP3 Players},
and 1,104 products and 26,189 reviews for {\em Digital Cameras}.
This data was then filtered for duplicate or near-duplicate entries ($>80\%$ bigrams match),
resulting in 85 products and 12,07 reviews being discarded for {\em MP3 Players},
and 38 products and 3,692 reviews being discarded for {\em Digital Cameras}.
Helpfulness ratings were obtained by taking every review with more than 5 helpfulness
responses.
This resulted in approximately a third to half of the reviews being discarded.
Training was done using SVM regression.
The authors tested on a variety of kernels, however found a radial bias function (RBF)
kernel to perform the best; all results are reported using this method.
To assist in feature creation, a {\em Product-Feature} set was automatically extracted.
This was done by mining references to product features from {\tt Epinions.com},
where users are allowed to describe the pros and cons of each product.
Frequent words were pruned from this list, resulting in around 10,000 unique features
for both domains.
Features were extracted over each review.
These features fell into the following categories:
\begin{enumerate}
\item {\bf Structural}: review length ({\tt LEN}); average sentence length,
number of sentences, etc. ({\tt SEN});
HTML formatting ({\tt HTM})
\item {\bf Lexical}: {\it tf-idf} statistic on each unigram ({\tt UGR});
{\it tf-idf} on each bigram ({\tt BGR})
\item {\bf Syntactic} ({\tt SYN}): features on POS information
\item {\bf Semantic}: product features ({\tt PRF});
{\it General Inquirer} sentiment words ({\tt GIW})
\item {\bf Meta Info}: star rating (average/deviation) ({\tt STR})
\end{enumerate}
Evaluation was done using the Spearman correlation coefficient.
The best results came from the three features ({\tt LEN+UGR+STR}),
resulting in a Spearman coefficient of 0.656 on {\em MP3 Players}
and 0.595 on {\em Digital Cameras}.
Adding additional features tended to hurt performance; adding every feature
dropped the {\em MP3 Player} score to 0.601,
although mildly improving the {\em Digital Camera} score to 0.604.
\subsection{A Joint Model of Text and Aspect Ratings for Sentiment Summarization \cite{2008titov-summarization}}
This paper presents an approach to summarization, jointly learning the topics to summarize and text which
describes them.
The paper proposes a model ({\em Multi-Aspect Sentiment Model}) consisting of two parts:
an unsupervised topic model, and a classifier from words to sentiment ratings.
The paper evaluates on a hotel review dataset taken from {\tt TripAdvisor.com}.
The paper presents the task of summarization as a two-fold task: the first task
is described as {\em aspect identification and mention extraction} -- determining which aspects of the
reviews are relevant to describe, and determining which text fragments describe them.
The second task is {\em sentiment classification} -- determining the sentiment on the
relevant extracted text.
The paper attempts to incorporate both of these tasks into a single model which extracts text fragments
and their associated rating, given the review and a per-aspect rating.
The dataset used in the paper consists of 10,000 reviews from {\tt TripAdvisor.com}, where each
review was rated in at least {\em service}, {\em location}, and {\em rooms}.
The approach taken is the build a model -- coined as a Multi-Aspect Sentiment model (MAS) --
which is effectively built on a combination of a multi-grain LDA topic model and a series of MaxEnt
classifiers for each topic.
The model is such that a word in the document is sampled from either a local or global topic;
the intent is that global topics will capture topics corresponding to non-sentiment phenomena
(e.g. {\em MP3 players} versus {\em hotels}), while the local topics will capture sentiment-laden
words.
The paper raises the issue that often aspects of reviews correlate strongly with each other.
That is, if you dislike a hotel, you will likely not rate any aspect highly.
To address this, the model classifies not over absolute ratings, but over the difference between
the aspect rating and the overall rating.
Inference over the model was done using Gibbs sampling, as exact inference is intractable.
The model achieves a precision of between 75\% and 85\% for the different aspects.
\bibliographystyle{acl}
\bibliography{main}
\end{document}