Skip to content
LING 539
GitHub

This page provides some information on the structure of the task.

Fictional background

Marvin works at the last Blockbusta Videoz (a fictional video rental shop) where his task is to classify movie and TV show reviews to help curate a special section of the store.

One day while perusing his favorite website, Marvin came upon a post about someone who secretly automated their job and then quietly took off on a long (paid) vacation. Inspired, Marvin set a side a portion of his salary to hire a developer to write a few scripts to scrape movie reviews from various sources. It's a bit noisy, but aggregating many reviews has cut the time he previously spent on his job by half.

Now, after reading an article about AI, Marvin wants to take things a step further: he is searching for a program that can determine a) whether or not a piece of text is a movie/TV show review and b) whether or not each review is positive (the movie/TV show is recommended) or negative (the movie/TV show should be avoided).

Marvin has put together a competition and advertised it on Fraggle (a fictional platform for competitive data science).

Task

Classify the documents as one of the following categories:

  1. Not a (movie/TV show) review
  2. Positive (movie/TV show) review
  3. Negative (movie/TV show) review

Instructions

While you are encouraged to engineer features, you may use any classification algorithm and supplemental data that you like.

Dataset

The data is a variation of that released by Pang and Lee (2004):

1@inproceedings{pang-lee-2004-sentimental,
2 title = "A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts",
3 author = "Pang, Bo and Lee, Lillian",
4 booktitle = "Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics ({ACL}-04)",
5 month = jul,
6 year = "2004",
7 address = "Barcelona, Spain",
8 url = "https://aclanthology.org/P04-1035",
9 doi = "10.3115/1218955.1218990",
10 pages = "271--278",
11}

... supplemented with data aggregated and preprocessed from a variety of other sources 😈

See this page for more information.