RNN-Based Sentiment Analysis
Author: aaronradlicki
— course project — 9 min read| Code Repository URL | https://github.com/uazhlt-ms-program/ling-582-fall-2025-course-project-code-feelingsentimental |
|---|---|
| Demo URL (optional) | |
| Team name | FeelingSentimental |
Mid-Semester Proposal
For my course project, I am planning to train an RNN-based PyTorch project to evaluate the valence of a set of Tweets using the popular Sentiment140 dataset.
There are several reasons why I am interested in this type of project:
- For LING 539, we did a project comparing movie reviews and I did a strictly non-neural network approach. I would like to apply that type of approach to a similar project. This dataset is the perfect opportunity.
- As someone who wants to work with underresourced language groups, with very small datasets, I plan to experiment on seeing how small of a training set I can work with for the purposes of this project in order to achieve relatively solid results. (This will require multiple runs and writing a program that is able to variably slice the training data.)
- Given my research paper analysis's discussion on the complexities of code-mixing, slang-usage, and dialectical differences, I feel that it would be quite intriguing to work with this kind of a dataset and see how those factors can affect the process.
As a result, my official proposal is as follows: I plan to design a configurable and retrainable RNN-based PyTorch project to utilize and make predictions on the Sentiment140 dataset. I will make at least 5 sets of predictions, with the goal being to utilize the smallest amount of training data possible while balancing a relatively high F1 score. Cursory analysis will be provided for the initial runs, with an in-depth analysis and instructions provided to replicate the one that meets the goal of balancing a high F1 score with a lower amount of training data. This project will be downloadable from the course GitHub repository (see above), with instructions provided below by the end of the semester. Through this, I will be exploring the balance of dataset scope as it relates to successful NLP analysis.
Dates:
- Dataset downloaded and scoped out by 11/23
- Working code for dataset analysis by 11/30
- Optimal configuration found by 12/7
- Writeup completed by course deadline
--- Final Writeup Below ---
Project description
See the rubric. Use subsections to organize components of the description.
The purpose of this project was to evaluate the effectiveness of a GRU on an extremely limited training set. As a result, I coded my program to be able to have a variable amount of training data configured by the user via a variable defined at the top of the file. After doing so, users will be able to run the program and see what results are output to the terminal in the form of accuracy (measured between 0.0 and 1.0). Unlike in my proposal, I determined that accuracy was the more important metric for analysis, as I will explain below.
For this analysis, I decided to utilize the Sentiment140 dataset, as it both was an easily available and well-document dataset which I knew could work well for a configurable project like this. I knew that I could also easily convert the data to a simple binary output, which would allow for a simple-to-train and analyze program, which could be ran on relatively simple machines (even CPU-bound ones with enough patience, as was my case while working on the project initially). This relatively speedy and configurable codeset also allows for users (and myself) to do further experimentation going forward in order to be able to dive further into the course.
As for the code itself, it using keras to construct a GRU-based system, sandwiched between an embedding layer and a dense layer, in order to achieve the necessary outputs. All necessary components of the program itself are contained within the SentimentAnalysisMaster.py file, with all required components being listed in requirements.txt.
What I think makes the project unique is also connected to its inspiration: I want to work with underresourced language groups in Indonesia. As a result, I wanted to see the limitations that would be faced with working with extremely limited datasets. Thus, this code that you have access to now has a completely configurable and precise variable, training_datapoints, which you can use to determine the exact number of datapoints that get fed into the GRU. Whether you want to give it 1,000,000 or you only want to give it 10, that is possible! Furthermore, given the fact that the result changes based on the amount of datapoints you give to it, it will output its accuracy assessment to the terminal for you to view at the very end, so that you can see how much the limitation affected the program!
I think it's pretty neat. \:)
How to Run
For this program, I originally intended to have an easily accessible Docker container available to run. However, given the fact that I have struggled to a greater degree than most with working with these types of machines, the current version of the program needs to have its dependencies installed manually and be ran from the terminal.
As of 12/4/2025, all components utilized were their latest versions. Installing the latest version of each component (like numpy, scikit-learn, etc.) should not cause issues, but if you do run into problems, make sure to install the versions listed in requirements.txt.
In order to make sure that you are able to install each component, I will provide an example of installation utilizing pip. Please note that this set of directions presumes a valid installation of Python and a Linux-based operating system.
- If you do not have
pipinstalled, open your Terminal and type the following commands:
1sudo apt update2sudo apt install python3-pip- Once pip is installed, you can begin installing each component from
requirements.txt. For example, here is the command that you can use to install the first item listed, which is listed aspandas==2.3.3. In this case, you can ignore the version number at the end and enter the following command into your terminal:
1pip install pandasRepeat this process for each item listed in
requirements.txt.Once all dependencies are installed, you will next have to download the Sentiment140 dataset and place it in the same folder as the
SentimentAnalysisMaster.pyscript. You can download it from here. Ensure that once you have downloaded it that it has the exact correct name oftraining.1600000.processed.noemoticon.csv.It is now time to run the script! Navigate to the directory that the script and the Sentiment140 dataset are in using
cd path/to/directoryand then run the script by typing the correct version of Python installed on your system. For example, that command may look likepython3 SentimentAnalysisMaster.py. Note that depending on your system, the process may take quite some time. (The smaller thetraining_datapointsvariable in the configuration section, the faster it will run.)
Challenges Faced
From a personal standpoint, this was all new to me. I have never attempted anything close to this level of computer science. Even understanding the right amount of epochs and GRU layers to get right was a challenge (and clearly still has a lot of room to go, given the lack of differentiation that you can see in the results document). I faced immense difficulties with configuring and even getting my code to run. While I am sure that there are many students whose challenges faced were much more dramatic (i.e. getting the right chain of algorithms and processors chained together in the optimal way), for me I am going to cherish my legendary accomplishments, such as:
- Figuring out how to convert dataframes between one format to another properly
- How to transfer the labels column properly through all the necessary steps of the process (without accidentally deleting it...)
- Successfully achieving dropout integration within GRU layers and not as a separate layer
Throughout every step of the process, I was faced with a new challenge that required me to learn something new. While at many times it felt like a deluge, I not only learned new things, but I also feel that I learned how to learn more effectively. For a long time, I have been pretty averse to utilizing "AI" like ChatGPT for bug hunting, but thanks to this course, I have seen it is immensely helpful for parsing tracebacks that would overwhelm me repeatedly. I am thankful to finally understand how great of a tool it can be in my tool belt!
I also feel like I have developed a lot in my ability to research a topic online when I am very unfamiliar with it. For whatever reason, I have really struggled to grasp this course material - and when I searched for answers to problems I faced they were usually explained with terminology that was above my head, links to extensive documentation that I could not understand, or something else indecipherable to me at this stage. Yet I learned how to effectively sift through internet results and find beginner tutorials, leapfrog from those tutorials to lessons on my issues, and oftentimes even ended up finding some really helpful YouTube tutorials.
As a result, the challenges that I faced on this project helped me to grow as a student. I did not merely face project-specific challenges like getting aspects of my code to run (or run in a relatively timely manner), but faced challenges that caused me to achieve some solid personal development!
Summary of individual contributions
| Team member | Role/contributions |
|---|---|
| Aaron Radlicki | Program Writer/Project Summary Author |
| The Internet | Helpful Resource/Stunt Double #4 |
| ChatGPT 5.1 | Traceback Untangler (see ai_citation.md) |
Results
See the rubric
While I was really excited to explore the differences between training with wildly differing amounts of training data, the actual accuracy did not vary nearly as much as I thought that it would among the tests ran with lower amounts of data. More data did help - and you could see notable jumps in accuracy from 1,000 to 10,000 to 100,000 for example, but those were less than a hundredth of a percent. Since I wanted to see if any meaningful information could be deduced with those truly low numbers, the reality that the low-input outputs were all essentially barely above the 50% baseline was frustrating.
I decided to compare all of these low training inputs to a "control" of sorts, allowing 1,000,000 training inputs out of the 1.6 million, a much closer to normal ratio. This is the only place where we finally saw reasonable performance with around 82%. The jump from 100,000 to 1,000,000, unlike all prior jumps, had massive leaps in accuracy. While still not perfect, this revealed that the current implementation of my GRU model was dependent on massive amounts of data to make its predictions and revealed the limitations of using these standard models in underresourced data environments.
That being said, I think that there is plenty of room for everything to improve if I fine tune my model. I have read up on many exciting innovations (see my research essay summary for notes on that) which can help underresourced groups. While those are not golden bullet solutions, it is exciting to know that work is being done in that field.
As for more traditional methods of training models, it is not easy for such a thing to be achieved on extremely low amounts of data. While this was expected, it is still much more of a drastic and disappointing reality than I hoped it would be.
Please see Results_Summary.md in the project directory for the exact outputs of each result.
Error analysis
See the rubric
The model made incorrect predictions in many, many areas. I was hoping that I would be able to point out specific notable recurring errors like in my previous courses, but in this case, it is barely beating the baseline for all the lower data runs. I do, however, have some hypotheses.
First, I think that I must do some more editing prior to tokenization of the data. When I printed out examples of tweets, many of them contained web links and such which I imagine could have confused the system. On a website like Twitter, where people are constantly sharing links, that could be a major factor. I could see Twitter handles (like @ExampleTwitterHandle) being a factor too, among other things.
Second, since the model is barely above the baseline, it needs to be fit more tightly. Going forward, I would like to design a version of this program that can run much more intensive calculations (with a much more user friendly and configurable level of customization). See Future Improvements for more on that.
Finally, when I was pulling samples, I noticed that my decision to include Tweets that were classified as "a little positive" (3) or "a little negative" (1) on the original scale of 0-4 could definitely sour the training pool. These were sometimes confusing to the program, although it also struggled with some that seemed like they should be relatively obvious too. (Shoutouts to the Tweet praising Battle Creek, Michigan for being awesome! I think that is a positive Tweet no matter what a random iteration of this program might say!)
Of course, looking through the "control" results where it was given 1,000,000 Tweets to train off of, it still was not perfect there either. Oddly enough, I think similar improvements to the ones that I suggested could help, although brute forcing things with larger amounts of training data caused the control run to do drastically better.
Reproducibility
See the rubric If you'ved covered this in your code repository's README, you can simply link to that document with a note.
This code contains a built in random seed, so as long as you choose the same number of training data points, you should get the exact same results. Anyone looking to replicate these results should not have to modify any parts of the code outside of the configuration section at the beginning. However, one should note that they will alter the results if they change the actual code itself, such as the configuration of layers in the model, for example.
For instructions on how to install the necessary dependencies and run the code, see above.
For a quick and simple reproducible example, I would recommend running the code with just 1,000 or 10,000 training datapoints.
Future improvements
See the rubric
For future improvements to the project, I would also like to make the other variables within the program accessible to the user from the configuration section at the top. Why should they not be able to choose the amount of GRUs they want, the amount of training epochs, etc.? I think that would allow for a very solid amount of flexibility, while also allowing me (and others) to further explore the challenges that limited datasets cause.
I would also like to have the program not only output its data to the terminal, but also save it automatically. All of the training data in Results_Summary.md was manually typed by me, but it would be amazing if that could be manually saved by the program!