Skip to content
LING 582 (FA 2025)
GitHub

Class code competetion

Author: jakemns

class competition2 min read

Class Competition Info
Leaderboard score0.54513
Leaderboard team nameJake Mains
Kaggle usernameJake Mains
Code Repository URLhttps://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-jakemns

Task summary

For this competition, we are given a training file and a testing file, each a .csv file that has 1600 lines containing chunks of text. The training file contains 0s and 1s for each chunk of text represnting authorship by the target author or a different author. We are tasked with training our model on this file and using it to predict authorship of the texts in the testing file. We must return the file with a prediction of 0 or 1 for each chunk of text, representing different authorship or same authorship.

In order to complete this task, we must first process the .csv files for analysis. From there, we pick our own method to analyze the texts to help determine authorship. I chose a Tf/IDF model. One challenge of this task was maximizing effeciency when trainnig and implementing our model of choice to reduce processing time and resource use. Another challenge of this task was the blind submisson setup, which made error identification more difficult.

Exploratory data analysis

The training data set had 1600 lines, each with an ID, chunk of text, and a label of 1 or 0 to identify authorship. The testing data had 899 lines, wach with an ID and chunk of text, but no label. The task involved generating a label. The submissino csv was formatted the same as the testing data, but with a label for each line predicting authorship.

Approach

For this project, I chose to do a TF/IDF model to analyze the data. I first calcualted the TF/IDF values for each word in the text chunks with a 1 value in the training data, putting them in a dictionary. I repeated this with the text chunks with a 0 value. From there, I compared the TF values of the text by the target author to the TF values of each word in a given text of the test data. I did not create IDF values for the test texts because there was no body of texts/docs to compare them against. I analyzed the differences of each word across each text and added them up for each value. I decided a cutoff point at which the sum of the differences was too high. Texts above that point were marked with a 0, while texts below were marked with a 1. I refined this cutoff point with a series of submissions, using the score received to help determine the most effective vitoff point.

Results

My best submission to the leaderboard yielded a score of 0.54513, 0.043 above the weighted random baseline. At the time I am writing this, my score puts me in 11th place out of 24 competitors.

Error analysis

This model was tested against a series of 1600 text chunks for the competition. Each submission yielded a score which represented the percent of text chunks with correctly identified authorship. Because we were not able to see which authorship characterizations were correct, we needed to get creative with identifying errors in our approach. For my code, this mostly involved adjusting the threshhold at which a text is given a 0 instead of a 1.

Reproducibility

The code for this project is found in one python file in the repository linked above.

Future Improvements

If given the chance to improve this further, I would explore different approaches for authorship profiling, like a Naive Bayes classifier. I could also supplement my approach with more simple identifiers like average word length and highly used words.