[gquinones] approach to the class competition
Author: gquinones
— class competition — 4 min read| Leaderboard score | 0.41097 |
|---|---|
| Leaderboard team name | Gabriela Quinones |
| Kaggle username | gquinones019 |
| Code Repository URL | https://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-gquinones019.git |
Task summary
For this task, I had an author identification competition. Here, I had to build a code that would correctly identify which samples of texts shared the same authors. These texts were divided between two csv documents. One, was labeled as train, and here, the structure of the file was divided between three columns. The first was the ID column, which just identified which row was the text located in the data frame. Then came the TEXT column, which stored the samples of text that I would use to train the data. In total, there were a total 1,601 of these text samples. The third and final column was the LABEL column. This column had a binary classification 0 and 1. Texts that received the label 1 were because they all shared the same author, those that received the label 0 did not share the same author. In total, there were 1,245 samples that had the label 0 and 356 samples that had the label 1. Then, I had the test.csv file, which was the file that I would test my created code against. This file had a total of 899 values. The file had two columns, the ID column, which served a similar purpose as the ID column in the train.csv file, to identify which row was the text located in the data frame. Then we had the TEXT column, which held the samples of text that we would use to text our code against. There was no LABEL column, since that would be the purpose of the code I would create, to figure out the label it would receive. A related task that I have used techniques from stylometry is also in trying to do author identification. In that case, I was trying to see which writing styles were commonly used by a specific novelist in their works. I wanted to see if the author had the same sentence average sentence length in all their works, how much they used first vs third person in their novels, what was the dialogue to test ratio, etc. Though, I did not use a binary classification, instead, I just used the graphs to demonstrate these findings.
Exploratory data analysis
The dataset consists of text passages paired with a label indicating whether the text belongs to class 0 or class 1. The dataset used for training contains:
- Total samples: 1601
- Class 0 samples: 1245
- Class 1 samples: 356
The distribution shows a clear class imbalance, with Class 0 being significantly more frequent. This imbalance becomes an important factor when evaluating and interpreting model performance, particularly recall on the minority class. When observing the data, I noticed that some of the samples were written in the format of a research article or textbook, using several technical terms and jargon.
Ex. It must be remembered, however, that in the gaseous experiments the gases occupy all the space o, o, (fig. 104.) between the inner and the outer ball, except the small portion filled by the stem; and the results, therefore, are twice as delicate as those with solid dielectrics.
While some of the other samples were written in a narrative, including dialogue from characters and written from a first person’s perspective.
Ex. He fell upon his berth. I bent over him. [SNIPPET] "Not what you seem to think, Dr. Goodwin," he answered at last, gravely.
If we analyze the TTR of the samples:
| types | tokens | TTR | |
|---|---|---|---|
| 0 | 22068.0 | 236822.0 | 0.093184 |
| 1 | 10694.0 | 67862.0 | 0.157585 |
Here, we see that the tokens from class 0 have a TTR score of 0.093, which demonstrates while the class has a larger vocabulary, the words repeat more often and the text of the class are more uniform. While class 1 has a score of 0.158, which means that the texts could be shorter, could have less repetition and more unique words.
Approach
To identify which texts were written by the same author and which were not, I decided to use the CountVectorizer function, turning the sample texts into a bag of words representation, so it could focus more on the words in the data frame. Here, I focused on just on unigrams to produce the score. When I tried incorporating bigrams, my accuracy score decreased, so I decided to just focus on unigrams for this task. I also made sure that the code focuses on the most common terms in the text, so any words that appear less than 20% of the document were eliminated (min_df=.2). All of these features were put into place with the TEXT column of the train.csv file, which has been given the variable dataset. I used the parameter alpha to test the Multinomial Naive Bayes model that I used for the set of values that were selected. This parameter also smooths out the data. Naïve Bayes was selected since when I tested it out, it gave me a higher accuracy score. I also used the GridSearchCV() function, along with the MultinomialNB, along with the alpha parameter, to find the best value of the ones we established. With the scope created, I have trained the created model to then apply it to the test.csv file.
Results
The accuracy score of my model was 0.778816199376947.
To analyze my results, I created a Stratified 5-Fold Cross-Validation to evaluate the robustness of the Multinomial Naive Bayes model trained with a CountVectorizer (min_df=2). Stratification preserved the original label distribution within each fold, ensuring a balanced and fair evaluation despite class differences in token counts, type-token ratio, and lexical diversity. The cross-validation accuracies across the five folds were:
1{2[0.7788, 0.7656, 0.7781, 0.7719, 0.7906]3This produced a:4• Mean accuracy: 0.77705• Standard deviation: 0.00836}The extremely low standard deviation shows that the model works well across the data. Also, comparing my accuracy score (0.778816199376947) to the result of the five folds demonstrates that no overfitting occurred.
Error analysis
The classifier that I have created shows strong overall performance but consistently struggles with classifying texts as Label 1. Most mistakes were false negatives, indicating the model defaults to Label 0 when uncertain. This occurs because Label 1 exhibits higher lexical diversity and contains many low-frequency or unique terms, which Multinomial Naive Bayes does not weight heavily. Misclassified examples often included long narrative text, implicit cues instead of explicit keywords, or domain-specific vocabulary. These characteristics reduce the model’s confidence, causing underprediction of Label 1.
Reproducibility
The steps on how to repoduce my results can be found in README.md file in my repository, which can be found following this link:
Future Improvements
A future improvement that I would use would be to include more traditional approaches of stylometry, such as average sentence length, average word length and first vs third person ratio, to name a few. I feel like that could improve my accuracy score further. For this competition, I used several different models for classification, such as Naïve Bayes, Logistic Regression and Support Vector Classifier. In a future study, I would like to use a neutral network, like BERT, to see if it could help my score go higher.