Did they write it or did they not?

Author: mgatto

11/07/2025 — class competition — 4 min read

For the initial leaderboard submission, I used both unigram and bigram models using Logistic Regression. I used SpaCy for tokenizing, opting not to rely on Scikit-Learn's built-in tokenizing. I stripped punctuation and stop words at first. However, for Authorship Attribution, perhaps the patterns of punctuation are a useful feature, as well as frequency of stop word usage.

They performed rather poorly, coming in under the weighted baseline.

I believe that word-order and sentence structure are in fact important for author attribution as the above experiment implies. Thus, I hypothesize that a RNN with a long-memory will perform better. Because I believe long memories are important here, CNNs are not appropriate for this specific type of classification tasks.

Class Competition Info

Leaderboard score	0.50677
Leaderboard team name	MichaelOmarGatto
Kaggle username	MichaelOmarGatto
Code Repository URL	https://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-mgatto

Task summary

I was tasked with evaluating snippets of authored text to determine if it was written by the single, unknown author or not. In effect, we asked the question, are these texts so similar to each other that they must therefore be from the same author?

The training set of text is a few thousand snippets and is provided with a label column and nothing else.

Exploratory data analysis

Clearly, some of the texts feel different from others. Proper names of one really made it feel like the snippets came from Dune. If you've read it, you know what I mean: faux-Middle Eastern sounding names really stood out to me. Character dialog was copious. Some looked clearly as if from a starship's logbook (was i right). My poor eyes could not discern much else from it.

Good thing we have Python! Right away it was easy to calculate if the training data had roughly equal numbers of both Author's and Not Author's samples. It does not; There is a class imbalance:

Class distribution in training data
Class 0	Class 1
1245	356
77.76%	22.24%

Some snippets contained a tag "[SNIPPET]"; I promptly erased it from the vocabulary. Frankly, I was rather unsure of its significance.

Approach

I first tried something similar to, but different than the Arabic Roots project: also a character level model, but discarded the LSTM and built it with a CNN for binary classification. I assumed that learned embeddings as in the roots project could be effective here, so despite hints, I did no feature engineering at first. But, I did spice it up with some Attention and over-engineered the heck out of it by pairing it with a GRU, which resulted in absymal performance.

I then regrouped by scrapping it and returning to my first love, the LSTM and promoting it to a word-level model. This performed much better, but still was not above baseline. Then I removed attention, seeking simplicity in life, but I stood at the ice wall of the baseline at 0.50251, unable to overcome it.

Eventually my approach devolved to constant tunings of hyperparameters and trying out numerous combinations of models with and without attention and n-gram caegories to see which one best stuck to the wall. I reasoned that if all my model attempts overtrained so similarily and resulting in similar F1 scores, then it's probably the data's fault: each snippet is just too short. So, I concatenated snippets in the class 1 two at a time which yielded wonderful figures like "95%" in the validation data, but when it too was concatenated...You see where this is going? Nevertheless, I submitted but the score was horrible. Only then did I realize that it was because the test data could not be concatenated and the model was unable to generalize from large, concatenated snippets to short test snippets. But, that training graph was just heartwarming for a while.

I then reasoned that given all these models I had trained, why not join them together as an ensemble. This is the only thing which at least got me to the baseline. To pass it, I then saw Feature Engineering as my only hope.

Feature Engineering

My intuition said that I should include words which occur only once since the data exploration showed them to be rather unique per author. Strangely, this did not help at all. Excluding stop words and forcing a minimum frequency of two occurences in the training data noticably improved F1 which left me puzzled as to why my hypothesis seemed incorrect.

I also tried part of speech tagging features and some stlyometric features which I thought would be very impactful; they were not. What happened? I suspect I may have encoded the features incorrectly.

Results

No matter which combination of LSTM, CNN, GRU that I used, with or without attention, I was not able to control overfitting. Curiously, the maximum F1 for most combinations seemed to hover in the mid 30% range.

Early attempts (with submission scores):

"Bidirectional GRU RNN with attention" = 0.262
"LSTM with attention, word level not char-level like the one before" = 0.467

Error analysis

Because of the class imbalance, many of my runs favored predicting class 0. It was difficult to overcome this despite considerable coding attempts. Class wieghts were calculated: class 0 = 0.6432, class 1 = 2.2456. Despite this and using a POS weight with the loss function (3.4912, but sometimes tuned unsuccessfully), it did not successfully get us above the baseline. The git repo history tells the full, tangled story.

Validation data was separated out, but I unaccountably did not debug incorrect classifications in that set. Which brings us to the leaderboard!

Leaderboard Result

My best results were with the ensemble models, resulting in my two highest scores:

"Second attempt at an ensemble model." = 0.49579
"Truely ensemble." = 0.50677 which passes the baseline by a hair, a hair I tell you! :-)

Reproducibility

Please see the README.md file in the code repository.

Future Improvements

I honestly do not know. The failure of my attempted feature engineering was my last hope. I have much more learning to do and will remember this project fondly one day after I arrive at my AHA! moment.

I intend to study feature engineering in great detail over the winter break just to see where I went wrong and how it could be corrected.