Skip to content
LING 582 (FA 2025)
GitHub

Sequnce Modeling for class Competition

Author: sydneybess

class competition7 min read

Class Competition Info
Leaderboard score68.282 (for the model I'll be analyzing this score, I wasn't sure how to select it on Kaggle because it kept defaulting to my highest)
Leaderboard team nameSydney Bess
Kaggle usernamesydneybess
Code Repository URLhttps://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-bess-days

Task summary

See the rubric

The goal of this task is to see if two texts are by the same or different authors. This task has been studied in depth in linguistics; it even has its own field, Stylometry. For this Kaggle competition, we were given training and test datasets in CSV format. The train dataset, with 1,601 entries, was manually labeled: each entry (a combination of two snippets from the same or different authors) was assigned a 1 or 0 (more on this later). The test set was unlabeled and was what we were supposed to use for inference.

While Stylometry can be automated (i.e., feature engineering based on sentence length, punctuation, etc.), I wanted to know whether a sequence-classification pre-trained model could make the process more efficient.

There has been some research on using LLMs to identify authors; although this is a different task than the one I'm partaking in, I think it is worth discussing (Huan & Chen, 2024). Huan and Chen specifically mention why something like a pretrained model (which I am using), isn't the best choice. The main reason to use LLMs instead of something like a BERT transformer is that for a transformer, you need a lot of data -- which, as you will see in my analysis, the lack of balanced data for this task reflects in the results. The reasons they say LLMs are good is because they have the reasoning to make linguistic conclusions, versus a pre-trained model's hidden feature engineering. However, I am hesitant to believe Huan and Chen's claim based on something I did for this assignment.

I'll discuss the imbalance in the data later, so I tried to ask JanAI using this prompt.

1I want more data for a project I’m doing. The project is to tell if two text snippets are by the same author or a different author. 0 is for different author and 1 is for same author.
2Right now my train data looks like this:
3ID,TEXT,LABEL
40,"When his speaker remained silent Dirrul .... [I inputted more examples, but leaving it at that]
5the two texts are a single string separated by [SNIPPET]
6Create a new CSV file with more training data from various authors.

When it spat back text, I asked it to tell me who the two authors are, because I recognized some quotes from literature but not others. I tried to verify the authors and text it gave me, but to no avail. So, it seems it isn't very efficient at generating factual text by different authors. Can it do the reverse? Who knows? One day I'll experiment, but for now we're doing an old-fashioned model.

In Huan and Chen's article, they provided a result table that reflected Tf-IDF (which I used as my base), BERT (which I used for my main model), and GPT. Interestingly, their results comparing TF-IDF and BERT on the author identification task resembled my proportions (around a 35% improvement from TF-IDF to BERT).

In my approach, I will explain why I chose Sequencing with BERT and how I will handle the data.

Exploratory data analysis

See the rubric The data consisted of an ID, TEXT (which were actually two texts by the same or different author, combined with the string [SNIPPET]), and a LABEL of 0 or 1 depending on whether the texts were by the same or different author. To investigate data dispersal, I checked several things. First, the labels. In the full dataset:

DatasetEntriesLabel 0Label 1
Full Dataset1,6011,245356
Train Dataset1,280999281
Validation Set16012337
Test Dataset16112338

I thought the data was imbalanced because most entries were not by the same author. I had a theory that the model would have lower recall because it would learn more from not the same author than from the same author. Hence, tried adding more data via AI augmentation, but it made things up. So next, I did research on other datasets and found this dataset on Kaggle Spooky Authorship Dataset It had text samples from three authors, with columns for ID, text, and author. While I didn't care about who the author was, I set up a random selection of two rows, combined the two text columns with [SNIPPET], and, if the author was the same, set the corresponding label. While I ended up not using the data because it actually made my model worse (I discuss that more in future improvements), even though it balanced the two categories, here was the breakdown

DatasetEntriesLabel 0Label 1
Full Dataset2,9971,7251,273
Train Dataset2,3981,3811,017
Test Dataset600344256

While this 'balanced' the data, I did some research and found that BERT does not need much for finetuning To make the data easier to visualize, I added text1 and text2 columns, which split the single [Snippet] special token into two texts.

IDTEXTLABELtext1text2
1596Kinton realized to his surprise that the effor...0Kinton realized to his surprise that the effor...Their erect posture gave them a weirdly half-h...

Approach

See the rubric

My approach was to use a BERT-based text classifier to classify the texts.

  1. Performed exploratory analysis and decided to add two new columns for each of the text snippets. This is also where I attempted to supplement with more data.
  2. Separated the dataframe into train, validation, and test, and imported it into a Dataset for easier manipulation
  3. Set up the tokenizer; I used bert-base-uncased, as suggested by the Text classifier. Though I didn’t just choose a Sequence classifier model because that post told me to. As I’ve learned in classes, Sequence models are great for NLP because they can learn features from a sequence of words by transforming it into a sequence of numerical representations. With these features, it can build dependencies, which is useful for this task because you want to learn patterns and the similarities among authors.
  • For this, I mapped the dataset and made some alterations to the model to better understand it, such as renaming ‘LABEL’ and ignoring the actual text columns, focusing on the IDs, input_ids, token_type_ids, attention_mask, and labels.
  • Next, I put it in the Torch format, based on this example: Dealing with the tokenizer.
  1. Next, I set up a data_collator with dynamic padding to better organize padding.
  2. Creating a metric to test the success of the model using accuracy, a function taken from HuggingFace, along with other functions taken from class lectures that talked about evaluating binary classification models.
  3. Set up the training arguments as suggested by Hugging Face. My initial run was 3 num epochs, but I thought if I ran it longer, it would be better - but that was not the case. I was trying to be novel and try new things, but HuggingFace knows what it is talking about. But at least I learned what didn’t work as well (though all my experimentations did help me learn more about my model, and were relatively close to each other, ie, 65%-67% compared to the top first model, which was 70%). In the end, I went with 5 epochs.
  4. Run model on witheld data to see accuracy there
  5. Apply the model to the unlabeled test csv file

Results

See the rubric

MetricBaselineNew ModelDelta (New − Base)
Accuracy0.7810.844+0.064
Precision0.5000.875+0.375
Recall0.0850.378+0.293
F1 Score0.1460.520+0.374
AUC ROC0.727
Average Precision (AP)0.600

First, I set up a base model that used a similarity score and a threshold to predict the label by turning each text into a TF-IDF vector. For this, I used the training data later used for the model.

While accuracy is relatively high, compared to the Sequence model, it tends to mislabel the positive (same author) class as negative (not same author) with high frequency. More specifically, among all the golden class 1 labels, it gets only 8% correct. This is a major issue. The reason accuracy might be so high is that the data is imbalanced toward negative examples.

For my model, I saved the best accuracy at epoch 3, though in retrospect I should have tested with the model that achieved the best precision and recall.

Looking at the 'New Model' column, there is a high accuracy, which can be misleading because, as I predicted in the beginning, the data heavily favours the negative class.

The high precision means the predicted positives are truly positive. However, on the flip side, recall is very low, as it accurately predicts actual positives only 38% of the time (so even though it is out of the positives predicted, a high percentage are actually positive, it only picks positives in a third of the cases where there are positives). The F1 score better reflects the imbalance.

That is why I decided to also calculate the Area Under the ROC, especially the AP. The ROC AUC says it does an average job separating positives and negatives. But because the data is imbalanced, I wanted to focus more on AUC-PR, where the positive cases are ranked above the negative cases. The 60% means it is more than random, which is good, suggesting the reason the recall is so low is not that the model can't predict positives, but that the threshold is too high.

All that said, the delta shows there is a 35% average improvement in Precision, Recall, and F1 score, which is good. This shows that using a transformer-based model improves performance compared to a standard vectorisation model. However, it follows the same proportional trend of high precision and low recall (though to a greater extent). This suggests that, even though the model improves matters, there is something to be said about the data imbalance (though again, when I tried to add data, it did worse).

Error analysis

See the rubric

MetricValue
Eval Accuracy0.8509
Eval Precision0.7917
Eval Recall0.5000
Eval F10.6129
Eval AUC ROC0.7668
Eval AP0.6793

This trend of high accuracy, high precision, and low recall continues when evaluating the model on test data (separate from the validation and training data). The F1 score is higher because precision and recall are more balanced. This suggests it did a better job of identifying more true positives rather than categorizing them as negative (though only with 50% recall, which is still bad). This remains the main type of error. The higher, but similar, accuracy supports the idea that the datasets were relatively similar (as shown in the EDA).

Reproducibility

See the rubric If you'ved covered this in your code repository's README, you can simply link to that document with a note.

README.md - I showed reproducibility and I set up seeds for the files, along with how to run the model.

Future Improvements

Describe how you might best improve your approach I'd like to look at the specifics of the errors, like if the text pairs it predicts wrong or right have specific features like including dialogue, flowery language (ie distinct), or are just descriptive.

The second thing I'd do is balance the data by augmenting it. I think when I tried to add more data from the Kaggle dataset I mentioned, I added too much, so it didn't actually represent what the test data would look like. It was 65% when I submitted the prediction (although at the time I added the data, I wasn't checking the Precision, Recall, or F1 score to tell - all I knew is it had an accuracy of around 75% (below both base and my model) and a 65% on Kaggle). I would go back and redo adding the data and try again, because I hadn't pursued it after it got my lowest results. If that still didn't work (or even along with that), I'd do a method like class weighting in machine learning where I'd add more weight to the minority class (label 1)).