Detecting Shared Authorship with Classical Stylometry

Author: pamelaangulo164

11/16/2025 — class competition, authorship identification, stylometry — 3 min read

Class Competition Info

Leaderboard score (macro F1)	0.46238
Leaderboard team name	Pamela Angulo Martinez (individual)
Kaggle username	pamelaangulomartinez
Code Repository URL	https://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-pamelaangulo164

Task summary

This shared task is a binary classification problem in author profiling / digital stylometry.

Each data point consists of two text spans in English, concatenated with the special delimiter [SNIPPET]. The goal is to predict whether both spans were written by the same author (LABEL = 1) or by different authors (LABEL = 0).

I treat the shared task as a binary text classification problem over the full concatenated TEXT field. Each example is represented using character 3–5-gram TF–IDF features, and I train a linear Support Vector Machine (LinearSVC) classifier on these features. Character n-grams are intended to capture orthographic and stylistic patterns that are more stable across topics and help deal with the relatively high OOV rate in the test set.

My full pipeline is implemented in the repository:

Repo name: ling-582-fall-2025-class-competition-code-pamelaangulo164
URL: https://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-pamelaangulo164

Exploratory data analysis

Dataset size and label balance

The training set has 1,601 examples and the test set has 899 examples.

Label distribution in the training data:

Label 0 (not same author): 1,245 examples (77.76 percent)
Label 1 (same author): 356 examples (22.24 percent)

This is a moderately imbalanced dataset skewed toward label 0, which explains why a naive majority-class baseline performs reasonably well and why macro F1 is a more informative metric than accuracy alone.

Span length

Train set

Characters: mean = 1,047.69, standard deviation = 546.45, minimum = 164, maximum = 5,491
Words: mean = 186.26, standard deviation = 97.16, minimum = 25, maximum = 1,036

Test set

Characters: mean = 1,001.44, standard deviation = 545.32, minimum = 204, maximum = 4,321
Words: mean = 178.87, standard deviation = 97.42, minimum = 34, maximum = 773

The length distributions in train and test are very similar, with a wide range from a few dozen words to more than a thousand. Models therefore need to handle both short and long text pairs.

The [SNIPPET] delimiter appears exactly once per example in both splits (minimum = 1, maximum = 1, mean = 1.0).

Interestingly, average lengths by label in the training set are nearly identical:

Label 0: mean characters = 1,047.71; mean words = 186.10
Label 1: mean characters = 1,047.62; mean words = 186.80

Vocabulary and diversity

Train vocabulary size: 41,269 unique tokens
Test vocabulary size: 27,132 unique tokens
Shared vocabulary: 14,567 types
Test-only vocabulary: 12,565 types

Token counts:

Train: 298,196 tokens
Test: 160,800 tokens
Test tokens that do not appear in the training vocabulary (OOV relative to train): 14,481
OOV rate in test: 9.01 percent

Type–token ratios (TTR):

Train TTR ≈ 0.14 (41,269 / 298,196)
Test TTR ≈ 0.17 (27,132 / 160,800)

The slightly higher TTR in the test set suggests that the test examples are, on average, a bit more lexically diverse.

Approach

I treated the combined TEXT field as the unit of classification. I also used character 3–5-gram TF–IDF features because these capture orthographic and stylistic patterns such as punctuation use, word endings, and local character sequences. Finally, I used a linear SVM classifier on top of this representation, which is a standard, strong baseline for high-dimensional sparse features.

The main design decision is to rely on character-level features rather than word-level features or deep neural models. This choice is motivated by the OOV analysis, the literary nature of the texts, and the goal of capturing stable stylistic patterns across topics.

Results

Using a stratified 80/20 train–validation split on train.csv, this baseline model reaches a macro F1 score of 0.4717 on the validation set, with overall accuracy of 0.7695. The classifier performs very well on the majority class (label 0, “not same author”) but has low recall and F1 for the minority class (label 1, “same author”), reflecting the underlying label imbalance in the data.

Evaluation scores:

Macro F1: 0.4717
Accuracy: 0.7695

Error analysis

The confusion matrix shows that the model is extremely confident about label 0 but rarely predicts label 1 on held-out data:

For label 0, it correctly classifies 244 out of 250 examples and mislabels only 6 as label 1.

For label 1, it correctly classifies only 3 out of 71 examples and mislabels 68 as label 0.

This leads to very high recall and F1 for label 0 and very low recall and F1 for label 1. The model is effectively biased toward predicting that most pairs are written by different authors, which mirrors the class imbalance in the training data.

Qualitatively, many false negatives (true label 1, predicted 0) occur when:

Both spans are fairly short, so there is limited stylistic evidence.

The author’s style changes between spans (for example, narration versus dialogue).

The spans are very similar topically to other authors in the corpus, which may blur distinctive stylistic cues.

False positives (true label 0, predicted 1) are less common but often appear when the two spans share unusually specific vocabulary or expressions, leading the model to treat lexical overlap as a strong signal of shared authorship; or, when spans have similar sentence rhythm and punctuation patterns despite being from different authors.

These observations suggest that balancing the classes during training and incorporating more explicit pairwise similarity features could help the model better distinguish genuine stylistic similarity from topic overlap.

Reproducibility

Steps to reproduce these results:

Clone the repository:

1git clone https://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-pamelaangulo164.git
2   cd ling-582-fall-2025-class-competition-code-pamelaangulo164

Create and activate a Python virtual environment, then install dependencies:

1python -m venv .venv
2# on Mac: source .venv/bin/activate        
3# on Windows: .venv\Scripts\activate
4pip install -r requirements.txt

Place train.csv, test.csv, and sample_submission.csv in the expected data directory or project root, as documented in the repository.
Run the main script:

1python code.py