Detecting Shared Authorship with Classical Stylometry
Author: pamelaangulo164
— class competition, authorship identification, stylometry — 3 min read| Leaderboard score (macro F1) | 0.46238 |
|---|---|
| Leaderboard team name | Pamela Angulo Martinez (individual) |
| Kaggle username | pamelaangulomartinez |
| Code Repository URL | https://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-pamelaangulo164 |
Task summary
This shared task is a binary classification problem in author profiling / digital stylometry.
Each data point consists of two text spans in English, concatenated with the special delimiter [SNIPPET]. The goal is to predict whether both spans were written by the same author (LABEL = 1) or by different authors (LABEL = 0).
I treat the shared task as a binary text classification problem over the full concatenated TEXT field. Each example is represented using character 3–5-gram TF–IDF features, and I train a linear Support Vector Machine (LinearSVC) classifier on these features. Character n-grams are intended to capture orthographic and stylistic patterns that are more stable across topics and help deal with the relatively high OOV rate in the test set.
My full pipeline is implemented in the repository:
- Repo name: ling-582-fall-2025-class-competition-code-pamelaangulo164
- URL: https://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-pamelaangulo164
Exploratory data analysis
Dataset size and label balance
The training set has 1,601 examples and the test set has 899 examples.
Label distribution in the training data:
- Label 0 (not same author): 1,245 examples (77.76 percent)
- Label 1 (same author): 356 examples (22.24 percent)
This is a moderately imbalanced dataset skewed toward label 0, which explains why a naive majority-class baseline performs reasonably well and why macro F1 is a more informative metric than accuracy alone.
Span length
Train set
- Characters: mean = 1,047.69, standard deviation = 546.45, minimum = 164, maximum = 5,491
- Words: mean = 186.26, standard deviation = 97.16, minimum = 25, maximum = 1,036
Test set
- Characters: mean = 1,001.44, standard deviation = 545.32, minimum = 204, maximum = 4,321
- Words: mean = 178.87, standard deviation = 97.42, minimum = 34, maximum = 773
The length distributions in train and test are very similar, with a wide range from a few dozen words to more than a thousand. Models therefore need to handle both short and long text pairs.
The [SNIPPET] delimiter appears exactly once per example in both splits (minimum = 1, maximum = 1, mean = 1.0).
Interestingly, average lengths by label in the training set are nearly identical:
- Label 0: mean characters = 1,047.71; mean words = 186.10
- Label 1: mean characters = 1,047.62; mean words = 186.80
Vocabulary and diversity
- Train vocabulary size: 41,269 unique tokens
- Test vocabulary size: 27,132 unique tokens
- Shared vocabulary: 14,567 types
- Test-only vocabulary: 12,565 types
Token counts:
- Train: 298,196 tokens
- Test: 160,800 tokens
- Test tokens that do not appear in the training vocabulary (OOV relative to train): 14,481
- OOV rate in test: 9.01 percent
Type–token ratios (TTR):
- Train TTR ≈ 0.14 (41,269 / 298,196)
- Test TTR ≈ 0.17 (27,132 / 160,800)
The slightly higher TTR in the test set suggests that the test examples are, on average, a bit more lexically diverse.
Approach
I treated the combined TEXT field as the unit of classification. I also used character 3–5-gram TF–IDF features because these capture orthographic and stylistic patterns such as punctuation use, word endings, and local character sequences. Finally, I used a linear SVM classifier on top of this representation, which is a standard, strong baseline for high-dimensional sparse features.
The main design decision is to rely on character-level features rather than word-level features or deep neural models. This choice is motivated by the OOV analysis, the literary nature of the texts, and the goal of capturing stable stylistic patterns across topics.
Results
Using a stratified 80/20 train–validation split on train.csv, this baseline model reaches a macro F1 score of 0.4717 on the validation set, with overall accuracy of 0.7695. The classifier performs very well on the majority class (label 0, “not same author”) but has low recall and F1 for the minority class (label 1, “same author”), reflecting the underlying label imbalance in the data.
Evaluation scores:
- Macro F1: 0.4717
- Accuracy: 0.7695
Error analysis
The confusion matrix shows that the model is extremely confident about label 0 but rarely predicts label 1 on held-out data:
For label 0, it correctly classifies 244 out of 250 examples and mislabels only 6 as label 1.
For label 1, it correctly classifies only 3 out of 71 examples and mislabels 68 as label 0.
This leads to very high recall and F1 for label 0 and very low recall and F1 for label 1. The model is effectively biased toward predicting that most pairs are written by different authors, which mirrors the class imbalance in the training data.
Qualitatively, many false negatives (true label 1, predicted 0) occur when:
Both spans are fairly short, so there is limited stylistic evidence.
The author’s style changes between spans (for example, narration versus dialogue).
The spans are very similar topically to other authors in the corpus, which may blur distinctive stylistic cues.
False positives (true label 0, predicted 1) are less common but often appear when the two spans share unusually specific vocabulary or expressions, leading the model to treat lexical overlap as a strong signal of shared authorship; or, when spans have similar sentence rhythm and punctuation patterns despite being from different authors.
These observations suggest that balancing the classes during training and incorporating more explicit pairwise similarity features could help the model better distinguish genuine stylistic similarity from topic overlap.
Reproducibility
Steps to reproduce these results:
- Clone the repository:
1git clone https://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-pamelaangulo164.git2 cd ling-582-fall-2025-class-competition-code-pamelaangulo164- Create and activate a Python virtual environment, then install dependencies:
1python -m venv .venv2# on Mac: source .venv/bin/activate 3# on Windows: .venv\Scripts\activate4pip install -r requirements.txt- Place train.csv, test.csv, and sample_submission.csv in the expected data directory or project root, as documented in the repository.
- Run the main script:
1python code.py