An Hybrid Approach to Author Verification
Author: oaikumariegbe
— class competition — 6 min read| Leaderboard score | 0.65 |
|---|---|
| Leaderboard team name | Oghenevovwe Ikumariegbe |
| Kaggle username | abbyogv |
| Code Repository URL | https://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-Abby-OGV |
Task summary
Introduction
The task, which is commonly known as author verification in the field of Natural Language Processing, is one whose objective is to determine whether two (or more) snippets of text were written by the same author. Unlike authorship attribution, where the goal is to select the most likely author from a predefined set of candidates, author verification operates in an open-set scenario: deciding whether two documents share authorship without relying on a closed list of known writers. to the same author or not. A closely related task, author profiling, extends author verification to infer latent characteristics of an author&emdash;such as age, gender, native language, or personality traits&emdash; purely from linguistic patterns.
These tasks share the broad aim of uncovering stylistic signals in text, and have practical uses in digital forensics, cybersecurity, plagiarism detection, among other. However, they can be challenging to accomplish for a number of reasons:
i. An author's style can be subtle and vary across topics and genres (even by the same author)
ii. Stylistic signals extend beyond semantic similarity, therefore adding to its complexity.
iii. Limited and noisy data, which is needed to generalizing to unknown authors' styles.
Author verification, which is the focus of this blog, is a binary classification task with 0 corresponding to 'Not the same author' and 1 to 'Same author'. The dataset provided consists of 1601 training instances and 899 (80%) test instances. I go into more details on the dataset in the Exploratory Data Analysis section.
Related Work and State of the Art Approaches
Classical approaches to author verification relied heavily on stylometry, focusing on manually crafted linguistic features: distribution of function words, character n‑grams, POS tags, and other style markers, combined with distance measures or supervised classifiers. With the rise of deep learning and neural architectures&emdash;first CNNs and RNNs, and later transformers&emdash;have become more prominent in the task.
Pretrained language models such as BERT, RoBERTa, and ELECTRA have been fine-tuned for author verification [Nguyen et al., van Leeuwen et al.]. In parallel, contrastive and pairwise training objectives have been have been specialized for this task, more explicitly model authorship similarity, encouraging embeddings of texts by the same author to cluster together while pushing apart those from different authors.
Benchmark datasets and standardized evaluations have also played a crucial role. In particular, the PAN have established widely used benchmarks for author verification, enabling the comparison of methods across domains, and levels of stylistic difficulty.
Exploratory data analysis
The initial provided dataset consisted of 1601 examples, which I split into train and val using a test_size of 0.08 and setting a seed for reproducibility. In addition to the data set, I generated my data from Gutenberg books and created a balanced set by augmenting it with the original dataset.
The table below shows the basic statistics of the dataset.
| Data Split | Number of examples | Class distribution |
|---|---|---|
| Train | 1472 | [0] 1145; [1] 327 |
| Val | 129 | [0] 100; [1] 29 |
| Augmented Train | 35,425 | [0] 17713; [1] 17712 |
| Augmented Val | 3081 | [0] 1540; [1] 1541 |
I had looked into the mean token lengths of the snippets which informed the data augmentation process as well as the model architecture. The mean length of the tokenized snippets in original test set was ~266 and the max length was 1252. Hence, considering the choice to use BERT and BERT-style models (for embeddings) with a model max length of 512, I decided on a model architecture that generated the embeddings independently. I chose BERT-style models because as encoderr-only models, they are able to build a good representations of the snippets.
Furthermore, some features such as type-token-ratio and dependency parse tree were inspired by looking at a few samples of the data.
Data augmentation
To improve robustness and reduce overfitting, I augmented the training data with additional samples drawn from the Gutenberg library. I began by retrieving up to fifteen books of the top 100 authors in a 30-day popularity window. Then, I had the books preprocessed to remove headers and footers, and then had the spans sampled under specific constraints, such as minimum/maximum length, capitalization patterns (uninformative sentences like chapter or titles), the absence of certain keywords (for example, Gutenberg) and statistical patterns (min, mean and max) of the original train set. These constraints helped ensure that the augmented examples more closely resemble the original train set.
However, one key difference is that my dataset contains more spans that do not start at a sentence boundary compared to the ones in the original train set. I allowed it to be this way, so the model can better generalize and does not rely on snippet-length based features.
Approach
In this project, I explore a hybrid approach to authorship profiling by combining dense neural embeddings with linguistically motivated sparse features. The goal is to capture both high-level semantic information and fine-grained stylistic patterns from features that may distinguish one author's writing style from another. This is largely inspired by the fact that authorship verification relies on signals beyond semantic signals.
The dense component is based on a BERT-style transformer. I also experimented with Longformer (4096 max length). I trained some layers of the transformer architecture as well, so the embeddings are learned for my task. Two text spans—separated in the data by a special token ([SNIPPET])—are encoded as two independent sequences whose pooled representations are passed to a classifier.
The classifier model incorporates a set of hand-engineered features designed to reflect stylistic tendencies at the lexical, syntactic, and structural levels. These features include:
Type-to-token ratio
Stopword ratio
Average token length
Counts of POS tags
Counts of dependency tags
Frequencies of the top-k POS bigrams, capturing local transitions in part-of-speech sequences (k= 100)
Cosine similarity of 3-5 character level bigrams
Furthermore, to reduce the dimensionality and aid learning, I used PCA from scikit-learn to reduce the dimension to 50.
Another core decision made was to train the model to be invariant to the order of snippets. Therefore, I used the absolute difference of the embeddings/features, their product and their average, and never one feature or embedding followed by the other.
Lastly, I also experimented with two types of pooling
on the embeddings (input ids embeddings to -> snippet embedding):
mean pooling and CLS pooling.
By jointly leveraging dense semantic information and interpretable syntactic/lexical indicators, my approach aims to produce a more reliable model for distinguishing authors based on their stylistic signatures.
In addition, to reduce the risk of overfitting, I used early stopping on the best F1. Indeed, this turned out useful as the BERT model overfit to the training data in the later epochs: 1.0 F1 on train and a 0.84 F1 on the validation.
Results
On the test set, my best result was 0.68 macro average F1. The results reported here are based on the validation set, as I have access to the gold labels for it. I obtained the following results on the validation set.
BERT (Mean pooling):
1Classification Report 2 precision recall f1-score support3
4 0 0.84 0.86 0.85 15405 1 0.85 0.83 0.84 15416
7 accuracy 0.85 30818 macro avg 0.85 0.85 0.85 30819weighted avg 0.85 0.85 0.85 3081BERT (CLS pooling):
1Classification Report 2 precision recall f1-score support3
4 0 0.88 0.83 0.85 15405 1 0.84 0.89 0.86 15416
7 accuracy 0.86 30818 macro avg 0.86 0.86 0.86 30819weighted avg 0.86 0.86 0.86 3081DeBERTa (Mean pooling) - last running epoch
1Classification Report 2 precision recall f1-score support3
4 0 0.90 0.71 0.79 15405 1 0.76 0.92 0.83 15416
7 accuracy 0.81 30818 macro avg 0.83 0.81 0.81 30819weighted avg 0.83 0.81 0.81 3081RoBERTa (Mean pooling) - last running epoch:
1Classification Report 2 precision recall f1-score support3
4 0 0.83 0.85 0.84 15405 1 0.85 0.82 0.84 15416
7 accuracy 0.84 30818 macro avg 0.84 0.84 0.84 30819weighted avg 0.84 0.84 0.84 3081Longformer (Mean pooling) - last running epoch:
1Classification Report 2 precision recall f1-score support3
4 0 0.84 0.80 0.82 15405 1 0.81 0.85 0.83 15416
7 accuracy 0.82 30818 macro avg 0.83 0.82 0.82 30819weighted avg 0.83 0.82 0.82 3081The results are promising, showing the model is learning to distinguish stylistic cues between text by the same author and those by different authors.
However, the model is prone to overfitting on the training data.
With the exception of the longformer, all models reach 0.99 - 1.0 F1 on
the training data.
I used early stopping with a patience of 4 to return the last model
with the best validation F1.
Error analysis
I did a manual analysis of the data (based on the validation set) and found the following patterns:
- Two short snippets, especially of similar emotional tone, are likely to be characterized as being the same author: For example,
1Snippet 1: to keep this a secret between us two. Not a word of it to your2Snippet 2: threatened to engulf the fragile vessel; but she clung to the hope that the stranger's- When an author writes in a different genre or topic space, the model struggles to correlate it as being by the same person For example,
1Snippet 1: anything were a part of many, being itself one of them, it will surely be a part of itself, which is impossible, and it will be a part of each one of the other parts, if of all; for if not a part of some one, it will be a part of all the others but this one, and thus will not be a part of each one; and if not a part of each, one it will not be a part of any one of the many; and not being a part of any one, it cannot be a part or anything else of all those things of none of which it is2Snippet 2: the air. Borne hither and thither, 'they speedily fall into beliefs' the opposite of those in which they were brought up. They hardly retain the distinction of right and wrong; they seem to think one thing as good as another. They suppose themselves to be searching after truth when they are playing the game of 'follow my leader.' They fall in love 'at first sight' with paradoxes respecting morality, some fancy about art, some novelty or eccentricity in religion, and like lovers- Some snippets were to short to be informative, especially when compared with a much longer snippet
1Snippet 1: narrow neck, about midway the lake on the east side. The celebrated precipice is on the east or land side of this, and is so high and perpendicular that you can jump from the top,2Snippet 2: [OMITTED FOR SPACE, BUT IT'S AT LEAST 4-5 TIMES THE LENGTH OF SNIPPET 1]- Some snippets were in a similar writing style, leading the model to predict same author even when it was by different authors.
1Snippet 1: narrow neck, about midway the lake on the east side. The celebrated precipice is on the east or land side of this, and is so high and perpendicular that you can jump from the top,2Snippet 2: to see the heroine and express their wonder, thanks, and admiration. All agreed that partial drowning seemed to suit the girl, for a new Ruth had risen like Venus from the sea. A softer beauty was in her fresh face now, a gentler sort of pride possessed her, and a still more modest shrinking from praise and publicityReproducibility
See the README in the code repository.
Future Improvements
There are a number of limitations and future improvements for this approach. They include:
Lack of ablation studies: Given more time, I would like to see the effect of the different components on the overall architecture (embeddings, handcrafted features and dimensionality reduction)
Objective function: Prior work have shown that contrastive or pairwise loss can work better for this task. That would be a future extension of the hybrid architecture.
Cross encoders: Some attentional information is likely lost by generating the embeddings for the snippets independently.
Overfitting: The BERT-style transformers overfit after some epochs despite early stopping. Methods like regularization, increasing the dropout probability and cross-validation might help with this.
Data augmentation: The augmented dataset is from a small number of authors, limiting the model's generalizability.
Improving Long-Text Authorship Verification via Model Selection and Data Tuning(Nguyen et al, 2023)
Combining style and semantics for robust authorship verification(van Leeuwen et al.)