Authorship Verification Using TF-IDF and Digital Stylometry

Author: 880417603

12/09/2025 — class competition — 3 min read

Class Competition Info

Leaderboard score	0.55026
Leaderboard team name	conniecameron
Kaggle username	conniecameron
Code Repository URL	https://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-conniecameron

Task summary

Authorship verification determines whether two text spans belong to the same author (label = 1) or different authors (label = 0).

Each training example contains:

SPAN1_TEXT [SNIPPET] SPAN2_TEXT
The official evaluation metric is Macro F1.
Training and test sets contain 1601 and 899 examples, respectively.

This problem relates to:

a) Author Profiling – Inferring stylistic traits from text.
b) Authorship Identification – Assigning text to known authors, unlike verification.
c) Digital Stylometry – Measuring style via linguistic, structural, and lexical features.
d) Text Classification – TF-IDF + linear models (logistic regression) are strong statistical NLP baselines.

Key challenges include:
a) Short spans offering limited stylistic signal.
b) Topic mismatch between same-author spans.
c) Topical similarity misleading the model for different authors.
d) Dialogue-heavy text affecting uppercase, punctuation ratios.
e) Uneven span length reducing reliability of stylometric differences.

Example records examined during EDA confirm all of these issues.

Exploratory data analysis

Initial Observations:

[SNIPPET] always appears exactly once.
Span lengths vary dramatically (short < 50 chars; long > 1000).
Authors display consistent tendencies in punctuation, function-word usage, and sentence length.
Some label 0 pairs show topical overlap that can mislead lexical models.

Dataset Size per Class: Label Meaning Count %
0 Different Authors 1,245 77.76%
1 Same Author 356 22.24%

Metrics examined include:

Distribution of average word length
Sentence length words per sentence
Type–token ratio
Punctuation ratio
Uppercase ratio

Same-author pairs tend to have smaller stylometric differences, motivating pairwise-difference features.

Approach

Approach 1: Baseline Model — TF-IDF + Logistic Regression
TF-IDF: ngram_range=(1,2), max_features=50,000
Classifier: Logistic Regression with class_weight="balanced"

Reason: Provide a strong statistical NLP baseline using lexical similarity.
Macro F1: 0.5139

Approach 2: Stylometry-Only Model
Computed per-span features:

Avg word length
Avg sentence length
Type–token ratio
Punctuation ratio
Uppercase ratio
Digit ratio
Function-word ratio Plus pairwise absolute differences between SPAN1 and SPAN2.

Classifier: StandardScaler → Logistic Regression.

Reason: Stylometry captures writing style independent of topic.
Macro F1: 0.5647

Outperforms TF-IDF baseline (+0.0508).

Approach 3: Combined Model — TF-IDF + Stylometry (Final System)

Implemented via ColumnTransformer

Processing Pipeline:

Step 1: Two different types of features (TEXT and STYLE) are processed and then fed into one classifier. These are scaled before going into Logistic Regression.
Step 2: Each part goes through a different ColumnTransformer

TEXT → TfidfVectorizer

Converts text into a sparse vector of TF-IDF weights.
Output: a matrix of size (n_samples × vocab_size)

STYLE → StandardScaler

Standardizes numeric features i.e. mean = 0, std = 1
Ensures all features contribute equally to the classifier.

Step 3: The ColumnTransformer merges them
The ColumnTransformer runs TF-IDF on column "TEXT" and runs StandardScaler on all "style" columns, then concatenates the results horizontally.
This results to one unified feature vector per row.

Reason: Combining lexical and stylistic signals provides complementary information. Macro F1: 0.6346

Best-performing system. +0.1207 over TF-IDF baseline.

Results

Quantitative Summary

Model	Features	Macro F1	Δ vs Baseline
Baseline (TF-IDF)	word uni/bi-grams	0.5139	—
Stylometry-only	style + pairwise diffs	0.5647	+0.0508
Combined Model	TF-IDF + stylometry	0.6346	+0.1207

Observations:

Sylometric signals are strong and meaningful.
Combined model significantly outperforms either component alone.
Improvements validate that both content and style contribute to authorship cues.

Robustness Analysis

Running multiple random train/dev splits (80/20):

Combined model variance ≈ ±0.01 Macro F1
Stylometry-only variance slightly higher due to sensitivity to small spans
TF-IDF baseline stable but weaker overall

Error analysis

a) Little stylistic or lexical signal for short spans.
b) High uppercase/punctuation disrupt style patterns.
c) Same-author but semantically distant texts.
d) Different authors writing similar content.

Reproducibility

https://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-conniecameron/blob/main/README.md

Future Improvements

This project demonstrates that combining TF-IDF lexical modeling with digital stylometry yields strong, interpretable solutions to authorship verification. The final model achieves a Macro F1 of 0.6346, substantially outperforming either TF-IDF alone or stylometric features alone.

The workflow remains simple, reproducible, and aligned with classic methods in authorship analysis.

Other options to explore:

1) SBERT sentence embeddings for content-invariant semantic comparison.
While TF-IDF captures lexical similarity, it fails when the same author discusses different topics or different authors use similar vocabulary (e.g., genre conventions)
This adds a semantic similarity dimension that TF-IDF lacks:

If the same author writes differently but with similar semantic tendencies (e.g., similar phrasing or rhetorical structure), SBERT will detect it.
If two authors write about the same topic but in different styles, stylometry helps differentiate them.

2) POS-Based Stylometry (Part-of-Speech Patterns)

Syntactic preferences — not just surface word choices — are strong markers of author identity. Authors differ in how frequently they use various POS.
Possible implementation is tagging each span using a lightweight POS tagger like spaCy, computing POS distribution features and adding pairwise POS differences.
The same ColumnTransformer pipeline (Approach #3) can be utilized.This enhances your model’s ability to detect deeper syntactic habits or grammatical style.