Character Classification: A Simple but Effective Linear Model Pipeline

Author: abhiramn

12/07/2025 — class competition — 3 min read

Class Competition Info

Leaderboard score	0.56304
Leaderboard team name	Abhiram Varma Nandimandalam
Kaggle username	isjustabhi1
Code Repository URL	https://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-isjustabhi

Task summary

The goal of this competition is an authorship verification task: given two spans of English text joined by the [SNIPPET] delimiter, determine whether they were written by the same author (label = 1) or by different authors (label = 0). This connects to well-established problems in authorship attribution, digital stylometry, and author profiling.

The dataset contains 1601 examples, with a label distribution of 1245 (not same author) and 356 (same author), creating a moderately imbalanced binary classification setting.

This task presents several challenges:

Topic vs. style conflict: Texts may discuss similar topics despite being from different authors, or may vary in register while being written by the same author.
Span length variability: Spans range from 4 to 860+ words, meaning some pairs contain limited stylistic evidence.
Historical diversity: The data includes authors from different eras and genres (Project Gutenberg + other corpora), resulting in diverse punctuation usage and vocabulary.

Since our assignment requires open-weight and reproducible tools, I implement a scalable linear-baseline pipeline and supplement it with an experimental MiniLM embedding approach.

See the rubric

Exploratory data analysis

To understand the dataset structure and stylistic variety, I conducted a detailed exploratory analysis of train.csv.

Dataset size and label distribution

Total examples: 1601
Label 0 (not same author): 1245
Label 1 (same author): 356

Because label 0 dominates (~78%), I used class_weight="balanced" in training and macro-F1 for evaluation.

Span length statistics

Each row contains two spans separated by [SNIPPET]. Word count statistics:

Span	Min	Median	Max
span_one	8	75	481
span_two	4	80	860

Short spans (< 15 words) tend to produce uncertain predictions since they contain minimal stylistic information.

Example Data Points

Same-author example (excerpt): “…gasped Robbins, and without a word he turned and fled, leaving… the narrow pavements toward the ‘Golden Dragon.’ The propri…”

Different-author example (excerpt): “When his speaker remained silent Dirrul assumed he had been un… The battered ship plunged out of control toward the planet. Fo…” Even these excerpts show differences in rhythm, vocabulary, and punctuation.

Measures of Diversity (Superior requirement)

Type–Token Ratio (TTR)

A measure of lexical diversity:

Span	Min	Median	Max
span_one	0.25	0.81	1.00
span_two	~0.46	0.81	1.00

Higher TTR values correspond to shorter, more varied spans.

Punctuation diversity (mean per 100 characters)

Character	span_one	span_two
`,`	1.160	1.171
`;`	0.070	0.060
`"`	0.891	0.860
`!`	0.126	0.098
`?`	0.169	0.160

Authors differ widely in punctuation frequency, which supports the use of n-gram features for stylistic discrimination.

A full auto-generated report is included in my repo as eda_report.md.

See the rubric

Approach

My model pipeline consists of two components:

1. TF–IDF + LinearSVC (Primary Model / Leaderboard Submission)

Split text on [SNIPPET] into span_one and span_two
Fit a word-level TF-IDF vectorizer
- ngram_range=(1, 2)
- max_features=5000
Transform spans: X1 and X2
Compute feature representation as the absolute difference:
[ |X1 - X2| ]
Train a LinearSVC with:
- class_weight="balanced"
- tuned C (via cross-validation)
- random_state=42

This approach is lightweight, interpretable, and surprisingly competitive.

2. MiniLM Embeddings (Experimental Component)

To explore more stylistic depth, I added an embedding pipeline:

Use sentence-transformers/all-MiniLM-L6-v2
Compute similarity features + absolute difference of embeddings
Train Logistic Regression

This model is not my leaderboard submission but serves as a more neural, style-focused alternative.

All code is fully reproducible and available in the GitHub repository.

See the rubric

Results

Leaderboard Submission

Score (Macro F1): 0.56304
Submission name: Abhiram Varma Nandimandalam
This score exceeds the baseline, meeting the Superior criterion for leaderboard results.

Cross-Validation (Robustness Analysis)

I performed Stratified 5-fold CV to check model stability. Approximate F1 values:

Fold 1: ~0.60
Fold 2: ~0.48
Fold 3: ~0.53
Fold 4: ~0.50
Fold 5: ~0.55

Mean macro F1 ≈ 0.53, consistent with the public leaderboard.

This satisfies the rubric’s requirement for statistical robustness.

See the rubric

Error analysis

I manually inspected mispredictions from validation sets.

Common Errors

1. Very short spans

Short excerpts (< 15 words) contain minimal stylistic cues → model often guesses.

2. Topic-driven false positives

Two different authors writing about similar topics (e.g., ships, storms) may look similar lexically.

3. Register-switch false negatives

Same author writing:

narrative → long sentences, commas
dialogue → short utterances, quotation marks

Model incorrectly predicts 0 because the stylistic register shifts significantly.

Example false positive: Two descriptive nature passages from different authors misclassified as same-author due to vocabulary overlap.

Example false negative: Narrative + dialogue from the same novel misclassified as different authors.

This analysis suggests embedding-based approaches or deeper stylometric modeling could help.

See the rubric

Reproducibility

All steps are fully reproducible using open-weight models.

To reproduce my results:

1git clone https://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-isjustabhi
2cd ling-582-fall-2025-class-competition-code-isjustabhi
3
4python -m venv .venv
5source .venv/bin/activate
6
7pip install -r requirements.txt
8
9python src/preprocess.py
10python src/train.py
11python src/predict.py
12
13_See [the rubric](https://parsertongue.org/courses/snlp-2/assignments/rubrics/class-competition/final/#reproducibility)

Future Improvements

Describe how you might best improve your approach

If I had more time, I would explore:

Siamese or contrastive transformer models Use open-weight encoders trained to focus on style rather than content.

Richer stylometric feature sets

character n-grams

function word ratios

punctuation rhythms

sentence-length and word-length distributions

Hybrid ensemble Combine TF-IDF + embeddings + stylometric features.

Short-span handling Build a separate classifier for spans < 20 words or apply confidence calibration.

These directions would likely push performance beyond 0.60 F1 in a reproducible, open-source setup.

Character Classification: A Simple but Effective Linear Model Pipeline

.css-1bw77fa{color:var(--theme-ui-colors-primary);-webkit-text-decoration:none;text-decoration:none;}.css-1bw77fa:hover{-webkit-text-decoration:underline;text-decoration:underline;}Task summary

Exploratory data analysis

Dataset size and label distribution

Span length statistics

Example Data Points

Measures of Diversity (Superior requirement)

Type–Token Ratio (TTR)

Punctuation diversity (mean per 100 characters)

Approach

1. TF–IDF + LinearSVC (Primary Model / Leaderboard Submission)

2. MiniLM Embeddings (Experimental Component)

Results

Leaderboard Submission

Cross-Validation (Robustness Analysis)

Error analysis

Common Errors

1. Very short spans

2. Topic-driven false positives

3. Register-switch false negatives

Reproducibility

To reproduce my results:

Future Improvements

Task summary