Character Classification: A Simple but Effective Linear Model Pipeline
Author: abhiramn
— class competition — 3 min read| Leaderboard score | 0.56304 |
|---|---|
| Leaderboard team name | Abhiram Varma Nandimandalam |
| Kaggle username | isjustabhi1 |
| Code Repository URL | https://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-isjustabhi |
Task summary
The goal of this competition is an authorship verification task: given two spans of English text joined by the [SNIPPET] delimiter, determine whether they were written by the same author (label = 1) or by different authors (label = 0). This connects to well-established problems in authorship attribution, digital stylometry, and author profiling.
The dataset contains 1601 examples, with a label distribution of 1245 (not same author) and 356 (same author), creating a moderately imbalanced binary classification setting.
This task presents several challenges:
- Topic vs. style conflict: Texts may discuss similar topics despite being from different authors, or may vary in register while being written by the same author.
- Span length variability: Spans range from 4 to 860+ words, meaning some pairs contain limited stylistic evidence.
- Historical diversity: The data includes authors from different eras and genres (Project Gutenberg + other corpora), resulting in diverse punctuation usage and vocabulary.
Since our assignment requires open-weight and reproducible tools, I implement a scalable linear-baseline pipeline and supplement it with an experimental MiniLM embedding approach.
See the rubric
Exploratory data analysis
To understand the dataset structure and stylistic variety, I conducted a detailed exploratory analysis of train.csv.
Dataset size and label distribution
- Total examples: 1601
- Label 0 (not same author): 1245
- Label 1 (same author): 356
Because label 0 dominates (~78%), I used class_weight="balanced" in training and macro-F1 for evaluation.
Span length statistics
Each row contains two spans separated by [SNIPPET]. Word count statistics:
| Span | Min | Median | Max |
|---|---|---|---|
| span_one | 8 | 75 | 481 |
| span_two | 4 | 80 | 860 |
Short spans (< 15 words) tend to produce uncertain predictions since they contain minimal stylistic information.
Example Data Points
Same-author example (excerpt): “…gasped Robbins, and without a word he turned and fled, leaving… the narrow pavements toward the ‘Golden Dragon.’ The propri…”
Different-author example (excerpt): “When his speaker remained silent Dirrul assumed he had been un… The battered ship plunged out of control toward the planet. Fo…” Even these excerpts show differences in rhythm, vocabulary, and punctuation.
Measures of Diversity (Superior requirement)
Type–Token Ratio (TTR)
A measure of lexical diversity:
| Span | Min | Median | Max |
|---|---|---|---|
| span_one | 0.25 | 0.81 | 1.00 |
| span_two | ~0.46 | 0.81 | 1.00 |
Higher TTR values correspond to shorter, more varied spans.
Punctuation diversity (mean per 100 characters)
| Character | span_one | span_two |
|---|---|---|
, | 1.160 | 1.171 |
; | 0.070 | 0.060 |
" | 0.891 | 0.860 |
! | 0.126 | 0.098 |
? | 0.169 | 0.160 |
Authors differ widely in punctuation frequency, which supports the use of n-gram features for stylistic discrimination.
A full auto-generated report is included in my repo as eda_report.md.
See the rubric
Approach
My model pipeline consists of two components:
1. TF–IDF + LinearSVC (Primary Model / Leaderboard Submission)
- Split text on
[SNIPPET]intospan_oneandspan_two - Fit a word-level TF-IDF vectorizer
ngram_range=(1, 2)max_features=5000
- Transform spans:
X1andX2 - Compute feature representation as the absolute difference:
[ |X1 - X2| ] - Train a LinearSVC with:
class_weight="balanced"- tuned
C(via cross-validation) random_state=42
This approach is lightweight, interpretable, and surprisingly competitive.
2. MiniLM Embeddings (Experimental Component)
To explore more stylistic depth, I added an embedding pipeline:
- Use
sentence-transformers/all-MiniLM-L6-v2 - Compute similarity features + absolute difference of embeddings
- Train Logistic Regression
This model is not my leaderboard submission but serves as a more neural, style-focused alternative.
All code is fully reproducible and available in the GitHub repository.
See the rubric
Results
Leaderboard Submission
- Score (Macro F1): 0.56304
- Submission name:
Abhiram Varma Nandimandalam - This score exceeds the baseline, meeting the Superior criterion for leaderboard results.
Cross-Validation (Robustness Analysis)
I performed Stratified 5-fold CV to check model stability. Approximate F1 values:
- Fold 1: ~0.60
- Fold 2: ~0.48
- Fold 3: ~0.53
- Fold 4: ~0.50
- Fold 5: ~0.55
Mean macro F1 ≈ 0.53, consistent with the public leaderboard.
This satisfies the rubric’s requirement for statistical robustness.
See the rubric
Error analysis
I manually inspected mispredictions from validation sets.
Common Errors
1. Very short spans
Short excerpts (< 15 words) contain minimal stylistic cues → model often guesses.
2. Topic-driven false positives
Two different authors writing about similar topics (e.g., ships, storms) may look similar lexically.
3. Register-switch false negatives
Same author writing:
- narrative → long sentences, commas
- dialogue → short utterances, quotation marks
Model incorrectly predicts 0 because the stylistic register shifts significantly.
Example false positive: Two descriptive nature passages from different authors misclassified as same-author due to vocabulary overlap.
Example false negative: Narrative + dialogue from the same novel misclassified as different authors.
This analysis suggests embedding-based approaches or deeper stylometric modeling could help.
See the rubric
Reproducibility
All steps are fully reproducible using open-weight models.
To reproduce my results:
1git clone https://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-isjustabhi2cd ling-582-fall-2025-class-competition-code-isjustabhi3
4python -m venv .venv5source .venv/bin/activate6
7pip install -r requirements.txt8
9python src/preprocess.py10python src/train.py11python src/predict.py12
13_See [the rubric](https://parsertongue.org/courses/snlp-2/assignments/rubrics/class-competition/final/#reproducibility)Future Improvements
Describe how you might best improve your approach
If I had more time, I would explore:
Siamese or contrastive transformer models Use open-weight encoders trained to focus on style rather than content.
Richer stylometric feature sets
character n-grams
function word ratios
punctuation rhythms
sentence-length and word-length distributions
Hybrid ensemble Combine TF-IDF + embeddings + stylometric features.
Short-span handling Build a separate classifier for spans < 20 words or apply confidence calibration.
These directions would likely push performance beyond 0.60 F1 in a reproducible, open-source setup.
1