Skip to content
LING 582 (FA 2025)
GitHub

Character Classification: A Simple but Effective Linear Model Pipeline

Author: abhiramn

class competition3 min read

Class Competition Info
Leaderboard score0.56304
Leaderboard team nameAbhiram Varma Nandimandalam
Kaggle usernameisjustabhi1
Code Repository URLhttps://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-isjustabhi

Task summary

The goal of this competition is an authorship verification task: given two spans of English text joined by the [SNIPPET] delimiter, determine whether they were written by the same author (label = 1) or by different authors (label = 0). This connects to well-established problems in authorship attribution, digital stylometry, and author profiling.

The dataset contains 1601 examples, with a label distribution of 1245 (not same author) and 356 (same author), creating a moderately imbalanced binary classification setting.

This task presents several challenges:

  • Topic vs. style conflict: Texts may discuss similar topics despite being from different authors, or may vary in register while being written by the same author.
  • Span length variability: Spans range from 4 to 860+ words, meaning some pairs contain limited stylistic evidence.
  • Historical diversity: The data includes authors from different eras and genres (Project Gutenberg + other corpora), resulting in diverse punctuation usage and vocabulary.

Since our assignment requires open-weight and reproducible tools, I implement a scalable linear-baseline pipeline and supplement it with an experimental MiniLM embedding approach.


See the rubric

Exploratory data analysis

To understand the dataset structure and stylistic variety, I conducted a detailed exploratory analysis of train.csv.

Dataset size and label distribution

  • Total examples: 1601
  • Label 0 (not same author): 1245
  • Label 1 (same author): 356

Because label 0 dominates (~78%), I used class_weight="balanced" in training and macro-F1 for evaluation.

Span length statistics

Each row contains two spans separated by [SNIPPET]. Word count statistics:

SpanMinMedianMax
span_one875481
span_two480860

Short spans (< 15 words) tend to produce uncertain predictions since they contain minimal stylistic information.

Example Data Points

Same-author example (excerpt): “…gasped Robbins, and without a word he turned and fled, leaving… the narrow pavements toward the ‘Golden Dragon.’ The propri…”

Different-author example (excerpt): “When his speaker remained silent Dirrul assumed he had been un… The battered ship plunged out of control toward the planet. Fo…” Even these excerpts show differences in rhythm, vocabulary, and punctuation.

Measures of Diversity (Superior requirement)

Type–Token Ratio (TTR)

A measure of lexical diversity:

SpanMinMedianMax
span_one0.250.811.00
span_two~0.460.811.00

Higher TTR values correspond to shorter, more varied spans.

Punctuation diversity (mean per 100 characters)

Characterspan_onespan_two
,1.1601.171
;0.0700.060
"0.8910.860
!0.1260.098
?0.1690.160

Authors differ widely in punctuation frequency, which supports the use of n-gram features for stylistic discrimination.

A full auto-generated report is included in my repo as eda_report.md.


See the rubric

Approach

My model pipeline consists of two components:

1. TF–IDF + LinearSVC (Primary Model / Leaderboard Submission)

  • Split text on [SNIPPET] into span_one and span_two
  • Fit a word-level TF-IDF vectorizer
    • ngram_range=(1, 2)
    • max_features=5000
  • Transform spans: X1 and X2
  • Compute feature representation as the absolute difference:
    [ |X1 - X2| ]
  • Train a LinearSVC with:
    • class_weight="balanced"
    • tuned C (via cross-validation)
    • random_state=42

This approach is lightweight, interpretable, and surprisingly competitive.

2. MiniLM Embeddings (Experimental Component)

To explore more stylistic depth, I added an embedding pipeline:

  • Use sentence-transformers/all-MiniLM-L6-v2
  • Compute similarity features + absolute difference of embeddings
  • Train Logistic Regression

This model is not my leaderboard submission but serves as a more neural, style-focused alternative.

All code is fully reproducible and available in the GitHub repository.


See the rubric

Results

Leaderboard Submission

  • Score (Macro F1): 0.56304
  • Submission name: Abhiram Varma Nandimandalam
  • This score exceeds the baseline, meeting the Superior criterion for leaderboard results.

Cross-Validation (Robustness Analysis)

I performed Stratified 5-fold CV to check model stability. Approximate F1 values:

  • Fold 1: ~0.60
  • Fold 2: ~0.48
  • Fold 3: ~0.53
  • Fold 4: ~0.50
  • Fold 5: ~0.55

Mean macro F1 ≈ 0.53, consistent with the public leaderboard.

This satisfies the rubric’s requirement for statistical robustness.


See the rubric

Error analysis

I manually inspected mispredictions from validation sets.

Common Errors

1. Very short spans

Short excerpts (< 15 words) contain minimal stylistic cues → model often guesses.

2. Topic-driven false positives

Two different authors writing about similar topics (e.g., ships, storms) may look similar lexically.

3. Register-switch false negatives

Same author writing:

  • narrative → long sentences, commas
  • dialogue → short utterances, quotation marks

Model incorrectly predicts 0 because the stylistic register shifts significantly.

Example false positive: Two descriptive nature passages from different authors misclassified as same-author due to vocabulary overlap.

Example false negative: Narrative + dialogue from the same novel misclassified as different authors.

This analysis suggests embedding-based approaches or deeper stylometric modeling could help.


See the rubric

Reproducibility

All steps are fully reproducible using open-weight models.

To reproduce my results:

1git clone https://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-isjustabhi
2cd ling-582-fall-2025-class-competition-code-isjustabhi
3
4python -m venv .venv
5source .venv/bin/activate
6
7pip install -r requirements.txt
8
9python src/preprocess.py
10python src/train.py
11python src/predict.py
12
13_See [the rubric](https://parsertongue.org/courses/snlp-2/assignments/rubrics/class-competition/final/#reproducibility)

Future Improvements

Describe how you might best improve your approach

If I had more time, I would explore:

Siamese or contrastive transformer models Use open-weight encoders trained to focus on style rather than content.

Richer stylometric feature sets

character n-grams

function word ratios

punctuation rhythms

sentence-length and word-length distributions

Hybrid ensemble Combine TF-IDF + embeddings + stylometric features.

Short-span handling Build a separate classifier for spans < 20 words or apply confidence calibration.

These directions would likely push performance beyond 0.60 F1 in a reproducible, open-source setup.

1