Skip to content
LING 582 (FA 2025)
GitHub

Class Competition – Final Summary (Data Minds Team)

Author: pranithachilvari

class competition2 min read

Task Summary

The class competition focuses on authorship verification, where the goal is to determine whether two text spans, separated by the token [SNIPPET], were written by the same author or by different authors. This is a binary classification task:

  • 0: Not the same author
  • 1: Same author

The dataset includes:

  • train.csv — labeled examples
  • test.csv — unlabeled examples for leaderboard submission
  • sample_submission.csv — required output format

Each row contains an ID, the combined TEXT field, and the LABEL (in the training set).
Span lengths vary significantly, and only raw text is provided—no author metadata, genres, or IDs.

This task relates to:

  • authorship attribution
  • stylometry
  • authorship verification
  • digital forensics

Challenges

Several factors make the task non-trivial:

  • Very short spans provide little stylistic evidence
  • Topic differences overshadow writing style
  • Same-author samples may differ in vocabulary or tone
  • Different authors may share similar writing style

State of the Art

Modern approaches often use:

  • Transformer models (RoBERTa, DeBERTa)
  • Character-level CNNs
  • SVMs with stylometric features
  • Siamese networks for similarity learning

Our team developed a clean, reproducible baseline using TF–IDF + LinearSVC.


Exploratory Data Analysis (EDA)

We performed light EDA to understand dataset structure.

Dataset Size

  • Total training examples: TODO: insert count
  • Label distribution:
    • Label 0: TODO
    • Label 1: TODO

Observations

  • Label 0 was slightly more common.
  • Text lengths varied from very short snippets to long paragraphs.
  • Some same-author pairs had similar phrasing and punctuation.
  • Some different-author pairs contained drastically different tones and vocabulary.

Vocabulary Diversity

The approximate vocabulary size across the training set was:

TODO: insert value

This indicates a wide stylistic range across authors and topics.

Example Cases

  • Same-author: Often shared function words, punctuation rhythm, and similar sentence structure.
  • Different-author: Sometimes differed dramatically in tone and syntactic style.

Approach

Our model uses a clean, effective feature-based pipeline.

Text Representation

We applied:

  • TF–IDF vectorization
  • Word 1–2 n-grams
  • min_df=2, max_df=0.9
  • sublinear_tf=True

These capture stylistic, lexical, and surface-level patterns.

Classifier

We used LinearSVC, which performs well with high-dimensional sparse data and is widely used for text classification baselines.

Training Strategy

  1. Split the dataset into train/validation sets with stratification.
  2. Train the TF–IDF + LinearSVC model.
  3. Compute validation Macro-F1.
  4. Retrain on full training data for final predictions.
  5. Submit predictions to the leaderboard (submission.csv).

Novelty

  • Combined n-gram TF–IDF with SVM for a simple yet strong baseline
  • Lightweight, reproducible, competition-friendly pipeline

Results

Validation Performance

Using a stratified 80/20 split, validation Macro-F1 was approximately:

TODO: insert your validation score from Kaggle notebook

Leaderboard Submission

Our final team submission:

  • Submission name: submission.csv
  • Public Leaderboard Score (Macro-F1): 0.40638

Baseline Comparison

The baseline score listed on the competition page was:

TODO: baseline Macro-F1

Our team’s score was TODO: above / below / comparable, depending on baseline value.

Robustness

We additionally performed 5-fold cross-validation (in our Kaggle notebook), and the Macro-F1 remained stable across folds.


Error Analysis

Key error patterns observed:

1. Very Short Spans

Minimal stylistic cues → model frequently misclassifies.

2. Topic Mismatch

Same-author but different-topic pairs appeared stylistically different.

3. Generic Writing Style

Some authors used extremely generic sentence structures, causing false positives.

Confusion Matrix Notes

Our analysis showed slightly more false negatives (predicting 0 when label = 1), suggesting the model is conservative about classifying “same author.”


Reproducibility

The entire workflow is reproducible using the following scripts:

GitHub Repo: https://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-PranithaChilvari

Kaggle Link: https://www.kaggle.com/code/harichandanavadige/data-minds

Local Reproduction

Install requirements:

1pip install -r requirements.txt