Class Competition – Final Summary (Data Minds Team)

Author: pranithachilvari

12/10/2025 — class competition — 2 min read

Task Summary

The class competition focuses on authorship verification, where the goal is to determine whether two text spans, separated by the token [SNIPPET], were written by the same author or by different authors. This is a binary classification task:

0: Not the same author
1: Same author

The dataset includes:

train.csv — labeled examples
test.csv — unlabeled examples for leaderboard submission
sample_submission.csv — required output format

Each row contains an ID, the combined TEXT field, and the LABEL (in the training set).
Span lengths vary significantly, and only raw text is provided—no author metadata, genres, or IDs.

This task relates to:

authorship attribution
stylometry
authorship verification
digital forensics

Challenges

Several factors make the task non-trivial:

Very short spans provide little stylistic evidence
Topic differences overshadow writing style
Same-author samples may differ in vocabulary or tone
Different authors may share similar writing style

State of the Art

Modern approaches often use:

Transformer models (RoBERTa, DeBERTa)
Character-level CNNs
SVMs with stylometric features
Siamese networks for similarity learning

Our team developed a clean, reproducible baseline using TF–IDF + LinearSVC.

Exploratory Data Analysis (EDA)

We performed light EDA to understand dataset structure.

Dataset Size

Total training examples: TODO: insert count
Label distribution:
- Label 0: TODO
- Label 1: TODO

Observations

Label 0 was slightly more common.
Text lengths varied from very short snippets to long paragraphs.
Some same-author pairs had similar phrasing and punctuation.
Some different-author pairs contained drastically different tones and vocabulary.

Vocabulary Diversity

The approximate vocabulary size across the training set was:

TODO: insert value

This indicates a wide stylistic range across authors and topics.

Example Cases

Same-author: Often shared function words, punctuation rhythm, and similar sentence structure.
Different-author: Sometimes differed dramatically in tone and syntactic style.

Approach

Our model uses a clean, effective feature-based pipeline.

Text Representation

We applied:

TF–IDF vectorization
Word 1–2 n-grams
min_df=2, max_df=0.9
sublinear_tf=True

These capture stylistic, lexical, and surface-level patterns.

Classifier

We used LinearSVC, which performs well with high-dimensional sparse data and is widely used for text classification baselines.

Training Strategy

Split the dataset into train/validation sets with stratification.
Train the TF–IDF + LinearSVC model.
Compute validation Macro-F1.
Retrain on full training data for final predictions.
Submit predictions to the leaderboard (submission.csv).

Novelty

Combined n-gram TF–IDF with SVM for a simple yet strong baseline
Lightweight, reproducible, competition-friendly pipeline

Results

Validation Performance

Using a stratified 80/20 split, validation Macro-F1 was approximately:

TODO: insert your validation score from Kaggle notebook

Leaderboard Submission

Our final team submission:

Submission name: submission.csv
Public Leaderboard Score (Macro-F1): 0.40638

Baseline Comparison

The baseline score listed on the competition page was:

TODO: baseline Macro-F1

Our team’s score was TODO: above / below / comparable, depending on baseline value.

Robustness

We additionally performed 5-fold cross-validation (in our Kaggle notebook), and the Macro-F1 remained stable across folds.

Error Analysis

Key error patterns observed:

1. Very Short Spans

Minimal stylistic cues → model frequently misclassifies.

2. Topic Mismatch

Same-author but different-topic pairs appeared stylistically different.

3. Generic Writing Style

Some authors used extremely generic sentence structures, causing false positives.

Confusion Matrix Notes

Our analysis showed slightly more false negatives (predicting 0 when label = 1), suggesting the model is conservative about classifying “same author.”

Reproducibility

The entire workflow is reproducible using the following scripts:

GitHub Repo: https://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-PranithaChilvari

Kaggle Link: https://www.kaggle.com/code/harichandanavadige/data-minds

Local Reproduction

Install requirements:

1pip install -r requirements.txt