Class Competition – Final Summary (Data Minds Team)
Author: pranithachilvari
— class competition — 2 min readTask Summary
The class competition focuses on authorship verification, where the goal is to determine whether two text spans, separated by the token [SNIPPET], were written by the same author or by different authors. This is a binary classification task:
- 0: Not the same author
- 1: Same author
The dataset includes:
train.csv— labeled examplestest.csv— unlabeled examples for leaderboard submissionsample_submission.csv— required output format
Each row contains an ID, the combined TEXT field, and the LABEL (in the training set).
Span lengths vary significantly, and only raw text is provided—no author metadata, genres, or IDs.
Related Tasks
This task relates to:
- authorship attribution
- stylometry
- authorship verification
- digital forensics
Challenges
Several factors make the task non-trivial:
- Very short spans provide little stylistic evidence
- Topic differences overshadow writing style
- Same-author samples may differ in vocabulary or tone
- Different authors may share similar writing style
State of the Art
Modern approaches often use:
- Transformer models (RoBERTa, DeBERTa)
- Character-level CNNs
- SVMs with stylometric features
- Siamese networks for similarity learning
Our team developed a clean, reproducible baseline using TF–IDF + LinearSVC.
Exploratory Data Analysis (EDA)
We performed light EDA to understand dataset structure.
Dataset Size
- Total training examples: TODO: insert count
- Label distribution:
- Label 0: TODO
- Label 1: TODO
Observations
- Label 0 was slightly more common.
- Text lengths varied from very short snippets to long paragraphs.
- Some same-author pairs had similar phrasing and punctuation.
- Some different-author pairs contained drastically different tones and vocabulary.
Vocabulary Diversity
The approximate vocabulary size across the training set was:
TODO: insert value
This indicates a wide stylistic range across authors and topics.
Example Cases
- Same-author: Often shared function words, punctuation rhythm, and similar sentence structure.
- Different-author: Sometimes differed dramatically in tone and syntactic style.
Approach
Our model uses a clean, effective feature-based pipeline.
Text Representation
We applied:
- TF–IDF vectorization
- Word 1–2 n-grams
min_df=2,max_df=0.9sublinear_tf=True
These capture stylistic, lexical, and surface-level patterns.
Classifier
We used LinearSVC, which performs well with high-dimensional sparse data and is widely used for text classification baselines.
Training Strategy
- Split the dataset into train/validation sets with stratification.
- Train the TF–IDF + LinearSVC model.
- Compute validation Macro-F1.
- Retrain on full training data for final predictions.
- Submit predictions to the leaderboard (
submission.csv).
Novelty
- Combined n-gram TF–IDF with SVM for a simple yet strong baseline
- Lightweight, reproducible, competition-friendly pipeline
Results
Validation Performance
Using a stratified 80/20 split, validation Macro-F1 was approximately:
TODO: insert your validation score from Kaggle notebook
Leaderboard Submission
Our final team submission:
- Submission name:
submission.csv - Public Leaderboard Score (Macro-F1): 0.40638
Baseline Comparison
The baseline score listed on the competition page was:
TODO: baseline Macro-F1
Our team’s score was TODO: above / below / comparable, depending on baseline value.
Robustness
We additionally performed 5-fold cross-validation (in our Kaggle notebook), and the Macro-F1 remained stable across folds.
Error Analysis
Key error patterns observed:
1. Very Short Spans
Minimal stylistic cues → model frequently misclassifies.
2. Topic Mismatch
Same-author but different-topic pairs appeared stylistically different.
3. Generic Writing Style
Some authors used extremely generic sentence structures, causing false positives.
Confusion Matrix Notes
Our analysis showed slightly more false negatives (predicting 0 when label = 1), suggesting the model is conservative about classifying “same author.”
Reproducibility
The entire workflow is reproducible using the following scripts:
GitHub Repo: https://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-PranithaChilvari
Kaggle Link: https://www.kaggle.com/code/harichandanavadige/data-minds
Local Reproduction
Install requirements:
1pip install -r requirements.txt