Class Competition – Initial Draft Summary (Data Minds)
Author: vadige
— 1 min readClass Competition — Initial Draft Summary
Student: Hari Chandana Vadige
Team: Data Minds
Task Summary
The purpose of this competition is to determine whether two text spans—joined using the special token [SNIPPET]—were written by the same author (label = 1) or by different authors (label = 0).
This is a binary classification task related to authorship attribution and stylometry.
The provided dataset includes:
- train.csv — contains ID, TEXT, and LABEL
- test.csv — contains ID and TEXT
- sample_submission.csv — example output format
The TEXT field contains two spans merged into one string, where each training example includes the correct binary label.
Approach (Baseline Model)
For the initial draft stage, our team created a baseline system to generate a valid leaderboard submission.
I completed the implementation using a Kaggle Notebook, which allowed easy dataset upload, training, and reproducible inference.
Steps performed:
- Uploaded
train.csvandtest.csvto Kaggle. - Loaded the dataset using pandas.
- Used TF–IDF vectorization with:
- word n-grams: (1, 2)
- character n-grams: (3, 5)
- Trained a Logistic Regression model with
class_weight="balanced". - Used a train/validation split to estimate macro-F1 performance.
- Retrained on the full dataset and generated predictions for the test set.
- Created
submission.csvdirectly in Kaggle and downloaded it for leaderboard upload.
This baseline served as our team's first valid submission.
Leaderboard Submission
- Submission name:
DataMinds - File generated through our Kaggle notebook workflow
- Evaluated using the Macro-F1 metric on the hidden test labels
Code Repository
All code for this project is maintained in our shared GitHub Classroom repository:
https://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-Chandana3940
This repository uses the standard course template and contains our baseline methods.