Bag of n-grams & TFIDF scores of n-grams approach for Text Classification
Author: josefgarcia
— class competition — 2 min read| Leaderboard score | 0.52563 |
|---|---|
| Leaderboard team name | Joey Garcia |
| Kaggle username | youknowjoey |
| Code Repository URL | https://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-YouKnowJoey |
Task summary
See the rubric
For this class competition, the task is to classify text snippet into two categories. The goal is to classify if the text snippets are similiar or not based on their content. The dataset contains an ID column, text snippet, and a label indicating whether the snippet is similar (1) or not similar (0). The training set contains 1601 sample, while the test contains 900 samples. The evaluation metric for this task is accuracy, which measures the proportion of correctly classified snippets.
Exploratory data analysis
Initial analysis revealed a notable class imbalance, with the majority class (label 0) significantly outnumbering the minority class (label 1). Specifically, the training data contains 1,245 samples labeled as 0, compared to 356 samples labeled as 1. To address this imbalance, the model was trained using class weighting in logistic regression, and evaluation focused on metrics sensitive to the minority class, such as F1-score and recall.
Additional visualizations can be found in the nlp_pipeline.ipynb notebook.
Approach
For input design, the text snippets are first preprocessed through lowercasing, punctuation removal, and tokenization using NLTK. To convert the text into numerical features, I use a bag-of-n-grams representation combined with TF-IDF weighting. This approach captures both term frequency and the contextual relevance of n-gram patterns, enabling the classifier to better distinguish meaningful signals across classes.
Results
These scores represent the classification report from the development set. 250 majority class label 0, and 71 minority class label 1.
- Validation Accuracy: 0.589
- F1-Score: 0.70
- Recall: 0.48 (minority class identification)
The model shows moderate accuracy of about 59%, but the minority class recall (0.48) is substantially higher than without balancing, it's capturing half of the rare class instances. This represents the models ability to capture the minority class.
Error analysis
Many instances of the minority class (label 1) were misclassified as the majority class (label 0). The false negatives (FN) reduce recall, this highlights the fact that certain patterns in the minority class are too subtle for the current features.
Some majority class instances were incorrectly predicted as the minority class. The precision for label 1 is only 0.26, showing a portion of predicted minority labels are actually from the majority class. This is possibly due to the overlapping vocabulary or phrases that appear in both classes, hence, the nessecity for a more complex model or additional feature engineering.
Reproducibility
Follow the instructions in the README.md
Future Improvements
Describe how you might best improve your approach
First and foremost, to improve this approach, I would explore additional features that provide richer contextual information. Incorporating word embeddings, such as GloVe or FastText, would enable the model to capture semantic relationships beyond exact n-grams. To fully model the complexity of this dataset, a combination of embeddings with sequence based models, such as RNNs, LSTMs, or transformers.