Bag of n-grams & TFIDF scores of n-grams approach for Text Classification

Author: josefgarcia

12/02/2025 — class competition — 2 min read

Class Competition Info

Leaderboard score	0.52563
Leaderboard team name	Joey Garcia
Kaggle username	youknowjoey
Code Repository URL	https://github.com/uazhlt-ms-program/ling-582-fall-2025-class-competition-code-YouKnowJoey

Task summary

See the rubric

For this class competition, the task is to classify text snippet into two categories. The goal is to classify if the text snippets are similiar or not based on their content. The dataset contains an ID column, text snippet, and a label indicating whether the snippet is similar (1) or not similar (0). The training set contains 1601 sample, while the test contains 900 samples. The evaluation metric for this task is accuracy, which measures the proportion of correctly classified snippets.

Exploratory data analysis

Initial analysis revealed a notable class imbalance, with the majority class (label 0) significantly outnumbering the minority class (label 1). Specifically, the training data contains 1,245 samples labeled as 0, compared to 356 samples labeled as 1. To address this imbalance, the model was trained using class weighting in logistic regression, and evaluation focused on metrics sensitive to the minority class, such as F1-score and recall.

Additional visualizations can be found in the nlp_pipeline.ipynb notebook.

Approach

For input design, the text snippets are first preprocessed through lowercasing, punctuation removal, and tokenization using NLTK. To convert the text into numerical features, I use a bag-of-n-grams representation combined with TF-IDF weighting. This approach captures both term frequency and the contextual relevance of n-gram patterns, enabling the classifier to better distinguish meaningful signals across classes.

Results

These scores represent the classification report from the development set. 250 majority class label 0, and 71 minority class label 1.

Validation Accuracy: 0.589
F1-Score: 0.70
Recall: 0.48 (minority class identification)

The model shows moderate accuracy of about 59%, but the minority class recall (0.48) is substantially higher than without balancing, it's capturing half of the rare class instances. This represents the models ability to capture the minority class.

Error analysis

Many instances of the minority class (label 1) were misclassified as the majority class (label 0). The false negatives (FN) reduce recall, this highlights the fact that certain patterns in the minority class are too subtle for the current features.

Some majority class instances were incorrectly predicted as the minority class. The precision for label 1 is only 0.26, showing a portion of predicted minority labels are actually from the majority class. This is possibly due to the overlapping vocabulary or phrases that appear in both classes, hence, the nessecity for a more complex model or additional feature engineering.

Reproducibility

Follow the instructions in the README.md

Future Improvements

Describe how you might best improve your approach

First and foremost, to improve this approach, I would explore additional features that provide richer contextual information. Incorporating word embeddings, such as GloVe or FastText, would enable the model to capture semantic relationships beyond exact n-grams. To fully model the complexity of this dataset, a combination of embeddings with sequence based models, such as RNNs, LSTMs, or transformers.