Skip to content
LING 539
GitHub

Evaluation

The evaluation metric for this competition is the macro F1 score (i.e., the unweighted mean F1). The F1 score, commonly used in information retrieval, measures accuracy using the statistics precision $$\text{p}$$ and recall $$\text{r}$$.

Precision is the ratio of true positives $$\text{tp}$$ to all predicted positives $$\text{tp} + \text{fp}$$. Recall is the ratio of true positives $$\text{tp}$$ to all actual positives $$\text{tp} + \text{fn}$$. The F1 score is given by:

$\text{F1} = 2\frac{\text{p} \cdot \text{r}}{\text{p}+\text{r}}\ \ \mathrm{where}\ \ \text{p} = \frac{\text{tp}}{\text{tp}+\text{fp}},\ \ \text{r} = \frac{\text{tp}}{\text{tp}+\text{fn}}$

The F1 metric weighs recall and precision equally. Moderately good performance on both will be favored over extremely good performance on one and poor performance on the other. The macro-averaged F1 balances performance across all three classes in the data.

Submission Format

You must produce a single submission file based on test.csv containing exactly two columns: ID and LABEL.

The file should contain a header and have the following format:

1ID,LABEL
218742,0
314108,1

The Kaggle competition is configured to allow a fixed number of submissions from each student per day.

Kaggle Leaderboard

Your submission on Kaggle will be scored automatically on the test set, and Kaggle will keep track of the performance of everyone's best submitted models on a Leaderboard. There is also a "benchmark" score on the leaderboard, which is the score that the sample submission would receive. (The sample submission was created by randomly assigning one of the three labels to each object in the test set, with equal probability. It's not a very clever approach, and doesn't produce very accurate performance.)

However, as we've learned in class, a machine-learning-based model might have performance that is strongly different for different data (ie, high variance), but our goal is to reduce this as much as is practical, while maintaining the greatest generality of our model (ie, keeping the model's overall bias low, as well). For this reason, the Kaggle leaderboard will be generated in two ways:

  • Before the competition cut-off date, a random sample of half of the testing data will be used for scores on the leaderboard, to assign interim rankings
  • After the competition cut-off date, the other half of the testing data will be used for scores on the leaderboard, to assign final rankings Your goal is to create a model that scores high regardless of the set of data used for evaluation, so ideally, you'd like to maximize the performance of your model on both sets of test data---the set whose score you can see daily prior to the end of the competition, as well as the set whose score you can't see until the competition is over.

Good luck, have fun, and don't crash your computer!