Statistical Detection of LLM-Generated Text Using Linguistic/Probabilistic Features
Author: dmihaylov
— course project — 1 min read| Code Repository URL | https://github.com/uazhlt-ms-program/ling-582-fall-2025-course-project-code-syntax-error-404 |
|---|---|
| Demo URL (optional) | Not Applicable |
| Team name | Syntax Error 404 |
Project description
See the rubric. Use subsections to organize components of the description.
For my project, I wanted to explore whether simple statistical features are still useful for distinguishing human text from text generated by a LLM. Instead of relying on model-dependent signals, I focused entirely on lightweight, interpretable features such as lexical diversity, hapax ratios, character-level entropy, punctuation patterns, and basic sentence-length statistics. The genre of text I chose is science fiction text, since it lets me control the style and makes the human versus LLM distinction more interesting than shorter or more formulaic text. Human data comes from public-domain Sci-Fi authors, and all LLM-generated data was produced using Llama-3-8B-Instruct. In terms of how novel this approach is, it intentionally avoids any neural scoring/perplexity-based signals. Much of the related work focuses on token-level probabilities and curvature metrics, but even those approaches struggle to generalize across models and domains. Generally speaking, simple features versus increasingly human-like outputs is what introduces a novel approach in this project.
Timeline:
11/3–11/10: Collect human and LLM-generated text
11/11–11/17: Extract statistical features and build baseline classifier
11/18–11/30: Run experiments and evaluate feature importance/performance
12/1–12/10: Final analysis and write-up
Summary of individual contributions
| Team member | Role/contributions |
|---|---|
| Dimitri Mihaylov | Everything |
Results
See the rubric
I trained a logistic regression classifier with standardization on nine statistical features extracted from the text. On a stratified 70/30 train–test split, the model achieved the following:
- ROC–AUC: 1.00 on the small test set
- Confusion matrix:
- Human: 3/3 correctly classified
- LLM: 2/3 correctly classified
Because the dataset is intentionally small and controlled, I also evaluated robustness using stratified 5-fold cross-validation which produced:
- Mean accuracy: 0.70 versus the 0.50 baseline
- Mean ROC–AUC: 0.90
So, even though these features are extremely simple, and the LLM outputs are similar to the human texts in terms of style, there is still enough statistical signal to do noticeably better than chance. The feature coefficients show that higher type/token ratios and higher hapax ratios tend to align with LLM-generated text, while higher character entropy and more varied sentence lengths are more aligned with human writing in this case.
Error analysis
See the rubric
I spent some time digging into the mistakes the classifier made and a couple of patterns stood out. One LLM-generated paragraph, which described the structure of an academic-style report, was inaccurately consistently predicted as human. The model's reasoning makes sense based on the feature values. It was a single long sentence with moderate lexical diversity, low punctuation density, and nothing especially template-like about the structure. In other words, it didn't match the cleaner/more modular sentences that the LLM tended to produce elsewhere in the dataset. However, a few human-written passages that were short/polished and more formal than the rest sometimes got predicted as LLM. These tended to have lower sentence-length variance and more uniform phrasing which accidentally mimicked the outputs of the model I used. Overall, the model seems most confident on informal human paragraphs with noisy character distributions and mixed sentence lengths, and on shorter highly structured LLM paragraphs. On the other hand, it struggles on boundary cases where the human writing is more clean and consistent, or where the LLM produces text that is surprisingly varied. These examples highlight exactly where surface-level statistics start to break down when human and machine styles converge.
Reproducibility
See the rubric If you'ved covered this in your code repository's README, you can simply link to that document with a note.
This is covered in my README, but I wanted to add it here as well. For Python setup, implement the following bash/terminal commands:
1pip install -r requirements.txt2python make_dataset.py3python train_and_evaluate.py
For Docker setup, implement the following bash/terminal commands:
1docker build -t sf-detector .2docker run --rm sf-detector
Future improvements
See the rubric
There are a couple limitations worth pointing out from this project. The dataset is intentionally small and heavily constrained to a single genre, which makes it great for controlled analysis, but it makes it non-reflective of the full range of human or LLM writing. The feature set is also deliberately minimal for the sake of local build performance. It ignores higher linguistic cues like POS distributions, discourse markers, and n-gram burstiness that may have improven model performance while staying interpretable. Finally, everything being trained and tested on a single LLM generalization across different model families can be ambiguous. If I were to improve this project, the first step would be scaling up the dataset and adding multiple open-weight LLM's to see whether these simple signals generalize past just a single generator. I'd also like to incorporate features like POS entropy or clause-level complexity that once again preserve interpretability but capture more structure than raw token/character counts. Finally, forcing the LLM to intentionally mock human statistical signatures in text/speech would be a great, natural way to explore the limits of simple detectors.