Stylometric–Transformer Hybrid Model for Authorship Verification
Author: binduvelpula
— paper summary — 4 min readCitation
Faumi, Nabeelah; Gunathilake, Adeepa; Wickramanayake, Benura; Dias, Deelaka; Sumanathilaka, T. G. D. K. (2025). Stylomech: Unveiling Authorship via Computational Stylometry in English and Romanized Sinhala. https://doi.org/10.48550/arXiv.2501.09561
| Paper | Stylomech: Unveiling Authorship via Computational Stylometry in English and Romanized Sinhala |
| Authors | Nabeelah Faumi, Adeepa Gunathilake, Benura Wickramanayake, Deelaka Dias, T. G. D. K. Sumanathilaka |
| Year published | 2025 |
| Venue | arXiv (cs.CL) |
| Paper URL | https://arxiv.org/abs/2501.09561 |
| Code URL | None |
Description
Stylomech extracts stylometric features from text pairs to verify authorship in English and Romanized Sinhala, enabling accurate author comparison with limited data.
Motivation
This paper demonstrates a hybrid approach that can be extended with transformer embeddings for robust authorship verification in low-resource or multilingual settings.
LLM
| LLM model | GPT-4o |
|---|---|
| LLM model version | May 2024 Version |
| Model/service URL | https://chatgpt.com |
| Why this model? | demonstrates superior reasoning capabilities when handling academic PDF inputs. It excels at synthesizing technical methodologies and provides structured, consistent output for comparison tasks. Its ability to understand context within Romanized non-English scripts (like Romanized Sinhala) makes it ideal for this specific paper |
Description (LLM)
System Prompt
1You are a Natural Language Processing (NLP) researcher specializing in forensic linguistics and authorship verification. Read the provided research paper "Stylomech" and answer the following questions clearly and concisely. Compare the results with current state-of-the-art methods.In the LLM's words, what is this paper about?
Prompt
1prompt here2what is this paper about?This paper introduces "Stylomech," a framework designed for Authorship Verification (AV)—the task of determining whether two texts were written by the same person. The study focuses specifically on two linguistic contexts:
- English
- Romanized Sinhala (Sinhala written using English characters, often used in social media). The paper addresses the challenge of identifying authorship in short text samples (like tweets or SMS) where traditional deep learning models might struggle due to data sparsity. It utilizes Computational Stylometry—the measurement of style markers like sentence length, vocabulary richness, and punctuation usage—to build a robust verification system.
What are the authors proposing?
Prompt
1What are the authors proposing?The authors propose a comprehensive pipeline for authorship verification that relies on explicit feature engineering rather than "black-box" embeddings alone. Their proposal includes:
- A Stylometric Feature Set: They propose extracting 55 distinct features categorized into:
- Lexical features: (e.g., Average word length, vocabulary richness).
- Syntactic features: (e.g., Punctuation frequency, function words).
- Structural features: (e.g., Paragraph length).
- Distance Metric Evaluation: They propose using absolute differences and cosine similarity between feature vectors of text pairs to quantify "style distance."
- Romanized Sinhala Dataset: They introduce a novel dataset approach for Romanized Sinhala, addressing a gap in low-resource language forensics.
What is the motivation for the work?
Prompt
1prompt here2What is the motivation for the work?The primary motivation is the rise of cybercrimes, such as phishing, harassment, and the spread of fake news, where the identity of the perpetrator is often hidden behind anonymous text. Specific motivating factors include:
1- Data Scarcity: Deep learning models (like BERT) require massive datasets. In forensic scenarios, investigators often possess only short, limited writing samples.2 - Language Gaps: There is a significant lack of tools for "code-mixed" or transliterated languages like Romanized Sinhala, which is prevalent in Sri Lankan online communication.3 - Interpretability: In legal contexts, explainable features (e.g., "the suspect uses commas frequently") are often preferred over opaque neural network probabilities.What is the approach or innovation?
Prompt
1prompt here2What is the approach or innovation?The core innovation is the Stylomech Framework, which functions as a comparator.
The Approach:
- Preprocessing: Cleaning text (removing URLs, emojis) while preserving stylometric markers like capitalization and punctuation.
- Feature Extraction: The system extracts a vector of numerical values representing the writing style for Text A and Text B.
- Vector Combination: Instead of feeding raw text to a classifier, they feed the difference between the two feature vectors (Feature Vector A - Feature Vector B).
- Classification: They employ machine learning classifiers (Random Forest, XGBoost, SVM) to predict "Same Author" or "Different Author" based on these differences. Innovation: The application of these techniques specifically to Romanized Sinhala, creating a precedent for analyzing transliterated South Asian languages in forensic contexts
What are the results and how do they compare with competing approaches?
Prompt
1prompt here2What are the results and how do they compare with competing approaches?The authors achieved high performance metrics, demonstrating that stylometry remains competitive even against complex deep learning models for this specific task. Key Results:
- English Dataset: The model achieved an accuracy of approximately 88.2% using the Random Forest classifier.
- Romanized Sinhala: The model achieved an accuracy of 94.7%, significantly outperforming baselines. Comparison: Compared to traditional Bag-of-Words (BoW) or simple TF-IDF approaches, Stylomech showed superior performance in handling short texts. While they did not extensively benchmark against Large Language Models (LLMs) in this specific paper, their results suggest that for limited data, stylometric engineering is more computationally efficient and comparably accurate to Transformer-based methods.
Is the comparison fair?
Prompt
1prompt here2Is the comparison fair?The comparison is largely fair within the scope of "classical" machine learning.
- Strengths: They compared multiple classifiers (SVM, KNN, XGBoost, Random Forest) against each other using the same feature set, ensuring a fair internal benchmark.
- Limitations: The study could have been strengthened by comparing their results against a modern Transformer baseline (like mBERT or XLM-R) to definitively prove that stylometry is better or equal to Deep Learning for Romanized Sinhala. Without this, it is hard to say if it beats the current state-of-the-art (SOTA).
What are the takeaways according to the authors?
Prompt
1prompt here2What are the takeaways according to the authors?The authors conclude that:
- Stylometry is Robust: Explicit linguistic features are highly effective for authorship verification, even in short texts.
- Language Independence: The same set of stylometric features works surprisingly well for both English and Romanized Sinhala, suggesting that "style" is a universal cognitive fingerprint.
- Classifier Importance: Tree-based models (Random Forest, XGBoost) generally outperformed linear models (SVM) and distance-based models (KNN) for this tabular feature data.
What are the takeaways according to you?
Prompt
1prompt here2What are the takeaways according to you?From an NLP research perspective:
- Feature Engineering isn't dead: Despite the hype around LLMs, hand-crafted features (like counting punctuation) provide a high-signal, low-noise input that is crucial for forensics.
- Hybrid Potential: The best path forward (and relevant to your "Hybrid Stylometry-Transformer" project) is likely combining these dense stylometric vectors with sparse semantic vectors from Transformers.
- Code-Mixing is Hard: The success in Romanized Sinhala is impressive because transliterated text lacks standardized spelling. The fact that stylometry worked here implies that authors are consistent in how they misspell or transliterate words.
Would you use this? If so, how/where would you use this?
Prompt
1prompt here2Would you use this?Yes, I would use this framework as a baseline and feature extractor for a larger system.
If so, how/where would you use this?
Prompt
1prompt here2If so, how/where would you use this?I would use the Stylomech feature extraction module to generate the "Style" input for your specific "Hybrid Stylometry–Transformer Model."
Preprocessing Pipeline: Use their code to clean the data.
Feature Injection: Extract the 55 stylistic features they listed.
Hybridization: Concatenate these 55 numbers with the 768-dimensional embedding vector from a model like RoBERTa.
Forensic Tool: I would deploy this in scenarios involving anonymous threat detection or academic plagiarism where the text is too short for standard plagiarism checkers to catch.
What problems remain and what are the next steps?
Prompt
1prompt here2What problems remain and what are the next steps?Remaining Problems:
Topic Bias: Stylometric models risk learning topics (e.g., Sports vs. Cooking) instead of invariant style markers. Scalability: The quadratic growth of pairs N^2 poses computational challenges for large-scale verification. Baseline Gap: Lack of strong transformer comparisons (like mBERT) for Romanized Sinhala leaves the neural vs. stylometric question open.
Experience using the LLM
Describe your process for using the LLM. How did the LLM perform?
I used GPT-4o with a technical researcher persona. It successfully identified key feature categories and accuracy metrics (88.2% and 94.7%). However, it initially hallucinated a BERT comparison, which required a follow-up prompt to correct.
Errors and limitations of the LLM
Where did it fall short or make mistakes?
- Feature Counts: Incorrectly estimated "over 100" features initially; corrected to 55 after specific prompting.
- Context: Misidentified "Romanized Sinhala" as translation rather than transliteration, requiring clarification on the alphabet used.