Comparing Decoder-Only Language Models for Creative Text Generation

Author: pranithachilvari

12/09/2025 — course project, language models, decoder-only transformers — 2 min read

Course Project Info
Code Repository URL	https://github.com/uazhlt-ms-program/ling-582-fall-2025-course-project-code-pranitha-chandana
Demo URL (optional)
Team Name	Pranitha–Chandana

Project Proposal

Our project investigates how different open-source decoder-only language models behave when generating creative text. We focus on three models — GPT-2, GPT-2-Medium, and GPT-Neo 1.3B — and analyze how each one performs across a consistent set of short prompts, ranging from story beginnings to conversational turns.

This work fits within Statistical NLP because it examines generative model behavior, coherence patterns, and the qualitative characteristics of autoregressive text generation.

Project Description

The core objective is to compare how the three selected models respond to 15 creative and dialogue-based prompts. These prompts include:

Opening lines for fictional stories
User–assistant dialogue continuations
Basic knowledge questions
Instruction-following scenarios

Novelty

Rather than evaluating the models using standardized benchmarks, we assess their creative intuition, narrative flow, and generative reliability. This allows us to observe subtle differences in hallucination, repetition, and stylistic behavior.

Motivation

Decoder-only transformer models vary widely in fluency and factual grounding. Understanding how they differ on short-form generative tasks provides insight into their limitations, especially for creative or user-facing applications.

Summary of Individual Contributions

Team Member	Contributions
Pranitha	Executed GPT-2 and GPT-Neo generations, compiled error examples, wrote parts of methods and results.
Chandana	Designed prompt set, organized outputs, supported evaluation, contributed to interpretation and write-up.

Both members participated fully in discussion, model comparison, and editing.

Methodology

Prompt Set

We constructed a set of 15 prompts that cover several categories:

Creative writing
Dialogue-style Q&A
Factual recall
Instruction following

Models Studied

GPT-2 (117M parameters)
GPT-2-Medium (345M)
GPT-Neo 1.3B

All experiments were performed locally using HuggingFace pipelines on CPU.

Generation Parameters

max_length = 200
num_return_sequences = 1
set_seed(42) for consistent outputs

Evaluation

All model responses were judged along:

Creativity
Fluency
Logical flow
Factual correctness
Degree of hallucination
Consistency in dialogue
Repetitive or degenerate patterns

Results

1. Creativity & Narrative Quality

GPT-Neo 1.3B produced the most coherent and engaging long-form outputs.
GPT-2-Medium showed moderate improvement over GPT-2.
GPT-2 struggled to stay on topic and often introduced unrelated ideas.

2. Factual Reliability

All models occasionally produced incorrect answers.
GPT-2 had the lowest factual accuracy.

3. Conversational Behavior

GPT-Neo handled conversational cues more effectively.
GPT-2 frequently mixed roles, repeated dialogue, or invented speakers.

Summary Table

Model	Creativity	Coherence	Factuality	Dialogue Stability	Repetition
GPT-2	Low	Low	Poor	Poor	High
GPT-2-Medium	Medium	Medium	Poor	Medium	Medium
GPT-Neo 1.3B	High	High	Moderate	High	Low

Error Analysis

Three examples illustrate the common failure modes.

Error 1 — GPT-2 Creates Imaginary Dialogue

Prompt:
“User: I’ve been feeling overwhelmed with college lately. Assistant:”

GPT-2 responded with unrelated chatter and repeated statements, showing difficulty maintaining a stable assistant persona.

Error 2 — Repetitive Degeneration

Prompt:
“User: Imagine you are a time traveler from 2200. Introduce yourself.”

GPT-2 generated the phrase “I don’t understand that” repeatedly, a common failure mode for smaller autoregressive models.

Error 3 — Incorrect Factual Answer

Prompt:
“Who wrote Pride and Prejudice?”

GPT-2 never mentioned Jane Austen and instead produced irrelevant narrative text.

These examples highlight limitations in grounding, memory, and repetition control.

Reproducibility

All resources, code, and data used in this project are available in our repository:

Repository:
https://github.com/uazhlt-ms-program/ling-582-fall-2025-course-project-code-pranitha-chandana

Requirements

Python 3.9 or 3.11
transformers, torch, huggingface_hub

Install dependencies:

1pip install transformers torch huggingface_hub

Future Improvements

Running larger models (e.g., LLaMA-2-Chat, Mistral-7B) on GPU environments

Adding automatic evaluation metrics such as MAUVE or perplexity

Collecting human ratings for narrative creativity

Exploring small-scale fine-tuning on story corpora

Expanding prompt diversity to include multi-turn dialogue