Skip to content
LING 582 (FA 2025)
GitHub

Comparing Decoder-Only Language Models for Creative Text Generation

Author: pranithachilvari

course project, language models, decoder-only transformers2 min read

Course Project Info
Code Repository URLhttps://github.com/uazhlt-ms-program/ling-582-fall-2025-course-project-code-pranitha-chandana
Demo URL (optional)
Team NamePranitha–Chandana

Project Proposal

Our project investigates how different open-source decoder-only language models behave when generating creative text. We focus on three models — GPT-2, GPT-2-Medium, and GPT-Neo 1.3B — and analyze how each one performs across a consistent set of short prompts, ranging from story beginnings to conversational turns.

This work fits within Statistical NLP because it examines generative model behavior, coherence patterns, and the qualitative characteristics of autoregressive text generation.


Project Description

The core objective is to compare how the three selected models respond to 15 creative and dialogue-based prompts. These prompts include:

  • Opening lines for fictional stories
  • User–assistant dialogue continuations
  • Basic knowledge questions
  • Instruction-following scenarios

Novelty

Rather than evaluating the models using standardized benchmarks, we assess their creative intuition, narrative flow, and generative reliability. This allows us to observe subtle differences in hallucination, repetition, and stylistic behavior.

Motivation

Decoder-only transformer models vary widely in fluency and factual grounding. Understanding how they differ on short-form generative tasks provides insight into their limitations, especially for creative or user-facing applications.


Summary of Individual Contributions

Team MemberContributions
PranithaExecuted GPT-2 and GPT-Neo generations, compiled error examples, wrote parts of methods and results.
ChandanaDesigned prompt set, organized outputs, supported evaluation, contributed to interpretation and write-up.

Both members participated fully in discussion, model comparison, and editing.


Methodology

Prompt Set

We constructed a set of 15 prompts that cover several categories:

  • Creative writing
  • Dialogue-style Q&A
  • Factual recall
  • Instruction following

Models Studied

  • GPT-2 (117M parameters)
  • GPT-2-Medium (345M)
  • GPT-Neo 1.3B

All experiments were performed locally using HuggingFace pipelines on CPU.

Generation Parameters

  • max_length = 200
  • num_return_sequences = 1
  • set_seed(42) for consistent outputs

Evaluation

All model responses were judged along:

  • Creativity
  • Fluency
  • Logical flow
  • Factual correctness
  • Degree of hallucination
  • Consistency in dialogue
  • Repetitive or degenerate patterns

Results

1. Creativity & Narrative Quality

  • GPT-Neo 1.3B produced the most coherent and engaging long-form outputs.
  • GPT-2-Medium showed moderate improvement over GPT-2.
  • GPT-2 struggled to stay on topic and often introduced unrelated ideas.

2. Factual Reliability

  • All models occasionally produced incorrect answers.
  • GPT-2 had the lowest factual accuracy.

3. Conversational Behavior

  • GPT-Neo handled conversational cues more effectively.
  • GPT-2 frequently mixed roles, repeated dialogue, or invented speakers.

Summary Table

ModelCreativityCoherenceFactualityDialogue StabilityRepetition
GPT-2LowLowPoorPoorHigh
GPT-2-MediumMediumMediumPoorMediumMedium
GPT-Neo 1.3BHighHighModerateHighLow

Error Analysis

Three examples illustrate the common failure modes.

Error 1 — GPT-2 Creates Imaginary Dialogue

Prompt:
“User: I’ve been feeling overwhelmed with college lately. Assistant:”

GPT-2 responded with unrelated chatter and repeated statements, showing difficulty maintaining a stable assistant persona.


Error 2 — Repetitive Degeneration

Prompt:
“User: Imagine you are a time traveler from 2200. Introduce yourself.”

GPT-2 generated the phrase “I don’t understand that” repeatedly, a common failure mode for smaller autoregressive models.


Error 3 — Incorrect Factual Answer

Prompt:
“Who wrote Pride and Prejudice?”

GPT-2 never mentioned Jane Austen and instead produced irrelevant narrative text.

These examples highlight limitations in grounding, memory, and repetition control.


Reproducibility

All resources, code, and data used in this project are available in our repository:

Repository:
https://github.com/uazhlt-ms-program/ling-582-fall-2025-course-project-code-pranitha-chandana

Requirements

  • Python 3.9 or 3.11
  • transformers, torch, huggingface_hub

Install dependencies:

1pip install transformers torch huggingface_hub

Future Improvements

Running larger models (e.g., LLaMA-2-Chat, Mistral-7B) on GPU environments

Adding automatic evaluation metrics such as MAUVE or perplexity

Collecting human ratings for narrative creativity

Exploring small-scale fine-tuning on story corpora

Expanding prompt diversity to include multi-turn dialogue