Course project: Comparing Retrieval Strategies for Multi-hop Question Answering on 2WikiMultihopQA

Author: 23721598

12/10/2025 — course project — 3 min read

Course Project Info
Code Repository URL	https://github.com/uazhlt-ms-program/ling-582-fall-2025-course-project-code-yanyan
Demo URL (optional)
Team name	yanyan

Project description

Overview

This project compares retrieval strategies for multi-hop question answering using the 2WikiMultihopQA dataset. Multi-hop QA requires synthesizing information from multiple documents to answer a single question, making retrieval quality critical to system performance.

I compare three retrieval methods:

TF-IDF retrieval (statistical baseline) A classic bag-of-words approach using cosine similarity over TF-IDF representations.
Dense retrieval using Sentence Transformers A neural embedding-based retriever using a small pre-trained encoder (e.g., multi-qa-MiniLM-L6-cos-v1) with vector similarity search.
Decomposition-based retrieval (simple multi-hop) This method first uses an LLM (Llama 3.2 3B) to decompose a complex multi-hop question into simpler sub-questions. Each sub-question is then used to retrieve relevant documents independently, and the results are merged. The idea is that simpler questions may yield more targeted retrieval than the original complex question.

Llama 3.2 (3B) via Ollama also serves as the QA reader for all three methods.

Challenges of Multi-hop QA

Multi-hop QA is harder than single-hop QA because it requires finding and combining information from multiple documents. Here are the main challenges:

Retrieval requires multiple documents. For example, to answer "Who is the mother of the director of Polish-Russian War (Film)?", the system must first find a document about the film to learn the director, then find a document about that director to learn his mother's name.
Entity disambiguation. Many entities have similar names. For example, when searching for "John V, Prince of Anhalt-Zerbst", the retriever might return documents about John II, John VI, or other princes with similar titles.
Implicit reasoning chains. The question doesn't tell you what intermediate information you need. For "When did John V's father die?", you must infer that you first need to find who the father is.
Answer format variability. The same answer can appear in different forms ("Galați" vs "Galați, Romania"), making exact match evaluation difficult.

Motivation

Retrieval quality is one of the main bottlenecks in multi-hop QA. While TF-IDF remains a strong baseline, dense retrieval and question decomposition may provide more relevant context for multi-step reasoning and question answering. This comparison will help clarify which retrieval strategies most effectively support multi-hop reasoning.

Timeline

11/17 - 11/21

Preprocess the dataset
Implement baseline retrieval (TF-IDF + dense embeddings)

11/22 - 11/28

Develop multi-hop question decomposition and retrieval pipeline
Integrate retrieval with a lightweight QA reader model

11/29 – 12/5

Run full experiments and evaluate all retrieval settings
Collect metrics and retrieval statistics

12/6 - 12/9

Perform error analysis and finalize results
Write reproducibility instructions and prepare the final summary

12/10

Submit final pull request

Summary of individual contributions

Team member	Role/contributions
Yanyan Dong	All

Results

I evaluated three retrieval methods on 50 samples from the 2WikiMultihopQA dev set. The QA reader used Llama 3.2 (3B) via Ollama.

Method	Retrieval Recall	Exact Match	F1 Score
TF-IDF	0.475	0.180	0.284
Dense	0.550	0.260	0.373
Decomposition	0.490	0.100	0.202

Key findings:

Dense retrieval outperforms TF-IDF across all metrics. The semantic understanding from neural embeddings helps match questions to relevant passages better than keyword matching.
Dense achieves the highest retrieval recall (0.550) and best QA performance (F1=0.373). This shows that semantic embeddings are better at finding relevant documents for multi-hop questions.
Decomposition underperforms expectations. While decomposing questions into sub-questions should help multi-hop reasoning, the small 3B model sometimes generates poor sub-questions, leading to worse retrieval than the simpler methods.

Error analysis

I performed detailed error analysis on 10 samples to understand failure cases.

Error type distribution

Method	Correct	Retrieval Error	Reader Error
TF-IDF	2/10	7/10	1/10
Dense	3/10	7/10	0/10
Decomposition	1/10	9/10	0/10

Example errors

Example 1: Retrieval failure

Question: "Who is the mother of the director of film Polish-Russian War (Film)?"
Gold docs: Xawery Żuławski, Polish-Russian War (film)
TF-IDF retrieved: Polish-Russian War (film), Minamoto no Chikako
Gold answer: "Małgorzata Braunek"
Prediction: "Dorota Masłowska"
The retriever found the film document but failed to retrieve the director's biography, leading to an incorrect answer.

Example 2: Reader failure

Question: "Which film came out first, Blind Shaft or The Mask Of Fu Manchu?"
TF-IDF retrieved the correct docs: Blind Shaft, The Mask of Fu Manchu
Gold answer: "The Mask Of Fu Manchu"
Prediction: "The Mask of Fu Manchu (1932)"
The answer is semantically correct but includes extra year information, causing EM=0.

Example 3: Retrieval failure (entity disambiguation)

Question: "When did John V, Prince Of Anhalt-Zerbst's father die?"
Gold docs: John V, Prince of Anhalt-Zerbst, Ernest I, Prince of Anhalt-Dessau
TF-IDF retrieved: Waldemar III, Prince of Anhalt-Zerbst, John VI, Prince of Anhalt-Zerbst
The retriever confused similar entity names (John V vs John VI, Waldemar III), failing to find the correct documents.

Findings

Retrieval is the main bottleneck. 70-90% of errors come from not finding the right documents. Entity disambiguation is particularly challenging, searching for "John V" returns documents about John II, John VI, etc.
Dense retrieval has zero reader errors. When it retrieves the correct documents, the LLM always answers correctly.
Decomposition struggles with small models. The 3B model often generates poor sub-questions, leading to 90% retrieval errors. Larger models with better instruction-following might perform better at question decomposition.
Answer format affects EM scores. Many predictions are semantically correct but fail exact match due to minor formatting differences (e.g., adding year or country information).

Reproducibility

See the README.md in the code repository for setup and running instructions.

1# Setup
2conda create -n multihop python=3.10 -y
3conda activate multihop
4pip install -r requirements.txt
5
6# Install Ollama from https://ollama.com
7ollama pull llama3.2:3b
8
9# Download data from https://github.com/Alab-NII/2wikimultihop
10# Unzip to data/ folder
11
12# Run
13python main.py  # Main experiments + error analysis

Future improvements

Based on the error analysis, here are potential improvements:

Retrieve more documents. Currently we retrieve top-2 documents, but multi-hop questions often need exactly 2 specific documents. Retrieving top-3 or top-4 might improve recall.
Iterative retrieval. Instead of retrieving all documents at once, retrieve one document, extract information, then retrieve the next document based on what was learned.
Better answer extraction. Add post-processing to extract short answers from verbose LLM responses, improving exact match scores.
Hybrid retrieval. Combine TF-IDF and Dense scores to get benefits of both keyword matching and semantic understanding.
Fine-tune the dense encoder. The current model (multi-qa-MiniLM-L6-cos-v1) is general-purpose. Fine-tuning on multi-hop QA data could improve retrieval for this specific task.