Course project: Comparing Retrieval Strategies for Multi-hop Question Answering on 2WikiMultihopQA
Author: 23721598
— course project — 3 min read| Code Repository URL | https://github.com/uazhlt-ms-program/ling-582-fall-2025-course-project-code-yanyan |
|---|---|
| Demo URL (optional) | |
| Team name | yanyan |
Project description
Overview
This project compares retrieval strategies for multi-hop question answering using the 2WikiMultihopQA dataset. Multi-hop QA requires synthesizing information from multiple documents to answer a single question, making retrieval quality critical to system performance.
I compare three retrieval methods:
TF-IDF retrieval (statistical baseline) A classic bag-of-words approach using cosine similarity over TF-IDF representations.
Dense retrieval using Sentence Transformers A neural embedding-based retriever using a small pre-trained encoder (e.g.,
multi-qa-MiniLM-L6-cos-v1) with vector similarity search.Decomposition-based retrieval (simple multi-hop) This method first uses an LLM (Llama 3.2 3B) to decompose a complex multi-hop question into simpler sub-questions. Each sub-question is then used to retrieve relevant documents independently, and the results are merged. The idea is that simpler questions may yield more targeted retrieval than the original complex question.
Llama 3.2 (3B) via Ollama also serves as the QA reader for all three methods.
Challenges of Multi-hop QA
Multi-hop QA is harder than single-hop QA because it requires finding and combining information from multiple documents. Here are the main challenges:
Retrieval requires multiple documents. For example, to answer "Who is the mother of the director of Polish-Russian War (Film)?", the system must first find a document about the film to learn the director, then find a document about that director to learn his mother's name.
Entity disambiguation. Many entities have similar names. For example, when searching for "John V, Prince of Anhalt-Zerbst", the retriever might return documents about John II, John VI, or other princes with similar titles.
Implicit reasoning chains. The question doesn't tell you what intermediate information you need. For "When did John V's father die?", you must infer that you first need to find who the father is.
Answer format variability. The same answer can appear in different forms ("Galați" vs "Galați, Romania"), making exact match evaluation difficult.
Motivation
Retrieval quality is one of the main bottlenecks in multi-hop QA. While TF-IDF remains a strong baseline, dense retrieval and question decomposition may provide more relevant context for multi-step reasoning and question answering. This comparison will help clarify which retrieval strategies most effectively support multi-hop reasoning.
Timeline
11/17 - 11/21
- Preprocess the dataset
- Implement baseline retrieval (TF-IDF + dense embeddings)
11/22 - 11/28
- Develop multi-hop question decomposition and retrieval pipeline
- Integrate retrieval with a lightweight QA reader model
11/29 – 12/5
- Run full experiments and evaluate all retrieval settings
- Collect metrics and retrieval statistics
12/6 - 12/9
- Perform error analysis and finalize results
- Write reproducibility instructions and prepare the final summary
12/10
- Submit final pull request
Summary of individual contributions
| Team member | Role/contributions |
|---|---|
| Yanyan Dong | All |
Results
I evaluated three retrieval methods on 50 samples from the 2WikiMultihopQA dev set. The QA reader used Llama 3.2 (3B) via Ollama.
| Method | Retrieval Recall | Exact Match | F1 Score |
|---|---|---|---|
| TF-IDF | 0.475 | 0.180 | 0.284 |
| Dense | 0.550 | 0.260 | 0.373 |
| Decomposition | 0.490 | 0.100 | 0.202 |
Key findings:
Dense retrieval outperforms TF-IDF across all metrics. The semantic understanding from neural embeddings helps match questions to relevant passages better than keyword matching.
Dense achieves the highest retrieval recall (0.550) and best QA performance (F1=0.373). This shows that semantic embeddings are better at finding relevant documents for multi-hop questions.
Decomposition underperforms expectations. While decomposing questions into sub-questions should help multi-hop reasoning, the small 3B model sometimes generates poor sub-questions, leading to worse retrieval than the simpler methods.
Error analysis
I performed detailed error analysis on 10 samples to understand failure cases.
Error type distribution
| Method | Correct | Retrieval Error | Reader Error |
|---|---|---|---|
| TF-IDF | 2/10 | 7/10 | 1/10 |
| Dense | 3/10 | 7/10 | 0/10 |
| Decomposition | 1/10 | 9/10 | 0/10 |
Example errors
Example 1: Retrieval failure
- Question: "Who is the mother of the director of film Polish-Russian War (Film)?"
- Gold docs:
Xawery Żuławski,Polish-Russian War (film) - TF-IDF retrieved:
Polish-Russian War (film),Minamoto no Chikako - Gold answer: "Małgorzata Braunek"
- Prediction: "Dorota Masłowska"
- The retriever found the film document but failed to retrieve the director's biography, leading to an incorrect answer.
Example 2: Reader failure
- Question: "Which film came out first, Blind Shaft or The Mask Of Fu Manchu?"
- TF-IDF retrieved the correct docs:
Blind Shaft,The Mask of Fu Manchu - Gold answer: "The Mask Of Fu Manchu"
- Prediction: "The Mask of Fu Manchu (1932)"
- The answer is semantically correct but includes extra year information, causing EM=0.
Example 3: Retrieval failure (entity disambiguation)
- Question: "When did John V, Prince Of Anhalt-Zerbst's father die?"
- Gold docs:
John V, Prince of Anhalt-Zerbst,Ernest I, Prince of Anhalt-Dessau - TF-IDF retrieved:
Waldemar III, Prince of Anhalt-Zerbst,John VI, Prince of Anhalt-Zerbst - The retriever confused similar entity names (John V vs John VI, Waldemar III), failing to find the correct documents.
Findings
Retrieval is the main bottleneck. 70-90% of errors come from not finding the right documents. Entity disambiguation is particularly challenging, searching for "John V" returns documents about John II, John VI, etc.
Dense retrieval has zero reader errors. When it retrieves the correct documents, the LLM always answers correctly.
Decomposition struggles with small models. The 3B model often generates poor sub-questions, leading to 90% retrieval errors. Larger models with better instruction-following might perform better at question decomposition.
Answer format affects EM scores. Many predictions are semantically correct but fail exact match due to minor formatting differences (e.g., adding year or country information).
Reproducibility
See the README.md in the code repository for setup and running instructions.
1# Setup2conda create -n multihop python=3.10 -y3conda activate multihop4pip install -r requirements.txt5
6# Install Ollama from https://ollama.com7ollama pull llama3.2:3b8
9# Download data from https://github.com/Alab-NII/2wikimultihop10# Unzip to data/ folder11
12# Run13python main.py # Main experiments + error analysisFuture improvements
Based on the error analysis, here are potential improvements:
Retrieve more documents. Currently we retrieve top-2 documents, but multi-hop questions often need exactly 2 specific documents. Retrieving top-3 or top-4 might improve recall.
Iterative retrieval. Instead of retrieving all documents at once, retrieve one document, extract information, then retrieve the next document based on what was learned.
Better answer extraction. Add post-processing to extract short answers from verbose LLM responses, improving exact match scores.
Hybrid retrieval. Combine TF-IDF and Dense scores to get benefits of both keyword matching and semantic understanding.
Fine-tune the dense encoder. The current model (
multi-qa-MiniLM-L6-cos-v1) is general-purpose. Fine-tuning on multi-hop QA data could improve retrieval for this specific task.