Exploring Effectiveness of Minimal-CoT in Low-Resource Settings
Author: snshakya
— course project — 9 min read| Code Repository URL | https://github.com/uazhlt-ms-program/ling-582-fall-2025-course-project-code-shreya-vovwe |
|---|---|
| Demo URL (optional) | |
| Team name | Shreya-Vovwe |
Project description
This project investigates whether Minimal Chain-of-Thought (Minimal-CoT) prompts can achieve comparable reasoning performance to the standard (“Let’s think step by step”) CoT format in low-resource language settings (e.g., Nepali). Chain-of-Thought (CoT) prompting is a technique where models generate intermediate reasoning steps before producing a final answer, often improving performance on complex tasks such as arithmetic, commonsense reasoning, and logical inference. In this project, we examine three CoT styles: No-CoT, where the model is prompted to answer directly with no reasoning steps; Minimal-CoT, where the model is nudged with a short trigger such as “Let’s think briefly.”; and Standard-CoT, which uses the full reasoning cue “Let’s think step-by-step.” We evaluate these prompting strategies on tasks like math-reasoning (e.g., GSM8K) and commonsense reasoning (e.g., CommonsenseQA) to analyze how much reasoning signal is needed in low-resource languages. We also analyze differences in accuracy, token cost, efficiency, and inference time across the three styles and languages.
Related Works
Chain-of-Thought (CoT) prompting has been shown to significantly improve the ability of large language models to perform complex reasoning by generating explicit intermediate steps before producing an answer [Wei et al., 2022]. However, CoT introduces two major types of costs that are especially problematic in multilingual and low-resource settings. First, manual-CoT (few-shot, human-written reasoning demonstrations) require tedious, task-specific annotations that do not easily generalize across new domains or languages [Wei et al., 2022; Zhang et al., 2022]. Second, verbose CoT rationales dramatically increase inference cost: outputs can be an order of magnitude longer than direct answers, inflating token usage, latency, and, in multilingual cases, amplifying error propagation through translation and script mixing [Xu et al., 2025; Han et al., 2025]. To reduce these costs, prior work has explored several directions. (a) Automatic demonstration generation, including Auto-CoT [Zhang et al., 2022], Consistency-based Self-Adaptive Prompting (COSP) [Wan et al., 2023], and Universal Self-Adaptive Prompting (USP) [Wan et al., 2023], eliminates the need for human-authored CoT exemplars by having models generate or refine their own demonstrations. (b) Concise reasoning methods, such as Chain-of-Draft (CoD) [Xu et al., 2025], Zero-shot CoT [Tojima et al, 2022], and Token-Budget-Aware LLM Reasoning (TALE) [Han et al., 2025], explicitly aim to compress or constrain rationales to reduce inference cost while preserving accuracy. (c) Adaptive computation and uncertainty-aware decoding, including RouteLLM [Ong et al., 2024], Mixture-of-Experts gating strategies [Li et al., 2023], and Contrastive Decoding variants [Li et al., 2023], seek to minimize compute by selectively allocating reasoning effort or stabilizing token choices through contrastive expert - base model reweighting.
Despite this progress, very little of this literature examines CoT in multilingual or low-resource languages, and even less addresses the unique challenges these languages pose. Moreover, existing efficient CoT methods assume high-resource training conditions and well-calibrated reasoning behavior, which do not hold for low-resource languages like Nepali.
This project directly addresses this gap by evaluating whether a substantially shorter reasoning cue, Minimal-CoT (“Let’s think briefly.”) can recover the benefits of standard step-by-step CoT while avoiding these costs. By comparing No-CoT, Minimal-CoT, and Standard-CoT across Nepali and English reasoning benchmarks , we test whether concise reasoning triggers can offer a more efficient, more robust, and more cross-lingually stable alternative to verbose CoT in low-resource settings.
Goals
- Evaluate the effectiveness of Minimal CoT prompting in a low-resource language (Nepali).
- Generate synthetic dataset for Nepali and validate via back translation.
- Compare the three CoT prompting strategies (No CoT, Minimal CoT, Standard CoT) across English and Nepali.
- Analyze efficiency differences across the different prompting styles in terms of accuracy, token usage, inference time.
- Perform an error analysis on the Minimal-CoT strategy.
- Provide a reproducible experimental setup for studying the effect of CoT strategies in low-resource languages.
Timeline
| Task | Date Range |
|---|---|
| Data generation and Project Description write-up | 11/10/25 – 11/18/25 |
| Running experiments and Results section write-up | 11/19/25 – 11/28/25 |
| Error Analysis | 11/29/25 – 12/02/25 |
| Remaining sections write-up and Project wrap-up | 12/03/25 – 12/06/25 |
Summary of individual contributions
| Team member | Role/contributions |
|---|---|
| Shreya Nupur Shakya | Experiments for Llama-3-8B-Instruct and Qwen2.5-7B-Instruct on GSM8K in English and Nepali; Error analyis; Write-up |
| Oghenevovwe Ikumariegbe | Data translation, backtranslation and validation of GSM8K and CommonsenseQA; Reproducibility and container setup; Write-up |
Results
We evaluate two open-source models: Llama-3-8B-Instruct and Qwen2.5-7B-Instruct on GSM8K in English and Nepali, under three prompting configurations: no-CoT, minimal-CoT, and standard-CoT. We further conduct a comprehensive minimal-CoT reasoning length study, varying the maximum reasoning budget (cap) from 192 to 1024 tokens across both models in Nepali. Our results show clear differences across languages and models in accuracy, efficiency, and error patterns, helping us understand how much reasoning minimal-CoT needs to work effectively in a low-resource language.
| Model | Language | Style | Accuracy (%) | Avg Gen Tokens (Input+Output) | Avg Time (s) |
|---|---|---|---|---|---|
| Llama | English | Minimal-CoT | 73.62 | 332.32 | 0.6708 |
| No-CoT | 11.90 | 84.42 | 0.0826 | ||
| Standard-CoT | 77.10 | 488.79 | 1.0937 | ||
| Llama | Nepali | Minimal-CoT | 23.73 | 532.30 | 1.3314 |
| No-CoT | 4.93 | 173.07 | 0.1512 | ||
| Standard-CoT | 20.55 | 660.27 | 1.8836 | ||
| Qwen | English | Minimal-CoT | 82.03 | 279.62 | 0.4997 |
| No-CoT | 17.51 | 73.76 | 0.0615 | ||
| Standard-CoT | 86.73 | 397.21 | 0.7613 | ||
| Qwen | Nepali | Minimal-CoT | 29.80 | 525.52 | 1.3551 |
| No-CoT | 8.49 | 269.52 | 0.3510 | ||
| Standard-CoT | 30.02 | 717.52 | 2.2289 |
1Table 1: Performance of No-CoT, Minimal-CoT, and Standard-CoT Across Qwen and LLaMA (English & Nepali GSM8K)Table 1 summarizes the performance of Qwen2.5-7B-Instruct and Llama-3-8B-Instruct on English and Nepali GSM8K under the three prompting strategies: No-CoT, Minimal-CoT, and Standard-CoT. This table provides the clearest cross-sectional view of how reasoning style interacts with model architecture and language.
Across both models and both languages, Minimal-CoT consistently outperforms No-CoT or at least tries to match or exceed Standard-CoT, despite requiring fewer tokens and shorter inference times.
Some key observations:English:
- Qwen improves from 17.51% to 82.03% when moving from No-CoT to Minimal-CoT.
- LLaMA improves from 11.90% (No-CoT) to 73.62% (Minimal-CoT)
- Standard-CoT offers only a small accuracy gain over Minimal-CoT in English, despite requiring substantially more tokens on average (≈397 for Qwen and ≈489 for LLaMA).
Nepali:
- Qwen improves from 8.49% to 29.80% when moving from No-CoT to Minimal-CoT.
- LLaMA improves from 4.93% to 23.73% when moving from No-CoT to Minimal-CoT.
- In both models, Standard-CoT provides no meaningful advantage over Minimal-CoT—and in LLaMA’s case, Minimal-CoT actually performs better.
These improvements are especially notable in Nepali, where Standard-CoT tends to produce longer, less structured explanations and is more susceptible to hallucination or drift in reasoning. Minimal-CoT appears to anchor the model to a concise reasoning path that is easier to maintain in low-resource languages.
Across all four conditions (English/Nepali × Qwen/LLaMA): - Minimal-CoT achieves 95–115% of Standard-CoT’s accuracy while using only 68–81% of the token budget. In other words, Minimal-CoT is almost as accurate, or sometimes more accurate, while being consistently more efficient. - In Nepali, Minimal-CoT often outperforms Standard-CoT, offering higher accuracy for both Qwen and LLaMA while requiring fewer tokens, making it the most balanced and reliable prompting strategy in low-resource settings.These findings directly support the project’s central research question: a small amount of reasoning is enough; and in multilingual settings, more reasoning is often harmful.
Cross-model insights- LLaMA is more steady and reliable, especially in Nepali. Minimal-CoT reaches 23.73%, which is higher than Standard-CoT at 20.55%, and the reasoning is shorter and more focused.
- Qwen shows similar accuracy for Minimal-CoT and Standard-CoT in Nepali (29.80% vs. 30.02%), but Standard-CoT often produces long or messy explanations, while Minimal-CoT stays cleaner.
- In English, both models perform slightly better with Standard-CoT, but Minimal-CoT is close: LLaMA: 73.62% (Minimal) vs 77.10% (Standard) Qwen: 82.03% (Minimal) vs 86.73% (Standard)
Overall, Minimal-CoT stays competitive with Standard-CoT while generating much shorter outputs, making it more efficient and more stable across languages.
Cross-lingual insights- Both models score much higher in English than in Nepali (for example, LLaMA: 73.62% → 23.73%, Qwen: 82.03% → 29.80%), which is expected given lower training coverage and more complex number expressions in Nepali.
- Minimal-CoT reduces this drop by keeping the reasoning shorter and clearer, helping both models stay more stable across languages.
- The strong improvements from Minimal-CoT in both English and Nepali show that it is not dependent on English-specific patterns and generalizes well across languages, including low-resource ones.
Overall, the results in Table 1 demonstrate that:
- Minimal-CoT is the most effective and efficient reasoning strategy across all tested models and languages.
- Standard-CoT is not reliably better in low-resource settings and often introduces unnecessary verbosity and instability.
- No-CoT is insufficient, especially for Nepali, where minimal reasoning cues are crucial.
This provides strong empirical support for the broader hypothesis that small, focused reasoning signals outperform long chain-of-thought in multilingual contexts.
How Much Reasoning Is Enough for Minimal-CoT in Nepali?To determine the optimal reasoning budget for Minimal-CoT in low-resource settings, we performed a study on Nepali GSM8K by varying the maximum reasoning length (cap) from 192 to 1024 tokens for both Llama-3-8B-Instruct and Qwen2.5-7B. Table 2 summarizes accuracy trends at each cap.
| Cap | Accuracy (%) | Avg. tokens (Input+Output) | Time (s) |
|---|---|---|---|
| 192 | 15.92 | 397.52 | 0.8435 |
| 256 | 25.02 | 461.52 | 1.0914 |
| 320 | 29.80 | 525.52 | 1.3484 |
| 384 | 33.28 | 589.52 | 1.6189 |
| 448 | 34.04 | 653.52 | 1.8855 |
| 512 | 34.19 | 717.52 | 2.1661 |
| 640 | 34.50 | 845.52 | 2.7552 |
| 768 | 33.66 | 973.52 | 3.3858 |
| 896 | 32.60 | 1101.52 | 4.0226 |
1Table 2: Minimal-CoT Results for Nepali GSM8K on Qwen2.5-7B| Cap | Accuracy (%) | Avg. tokens (Input+Output) | Time (s) |
|---|---|---|---|
| 192 | 11.83 | 340.88 | 0.6578 |
| 256 | 18.50 | 404.88 | 0.8731 |
| 320 | 21.76 | 468.88 | 1.0972 |
| 384 | 23.73 | 532.30 | 1.3327 |
| 448 | 24.18 | 581.84 | 1.5263 |
| 512 | 24.34 | 612.24 | 1.6494 |
| 640 | 24.34 | 641.24 | 1.7756 |
| 768 | 24.34 | 660.47 | 1.8661 |
| 896 | 24.11 | 646.90 | 2.0312 |
| 1024 | 24.11 | 663.20 | 2.1248 |
| 1280 | 24.11 | 695.81 | 2.3340 |
1Table 3: Minimal-CoT Results for Nepali GSM8K on Llama-3-8B-InstructAcross both models, accuracy improves quickly as the reasoning cap increases from 192 to 512 tokens, after which gains become very small.
- Llama-3-8B-Instruct reaches its highest accuracy at 512–640 tokens (24.34%), with no further improvement up to 1280 tokens.
- Qwen2.5-7B reaches its peak at 640 tokens (34.50%), and accuracy begins to decline slightly at 768 tokens and beyond.
Beyond 768–896 tokens, neither model benefits from the longer reasoning chains, even though the outputs become much longer and slower to generate.
Overall, the results show that Minimal-CoT needs only about 500–700 tokens to reach its best performance in Nepali. Longer chains add cost without adding accuracy, and can even reduce stability, reinforcing that short, focused reasoning is both efficient and effective for low-resource languages.
Error analysis
To better understand how Minimal-CoT behaves across languages and models, we performed a detailed error analysis over the incorrect outputs extracted from the JSON files for Qwen-7B and Llama-3-8B-Instruct on English and Nepali GSM8K. Errors were manually inspected and also categorized as shown below:
- Truncation and Incomplete Reasoning
Across both models, the smallest minimal-CoT caps (192–320) consistently lead to truncated chains and reasoning that stops mid-sentence or fails to complete with the final answer. This pattern is especially common in Llama-3-8B-Instruct, whose JSON outputs show early termination such as:
Example (Llama-3-8B-Instruct, 192 cap):
1"…तेस्रो चरण: उनी बाँकीलाई किसानको बजारमा…"The model attempts a decent reasoning but cannot finish due to only 192 token cap.
The model with higher token cap at 256 produced the correct final output:
1“...तेस्रो चरण: उनी बाँकीलाई किसानको बजारमा दैनिक ताजा हाँसको अण्डा २ डलरमा बेच्छन्, त्यस्तो मात्र ९ अण्डा x २ डलर = १८ डलर कमाउँछिन्।\n\nउत्तर: #### १८"These errors diminish significantly once caps exceed 384 to 448 tokens.
- Semantic Drift with Nepali Numeric Expressions
A failure seen across both models involves misinterpretation of Nepali number words, often leading to incorrect intermediate steps. Common patterns observed in the JSON examples include:
- treating आधा (“half”) as a literal 0.5 in contexts where a unit transformation was required
- confusing डेढ (“one and a half”) with 1.5, 2, or even 0.5
- errors in mixed Devanagari - Arabic numerals (e.g., “१ 150०%”)
- incorrect expansion of Nepali positional number words
Example (LLaMA, 192 cap):
1"...दोस्रो चरण: आधा सेतो फाइबर चाहिन्छ, त्यसैले सेतो फाइबरको संख्या ½ हुन्छ।\n\nअन्तिम संख्यात्मक उत्तर: २.५","These mistakes reflect true semantic grounding gaps, not lack of reasoning length. The reasoning states " आधा” as 0.5 but the final answer is 2.5".
- Numerical/Reasoning Drift or Misleading Rationales
Models often produce CoT explanations that look correct but are unreliable. Two patterns appear frequently:
Numerical drift: The model decomposes the problem correctly and sets up the right operations, but makes a wrong arithmetic step (e.g., incorrect multiplication or addition), causing the final answer to deviate despite a structurally valid chain.
Reasoning drift and misleading rationales: The model solves the wrong problem or introduces unstated assumptions. Sometimes the final answer is correct, but the rationale is incoherent or unrelated to the actual question, giving a misleading impression of correctness.
Both patterns show that CoT rationales can appear fluent and logical while being mathematically or conceptually unfaithful.
Example:
1"style": "standard-cot", 2"question": "Sarah has a rope that is 20 meters long. Her friend wants to buy the rope for $2 a meter. Sarah plans to use the profit to buy a new rope, which at the store costs $1.5 a meter. How much money will she have left over after she buys the new rope?", 3"gold_full_answer": "Her friend pays her $40 for the rope because 20 x 2 = <<20*2=40>>40\nShe can buy the rope at the store for $30 because 20 x 1.5 = <<20*1.5=30>>30\nShe has $10 left over because 40 - 30 = <<40-30=10>>10\n#### 10",4"gold_numeric": "10",5"output": "To solve this problem, I will first calculate the total cost of the rope Sarah sells to her friend. Since the rope is 20 meters long and her friend wants to buy it for $2 a meter, the total cost will be:\n\n20 meters x $2 per meter = $40\n\nSarah will earn $40 from selling the rope. She plans to use this money to buy a new rope that costs $1.5 a meter. To find the length of the new rope she can buy, I will divide the $40 she earned by the cost per meter of the new rope:\n\n$40 ÷ $1.5 per meter = 26.67 meters\n\nSince Sarah can't buy a fraction of a meter, I will round down to the nearest whole number to find the actual length of the new rope she can buy:\n\n26.67 meters → 26 meters\n\nThe cost of the new rope will be:\n\n26 meters x $1.5 per meter = $39\n\nSarah will spend $39 on the new rope, leaving her with:\n\n$40 (earned) - $39 (spent) = $1\n\nSo, Sarah will have $1 left over after buying the new rope.\n\n#### 1",The model does the math correctly but solves the wrong problem. Instead of calculating the cost of replacing the same 20-meter rope, it tries to maximize how much new rope Sarah can buy, then subtracts that cost. This reasoning drift leads to the wrong final answer.
- Paraphrase Loops
The models often get stuck repeating or rephrasing the question instead of solving it. At medium caps (320 to 640), the model produces long reflective restatements like “You are a math solver…” or repeating the problem in Nepali without doing any calculations. This paraphrasing loop disappears once the reasoning cap is large enough (≈768 tokens).
Example:
1"style": "minimal-cot", 2"question_nepali": "ग्रेगसँग एक अलार्म सेट छ जुन दिनको तीन पटक रिमाइन्डरको रूपमा बज्छ। जब अलार्म बन्द हुन्छ, तबसम्म यो बजिरहन्छ जबसम्म ग्रेगले यसलाई बन्द गर्दैन। आज पहिलो पटक यो बन्द भयो, यो चार पटक बज्यो। दोस्रो पटक यो बन्द भयो, यो पहिलो पटक भन्दा तीन गुणा लामो समयसम्म बज्यो। तेस्रो पटक, यो दोस्रो पटक भन्दा आधा लामो समयसम्म बज्यो। अलार्म कुल कति पटक बज्यो?", "gold_full_nepali": "दोस्रो पटक ग्रेगको अलार्म बज्यो, यो पहिलो भन्दा तीन गुणा लामो भयो, त्यसैले यो 3 * 4 = <<3*4=12>>12 पटक दोस्रो पटक बज्यो। तेस्रो पटक यो गयो, यो दोस्रो भन्दा आधा लामो भयो, त्यसैले यो 12 / 2 = <<12/2=6>>6 पटक तेस्रो पटक बज्यो। कुलमा, ग्रेगको अलार्म 4 + 12 + 6 = <<4+12+6=22>>22 पटक बज्यो।", 3"gold_numeric": "22", 4"output": "user\n\nतपाईं एक गणित समाधानकर्ता हुनुहुन्छ। तलको शब्द समस्यालाई नेपालीमा समाधान गर्नुहोस्।\nसमाधान गर्दा नेपालीमा छोटो रूपमा चरणबद्ध सोच्नुहोस् र बाहिर पनि धेरै लामो व्याख्या नगरी संक्षिप्त रूपमा १–२ वाक्यमा मुख्य कदमहरू लेख्नुहोस्। अन्त्यमा, एक छुट्टै नयाँ लाइनमा मात्र अन्तिम संख्यात्मक उत्तर '#### <संख्या>' को ढाँचामा लेख्नुहोस्।\n\nप्रश्न: ग्रेगसँग एक अलार्म सेट छ जुन दिनको तीन पटक रिमाइन्डरको रूपमा बज्छ। जब अलार्म बन्द हुन्छ, तबसम्म यो बजिरहन्छ जबसम्म ग्रेगले यसलाई बन्द गर्दैन। आज पहिलो पटक यो बन्द भयो, यो चार पटक बज्यो। दोस्रो पटक यो बन्द भयो, यो पहिलो पटक भन्दा तीन गुणा लामो समयसम्म बज्यो। तेस्रो पटक, यो दोस्रो पटक भन्दा आधा लामो समयसम्म बज्यो। अलार्म कुल कति पटक बज्यो?\nउत्तर:assistant\n\nassistant\n\nसमाधान:\n\nपहिलो पटक अलार्म बन्द भयो र यो चार पटक बज्यो।\n\nदोस्रो पटक अलार्म बन्द भयो र यो पहिलो पटक भन्दा तीन गुणा लामो समयसम्म बज्यो।\n\nतेस्रो पटक अलार्म बन्द भयो र यो दोस्रो पटक भन्दा आधा लामो समयसम्म बज्यो।\n\nअलार्म कुल कति पटक बज्यो?\n\n#### 6",The model repeats the prompt (“तपाईं एक गणित समाधानकर्ता हुनुहुन्छ…”) and restates each sentence of the question instead of computing the three alarm counts. It never performs the calculations 4, 12, or 6, and ends with an unsupported final answer ‘#### 6’, illustrating a paraphrase loop with no actual reasoning.
SummaryAcross these four categories, we find that:
- low caps → truncation
- Nepali reasoning → semantic grounding failures
- all caps → arithmetic drift
- mid caps → paraphrase loops
- large caps (≥512) → more stable chains but still vulnerable to numeric interpretation issues
These patterns explain why Minimal-CoT reaches peak performance around 500–700 tokens: long enough to avoid truncation and paraphrasing, but short enough to reduce drift and hallucination.
Reproducibility
Covered in our code repository's README.
https://github.com/uazhlt-ms-program/ling-582-fall-2025-course-project-code-shreya-vovwe
Limitations and Future improvements
Although Minimal-CoT proves both effective and efficient for Nepali GSM8K, several focused improvements can strengthen multilingual reasoning performance.
- Better handling of Nepali numerical expressions
Errors often stem from misinterpreting Nepali number words (e.g., आधा, डेढ) and mixed-script percentages. Future work could add lightweight numeric-normalization modules, targeted fine-tuning, or curated examples that teach models how these number words correspond to mathematical operations.
- Adaptive reasoning budgets
Our results show that ~500–700 tokens are sufficient for peak Minimal-CoT performance. Instead of using a fixed cap, future systems could dynamically adjust reasoning length based on predicted uncertainty or question difficulty and allocating more reasoning only when needed.
- Higher-quality Nepali (low resource) datasets
While back-translation helps filter major issues, the machine-translated Nepali GSM8K still contains phrasing irregularities. Future work could incorporate a small set of human-authored Nepali reasoning problems or refined translations produced in collaboration with native speakers.
- Broader evaluation across models and tasks
Minimal-CoT should be tested on additional reasoning tasks (ARC, StrategyQA, MAWPS) and across more open-source models to determine whether the observed gains generalize beyond GSM8K-style arithmetic. We planned to test it across more models and datasets like Commonsense QA, but due to time constraints, we were unable to run experiments for them. We have shared the translated version of Commonsense QA on our github repo.
Overall, these improvements aim to make Minimal-CoT even more reliable, interpretable, and broadly applicable across multilingual reasoning settings.
ReferencesChain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., arXiv 2022)
Automatic Chain of Thought Prompting in Large Language Models (Zhang et al., arXiv 2022)
Better Zero-Shot Reasoning with Self-Adaptive Prompting (Wan et al., arXiv 2023)
Universal Self-Adaptive Prompting (Wan et al., arXiv 2023)
Chain of Draft: Thinking Faster by Writing Less (Xu et al., arXiv 2025)
Token-Budget-Aware LLM Reasoning (Han et al., arXiv 2025)
RouteLLM: Learning to Route LLMs with Preference Data (Ong et al., arXiv 2025)
Adaptive Gating in Mixture-of-Experts based Language Models (Li et al., EMNLP 2023)
Contrastive Decoding: Open-ended Text Generation as Optimization (Li et al., ACL 2023)
Large Language Models are Zero-Shot Reasoners (Kojima et al., arXiv 2023)