This repository contains the official implementation of the paper "On the Thinking-Language Modeling Gap in Large Language Models" (ICLR 2026).
This work identifies a critical gap in Large Language Models (LLMs): despite their success in imitating human reasoning through Chain-of-Thought (CoT), LLMs trained on massive human language corpora struggle with tasks requiring careful handling of implicit expressions.
Human language is primarily a tool for communication rather than thinking (Fedorenko et al., 2024). Consequently, written language contains implicit expressions—information patterns that occur less frequently during training due to human preferences in language organization. This creates a fundamental mismatch:
- Language Modeling Bias: Next-token prediction on non-topological language order (e.g., conclusion before all premises) causes models to learn marginal distributions rather than joint dependencies (Proposition 2.3).
- Implicit Expression Blindness: LLMs overlook critical information when it is expressed implicitly rather than explicitly (Theorem 2.4).
- Shortcut Reasoning: Models tend to use incomplete context when expressions do not align with the underlying causal structure of the reasoning task.
Figure 1: Structural Causal Models of next-token prediction. When language is not in topological order (right), LLMs learn biased dependencies.
When encountering natural language in anti-topological order (e.g.,
The model marginalizes over
Define random vectors
where
Interpretation: The term
We propose LoT (Language-of-Thought) prompting to mitigate the gap by explicitly instructing models to process all information before reasoning. The prompt-level intervention (Section 3.2):
"Please observe, expand, and echo all the relevant information based on the question"
-
Observe: Identify all relevant information from the question (improves context
$q$ ). -
Expand: Elaborate on each piece to make implicit content explicit (improves expression
$L$ ). - Echo: Restate the elaborated information before reasoning (refreshes the context).
Theoretical Motivation: LoT aims to minimize the
Figure 2: Accuracy patterns on WinoControl under different levels of L-implicitness and q-implicitness. CoT shows significant performance drops as implicitness increases (left), while LoT interventions effectively mitigate these biases (center and right).
LoT vs. Chain-of-Thought (CoT) on bias and reasoning benchmarks (Table 2, 3, 4):
| Benchmark | Metric | CoT | LoT | Improvement |
|---|---|---|---|---|
| WinoBias (Gender Bias) | Consistency (DeepSeek-V3) | 86.9% | 90.7% | +3.8% |
| Consistency (GPT-4o-mini) | 71.5% | 72.5% | +1.0% | |
| Alice (Simple Math) | Accuracy (Avg. across 4 models) | 31.8% | 40.1% | +8.3% |
| BBQ (Social Bias) | Avg. Accuracy (DeepSeek-V3) | 87.1% | 89.7% | +2.6% |
| Avg. Accuracy (GPT-4o-mini) | 67.9% | 84.9% | +17.0% |
Note: WinoBias consistency measures agreement between pro-stereotype and anti-stereotype examples. Alice benchmark uses simple "N brothers and M sisters" questions where CoT often fails.
We introduce WinoControl, a controlled evaluation framework based on WinoBias that manipulates two types of implicitness:
Table 1: Construction of WinoControl datasets for controlling L-implicitness (expression level) and q-implicitness (context level).
For a premise
We define two types of implicitness:
L-implicitness (Expression Implicitness):
q-implicitness (Context Implicitness):
Finding: As shown in Figure 2, as L-implicitness and q-implicitness increase (moving toward hard), CoT accuracy drops significantly, while LoT maintains robust performance. Echo intervention is more effective for high q-implicitness (upper right in Figure 2b), while Expand intervention is more effective for high L-implicitness (bottom left in Figure 2c), confirming their theoretical roles in minimizing different components of the language-thought gap.
LoT can be combined with complex reasoning frameworks and shows consistent improvements:
Table 5: Results on HotpotQA with advanced reasoning protocols (ToT, GoT, ReAct). LoT presents improvement in 9 out of 12 cases compared to CoT, with consistent gains in Tree-of-Thought (ToT) settings.
- LoT + ToT/GoT/ReAct: Outperforms CoT variants on HotpotQA (Table 5 above). LoT shows consistent improvements in Tree-of-Thought (ToT), which is the state-of-the-art method for this benchmark.
- Self-Consistency: LoT achieves higher consistency with fewer samples than CoT.
Figure 4: Self-consistency results on WinoBias. LoT+SC achieves higher consistency with fewer tokens compared to CoT+SC (e.g., LoT with R=4 outperforms CoT with R=16 while costing less than half the tokens).
This repo provides a unified interface for evaluating different reasoning methods across multiple datasets and language models.
- LoT: Language-of-Thought reasoning with observe-expand-echo paradigm.
- Echo: Ablation using only the echo component (targets q-implicitness).
- Expand: Ablation using only the expand component (targets L-implicitness).
- CoT: Chain-of-Thought reasoning ("Let's think step by step").
- Direct: Direct prompting without explicit reasoning.
- RaR: Rephrase and Respond.
- LtM: Least-to-Most prompting
- None: Standard single-pass reasoning.
- ToT: Tree-of-Thought (Yao et al., 2023).
- GoT: Graph-of-Thought (Besta et al., 2024).
- ReACT: ReAct-style reasoning with tool use (Wikipedia API).
- WinoBias: Gender bias evaluation (Zhao et al., 2018).
- WinoControl: Controlled WinoBias with manipulated implicitness levels.
- Alice: Simple math problems testing basic reasoning (Nezhurina et al., 2024).
- BBQ: Bias Barbecue Questions (Age, Nationality, Religion subcategories).
- HotpotQA: Multi-hop question answering.
- CSQA: CommonsenseQA.
- FOLIO: First-Order Logic reasoning.
- GPQA: Graduate-Level Google-Proof Q&A.
- MUSR: Multi-step soft reasoning.
- Clone this repository:
git clone https://github.com/tmlr-group/LoT-2026.git
cd LoT-2026- Install dependencies:
pip install -r requirements.txtThis project requires API keys for various LLM providers. Copy the example configuration file and add your credentials:
cp secret.yaml.example secret.yamlEdit secret.yaml and replace the placeholder API keys with your actual credentials. The file supports multiple providers:
- OpenAI
- DeepSeek
- Qwen (Alibaba/Dashscope)
- AIMLAPI
- OpenRouter
- BigModels
- SiliconFlow
- TogetherAI
Important: Never commit secret.yaml to version control. It is already included in .gitignore.
The main configuration is in main/main.yaml, which includes:
- Dataset paths and settings
- Model parameters (temperature, max_tokens, concurrency)
- Prompt templates for different reasoning methods
- LoT hint configuration
Run an experiment with a specific model, dataset, and method:
python main/run.py \
--model-name deepseek-chat \
--dataset winobias \
--method lot \
--seed 0 \
--version 1 \
--maxrepeat 1 \
--concurrency 8For self-consistency evaluation (requires temperature=1.0):
python selfconsistency/basic_selfconsistency.py \
--model-name deepseek-chat \
--dataset winobias \
--method lot \
--seed 0 \
--version 1 \
--concurrency 8WinoControl is an advanced WinoBias experiment runner that studies the effect of implicit information on model reasoning. It controls two types of implicitness as defined in Table 1 of the paper:
Key Parameters:
-
--issue_1: q-implicitness (Context Implicitness) - Controls irrelevant distracting sentences0: No irrelevant sentences (baseline, easy)1: Low (2 random irrelevant sentences)2: High (4 random irrelevant sentences, hard)
-
--issue_2: L-implicitness (Expression Implicitness) - Controls how the answer is expressed0: Determinative hint added (correct answer explicitly indicated, easy)1: Partially informative hint (medium)2: No hint (original sentence, hard)
Basic Usage:
python winoControl/run.py \
--model_name deepseek-chat \
--method_name lot \
--method_prompt "Please **observe**, **expand**, and **echo** all the relevant information based on the question." \
--log_prefix my_experiment \
--issue_1 0 --issue_2 2Additional Features:
- Few-shot prompting with demonstration selection
- Self-consistency evaluation (set
--self_consistent Nand--temperature 1.0) - Cache-based reproducibility for random hints (
--cache_path) - W&B logging integration
Required:
--model-name: Model identifier (e.g.,deepseek-chat)--dataset: Dataset name (winobias,alice,bbq,hotpotqa,csqa,folio,gpqa,musr)--method: Reasoning method (lot,lot',echo,expand,cot,direct,rar,ltm)--seed: Random seed for reproducibility--version: Experiment version identifier--maxrepeat: Number of repetitions (main/run.py only)
Optional:
--reasoner: Reasoning framework (none,tot,got,react; default:none)--concurrency: Number of parallel API calls (default:32)--temperature: Sampling temperature (default:0.0; must be1.0for self-consistency)--model-provider: API provider to use (default:AIMLAPI)--lot-hint: Enable LoT hint for reasoning frameworks--post-tag: Additional tag for output directory naming--debug: Enable debug mode
WinoBias with LoT:
python main/run.py --model-name deepseek-chat --dataset winobias --method lot --seed 0 --version 1 --maxrepeat 1HotpotQA with CoT + Tree-of-Thought:
python main/run.py --model-name deepseek-chat --dataset hotpotqa --method cot --reasoner tot --seed 0 --version 2 --maxrepeat 1 --concurrency 8BBQ (Nationality) with LoT + Graph-of-Thought:
python main/run.py --model-name deepseek-chat --dataset bbq-nationality --method lot --reasoner got --seed 0 --version 2 --maxrepeat 1Alice with ReACT reasoning:
python main/run.py --model-name deepseek-chat --dataset alice --method cot --reasoner react --seed 0 --version 1 --maxrepeat 1LoT-2026/
├── lot/ # Core LoT framework
│ ├── dataloader/ # Dataset adapters
│ │ ├── base.py # Base dataset adapter
│ │ ├── winobias.py # WinoBias dataset
│ │ ├── alice.py # Alice dataset
│ │ ├── bbq.py # BBQ dataset
│ │ ├── hotpot.py # HotpotQA dataset
│ │ └── ...
│ ├── fm.py # Foundation Model API wrapper
│ ├── models.py # Additional model implementations
│ ├── io_utils.py # I/O utilities
│ ├── main_generator.py # Main generation logic
│ ├── sc_generator.py # Self-consistency generator
│ ├── reasoning_framework.py # ToT, GoT, ReACT implementations
│ ├── tot_general.py # Tree-of-Thought utilities
│ └── Batch_test.py # Batch testing utilities
├── main/ # Main experiment scripts
│ ├── run.py # Entry point for experiments
│ ├── cache/ # Experiment cache (gitignored)
│ └── main.yaml # Configuration file
├── selfconsistency/ # Self-consistency experiments
│ ├── basic_selfconsistency.py # Self-consistency runner
│ ├── cache/ # Self-consistency cache
│ └── sc.yaml # Self-consistency config
├── winoControl/ # Advanced WinoBias experiments
│ └── run.py # WinoBias control runner
├── requirements.txt # Python dependencies
├── secret.yaml.example # API key template
└── README.md # This file
Results are saved to:
main/<dataset>/<timestamp>_<dataset>_<model>_<reasoner>_<method>_seed<seed>_v<version>/
├── task_config.json # Experiment configuration
└── repeat_*.csv # Results for each repetition
Cached intermediate results are stored in:
main/cache/<dataset>_<model>_<reasoner>_<method>_seed<seed>_v<version>*/
└── repeat_*.csv
Results are saved to:
selfconsistency/<dataset>/<timestamp>_<dataset>_<model>_<method>_seed<seed>_v<version>/
├── task_config.json # Experiment configuration
└── repeat_*.csv # Multiple samples for aggregation
For advanced WinoBias experiments with controlled implicitness:
python winoControl/run.py \
--model_name deepseek-chat \
--method_name lot \
--method_prompt "Please **observe**, **expand**, and **echo** all the relevant information based on the question." \
--log_prefix my_experiment \
--issue_1 0 --issue_2 2Advanced Usage:
# With few-shot demonstrations (selecting demos with correct echo/expand behaviors)
python winoControl/run.py \
--model_name deepseek-chat \
--method_name lot \
--fewshot_k 5 \
--fewshot_demon demonstrations.csv \
--fewshot_selection_criteria echo_expand_correct \
--fewshot_demon_type anti \
--log_prefix lot_fewshot
# With self-consistency
python winoControl/run.py \
--model_name deepseek-chat \
--method_name cot \
--temperature 1.0 \
--self_consistent 5 \
--log_prefix cot_self_consistencyIf you find this repo is helpful, please kindly consider citing our paper:
@inproceedings{lot2026,
title={On the Thinking-Language Modeling Gap in Large Language Models},
author={Liu, Chenxi and Chen, Yongqiang and Liu, Tongliang and Cheng, James and Han, Bo and Zhang, Kun},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026}
}This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).
For questions or issues, please contact cscxliu@comp.hkbu.edu.hk.