LoT: On the Thinking-Language Modeling Gap in Large Language Models

This repository contains the official implementation of the paper "On the Thinking-Language Modeling Gap in Large Language Models" (ICLR 2026).

Overview

This work identifies a critical gap in Large Language Models (LLMs): despite their success in imitating human reasoning through Chain-of-Thought (CoT), LLMs trained on massive human language corpora struggle with tasks requiring careful handling of implicit expressions.

The Thinking-Language Gap

Human language is primarily a tool for communication rather than thinking (Fedorenko et al., 2024). Consequently, written language contains implicit expressions—information patterns that occur less frequently during training due to human preferences in language organization. This creates a fundamental mismatch:

Language Modeling Bias: Next-token prediction on non-topological language order (e.g., conclusion before all premises) causes models to learn marginal distributions rather than joint dependencies (Proposition 2.3).
Implicit Expression Blindness: LLMs overlook critical information when it is expressed implicitly rather than explicitly (Theorem 2.4).
Shortcut Reasoning: Models tend to use incomplete context when expressions do not align with the underlying causal structure of the reasoning task.

Figure 1: Structural Causal Models of next-token prediction. When language is not in topological order (right), LLMs learn biased dependencies.

Theoretical Foundation

Proposition 2.3: Language-Modeling Bias

When encountering natural language in anti-topological order (e.g., $(C_1, A, C_2)$ where conclusion $A$ appears before all premises), next-token prediction yields an LLM that draws conclusions based on incomplete information:

$$\Pr(L_A \mid L_1) = \sum_{C_1,C_2,A} \Pr(C_1 \mid L_1)\Pr(C_2)\Pr(A \mid C_1,C_2)\Pr(L_A \mid A, L_1)$$

The model marginalizes over $C_2$ using the population prior $\Pr(C_2)$ rather than the conditional context, leading to biased reasoning.

Theorem 2.4: Language-Thought Gap

Define random vectors $\mathbf{L} = (L_1, L_2, \ldots, L_n)$, $\mathbf{C} = (C_1, C_2, \ldots, C_n)$, and true values $\mathbf{c}^{ * } = (c_1^{ * }, c_2^{ * }, \ldots, c_n^{ * })$. Assuming perfect knowledge $\Psi(A \mid \mathbf{C}) = \Pr(A \mid \mathbf{C})$ and Markov property, the KL divergence between the true distribution and model prediction is lower bounded by:

$$ D_{\mathrm{KL}} \geq \frac{\left[1 - \Psi(\mathbf{C} = \mathbf{c}^{ * } \mid \mathbf{L} = \mathbf{l})\right]^2}{2} \cdot V^2\left( {\Pr(A \mid \mathbf{C} = {\mathbf{c}^{{ * }}} )} , {\Psi(A \mid \mathbf{L} = \mathbf{l}, \mathbf{C} \neq {\mathbf{c}^{{ * }}})} \right) $$

where $V(p,q) := \sum_x |p(x) - q(x)|$ is the variational distance.

Interpretation: The term $\left[1 - \Psi(\mathbf{C} = \mathbf{c}^{ * } \mid \mathbf{L} = \mathbf{l})\right]^2$ measures how well the model understands the task from language expressions (the language-thought gap), while the variational distance term measures the cost of misunderstanding. Even with perfect knowledge of causal relations, implicit expressions can trigger biased reasoning.

LoT Framework

We propose LoT (Language-of-Thought) prompting to mitigate the gap by explicitly instructing models to process all information before reasoning. The prompt-level intervention (Section 3.2):

"Please observe, expand, and echo all the relevant information based on the question"

Observe: Identify all relevant information from the question (improves context $q$).
Expand: Elaborate on each piece to make implicit content explicit (improves expression $L$).
Echo: Restate the elaborated information before reasoning (refreshes the context).

Theoretical Motivation: LoT aims to minimize the $\left(1 - \Psi(c_1^{ * }, \ldots, c_i^{ * } \mid L_1, \ldots, L_i)\right)$ term in Theorem 2.4 by improving both the context $q_i$ (via Echo) and the expression $L_i$ (via Expand).

Key Results

Figure 2: Accuracy patterns on WinoControl under different levels of L-implicitness and q-implicitness. CoT shows significant performance drops as implicitness increases (left), while LoT interventions effectively mitigate these biases (center and right).

LoT vs. Chain-of-Thought (CoT) on bias and reasoning benchmarks (Table 2, 3, 4):

Benchmark	Metric	CoT	LoT	Improvement
WinoBias (Gender Bias)	Consistency (DeepSeek-V3)	86.9%	90.7%	+3.8%
	Consistency (GPT-4o-mini)	71.5%	72.5%	+1.0%
Alice (Simple Math)	Accuracy (Avg. across 4 models)	31.8%	40.1%	+8.3%
BBQ (Social Bias)	Avg. Accuracy (DeepSeek-V3)	87.1%	89.7%	+2.6%
	Avg. Accuracy (GPT-4o-mini)	67.9%	84.9%	+17.0%

Note: WinoBias consistency measures agreement between pro-stereotype and anti-stereotype examples. Alice benchmark uses simple "N brothers and M sisters" questions where CoT often fails.

WinoControl Dataset

We introduce WinoControl, a controlled evaluation framework based on WinoBias that manipulates two types of implicitness:

Table 1: Construction of WinoControl datasets for controlling L-implicitness (expression level) and q-implicitness (context level).

Types of Implicitness

For a premise $C_i$ with true value $c_i^{ * }$, context $q_i = {L_1, \ldots, L_{i-1}}$, and expression $L_i \in \mathcal{L}_{C_i=c_i^{ * }}$, the model's understanding is factorized as:

$$\Psi(c_1^{ { * } }, \ldots, c_k^{ { * } } \mid L_1, \ldots, L_k) = \prod_{i} \Psi(c_i \mid q_i, L_i)$$

We define two types of implicitness:

L-implicitness (Expression Implicitness): $c_i^{ * }$ shows L-implicitness if there exists an alternative token expression $L_i'$ that increases the conditional: $$\Psi(c_i \mid q_i, L_i) < \Psi(c_i \mid q_i, L_i')$$

q-implicitness (Context Implicitness): $c_i^{ * }$ shows q-implicitness if there exists an alternative context $q_i'$ that increases the conditional: $$\Psi(c_i \mid q_i, L_i) < \Psi(c_i \mid q_i', L_i)$$

Finding: As shown in Figure 2, as L-implicitness and q-implicitness increase (moving toward hard), CoT accuracy drops significantly, while LoT maintains robust performance. Echo intervention is more effective for high q-implicitness (upper right in Figure 2b), while Expand intervention is more effective for high L-implicitness (bottom left in Figure 2c), confirming their theoretical roles in minimizing different components of the language-thought gap.

Integration with Advanced Reasoning

LoT can be combined with complex reasoning frameworks and shows consistent improvements:

HotpotQA results with advanced reasoning protocols

Table 5: Results on HotpotQA with advanced reasoning protocols (ToT, GoT, ReAct). LoT presents improvement in 9 out of 12 cases compared to CoT, with consistent gains in Tree-of-Thought (ToT) settings.

LoT + ToT/GoT/ReAct: Outperforms CoT variants on HotpotQA (Table 5 above). LoT shows consistent improvements in Tree-of-Thought (ToT), which is the state-of-the-art method for this benchmark.
Self-Consistency: LoT achieves higher consistency with fewer samples than CoT.

Figure 4: Self-consistency results on WinoBias. LoT+SC achieves higher consistency with fewer tokens compared to CoT+SC (e.g., LoT with R=4 outperforms CoT with R=16 while costing less than half the tokens).

Repository Features

This repo provides a unified interface for evaluating different reasoning methods across multiple datasets and language models.

Supported Methods

LoT: Language-of-Thought reasoning with observe-expand-echo paradigm.
Echo: Ablation using only the echo component (targets q-implicitness).
Expand: Ablation using only the expand component (targets L-implicitness).
CoT: Chain-of-Thought reasoning ("Let's think step by step").
Direct: Direct prompting without explicit reasoning.
RaR: Rephrase and Respond.
LtM: Least-to-Most prompting

Supported Frameworks

None: Standard single-pass reasoning.
ToT: Tree-of-Thought (Yao et al., 2023).
GoT: Graph-of-Thought (Besta et al., 2024).
ReACT: ReAct-style reasoning with tool use (Wikipedia API).

Supported Datasets

WinoBias: Gender bias evaluation (Zhao et al., 2018).
WinoControl: Controlled WinoBias with manipulated implicitness levels.
Alice: Simple math problems testing basic reasoning (Nezhurina et al., 2024).
BBQ: Bias Barbecue Questions (Age, Nationality, Religion subcategories).
HotpotQA: Multi-hop question answering.
CSQA: CommonsenseQA.
FOLIO: First-Order Logic reasoning.
GPQA: Graduate-Level Google-Proof Q&A.
MUSR: Multi-step soft reasoning.

Installation

Clone this repository:

git clone https://github.com/tmlr-group/LoT-2026.git  
cd LoT-2026

Install dependencies:

pip install -r requirements.txt

Configuration

API Keys Setup

This project requires API keys for various LLM providers. Copy the example configuration file and add your credentials:

cp secret.yaml.example secret.yaml

Edit secret.yaml and replace the placeholder API keys with your actual credentials. The file supports multiple providers:

OpenAI
DeepSeek
Qwen (Alibaba/Dashscope)
AIMLAPI
OpenRouter
BigModels
SiliconFlow
TogetherAI

Important: Never commit secret.yaml to version control. It is already included in .gitignore.

Configuration Files

The main configuration is in main/main.yaml, which includes:

Dataset paths and settings
Model parameters (temperature, max_tokens, concurrency)
Prompt templates for different reasoning methods
LoT hint configuration

Usage

Basic Experiment Runner

Run an experiment with a specific model, dataset, and method:

python main/run.py \
  --model-name deepseek-chat \
  --dataset winobias \
  --method lot \
  --seed 0 \
  --version 1 \
  --maxrepeat 1 \
  --concurrency 8

Self-Consistency Experiments

For self-consistency evaluation (requires temperature=1.0):

python selfconsistency/basic_selfconsistency.py \
  --model-name deepseek-chat \
  --dataset winobias \
  --method lot \
  --seed 0 \
  --version 1 \
  --concurrency 8

WinoControl Experiments

WinoControl is an advanced WinoBias experiment runner that studies the effect of implicit information on model reasoning. It controls two types of implicitness as defined in Table 1 of the paper:

Key Parameters:

--issue_1: q-implicitness (Context Implicitness) - Controls irrelevant distracting sentences
- 0: No irrelevant sentences (baseline, easy)
- 1: Low (2 random irrelevant sentences)
- 2: High (4 random irrelevant sentences, hard)
--issue_2: L-implicitness (Expression Implicitness) - Controls how the answer is expressed
- 0: Determinative hint added (correct answer explicitly indicated, easy)
- 1: Partially informative hint (medium)
- 2: No hint (original sentence, hard)

Basic Usage:

python winoControl/run.py \
  --model_name deepseek-chat \
  --method_name lot \
  --method_prompt "Please **observe**, **expand**, and **echo** all the relevant information based on the question." \
  --log_prefix my_experiment \
  --issue_1 0 --issue_2 2

Additional Features:

Few-shot prompting with demonstration selection
Self-consistency evaluation (set --self_consistent N and --temperature 1.0)
Cache-based reproducibility for random hints (--cache_path)
W&B logging integration

Key Arguments

Required:

--model-name: Model identifier (e.g., deepseek-chat)
--dataset: Dataset name (winobias, alice, bbq, hotpotqa, csqa, folio, gpqa, musr)
--method: Reasoning method (lot, lot', echo, expand, cot, direct, rar, ltm)
--seed: Random seed for reproducibility
--version: Experiment version identifier
--maxrepeat: Number of repetitions (main/run.py only)

Optional:

--reasoner: Reasoning framework (none, tot, got, react; default: none)
--concurrency: Number of parallel API calls (default: 32)
--temperature: Sampling temperature (default: 0.0; must be 1.0 for self-consistency)
--model-provider: API provider to use (default: AIMLAPI)
--lot-hint: Enable LoT hint for reasoning frameworks
--post-tag: Additional tag for output directory naming
--debug: Enable debug mode

Examples

WinoBias with LoT:

python main/run.py --model-name deepseek-chat --dataset winobias --method lot --seed 0 --version 1 --maxrepeat 1

HotpotQA with CoT + Tree-of-Thought:

python main/run.py --model-name deepseek-chat --dataset hotpotqa --method cot --reasoner tot --seed 0 --version 2 --maxrepeat 1 --concurrency 8

BBQ (Nationality) with LoT + Graph-of-Thought:

python main/run.py --model-name deepseek-chat --dataset bbq-nationality --method lot --reasoner got --seed 0 --version 2 --maxrepeat 1

Alice with ReACT reasoning:

python main/run.py --model-name deepseek-chat --dataset alice --method cot --reasoner react --seed 0 --version 1 --maxrepeat 1

Project Structure

LoT-2026/
├── lot/                      # Core LoT framework
│   ├── dataloader/           # Dataset adapters
│   │   ├── base.py           # Base dataset adapter
│   │   ├── winobias.py       # WinoBias dataset
│   │   ├── alice.py          # Alice dataset
│   │   ├── bbq.py            # BBQ dataset
│   │   ├── hotpot.py         # HotpotQA dataset
│   │   └── ...
│   ├── fm.py                 # Foundation Model API wrapper
│   ├── models.py             # Additional model implementations
│   ├── io_utils.py           # I/O utilities
│   ├── main_generator.py     # Main generation logic
│   ├── sc_generator.py       # Self-consistency generator
│   ├── reasoning_framework.py # ToT, GoT, ReACT implementations
│   ├── tot_general.py        # Tree-of-Thought utilities
│   └── Batch_test.py         # Batch testing utilities
├── main/                     # Main experiment scripts
│   ├── run.py                # Entry point for experiments
│   ├── cache/                # Experiment cache (gitignored)
│   └── main.yaml             # Configuration file
├── selfconsistency/          # Self-consistency experiments
│   ├── basic_selfconsistency.py  # Self-consistency runner
│   ├── cache/                # Self-consistency cache
│   └── sc.yaml               # Self-consistency config
├── winoControl/              # Advanced WinoBias experiments
│   └── run.py                # WinoBias control runner
├── requirements.txt          # Python dependencies
├── secret.yaml.example       # API key template
└── README.md                 # This file

Output

Main Experiments

Results are saved to:

main/<dataset>/<timestamp>_<dataset>_<model>_<reasoner>_<method>_seed<seed>_v<version>/
├── task_config.json    # Experiment configuration
└── repeat_*.csv        # Results for each repetition

Cached intermediate results are stored in:

main/cache/<dataset>_<model>_<reasoner>_<method>_seed<seed>_v<version>*/
└── repeat_*.csv

Self-Consistency Experiments

Results are saved to:

selfconsistency/<dataset>/<timestamp>_<dataset>_<model>_<method>_seed<seed>_v<version>/
├── task_config.json    # Experiment configuration
└── repeat_*.csv        # Multiple samples for aggregation

WinoBias Control Experiments

For advanced WinoBias experiments with controlled implicitness:

python winoControl/run.py \
  --model_name deepseek-chat \
  --method_name lot \
  --method_prompt "Please **observe**, **expand**, and **echo** all the relevant information based on the question." \
  --log_prefix my_experiment \
  --issue_1 0 --issue_2 2

Advanced Usage:

# With few-shot demonstrations (selecting demos with correct echo/expand behaviors)
python winoControl/run.py \
  --model_name deepseek-chat \
  --method_name lot \
  --fewshot_k 5 \
  --fewshot_demon demonstrations.csv \
  --fewshot_selection_criteria echo_expand_correct \
  --fewshot_demon_type anti \
  --log_prefix lot_fewshot

# With self-consistency
python winoControl/run.py \
  --model_name deepseek-chat \
  --method_name cot \
  --temperature 1.0 \
  --self_consistent 5 \
  --log_prefix cot_self_consistency

Citation

If you find this repo is helpful, please kindly consider citing our paper:

@inproceedings{lot2026,
  title={On the Thinking-Language Modeling Gap in Large Language Models},
  author={Liu, Chenxi and Chen, Yongqiang and Liu, Tongliang and Cheng, James and Han, Bo and Zhang, Kun},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}

License

This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

Contact

For questions or issues, please contact cscxliu@comp.hkbu.edu.hk.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
asset		asset
lot		lot
main		main
selfconsistency		selfconsistency
winoControl		winoControl
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
secret.yaml.example		secret.yaml.example

Folders and files

Latest commit

History

Repository files navigation

LoT: On the Thinking-Language Modeling Gap in Large Language Models

Overview

The Thinking-Language Gap

Theoretical Foundation

Proposition 2.3: Language-Modeling Bias

Theorem 2.4: Language-Thought Gap

LoT Framework

Key Results

WinoControl Dataset

Types of Implicitness

Integration with Advanced Reasoning

Repository Features

Supported Methods

Supported Frameworks

Supported Datasets

Installation

Configuration

API Keys Setup

Configuration Files

Usage

Basic Experiment Runner

Self-Consistency Experiments

WinoControl Experiments

Key Arguments

Examples

Project Structure

Output

Main Experiments

Self-Consistency Experiments

WinoBias Control Experiments

Citation

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages