An implementation of structured reasoning based on "The Molecular Structure of Thought" (Chen et al., 2026).
Standard chain-of-thought prompting tells a model to "think step by step" — but it doesn't control how the model thinks. The result is often a single linear argument that anchors on its first framing, never questions its own assumptions, and produces a hedged conclusion full of bullet points.
The paper by Chen et al. studied the reasoning traces of strong models (R1-class) and found that their thinking isn't free-form — it follows a molecular structure with distinct, recurring reasoning behaviors. These behaviors form predictable transition patterns that can be modeled as a Markov chain.
This agent makes that structure explicit. Instead of hoping the model reasons well, it tells the model what type of reasoning to perform at each step, guided by transition probabilities extracted from real reasoning traces.
The paper identifies four fundamental reasoning behaviors ("bonds") that strong reasoners alternate between:
| Bond | What It Does | Why It Matters |
|---|---|---|
| EXPLORE | Generates 2-3 structurally different framings of the problem | Prevents anchoring on the first interpretation |
| DEEP | Extends the current argument beyond the obvious — surfaces hidden assumptions and causal links | Prevents shallow reasoning that stops at the first plausible answer |
| REFLECT | Returns to a specific prior step and surgically interrogates it | Catches flawed assumptions before they propagate into the conclusion |
| NORMAL | Applies established logic directly — calculates, executes, moves forward | Not every step needs exploration; sometimes you just need to do the math |
Standard CoT: "Think step by step" --> Linear argument --> Hedged answer
Molecular CoT: EXPLORE (frame the problem 3 ways)
--> DEEP (push the strongest framing further)
--> REFLECT (go back and stress-test step 2)
--> DEEP (extend with the corrected reasoning)
--> NORMAL (apply and calculate)
--> Conclusion (synthesize the trajectory)
The key insight from the paper: the sequence of bond types matters as much as the content. Models that explore before committing, go deep before reflecting, and reflect before concluding produce structurally stronger arguments — even when the raw token content is similar.
The agent uses a Markov transition matrix (from paper Figure 5) to probabilistically select the next bond type based on the current one:
NORMAL DEEP REFLECT EXPLORE
NORMAL [ 0.74 0.10 0.05 0.11 ]
DEEP [ 0.32 0.21 0.10 0.37 ]
REFLECT [ 0.35 0.10 0.17 0.38 ]
EXPLORE [ 0.31 0.11 0.10 0.48 ]
This means, for example, after an EXPLORE step there's a 48% chance of another EXPLORE, 31% chance of NORMAL, 11% of DEEP, and 10% of REFLECT — matching the patterns observed in strong reasoning models. You can replace this with your own matrix estimated from your own traces.
After reasoning completes, the agent checks the bond distribution and warns about imbalances:
- Low Self-Reflection (< 10%) — conclusions may rest on unchecked assumptions
- Low Self-Exploration (< 10%) — may have anchored on first framing
- Low Deep Reasoning (< 15%) — argument may be shallow
These thresholds come from the paper's analysis of what separates strong reasoning traces from weak ones.
The agent supports two modes so you can empirically test whether structured reasoning improves answers for your use case:
--mode cot(default) — runs the full molecular reasoning pipeline, then synthesizes a conclusion--mode direct— sends the question straight to the LLM with no scaffolding
In our testing, CoT mode produces more opinionated, decisive answers that name specific uncertainties. Direct mode tends to produce longer, more generic responses that hedge with lists and frameworks instead of committing to a position.
git clone https://github.com/YOUR_USERNAME/molecular-cot.git
cd molecular-cot
pip install -r requirements.txt
cp .env.example .env # add your API key(s)Run with any provider:
# Molecular CoT (structured reasoning)
python agent.py "Should a startup with 12 months runway cut costs or raise?" --provider openrouter
# Direct mode (single-shot, no CoT)
python agent.py "Should a startup with 12 months runway cut costs or raise?" --provider openrouter --mode direct| Provider | Key needed | Default model |
|---|---|---|
anthropic |
ANTHROPIC_API_KEY |
claude-sonnet-4-6 |
openai |
OPENAI_API_KEY |
gpt-4o |
openrouter |
OPENROUTER_API_KEY |
qwen/qwen3-30b-a3b |
gemini |
GOOGLE_API_KEY |
gemini-2.0-flash |
ollama |
None (local) | llama3.1 |
Set your key(s) in .env or export them:
export OPENROUTER_API_KEY=sk-or-v1-...# Choose provider and model
python agent.py "Your question" --provider openrouter
python agent.py "Your question" --provider openai --model gpt-4o-mini
# Control reasoning depth
python agent.py "Your question" --provider openrouter --steps 6
# Compare CoT vs direct
python agent.py "Your question" --provider openrouter --mode cot
python agent.py "Your question" --provider openrouter --mode direct
# Compare providers side by side
python agent.py "Your question" --compare anthropic openai --output results.json
# Save output to JSON
python agent.py "Your question" --provider openrouter --output result.jsonfrom agent import MolecularCoTAgent, create_backend
backend = create_backend("openrouter") # or "anthropic", "openai", etc.
agent = MolecularCoTAgent(backend)
# Structured reasoning
result = agent.run("Your question here", max_steps=8)
print(result["trajectory"]["conclusion"])
print(result["trajectory"]["bond_distribution"])
print(result["warnings"])
# Direct (no CoT) for comparison
direct = agent.run_direct("Your question here")
print(direct["answer"])Replace the default Markov chain with one estimated from your own traces:
from agent import TransitionGraph, BondType
my_traces = [
[BondType.EXPLORE, BondType.DEEP, BondType.REFLECT, BondType.DEEP],
[BondType.EXPLORE, BondType.EXPLORE, BondType.DEEP, BondType.REFLECT],
]
graph = TransitionGraph.estimate_from_traces(my_traces)
agent = MolecularCoTAgent(backend, graph=graph){
"task": "...",
"backend": "openrouter/qwen/qwen3-30b-a3b",
"trajectory": {
"steps": [
{"step": 1, "bond": "EXPLORE", "reflects_on": null, "content": "..."},
{"step": 2, "bond": "DEEP", "reflects_on": null, "content": "..."},
{"step": 3, "bond": "REFLECT", "reflects_on": 2, "content": "..."}
],
"conclusion": "...",
"bond_distribution": {"EXPLORE": 0.2, "DEEP": 0.5, "REFLECT": 0.2, "NORMAL": 0.1}
},
"warnings": []
}Warnings fire when the trajectory is structurally imbalanced:
- Low Self-Reflection (< 10%) — conclusions may rest on unchecked assumptions
- Low Self-Exploration (< 10%) — may have anchored on first framing
- Low Deep Reasoning (< 15%) — argument may be shallow
# Unit tests (no API calls, fast)
pytest test_agent.py -v -m "not integration"
# Integration tests (real API calls)
pytest test_agent.py -v -m "integration"
# All tests
pytest test_agent.py -v.
├── agent.py # Core agent, backends, CLI
├── test_agent.py # Unit + integration tests
├── requirements.txt # Dependencies
├── .env.example # Template for API keys
└── .gitignore
- Start with EXPLORE — the agent always opens by generating multiple framings of the problem
- Transition — the Markov graph probabilistically picks the next bond type based on the current one
- Prompt per bond — each step uses a bond-specific prompt that constrains the LLM to only perform that type of reasoning (no summarizing, no concluding early)
- Safety net — if no REFLECT step has occurred by step
max_steps - 2, one is forced to prevent unchecked conclusions - Early exit — if the model produces a convergence signal (
FINAL ANSWER:,\boxed{}), reasoning stops early - Conclude — a separate conclusion prompt synthesizes the full trajectory into a direct answer
- Validate — the bond distribution is checked against paper thresholds and structural warnings are emitted
The paper (Section 5.2) found that merging outputs from two agents with incompatible reasoning structures causes performance collapse — even when the token-level content is similar (Pearson > 0.9 on tokens, but < 0.8 on bond distributions).
The compare_backends function checks this automatically:
from agent import compare_backends
result = compare_backends(
task="...",
providers=["anthropic", "openai"],
)
# result["structural_compatibility"] = {"anthropic vs openai": True/False}If two backends are structurally incompatible, don't merge their outputs — pick the better one.
- Chen et al., "The Molecular Structure of Thought", 2026 — the paper this implementation is based on
- Section 3.2: Bond type definitions and transition analysis
- Section 5.2: Structural compatibility and performance collapse
- Figure 5: Transition probability matrix used as the default graph
MIT