Skip to content

KnowledgeGraphNode hash collision should be fixed and regression-tested #714

@3em0

Description

@3em0

KnowledgeGraphNode hash collision due to template rendering bugs

Summary

KnowledgeGraphNode.hash can collide for different knowledge graph nodes because the rendered graph identity omits actual entity and relationship values.

The issue is in backend/app/rag/retrievers/knowledge_graph/schema.py. The default templates use Jinja-style placeholders ({{ name }}), but the code renders them with Python str.format(). In Python format strings, doubled braces are escaped literals, so the supplied entity and relationship fields are ignored. In addition, _get_relationships_str() renders relationships with self.entity_template instead of self.relationship_template.

As a result, two nodes with the same query and the same number of entities/relationships can produce the same rendered identity and the same SHA-256 hash even when their actual graph content differs.

Affected Area

  • backend/app/rag/retrievers/knowledge_graph/schema.py
  • KnowledgeGraphNode.hash
  • KnowledgeGraphNode.get_content()
  • KnowledgeGraphNode._get_entities_str()
  • KnowledgeGraphNode._get_relationships_str()

Security Impact

The hash currently identifies the shape of the rendered graph more than the graph semantics. In RAG flows that deduplicate or track nodes by node.hash, this can merge or confuse semantically different knowledge graph retrieval results.

Potential impact:

  • RAG context confusion between different knowledge graph results.
  • Incorrect deduplication of semantically different nodes.
  • Loss of auditability because logged/rendered node content does not reflect the actual retrieved entities and relationships.
  • Integrity impact on downstream reranking, tracing, or fusion logic that relies on stable node identity.

Verification

Local verification against upstream main showed:

Minimal result:

old_render_equal= True
old_hash_a= 0e6d70c48929a38456ea8326315137999721012ef40930cccccbdd4208024c50
old_hash_b= 0e6d70c48929a38456ea8326315137999721012ef40930cccccbdd4208024c50
new_render_equal= False
new_hash_a= 95d22b15482ec725c5216f3eab098363d0bd4d117c921234381b5061ca97348d
new_hash_b= 59619f84dc7dcabb93fc007b2a0b4d99652439cc5cec48a801c3b13b254770c9

Proposed Fix

PR #709 already contains the required fix:

  • Replace Jinja-style placeholders with Python str.format() placeholders in the default entity and relationship templates.
  • Use self.relationship_template inside _get_relationships_str().
  • Add or keep a regression test that constructs two graph nodes with the same query/counts but different entity/relationship data and asserts different hash values.

Existing PR: #709

Environment

  • Repository: pingcap/autoflow
  • Upstream main checked: c4cb19d8fa205bdd4cb38d0ac250d273fcc3e5f2
  • Fix branch checked locally: fix/kg-node-hash-collision
  • Fix commit checked locally: b0d0f82cf3ecaacb7d5514763d02aeeebd33b331

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions