Adds RAG ADM Component and creates tagging pipeline to test it by ygefen · Pull Request #270 · ITM-Kitware/align-system

ygefen · 2026-04-23T16:12:29Z

No description provided.

ygefen · 2026-04-23T16:19:58Z

tagging_comparison_results.zip
Comparison between RAG pipeline and expert prompt pipeline for the tagging task

ygefen · 2026-04-27T14:29:34Z

tagging_comparison_results.zip
Results with the addition of tagging fewshot aligned experiments.

ygefen · 2026-04-27T14:30:22Z

This code needs refactoring once I get a thumbs up on the concept and before merge. Specifically there is too much code repetition.

dmjoy

Generally looks pretty good to me, but a handful of requested changes. I'm waffling a bit on whether or not to include the tagging domain documents in the repo here. Longer term (as we accumulate more document sets) it doesn't seem sustainable; but it is more convenient and ensures we don't lose the documents we were using. Probably fine to leave them in for now (as we already include the ICL datasets in the repo).

dmjoy · 2026-04-27T15:54:34Z

+        scenario_description = call_with_coerced_args(
+            self.scenario_description_template,
+            {'scenario_state': scenario_state,
+             'rag_context': rag_context})


Interesting, so is this really the only difference with the base class? If so wondering if it just makes more sense to add it to the original class 🤔

Nvm. Thinking it's better to be a distinct subclass but let's make rag_context a required argument to the run function (if you're not using RAG just use the base class?)

dmjoy · 2026-04-27T15:57:15Z

+            alignment_target,
+            positive_icl_dialog_elements=[],
+            negative_icl_dialog_elements=[],
+            rag_context=None):


Same story here I think (making rag_context non-optional)

dmjoy · 2026-04-27T15:58:53Z

+        all_splits = text_splitter.split_documents(docs)
+
+        embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)
+        vector_store = FAISS.from_documents(all_splits, embeddings)


I'm guessing this vector_store is just in memory here right? Or is it backed by a file / cache on disk? Not requesting any changes here just curious (I know the documents we're using now are small, but what happens if it's a massive collection)

dmjoy · 2026-04-27T16:00:56Z

+DocumentFileListType = Iterable[DocumentFileType]
+
+
+class LangChainRAGIndexerADMComponent(ADMComponent):


I think having RAG implemented as an ADMComponent is probably OK for now, but I was thinking of it more like how we have the StructuredInferenceEngine pieces (as in the same instance could potentially be re-used by different ADM components). This might come into play for our multi-attribute ADMs where we need to do some kind of ICL retrieval for relevance computation, and then again for provided in-context examples.

dmjoy · 2026-04-27T16:07:10Z

+      - /data/users/yonatan.gefen/align-system/align_system/documents/start.md
+      - /data/users/yonatan.gefen/align-system/align_system/documents/start_triage_flowchart.md
+      - /data/users/yonatan.gefen/align-system/align_system/documents/Salt.md


Better to check with Emily and Aaron here, but I don't know that we want all protocol documents available all the time (it may depend on the target, e.g. for a start target, we would use the two start documents, but not the Salt.md).

Also we have a convention for storing / using files in the repo for this kind of thing it seems (or at least use a /data/shared path rather than your own directory): https://github.com/ITM-Kitware/align-system/blob/main/align_system/configs/adm_component/icl/tagging.yaml#L17-L19

dmjoy · 2026-04-27T16:07:54Z

+      - /data/users/yonatan.gefen/align-system/align_system/documents/start.md
+      - /data/users/yonatan.gefen/align-system/align_system/documents/start_triage_flowchart.md
+      - /data/users/yonatan.gefen/align-system/align_system/documents/Salt.md


Same comment here about filepaths

dmjoy · 2026-04-27T16:10:43Z

            raise ValueError(f"Unknown target tagging protocol: {target_kdma}")
+
+
+class SimpleTaggingSystemPrompt:


Slight preference (seems more maintainable) for the rag_context based prompts to be separate classes (they can call super for __call__ and then just append the RAG context stuff to keep it a bit neater?

dmjoy · 2026-04-27T16:12:52Z

Would prefer these scripts to go in the scratch repo

dmjoy · 2026-04-27T16:13:45Z

    "langchain>=0.2.5",
    "llama-index>=0.13.0",
 ]
+rag = [


Pros and cons of just including these dependencies in the main list? (How big / unwieldy is faiss-cpu I suspect we already include the others?)

Adds RAG ADM Component and creates tagging pipeline to test it

49f70cd

ygefen requested a review from dmjoy April 23, 2026 16:20

ygefen self-assigned this Apr 23, 2026

Addes ICL RAG to alignemnt fewshot experiments

606239c

dmjoy requested changes Apr 27, 2026

View reviewed changes

Adds ablation testing and slurm capabilityto hydra

8e8c504

		DocumentFileListType = Iterable[DocumentFileType]


		class LangChainRAGIndexerADMComponent(ADMComponent):

		raise ValueError(f"Unknown target tagging protocol: {target_kdma}")


		class SimpleTaggingSystemPrompt:

Conversation

ygefen commented Apr 23, 2026

Uh oh!

ygefen commented Apr 23, 2026

Uh oh!

ygefen commented Apr 27, 2026

Uh oh!

ygefen commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmjoy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ygefen commented Apr 27, 2026 •

edited

Loading