Skip to content

Adds RAG ADM Component and creates tagging pipeline to test it#270

Open
ygefen wants to merge 3 commits into
mainfrom
ygefen/feat/rag
Open

Adds RAG ADM Component and creates tagging pipeline to test it#270
ygefen wants to merge 3 commits into
mainfrom
ygefen/feat/rag

Conversation

@ygefen
Copy link
Copy Markdown
Contributor

@ygefen ygefen commented Apr 23, 2026

No description provided.

@ygefen
Copy link
Copy Markdown
Contributor Author

ygefen commented Apr 23, 2026

tagging_comparison_results.zip
Comparison between RAG pipeline and expert prompt pipeline for the tagging task

@ygefen ygefen requested a review from dmjoy April 23, 2026 16:20
@ygefen ygefen self-assigned this Apr 23, 2026
@ygefen
Copy link
Copy Markdown
Contributor Author

ygefen commented Apr 27, 2026

tagging_comparison_results.zip
Results with the addition of tagging fewshot aligned experiments.

@ygefen
Copy link
Copy Markdown
Contributor Author

ygefen commented Apr 27, 2026

This code needs refactoring once I get a thumbs up on the concept and before merge. Specifically there is too much code repetition.

Copy link
Copy Markdown
Contributor

@dmjoy dmjoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks pretty good to me, but a handful of requested changes. I'm waffling a bit on whether or not to include the tagging domain documents in the repo here. Longer term (as we accumulate more document sets) it doesn't seem sustainable; but it is more convenient and ensures we don't lose the documents we were using. Probably fine to leave them in for now (as we already include the ICL datasets in the repo).

scenario_description = call_with_coerced_args(
self.scenario_description_template,
{'scenario_state': scenario_state,
'rag_context': rag_context})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, so is this really the only difference with the base class? If so wondering if it just makes more sense to add it to the original class 🤔

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm. Thinking it's better to be a distinct subclass but let's make rag_context a required argument to the run function (if you're not using RAG just use the base class?)

alignment_target,
positive_icl_dialog_elements=[],
negative_icl_dialog_elements=[],
rag_context=None):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same story here I think (making rag_context non-optional)

all_splits = text_splitter.split_documents(docs)

embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)
vector_store = FAISS.from_documents(all_splits, embeddings)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this vector_store is just in memory here right? Or is it backed by a file / cache on disk? Not requesting any changes here just curious (I know the documents we're using now are small, but what happens if it's a massive collection)

DocumentFileListType = Iterable[DocumentFileType]


class LangChainRAGIndexerADMComponent(ADMComponent):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having RAG implemented as an ADMComponent is probably OK for now, but I was thinking of it more like how we have the StructuredInferenceEngine pieces (as in the same instance could potentially be re-used by different ADM components). This might come into play for our multi-attribute ADMs where we need to do some kind of ICL retrieval for relevance computation, and then again for provided in-context examples.

Comment on lines +32 to +34
- /data/users/yonatan.gefen/align-system/align_system/documents/start.md
- /data/users/yonatan.gefen/align-system/align_system/documents/start_triage_flowchart.md
- /data/users/yonatan.gefen/align-system/align_system/documents/Salt.md
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to check with Emily and Aaron here, but I don't know that we want all protocol documents available all the time (it may depend on the target, e.g. for a start target, we would use the two start documents, but not the Salt.md).

Also we have a convention for storing / using files in the repo for this kind of thing it seems (or at least use a /data/shared path rather than your own directory): https://github.com/ITM-Kitware/align-system/blob/main/align_system/configs/adm_component/icl/tagging.yaml#L17-L19

Comment on lines +15 to +17
- /data/users/yonatan.gefen/align-system/align_system/documents/start.md
- /data/users/yonatan.gefen/align-system/align_system/documents/start_triage_flowchart.md
- /data/users/yonatan.gefen/align-system/align_system/documents/Salt.md
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about filepaths

raise ValueError(f"Unknown target tagging protocol: {target_kdma}")


class SimpleTaggingSystemPrompt:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight preference (seems more maintainable) for the rag_context based prompts to be separate classes (they can call super for __call__ and then just append the RAG context stuff to keep it a bit neater?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer these scripts to go in the scratch repo

Comment thread pyproject.toml
"langchain>=0.2.5",
"llama-index>=0.13.0",
]
rag = [
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pros and cons of just including these dependencies in the main list? (How big / unwieldy is faiss-cpu I suspect we already include the others?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants