Adds RAG ADM Component and creates tagging pipeline to test it#270
Adds RAG ADM Component and creates tagging pipeline to test it#270ygefen wants to merge 3 commits into
Conversation
|
tagging_comparison_results.zip |
|
tagging_comparison_results.zip |
|
This code needs refactoring once I get a thumbs up on the concept and before merge. Specifically there is too much code repetition. |
dmjoy
left a comment
There was a problem hiding this comment.
Generally looks pretty good to me, but a handful of requested changes. I'm waffling a bit on whether or not to include the tagging domain documents in the repo here. Longer term (as we accumulate more document sets) it doesn't seem sustainable; but it is more convenient and ensures we don't lose the documents we were using. Probably fine to leave them in for now (as we already include the ICL datasets in the repo).
| scenario_description = call_with_coerced_args( | ||
| self.scenario_description_template, | ||
| {'scenario_state': scenario_state, | ||
| 'rag_context': rag_context}) |
There was a problem hiding this comment.
Interesting, so is this really the only difference with the base class? If so wondering if it just makes more sense to add it to the original class 🤔
There was a problem hiding this comment.
Nvm. Thinking it's better to be a distinct subclass but let's make rag_context a required argument to the run function (if you're not using RAG just use the base class?)
| alignment_target, | ||
| positive_icl_dialog_elements=[], | ||
| negative_icl_dialog_elements=[], | ||
| rag_context=None): |
There was a problem hiding this comment.
Same story here I think (making rag_context non-optional)
| all_splits = text_splitter.split_documents(docs) | ||
|
|
||
| embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name) | ||
| vector_store = FAISS.from_documents(all_splits, embeddings) |
There was a problem hiding this comment.
I'm guessing this vector_store is just in memory here right? Or is it backed by a file / cache on disk? Not requesting any changes here just curious (I know the documents we're using now are small, but what happens if it's a massive collection)
| DocumentFileListType = Iterable[DocumentFileType] | ||
|
|
||
|
|
||
| class LangChainRAGIndexerADMComponent(ADMComponent): |
There was a problem hiding this comment.
I think having RAG implemented as an ADMComponent is probably OK for now, but I was thinking of it more like how we have the StructuredInferenceEngine pieces (as in the same instance could potentially be re-used by different ADM components). This might come into play for our multi-attribute ADMs where we need to do some kind of ICL retrieval for relevance computation, and then again for provided in-context examples.
| - /data/users/yonatan.gefen/align-system/align_system/documents/start.md | ||
| - /data/users/yonatan.gefen/align-system/align_system/documents/start_triage_flowchart.md | ||
| - /data/users/yonatan.gefen/align-system/align_system/documents/Salt.md |
There was a problem hiding this comment.
Better to check with Emily and Aaron here, but I don't know that we want all protocol documents available all the time (it may depend on the target, e.g. for a start target, we would use the two start documents, but not the Salt.md).
Also we have a convention for storing / using files in the repo for this kind of thing it seems (or at least use a /data/shared path rather than your own directory): https://github.com/ITM-Kitware/align-system/blob/main/align_system/configs/adm_component/icl/tagging.yaml#L17-L19
| - /data/users/yonatan.gefen/align-system/align_system/documents/start.md | ||
| - /data/users/yonatan.gefen/align-system/align_system/documents/start_triage_flowchart.md | ||
| - /data/users/yonatan.gefen/align-system/align_system/documents/Salt.md |
There was a problem hiding this comment.
Same comment here about filepaths
| raise ValueError(f"Unknown target tagging protocol: {target_kdma}") | ||
|
|
||
|
|
||
| class SimpleTaggingSystemPrompt: |
There was a problem hiding this comment.
Slight preference (seems more maintainable) for the rag_context based prompts to be separate classes (they can call super for __call__ and then just append the RAG context stuff to keep it a bit neater?
There was a problem hiding this comment.
Would prefer these scripts to go in the scratch repo
| "langchain>=0.2.5", | ||
| "llama-index>=0.13.0", | ||
| ] | ||
| rag = [ |
There was a problem hiding this comment.
Pros and cons of just including these dependencies in the main list? (How big / unwieldy is faiss-cpu I suspect we already include the others?)
No description provided.