A semantic retrieval workflow turns embeddings into useful search results. In this lesson, you will build the backend sequence that takes stored documents, embeds them, embeds a user query, compares similarity scores, ranks the results, and returns the top matches.
You will produce a Python script within a developer documentation scenario using Identify → Assemble → Execute → Verify with document metadata, query processing, cosine similarity, top-k ranking, and relevance checks.
You are a junior backend developer on a platform team. The team maintains internal developer documentation, but users often search with everyday language instead of exact article titles.
A developer asks:
“Why does the mobile app say my token is expired?”
The documentation site contains several articles about authentication, API keys, billing, and dashboard performance. Your task is to build a small semantic retrieval workflow that returns the most relevant documentation articles first.
- Python 3.10 or newer
- Visual Studio Code or another code editor
- Terminal or integrated terminal
pipenv- Ollama installed and running locally
- An embedding model, such as
embeddinggemma - Python package:
ollama
Set up the embedding model:
ollama pull embeddinggemma
ollama run embeddinggemma "Hello world"Install the project dependencies in the Pipfile and enter the virtual environment:
pipenv install
pipenv shellTo run the script later from inside the pipenv shell, use:
python semantic_retrieval_lesson.pyFollow the technical process: Identify → Assemble → Execute → Verify.
Next, you will define what the retrieval workflow should solve and what information it should return.
Create a short planning note that identifies the user need, searchable content, expected top result, and returned output.
Start by identifying the user role:
User or role:
Developers searching internal platform documentation.
Add the business problem:
Business problem:
Developers lose time when natural-language questions do not match documentation titles.
Add the user query that the backend will process:
User query:
"Why does the mobile app say my token is expired?"
Add the content the backend can search:
Searchable content:
Developer documentation summaries about authentication, API keys, billing, and dashboard performance.
Add the result you expect to rank first:
Expected top result:
An article about refreshing expired API access tokens.
Add the information each returned result should include:
Returned output:
Top 3 ranked results with document ID, title, category, similarity score, and source text.
Add a verification goal:
Verification goal:
The top result should be about expired or invalid API access tokens, not billing or dashboard performance.
Your completed planning note should look like this:
User or role:
Developers searching internal platform documentation.
Business problem:
Developers lose time when natural-language questions do not match documentation titles.
User query:
"Why does the mobile app say my token is expired?"
Searchable content:
Developer documentation summaries about authentication, API keys, billing, and dashboard performance.
Expected top result:
An article about refreshing expired API access tokens.
Returned output:
Top 3 ranked results with document ID, title, category, similarity score, and source text.
Verification goal:
The top result should be about expired or invalid API access tokens, not billing or dashboard performance.
You should have a clear output contract for your search workflow.
The output contract helps you design the code. A retrieval workflow is not only about finding text. It also needs to return enough source information for the user, frontend, or future RAG system to trust the result.
Do not return only the matching text. Include metadata such as document ID, title, category, and similarity score.
This step is strong when it connects the retrieval workflow to a real user need and a clear returned output.
Next, you will create the Python file, import the tools you need, choose one embedding model, set the number of results to return, and store searchable documents with traceable metadata.
Create a Python file named semantic_retrieval_lesson.py and add the imports, model name, top-k value, and document list.
Create the file:
touch semantic_retrieval_lesson.pyOpen semantic_retrieval_lesson.py.
At the top of the file, import the tools the script will use:
from math import sqrt
from typing import Any, Dict, List
import ollamaThese imports are used for specific parts of the workflow:
sqrtcalculates vector magnitudes for cosine similarity.Any,Dict, andListmake the data structure easier to read as you store strings, scores, and vectors together.ollamasends text to the local embedding model and receives vector outputs.
Create a reusable constant for the embedding model:
MODEL = "embeddinggemma"This keeps the model choice in one place. The same model must be used for both document embeddings and query embeddings.
Create a reusable constant for the number of results to return:
TOP_K = 3TOP_K limits the output to the strongest matches. Returning only a few ranked results helps reduce noise.
Create the document list:
DOCUMENTS: List[Dict[str, str]] = []Add the first searchable document:
DOCUMENTS.append(
{
"id": "DEV-101",
"title": "Refreshing Expired API Access Tokens",
"category": "authentication",
"text": (
"Explains how to refresh expired API access tokens, check token lifetime, "
"and retry a request with a valid bearer token."
),
}
)This document should be the strongest expected match for the user query about an expired token.
Add another authentication-related document:
DOCUMENTS.append(
{
"id": "DEV-102",
"title": "Fixing Invalid Authorization Headers",
"category": "authentication",
"text": (
"Shows how to format authorization headers, include bearer tokens, "
"and troubleshoot rejected API requests caused by malformed headers."
),
}
)This result may be related, but it is not exactly the same as an expired token.
Add an API key onboarding document:
DOCUMENTS.append(
{
"id": "DEV-103",
"title": "Creating a New Developer API Key",
"category": "onboarding",
"text": (
"Guides a new developer through creating an API key, copying the key value, "
"and storing credentials securely."
),
}
)This gives the workflow a partial match that is still about API access.
Add a billing document:
DOCUMENTS.append(
{
"id": "DEV-104",
"title": "Understanding Dashboard Billing Limits",
"category": "billing",
"text": (
"Explains plan limits, monthly usage caps, billing warnings, "
"and how to upgrade an account."
),
}
)This gives the workflow an unrelated document to rank lower for authentication queries.
Add a frontend performance document:
DOCUMENTS.append(
{
"id": "DEV-105",
"title": "Troubleshooting Slow Dashboard Pages",
"category": "frontend",
"text": (
"Covers browser caching, loading states, client-side rendering delays, "
"and slow dashboard performance."
),
}
)This gives the workflow another unrelated option so you can check whether the ranking makes sense.
Your file should now include:
- imports,
- one model constant,
- one
TOP_Kconstant, - and five searchable documents with IDs, titles, categories, and source text.
Metadata makes retrieval easier to inspect and trust. If you return a result without source details, users and developers cannot easily verify where the information came from.
Short summaries are fine for this first workflow. Later, longer documents may need to be split into chunks before embedding.
This step is strong when the documents are varied enough to test whether semantic retrieval separates related and unrelated content.
Next, you will create a helper function that sends one text input to the embedding model and returns one vector.
Add a function named get_embedding() below the document records.
Start the helper function with a clear name, parameter, return type, and docstring:
def get_embedding(text: str) -> List[float]:
"""Return one embedding vector for one text input."""Inside the function, send the text to the local Ollama embedding model:
response = ollama.embed(model=MODEL, input=text)This line tells Ollama which model to use and which text to convert into an embedding.
Return the first embedding vector from the response:
return response["embeddings"][0]The response stores embeddings in a list because the API can return embeddings for one input or multiple inputs. In this lesson, each call sends one text, so the vector you need is the first item.
Your completed helper function should look like this:
def get_embedding(text: str) -> List[float]:
"""Return one embedding vector for one text input."""
response = ollama.embed(model=MODEL, input=text)
return response["embeddings"][0]You should now have a reusable function that accepts one text string and returns one list of numbers.
This helper function keeps embedding generation separate from the rest of the workflow. Instead of rewriting the Ollama call for every document and query, you can call get_embedding() whenever the backend needs a vector.
Make sure this function is not indented inside a document record. It should start at the left edge of the file.
This step is strong when the function has one clear job: convert one text input into one embedding vector.
Next, you will create a function that compares two embedding vectors and returns a similarity score.
Add a function named cosine_similarity() below get_embedding().
Start the function with two vector parameters and a float return type:
def cosine_similarity(vector_a: List[float], vector_b: List[float]) -> float:
"""Compare two vectors by cosine similarity."""Calculate the dot product:
dot_product = sum(a * b for a, b in zip(vector_a, vector_b))The dot product combines matching positions in both vectors. It is one part of measuring whether the vectors point in a similar direction.
Calculate the magnitude of the first vector:
magnitude_a = sqrt(sum(a * a for a in vector_a))Calculate the magnitude of the second vector:
magnitude_b = sqrt(sum(b * b for b in vector_b))Magnitude represents the vector length. Cosine similarity uses vector direction, so the score needs both magnitudes.
Add a guard for zero-length vectors:
if magnitude_a == 0 or magnitude_b == 0:
return 0.0This prevents division by zero if either vector has no magnitude.
Return the cosine similarity score:
return dot_product / (magnitude_a * magnitude_b)Your completed function should look like this:
def cosine_similarity(vector_a: List[float], vector_b: List[float]) -> float:
"""Compare two vectors by cosine similarity."""
dot_product = sum(a * b for a, b in zip(vector_a, vector_b))
magnitude_a = sqrt(sum(a * a for a in vector_a))
magnitude_b = sqrt(sum(b * b for b in vector_b))
if magnitude_a == 0 or magnitude_b == 0:
return 0.0
return dot_product / (magnitude_a * magnitude_b)You should now have a function that can compare a query embedding with a document embedding.
Cosine similarity gives the backend a ranking signal. Higher scores usually mean the texts are closer in meaning, but you still need to inspect the returned source text.
Do not compare embeddings from different models. Use the same model for the documents and the user query.
This step is strong when the function returns one score and does not depend on a specific query or document.
Next, you will embed each document and keep each embedding connected to its source metadata.
Add a function named build_index() below cosine_similarity().
Start the function with one parameter for the document list:
def build_index(documents: List[Dict[str, str]]) -> List[Dict[str, Any]]:
"""Embed each document and keep the embedding attached to source metadata."""Create an empty list to store indexed documents:
index: List[Dict[str, Any]] = []Loop through each document:
for document in documents:Inside the loop, combine the title and document text:
searchable_text = f"{document['title']}. {document['text']}"The title often contains useful meaning. Combining the title and text gives the embedding model a stronger description of the document.
Create an embedding for the searchable text:
embedding = get_embedding(searchable_text)Store the original metadata and the new embedding together:
index.append({**document, "embedding": embedding})This keeps the document ID, title, category, source text, and vector connected.
After the loop, return the completed index:
return indexYour completed function should look like this:
def build_index(documents: List[Dict[str, str]]) -> List[Dict[str, Any]]:
"""Embed each document and keep the embedding attached to source metadata."""
index: List[Dict[str, Any]] = []
for document in documents:
searchable_text = f"{document['title']}. {document['text']}"
embedding = get_embedding(searchable_text)
index.append({**document, "embedding": embedding})
return indexYou should now have a function that creates an in-memory index.
This step creates the searchable representation of your stored content. In a larger application, a vector database like Chroma could store these embeddings and metadata. Here, you keep them in memory so you can see the retrieval workflow clearly.
Keep the original metadata with the embedding. If the vector becomes separated from its document ID or title, the result will be hard to verify.
This step is strong when every embedding remains traceable to its original document.
Next, you will embed the user query, compare it to every document embedding, sort the results by score, and return only the top matches.
Add a function named search() below build_index().
Start the function with parameters for the query, index, and number of results:
def search(query: str, index: List[Dict[str, Any]], top_k: int = TOP_K) -> List[Dict[str, Any]]:
"""Embed a query, compare it to each document, and return top-ranked results."""Create an embedding for the user query:
query_embedding = get_embedding(query)Create an empty list for scored results:
scored_results: List[Dict[str, Any]] = []Loop through each indexed document:
for document in index:Compare the query embedding to the document embedding:
score = cosine_similarity(query_embedding, document["embedding"])Store the returned fields for this result:
scored_results.append(
{
"id": document["id"],
"title": document["title"],
"category": document["category"],
"score": score,
"text": document["text"],
}
)The result includes the score and source metadata. It does not return the raw embedding because users do not need to read the vector.
Sort the results from highest score to lowest score:
ranked_results = sorted(
scored_results,
key=lambda result: result["score"],
reverse=True,
)Return only the top results:
return ranked_results[:top_k]Your completed function should look like this:
def search(query: str, index: List[Dict[str, Any]], top_k: int = TOP_K) -> List[Dict[str, Any]]:
"""Embed a query, compare it to each document, and return top-ranked results."""
query_embedding = get_embedding(query)
scored_results: List[Dict[str, Any]] = []
for document in index:
score = cosine_similarity(query_embedding, document["embedding"])
scored_results.append(
{
"id": document["id"],
"title": document["title"],
"category": document["category"],
"score": score,
"text": document["text"],
}
)
ranked_results = sorted(
scored_results,
key=lambda result: result["score"],
reverse=True,
)
return ranked_results[:top_k]You should now have the core retrieval workflow:
documents → document embeddings → user query → query embedding → similarity scores → ranked top-k results
This is the retrieval workflow in action. The backend is comparing the meaning of the user’s query to the meaning of stored documents, then ranking the closest matches.
Sort in descending order so the highest similarity score appears first.
This step is strong when the returned results are ranked, limited to a useful top-k value, and traceable to source documents.
Next, you will display the ranked results and test whether the workflow handles different user intents.
Add a print helper, a main() function, and the script entry point. Then run the file.
Start with a helper function that prints one query and its ranked results:
def print_results(query: str, results: List[Dict[str, Any]]) -> None:
"""Display ranked results in a readable format."""Print the query and a divider:
print(f"\nQuery: {query}")
print("-" * 72)Loop through the ranked results with a rank number:
for rank, result in enumerate(results, start=1):Print the document ID, title, and score:
print(f"{rank}. {result['id']} | {result['title']} | score={result['score']:.4f}")Print the category and source text:
print(f" category: {result['category']}")
print(f" {result['text']}")Your completed print helper should look like this:
def print_results(query: str, results: List[Dict[str, Any]]) -> None:
"""Display ranked results in a readable format."""
print(f"\nQuery: {query}")
print("-" * 72)
for rank, result in enumerate(results, start=1):
print(f"{rank}. {result['id']} | {result['title']} | score={result['score']:.4f}")
print(f" category: {result['category']}")
print(f" {result['text']}")Create the main() function:
def main() -> None:Build the in-memory index:
index = build_index(DOCUMENTS)Add three test queries:
test_queries = [
"Why does the mobile app say my token is expired?",
"My API request fails even though I added a bearer token.",
"How can I increase my monthly usage limit?",
]The first two queries should usually rank authentication documents highly. The third query should usually rank the billing document highly.
Loop through the test queries:
for query in test_queries:Run the search for each query:
results = search(query, index, top_k=TOP_K)Print the ranked results:
print_results(query, results)Add the script entry point at the bottom of the file:
if __name__ == "__main__":
main()Your completed main() function and entry point should look like this:
def main() -> None:
index = build_index(DOCUMENTS)
test_queries = [
"Why does the mobile app say my token is expired?",
"My API request fails even though I added a bearer token.",
"How can I increase my monthly usage limit?",
]
for query in test_queries:
results = search(query, index, top_k=TOP_K)
print_results(query, results)
if __name__ == "__main__":
main()Run the file from inside your pipenv shell:
python semantic_retrieval_lesson.pyYour output should show ranked results for each query.
Example output pattern:
Query: Why does the mobile app say my token is expired?
------------------------------------------------------------------------
1. DEV-101 | Refreshing Expired API Access Tokens | score=0.5357
category: authentication
Explains how to refresh expired API access tokens, check token lifetime, and retry a request with a valid bearer token.
2. DEV-102 | Fixing Invalid Authorization Headers | score=0.3431
category: authentication
Shows how to format authorization headers, include bearer tokens, and troubleshoot rejected API requests caused by malformed headers.
3. DEV-105 | Troubleshooting Slow Dashboard Pages | score=0.2919
category: frontend
Covers browser caching, loading states, client-side rendering delays, and slow dashboard performance.
Exact rankings and scores may vary by model. Authentication-related queries should usually rank authentication documents above billing or frontend documents.
Testing multiple queries helps you verify consistency. A workflow that works for one query may still fail on another query.
Use test queries that represent different intents. Include at least one query that should match a non-authentication document.
This step is strong when different queries produce different top results that match the user’s likely intent.
Next, you will decide whether the retrieval output is useful enough to trust.
Review the output and write a short verification note.
Use these checks:
Functional output:
Did the workflow return ranked results?
Vector consistency:
Were documents and queries embedded with the same model?
Relevance:
Does the top result answer the user’s actual need?
Intent alignment:
Does the result match the meaning of the query, not just a shared word?
Top-k quality:
Are the extra results useful or noisy?
Source traceability:
Can each result be traced back to a document ID and title?
RAG readiness:
Would the top result provide grounded context for an AI-generated response?
A strong verification note might look like this:
For the expired token query, DEV-101 ranked first. This makes sense because the result explains token lifetime and retrying requests with a valid bearer token.
The output includes ID, title, category, score, and source text, so the result is traceable. This result could support a future RAG answer, but I would still confirm that the article is current before using it as final context.
You should have a short verification note that connects the ranking back to the original user need.
Verification protects users from weak retrieval. A semantic retrieval workflow can run correctly and still return incomplete or misleading context. Before using retrieved content in RAG, always check whether the result is relevant and source-grounded.
A high score is not the same as certainty. Read the top result before deciding whether it is useful.
This step is strong when your verification explains both what ranked first and why it is or is not useful.
Next, you will connect the in-memory workflow to vector databases, retrievers, APIs, and RAG.
Write a short reflection that explains what this script proves and what would change in a larger system.
Answer these questions:
1. What content did the workflow embed?
2. What did the workflow compare?
3. How did the workflow rank results?
4. Why did metadata matter?
5. What would change if this moved into Chroma, LangChain, Flask, or RAG?
A completed reflection might look like this:
The script embedded developer documentation summaries and embedded each user query with the same model. It compared the query embedding with each document embedding using cosine similarity, then sorted the results by score and returned the top matches.
Metadata mattered because each result needed an ID, title, category, and source text so the output could be checked. In a larger system, Chroma could store embeddings and metadata, LangChain could wrap the search as a retriever, Flask could expose the workflow through an API route, and RAG could use the retrieved source text as context before generating an answer.
You should have a reflection that explains how the manual workflow prepares you for future tools.
This step helps you avoid treating Chroma, LangChain, or RAG as magic. Those tools still depend on the same retrieval sequence: prepare content, embed text, compare meaning, rank results, return source-grounded context, and verify quality.
Focus your reflection on the workflow, not only the tool names. The tools change, but the retrieval logic remains similar.
This step is strong when it clearly separates “the code returned results” from “the retrieval output is useful and traceable.”
Use this completed file to check your work after you have built the script step by step.
from math import sqrt
from typing import Any, Dict, List
import ollama
MODEL = "embeddinggemma"
TOP_K = 3
DOCUMENTS: List[Dict[str, str]] = []
DOCUMENTS.append(
{
"id": "DEV-101",
"title": "Refreshing Expired API Access Tokens",
"category": "authentication",
"text": (
"Explains how to refresh expired API access tokens, check token lifetime, "
"and retry a request with a valid bearer token."
),
}
)
DOCUMENTS.append(
{
"id": "DEV-102",
"title": "Fixing Invalid Authorization Headers",
"category": "authentication",
"text": (
"Shows how to format authorization headers, include bearer tokens, "
"and troubleshoot rejected API requests caused by malformed headers."
),
}
)
DOCUMENTS.append(
{
"id": "DEV-103",
"title": "Creating a New Developer API Key",
"category": "onboarding",
"text": (
"Guides a new developer through creating an API key, copying the key value, "
"and storing credentials securely."
),
}
)
DOCUMENTS.append(
{
"id": "DEV-104",
"title": "Understanding Dashboard Billing Limits",
"category": "billing",
"text": (
"Explains plan limits, monthly usage caps, billing warnings, "
"and how to upgrade an account."
),
}
)
DOCUMENTS.append(
{
"id": "DEV-105",
"title": "Troubleshooting Slow Dashboard Pages",
"category": "frontend",
"text": (
"Covers browser caching, loading states, client-side rendering delays, "
"and slow dashboard performance."
),
}
)
def get_embedding(text: str) -> List[float]:
"""Return one embedding vector for one text input."""
response = ollama.embed(model=MODEL, input=text)
return response["embeddings"][0]
def cosine_similarity(vector_a: List[float], vector_b: List[float]) -> float:
"""Compare two vectors by cosine similarity."""
dot_product = sum(a * b for a, b in zip(vector_a, vector_b))
magnitude_a = sqrt(sum(a * a for a in vector_a))
magnitude_b = sqrt(sum(b * b for b in vector_b))
if magnitude_a == 0 or magnitude_b == 0:
return 0.0
return dot_product / (magnitude_a * magnitude_b)
def build_index(documents: List[Dict[str, str]]) -> List[Dict[str, Any]]:
"""Embed each document and keep the embedding attached to source metadata."""
index: List[Dict[str, Any]] = []
for document in documents:
searchable_text = f"{document['title']}. {document['text']}"
embedding = get_embedding(searchable_text)
index.append({**document, "embedding": embedding})
return index
def search(query: str, index: List[Dict[str, Any]], top_k: int = TOP_K) -> List[Dict[str, Any]]:
"""Embed a query, compare it to each document, and return top-ranked results."""
query_embedding = get_embedding(query)
scored_results: List[Dict[str, Any]] = []
for document in index:
score = cosine_similarity(query_embedding, document["embedding"])
scored_results.append(
{
"id": document["id"],
"title": document["title"],
"category": document["category"],
"score": score,
"text": document["text"],
}
)
ranked_results = sorted(
scored_results,
key=lambda result: result["score"],
reverse=True,
)
return ranked_results[:top_k]
def print_results(query: str, results: List[Dict[str, Any]]) -> None:
"""Display ranked results in a readable format."""
print(f"\nQuery: {query}")
print("-" * 72)
for rank, result in enumerate(results, start=1):
print(f"{rank}. {result['id']} | {result['title']} | score={result['score']:.4f}")
print(f" category: {result['category']}")
print(f" {result['text']}")
def main() -> None:
index = build_index(DOCUMENTS)
test_queries = [
"Why does the mobile app say my token is expired?",
"My API request fails even though I added a bearer token.",
"How can I increase my monthly usage limit?",
]
for query in test_queries:
results = search(query, index, top_k=TOP_K)
print_results(query, results)
if __name__ == "__main__":
main()| Issue | Why it matters | How to respond |
|---|---|---|
| Weak document summaries | The embedding may not represent enough meaning. | Add clearer source text or chunk longer documents. |
| Missing metadata | Users cannot trace where results came from. | Include ID, title, category, and source text. |
| Different embedding models | Query and document vectors may not compare reliably. | Use one approved model for all embeddings. |
| Top-k is too high | Too many results may add noise. | Start with top 3 and adjust based on task. |
| Top-k is too low | Useful context may be missed. | Test whether top 5 improves coverage. |
| Scores are trusted blindly | Similarity does not guarantee correctness. | Read the source text and compare it to user intent. |
| No RAG readiness check | Weak retrieval can lead to weak AI responses. | Confirm relevance and source grounding before generation. |
| Option | Use when | Tradeoff |
|---|---|---|
| Manual in-memory similarity | You are learning or testing a small dataset. | Easy to inspect, but not scalable. |
| Chroma vector store | You need to store and search many embeddings. | More realistic, but adds tooling complexity. |
| LangChain retriever | You need reusable retrieval in a RAG pipeline. | Useful abstraction, but can hide some mechanics. |
In this lesson, the in-memory workflow is intentional. It helps you see what happens before a vector database or retriever abstraction handles storage and search for you.