AI systems forget everything.
MemoraAI is a memory layer that helps AI assistants build long-term understanding of users by extracting, storing, retrieving, and injecting relevant context into future interactions.
The goal is not to store more information. The goal is to surface the right information at the right time.
Stateless Large Language Models (LLMs) treat every interaction as if it is the first time they are meeting the user. Existing solutions try to solve this by:
- Shoving everything into the context window: This leads to exponential token costs, latency spikes, and the "lost in the middle" retrieval degradation where LLMs fail to recall facts placed in large contexts.
- Naive RAG (Retrieval-Augmented Generation): Standard semantic retrieval frequently fetches irrelevant noise or fails to resolve exact keywords, acronyms, or references.
- Flat chat logs: Treating memory merely as a raw database of past messages without structural representation, classification, or long-term user preferences extraction.
MemoraAI runs as a containerized multi-tier service suite, engineered for reliability, low-latency execution, and granular observability.
flowchart TD
User[Client Browser] -->|HTTPS Port 443| Nginx[Nginx Reverse Proxy]
Nginx -->|Route: /| FE[Next.js Frontend :3001]
Nginx -->|Route: /api/v1| BE[NestJS Backend Gateway :3000]
BE -->|Route: internal /api/v1| AI[FastAPI AI Service :8000]
AI -->|pgvector Cosine Search / FTS| DB[(Supabase PostgreSQL)]
AI -->|LLM & Embeddings| Gemini[Google Gemini API]
- Trace Propagation: Client requests are intercepted by Nginx and forwarded to the Next.js Frontend. A unique
x-request-idtransaction trace is generated (or captured) and sent to the NestJS Backend Gateway. - NestJS Context Tracking: The NestJS Gateway handles global rate-limiting (via
@nestjs/throttlermapping30 req/minper IP), request schema validation, and maps the trace ID into Express Request context. UsingAsyncLocalStorage, NestJS propagates the trace token downstream to all microservices. - FastAPI Observability Middleware: The core FastAPI AI Service extracts the
x-request-idheader in itsObservabilityMiddleware, binding it to pythonstructlogcontexts. Every database query, exception, or external LLM request is logged alongside this trace ID. - Data Layer Ingress: Queries are whitelisted inside FastAPI's DB connector (
db/connection.py) against known database tables (chunks,documents) to block SQL injection vectors on dynamic queries, before executing against Supabase PostgreSQL.
To maximize both recall (finding all candidate matches) and precision (ensuring the most relevant chunks are at the top), MemoraAI utilizes a multi-stage retrieval pipeline:
graph TD
Query[User Query] --> Embed[Gemini Embedder]
Embed -->|3072-dim Vector| VS[pgvector Cosine Search]
Query -->|FTS Lexical Index| BM25[PostgreSQL ts_rank FTS]
VS -->|List A: Ranked Chunks| RRF[Reciprocal Rank Fusion]
BM25 -->|List B: Ranked Chunks| RRF
RRF -->|Top 10 Candidates| Rerank[Cross-Encoder Reranker]
Rerank -->|Top 5 Contexts| LLM[Gemini 2.5 Flash Generator]
-
pgvector Semantic Search: Embeds query into a 3072-dimensional vector using
gemini-embedding-001. Runs cosine distance (<=>operator) query against the database, filtering chunks above the configuredSIMILARITY_THRESHOLD. -
Lexical Full-Text Search (FTS): Formats the query into English lexical tokens and executes a GIN-indexed keyword search using
ts_rank(equivalent to BM25 relevance ranking) to match exact terminology, acronyms, or numbers. -
Reciprocal Rank Fusion (RRF): Blends the ranks of the vector and lexical search runs using the formula:
$$\text{RRF Score} = \sum_{m \in {\text{Vector}, \text{FTS}}} \frac{1}{60 + \text{Rank}_m}$$ -
Cross-Encoder Reranking: Takes the top 10 RRF candidates and processes them together with the query through the
ms-marco-MiniLM-L-6-v2transformer. The model uses self-attention to calculate deep token-level similarity, re-sorting candidates so the top 3-5 highly dense contexts are sent to the LLM.- Docker Optimization: The 150MB Reranker weights are cached directly inside the Docker image during the build stage, reducing cold-start container initialization in cloud platforms from 20s to 1.5s.
MemoraAI models human memory using a three-tier system: Short-Term, Long-Term, and Episodic.
┌────────────────────────────────────────────────────────┐
│ MEMORY LIFECYCLE │
├────────────────────────────────────────────────────────┤
│ │
│ [Ingestion / Input] │
│ │ │
│ ▼ │
│ [Short-Term Memory] ──► (Active Session Context) │
│ │ │
│ ├─► (Every 5 messages: Extract facts) │
│ ▼ │
│ [Long-Term Memory] ──► (Persistent User Facts) │
│ │ │
│ ├─► (On New Session/Clear: Summarize) │
│ ▼ │
│ [Episodic Memory] ──► (Chronological Session Events) │
└────────────────────────────────────────────────────────┘
- Ingestion (Active Interaction):
- Files (PDFs), URL contents, or plain texts are parsed, chunked via a token-based sliding window (700 tokens, 100 overlap), vectorized, and saved globally in the
documentsandchunkstables. - User chat inputs are immediately committed to the
short_term_memorytable.
- Files (PDFs), URL contents, or plain texts are parsed, chunked via a token-based sliding window (700 tokens, 100 overlap), vectorized, and saved globally in the
- Short-Term Context Retention:
- On every message query, the last 5 conversation pairs for the specific
session_idare injected into the prompt context to maintain immediate dialog flow.
- On every message query, the last 5 conversation pairs for the specific
- Long-Term Fact Extraction (LTM):
- For every 5 messages exchanged, a background extraction event executes. The recent chat logs are evaluated by Gemini using structured JSON output schemas to distill personal user facts (e.g., "user works as a backend developer", "user prefers Python") into the
long_term_memorytable. - Relevant facts are retrieved and prepended to the system prompt dynamically on all future interactions.
- For every 5 messages exchanged, a background extraction event executes. The recent chat logs are evaluated by Gemini using structured JSON output schemas to distill personal user facts (e.g., "user works as a backend developer", "user prefers Python") into the
- Episodic Summarization:
- When a session is closed (via clicking "New Session" or "Clear Chat" in the UI), the system calls
POST /api/v1/memory/episodic/summarize. - The full chat log is summarized, and key topics and sentiment metrics are extracted and saved to
episodic_memorybefore clearing short-term tables.
- When a session is closed (via clicking "New Session" or "Clear Chat" in the UI), the system calls
Engineering a production memory system requires balancing performance, cost, and complexity:
- Periodic vs. Inline Extraction: We run Long-Term Memory (LTM) fact extraction asynchronously in the background every 5 messages rather than inline during every user query. Inline extraction increases user latency by ~1.5s and bloats token expenses, whereas periodic asynchronous extraction keeps user interactions sub-second while maintaining semantic updates.
- Bi-Encoder Embeddings vs. Cross-Encoder Rerankers: Bi-encoder embeddings represent documents as static vectors for fast vector indexing, but cannot model complex query-context interactions. Cross-Encoders calculate deep cross-attention at runtime for superior accuracy but are compute-heavy. We run Vector+FTS first to retrieve 10 candidates, then run the Cross-Encoder only on those 10 candidates to maintain sub-100ms pipeline latencies while achieving high precision.
- Global Supabase SQL Database vs. Session-Isolated Storage: Ingested documents are stored globally in PostgreSQL rather than inside session scopes. This allows cross-session discovery (any session can query any past document) but requires extra care to prevent cross-user data leakage by enforcing user-id filters at the database query level.
- LLM JSON Extraction Fragility: LLMs occasionally output unescaped quotes or nested markdown lists that crash standard parsers. In the offline evaluation engine (
evaluator.py), we built an index-slicing backup parser (_parse_judge_json) that isolates brackets and keys to robustly extract scores and reasoning even if the JSON is malformed. - Vector Dimension Scaling Bottleneck: Upgrading the embedding model to a 3072-dimensional output caused Supabase pgvector indexing scripts to crash because pgvector
ivfflat/hnswindexes restrict vector dimensions to a maximum of 2000. We learned to disable the vector indexes to allow 3072-dimension vectors while using exact nearest-neighbor search, which remains sub-millisecond for our database size. - Port Collision and Proxy Loops: Running NestJS and Next.js both on port 3000 on Windows caused an infinite proxy loop because Next.js bound to IPv6 (
::1) while NestJS bound to IPv4 (0.0.0.0). Standardizing the frontend on port 3001 and configuring explicit health and proxy re-writes innext.config.jsfixed this loop.
- Entity-Relation Graph Memory: Move LTM from flat key-value facts to a Neo4j Graph database representation. This would allow the agent to understand multi-degree relationships between concepts (e.g. knowing that "React" is related to "Next.js").
-
Relevance Decay & Memory Pruning: Implement a chronological decay index (
$e^{-\lambda t}$ ) on user facts. Facts that haven't been referenced in months or are contradicted by newer inputs would automatically prune or archive themselves. - Adaptive Chunking: Replace fixed sliding-window chunking with semantic chunking that detects changes in topics, headings, or paragraph structures, preserving context bounds more cleanly.
Verify system health and metrics endpoints post-deployment:
# Core AI Service
curl http://localhost:8000/api/v1/health
# Backend Gateway
curl http://localhost:3000/api/v1/health# Scrape FastAPI core metrics
curl http://localhost:8000/metrics
# Scrape NestJS orchestrator metrics
curl http://localhost:3000/api/v1/health/metricsdocker compose exec ai-service python run_eval.pyThis command runs the automated quality suite (LLM-as-a-judge scorers for Faithfulness, Relevance, and Recall) and writes the results to docs/RAG_EVALUATION_REPORT.md.
| Layer | Technology | Description |
|---|---|---|
| Frontend | Next.js 15, TypeScript, Tailwind CSS, shadcn/ui, Zustand, Framer Motion | |
| Backend Gateway | NestJS, TypeScript, Express, Zod, Throttler Guard, class-validator | Gateway orchestration with distributed request tracing context and rate-limiting |
| AI Microservice | Python 3.11, FastAPI, asyncpg, Pydantic, Structlog, SentenceTransformers | Core service for hybrid retrieval, reranking, and cognitive memory lifecycle |
| Database | Supabase PostgreSQL + pgvector | Relational vector database supporting exact match indexes and FTS lexical ranks |
| AI Foundations | Google Gemini 2.5 Flash, gemini-embedding-001, ms-marco-MiniLM-L-6-v2 |
Advanced LLM generator, 3072-dimension embeddings, and cross-encoder rerank model |