<div align="center">██████╗ ██╗ ██╗ █████╗ ███╗ ██╗████████╗ ██████╗ ███╗ ███╗
██╔══██╗██║ ██║██╔══██╗████╗ ██║╚══██╔══╝██╔═══██╗████╗ ████║
██████╔╝███████║███████║██╔██╗ ██║ ██║ ██║ ██║██╔████╔██║
██╔═══╝ ██╔══██║██╔══██║██║╚██╗██║ ██║ ██║ ██║██║╚██╔╝██║
██║ ██║ ██║██║ ██║██║ ╚████║ ██║ ╚██████╔╝██║ ╚═╝ ██║
╚═╝ ╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═══╝ ╚═╝ ╚═════╝ ╚═╝ ╚═╝Local-first document intelligence engine. Private, sovereign, and platform-agnostic.
      
</div>Phantom is a voidnxlabs-grade document intelligence engine that classifies, sanitizes, and understands unstructured data — locally, privately, and fast.
It is architected as an infrastructure-agnostic system. While it leverages Nix for hermetic development environments, it is fully compatible with any OCI-compliant container runtime or standard Python 3.11+ environment. It interfaces with local LLMs via llama.cpp and indexes data into FAISS through a high-performance RAG pipeline.
Core Mission: Transform raw documents into structured intelligence — themes, patterns, PII reports, and vector search — without data ever leaving your controlled environment.
phantom/
├── src/phantom/ — Core Python logic (CORTEX, RAG, DAG, API)
├── cortex-desktop/ — Desktop GUI (Tauri 2 + SvelteKit)
├── spectre/ — Sentiment & Pattern Extraction component
├── docs/ — Structured documentation (Architecture, API, Guides)
├── nix/ — Hermetic environment & package definitions
└── tests/ — Comprehensive Python & Integration test suite- IntelAgent (Rust) — 8-crate workspace for decentralized agent governance, security, and memory.
- Cloud LLM Providers — Native support for OpenAI, Anthropic, and DeepSeek.
- Redis Semantic Cache — Low-latency response caching for recurring queries.
- Kubernetes Helm Charts — For scalable, self-hosted enterprise deployments.
Fork the repo to your own account, make changes, and open a pr. Follow voidnxlabs on social media for updates.
For enterprise support, open a support ticket.
Phantom is optimized for Nix, but supports any OCI-compliant or Python 3.11+ environment.
git clone https://github.com/VoidNxSEC/phantom
cd phantom
# Drop into the fully-pinned dev environment
nix develop
# Run the test suite to confirm everything works
just test
# Start the API server
just serve
# Or run the full desktop app
just desktopBy leveraging Nix, the development environment is hermetic and reproducible. However, Phantom remains fully deployable via standard Python tools or Docker for production environments.
The heart of Phantom. Processes raw documents into structured insights through a multi-stage pipeline:
Document → SemanticChunker → EmbeddingGenerator → LLM Classifier → Pydantic Schema- Chunking with configurable token budgets (default: 1024 tokens, 128 overlap)
- Parallel LLM calls with retry logic (3 attempts, 2s backoff)
- Real-time VRAM monitoring with auto-throttle — won't OOM your GPU
- Extracts:
Theme,Pattern,Learning,Concept,Recommendation
# Process a document directory
just run extract --input ./docs --output ./insights.json
# Or hit the API directly
curl -X POST http://localhost:8000/process \
-F "file=@report.pdf" \
-F "chunk_strategy=recursive" \
-F "chunk_size=1024"Most RAG systems pick either semantic or keyword search. Phantom does both and fuses the results using Reciprocal Rank Fusion:
Query → FAISS (dense cosine) ─┐
├→ RRF Fusion → Ranked Results
Query → BM25Okapi (sparse) ───┘- FAISS
IndexFlatIPwith L2-normalized cosine similarity - Optional GPU acceleration via
StandardGpuResources - BM25 index rebuilt lazily on each
add()— no manual sync required
curl -X POST http://localhost:8000/vectors/search \
-H "Content-Type: application/json" \
-d '{"query": "compliance requirements", "top_k": 5, "search_type": "hybrid"}'Context-aware chat over your document base. Supports SSE streaming for real-time token delivery.
# Streaming chat
curl -X POST http://localhost:8000/api/chat/stream \
-H "Content-Type: application/json" \
-d '{
"message": "What are the key risks in the Q3 report?",
"conversation_id": "session-001",
"history": [],
"context_size": 5
}'Phantom utilizes a strictly decoupled Provider Abstraction Layer. While local inference via llama.cpp is the primary target for maximum data sovereignty, the engine is architected as a cloud-native agnostic infrastructure. The provider registry ensures seamless interoperability with any OpenAI-compatible API, allowing deployment across diverse environments without architectural lock-in.
Phantom's DAG pipeline processes files through a classification and sanitization chain before they ever touch your vector store:
Discovery → Fingerprint → Classify → Pseudonymize → Sanitize → Verify → PersistFour sanitization levels:
| Level | What happens |
|---|---|
none |
Direct copy, no modifications |
strip_metadata |
EXIF, document properties, author fields removed |
redact_pii |
Email, phone, SSN, CPF/CNPJ, credit cards replaced with [REDACTED] |
full_sanitize |
Everything above + content normalization |
PII detection covers: email addresses, phone numbers, SSN, CPF/CNPJ, payment card numbers, AWS credentials, API keys, Bearer tokens, private keys, PGP blocks, IPv4/IPv6 ranges, UUIDs.
# Scan a directory for sensitive content
phantom-scan ./repo | jq '.findings[] | select(.risk_score > 0.7)'
# Sanitize before exporting
phantom-dag -i ./internal_dataset -o ./export --sanitize pii
# Dry-run to preview what would happen
phantom -i ./input -o ./output --dry-runEvery file processed gets a hash. You choose the algorithm:
| Algorithm | Use case |
|---|---|
| SHA256 | Baseline integrity, broad compatibility |
| BLAKE3 | High-throughput, modern standard |
| xxHash | Maximum speed, block-level streaming |
# Generate a manifest
phantom-hash ./directory > manifest.json
# Verify a file against a known hash
phantom-verify report.pdf abc123def456...
# Diff two manifests (transfer verification)
diff <(jq -S . before.json) <(jq -S . after.json)The FastAPI server runs at http://localhost:8000 by default. Prometheus metrics at /metrics, OpenAPI docs at /docs.
| Endpoint | Method | Purpose |
|---|---|---|
/health |
GET | Liveness probe |
/ready |
GET | Readiness check with downstream deps |
/metrics |
GET | Prometheus metrics |
/api/system/metrics |
GET | CPU, RAM, VRAM, disk |
/process |
POST | Process document with CORTEX |
/extract |
POST | Extract insights from text |
/upload |
POST | Single file upload |
/api/upload |
POST | Multi-file upload with processing |
/vectors/search |
POST | Hybrid vector search |
/vectors/index |
POST | Index document to FAISS |
/vectors/batch-index |
POST | Batch indexing |
/api/chat |
POST | RAG-powered chat |
/api/chat/stream |
POST | SSE streaming chat |
/api/models |
GET | List available LLM models |
/api/prompt/test |
POST | Render and token-count a prompt |
/api/pipeline |
POST | Full DAG pipeline execution |
/api/pipeline/scan |
POST | Scan-only (read-only, no writes) |
/judge |
POST | AI-Agent-OS judgment integration |
All request/response bodies are validated by Pydantic v2. No silent failures.
output/
├── documents/ # PDF, DOCX, TXT, MD
├── images/ # PNG, JPG, SVG
├── audio/ # MP3, FLAC, WAV
├── video/ # MP4, MKV, AVI
├── code/ # PY, JS, RS, GO, NIX
├── data/ # JSON, CSV, PARQUET
├── archives/ # ZIP, TAR, 7Z
├── configs/ # ENV, CONF, INI
├── logs/ # LOG, OUT, ERR
├── crypto/ # PEM, KEY, P12
├── executables/ # ELF, EXE, DEB
├── unknown/ # Unclassified
└── .phantom/
├── phantom.db # SQLite audit log
├── pseudonym_map.json # Reversible path mapping
├── reports/ # JSON execution reports
├── audit/ # Chain of custody
├── staging/ # Processing scratch space
└── quarantine/ # Files that failed validation{
"phantom_version": "0.1.0",
"statistics": {
"total_files": 15420,
"processed": 15398,
"failed": 22,
"success_rate": "99.86%",
"total_size_human": "48.32 GB",
"duration_seconds": "127.45",
"throughput_files_per_sec": "120.81",
"files_with_sensitive_data": 847
},
"sensitivity_breakdown": {
"PUBLIC": 12453,
"INTERNAL": 1892,
"CONFIDENTIAL": 734,
"SECRET": 289,
"TOP_SECRET": 30
}
}Original paths are replaced with deterministic, reversible pseudonyms. Nothing is lost — the mapping is persisted in pseudonym_map.json.
/home/user/docs/secret_report_2024.pdf
↓
PH-a1b2c3d4-e5f6a7b8-1234abcd.pdf
│ │ │ │
│ │ │ └─ Hexadecimal timestamp
│ │ └─ Random entropy block
│ └─ Deterministic path hash
└─ Namespace prefix
# Resolve it back
phantom --resolve PH-a1b2c3d4-e5f6a7b8-1234abcd.pdfThree levels. No compromises.
# Everything
just test
# Targeted
just test-unit
just test-integration
just test-e2e
# With coverage report (enforced minimum: 70%)
just test-cov
# GPU-specific tests
just test-gpu
# Match a pattern
just test-match "test_vector"tests/
├── conftest.py # Shared fixtures
├── test_imports.py # Critical import smoke tests
├── unit/ # 17 test modules (isolated, fast)
├── integration/ # API + CLI tests (requires running server)
└── e2e/ # Full pipeline tests (slow, thorough)Coverage is enforced at 70% minimum via pytest --fail-under=70. The CI will fail before you merge something that regresses it.
# src/phantom/pipeline/phantom_dag.py
SENSITIVE_PATTERNS = [
# (regex, label, risk_score)
(r'your_pattern', 'YOUR_LABEL', 0.9),
]EXT_MAP = {
'.yourext': Classification.DOCUMENTS,
}Implement AIProvider from src/phantom/providers/base.py:
class YourProvider(AIProvider):
async def generate(self, prompt: str, **kwargs) -> GenerationResult: ...
async def stream(self, prompt: str, **kwargs) -> AsyncIterator[str]: ...
async def health_check(self) -> ProviderStatus: ...Register it in the API's provider resolver. Done.
- Files over 10MB are skipped during deep PII scanning (magic bytes + extension classification still applies).
- Encrypted file content cannot be classified beyond magic bytes and extension.
- Metadata stripping is best-effort on proprietary formats — some residual metadata may survive.
We are working on this issues, you can submit a PR or open a issue.
Normalize legacy storage:
phantom -i /mnt/legacy -o /mnt/normalized -w 8 -vExport sanitized dataset:
phantom-dag -i ./internal -o ./export --sanitize piiAudit a repo before committing:
phantom-scan ./project | jq '.findings[] | select(.risk_score > 0.7)'Verify a data transfer:
phantom-hash ./original > before.json
cp -r ./original ./destination
phantom-hash ./destination > after.json
diff <(jq -S . before.json) <(jq -S . after.json)Ask questions about your documents:
just serve &
curl -X POST http://localhost:8000/vectors/index -F "file=@docs.pdf"
curl -X POST http://localhost:8000/api/chat \
-d '{"message": "Summarize the main risks", "conversation_id": "s1", "history": []}'Phantom runs a full security stack on every commit:
- SAST: CodeQL (Python + JavaScript), Bandit
- Dependency audit: pip-audit, safety, cargo-audit
- Secret scanning: Trufflehog, detect-secrets
- SBOM: CycloneDX, SPDX JSON, Syft — every build
- Vulnerability scan: Grype against SBOM
- Supply chain: OpenSSF Scorecard
Found a vulnerability? See SECURITY.md.
nix develop # enter the pinned shell
just lint # ruff + mypy
just fmt # ruff format
just quality # lint + typecheck + security scan
just ci # lint + test (what CI runs)
just stats # project statistics
just info # environment summaryAll tasks live in the justfile. Run just with no arguments to list them.
Pre-commit hooks are installed automatically when you enter nix develop. They run ruff, mypy, and bandit before every commit.
| Component | Status |
|---|---|
| CORTEX Engine | Production ready |
| FAISS Vector Store + Hybrid Search | Production ready |
| FastAPI Server (20 endpoints) | Production ready |
| DAG Pipeline + Sanitization | Production ready |
| Prometheus Metrics + Structlog | Production ready |
| CI/CD (7 workflows) | Production ready |
| Cortex Desktop (Tauri + SvelteKit) | Beta |
| CLI Commands | Complete |
| SPECTRE Analysis | Production ready |
| IntelAgent (Rust) | Planned |
Apache 2.0. See LICENSE.
Read CONTRIBUTING.md before opening a PR.
For architecture changes or significant API modifications, open an issue first. The docs/adr/ directory has the decision history — read it before proposing something we already debated and rejected.
Contributions welcome. Hot takes about the architecture go in the issues. Fixes go in PRs.