-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Description
Hi DeepSeek team and community!
I'd like to propose an addition to the DeepSeek API that would enable
P-tuning and other embedding-based techniques directly through the API.
The idea is simple: add an endpoint that accepts raw embedding tensors
alongside (or instead of) text, while keeping the current token-based billing.
This would allow researchers and engineers to:
- Do P-tuning without expensive local GPUs
- Create custom adapters for different tasks
- Experiment with prefix prompts and embedding manipulation
The implementation is minimal (concatenate embeddings before forward pass),
but the possibilities it opens up are enormous.
Full RFC with technical details and examples is below.
Looking forward to your feedback!
RFC: Embedding-aware API for DeepSeek
Author: DeepSeek Community
Date: 2026-02-22
Status: Proposal
Executive Summary
Add support for raw embedding tensors as input to DeepSeek API, while keeping the current billing model (pay-per-token). This minimal change opens up enormous possibilities for researchers and developers, while fully preserving DeepSeek's business model.
Motivation
Current Situation
- Researchers and engineers who need low-level control over the model (P-tuning, prefix prompts) are forced to run models locally
- This requires expensive hardware (GPUs) and expertise
Opportunity for DeepSeek
DeepSeek can give this audience the ability to work through the API, getting paid for the compute they would use anyway. Consider:
| For DeepSeek | For the Community |
|---|---|
| Increased revenue as a consequence | Access to large models without GPUs |
| New customers (research segment) | Ability to experiment |
| Differentiation from competitors | P-tuning via API |
Proposed API
Minimal Implementation
# New endpoint
POST /v1/completions_from_embeds
{
"model": "deepseek-chat",
"prompt_embeds": [
[0.1, -0.2, 0.3, ...], # virtual token 1 (vector of size hidden_size)
[0.4, 0.5, -0.1, ...], # virtual token 2
... # as many virtual tokens as needed
],
"prompt_text": "continue this sentence", # optional
"max_tokens": 100,
"temperature": 0.7
}Alternative: Extend Existing API
POST /v1/chat/completions
{
"model": "deepseek-chat",
"input_embeds": [...], # if provided, messages is ignored
"max_tokens": 100
}Tensor Format
- Plain JSON array (no binary formats)
- Shape:
[num_virtual_tokens, hidden_size] - Data type: float32 (or whatever the model uses)
Billing — Unchanged
def calculate_cost(request):
# prompt_embeds are just tokens, albeit "virtual" ones
virtual_tokens = len(request.prompt_embeds)
if request.prompt_text:
text_tokens = count_tokens(request.prompt_text)
else:
text_tokens = 0
total_tokens = virtual_tokens + text_tokens
# Then, existing DeepSeek billing logic
return total_tokens * price_per_tokenKey point: DeepSeek gets paid for exactly the same compute as before. Virtual tokens are no different from regular ones in terms of inference cost.
Technical Complexity
For DeepSeek (Implementation)
# Current pipeline
def generate(prompt_text):
input_ids = tokenize(prompt_text)
embeds = model.embed(input_ids)
return model.forward(embeds)
# New pipeline
def generate(prompt_embeds, prompt_text=None):
if prompt_embeds:
if prompt_text: # both embeddings and text provided
text_ids = tokenize(prompt_text)
text_embeds = model.embed(text_ids)
embeds = torch.cat([prompt_embeds, text_embeds])
else: # embeddings only
embeds = prompt_embeds
else: # text only (legacy mode)
input_ids = tokenize(prompt_text)
embeds = model.embed(input_ids)
return model.forward(embeds)Effort: 1 person-day
Additional load: negligible (one concatenation)
Risks: none (full backward compatibility)
Use Cases
Example 1: P-tuning for Classification
import deepseek
import torch.nn as nn
from transformers import AutoTokenizer
# Tokenizer for labels
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-chat")
# Prompt encoder lives on client side (2 layers, ~1M parameters)
encoder = nn.Sequential(
nn.Linear(768, 1536),
nn.ReLU(),
nn.Linear(1536, 768)
)
# 5 virtual tokens (trainable)
virtual_tokens = nn.Parameter(torch.randn(5, 768))
optimizer = torch.optim.Adam(encoder.parameters(), lr=1e-4)
# Training loop
for text, label in dataset:
# Tokenize label to get its length
label_tokens = tokenizer.encode(label, add_special_tokens=False)
# Generate prompt embeddings
prompt_embeds = encoder(virtual_tokens)
# Send to DeepSeek along with text
response = deepseek.CompletionsFromEmbeds.create(
model="deepseek-chat",
prompt_embeds=prompt_embeds.tolist(),
prompt_text=text,
max_tokens=len(label_tokens) # Generate exactly as many tokens as the label
)
# Compute loss, update encoder (locally!)
loss = loss_fn(response.text, label)
loss.backward()
optimizer.step()
# DeepSeek got paid for 5 + len(text) + len(label_tokens) tokensExample 2: Prefix Prompts (p-tuned model inference)
# After training, use the learned prefix for inference
prefix_embeds = encoder(virtual_tokens).detach() # [5, 768] - trained for a specific task
# Apply to each query
for user_query in queries:
response = deepseek.CompletionsFromEmbeds.create(
model="deepseek-chat",
prompt_embeds=prefix_embeds.tolist(),
prompt_text=user_query,
max_tokens=100
)Example 3: Controlled Generation via Embedding Manipulation
# Want to influence generation style
base_prompt = "Write a summary of 'War and Peace'"
# Get "neutral" embeddings
neutral_embeds = deepseek.Embeddings.create(
model="deepseek-chat",
input=base_prompt
).embeddings
# Mix with "conciseness" vector (hypothetical)
concise_vector = load_concise_vector() # [768]
concise_embeds = neutral_embeds + concise_vector * 0.1
# Generate
response = deepseek.CompletionsFromEmbeds.create(
model="deepseek-chat",
prompt_embeds=concise_embeds.tolist()
)
# Result should be shorter than usualExample 4: Custom Adapters for Enterprise
# Enterprise trains task-specific adapters for multiple tasks
sentiment_prefix = load_prefix("sentiment_analysis") # [5, 768]
summarization_prefix = load_prefix("summarization") # [8, 768]
translation_prefix = load_prefix("en_to_fr") # [6, 768]
# Use appropriate prefix for each request
for document, task in incoming_requests:
prefix = {
"sentiment": sentiment_prefix,
"summary": summarization_prefix,
"translate": translation_prefix
}[task]
response = deepseek.CompletionsFromEmbeds.create(
model="deepseek-chat",
prompt_embeds=prefix.tolist(),
prompt_text=document,
max_tokens=200
)Target Audience
| Category | Scenario | Value Proposition | Willingness to Pay |
|---|---|---|---|
| Researchers | P-tuning, prefix prompts | Save on GPU costs | ✅ High |
| ML Engineers | Rapid prototyping | No need to deploy models | ✅ Medium |
| Students/Universities | Coursework, theses | Access to large models | |
| Startups | Adaptation experiments | Low entry barrier | ✅ Medium |
| Enterprise | Custom adapters | Multiple tasks with shared infrastructure | ✅ High |
Security
- No access to weights — only forward pass via API
- Rate limiting — same as regular requests
- Dimension validation — embeddings must match model's hidden_size
- Backward compatibility — existing clients unaffected
Why This Benefits DeepSeek
| Aspect | Advantage |
|---|---|
| Technical | Minimal development effort |
| Commercial | New customers (research segment), increased revenue as a consequence |
| Marketing | First to market, differentiation from competitors |
| Strategic | Research community loyalty |
Competitor Comparison
| Provider | Text API | Embedding API | P-tuning via API |
|---|---|---|---|
| OpenAI | ✅ | ❌ (search only) | ❌ |
| Anthropic | ✅ | ❌ | ❌ |
| ✅ | ❌ | ❌ | |
| DeepSeek (current) | ✅ | ❌ | ❌ |
| DeepSeek (proposed) | ✅ | ✅ | ✅ |
Conclusion
This minimal change opens up enormous possibilities for the community while fully preserving DeepSeek's business model. Researchers gain access to large models without expensive hardware; DeepSeek gets paid for the compute they would use anyway.
Technically: simple
Commercially: profitable
Strategically: timely