Embedding-aware API for DeepSeek

Hi DeepSeek team and community!

I'd like to propose an addition to the DeepSeek API that would enable 
P-tuning and other embedding-based techniques directly through the API.

**The idea is simple:** add an endpoint that accepts raw embedding tensors 
alongside (or instead of) text, while keeping the current token-based billing.

This would allow researchers and engineers to:
- Do P-tuning without expensive local GPUs
- Create custom adapters for different tasks
- Experiment with prefix prompts and embedding manipulation

The implementation is minimal (concatenate embeddings before forward pass), 
but the possibilities it opens up are enormous.

Full RFC with technical details and examples is below.

Looking forward to your feedback!

[embedding-api-rfc.md](https://github.com/user-attachments/files/25466436/embedding-api-rfc.md)

# RFC: Embedding-aware API for DeepSeek

**Author:** DeepSeek Community  
**Date:** 2026-02-22  
**Status:** Proposal

## Executive Summary

Add support for **raw embedding tensors** as input to DeepSeek API, while keeping the current billing model (pay-per-token). This minimal change opens up enormous possibilities for researchers and developers, while fully preserving DeepSeek's business model.

## Motivation

### Current Situation
- Researchers and engineers who need low-level control over the model (P-tuning, prefix prompts) are forced to run models locally
- This requires expensive hardware (GPUs) and expertise

### Opportunity for DeepSeek
DeepSeek can give this audience the ability to work through the API, getting paid for the compute they would use anyway. Consider:

| For DeepSeek | For the Community |
|:---|:---|
| Increased revenue as a consequence | Access to large models without GPUs |
| New customers (research segment) | Ability to experiment |
| Differentiation from competitors | P-tuning via API |

## Proposed API

### Minimal Implementation

```python
# New endpoint
POST /v1/completions_from_embeds
{
    "model": "deepseek-chat",
    "prompt_embeds": [
        [0.1, -0.2, 0.3, ...],  # virtual token 1 (vector of size hidden_size)
        [0.4, 0.5, -0.1, ...],  # virtual token 2
        ...  # as many virtual tokens as needed
    ],
    "prompt_text": "continue this sentence",  # optional
    "max_tokens": 100,
    "temperature": 0.7
}
```

### Alternative: Extend Existing API

```python
POST /v1/chat/completions
{
    "model": "deepseek-chat",
    "input_embeds": [...],  # if provided, messages is ignored
    "max_tokens": 100
}
```

### Tensor Format
- Plain JSON array (no binary formats)
- Shape: `[num_virtual_tokens, hidden_size]`
- Data type: float32 (or whatever the model uses)

## Billing — Unchanged

```python
def calculate_cost(request):
    # prompt_embeds are just tokens, albeit "virtual" ones
    virtual_tokens = len(request.prompt_embeds)
    
    if request.prompt_text:
        text_tokens = count_tokens(request.prompt_text)
    else:
        text_tokens = 0
    
    total_tokens = virtual_tokens + text_tokens
    
    # Then, existing DeepSeek billing logic
    return total_tokens * price_per_token
```

**Key point:** DeepSeek gets paid for exactly the same compute as before. Virtual tokens are no different from regular ones in terms of inference cost.

## Technical Complexity

### For DeepSeek (Implementation)

```python
# Current pipeline
def generate(prompt_text):
    input_ids = tokenize(prompt_text)
    embeds = model.embed(input_ids)
    return model.forward(embeds)

# New pipeline
def generate(prompt_embeds, prompt_text=None):
    if prompt_embeds:
        if prompt_text:  # both embeddings and text provided
            text_ids = tokenize(prompt_text)
            text_embeds = model.embed(text_ids)
            embeds = torch.cat([prompt_embeds, text_embeds])
        else:  # embeddings only
            embeds = prompt_embeds
    else:  # text only (legacy mode)
        input_ids = tokenize(prompt_text)
        embeds = model.embed(input_ids)
    
    return model.forward(embeds)
```

**Effort:** 1 person-day  
**Additional load:** negligible (one concatenation)  
**Risks:** none (full backward compatibility)

## Use Cases

### Example 1: P-tuning for Classification

```python
import deepseek
import torch.nn as nn
from transformers import AutoTokenizer

# Tokenizer for labels
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-chat")

# Prompt encoder lives on client side (2 layers, ~1M parameters)
encoder = nn.Sequential(
    nn.Linear(768, 1536),
    nn.ReLU(),
    nn.Linear(1536, 768)
)

# 5 virtual tokens (trainable)
virtual_tokens = nn.Parameter(torch.randn(5, 768))
optimizer = torch.optim.Adam(encoder.parameters(), lr=1e-4)

# Training loop
for text, label in dataset:
    # Tokenize label to get its length
    label_tokens = tokenizer.encode(label, add_special_tokens=False)
    
    # Generate prompt embeddings
    prompt_embeds = encoder(virtual_tokens)
    
    # Send to DeepSeek along with text
    response = deepseek.CompletionsFromEmbeds.create(
        model="deepseek-chat",
        prompt_embeds=prompt_embeds.tolist(),
        prompt_text=text,
        max_tokens=len(label_tokens)  # Generate exactly as many tokens as the label
    )
    
    # Compute loss, update encoder (locally!)
    loss = loss_fn(response.text, label)
    loss.backward()
    optimizer.step()
    
    # DeepSeek got paid for 5 + len(text) + len(label_tokens) tokens
```

### Example 2: Prefix Prompts (p-tuned model inference)

```python
# After training, use the learned prefix for inference
prefix_embeds = encoder(virtual_tokens).detach()  # [5, 768] - trained for a specific task

# Apply to each query
for user_query in queries:
    response = deepseek.CompletionsFromEmbeds.create(
        model="deepseek-chat",
        prompt_embeds=prefix_embeds.tolist(),
        prompt_text=user_query,
        max_tokens=100
    )
```

### Example 3: Controlled Generation via Embedding Manipulation

```python
# Want to influence generation style
base_prompt = "Write a summary of 'War and Peace'"

# Get "neutral" embeddings
neutral_embeds = deepseek.Embeddings.create(
    model="deepseek-chat",
    input=base_prompt
).embeddings

# Mix with "conciseness" vector (hypothetical)
concise_vector = load_concise_vector()  # [768]
concise_embeds = neutral_embeds + concise_vector * 0.1

# Generate
response = deepseek.CompletionsFromEmbeds.create(
    model="deepseek-chat",
    prompt_embeds=concise_embeds.tolist()
)
# Result should be shorter than usual
```

### Example 4: Custom Adapters for Enterprise

```python
# Enterprise trains task-specific adapters for multiple tasks
sentiment_prefix = load_prefix("sentiment_analysis")  # [5, 768]
summarization_prefix = load_prefix("summarization")  # [8, 768]
translation_prefix = load_prefix("en_to_fr")  # [6, 768]

# Use appropriate prefix for each request
for document, task in incoming_requests:
    prefix = {
        "sentiment": sentiment_prefix,
        "summary": summarization_prefix,
        "translate": translation_prefix
    }[task]
    
    response = deepseek.CompletionsFromEmbeds.create(
        model="deepseek-chat",
        prompt_embeds=prefix.tolist(),
        prompt_text=document,
        max_tokens=200
    )
```

## Target Audience

| Category | Scenario | Value Proposition | Willingness to Pay |
|:---|:---|:---|:---|
| **Researchers** | P-tuning, prefix prompts | Save on GPU costs | ✅ High |
| **ML Engineers** | Rapid prototyping | No need to deploy models | ✅ Medium |
| **Students/Universities** | Coursework, theses | Access to large models | ⚠️ Limited |
| **Startups** | Adaptation experiments | Low entry barrier | ✅ Medium |
| **Enterprise** | Custom adapters | Multiple tasks with shared infrastructure | ✅ High |

## Security

1. **No access to weights** — only forward pass via API
2. **Rate limiting** — same as regular requests
3. **Dimension validation** — embeddings must match model's hidden_size
4. **Backward compatibility** — existing clients unaffected

## Why This Benefits DeepSeek

| Aspect | Advantage |
|:---|:---|
| **Technical** | Minimal development effort |
| **Commercial** | New customers (research segment), increased revenue as a consequence |
| **Marketing** | First to market, differentiation from competitors |
| **Strategic** | Research community loyalty |

## Competitor Comparison

| Provider | Text API | Embedding API | P-tuning via API |
|:---|:---|:---|:---|
| OpenAI | ✅ | ❌ (search only) | ❌ |
| Anthropic | ✅ | ❌ | ❌ |
| Google | ✅ | ❌ | ❌ |
| **DeepSeek (current)** | ✅ | ❌ | ❌ |
| **DeepSeek (proposed)** | ✅ | ✅ | ✅ |

## Conclusion

This minimal change opens up enormous possibilities for the community while fully preserving DeepSeek's business model. Researchers gain access to large models without expensive hardware; DeepSeek gets paid for the compute they would use anyway.

**Technically:** simple  
**Commercially:** profitable  
**Strategically:** timely


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedding-aware API for DeepSeek #802

RFC: Embedding-aware API for DeepSeek

Executive Summary

Motivation

Current Situation

Opportunity for DeepSeek

Proposed API

Minimal Implementation

Alternative: Extend Existing API

Tensor Format

Billing — Unchanged

Technical Complexity

For DeepSeek (Implementation)

Use Cases

Example 1: P-tuning for Classification

Example 2: Prefix Prompts (p-tuned model inference)

Example 3: Controlled Generation via Embedding Manipulation

Example 4: Custom Adapters for Enterprise

Target Audience

Security

Why This Benefits DeepSeek

Competitor Comparison

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

For DeepSeek	For the Community
Increased revenue as a consequence	Access to large models without GPUs
New customers (research segment)	Ability to experiment
Differentiation from competitors	P-tuning via API

Category	Scenario	Value Proposition	Willingness to Pay
Researchers	P-tuning, prefix prompts	Save on GPU costs	✅ High
ML Engineers	Rapid prototyping	No need to deploy models	✅ Medium
Students/Universities	Coursework, theses	Access to large models	⚠️ Limited
Startups	Adaptation experiments	Low entry barrier	✅ Medium
Enterprise	Custom adapters	Multiple tasks with shared infrastructure	✅ High

Aspect	Advantage
Technical	Minimal development effort
Commercial	New customers (research segment), increased revenue as a consequence
Marketing	First to market, differentiation from competitors
Strategic	Research community loyalty

Provider	Text API	Embedding API	P-tuning via API
OpenAI	✅	❌ (search only)	❌
Anthropic	✅	❌	❌
Google	✅	❌	❌
DeepSeek (current)	✅	❌	❌
DeepSeek (proposed)	✅	✅	✅

Embedding-aware API for DeepSeek #802

Description

RFC: Embedding-aware API for DeepSeek

Executive Summary

Motivation

Current Situation

Opportunity for DeepSeek

Proposed API

Minimal Implementation

Alternative: Extend Existing API

Tensor Format

Billing — Unchanged

Technical Complexity

For DeepSeek (Implementation)

Use Cases

Example 1: P-tuning for Classification

Example 2: Prefix Prompts (p-tuned model inference)

Example 3: Controlled Generation via Embedding Manipulation

Example 4: Custom Adapters for Enterprise

Target Audience

Security

Why This Benefits DeepSeek

Competitor Comparison

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions