Skip to content

Embedding-aware API for DeepSeek #802

@stasf25

Description

@stasf25

Hi DeepSeek team and community!

I'd like to propose an addition to the DeepSeek API that would enable
P-tuning and other embedding-based techniques directly through the API.

The idea is simple: add an endpoint that accepts raw embedding tensors
alongside (or instead of) text, while keeping the current token-based billing.

This would allow researchers and engineers to:

  • Do P-tuning without expensive local GPUs
  • Create custom adapters for different tasks
  • Experiment with prefix prompts and embedding manipulation

The implementation is minimal (concatenate embeddings before forward pass),
but the possibilities it opens up are enormous.

Full RFC with technical details and examples is below.

Looking forward to your feedback!

embedding-api-rfc.md

RFC: Embedding-aware API for DeepSeek

Author: DeepSeek Community
Date: 2026-02-22
Status: Proposal

Executive Summary

Add support for raw embedding tensors as input to DeepSeek API, while keeping the current billing model (pay-per-token). This minimal change opens up enormous possibilities for researchers and developers, while fully preserving DeepSeek's business model.

Motivation

Current Situation

  • Researchers and engineers who need low-level control over the model (P-tuning, prefix prompts) are forced to run models locally
  • This requires expensive hardware (GPUs) and expertise

Opportunity for DeepSeek

DeepSeek can give this audience the ability to work through the API, getting paid for the compute they would use anyway. Consider:

For DeepSeek For the Community
Increased revenue as a consequence Access to large models without GPUs
New customers (research segment) Ability to experiment
Differentiation from competitors P-tuning via API

Proposed API

Minimal Implementation

# New endpoint
POST /v1/completions_from_embeds
{
    "model": "deepseek-chat",
    "prompt_embeds": [
        [0.1, -0.2, 0.3, ...],  # virtual token 1 (vector of size hidden_size)
        [0.4, 0.5, -0.1, ...],  # virtual token 2
        ...  # as many virtual tokens as needed
    ],
    "prompt_text": "continue this sentence",  # optional
    "max_tokens": 100,
    "temperature": 0.7
}

Alternative: Extend Existing API

POST /v1/chat/completions
{
    "model": "deepseek-chat",
    "input_embeds": [...],  # if provided, messages is ignored
    "max_tokens": 100
}

Tensor Format

  • Plain JSON array (no binary formats)
  • Shape: [num_virtual_tokens, hidden_size]
  • Data type: float32 (or whatever the model uses)

Billing — Unchanged

def calculate_cost(request):
    # prompt_embeds are just tokens, albeit "virtual" ones
    virtual_tokens = len(request.prompt_embeds)
    
    if request.prompt_text:
        text_tokens = count_tokens(request.prompt_text)
    else:
        text_tokens = 0
    
    total_tokens = virtual_tokens + text_tokens
    
    # Then, existing DeepSeek billing logic
    return total_tokens * price_per_token

Key point: DeepSeek gets paid for exactly the same compute as before. Virtual tokens are no different from regular ones in terms of inference cost.

Technical Complexity

For DeepSeek (Implementation)

# Current pipeline
def generate(prompt_text):
    input_ids = tokenize(prompt_text)
    embeds = model.embed(input_ids)
    return model.forward(embeds)

# New pipeline
def generate(prompt_embeds, prompt_text=None):
    if prompt_embeds:
        if prompt_text:  # both embeddings and text provided
            text_ids = tokenize(prompt_text)
            text_embeds = model.embed(text_ids)
            embeds = torch.cat([prompt_embeds, text_embeds])
        else:  # embeddings only
            embeds = prompt_embeds
    else:  # text only (legacy mode)
        input_ids = tokenize(prompt_text)
        embeds = model.embed(input_ids)
    
    return model.forward(embeds)

Effort: 1 person-day
Additional load: negligible (one concatenation)
Risks: none (full backward compatibility)

Use Cases

Example 1: P-tuning for Classification

import deepseek
import torch.nn as nn
from transformers import AutoTokenizer

# Tokenizer for labels
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-chat")

# Prompt encoder lives on client side (2 layers, ~1M parameters)
encoder = nn.Sequential(
    nn.Linear(768, 1536),
    nn.ReLU(),
    nn.Linear(1536, 768)
)

# 5 virtual tokens (trainable)
virtual_tokens = nn.Parameter(torch.randn(5, 768))
optimizer = torch.optim.Adam(encoder.parameters(), lr=1e-4)

# Training loop
for text, label in dataset:
    # Tokenize label to get its length
    label_tokens = tokenizer.encode(label, add_special_tokens=False)
    
    # Generate prompt embeddings
    prompt_embeds = encoder(virtual_tokens)
    
    # Send to DeepSeek along with text
    response = deepseek.CompletionsFromEmbeds.create(
        model="deepseek-chat",
        prompt_embeds=prompt_embeds.tolist(),
        prompt_text=text,
        max_tokens=len(label_tokens)  # Generate exactly as many tokens as the label
    )
    
    # Compute loss, update encoder (locally!)
    loss = loss_fn(response.text, label)
    loss.backward()
    optimizer.step()
    
    # DeepSeek got paid for 5 + len(text) + len(label_tokens) tokens

Example 2: Prefix Prompts (p-tuned model inference)

# After training, use the learned prefix for inference
prefix_embeds = encoder(virtual_tokens).detach()  # [5, 768] - trained for a specific task

# Apply to each query
for user_query in queries:
    response = deepseek.CompletionsFromEmbeds.create(
        model="deepseek-chat",
        prompt_embeds=prefix_embeds.tolist(),
        prompt_text=user_query,
        max_tokens=100
    )

Example 3: Controlled Generation via Embedding Manipulation

# Want to influence generation style
base_prompt = "Write a summary of 'War and Peace'"

# Get "neutral" embeddings
neutral_embeds = deepseek.Embeddings.create(
    model="deepseek-chat",
    input=base_prompt
).embeddings

# Mix with "conciseness" vector (hypothetical)
concise_vector = load_concise_vector()  # [768]
concise_embeds = neutral_embeds + concise_vector * 0.1

# Generate
response = deepseek.CompletionsFromEmbeds.create(
    model="deepseek-chat",
    prompt_embeds=concise_embeds.tolist()
)
# Result should be shorter than usual

Example 4: Custom Adapters for Enterprise

# Enterprise trains task-specific adapters for multiple tasks
sentiment_prefix = load_prefix("sentiment_analysis")  # [5, 768]
summarization_prefix = load_prefix("summarization")  # [8, 768]
translation_prefix = load_prefix("en_to_fr")  # [6, 768]

# Use appropriate prefix for each request
for document, task in incoming_requests:
    prefix = {
        "sentiment": sentiment_prefix,
        "summary": summarization_prefix,
        "translate": translation_prefix
    }[task]
    
    response = deepseek.CompletionsFromEmbeds.create(
        model="deepseek-chat",
        prompt_embeds=prefix.tolist(),
        prompt_text=document,
        max_tokens=200
    )

Target Audience

Category Scenario Value Proposition Willingness to Pay
Researchers P-tuning, prefix prompts Save on GPU costs ✅ High
ML Engineers Rapid prototyping No need to deploy models ✅ Medium
Students/Universities Coursework, theses Access to large models ⚠️ Limited
Startups Adaptation experiments Low entry barrier ✅ Medium
Enterprise Custom adapters Multiple tasks with shared infrastructure ✅ High

Security

  1. No access to weights — only forward pass via API
  2. Rate limiting — same as regular requests
  3. Dimension validation — embeddings must match model's hidden_size
  4. Backward compatibility — existing clients unaffected

Why This Benefits DeepSeek

Aspect Advantage
Technical Minimal development effort
Commercial New customers (research segment), increased revenue as a consequence
Marketing First to market, differentiation from competitors
Strategic Research community loyalty

Competitor Comparison

Provider Text API Embedding API P-tuning via API
OpenAI ❌ (search only)
Anthropic
Google
DeepSeek (current)
DeepSeek (proposed)

Conclusion

This minimal change opens up enormous possibilities for the community while fully preserving DeepSeek's business model. Researchers gain access to large models without expensive hardware; DeepSeek gets paid for the compute they would use anyway.

Technically: simple
Commercially: profitable
Strategically: timely

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions