Full Documentation Click Zread in bellow
API Provider berbasis ASP.NET Core yang dirancang khusus untuk menjalankan model AI dalam format ONNX menggunakan ONNX Runtime dengan performa tinggi dan optimasi perangkat keras yang handal.
- Performa Tinggi: Inferensi model machine learning yang dioptimalkan langsung oleh ONNX Runtime.
- REST API Siap Pakai: Integrasi mudah dengan aplikasi eksternal melalui endpoint HTTP/JSON.
- Arsitektur Fleksibel: Dirancang untuk menangani berbagai seri model ONNX (khususnya seri Qwen).
- Deteksi Perangkat Keras Otomatis: Optimasi cerdas berdasarkan spesifikasi hardware pengguna.
- .NET 10.0 SDK (atau versi terbaru yang kompatibel).
- Python (Opsional): Untuk menjalankan skrip pengujian tambahan seperti
debug_request.py.
PENTING: Sangat disarankan mengunduh model dari ONNX Community karena konfigurasi dan Tokenizer telah disesuaikan di dalam kode.
Rekomendasi Model: onnx-community/Qwen3.5-4B-ONNX (Versi: q4f16, q4, atau int8).
decoder_model_merged_q4f16.onnx(&.onnx_datajika > 2GB)encoder_model_merged_q4f16.onnx(&.onnx_datajika > 2GB)tokenizer.json,tokenizer_config.json,vocab.json,merges.txt,special_tokens_map.json- Opsional:
vision_encoder_q4f16.onnx(untuk dukungan Vision/Gambar).
Important
Jangan lupa untuk memperbarui path root folder model di file appsettings.json.
Jika Anda ingin performa yang lebih optimal:
- Bagikan kode
HardwareDetector.cs,Program.cs, danQwen35inferenceengine.cske AI (Claude/Gemini/ZAI). - Sertakan informasi spesifikasi perangkat keras (GPU/CPU/RAM) Anda.
- Minta AI untuk mengoptimalkan parameter inisialisasi agar sesuai dengan hardware Anda demi efisiensi maksimal.
Proyek ini telah diuji secara ekstensif pada hardware kelas server (Xeon v4) dan berhasil menangani 64,000 tokens context window dengan stabil.
- Konteks Panjang: Mendukung hingga 64K tokens dengan degradasi performa minimal.
- Efisiensi: Arsitektur hybrid memberikan kecepatan prefill yang konsisten bahkan pada context besar.
- Melampaui Standar: Berbeda dengan llama.cpp atau Ollama yang seringkali tidak stabil atau sangat lambat pada CPU untuk konteks di atas 8K, proyek ini berhasil menjalankan 64K context secara penuh dengan stabil pada hardware tahun 2016.
Tip
Lihat laporan performa lengkap dan statistik benchmark di: BENCHMARK.md
An ASP.NET Core-based API Provider specifically designed to run AI models in ONNX format using ONNX Runtime, focusing on high performance and reliable hardware optimization.
- High Performance: Machine learning model inference optimized directly by ONNX Runtime.
- Ready-to-use REST API: Seamless integration with external applications via HTTP/JSON endpoints.
- Extensible Architecture: Designed to handle various ONNX model series (specifically tailored for Qwen).
- Auto Hardware Detection: Intelligent optimization based on the user's hardware specifications.
- .NET 10.0 SDK (or relevant latest versions).
- Python (Optional): For running additional testing scripts like
debug_request.py.
IMPORTANT: It is highly recommended to download models from the ONNX Community repository as configurations and Tokenizers are already synchronized with the codebase.
Recommended Model: onnx-community/Qwen3.5-4B-ONNX (Versions: q4f16, q4, or int8).
decoder_model_merged_q4f16.onnx(&.onnx_dataif > 2GB)encoder_model_merged_q4f16.onnx(&.onnx_dataif > 2GB)tokenizer.json,tokenizer_config.json,vocab.json,merges.txt,special_tokens_map.json- Optional:
vision_encoder_q4f16.onnx(for Vision/Image support).
Important
Remember to update the model folder root path in appsettings.json.
To achieve maximum performance:
- Share the
HardwareDetector.cs,Program.cs, andQwen35inferenceengine.csfiles with an AI (Claude/Gemini/ZAI). - Provide your hardware specifications (GPU/CPU/RAM).
- Ask the AI to optimize the initialization parameters to match your specific hardware for peak efficiency.
This project has been extensively tested on server-grade hardware (Xeon v4) and successfully handles a 64,000 tokens context window with high stability.
- Long Context: Supports up to 64K tokens with minimal performance degradation.
- Efficiency: The hybrid architecture ensures consistent prefill speeds even at large context scales.
- Beyond Industry Standards: Unlike llama.cpp or Ollama, which often encounter instability or severe slowdowns on CPU for contexts exceeding 8K, this project successfully processes a full 64K context with stability on hardware from 2016.
Tip
Read the full performance report and benchmark statistics at: BENCHMARK.md
-
Clone Repository:
git clone https://github.com/USERNAME/SFCoreServerProviderOnnxRuntime.git cd SFCoreServerProviderOnnxRuntime -
Restore Dependencies:
dotnet restore SFCore.OnnxRuntimeProvider.Api
-
Run Application:
dotnet run --project SFCore.OnnxRuntimeProvider.Api
Note
Default URL: http://localhost:5034 (or as configured in appsettings.json).
Once running, access the Swagger UI at:
http://localhost:<PORT>/swagger
We welcome contributions! Please check CONTRIBUTING.md for guidelines.
Distributed under the MIT License.