GitHub - Neul-labs/fast-litellm: High-performance Rust Acceleration For LiteLLM

PEPI
License: MIT
python version

High-performance Rust acceleration for LiteLLM – provides 2-20x performance improvement for token counting, routing, rate limiting, and connection management.

Fast LiteLLM is a drop-in Rust acceleration layer for LiteLLM that provides significant performance improvements:

5-20x faster Token counting with batch processing
3-8x faster Request routing with lock-free data structures
4-12x faster Rate limiting with async support
2-5 times faster connection management

Built with PyO3 and Rust, it integrates seamlessly with existing LightLLM code with zero configuration required.

import fast_litellm  # Automatically accelerates LiteLLM
import litellm

# All LiteLLM operations now use Rust acceleration where available
response = litellm.completion(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Hello!"}]
)

That’s it! just import fast_litellm First litellm And acceleration is applied automatically.

Acceleration uses PyO3 to create Python extensions from Rust code:

┌─────────────────────────────────────────────────────────────┐
│ LiteLLM Python Package                                      │
├─────────────────────────────────────────────────────────────┤
│ fast_litellm (Python Integration Layer)                    │
│ ├── Enhanced Monkeypatching                                │
│ ├── Feature Flags & Gradual Rollout                        │
│ ├── Performance Monitoring                                 │
│ └── Automatic Fallback                                     │
├─────────────────────────────────────────────────────────────┤
│ Rust Acceleration Components (PyO3)                        │
│ ├── core               (Advanced Routing)                   │
│ ├── tokens             (Token Counting)                    │
│ ├── connection_pool    (Connection Management)             │
│ └── rate_limiter       (Rate Limiting)                     │
└─────────────────────────────────────────────────────────────┘

zero configuration: works automatically on import
production safe: Built-in Feature Flags, Monitoring, and Automatic Fallback in Python
performance monitoring: Real-time metrics and optimization recommendations
gradual rollout: Support for Canary deployment and percentage-based feature rollouts
thread safe:Lock-free data structures using Dashmap for concurrent operations
type safe: includes full Python type hints and type stubs

Component	basic	customized	Example
token count	5-10x	15-20x	Batch Processing, Reference Management
request routing	3-5x	6-8x	load balancing, model selection
rate limiting	4-8x	10-12x	Request throttling, quota management
connection pooling	2-3x	4-5x	HTTP reuse, latency reduction

Fast LiteLLM works out of the box with zero configuration. For advanced use cases, you can configure the behavior via environment variables:

# Disable specific features
export FAST_LITELLM_RUST_ROUTING=false

# Gradual rollout (10% of traffic)
export FAST_LITELLM_BATCH_TOKEN_COUNTING=canary:10

# Custom configuration file
export FAST_LITELLM_FEATURE_CONFIG=/path/to/config.json

See the configuration guide for all options.

Python 3.8 or higher
lightllm

there is rust No Required for installation – Prebuilt wheels are available for all major platforms.

To contribute or build from source:

Prerequisites:

Python 3.8+
Rust Tool Chain (1.70+)
Ripe for building Python extensions

to install:

git clone https://github.com/neul-labs/fast-litellm.git
cd fast-litellm

# Install maturin
pip install maturin

# Build and install in development mode
maturin develop

# Run unit tests
pip install pytest pytest-asyncio
pytest tests/

Fast LiteLLM includes comprehensive integration tests that run LiteLLM’s test suite with acceleration enabled:

# Setup LiteLLM for testing
./scripts/setup_litellm.sh

# Run LiteLLM tests with acceleration
./scripts/run_litellm_tests.sh

# Compare performance (with vs without acceleration)
./scripts/compare_performance.py

This ensures that Fast LiteLLM does not break any LiteLLM functionality. See the test guide for details.

See our contribution guide for more information.

Fast LiteLLM uses PyO3 to create Python extensions from Rust code:

┌─────────────────────────────────────────────────────────────┐
│ LiteLLM Python Package                                      │
├─────────────────────────────────────────────────────────────┤
│ fast_litellm (Python Integration Layer)                    │
│ ├── Enhanced Monkeypatching                                │
│ ├── Feature Flags & Gradual Rollout                        │
│ ├── Performance Monitoring                                 │
│ └── Automatic Fallback                                     │
├─────────────────────────────────────────────────────────────┤
│ Rust Acceleration Components (PyO3)                        │
│ ├── core               (Advanced Routing)                   │
│ ├── tokens             (Token Counting)                    │
│ ├── connection_pool    (Connection Management)             │
│ └── rate_limiter       (Rate Limiting)                     │
└─────────────────────────────────────────────────────────────┘

when you import fast_litellmIt automatically patches performance-critical functions of LightLLM with the Rust implementation while maintaining full compatibility with the Python API.

We welcome contributions! Please see our contribution guide.

This project is licensed under the MIT License – see the license file for details.

GitHub – neul-labs/fast-litellm: High-performance Rust acceleration for LiteLLM

Like this:

Related

Leave a Comment Cancel reply

Share this:

Like this:

Related

Leave a Comment Cancel reply