twerkmeister/tokenflood: Tokenflood is a load testing framework for simulating arbitary loads on instruction-tuned LLMs

TokenFlood is a load testing tool for instruction-tuned LLMs that allows you to run arbitrary load profiles without requiring specific signal and response data.
Define the desired prompt length, prefix length, output length, and request rate, and TokenFlood simulates this workload for you.

TokenFlood makes it easy to find out how latency changes when using different providers, hardware, quantization, or prompt and output lengths.

TokenFlood uses Litelum under the hood and supports all providers covered by Litelum.

Caution

Tokenflood can generate high costs if poorly configured and used with pay-per-token services. Make sure you only test workloads that are within a reasonable budget. See the security section for more information.

Load testing of self-hosted LLM.
Estimating the impacts of hardware,quantization and accelerated optimization on latency, throughput,,and cost.
Assessing intraday latency variations of hosted LLM providers for your load types.
Assessing and choosing a hosted LLM provider before going into production with them.

Example: assessing the effects of early adaptation in advance

Here’s an example of exploring the effects of prompt parameters for latency and throughput. The following graphs show various load scenarios. Together they show the impact of hypothetical corrections on accelerated parameters.

The first graph represents the base case, our current prompt parameters: ~3000 input tokens, of which ~1000 are a common prefix that can be cached, and ~60 output tokens.

base-case-latency

In the graph, you can see the mean latency and the 50th, 90th, and 99th percentile latency. These percentage lines represent the latency below which 50%, 90%, and 99% of LLM requests fall. When designing latency sensitive systems, it is important to have an understanding of the distribution, not just the average. At 3 requests per second, our system gives us a latency of approximately 1720ms for the 50th percentile, 2700ms for the 90th percentile, and 2950ms for the 99th percentile. This means that 50% of requests returned below 1720 ms, 90% returned below 2700 ms, and 99% of requests returned below 2950 ms.

Suppose that in our hypothetical signal, we can rearrange things a bit to increase the number of tokens at the beginning which are always the same and thus increase the prefix-cache portion. If we change things we may have to spend some extra time re-quick tuning, so we would like to know how much such a change would improve latency.

Let’s run the test by increasing the number of prefix tokens from 1000 to 2000:

more-prefix-cached-token-latency

We’re seeing a meaningful improvement to about 1100ms for the 50th percentile, 1340ms for the 90th percentile, and 1450ms for the 99th percentile at 3 requests per second.

Another option to reduce latency could be to reduce the number of output tokens. Maybe our current hypothetical prompt uses JSON output that is very verbose and requires lots of tokens for all the special characters. This may contain more expression than your work actually requires, so how about testing the benefits of using a smaller output format before implementing changes?

Let’s start with the base case again and reduce the number of output tokens from 60 to 30:

low-output-token-latency

We see a nice improvement here again, going down to 840 ms for the 50th, 1270 ms for the 90th, and 1900 ms for the 99th percentile at 3 requests per second.

In the end, we may wonder what the outcome of both reforms will be or whether having one of them gives you all the benefits it should. So we apply both changes, increasing the number of prefix tokens to 2000 and decreasing the number of output tokens to 30.

low-output-over-prefix-latency

In fact, they add significantly to our setup with dedicated and limited compute. We reach the 50th percentile latency with 570 ms, the 90th percentile with 770 ms, and the 99th percentile with 870 ms.

Here is a brief extract of the data in tabular form:

landscape	#input token	#prefix token	#OutputToken	#50th percentile latency (@3req/s)	#90th percentile latency (@3req/s)	#99th percentile latency (@3req/s)
base case	3038	1000	60	1720 ms	2700ms	2950ms
more prefix tokens	3038	2000	60	1100ms	1340ms	1450 ms
small output	3038	1000	30	840ms	1270 ms	1900 ms
both changes	3038	2000	30	570ms	770ms	870ms

🛠️ Professional Services 🛠️

If you are looking for paid professional assistance

Optimize your LLM accuracy, latency, throughput or cost
Fine tune the open model for your use case,
Design and build custom AI systems

Feel free to contact me at thomas@werkmeister.me or on LinkedIn.

For a quick start, make sure VLM is installed, and you present a small model:

pip install vllm
vllm serve HuggingFaceTB/SmolLM-135M-Instruct

Next, create basic config files and do the first run:

# This creates config files for a tiny first run: run.yml and endpoint.yml
tokenflood init
# Afterwards you can inspect those files and run them
tokenflood run run_suite.yml endpoint.yml

Finally, in results In the folder you should find your run folder which contains:

A graph showing latency quantities in differential request rates and network latency (latency_quantiles.png,
Raw data points collected from LLM calls (llm_requests.csv,
Raw data points collected from assessing network latency (network_latency.csv,
A summary file that contains a lot of information about the run (summary.yml,
The basic run suite configuration used for runs (run_suite.yml,
The basic endpoint configuration used for the run (endpoint_spec.yml,
an error log (errors.csv,

The endpoint spec file lets you set the target of the load test. TokenFlood uses Litelum under the hood and supports all providers covered by Litelum.

Here you can see an example endpoint spec file from the quick start:

provider: hosted_vllm
model: HuggingFaceTB/SmolLM-135M-Instruct
base_url: http://127.0.0.1:8000/v1
api_key_env_var: null
deployment: null
extra_headers: {}

Explanation of parameters:

provider: is the provider parameter used by Lightell and is used to determine how to actually interact with the endpoint as different providers have different APIs.
model: The specific model to use on a given endpoint.
base_url: This is important if you’re self-hosting or using an endpoint in a specific region of a provider.
api_key_env_var: Name of the environment variable to use as an API key. If you specify this, it allows you to manage multiple API keys for the same provider for different regions without changing env files: e.g. AZURE_KEY_FRANKFURT And AZURE_KEY_LONDON,
deployment: Required for some providers like Azure.
extra_headers: May be useful in selecting models for some providers.

TokenFlood passes all these parameters directly to Litelum’s completion call. To get a better understanding, it is best to take a look at the official documentation of the Lightlum completion call.

provider: hosted_vllm
model: meta-llama/Llama-3.1-8B-Instruct
base_url: http://127.0.0.1:8000/v1

provider: openai
model: gpt-4o-mini

Environment version: OPENAI_API_KEY

provider: bedrock
model: anthropic.claude-3-sonnet-20240229-v1:0

Environment version: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION_NAME

AWS SageMaker Inference Endpoint

provider: sagemaker_chat
model: your-sagemaker-endpoint

Environment version: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION_NAME

provider: azure
deployment: gpt-4o
model: gpt-4o
api_version: 2024-06-01
api_base: https://my-azure-url.openai.azure.com/

Environment version: AZURE_API_KEY

provider: gemini
model: gemini-2.5-flash-lite-preview-09-2025

Environment version: GEMINI_API_KEY

provider: anthropic
model: claude-3-5-sonnet-20240620

Environment version: ANTHROPIC_API_KEY

With Run Suite you define the specific tests you want to run. Each test can have multiple phases with different numbers of requests per second. All stages share the same length in seconds and the type of load being sent.

Here is the run suite that is being created for you when you call tokenflood init,

name: ripple
requests_per_second_rates:  # Defines the phases with the different request rates
- 1.0
- 2.0
test_length_in_seconds: 10  # each phase is 10 seconds long
load_types:                 # This run suite has two load types with equal weight
- prompt_length: 512        # prompt length in tokens
  prefix_length: 128        # prompt prefix length in tokens
  output_length: 32         # output length in tokens
  weight: 1                 # sampling weight for this load type
- prompt_length: 640
  prefix_length: 568
  output_length: 12
  weight: 1
percentiles:                # the latency percentiles to report
- 50
- 90
- 99
input_token_budget: 100000  # the maximum number of input tokens this test is allowed to use - prevents any load configuration that would use more than this from starting
output_token_budget: 10000  # the maximum number of output tokens this test is allowed to use - prevents any load configuration that would use more than this from starting
error_limit: 0.3            # the fraction of errors that are acceptable for the last 30 requests
task:                       # The task tokenflood uses to generate a lot of tokens which we can truncate using the max token parameters - makes sure we do not produce too few tokens!
  task: 'Task: Count up to 10000 naming each individual number like this: 1 2 3 4'
token_set:                  # The 1-token strings tokenflood uses to fill up the prompt and prefix up to the desired length   
  tokens:
  - ' A'
  - ' B'
  - ' C'
  - ' D'
  - ' E'
  - ' F'
  - ' G'
  - ' H'
  - ' I'
  - ' J'
  - ' K'
  - ' L'
  - ' M'
  - ' N'
  - ' O'
  - ' P'
  - ' Q'
  - ' R'
  - ' S'
  - ' T'
  - ' U'
  - ' V'
  - ' W'
  - ' X'
  - ' Y'
  - ' Z'

TokenFlood does not require specific prompt data to run tests. Instead, it only requires metadata about the prompt and task: prompt length, prefix length, and output length. All are counted as tokens. This allows quick testing of alternative configurations and loads. Changing token counts across load types is a matter of seconds as opposed to adjusting the implementation and revisiting system prompts. Additionally, you can ensure that all models and configurations achieve exactly the desired output profile, allowing direct comparisons between them.

TokenFlood uses a set of strings that correspond to a single token in most tokenizers, such as a space and a capital letter. Sampling from this set of single token strings generates the TokenFlood input prompt. The defined prefix length will be non-random. Finally, a task is added which usually produces a longer answer. In combination with setting the maximum completeness tokens for generation, TokenFlood achieves the desired output length.

This type of heuristic testing produces reliable data because the processing time of non-logical LLMs depends only on the length of the inputs and outputs and any caching mechanisms involved.

failures of the heuristic

Heuristic load testing comes with the risk of not fully achieving the desired token count for specific models. If this happens, TokenFlood will warn you during run if a request differs from the expected input or output token length by more than 10%. At the end of the run, you will also be warned about the average deviation if it is more than 10% from the expected token count. In the summary file of a run, you can see the absolute and relative deviations again.

Important

You can specify the length of the prefix, however, whether or not the prefix is used will depend on the specific endpoint and its configuration. Some providers, like OpenAI, will only start using prefix caching when your total prompt length exceeds 1024 tokens. Additionally, it seems that Lightellum does not always record the use of prefix caching. When using VLM as an inference server, it never reports any cached tokens. Additionally, a large difference in latency can be seen between using and not using prefix caching, despite cached tokens not being properly reported. Due to this problem, TokenFlood does not currently warn when desired prefix tokens differ from measured tokens.

Using TokenFlood can lead to higher token spend. To prevent negative surprises, TokenFlood has additional security measures in place:

TokenFlood always tries to guess in advance the tokens used for testing and after seeing the guess asks you to confirm to start the test.
There are additional run suite variables that determine the maximum allowed input and output token budgets for the test. A test whose token usage estimate exceeds those limits will not be started.
TokenFlood will not start the run if the first warm-up request fails, for example, due to API key misconfiguration.
TokenFlood will terminate a run if the error rate exceeds 30% for the last 30 requests.

Nevertheless, these measures do not provide complete protection against misconfiguration. Always be careful when using TokenFlood.

We welcome contributions! If you want to add new features, fix bugs, or improve the documentation:

fork the repository
install including dev dependencies
```
poetry install --all-groups
```
Create a feature branch:
```
git checkout -b feature/my-improvement
```
Make your changes and add tests if applicable
Run linting and testing locally to make sure everything is working properly:
Submit a pull request with a clear description of your fix

If you are planning a major change (for example, new test type or provider integration), please open an issue to discuss it first.

twerkmeister/tokenflood: Tokenflood is a load testing framework for simulating arbitary loads on instruction-tuned LLMs

Example: assessing the effects of early adaptation in advance

🛠️ Professional Services 🛠️

AWS SageMaker Inference Endpoint

failures of the heuristic

Like this:

Related

Leave a Comment Cancel reply

Example: assessing the effects of early adaptation in advance

🛠️ Professional Services 🛠️

AWS SageMaker Inference Endpoint

failures of the heuristic

Share this:

Like this:

Related

Leave a Comment Cancel reply