GitHub - Microsoft/fara

Fara-7B Microsoft’s first Agentic Small Language Model (SLM) Designed specifically for computer use. With only 7 billion parameters, FARA-7B is an ultra-compact computer usage agent (CUA) that achieves state-of-the-art performance within its size class and is competitive with larger, more resource-intensive agentic systems.

Try Fara-7b locally as follows (see Installation for detailed instructions):

# 1. Clone repository
git clone https://github.com/microsoft/fara.git
cd fara

# 2. Setup environment
python3 -m venv .venv 
source .venv/bin/activate
pip install -e .
playwright install

Then in one process, host the model:

vllm serve "microsoft/Fara-7B" --port 5000 --dtype auto

You can then do an iterative query with:

fara-cli --task "whats the weather in new york now"

Hint: May need to --tensor-parallel-size 2 With vllm command if you run out of memory

What makes the Fara-7B unique?

Unlike traditional chat models that generate text-based responses, Fara-7B leverages the computer interface—mouse and keyboard—to perform multi-step tasks on behalf of users. Model:

visually driven By viewing webpages and performing actions such as scrolling, typing, and clicking directly on predicted coordinates.
uses the same methods as humans To interact with computers – no accessibility trees or separate parsing models are needed
Enables on-device deployment Due to its compact 7B parameter size, latency is reduced and privacy is improved as user data remains local.
completes tasks efficientlyOnly ~16 steps on average per task compared to ~41 for comparable models

FARA-7B is trained using a novel synthetic data generation pipeline built on the Magentic-One multi-agent framework, with 145K trajectories covering different websites, task types, and difficulty levels. This model is based on Qwen2.5-VL-7B and trained with supervised fine-tuning.

Fara-7B can automate everyday web tasks including:

Searching for information and summarizing results
Filling out forms and managing accounts
Booking of travel, movie tickets and restaurant reservations
Shopping and comparing prices between retailers
Finding job postings and real estate listings

FARA-7B achieves state-of-the-art results in several web agent benchmarks, outperforming both comparably sized models and larger systems:

Sample	parameters	webvoyager	Online-M2W	deepshop	webtelbench
SOM Agent
SoM Agent (GPT-4o-0513)	,	90.6	57.7	49.1	60.4
SoM Agent (o3-mini)	,	79.3	55.4	49.7	52.7
SoM Agent (GPT-4o)	,	65.1	34.6	16.0	30.8
glm-4.1v-9b-thinking	9b	66.8	33.9	32.0	22.4
computer usage model
OpenAI computer-use-preview	,	70.9	42.9	24.7	25.7
ui-tars-1.5-7b	7b	66.4	31.3	11.6	19.5
Fara-7B	7b	73.5	34.1	26.2	38.4

Table: Online agent evaluation results showing success rates (%) in four web benchmarks. Results are averaged over 3 runs.

WebtailBench: A new benchmark for real-world web tasks

we are releasing webtelbenchA new assessment benchmark focuses on 11 real-world work types that are underrepresented or missing in existing benchmarks. The benchmark consists of 609 tasks in various categories, with the first 8 sections testing single skills or objectives (usually on the same website), and the remaining 3 evaluating more difficult multi-step or cross-site tasks.

webtelbench detailed results

work section	Work	SoM GPT-4o-0513	SOM O3-Mini	SoM GPT-4o	GLM-4.1V-9B	OAI comp-usage	ui-tars-1.5	Fara-7B
single-site work
shopping	56	62.5	71.4	38.1	31.0	42.3	41.1	52.4
stamp	51	60.1	39.2	11.1	10.5	17.6	10.5	37.9
hotel	52	68.6	56.4	31.4	19.9	26.9	35.3	53.8
restaurant	52	67.9	59.6	47.4	32.1	35.9	22.4	47.4
activities	80	70.4	62.9	41.7	26.3	30.4	9.6	36.3
Ticketing	57	58.5	56.7	37.4	35.7	49.7	30.4	38.6
real estate	48	34.0	17.4	20.1	16.0	9.0	9.7	23.6
Jobs/Careers	50	49.3	44.0	32.7	22.7	20.7	20.7	28.0
multi-step work
Shopping List (2 items)	51	66.0	62.7	17.0	7.8	34.0	20.9	49.0
comparison shopping	57	67.3	59.1	27.5	22.8	1.2	8.8	32.7
creative work	55	51.5	39.4	26.7	17.0	10.3	9.1	23.0
overall
macro average	609	59.7	51.7	30.1	22.0	25.3	19.9	38.4
micro average	609	60.4	52.7	30.8	22.4	25.7	19.5	38.4

Table: Details of WebtailBench results in all 11 segments. Success rate (%) is averaged over 3 independent runs. The Fara-7B achieves the highest performance among computer-use models in all task categories.

coming soon:

Task Verification Pipeline for LLM-A-Judge Assessment
Official human comments from WebtelBench (in partnership with BrowserBase)

evaluation infrastructure

Our evaluation setup leverages:

playwright – A cross-browser automation framework that mimics the browser environment
abstract web agent interface – Allows integration of any model from any source into the assessment environment
fara-agent class – Reference implementation for running FARA models

Comment: FARA-7B is an experimental release designed to invite practical exploration and feedback from the community. We recommend running it in a sandboxed environment, monitoring its execution, and avoiding domains with sensitive data or high risk.

Install packages using uv or pip:

Then install Playwright Browser:

recommended: The easiest way to get started is to use Azure Foundry hosting, which doesn’t require any GPU hardware or model downloads. Alternatively, you can self-host with VLLM if you have the GPU resources available.

Azure Foundry Hosting (recommended)

Deploy Fara-7B on Azure Foundry without the need to download weights or manage GPU infrastructure.

to install:

Deploy the Fara-7B model on Azure Foundry and get your endpoint URL and API key
Add your endpoint details to existing endpoint_configs/ Directory (example configurations already provided):

# Edit one of the existing config files or create a new one
# endpoint_configs/fara-7b-hosting-ansrz.json (example format):
{
    "model": "Fara-7B",
    "base_url": "https://your-endpoint.inference.ml.azure.com/",
    "api_key": "YOUR_API_KEY_HERE"
}

Run Fara Agent:

fara-cli --task "how many pages does wikipedia have" --start_page "https://www.bing.com"

That’s it! No GPU or model downloads required.

If you have access to GPU resources, you can self-host Fara-7B using VLLM. This requires a GPU machine with sufficient VRAM.

To start the VLLM server it is necessary to simply run the following command:

vllm serve "microsoft/Fara-7B" --port 5000 --dtype auto

Run the test script to see Fara in action:

fara-cli --task "how many pages does wikipedia have" --start_page "https://www.bing.com" --endpoint_config endpoint_configs/azure_foundry_config.json [--headful] [--downloads_folder "/path/to/downloads"] [--save_screenshots] [--max_rounds 100] [--browserbase]

In a self-hosting scenario endpoint_config Signal of endpoint_configs/vllm_config.json From the above VLLM server.

If you set --browserbaseExport environment variables for API key and project ID.

Initializing Browser...
Browser Running... Starting Fara Agent...
##########################################
Task: how many pages does wikipedia have
##########################################
Running Fara...


Thought #1: To find the current number of Wikipedia pages, I'll search for the latest Wikipedia page count statistics.
Action #1: executing tool 'web_search' with arguments {"action": "web_search", "query": "Wikipedia total number of articles"}
Observation#1: I typed 'Wikipedia total number of articles' into the browser search bar.

Thought #2: Wikipedia currently has 7,095,446 articles.
Action #2: executing tool 'terminate' with arguments {"action": "terminate", "status": "success"}
Observation#2: Wikipedia currently has 7,095,446 articles.

Final Answer: Wikipedia currently has 7,095,446 articles.

Enter another task (or press Enter to exit):

We provide an outline in webeval/ To reproduce our results on WebVoyager and OnlineMind2Web. Agent evaluations on live websites present unique challenges due to day-to-day changes. We implement a number of measures to ensure reliable and comparable valuations:

Browserbase integration
We use BrowserBase to manage browser session hosting, enabling reliable browser instance management.

time-sensitive task updates
Tasks in benchmarks like WebVoyager may become stale or impossible. We:

~48 impossible tasks removed from original WebVoyager benchmarks
Updated ~50 tasks with future dates to keep them attainable
Example: “Find a hotel in Bali from January 1st to January 4th, 2024” “Find a hotel in Bali from January 1st to January 4th, 2026”
Our updated WebVoyager benchmarks are available here webeval/data/webvoyager/WebVoyager_data_08312025.jsonl

environmental error handling
Browser errors (connection drops, page timeouts) are tightly handled:

Trajectories are retried up to 5 times when environmental errors occur
Perfect but wrong trajectories are never retried
Each retry starts with a new browser session, with no retained state.

stage budget
Each trajectory is limited to a maximum of 100 actions in all online benchmarks. Trajectories that exceed this budget without stopping are considered incorrect.

WebEval Package Installation

conda create --name fara_webeval python=3.12
conda activate fara_webeval

# Install fara package
pip install -e .

# Install autogen submodule
git submodule update --init --recursive
cd autogen/python/packages
pip install -e autogen-core
pip install -e autogen-ext

# Install webeval
cd webeval
pip install -e .

# Install playwright
playwright install

Go to script directory:

Make sure you set a valid OpenAI GPT-4o endpoint endpoint_configs_gpt4o/dev To run WebVoyager LLM as a judge!

Option 1: Self-Hosted VLLM

python webvoyager.py --model_url /path/where/you/want/to/download/model/ --model_port 5000 --eval_oai_config ../endpoint_configs_gpt4o/dev/ --out_url /path/to/save/eval/files --device_id 0,1 --processes 1 --run_id 1 --max_rounds 100

Option 2: Azure Foundry deployment

Deploy Fara-7b to Foundry endpoint, then put endpoint URLs and keys into JSON endpoint_configs/,

python webvoyager.py --model_endpoint ../../endpoint_configs/ --eval_oai_config ../endpoint_configs_gpt4o/dev/ --out_url /path/to/save/eval/files --processes 1 --run_id 1_endpoint --max_rounds 100

We use the same LLM-a-Judge prompt and model (GPT-4o) as WebVoyager, so --eval_oai_config logic
set --browserbase For browser session management (requires export API key and project ID environment variables)
Avoid overloading a single VLLM deployment with more than ~10 concurrent processes due to known issues
View debugging output fara/webeval/scripts/stdout.txt

Analysis of evaluation results

evaluation output structure

The evaluation results are stored under --out_url In folders organized by:

model name
dataset
username
run id

Example path:

/runs/WebSurfer-fara-100-max_n_images-3/fara-7b//WebVoyager_WebVoyager_data_08312025.jsonl/

Each assessment folder contains:

gpt_eval/ – LLM-A-Judge Assessment Result
traj/ – Per-task trajectory subdirectories including:
- final_answer.json (As, Amazon--1_final_answer.json, indicates abort or phase budget exceedance
- scores/gpt_eval.json – LLM Judge Score
- web_surfer.log – History of actions and errors
- screenshot_X.png – Screenshots captured before each action

Use the Analysis notebook to calculate metrics:

cd webeval/scripts/analyze_eval_results/
jupyter notebook analyze.ipynb

Script:

Identifies aborted trajectories for mid-performance and clinical reasons
Calculates the average score on non-cancelled trajectories
Distinguishes between aborted trajectories (errors during sampling) and completed trajectories (with terminate() calls or step budget exceeded)

To re-run failed tasks, execute the evaluation script again with the same run_id And username – This will skip non-cancelled tasks.

Example WebVoyager GPT Eval Results

{
  "score": 1.0,
  "gpt_response_text": "To evaluate the task, we need to verify if the criteria have been met:\n\n1. **Recipe Requirement**: A vegetarian lasagna recipe with zucchini and at least a four-star rating.\n\n2. **Search and Results**:\n   - The screenshots show that the search term used was \"vegetarian lasagna zucchini.\"\n   - Among the search results, \"Debbie's Vegetable Lasagna\" is prominently featured.\n   \n3. **Evaluation of the Recipe**:\n   - Rating: \"Debbie's Vegetable Lasagna\" has a rating of 4.7, which satisfies the requirement of being at least four stars.\n   - The presence of zucchini in the recipe is implied through the search conducted, though the screenshots do not explicitly show the ingredients list. However, the result response confirms the match to the criteria.\n\nGiven the information provided, the task seems to have fulfilled the requirement of finding a vegetarian lasagna recipe with zucchini and a four-star rating or higher. \n\n**Verdict: SUCCESS**"
}

If you use Fara in your research, please cite our work:

<a href

GitHub – microsoft/fara

What makes the Fara-7B unique?

WebtailBench: A new benchmark for real-world web tasks

webtelbench detailed results

evaluation infrastructure

Azure Foundry Hosting (recommended)

WebEval Package Installation

Analysis of evaluation results

evaluation output structure

Like this:

Related

Leave a Comment Cancel reply

What makes the Fara-7B unique?

WebtailBench: A new benchmark for real-world web tasks

webtelbench detailed results

evaluation infrastructure

Azure Foundry Hosting (recommended)

WebEval Package Installation

Analysis of evaluation results

evaluation output structure

Share this:

Like this:

Related

Leave a Comment Cancel reply