GitHub – ClioAI/kw-sdk: Knowledge work sdk

A Python SDK for building performant AI agents knowledge work-Research, analysis, writing, and decision-making tasks that require repetition, verification, and structured thinking.

Why is knowledge work different from code?

Code has a tight feedback loop: write code → run tests → fix errors → repeat. The solution space is limited – there is usually only one correct answer, and automated tests tell you whether you have found it.

Knowledge work is fundamentally different. The solution space is vast and underspecified. A “market analysis” can be a two-paragraph summary or a 50-page deep dive. A “strategy recommendation” may emphasize cost, speed, risk, innovation, or any combination. There is no test suite that returns pass/fail.

our approach: Since the knowledge task lacks natural verification, we synthesize using a rubric. A rubric defines what “good” looks like before execution begins, enabling:

  • self verification: the agent checks his own action against explicit criteria
  • iterative refinement: Failed validation triggers targeted remediation
  • transparent assessment:Humans can audit the rubric and verification process

This SDK implements Self-verifying agentic loop Which brings structure to the inherently open nature of knowledge work. The agent can search the web, read and write files, execute code, generate artifacts, and ask for clarification from the user – all coordinated through an orchestrator that verifies its own output.

It started as a tool for running RL training on knowledge tasks. I’m open-sourcing it because:

  1. Knowledge workflows are less explored. Most AI tooling focuses on code. But knowledge work—research, analysis, strategy, writing—is where most professionals spend their time. The basics of building these systems are not yet well established.

  2. This can be a useful building block. If you’re building products that involve AI doing research, making recommendations, or drafting documents, this validation loop can save you weeks of iteration.

  3. The models are still struggling with validation. The self-check step is the weakest link. If adopted, an open-source model provider could provide training specifically on rubric-based validation – improving the entire ecosystem.

I would rather see these ideas spread than remain proprietary.

┌─────────────────────────────────────────────────────────────┐
│  1. BRIEF CREATION                                          │
│     → Formalize task into structured requirements           │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│  2. RUBRIC CREATION                                         │
│     → Generate evaluation criteria (hidden from executor)   │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│  3. TASK EXECUTION                                          │
│     → Orchestrator delegates to subagents, runs searches    │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│  4. VERIFICATION                                            │
│     → Check answer against rubric → PASS or FAIL            │
└─────────────────────────────────────────────────────────────┘
                          ↓ (if FAIL)
                    ← ITERATE ←
                          ↓ (if PASS)
┌─────────────────────────────────────────────────────────────┐
│  5. SUBMISSION                                              │
│     → Submit verified answer                                │
└─────────────────────────────────────────────────────────────┘

As a dependency (recommended)

uv pip install git+https://github.com/ClioAI/kw-sdk.git

or add to your pyproject.toml: :

dependencies = [
    "verif @ git+https://github.com/ClioAI/kw-sdk.git",
]
git clone https://github.com/ClioAI/kw-sdk.git
cd kw-sdk
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"

create a .env file:

GEMINI_API_KEY=your_gemini_key
OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key
from verif import RLHarness

harness = RLHarness(provider="gemini")  # or "openai" or "anthropic"
result = harness.run_single("Analyze the economic impact of remote work on urban real estate.")

print(result.answer)  # The analysis
print(result.rubric)  # Auto-generated evaluation criteria

The SDK provides various modes optimized for different types of knowledge tasks:

Method best for rubric strategy
standard General Research and Analysis automatically created during execution
plan complex multi-step tasks User-provided or automatically generated
explore creative/divergent thinking Quality Checklist (no accuracy rubric)
iterate refining existing work Uses existing rubrics + feedback

provider config thought control
Gemini provider="gemini" thinking_level: : LOW / MEDIUM / HIGH
OpenAI provider="openai" reasoning_effort: : low / medium / high
anthropic provider="anthropic" thinking_budget:token count (default 10000)

For general tasks. Orchestrator automatically creates briefs and rubrics.

from verif import RLHarness

harness = RLHarness(provider="gemini", enable_search=True)

result = harness.run_single(
    "Compare carbon tax vs cap-and-trade for reducing industrial emissions."
)

print(result.answer)
print(result.rubric)  # Auto-generated

See: examples/standard_mode.py

For structured execution with clear control over strategy.

from verif import RLHarness

harness = RLHarness(provider="gemini", enable_search=True)

PLAN = """
## Investigation Phase
1. Research incident postmortem best practices
2. Identify key sections for blameless postmortems

## Writing Phase
3. Write executive summary
4. Document timeline with timestamps
5. Describe root cause analysis
"""

RUBRIC = """
## Structure (40 points)
- [ ] Has executive summary
- [ ] Includes timeline with timestamps
- [ ] Contains root cause analysis

## Blameless Culture (30 points)
- [ ] No individual blame
- [ ] Uses "we" language
"""

result = harness.run_single(
    task="Write a postmortem for a 47-minute database outage.",
    mode="plan",
    plan=PLAN,
    rubric=RUBRIC,  # Optional - omit to auto-create
)

See: examples/plan_mode.py

For divergent thinking – generate many different perspectives. Unlike Standard mode, Explore does not optimize for a single “correct” answer. It maps the solution space.

How does Explore differ from standard:

  • No accuracy rubric. Standard mode creates a rubric to verify accuracy. Explore uses a quality checklist—are the techs different? Do they cover different concepts?
  • Difference strengthens identity. Each tech has to explain their assumptions and what could break it. It exposes blind spots that you may not be able to find a single answer to.
  • Volume exceeding convergence. The standard iterates toward a verified answer. Exploration produces N parallel answers that may contradict each other – that’s the point.
from verif import RLHarness

harness = RLHarness(provider="gemini", enable_search=True)

result = harness.run_single(
    task="""Explore database architectures for a fintech handling 10K TPS 
    with strong consistency and multi-region deployment.""",
    mode="explore",
    num_takes=3,  # Generate 3 distinct approaches
)

# Result contains multiple takes separated by ===
takes = result.answer.split("===")
for i, take in enumerate(takes, 1):
    print(f"--- Approach {i} ---\n{take[:500]}...")

Each take includes:

  • solution/recommendation
  • beliefs: What must be true for this to be implemented (for example, “Assumes budget for multi-region replication”)
  • counterfactual: What might cause it to fail (for example, “breaks if latency requirements tighten to <10ms")

ends with output set-level interval:What’s missing from the complete set? This tells you which angles weren’t covered – maybe everyone assumed the same cloud provider, or someone didn’t consider regulatory hurdles. Intervals are often more valuable than themselves.

Use Explore when you’re not sure what the right question is, or when the “best” answer depends on unstated constraints.

See: examples/explore_mode.py

To refine existing work based on user feedback.

# Initial execution
result = harness.run_single(task="Write a market analysis memo.")

# User provides feedback
iterate_result = harness.iterate(
    task="Write a market analysis memo.",
    answer=result.answer,
    rubric=result.rubric,
    feedback="Use 2024 data instead of 2023. Add executive summary.",
    rubric_update="Must address data residency requirements.",  # Optional
)

print(iterate_result.answer)  # Refined version

See: examples/iterate_workflow.py

Save execution state at each step. Resume from any checkpoint with optional feedback and rubric updates.

from verif import RLHarness

harness = RLHarness(provider="gemini", enable_search=True)

# Run with checkpointing
result = harness.run_single(
    "Analyze the power dynamics among Olympian gods.",
    checkpoint=True,
)

# List checkpoints
for snap_id, snap in harness.snapshots.items():
    print(f"{snap_id} (step {snap.step})")

# Resume from any checkpoint with new direction
resumed = harness.resume(
    checkpoint_id="",
    feedback="Focus more on the Trojan War.",
    rubric_update="Must include analysis of divine intervention in the Iliad.",
)

See: test/test_checkpoint.py


Explore → Select → Execute

The most powerful pattern: Brainstorm, choose the best approach, then implement.

# Stage 1: Explore multiple approaches
explore_result = harness.run_single(task=TASK, mode="explore", num_takes=3)
takes = explore_result.answer.split("===")

# Stage 2: Use LLM to select best approach
selector = GeminiProvider()
selection = selector.generate(f"Pick the best approach:\n{explore_result.answer}")

# Stage 3: Execute with selected plan
final_result = harness.run_single(
    task=TASK,
    mode="plan",
    plan=selected_plan,
    rubric=selected_rubric,
)

See: examples/end_to_end_workflow.py


harness = RLHarness(
    provider="gemini",
    enable_search=True,  # Adds search_web tool
)

See: examples/standard_with_search.py

harness = RLHarness(
    provider="gemini",
    enable_bash=True,  # Adds search_files tool (ls, find, grep, cat)
)
from verif.executor import SubprocessExecutor

harness = RLHarness(
    provider="gemini",
    enable_code=True,
    code_executor=SubprocessExecutor("./artifacts"),
    artifacts_dir="./artifacts",
)

code executable stateful—Variables persist throughout the call. files saved artifacts_dir Is tracked and returned.

See: examples/with_code_execution.py

from verif import RLHarness, Attachment, Prompt

# Create attachment with preview
attachment = Attachment(
    content="/path/to/data.csv",
    mime_type="text/csv",
    name="data.csv",
    preview="col1,col2\n1,2\n3,4...",  # First N lines
)

# Build multimodal prompt
prompt: Prompt = [
    "Analyze the attached sales data and create a summary.",
    attachment,
]

result = harness.run_single(prompt)

See: examples/with_files.py

User Explanation (ask_user)

Enable interactive explanations when tasks are unclear:

import threading
from verif import RLHarness, ProviderConfig

def on_event(entry, harness):
    if entry.entry_type == "user_question":
        question_id = entry.metadata["question_id"]
        questions = entry.metadata["questions"]
        
        # Generate or collect answers
        answers = {0: "B2B SaaS platform", 1: "$50,000 budget"}
        
        # Send response back (in a thread to not block)
        threading.Thread(
            target=lambda: harness.provider.receive_user_response(question_id, answers)
        ).start()

harness = RLHarness(
    provider="gemini",
    enable_ask_user=True,
    on_event=lambda e: on_event(e, harness),
)

result = harness.run_single("Create a project plan for my product launch.")

orchestrator can call ask_user To request clarification. Verification is blocked until all pending questions are answered.

See: test/test_ask_user.py


from verif import RLHarness, HistoryEntry

def on_event(event: HistoryEntry):
    if event.entry_type == "tool_call":
        print(f"→ {event.content}")
    elif event.entry_type == "thought":
        print(f"💭 {event.content[:100]}...")

harness = RLHarness(
    provider="gemini",
    on_event=on_event,
    stream=True,           # Stream orchestrator output
    stream_subagents=True, # Stream subagent output
)

See: examples/with_streaming.py


from verif import RLHarness, ProviderConfig, CompactionConfig
from verif.executor import SubprocessExecutor

harness = RLHarness(
    # Provider: "gemini" | "openai" | "anthropic" | ProviderConfig
    provider=ProviderConfig(
        name="gemini",
        thinking_level="MEDIUM",  # Gemini: LOW | MEDIUM | HIGH
        # OR for OpenAI:
        # name="openai",
        # reasoning_effort="medium",  # low | medium | high
        # OR for Anthropic:
        # name="anthropic",
        # thinking_budget=10000,  # token budget for extended thinking
    ),
    
    # Tool Capabilities
    enable_search=True,     # Web search tool
    enable_bash=False,      # File system navigation
    enable_code=False,      # Python code execution
    enable_ask_user=False,  # User clarification tool
    
    # Code Execution (required if enable_code=True)
    code_executor=SubprocessExecutor("./artifacts"),
    artifacts_dir="./artifacts",
    
    # Execution Limits
    max_iterations=30,
    
    # Mode Selection
    default_mode="standard",  # "standard" | "plan" | "explore"
    
    # Pre-set Rubric (optional)
    rubric="1. Must be accurate\n2. Must cite sources",
    
    # Event Streaming
    on_event=lambda e: print(f"[{e.entry_type}] {e.content[:100]}"),
    stream=True,
    stream_subagents=True,
    
    # Context Compaction (for long tasks)
    compaction_config=CompactionConfig(
        enabled=True,
        threshold=0.8,        # Trigger at 80% context capacity
        keep_recent_turns=3,
    ),
)

result = harness.run_single(task)

result.task          # Original task text
result.answer        # Final submitted answer
result.rubric        # Evaluation rubric used
result.history       # List[HistoryEntry] - full execution trace
result.mode          # Mode used: "standard" | "plan" | "explore"
result.plan          # Plan (if plan mode)
result.brief         # Brief (if available)
# Get formatted history
print(harness.get_history_markdown())
print(harness.get_history_text())

# Access raw entries
for entry in result.history:
    print(f"[{entry.timestamp}] {entry.entry_type}: {entry.content[:100]}")

Tools available to Orchestrator

tool Description when available
create_brief formalize job requirements standard, explore
create_rubric Generate evaluation criteria standard, plan
spawn_subagent delegate tasks all modes
search_web web search enable_search=True
search_files read/search file enable_bash=True
execute_code python REPL enable_code=True
ask_user User request clarification enable_ask_user=True
verify_answer check against rubric standard, plan, iteration
verify_exploration quality check checklist to explore
submit_answer Submit final answer all modes


  • computer access subagent – Attach a computer-use capable subagent for GUI interactions (filling out forms, navigating apps, extracting data from web interfaces).
  • Multi-app workflow — Working across the browser, spreadsheets and documents at once.
  • parallel verification – Run multiple validation passes and seek consensus while reducing single-verifier bias.
  • scoring quality rubric – Meta-assessment: Score the rubric itself before using it for validation. Grab the “Always-Pass” rubric early.
  • Structured output from runs – Return typed sections (executive summary, recommendations, evidence) instead of a single answer string.
  • eval structure – Systematic comparison across providers/modes/rubric strategies on benchmark task sets. run_eval Exists but needs scoring and reporting.
  • token usage tracking – Surface per-run token calculations by stage (brief, rubric, execute, verify) for cost analysis.
  • Mixed-model orchestration – Use different models for orchestrator vs. subagent (e.g., Opus for orchestration, Flash for search subagent). Currently a single provider handles both. I put it this way because RL training benefits from the same policy, but the cost savings of moving cheaper tasks to smaller models for production use would be significant.

¹ See TOOL_CALLING_GUIDE.md for philosophy: Skip MCP server, use code as tool. ² See EXTENSIONS.md to create custom modes and providers.

See examples/output/ for sample execution results:



If you are using this for RL training:

Use consistently. The reward for knowledge work is signal noise. What works for one work type may fail for another.

Train selectively on the control plane. In my experience, training works best when you focus on the following:

  • Orchestrator outputs (tool calls, sequencing decisions)
  • Brief construction (task formalization)
  • Rubric Creation (Evaluation Criteria)

Exclude subagent outputs, search results, and code execution from the training signal – even if they are generated by the same policy. the goal is to improve instrument space And verification Layers. Everything else is downstream; If the orchestrator gets better at decomposition and the rubric gets better at capturing intent, sub-agents automatically benefit.

Verification is the only hurdle. Most of the training gains come from improvements in the validation phase. A model that can accurately assess its own work against a rubric is more valuable than a model that produces a slightly better first draft.


Validation is only as good as the model. The rubric is created by the same model that works. If the model has blind spots, the rubric will also have blind spots. This is a fundamental barrier to self-verification.

External grounding occurs at a summary level, not verification. If you need external validation (for example, checking facts against a database), you can provide your own rubric. But be careful: the verifier is intentionally limited—it doesn’t have access to search or the file system. The design assumes that grounding occurs during task execution (via summaries and sub-agents), not during verification. verifier checks internal consistency Against the rubric, no external purity.

Rubrics can be gamified. A sufficiently clever model can write a rubric that is easy to pass. This is why human review of rubrics matters for high-stakes tasks.

Context compilation requires a Gemini API key. Uses compaction (summarization of mid-context to stay under token limit) gemini-3-flash-preview Regardless of the provider you choose. If you enable compaction with OpenAI or Anthropic as the orchestrator, you will still need a GEMINI_API_KEY. Free keys are available from Google AI Studio.


MIT



<a href

Leave a Comment