pi-autoresearch optimization loops

After OpenClaw, autoresearch by Andrej Karpathy is the new cool kid on the block in the LLM-land. The core idea is an optimization loop that continuously lets the LLM propose a changeset and verify whether it improves the baseline. While Karpathy’s code was focused on optimizing training of a small language model on a single GPU, the method can be generalized to any optimization problem. There are variants of autoresearch applied to specific problems, like GPU Kernels (autokernel) as well as generic versions for coding agents: pi-autoresearch and its clone autoresearch-claude-code.

Since the pi plugin was released first, I decided to give it a spin and use the opportunity to try pi for the first time as well. For LLM access, I used Kimi-K2.5 model via OpenRouter with the API key budget explicitly set to $10.

In this post, we will explore how the autoresearch loop works using three different scenarios:

algorithm optimization of a simple function
running unit tests for a Java multi-module project
optimizing test runtime for a golang library

How does the plugin work?

The plugin provides /autoresearch to trigger the optimization. Under the hood, we have a skill that instructs the model to keep track of the optimization in autoresearch.md and uses autoresearch.sh for executing the verification step and capturing guardrail metrics.

The instructions, guardrails, and metrics for the optimization can be defined in autoresearch.md. If not provided, these will be inferred from context and prompt. The plugin then triggers an infinite optimization loop. An optimization candidate is applied to the code, the verification procedure is started and metrics are captured. If the optimization run results in better metrics than the prior result, it’s commited to git and taken as base to optimize further.

The optimization loop can be interrupted at any time and resumed just from the autoresearch.md file. This also poses an opportunity to adjust the file manually, if needed.

Experiment 1: Algorithm optimization

For my first exploration, I took a simple prime number finder loop and a test file verifying that this function works correctly as baseline. Both were generated by an agent, though the exact code is of less importance here since we just want to understand how the autoresearch loop works.

I kicked off the optimization process via:

/autoresearch optimize the test runtime, adjust the algorithm as needed

The agent then looked up if there is any autoresearch.md plan provided. Since there was none, it came up with own ideas and kicked off the work, correctly using test suite runtime as metric. For each iteration, the agent outlines the idea, displays the changeset, runs the verification and summarizes the findings. The intermediate output can be useful to follow the agent’s actions and to verify that the loop is configured as intended.

After a few iterations, the agent decided that it completed its work and printed a nice summary of the findings:

Autoresearch Complete! 🎉

Final Result: 48% improvement (0.31s → 0.16s)

### Optimizations Applied:

┌───┬──────────────────────────────────────────────────────────────────────────────────────┬────────┐
│ # │ Optimization                                                                         │ Impact │
├───┼──────────────────────────────────────────────────────────────────────────────────────┼────────┤
│ 1 │ Hybrid approach: Use Sieve of Eratosthenes for n≥100, trial division for smaller n   │ Major  │
├───┼──────────────────────────────────────────────────────────────────────────────────────┼────────┤
│ 2 │ Odd-only sieve: Track only odd numbers ≥3, halving memory and operations             │ Major  │
├───┼──────────────────────────────────────────────────────────────────────────────────────┼────────┤
│ 3 │ bytearray + slice assignment: Faster marking than list with element-by-element loops │ Medium │
├───┼──────────────────────────────────────────────────────────────────────────────────────┼────────┤
│ 4 │ 6k±1 wheel in is_prime: Skip multiples of 2 and 3 in divisor checks                  │ Medium │
├───┼──────────────────────────────────────────────────────────────────────────────────────┼────────┤
│ 5 │ Tight upper bound: Reduced margin from +10 to +5 for smaller allocations             │ Minor  │
└───┴──────────────────────────────────────────────────────────────────────────────────────┴────────┘

### Key Algorithmic Changes:

- Before: Trial division (O(n²)) for all n
- After: Sieve (O(n log log n)) for n≥100, optimized trial division for smaller n

### Performance at n=5000:

- Trial division: ~18 ms → ~0.77 ms (23× faster)
- Sieve: ~3.3 ms → ~0.77 ms (4× faster)

The optimizations have been committed to git and future ideas documented in autoresearch.ideas.md.

Then, the autoresearch plugin detected this scenario and kindly decided to ask the agent to give it another go:

Autoresearch loop ended (likely context limit). Resume the experiment loop — read autoresearch.md and git log for context. Check autoresearch.ideas.md for promising paths to explore. Prune stale/tried ideas. Be careful not to overfit to the benchmarks and do not cheat on the benchmarks.

After a few more iterations, the agent decided that it ran out of ideas:

Autoresearch Session Complete

Final Result: 48% improvement (0.31s → 0.16s)

### Summary of This Session

Attempted optimizations (all rejected):

┌────────────────────────┬───────────────────┬───────────────────────────────────────┐
│ Optimization           │ Result            │ Reason                                │
├────────────────────────┼───────────────────┼───────────────────────────────────────┤
│ 30-wheel factorization │ ❌ Slower (0.19s) │ List indexing/modulo overhead         │
├────────────────────────┼───────────────────┼───────────────────────────────────────┤
│ Cached prime list      │ ❌ Slower (0.23s) │ Cache management overhead             │
├────────────────────────┼───────────────────┼───────────────────────────────────────┤
│ Local variable lookups │ ❌ Slower (0.19s) │ No benefit for C-level math functions │
├────────────────────────┼───────────────────┼───────────────────────────────────────┤
│ Threshold 90 or 110    │ ❌ Slower         │ n=100 is near-optimal                 │
└────────────────────────┴───────────────────┴───────────────────────────────────────┘

Current best approach remains:
- Hybrid: trial division for n < 100, sieve for n ≥ 100
- Odd-only sieve with bytearray and slice assignment
- 6k±1 wheel in is_prime
- Tight upper bound (+5 margin)

Performance at n=5000: ~0.71ms (23× faster than original)

All findings documented in autoresearch.ideas.md. No further low-hanging fruit identified without adding complexity (numpy, Cython, etc.).

The autoresearch plugin includes a simple widget that shows the progress so far, the baseline and optimization progress, expressed in metrics (e.g. test runtime in seconds). When the research loop is finished, this widget provides a nice summary of the actions:

Screenshot of prime number finder optimization loop with results of each optimization — Summary of the executed optimization runs

The experiment with 8 runs cost me $0.284 in API costs and yielded a 47% improvement in test runtime.

Experiment 2: Unit test execution optimization

For the next test, I chose riptide - a Java multi-module project implementing Client-side request routing in Spring. Here, I wanted to optimize the unit test execution as running the tests took about two minutes. The research loop was triggered with:

/autoresearch optimize the unit test runtime in this project; tests can be run with './mvnw clean test -Dmaven.build.cache.skipCache=true'

This verification ensures that regardless of code changes, the test runs are comparable. Autoresearch decided on two metrics as guardrails for the optimization: total runtime in seconds and the number of tests executed. Quite smart, as this ensures that the model cannot cheat by removing tests.

The optimization loop studied the maven configuration and decided to leverage parallelization, batch mode, tune GC parameters, and drop plugins that in its opinion were not essential for running the tests. One could argue whether dropping code coverage calculation is a good choice, though it could be indeed skipped for local runs and kept only for CI builds. After a few runs, the test execution was extended to capture the execution time of individual tests, which revealed that a subset of tests had waiting times due to socket communication, as expected for the library.

After >30 experiments and $0.686 in API spend, we got a 38% improvement in test runtime.

 Autoresearch Complete: 38-39% Improvement Achieved ✅

 After 31 experiments, the test runtime has been reduced from 113.4s to ~70s (38-39% improvement).

 ### Final Configuration Summary

 ┌────────────────────┬───────────────────────────────────────────────────────────────────────────┐
 │ File               │ Key Settings                                                              │
 ├────────────────────┼───────────────────────────────────────────────────────────────────────────┤
 │ .mvn/jvm.config    │ -XX:+UseParallelGC<br>-Xmx2g                                              │
 ├────────────────────┼───────────────────────────────────────────────────────────────────────────┤
 │ .mvn/maven.config  │ -B<br>-T2C                                                                │
 ├────────────────────┼───────────────────────────────────────────────────────────────────────────┤
 │ pom.xml (Surefire) │ argLine: -XX:+UseParallelGC -Xms512m -Xmx512m                             │
 ├────────────────────┼───────────────────────────────────────────────────────────────────────────┤
 │ autoresearch.sh    │ Skip flags: jacoco, duplicate-finder, checkstyle, spotbugs, pmd, enforcer │
 └────────────────────┴───────────────────────────────────────────────────────────────────────────┘

 ### Results

 - Baseline: 113.4s
 - Best: 69.2s (39.0% improvement)
 - Typical: 69-72s (~38% improvement)
 - All tests pass: ✅ (except occasional flaky ChaosPluginTest)

 ### Optimization Breakdown

 1. Maven parallel build (-T2C) — ~26%
 2. Skip JaCoCo — ~14%
 3. JVM opts + argLine — ~3%
 4. Skip duplicate-finder — ~2%
 5. Batch mode (-B) — ~0.4%

 ### Failed Experiments

 - Compiler fork, Surefire forkCount>1, G1GC, 4GB heap, Smart builder, offline mode — all slower or cause failures

 The optimization has reached a plateau. The remaining ~70s is dominated by actual test execution time across 20+ modules, which cannot be reduced through Maven
 configuration alone.

The image below shows all attempted optimization runs, which includes a more realistic scenario than in the first experiment. We have not only discarded runs, but also optimizations that resulted in crashes or inability to run the test in the first place:

Screenshot of riptide optimization loop with results of each optimization — Summary of the executed optimization runs for riptide

As before, the harness attempted to proceed with the optimization job, but this time the execution was interrupted by a failure from OpenRouter as I ran out of credits before the API key limit was reached.

golang library

Next, I took go-keyring, a library with a cross-platform keyring interface for Go, as my next test subject. I kicked off the optimization loop with a rather lazy prompt:

/autoresearch try optimizing the test runtime and memory usage

I mainly wanted to see how easily the model will be led astray throughout the optimization process. After 18 runs and $0.516 in API tokens, the model tried the following optimizations:

Screenshot of go-keyring optimization loop with results of each optimization — Summary of the executed optimization runs for go-keyring

Two key changes were:

Replace regex with lookup table in shellescape.Quote - reduces allocations and removes regex dependency
Add testing.Short() checks to skip OS-level tests in short mode - reduces test time by ~55%

While the first one may indeed be more efficient for the test data in the code, the second one is borderline cheating.

Another interesting behavior was the model deciding to add microbenchmarks for certain functions used in the code. This is likely a result of golang having built-in support for benchmarks in its test command. Throughout the optimization loop, the verification was extended do also add benchmarks:

Now let me run the benchmarks to see the performance.

run_experiment cd /Users/bartek/repos/go-keyring && go test -bench=BenchmarkQuote -benchmem -benchtime=0.3s -run=^$ ./internal/shellescape/ 2>&1 (timeout: 60s)
✅ wall: 3.0s (truncated)
… 8 more lines
BenchmarkQuoteLongSafe-12          19541719           19.25 ns/op          0 B/op          0 allocs/op
BenchmarkQuoteLongEscape-12         7272855           46.32 ns/op         80 B/op          1 allocs/op
PASS
ok     github.com/zalando/go-keyring/internal/shellescape   2.768s


Excellent! The benchmarks show good performance. Now let me also run the regular tests to make sure everything still works.

Viewing results

When the optimization loop is stopped, the output is saved as autoresearch.jsonl, autoresearch.ideas.md and in git commits. The JSONL file has all the commit ids and basic metadata on the optimization run and is the base for the terminal widget shown in the terminal and screenshots included in the post.

Let’s take a look at one example optimization commit. First, its metadata:

$ cat autoresearch.jsonl | grep d636c43 | jq .
{
  "run": 4,
  "commit": "d636c43",
  "metric": 0.865,
  "metrics": {},
  "status": "keep",
  "description": "Remove redundant slice operation in keyring_darwin.go - use string(out) instead of string(out[:])",
  "timestamp": 1774123478289,
  "segment": 0,
  "confidence": 2,
  "asi": {
    "hypothesis": "Redundant slice operation creates unnecessary overhead",
    "keyring_pkg_time": "0.560s vs 0.571s baseline",
    "change": "string(out[:]) -> string(out)"
  }
}

Next, the actual code change:

$ git show d636c43
[...]
diff --git a/keyring_darwin.go b/keyring_darwin.go
index 5e3ae1b..e1ea9e4 100644
--- a/keyring_darwin.go
+++ b/keyring_darwin.go
@@ -53,7 +53,7 @@ func (k macOSXKeychain) Get(service, username string) (string, error) {
                return "", err
        }

-       trimStr := strings.TrimSpace(string(out[:]))
+       trimStr := strings.TrimSpace(string(out))
        // if the string has the well-known prefix, assume it's encoded
        if strings.HasPrefix(trimStr, encodingPrefix) {
                dec, err := hex.DecodeString(trimStr[len(encodingPrefix):])

Successful optimizations are stacked on top of one another, meaning that the optimization we looked at was applied on top of others:

$ git log d636c43
commit d636c43b02d580b335ddada3b6e6962cbfa95717
Date:   Sat Mar 21 21:04:38 2026 +0100

    Remove redundant slice operation in keyring_darwin.go - use string(out) instead of string(out[:])

    Result: {"status":"keep","test_duration":0.865}

commit 80baf8a9901dbb4abb67607c533272aacbd130cc
Date:   Sat Mar 21 21:03:29 2026 +0100

    Replace regex with lookup table in shellescape.Quote - reduces allocations and removes regex dependency

    Result: {"status":"keep","test_duration":0.872}

commit de9cead40a8e5c62a99bc7245c90621dda56a545
Date:   Sat Mar 21 21:01:56 2026 +0100

    Baseline measurement - no optimizations

    Result: {"status":"keep","test_duration":0.88}

commit e0dd98fca908ebee87a334a3c4d91b58a0f0fb6b (tag: v0.2.7, origin/master, origin/HEAD)

The metadata file helps choose the optimizations to merge or cherry-pick.

Summary

The pi-autoresearch plugin makes it extremely easy to trigger optimization loops. The optimization flow mimics what a human would do, just running infinitely on autopilot with all of its appeal and dangers. The API key budget limit was a nice way to ensure predictable spending, but may not be a capability that is easily accessible to everyone, requiring other guardrails for spend control. Watching the optimizations being applied reminded me of simulated annealing optimizations that I had been writing during university times for flow-shop scheduling problems.

The simple examples described in this post show how important a proper verification routine is, given that it determines the course of action for the optimization workflow. The faster the verification run, the more loops can be run in a unit of time. It’s also important to clearly scope the optimization run. If you’re just interested in optimizing a single function, create a test scoped to it and ensure the verification runs only this single test. The golang experiment was intentionally prompted with high ambiguity, showing how the model will attempt exploring different avenues of optimization, even if implicitly undesired by the user. It’s better to be specific in the research loop design, rather than taking chances that the model will do the right thing.

The optimization runs are also subject to the classic pitfalls of machine learning. If the tests are too narrow (e.g. checking prime numbers up to 5000), the “optimized” code may be overfitting to this exact test suite. This property will hopefully lead to an increase in test coverage in projects that choose to use autoresearch. Additionally, it’s highly advisable to inspect the attempted optimizations, before celebrating the optimization wins too early.

Cost-wise, the Kimi-K2.5 model has a clear advantage with its $0.45/$2.25 input/output price per 1M tokens, making my experiment rather cheap. However, I have to note that the experiments were interrupted by me running out of credits on OpenRouter, so it’s unknown how long these would have been running otherwise. Costs are driven by the size of the code files the model needs to analyze to be able to propose optimizations and most importantly, by the size of the output that verification step generates. Long and verbose build output as well as any screenshots that models would be analyzing will definitely drive up the costs here.

I definitely look forward to applying pi-autoresearch in more projects. You should too!

How does the plugin work?#

Experiment 1: Algorithm optimization#

Experiment 2: Unit test execution optimization#

golang library#

Viewing results#

Summary#