Quo Vadis, Agentic Engineering?

It’s a very eventful time for the industry. Some compare it to the most exciting period in engineering since the rise of the Internet, while others see parallels with the adoption of cloud computing and microservices. Others already seek a manifesto moment, though in Martin Fowler’s view it’s way too early for that: in XP and Agile terms we’re at the stage comparable to the early 90s full of experimentation with new ideas (forming XP) and around the Workshop on Object-Oriented Design (WOOD) ¹.

The sheer influx of information and day-to-day developments is really hard to follow. Just in January Steve Yegge released a new take on the IDE with Gastown, only to reflect a month later on being bitten by the AI vampire, highlighting the addictive side of building software with agents. The impact of FOMO and pressure on mental health is widely discussed in the industry as well, given agents can amplify undesired behavior. OpenClaw triggered euphoria, showing the power of agents to a wide audience, further accelerating the FOMO. In the meantime, its author Peter Steinberger was acquihired by OpenAI, with the project transitioning into a foundation (see State of the Claw for a recent update). Others claim not to be fooled again, recollecting MS-DOS times.

The key question is where things are going medium to long term, hence the title of the post. My goal is to highlight some of the constraints, mechanisms, or factors that will influence how Agentic Engineering evolves, putting structure into notes that I’ve been collecting for a while. Feed the text to your agent for a TL;DR or enjoy reading just like I enjoyed compiling the material and typing this post.

Leveling the playing field

Equipped with coding agents, non-experts can create software and debug problems they encounter without the need to rely on their Google-fu, know the right keywords, or seek support from colleagues. They can build tools and businesses end-to-end in a way that was previously inaccessible to them, benefiting from the drastically lower cost of creation. Surely, many non-experts will have a different objective in mind when coding, seeing it merely as a tool and a means to get the job done. The created software may never be used by more than a single user. When it does, though, it will require security hardening, productionization, and a safe place to run. Otherwise, it will immediately become a security liability.

The (software) craftsman among us are worried about skill atrophy in the age of LLMs. The bigger risk worth acknowledging is lack of skill formation as non-experts become dependent on tools without forming deep enough understanding of their internals. Provider downtime can feel like a blackout: work simply stops. On a positive note, we use things all the time that a single person cannot build from scratch on their own and we’re perfectly fine with this. “I, Pencil” retold by Milton Friedman offers a 2-minute lesson on the complexity and connectedness of our world.

Looking at implications for software teams, platforms and developer experience teams are challenged to accommodate an expanded range of software contributors, going even beyond product managers or designers. Teams will need to tune their assumptions about development environments, onboarding approaches, and create new safety nets, lowering contribution friction. It is also an opportunity to unify the tech stack and deployments of internal apps, forming portal-like marketplaces where apps can be easily adjusted, remixed, and integrated with existing APIs and tools. These managed apps could end up as a remixed experience of codepen.io, glitch.com, and Google’s AI Studio. Without these platforms, non-expert authors of software will be searching for a team to harden and operate their creation, directly breaking “You build it, you run it” principles that many teams follow today.

What bottlenecks will we hit?

All of them. Many times. Our engineering processes and the underlying platforms were built to scale with human activity. Agentic Engineering lowers the cost of producing changes faster than our existing systems and processes can adjust.

Systems built for human-driven workloads

When Netflix’s platform is spinning up cloud compute to stream a video, there is a human using a device to access this video. When a customer service agent picks up a phone, it’s because a human had an interaction with the business and something went south. When existing system constraints were challenged, there were usually clear incentives: marketing (spam), building influence (misinformation through media content), financial profit (sneaker bots, event ticket bots, HFT, fraud, DDoS, malware). In all these, we observed the effect of leverage with a human operator or human-in-the-loop somewhere in the process. Agents will bring these effects into a multitude of places, far beyond putting it into the hands of the next generation of script kiddies.

Lowering cost of contributions and need for improving verification

Lowering the cost of contributions results in more incoming changes. Many of these changes would not have existed before as the cost of doing them would be too high given the value. Now, a change is developed and submitted quicker than it would have taken to assess the need for this change in the first place. When going through existing processes, these code changes trigger code review requests, CI/CD runs with builds, tests, artifact uploads, security scans, etc. Any of these steps that are slow, cannot run in parallel, or require extensive human involvement will result in inefficiencies and frustration that will continue to pile up. These steps used to be to a large extent correlated with human activity and subject to human-level constraints. Dependency updates, if automated were largely expected to be non-breaking. Well-run engineering orgs or large orgs that needed to accommodate a high number of contributors, adopted practices that helped them scale (or reduce costs and lead time), such as Spotify’s fleetshift for fleet-wide refactoring. Other teams may have never seen the need for optimizations or assessed them as having a clearly negative ROI.

Agentic Engineering benefits from rapid verification cycles. A few minutes spent waiting for a PR build to complete or for a code review to come in directly affect the dopamine hits that operators of agents experience. To accelerate, various strategies are needed, such as splitting test suites, using multi-stage builds, being able to launch the application in parallel (locally or on a devbox), and test automation helps keep verification small enabling agentic coding or optimization loops. As release frequency increases, relying only on real-user A/B tests may become too slow for early iteration loops, so simulated traffic from synthetic personas may be used more often as an early signal. To manage incidents, we have established practices in SRE where automating runbooks is far from being novel, yet a practice that’s way more accessible than before. Literally all existing tools and practices need to be challenged, adjusted, or dropped. Early innovators’ products are likely to be absorbed and integrated into existing, established platforms to cope with the pace of development.

Drowning in code and loss of system understanding

When agents contribute code, change sets tend to increase not only in frequency, but also in size. Does it make sense to review the large PRs? Tools like devin help break down PRs into semantic chunks, thus lowering the burden on larger PRs. One can also ask the agent to commit in small chunks and leverage stacked diffs for reviews. However, given a high rate of rework from agents on the same files, what’s the ROI for code reviews on every PR? Maybe setting quality and security guardrails that when met result in an automatic merge are enough? Looking at human constraints on time, allocating fixed time chunks for the team to review changes from the current day, collectively reading and building a shared understanding of the codebase may be a smart tactic. Getting comfortable with agents writing code without supervision will require more platform support (sandboxing) and engineers getting comfortable with building without writing, focusing on agent coordination. A world not every craftsman will enjoy.

To cope with the increasing amount of code that needs to be understood in a structured way, we’re seeing foundations of new tools being built. Codebase size influences the ability of agents to reason about it and affects iteration speed as inference time is correlated with the number of input tokens. Projects like cased/kit, GitNexus, or GitLab Knowledge Graph aim at providing tools that index codebases and expose their symbols or structure in a more efficient way when compared to (rip)grep. Coding agents also support Language Server Protocol (LSP) servers to access IDE-like code navigation features and jump around the codebase.

Specs instead of code?

Spec-driven development can complement Architecture Decision Records (ADRs) by turning intent into something executable and verifiable. What’s new is that full applications can be built purely from the spec. First projects start shipping with disclaimers: “use an agent to make your own” based on the released spec with a tech stack of your choice. This approach brings us closer to a scenario where software can be rewritten on demand, with far less manual implementation effort. The key is in the verification stage: being able to verify the adjusted acceptance and verification criteria coming with the next iteration of the specification. I believe I’ve seen an ERP company have their product work this way, but I cannot find the reference anymore. An approach like this would also mean that framework upgrades and migrations can be executed in a similar fashion, addressing a large chunk of technical debt that exists today.

Cognitive load of change

Another bottleneck will be on the human side. We’re used to a certain pace of software development and delivery. Further acceleration increases the cognitive load and challenges our ability to reason about the changes across systems and codebases. Increasing the number of changesets, their span, and the number of changes still in flight leads to a significant explosion in scope and complexity. It’s not unlikely that the already observed differences between high performers and the rest of the teams, will start requiring structural changes to our teams. It remains to be seen which exact ones.

Effect on Open Source

Open Source is a prime example of bottlenecks, especially when critical projects are maintained by a single person. GitHub sees an influx of activity on their platform, on track for a 14x increase of commits. At the same time the platform has less than 90% uptime showing the pressure their SRE teams are under. Looking at repo creation stats, it’s ~331k repos per day, with clearly increasing momentum since the start of 2026 ². GitHub also published own statistics proving the surge of activity ³ and highlighting that planning for 10x load increase was insufficient, requiring a pivot towards 30x.

Diffusion of quality and loss of discoverability of new projects

The influx of new repos affects discoverability, for example for hot topics such as coding agent sandboxing. Try finding which of the ‘claude sandbox’ projects is good enough to be used safely. Sifting through the project list is more time-consuming than it used to be, because it’s much harder to assess whether a project just looks good or whether it actually works and what its quality level is. Large PRs break the UI, making it harder than needed to review incoming contributions. Faced with an influx of activity, spam PRs or comments, maintainers observe past incentives being put out of balance. The current training data for LLMs is built on the prior generation of OSS projects. Many of these were high quality and key dependencies across a large number of projects. As incentives shift, the question is how strong and resilient the ecosystem really is. Open Source used to be a way to tap into developer capacity, especially the most committed community members willing to contribute value. Supported by agentic coding, some projects consider closing down their projects to trusted contributors only, making more efficient use of their time. There is even a GitHub feature helping to limit PRs to contributors only and vouch as an experimental project for trust management and means to reinforce the strong links in the ecosystem.

AI contribution and attribution policy divergence

The community is torn on handling AI-assisted contributions. Linux Kernel contributions invite explicit authorship tags (Assisted-by) whereas Kubernetes explicitly bans them. Adrin Jalali, one of the core maintainers of scikit-learn, published a piece outlining different strategies for maintainers of open source projects, recommending creating agent guidance files (e.g. AGENTS.md) aiming to help increase quality of incoming AI contributions. Melissa Weber Mendonça maintains open-source-ai-contribution-policies with a collection of AI policies across the ecosystem.

Security triage burden

In addition to code contributions, projects now face more incoming issues and security advisories. Claude Code has a bug command that creates github issues. Looking through the types of issues discussed is a mix of amusing and frightening. curl is known for receiving an influx of low-quality vulnerability reports, struggling to sift through them. To adjust incentives, they stopped their bug bounty program. The problem here is the signal-to-noise ratio that hides the valuable reports.

In the State of the Claw talk, Peter mentioned that OpenClaw had 1142 security advisories since January 31 (>16 per day) with an acceptance rate of 41%. This was an estimated 5700 hours of work over 69 calendar days (5700h = 237 calendar days or >700 working days). This particular project is popular enough to attract enough attention from both ends: contributors and attackers. Not every single critical dependency in the software supply chain will be lucky to have enough hands on deck and a foundation structure to support governance. We see first signs of large-scale attacks with ripple effects from the attack on trivy propagating through the ecosystem.

Compute capacity shortages as a driver for innovation?

When limits are healthy

Constraints are great as they lead to reflection and innovation. As the AI datacenter build-out affects the whole supply chain with memory, storage prices, and availability of GPUs and CPUs (and other components), compute efficiency will hopefully matter more and more. We’ve become complacent and indifferent to resource usage as an industry as it’s become so easy to just bump memory or CPU in the cloud instead of spending the time to profile an application and understand reasons for performance bottlenecks. The local inference movement with llama.cpp is a great example of a successful movement aiming to bring inference to local machines, with their compute constraints. Same goes for on-device inference with ExecuTorch.

Limits on coding plans are often the only reason for a person to ask a question: Could I have used a cheaper model? or Was this really the right task for a coding agent? Without constraints, it’s too easy to just continuously run on the currently most capable and expensive model, an approach that is not sustainable long-term. Lack of limits reinforces bad behaviors. Too tight limits do not allow users to experience new capabilities in action.

GPU capacity constraints and hidden price hikes for models

GPU capacity shortages will also play a key role moving forward. They could be a possible explanation for features such as adaptive thinking, now enforced for new models and deprecating the feature for existing ones. Capacity limits would also make subtle bugs in prompt caching hurt much more (Anthropic since released a post mortem on the quality issues caused by changes shipped: default reasoning effort, cleaning thinking sessions, system prompt changes). While the length of tasks that LLMs are able to solve keeps growing exponentially, it remains to be seen whether token usage rises exponentially as well. Anthropic’s 10x growth rates would suggest that, yet we lack clear data on costs of AI agents.

What we definitely see is explicit price hikes with releases of OpenAI models: gpt-5.4 is 11% more expensive than gpt-5.2/5.3 which is 40% more expensive than gpt-5.1 - a total of 55%. Surely, the model is more capable, though one has to ask whether the models are really fed with incrementally more complex tasks by all users? Without clear, task-specific evals, engineering teams often opt to switch their coding model to the newest one as this is the (now) recommended one, feeding the FOMO on one end and filling the revenue hat on the other. Many users do not have the tools or the capacity to create their own, task-specific benchmarks (yet). Hopefully, they take the time to do so for customer-facing products as this has clearer ROI. If not, they will experience funny inference glitches, such as links being injected into LLM outputs where a single word was expected.

Anthropic’s price hikes are more subtle and hidden in features like adaptive thinking or system prompt that change the behavior of their harness or those that are more visible like the tokenizer change in Claude Opus 4.7 resulting in 38% higher request cost. On the other hand, they’re also more explicit with recent updates to the Enterprise plan, switching to usage-based pricing by introducing a $20 per seat price just to get access to tools with all interactions billed at API price level. No more free riding and no quotas. Reduces complexity, and challenges existing budgets.

GitHub feels the heat as well, adjusting their individual plans, pausing sign-ups, reducing model availability of Opus to the highest tier only as well as retiring previous Opus versions. GitHub’s move to offer Opus 4.7 with a promotional 7.5x premium request multiplier (vs. 3x for Opus 4.6) was a 2.5x price hike as well, a prelude to shifting the whole platform to usage-based billing. This just gives users more arguments to run with Opus-class open-source models like the newly released Kimi K2.6. Frequent changes to pricing models of established providers are prone to push more users toward open-weight and local alternatives, strengthening competitive pressure on proprietary vendors.

Pricing models, revisited

As products and platforms start getting agent-ready, they are forced to open up access to agents via official interfaces: CLIs, APIs, MCP, etc. Google’s NotebookLM is famous for not having an API, requiring hacks for agent access and possibly risking account bans if crawling will get classified as abuse. Many other platforms got away with not releasing programmatic access to their platforms, locking in their users and making migration between products unnecessarily difficult. With Salesforce announcing their Headless 360 initiative, other players will feel even more pressure to catch up. The result will be two-fold. Firstly, users will (hopefully) gain an ability to integrate and migrate between platforms with more ease. Secondly, existing pricing models of platforms will be challenged.

All the problems we’ve seen with coding agents will also show up here. Seat-based pricing will get challenged as activity on the accounts will get an additional component, not correlated with human activity anymore. Public platforms dealt with automated traffic via waiting rooms or bot protection products, incl. recent releases allowing to take in payments for access, giving http response code 402 Payment required a new life. That’s not going to fly for SaaS. The growing share of agent traffic will put a strain on margins, leading platforms to rethink their pricing models. An approach where “agents must buy seats just like human employees” is unlikely to succeed long-term as the usage patterns of agents will continuously evolve and become more complex. Assigning agents subsidized quotas would result in agents gaming the system, registering accounts to be used in parallel or just running at the full utilization of rate limits all the time (like crawlers do). The usage-based models that the coding plans are converging on are much more likely to be applied more broadly to API-based products.

Summary

Roughly 18 years ago, I had a course in Artificial Intelligence at university where the professor was explaining how an AI could compose computer programs through natural language analysis. What felt like a complete abstraction existing purely on the whiteboard is now a reality used by millions. Having witnessed both sides is fascinating and humbling at the same time.

The ecosystem around Agentic Engineering is evolving in parallel with an accelerated race to ship new and more capable models while dealing with constraints in GPU capacity. This mix of challenges in processes, tools, and infrastructure build-out results in fast-paced change with implications that cannot be easily foreseen. We will continue hitting various bottlenecks as progress is made, revisiting old approaches or adding new ones. All we can do is embrace the uncertainty and adjust course when needed. With the bottleneck shifting from generation to orchestration and verification, we have an opportunity to close some of the gaps in processes and systems that existed for a long time, thus having an opportunity to improve the industry as a whole. To achieve this, we have to be able to teach Agentic Engineering while new ways and approaches are being figured out.

More tools to come, more approaches to be tried out. As exhausting as it is to ride the wave of change, it’s highly rewarding as well.

Updated on Apr 24th: Referenced Anthropic post mortem on recent quality issues. Updated on Apr 29th: Referenced GitHub’s move to usage-based billing and added stats from update on GitHub availability.

Thoughtworks held a workshop on Future of Software Development in February (see key takeaways) where the manifesto question came up. ↩︎
GitHub repositories created per day (Apr 16, 2025 - Apr 16, 2026)
GitHub repositories created per day (by u/efumagal on r/github)
↩︎
GitHub’s surge of activity (Apr 28th, 2026)
Activity on GitHub platform across merged pull requests, commits, new repos per month
↩︎

Leveling the playing field#

What bottlenecks will we hit?#

Systems built for human-driven workloads#

Lowering cost of contributions and need for improving verification#

Drowning in code and loss of system understanding#

Specs instead of code?#

Cognitive load of change#

Effect on Open Source#

Diffusion of quality and loss of discoverability of new projects#

AI contribution and attribution policy divergence#

Security triage burden#

Compute capacity shortages as a driver for innovation?#

When limits are healthy#

GPU capacity constraints and hidden price hikes for models#

Pricing models, revisited#

Summary#