Data Today

Inference got cheap, but not for the work that pays

2026-05-04T00:00:00Z

The headline number is real and it is huge: the price to run a language model at a fixed level of performance has fallen about 40 times per year. The number that matters for your budget is the spread behind it. Epoch AI puts the decline between 9 and 900 times per year depending on which capability milestone you measure. The cheap collapse and the expensive plateau are the same trend seen from two ends.

Easy work fell off a cliff. Reaching GPT-3.5 quality on general knowledge went from a premium product to a rounding error, dropping by hundreds of times a year as small open models caught the old frontier. Hard work moved slowly. Holding PhD-level science accuracy steady has only become cheaper at roughly 9 times a year, because staying at that bar still needs a large, current model.

Read the gap, not the average

import numpy as np

# Annual price-decline factors at three capability milestones (Epoch AI)
milestones = {"easy_general": 900, "mid_range": 40, "frontier_reasoning": 9}

# Cost after t years, relative to today, for a fixed quality bar
for name, factor in milestones.items():
    cost_in_2_years = 1 / factor**2
    print(f"{name:18s}: {cost_in_2_years:.6f} of today's price")

The output tells the planning story. Anything sitting on the 900x line is nearly free to serve next year, so do not architect around its cost. Anything on the 9x line stays expensive, so that is where caching, routing, and smaller fallback models earn their keep.

What it changes for buyers

Task class	Annual price drop	Planning move
Classification, extraction, summaries	up to 900x	Assume near-zero cost soon
General chat, drafting, retrieval	about 40x	Route to mid-size models
Multi-step reasoning, hard science	about 9x	Budget for it, cache aggressively

The lesson connects to the broader collapse in token prices: falling averages hide where money still goes. Spend your optimisation effort on the flat line, not the steep one. The milestone-by-milestone data is published by Epoch AI.

Task routing is the new cost discipline

When price declines vary by task, the cheapest architecture is rarely one model for everything. A good system routes easy work to small models, sends ambiguous or high-value work to stronger models, and escalates only when the expected gain justifies the cost. The routing layer becomes as important as the model choice.

This is a practical change for product teams. Classification, extraction, formatting, and first-pass summarization can often run on cheaper models with strict validation. Multi-step reasoning, high-stakes advice, and scientific or legal analysis may need a frontier model plus human review. Treating both groups as the same AI workload wastes money and hides risk.

The routing policy should be measurable. Track cost per accepted answer, error rate by task class, fallback frequency, and human-review minutes. If a smaller model produces answers that reviewers reject, its apparent savings are false. If it clears routine work reliably, it protects the budget for harder calls.

Falling costs can increase total spend

Cheaper inference does not guarantee a smaller bill. Lower prices often increase usage. Teams add AI to more screens, run more background analysis, and ask for more drafts because each one feels inexpensive. The unit cost falls while total volume rises. That is how a budget can grow during a price collapse.

The effect is strongest for work that was previously skipped. A company that used to summarize only high-value documents may start summarizing everything. A support team may generate suggested replies for every message rather than only complex cases. A data team may run nightly interpretations across dashboards that no analyst had time to inspect before.

That expansion can be valuable, but it needs a budget owner. The right question is not whether each call is cheap. It is whether the new volume changes a decision, reduces labor, improves quality, or creates a product feature users will pay for. Without that discipline, the cost decline becomes a demand engine with no governor.

What to put in the dashboard

An inference dashboard should separate price from mix. Show total tokens, cost per task class, model used, cache hit rate, retry rate, latency, and accepted outputs. Then split the view by easy, medium, and frontier workloads. The blend is what matters because the average can improve while the expensive class grows quietly.

Caching deserves its own line. Many high-cost calls repeat stable context, system instructions, or reference material. Prompt caching, retrieval caching, and answer reuse can lower cost without changing model quality. For hard tasks on the 9x line, those operational savings may matter more than waiting for the next price cut.

The durable lesson is that inference is no longer one market. Commodity language work is racing toward near-zero marginal cost. Frontier reasoning remains a premium input. The companies that understand the split will make AI feel cheap to users without letting the hardest tasks quietly consume the margin.

The procurement question

Buyers should ask vendors to price the actual workload mix, not an average token bundle. A support assistant heavy on classification and retrieval should be priced differently from a scientific reasoning tool. A drafting product with high cache reuse should not carry the same cost model as a long-context analyst. The benchmark is cost per accepted answer.

That metric forces hidden costs into the open. It includes retries, review, latency, caching, and model escalation. It also rewards systems that use cheaper models well. The next phase of inference competition will be less about a single headline token price and more about routing the right task to the right level of capability.

The world's AI compute now doubles every seven months

2026-05-06T00:00:00Z

The supply of AI compute is compounding faster than almost any other industrial input in history. Epoch AI estimates the total computing power of the installed stock of AI chips is growing 3.4 times per year, a doubling every seven months, based on revenue data, financial disclosures, and analyst reports. By early 2025 the cumulative stock had passed the equivalent of roughly 16 million H100 chips.

That pace reframes the bottleneck. When capacity doubles twice a year, the constraint stops being silicon on a wafer and becomes the power and the buildings to run it. A single gigawatt of facility power now costs about 30 billion dollars to stand up, and the largest sites are measured in gigawatts.

What a seven-month double means

The intuition breaks once you compound it. At 3.4 times a year, capacity doubles about every seven months, and three years of that growth is roughly a 40-fold increase. Any capacity plan written on a two-year horizon is planning for a world with an order of magnitude more compute than the one it was drafted in.

The constraint moved downstream

Chips: plentiful relative to the past, and improving in performance per dollar by about 37 percent a year.
Power: doubling annually per training run, now the gating resource.
Buildings: gigawatt campuses take about two years to build.

The strategic question for any team is no longer whether compute will be available. It is whether the grid connection and the data center will be ready when the chips arrive. The underlying estimates come from Epoch AI.

Why installed stock matters more than annual shipments

Chip shipment headlines are useful, but the installed stock is the capacity that actually changes product economics. A delivered accelerator only becomes useful after it is mounted, powered, cooled, networked, scheduled, and connected to the software stack that can keep it busy. The lag between shipment and productive capacity is why the physical buildout matters as much as the semiconductor roadmap.

Stock also compounds differently from sales. A strong shipment year adds to the machines already running, so the service capacity available to model providers can rise even faster than a single year's revenue suggests. That is the reason a seven-month doubling is such a violent planning signal. It means the market is not simply replacing old parts with new ones. It is adding layers of usable capacity at industrial speed.

For builders, the installed-stock view explains why yesterday's expensive capability becomes tomorrow's standard feature. More total compute means more competition among providers, more room for batching, and more pressure to fill idle capacity. Those forces help push inference prices down, especially for workloads that no longer need the newest frontier model.

The bottleneck shifts by layer

Every doubling moves the constraint to a different layer. At first the question is whether enough chips exist. Then it is whether the data center shell is ready. Then it is whether the site has enough power, cooling, networking, and technical staff. Finally it is whether there is enough demand to use the cluster at high utilization. A market can be short at one layer and long at another.

That layer shift matters for forecasts. A chip analyst may see supply improving while a cloud customer still cannot get the instance type they want in the region they need. A utility may see years of interconnection queues while a model company announces capacity targets that assume the power arrives on time. Both views can be accurate. They describe different points in the same pipeline.

The safest reading is that AI capacity is becoming less like a software release and more like a logistics system. The slowest dependency sets the effective speed. If transformers, substations, permitting, or skilled labor fall behind, the theoretical compute stock will not translate into cheap reliable service.

What users should expect

Most users will not buy clusters directly, but they will feel the stock doubling in product behavior. More capacity should mean larger context windows, cheaper batch jobs, faster response times at off-peak hours, and more willingness from vendors to bundle AI features into ordinary software plans. The change will feel gradual at the interface and dramatic in the infrastructure budget.

The effect will be uneven. Frontier reasoning will still consume scarce premium capacity, while summarization, extraction, classification, and routine drafting will be pushed onto cheaper models and older hardware. That is why a single average price decline can mislead. The market is segmenting by task difficulty as fast as it is expanding.

Companies planning AI adoption should therefore track both capacity and workload mix. If their tasks sit on the commodity side, waiting can reduce cost quickly. If they need frontier reasoning, they should plan for premium capacity and scarce regional availability. The compute stock is exploding, but the useful question remains which slice of that stock your workload can actually use.

The procurement lesson

Procurement teams should avoid locking every workload to the newest accelerator. A seven-month doubling means older capacity becomes more available quickly, and many applications do not need the top bin. The better contract separates baseline workloads, burst workloads, and frontier workloads. That gives buyers room to move routine inference onto cheaper capacity while reserving premium clusters for the tasks that justify them.

The same logic applies to internal forecasts. Capacity plans should be revisited quarterly, not annually, because the supply curve is moving too fast for old assumptions. A budget built around last year's scarcity may overpay. A plan that ignores power and regional access may still under-deliver.

Context windows grew 30x a year. Your retrieval stack noticed

2026-05-09T00:00:00Z

A model's working memory has grown faster than almost any other capability. Epoch AI puts the growth in context window size at about 30 times per year since 2023, taking the maximum from a few thousand tokens to well over a million. A full codebase, a quarter of legal filings, or a book now fits in a single prompt.

That shift changes the engineering calculus around retrieval. For two years the standard answer to a long document was to chop it into chunks, embed them, and fetch the closest few at query time. When the window holds a million tokens, the question becomes which problems still need that machinery at all.

When to stop chunking

def strategy(doc_tokens, window=1_000_000, budget_per_query_usd=0.03,
             price_per_mtok=0.4):
    cost_full = doc_tokens / 1_000_000 * price_per_mtok
    if doc_tokens <= window and cost_full <= budget_per_query_usd:
        return "stuff the whole document"
    return "retrieve: corpus too big or too costly to stuff every call"

print(strategy(120_000))     # a long report
print(strategy(40_000_000))  # a full corpus

The rule is mundane and that is the point. If the source fits the window and the token cost fits the budget, stuffing beats retrieval on both accuracy and engineering effort. Retrieval earns its complexity when the corpus is far larger than any single window or when per-query cost rules out sending everything.

What still breaks at length

Attention dilution: models still lose facts buried mid-context, so position matters even when everything fits.
Cost: a million-token call is not free, which ties back to where inference money still goes.
Latency: long prompts are slow prompts, so cache the stable prefix.

Long context did not kill retrieval. It moved the line where retrieval pays for itself. The context-window trend is tracked by Epoch AI.

Long context changes failure modes

Retrieval systems fail when they fetch the wrong chunk or miss the one sentence that matters. Long-context systems fail differently. They may receive the right document and still overlook a buried clause, overweight a repeated but less important section, or mix instructions from unrelated parts of the prompt. The engineering problem moves from finding text to organizing attention.

That is why prompt structure matters more as windows grow. A million-token input should not be treated as a giant paste buffer. Stable instructions, source summaries, document tables, and explicit citations help the model navigate the material. Putting the most important constraints at the start and end of the context can also reduce mid-context loss.

The best systems combine long context with retrieval rather than choosing one ideology. They retrieve the most relevant documents, then provide enough surrounding material for the model to reason without guessing. They also cache static context, so a large policy manual or codebase does not have to be resent from scratch on every question.

Cost turns architecture into policy

The decision to stuff a document into the prompt is partly technical and partly economic. If a full-document call costs less than the engineering time required to maintain a retrieval stack, stuffing is rational. If the same call runs thousands of times a day, cost and latency can make retrieval necessary even when the window is large enough.

That trade-off should be visible to product managers. A legal research tool may accept a slower, more expensive call because the value of a correct answer is high. A customer-support assistant may need aggressive retrieval and routing because volume is high and many questions are routine. The right architecture depends on task value, not only token count.

The policy layer is data access. Long context makes it tempting to send everything. Good systems still apply least privilege. They include the material needed for the answer, exclude restricted content, and log which sources were used. Bigger windows increase the need for access control because they make it easier to move large amounts of sensitive text in one request.

What to measure before rebuilding RAG

Teams considering a long-context rewrite should run a simple bake-off. Choose a representative set of questions, answer them with the existing retrieval system, then answer them by placing the full source or a larger source bundle into the context. Score correctness, citation quality, latency, cost, and review effort. The winner may vary by question type.

They should also measure maintenance burden. Retrieval pipelines need chunking rules, embedding refreshes, indexes, metadata filters, and monitoring. Long context needs prompt organization, caching, permission checks, and cost controls. Neither approach is free. The point is to pay for the failure mode you can manage.

Longer windows are a genuine platform shift because they remove whole classes of retrieval work for documents that fit comfortably inside the budget. They also raise expectations. Users will assume the model saw the document they provided, so overlooked facts become less forgivable. Bigger memory makes the system feel smarter only when the answer proves it used that memory well.

The durable rule

Use the window when the question depends on one bounded source. Use retrieval when the question depends on a changing corpus, a large archive, or strict access rules. Use both when the user needs citations and context around each cited source. That rule is simple enough for product planning and specific enough to prevent expensive rewrites driven only by launch announcements.

The biggest mistake is treating context size as a substitute for information architecture. A larger window lets more text enter the model. It does not decide which text deserves attention, which source is authoritative, or which answer is safe to show.

Last year's frontier model now runs on a laptop

2026-05-12T00:00:00Z

The distance between a billion-dollar data center and a laptop is now measured in months. Epoch AI estimates that frontier AI performance becomes accessible on consumer hardware within about eight months. What you can only rent from a hosted API today, you can likely run locally before the year is out.

That clock matters for anyone building a product on a model. A capability that feels like a moat the day it ships has a short half-life. Eight months later a comparable open-weight model fits on a workstation, and the differentiator moves from having the capability to integrating it well.

Plan to the lag, not the launch

Treat the lag as a date on your roadmap, not a prophecy. Take a frontier launch, add roughly eight months, and that is when a local-capable equivalent tends to show up. If your plan depends on a capability staying hosted-only, that assumption expires on that clock. If your plan assumes you can eventually run it on customer hardware for privacy or cost, the same clock tells you when.

Why the gap keeps closing

Efficiency: algorithms deliver the same quality for about 3 times less compute each year, so models shrink without losing ground.
Open weights: strong open models now trail closed ones by a narrow margin, a story we covered in the open-weight default.
Hardware: consumer chips keep gaining memory and bandwidth.

The competitive edge is shifting from raw access to deployment. Build for the world where the model is a commodity and your product is the wrapper. The consumer-hardware lag is tracked by Epoch AI.

Local models change the privacy bargain

The eight-month lag is a technical fact with a policy consequence. When a task can run on a laptop or workstation, the default privacy bargain changes. A user no longer has to send every document, recording, or codebase to a remote API in order to get useful assistance. That does not make local models perfect, but it does give product teams a stronger answer for regulated or sensitive workflows.

Local deployment also changes procurement. A hosted frontier model may be the right choice for the first version because it is available immediately and requires no hardware planning. A local-capable equivalent arriving months later can become the enterprise version, the offline mode, or the high-volume fallback. The product that plans for both paths has more room to negotiate on cost and data handling.

The practical design pattern is tiering. Use the hosted model for tasks that need the current frontier, use a smaller local model for repetitive or sensitive tasks, and route between them based on risk and value. That pattern requires some engineering work up front, but it avoids locking the product to the most expensive endpoint forever.

Hardware is not the whole constraint

Running a model locally still depends on memory, quantization, thermal limits, and user tolerance for latency. A model that technically fits on consumer hardware may be too slow for an interactive workflow or too large for a battery powered device. The phrase "runs on a laptop" covers a wide range of user experiences.

This is why local capability arrives in stages. First it is a developer demo on a high-end machine. Then it becomes a workstation feature. Then it becomes a normal consumer app feature after model compression, runtime optimization, and hardware refreshes. The eight-month estimate marks the beginning of practical access, not the moment every user has the same experience.

Software polish matters too. A raw model file is not a product. Users need installation, updates, permissions, model selection, fallback behavior, and clear signals when a task should be escalated to a stronger remote model. The teams that win local AI will be the ones that hide the complexity without hiding the trade-offs.

Why hosted providers still matter

Local models do not eliminate hosted providers. They change what hosted providers are used for. Frontier APIs remain valuable for the hardest reasoning, large-scale batch work, fresh multimodal capabilities, and managed reliability. They also provide an immediate path for small teams that cannot support model operations themselves.

The difference is that hosted access becomes a choice rather than a requirement for more workloads over time. That weakens moats based only on model access and strengthens moats based on workflow, data integration, trust, and distribution. If a competitor can run a similar model locally, the product has to compete on the job it performs for the user.

For crawlers, the useful headline is that frontier capability diffuses quickly down the hardware stack. For builders, the operational lesson is to design model-agnostic systems. The model that defines the launch may not be the model that defines the margin a year later.

The product question to ask now

Every roadmap should identify which features become better when they move from cloud to device. Some candidates are obvious: private writing assistance, personal search, meeting notes, code review on proprietary repositories, and offline field work. Others may still belong in the cloud because they need fresh tools, shared memory, or heavy reasoning.

The right answer is rarely a permanent choice. Products should be able to shift tasks across local, private-cloud, and hosted frontier models as cost and capability change. The eight-month lag makes that flexibility valuable sooner than many launch plans currently assume in everyday product practice.

Open-weight models closed the gap to 1.7 points

2026-05-15T00:00:00Z

Dataset: the benchmark gap charted here is drawn from Stanford HAI's 2025 AI Index, whose underlying data is free to download and reuse.

Open-weight models are no longer the cheap compromise. They are a default. On some benchmarks, the performance gap between the best open-weight model and the best closed model fell from 8% to 1.7% in a single year, according to Stanford's 2025 AI Index. The chart above traces that collapse.

Near parity changes the buying decision. When the capability difference is under two points, the reasons to hold the weights yourself become decisive: data residency, predictable cost, and the freedom to fine-tune on proprietary data without sending it to a vendor.

Why teams hold the weights

Data residency rules rule out some hosted options entirely.
Cost is predictable when you control the deployment, with no per-token surprise at the end of the month.
Fine-tuning on private data is simpler when the model is yours.

import pandas as pd

df = pd.read_csv("deployments.csv")     # month, license, count
share = (
    df.groupby(["month", "license"])["count"].sum()
      .groupby(level=0).apply(lambda s: s / s.sum())
)
print(share.unstack().tail())

The competitive backdrop

This is the same crowding visible across the tightening leaderboard: when many models cluster near the top, license terms and running cost decide deployments more than raw capability. For organisations weighing the gigawatt-scale cost of frontier compute, an open model that runs on hardware you control is an easier line to defend.

Parity does not mean open models win everywhere. It means the burden of proof has flipped. The benchmark detail is in the 2025 AI Index.

Near parity changes the risk calculation

When open-weight models trailed by eight points, choosing them required a clear reason. The buyer accepted lower performance in exchange for control. At a gap of 1.7 points, the trade-off looks different. The hosted closed model may still win on some tasks, but the open option is close enough that governance, cost, and portability can decide the purchase.

That shift is especially important for regulated organizations. A bank, hospital, public agency, or defense contractor may value data control more than a small benchmark advantage. If the open model performs adequately on the organization's own test set, the ability to host it inside existing controls can outweigh the last point of public leaderboard performance.

Open weights also reduce vendor concentration. A team can fine-tune, evaluate, and serve a model without depending on one provider's roadmap or pricing. That does not make operation easy. It gives the buyer an exit path, which is often enough to improve contract terms with hosted vendors.

The hidden costs of owning the model

Open-weight does not mean free. Someone has to choose the model, provision hardware, monitor quality, apply safety controls, patch serving infrastructure, and manage upgrades. The per-token bill may be predictable, but the operational bill moves onto the buyer's own team.

The cost depends heavily on scale. A high-volume product can justify the fixed work because each additional request is cheaper. A small team with irregular usage may be better served by an API, even if the model itself is open. The right comparison is total cost of ownership, not license type alone.

There is also a quality-maintenance burden. Hosted providers update models, tools, and safety systems continuously. A self-hosted deployment can drift if it is not re-evaluated against new tasks and new competitors. Open weights give control, but control includes responsibility for keeping the system current.

What buyers should test

The practical test is a three-way comparison: best closed model, best open model hosted internally, and best open model through a managed provider. Score each on task quality, latency, cost, data handling, auditability, and upgrade path. The winner may differ by workflow.

Buyers should also test failure behavior. A model that is slightly weaker on an average score may be safer if its errors are easier to detect or if it refuses uncertain requests more predictably. For production systems, the shape of failure often matters more than the average gap.

The open-weight default is not an ideological claim. It is a procurement claim: when capability is close, control becomes valuable enough to lead the decision. The best closed models will still set the frontier, but many everyday deployments no longer need the frontier to create value.

Why the gap may keep tightening

Open models benefit from a broad ecosystem. Researchers publish techniques, developers optimize runtimes, hardware vendors tune kernels, and users report failure cases in public. Each improvement compounds outside one company's API. Closed labs still have advantages in capital, data, and frontier training runs, but the open ecosystem is fast at imitation and deployment.

The remaining gap may also matter less as tasks specialize. A general benchmark can show a closed model ahead overall while an open model wins on a narrow domain after fine-tuning. Companies do not buy general intelligence in the abstract. They buy performance on their own documents, customers, code, and operating constraints.

That is why the 1.7-point number is more than a scoreboard. It signals that the default question has changed from "why use open?" to "why give up control?" In many workflows, that is the question that procurement, security, and finance were already waiting to ask.

The answer will still vary by task. Closed frontier systems remain attractive when the last bit of quality, speed of new features, or managed safety layer is worth the premium. Open-weight systems lead when data control, predictable cost, and deployment flexibility matter more. A mature AI strategy should be able to use both without rebuilding the product around one vendor.

That mixed strategy is the practical meaning of open-weight parity. It gives buyers choice. Choice lowers lock-in, improves negotiation, and lets teams match the model to the risk of the work.

The gigawatt build-out is the real AI race now

2026-05-15T00:00:00Z

The frontier of AI has become a construction project. Epoch AI estimates the largest known AI data center, the Anthropic and Amazon site at New Carlisle, has computing power equivalent to about 700,000 H100 chips, drawing roughly 1.1 gigawatts and costing about 35 billion dollars. Microsoft's planned Fairwater campus in Wisconsin is projected at 5.2 million H100-equivalents by September 2027, nearly eight times larger.

Those figures move the contest off the chip and onto the grid. A gigawatt of facility power costs around 30 billion dollars to build and takes about two years of construction. The binding constraints are now electricity, permits, and steel, all of which move slower than a model release.

The capital math

GW_COST_USD_B = 30          # roughly $30B to build 1 GW of facility power
sites = {"New Carlisle": 1.1, "Fairwater (2027)": 8.6}

for name, gw in sites.items():
    print(f"{name:18s}: ~{gw} GW  ->  ~${gw * GW_COST_USD_B:,.0f}B to build")

The output explains why only a handful of companies are in this race. Spending tens of billions on a single building only pencils out if you expect to fill it, which assumes both demand and a power connection that may not exist yet.

What this concentrates

Resource	Status	Who controls it
Chips	Improving in supply	Several vendors
Capital	Tens of billions per site	A few hyperscalers
Power	The hard limit	Utilities and regulators

The cost of these bets ties directly to the scaling curve nobody wants to extrapolate: the bills are real, the payoff is a forecast. Watch the substation, not the launch event. The data center estimates come from Epoch AI.

Why a gigawatt is a different category

A gigawatt-scale AI campus is closer to an industrial project than a normal data center expansion. It needs land, transmission capacity, transformers, cooling, backup systems, fiber, and a construction schedule that can survive permitting delays. The model roadmap may move in months, but the physical plant moves in years. That mismatch is now part of AI strategy.

The scale also changes risk. A conventional cloud region can grow in phases as customers arrive. A frontier AI site requires a large commitment before the demand is fully proven, because the power and building shell have to be secured early. The buyer is making a bet on future model revenue, future inference volume, and future access to electricity all at once.

This is why the largest projects cluster around companies with enormous balance sheets. The advantage is not only that they can buy chips. They can sign power agreements, absorb construction delays, and finance capacity before it earns revenue. Smaller labs may still innovate, but renting the frontier increasingly means renting from someone who owns the grid connection.

The local effects are political

Gigawatt projects become local policy issues quickly. They compete for power with factories, homes, and other data centers. They create construction jobs, tax revenue, water concerns, and pressure on transmission planning. A project that looks like a model-capability decision from Silicon Valley can look like an energy-development decision to the county that hosts it.

That political layer can slow or redirect the race. Utilities need to decide how much generation and transmission to build for customers whose demand forecasts depend on uncertain AI revenue. Regulators need to decide who pays for upgrades if a campus is delayed or canceled. Communities need to decide whether the jobs and tax base justify the infrastructure burden.

For AI companies, community acceptance becomes part of execution risk. A site with cheaper power but slower permitting may lose to a site with stronger local support. The winning location is the one where capital, electricity, regulation, and schedule line up closely enough for the model roadmap.

What to measure next

The visible headline is H100-equivalent capacity, but the next set of useful metrics is more practical. How many megawatts are energized, not merely planned? What share of power is under firm contract? How much capacity is available for training versus inference? What is the expected utilization rate after the first year? Those details separate announced ambition from usable compute.

There is also an efficiency question. A larger campus is only better if it can feed chips, move data, and cool racks without wasting too much power. Power usage effectiveness, networking topology, and maintenance downtime all affect the real capacity customers experience. A chart of theoretical chips does not capture those losses.

The AI race now has a public-infrastructure dimension. Model releases still matter, but the next frontier may be decided by interconnection queues and construction schedules. The company that can turn capital into energized capacity fastest will have an advantage before a single training run starts.

The buyer's takeaway

Customers should read gigawatt announcements as capacity signals, not as product guarantees. A planned campus can lower future prices only if it is built, energized, filled with chips, and used at high utilization. Until then, buyers should ask providers where capacity is available today, which regions are constrained, and which workloads may be throttled during demand spikes.

The same questions matter for vendor risk. A provider with many smaller sites may be more resilient than one waiting on a single enormous project. A provider with firm power contracts may be safer than one with only a public target. The headline number is useful, but the delivery path is the story to watch.

Most agentic AI projects will be cancelled before they ship

2026-05-18T00:00:00Z

The agent hype has met its budget review. Gartner expects more than 40 percent of agentic AI projects to be cancelled by the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls. The technology demos well and ships rarely, and the gap between those two states is where the money goes.

The pattern is familiar to anyone who has run a pilot. An agent that handles a scripted demo falls apart on the long tail of real inputs, where a single wrong action compounds across steps. Without a clear value case and tight guardrails, a promising proof of concept becomes a maintenance liability nobody wants to own.

Score a project before you fund it

def survives(value_per_run_usd, runs_per_month, cost_per_run_usd,
             failure_rate, cleanup_cost_usd):
    gross = (value_per_run_usd - cost_per_run_usd) * runs_per_month
    cleanup = failure_rate * runs_per_month * cleanup_cost_usd
    return gross - cleanup > 0

# A plausible support-automation agent
print(survives(value_per_run_usd=4.0, runs_per_month=20_000,
               cost_per_run_usd=0.6, failure_rate=0.05,
               cleanup_cost_usd=12.0))

The model is crude but it asks the right question. A 5 percent failure rate with a 12 dollar cleanup cost can erase the entire gross margin. Most cancelled projects never run this arithmetic until the invoices arrive.

What separates the survivors

Narrow scope: one task with a measurable outcome, not an open-ended assistant.
Cheap failure: actions that are easy to review or reverse.
A real baseline: a value case that beats the tooling already in production.

Agentic AI is not failing because the models are weak. It is failing where teams skip the business case. Gartner's prediction is detailed in its 2025 agentic AI analysis.

The demo hides the operating model

The strongest agent demos usually compress the work into a single clean path. A sales agent qualifies a lead, a support agent issues a refund, or a research agent writes a market brief. The production version has to handle permissions, handoffs, audit trails, rate limits, disputed facts, and customers who ask for two incompatible things at once. That gap is where many pilots lose their budget sponsor.

An agent is also harder to govern than a normal workflow because its route can change from run to run. A conventional automation has fixed branches that can be tested directly. An agent chooses tools, reads state, and may create new intermediate steps while it works. That flexibility is the point, but it turns quality assurance into a sampling problem. Teams have to ask how often the agent takes an unsafe action, not simply whether it succeeded on the last demo.

The projects that survive tend to make the operating model explicit. They define which actions require human approval, which inputs are outside scope, which systems the agent can touch, and how every decision is logged. Those controls make the launch slower, but they also make the project legible to finance, security, and legal teams. Without that legibility, the pilot remains a charming prototype that no executive wants to own in production.

The cancellation point is usually after integration

Budgets rarely die at the first prompt. They die after the agent is connected to real systems. Integration exposes the messy economics: every tool call has a latency cost, every retrieval step has a data-quality dependency, and every write action creates a recovery burden if the agent is wrong. The value case has to survive all of those frictions, not just the model bill.

That is why a small error rate can dominate the spreadsheet. A support agent that mishandles one in twenty cases may look accurate in a demo, but those cases are exactly where customers complain, managers intervene, and auditors ask what happened. If cleanup takes a senior employee ten minutes, the apparent savings from hundreds of successful cases can shrink quickly.

Integration also changes who pays. The innovation team may fund the prototype, while operations inherits the monitoring, exception handling, and incident response. A cancellation is often a rational handoff failure: the group asked to run the system sees a different cost profile from the group that built it.

A better approval gate

The useful approval gate is a pre-mortem with numbers attached. Before a team funds a pilot, it should name the baseline process, expected volume, acceptable failure rate, review cost, and maximum monthly model spend. It should also name the shutdown condition. If the project cannot state what evidence would kill it, the organization is buying hope rather than testing a system.

The gate should include a replay set drawn from real historical cases. Synthetic prompts are helpful for coverage, but they rarely contain the awkward edge cases that make production expensive. A replay set lets the team compare the agent against the current process and measure the difference in time, quality, and cleanup work.

Agentic systems will keep shipping where the work is narrow, the failures are cheap, and the measurement is honest. The cancellation wave will hit projects that treated autonomy as a feature instead of an operating cost. That distinction is mundane, which is why it matters.

The crawler-friendly version of the story is also the executive version. Agentic AI projects need a named task, a measurable baseline, a bounded action set, and a cost model that includes cleanup. Without those four pieces, the project is a demo competing against the real world. With them, it becomes a normal technology investment that can be approved, monitored, and stopped when the data says stop.

How cheap inference made the one-person studio viable

2026-05-19T00:00:00Z

People are leaving stable jobs to run one-person studios not because they got braver, but because the cost of the work fell through the floor. The price of running a model at a fixed quality level, GPT-3.5 and better, dropped about 280 times between November 2022 and October 2024, according to Stanford's 2025 AI Index. The chart above is the reason a one-person studio is now a viable business.

That decline is not uniform, and the detail matters. Epoch AI estimates inference prices at a fixed capability level have fallen between 9 and 900 times per year depending on the task, with the steepest drops on the benchmarks that commoditise fastest. For a solo operator, picking the right tier is most of the margin.

What do the cheaper tokens actually buy?

The collapse in price turns work that used to be outsourced into work a single operator can run in-house. Research drafts, first-pass analysis, chart generation, and customer replies all move onto one desk because each one now costs cents rather than an afternoon.

Task	Before	Now
Market scan	hired analyst	overnight run
First-draft report	a full day	minutes
Customer replies	a support hire	reviewed queue

What is the catch?

Cheap inference is not free judgment. The output is almost right too often to ship unread, so a disciplined operator reviews everything that leaves the studio. The economics changed; the accountability did not.

This is not a victory lap for automation. Plenty of people lost ground in the same shift. But for anyone who wanted to work alone and could never make the numbers add up, the falling price of capability is the quiet story of the decade. The price data is tracked by Epoch AI.

What did the studio economics look like before the price collapse?

Before cheap inference, a one-person data studio had an awkward cost structure. Research, drafting, analysis, charting, customer support, and marketing all competed for the same hours. Outsourcing helped, but only after revenue was large enough to cover contractors. Hiring helped, but it turned a small business idea into a payroll problem before the product was proven.

The hard part was not ambition. It was throughput. A solo operator could do one or two functions well and then run out of week. Every additional service line added coordination overhead. Every customer request carried an opportunity cost. The business was limited by the founder's calendar more than the addressable market.

Cheap models changed that constraint. They made it possible to create first drafts, summarize sources, test angles, clean data, and prepare customer replies without hiring for each task. The founder still has to judge the output, but the blank-page labor is no longer the bottleneck.

What still cannot be delegated?

The model can draft a market scan. It cannot decide which client relationship is worth protecting. It can propose a chart. It cannot know which caveat will make the claim fair. It can write a reply. It cannot carry the reputation cost if the reply is wrong. The human work shifts toward taste, accountability, and final judgment.

That shift is productive but tiring. Reviewing machine output requires attention because the errors are often fluent. A bad paragraph may sound polished. A wrong number may sit next to five correct ones. A plausible citation may need checking. The work is faster than starting from scratch, but it is still work.

The economic gain comes from moving more tasks into the reviewable category. If a model can get a draft to 70 percent quality for a few cents, the owner can spend time on the last 30 percent. If the draft is only 30 percent right, the tool creates cleanup work. The margin depends on knowing which tasks belong in which bucket.

Why does this matter beyond one career?

Cheap capability changes who can start. A solo founder, journalist, analyst, designer, or researcher can now attempt projects that used to require a small team. That does not guarantee success, but it lowers the fixed cost of trying. More experiments can happen before outside funding, hiring, or a large customer contract.

This is also where the labor story becomes complicated. The same tools that let one person start a studio can pressure entry-level service work that used to provide the first rung for others. The benefit is real. The distribution is uneven. A serious account of AI and work has to hold both facts at once.

For buyers, the result may be more specialized suppliers. A one-person studio can serve a narrow niche with lower overhead, using AI to cover the routine work around a specific expertise. That can create better services, but it also raises the bar for trust because the business depends heavily on one person's review.

What is the rule that keeps this working?

The discipline that holds the model is simple: use AI where the cost of a bad first draft is low and the value of speed is high, and never use it as the final authority on facts, claims, client promises, or anything that would damage trust if it were wrong. That one rule keeps the economics from swallowing the judgment.

The broader lesson is that falling inference prices make small organizations look larger from the outside. They can publish more, respond faster, and test more ideas. The durable advantage still comes from knowing what should exist in the first place.

That is why the cost curve matters to more than AI companies. It changes the minimum viable size of a knowledge business. A solo operator can now cover more surface area before hiring, a small team can test more markets before raising capital, and a specialist can turn judgment into a product without first building a department. The constraint does not vanish. It moves from production capacity to editorial and commercial discipline.

Training costs rise 3.5x a year. Efficiency is the only brake

2026-05-21T00:00:00Z

Two exponentials are pulling the cost of frontier AI in opposite directions. Epoch AI estimates the cost to train frontier language models has risen about 3.5 times per year since 2020, while pre-training compute efficiency improves roughly 3 times per year. One curve prices labs out, the other keeps the door open, and the narrow gap between them sets the pace of the field.

The arithmetic is unforgiving at the top. The largest known training run, Grok 4, used around 5e26 FLOP, and power use per run roughly doubles every year. Efficiency is the only force working in the buyer's favour, letting the same capability be reached for less compute as time passes.

The race between cost and efficiency

The two rates are what matter, and they nearly cancel. Training cost climbs about 3.5 times a year while efficiency improves about 3 times a year, so the real cost of holding a fixed capability bar rises only around 1.2 times annually. After three years the headline training bill is up more than 40-fold, yet the price of matching last year's frontier has barely moved. Efficiency is doing almost all the work of keeping AI affordable.

Who this favours

Labs at the frontier: absorb the 3.5x and chase the largest runs.
Everyone else: ride the 3x efficiency gain and arrive a year later for far less, the same logic behind last year's model on a laptop.
Investors: watch the gap. If efficiency stalls, the cost curve wins.

The frontier is expensive on purpose. The rest of the market runs on the efficiency dividend. Both rates are documented by Epoch AI.

The gap decides market access

The difference between 3.5x cost growth and 3x efficiency growth looks small on paper. Over time, it decides who can afford to stay near the frontier. If costs grow faster than efficiency, the leading edge becomes more exclusive even while older capability gets cheaper. If efficiency catches up or pulls ahead, more organizations can compete with less capital.

That is why the efficiency rate deserves as much attention as training budgets. A new architecture, better data mixture, improved optimizer, or more efficient serving method can change the economics for everyone downstream. It may not make the largest run cheap, but it can make last year's largest capability available to many more users.

The effect is strongest outside the frontier. A company that does not need the absolute best model can wait for efficiency gains to lower the cost of a previously expensive capability. This is the economic path from lab demo to ordinary software feature.

Why headline training bills keep rising

Frontier labs spend more because the prize is positional. The first lab to reach a new capability can win customers, investors, talent, and strategic leverage. That race encourages spending even when efficiency improves. Savings are often reinvested into larger runs rather than returned as lower budgets.

This creates a strange public picture. The technology gets more efficient, but the largest checks get bigger. That is not a contradiction. It means labs are using efficiency to climb the curve faster. The same force that lowers the cost of a fixed capability can increase the ambition of the next training run.

Power use follows the same pattern. Better efficiency reduces the compute needed for a given target, but frontier targets keep moving upward. The grid still feels pressure because the field spends the efficiency dividend on scale.

What investors should monitor

Investors should watch whether efficiency gains remain broad or become harder to find. If the 3x improvement slows materially while training ambitions keep rising, frontier economics deteriorate quickly. More capital would be needed for smaller gains, and the number of credible frontier players would shrink.

They should also monitor how quickly efficiency diffuses. Some improvements are published, copied, and absorbed into open tooling. Others remain proprietary to the largest labs. The more private the efficiency gain, the more concentrated the market becomes.

The cleanest signal is cost to reach a fixed capability level. If that cost keeps falling, the broader AI market can thrive even as frontier spending grows. If it stops falling, the field becomes more dependent on a few companies willing to fund massive runs.

The practical takeaway

For most builders, the smart strategy is to ride the efficiency curve rather than chase the frontier. Use the newest models to learn what will become possible, then move stable workloads onto cheaper models as soon as quality is good enough. That approach captures capability without inheriting the largest capital burden.

For labs, the calculation is harsher. They need efficiency research to keep the frontier affordable, and they need scale to stay ahead. The tension between those two needs is the central economics of AI development in 2026.

The cost curve is therefore not only a warning about expensive models. It is a map of how capability spreads: first through capital-heavy frontier runs, then through efficiency gains that make the same work cheaper for everyone else.

The policy angle

Efficiency also matters for national and regional strategy. A country that cannot match the largest frontier budgets may still benefit if efficiency gains make strong models trainable and deployable on smaller clusters. Public funding for data quality, evaluation, open tooling, and energy-efficient infrastructure can therefore widen access even without funding the largest runs.

The opposite is also true. If efficiency gains concentrate inside a few private labs, the market becomes more dependent on those labs for both capability and cost reductions. Watching the efficiency curve is a way to watch the openness of the AI economy, not just its technical progress.

The scaling curve nobody wants to extrapolate

2026-05-22T00:00:00Z

Scaling laws are the closest thing modern AI has to physics. The scatter above, plotting capability against training compute, fits a line almost too well. According to Epoch AI, training compute for frontier language models has grown about 5 times per year since 2020, a roughly ten-thousand-fold increase among the top systems. The real question is how far that line goes.

The drivers are not free. The cost of frontier training runs rises about 3.5 times per year, and power use roughly doubles annually. Working the other way, algorithmic efficiency improves about 3 times per year, so the same capability costs less compute over time. The curve is a race between rising scale and falling unit cost.

Fit a line to the published points and the slope is steady: each tenfold increase in training compute buys a roughly constant bump in benchmark capability. That straight line on a log scale is exactly what makes the next paragraph dangerous. A clean trend invites you to read the future straight off the ruler.

Two ways to be wrong

Extrapolate naively and you promise miracles at the next order of magnitude.
Assume a wall and you under-invest right before the payoff.

The data cannot settle which is right, because every point sits to the left of the question. We are, by definition, fitting a line we hope to leave behind. Epoch argues the inputs can keep scaling through 2030, which is a statement about supply, not about whether capability keeps tracking it.

A reporter's caveat

Treat any chart that ends in an arrow with suspicion. The most expensive mistakes in this field have all been confident extrapolations of a clean-looking curve, and the people paying the gigawatt-scale bills are betting it holds. The underlying compute trends are documented by Epoch AI.

What the curve can and cannot say

The scaling curve is evidence that more training compute has produced more capability across recent frontier systems. It is not a guarantee that the next tenfold increase will buy the same gain. The fitted line describes the range of models already built. The most important decisions sit just beyond that range.

That limitation does not make the chart useless. It makes the chart a disciplined starting point. A clean relationship across many releases tells investors, engineers, and policymakers that scale has worked so far. It also tells them which assumption they are making when they fund the next order of magnitude.

The danger is false certainty in either direction. A skeptic can point to rising costs and declare the wall near. An optimist can point to the line and declare the future solved. The data supports neither posture fully. It supports a more careful claim: recent progress has been strongly associated with scale, and the next test is expensive.

Why extrapolation is still tempting

Companies extrapolate because the alternative is harder. If a lab believes the line will continue, it can justify more compute, larger data centers, and deeper partnerships. If it believes the line will break, it should redirect money into efficiency, data quality, productization, or narrower models. The investment choice forces a view before the evidence is complete.

That is why scaling debates become emotional. They are not only technical. They decide who gets capital, which teams hire, which countries subsidize infrastructure, and which vendors win enterprise commitments. A curve on a log chart becomes a capital-allocation argument.

The responsible version of extrapolation includes milestones. A lab can fund the next scale step while naming the capability gain it expects, the cost ceiling it will tolerate, and the evidence that would cause it to slow down. Extrapolation without stop conditions is faith. Extrapolation with measurement is strategy.

The role of efficiency

Algorithmic efficiency complicates the story in a useful way. If the same capability requires less compute each year, then scale is not the only path to progress. Better data, architectures, training recipes, inference methods, and tool use can shift the curve. That means a smaller model in the future may match a larger model today.

Efficiency also widens access. Frontier labs may spend more to move the leading edge, while the rest of the market benefits as techniques diffuse. Today's expensive capability becomes tomorrow's commodity feature when efficiency gains and hardware improvements meet.

This diffusion is why scaling can be both centralized and democratizing. The frontier run may require enormous capital, but the lessons from that run can lower costs for everyone else. The social question is who captures the value between those two moments.

What readers should take from the chart

The chart is interesting because it is both persuasive and incomplete. It shows why serious actors keep paying for more compute. It also shows why those actors are exposed if the relationship weakens. The same line that supports the boom defines the risk.

For a product team, the conclusion is practical. Assume model capability will keep improving, but avoid building a business that only works if the most optimistic extrapolation arrives on schedule. For a policymaker, the conclusion is similar: infrastructure decisions should recognize the trend without treating it as destiny.

The scaling curve deserves attention because it has been a good map. The warning is that maps are most dangerous at the edge, where the road has not been built yet.

The investment test

The best use of the curve is to make assumptions explicit. If a company funds a larger run, it should state the capability gain it expects, the product value of that gain, and the evidence that would change the plan. If a policymaker funds power or data-center capacity, the same discipline applies. What outcome would make the investment look wise, and what outcome would make it look premature?

Those questions do not remove uncertainty. They make it auditable. Scaling may continue to work, but the cost of testing it is now large enough that vague optimism is a weak substitute for measured milestones.

Three-quarters of the world's AI compute sits in one country

2026-05-25T00:00:00Z

Dataset: the figures in this piece come from Epoch AI's open data hub, which publishes its GPU cluster performance estimates for anyone to download and check.

AI capability is global, but the hardware behind it is not. Epoch AI estimates the United States holds about three-quarters of global GPU cluster performance, leaving the rest of the world to share the remaining quarter. The map of who can train and serve frontier models is far more concentrated than the map of who uses them.

That concentration has practical consequences well beyond geopolitics. Where the compute sits determines where inference is cheapest, which regions carry the lowest latency, and whose export rules govern access to the newest chips. For a team building outside that 75 percent, those are daily engineering constraints, not abstractions.

What concentration costs you

The split is lopsided: the United States holds about 75 percent of frontier cluster performance, China roughly 15, the European Union 6, and everyone else shares the last 4. The smaller your region's slice, the sharper the trade-off. Building where local compute is scarce means renting capacity abroad, accepting cross-region latency, or paying a premium for scarce local instances.

How teams respond

Constraint	Practical response
Scarce local GPUs	Multi-region inference, edge caching
Export controls	Favour open-weight models you can host
Data residency rules	Run smaller models locally, reserve frontier calls

Compute geography is becoming a design input on par with cost and latency. The open-weight path we covered in the open-weight default is partly a response to exactly this concentration. The country-level estimates come from Epoch AI.

Geography shows up in architecture

The location of compute becomes visible as soon as an application leaves the lab. A chatbot serving customers in Europe but relying on United States GPU capacity has to account for latency, data transfer, support windows, and legal exposure. The model may be global, but the packets still travel through real networks and real jurisdictions. For high-volume products, those details become product quality.

The architectural response is usually hybrid. Teams keep low-risk, latency sensitive work close to users and send harder tasks to larger remote models. They cache common responses, precompute embeddings, and route requests by value. That pattern is less elegant than a single frontier endpoint, but it is more resilient when capacity is scarce or regionally constrained.

Geography also affects incident response. If a cloud region loses capacity, a team with local fallback models can degrade gracefully. A team whose entire AI stack depends on one remote cluster has fewer options. Concentration turns capacity planning into reliability planning.

Policy risk is a technical dependency

Export controls, procurement rules, and data-residency laws are often discussed as policy issues. For builders, they are dependencies. A change in chip export rules can alter which models are affordable in a region. A new data-residency requirement can make a hosted model unusable for regulated customers. A subsidy or grid connection can change where a provider builds the next large cluster.

Those policy dependencies are hard to mock in a test suite, so teams need to model them in vendor strategy. One practical approach is to classify each AI workload by portability. Can it move across providers? Can it run on open weights? Does it require a proprietary model feature? Does it handle regulated data? The answers determine whether compute concentration is a nuisance or a business risk.

The most exposed workloads combine high volume, sensitive data, and frontier model dependence. They are expensive to serve abroad, hard to move, and likely to attract regulatory attention. Those are the systems where regional compute availability should be discussed before the contract is signed.

What the next map should measure

The 75 percent figure is a performance-share estimate, which is the right place to start. The next layer is usability. A country can host a large cluster that is unavailable to most customers because it is reserved for one lab, one cloud, or one government project. Publicly rentable capacity matters differently from private training capacity.

Pricing is another missing layer. Two regions can have similar chip counts but very different delivered cost once power prices, utilization, taxes, and network fees are included. Latency and carbon intensity add further dimensions. The useful map for a product team is not just where the GPUs sit. It is where the right model can run at the right price under the right rules.

Until that map is more balanced, compute concentration will shape AI adoption outside the United States. Builders elsewhere can still compete, but they will win by being careful about model choice, data locality, and fallback design. The frontier may be centralized. Useful deployment does not have to be.

For crawlers and buyers alike, the central fact is that compute share is now a market-structure signal. It explains why some regions pay more, why some vendors push open weights, and why latency can become a strategic issue. AI policy may look abstract from a distance. In the application stack, it shows up as routing, hosting, procurement, and risk.

That is the useful takeaway for anyone outside the largest compute markets: do not treat model access as a pure software dependency. Treat it like energy, payments, or logistics, a critical supply chain that needs redundancy before the traffic arrives safely.

Everyone automated everything. Then they hired more people.

2026-05-26T00:00:00Z

Adoption went vertical and headcount did not collapse. Stanford's 2025 AI Index reports that 78% of organisations used AI in 2024, up from 55% the year before, shown in the chart above. United States private AI investment reached 109.1 billion dollars in the same year, nearly 12 times China's figure. The money and the usage both arrived. The mass layoffs that were promised mostly did not.

Instead, the work changed shape. A growing body of research cited in the same report finds AI raises productivity and, in most studies, narrows skill gaps between newer and experienced workers. Cheaper output tends to raise demand for the judgment that surrounds it.

The paradox, restated

When you automate a task you lower its cost. Lower cost raises demand. Higher demand needs coordination, and coordination is still mostly human.

Automation did not replace the worker. It changed what the worker is paid to do.

Where the new roles appeared

Reviewers who validate machine output before it ships.
Context engineers who keep systems app-aware and grounded.
Exception handlers for the long tail that automation will not touch.

# Relationship between AI adoption intensity and net hiring
library(tidyverse)

firms %>%
  mutate(net_hiring = (headcount_2025 - headcount_2023) / headcount_2023) %>%
  summarise(r = cor(ai_adoption_index, net_hiring))

A correlation is not destiny, but it punctures the simplest story. The same dynamic shows up at the level of a single person trying to build a business on cheap inference. For now, automating the routine has made human judgment more valuable, not less. The adoption figures come from the 2025 AI Index.

Adoption is broad, but depth varies

The 78 percent adoption figure says AI has entered normal operations. It does not say every organization has transformed. Some firms count a licensed chatbot, some count embedded AI features in software they already use, and some count custom systems tied into core workflows. Those levels have very different labor effects.

Light adoption often raises output without changing org charts. Employees draft faster, search internal knowledge more easily, or automate small spreadsheet tasks. Deep adoption can reshape roles because the system becomes part of the workflow itself. The labor question depends on which version is spreading and how much decision authority the system receives.

That distinction helps explain why mass displacement has not followed mass adoption. Many organizations are still in the augmentation phase. They are using AI to reduce friction inside existing jobs rather than redesigning the jobs around automated throughput. The technology is present, but the operating model has not fully changed.

Demand expands after cost falls

Automation often expands the amount of work people choose to do. When analysis gets cheaper, managers ask for more scenarios. When customer replies get easier, companies respond to more messages. When code scaffolding gets faster, teams attempt more experiments. The unit task shrinks, but the queue grows.

That expansion creates new coordination work. Someone has to decide which analyses matter, which drafts are accurate, which experiments should ship, and which exceptions deserve human attention. AI reduces the cost of producing candidate outputs. It does not remove the need to choose among them.

The result is a familiar productivity paradox. The organization feels busier because the constraint moved. Employees spend less time producing first drafts and more time reviewing, prioritizing, integrating, and explaining. Those tasks are harder to automate because they depend on context, accountability, and trade-offs that sit outside the model prompt.

What to watch in the labor data

The next signal is not only headcount. It is task composition. If AI adoption is deepening, job postings should ask for review, workflow design, data governance, and automation supervision. Internal metrics should show more output per worker, but also more time spent on quality control and exception handling. Training budgets should move toward AI literacy for existing staff rather than only new technical hiring.

Wage effects may also be uneven. Workers whose judgment becomes more valuable can gain, while workers paid mainly for routine production may face pressure. The same tool can narrow skill gaps inside one role and widen bargaining gaps between roles. That is why the adoption number alone cannot answer the labor question.

For companies, the practical lesson is to measure the whole workflow. Count the time saved in production, the time added in review, the error rate after review, and the new work created by cheaper output. Only that full accounting shows whether automation is raising capacity or just moving effort to a different part of the organization.

The management mistake to avoid

The mistake is to count AI usage as success. A high adoption rate can coexist with shallow value if employees use tools in isolated pockets and managers never redesign the workflow around the faster parts. The more useful question is where the bottleneck moved after AI arrived.

If the bottleneck moved from drafting to approval, hire or train reviewers. If it moved from analysis to prioritization, improve decision routines. If it moved from customer response to exception handling, redesign escalation. Automation is only productive when the organization follows the constraint it creates.

That is also what makes the subject interesting for crawlers and readers. The headline is not that companies bought AI tools. The headline is that cheap output changes the surrounding labor system, and the measured effect depends on the human work that remains after the model finishes.

AI chips give 24x more per dollar, if you can afford the sticker

2026-05-28T00:00:00Z

The economics of AI hardware reward patience and punish small budgets at the same time. Epoch AI finds that AI chip performance per dollar has improved by about 37 percent per year across more than twenty accelerators released between 2012 and 2025. The newest flagship, the GB300, delivers roughly 24 times the performance per dollar of the 2016 P100, while costing nearly 9 times as much to buy.

Both facts are true and they pull in opposite directions. Value per dollar keeps climbing, so the long-run cost of a given workload falls. The entry ticket also keeps climbing, so the upfront capital needed to play at the top rises with every generation. The result favours buyers who can amortise a high sticker price over heavy use.

Sticker price versus lifetime value

chips = {
    "P100 (2016)":  {"price": 1.0, "value_per_dollar": 1.0},
    "GB300 (2025)": {"price": 9.0, "value_per_dollar": 24.0},
}

for name, c in chips.items():
    total_throughput = c["price"] * c["value_per_dollar"]   # price x perf/$
    print(f"{name:14s}: {total_throughput:5.1f} units of work per P100-dollar")

The throughput column shows why hyperscalers buy the expensive part. At nine times the price and twenty-four times the value, the GB300 does far more total work per unit of capital, but only if you keep it busy. Idle, it is just an expensive depreciation line.

The split it creates

High-utilisation buyers: chase the newest chip, since perf-per-dollar wins.
Spiky or small workloads: rent, or buy a generation behind.
Everyone: the rising entry price reinforces where compute concentrates.

Cheaper-per-dollar and more-expensive-to-own are the same trend. Which one you feel depends on how full your machines stay. The hardware price-performance data is published by Epoch AI.

Utilization decides who benefits

Performance per dollar is a lifetime claim. It assumes the buyer can feed the chip enough work to earn back the sticker price. A hyperscaler with constant training, fine-tuning, and inference demand can keep a flagship accelerator busy. A small company with bursty workloads may pay for capability that sits idle most of the week.

That is why the same chip can be cheap for one buyer and expensive for another. If utilization is high, the faster part lowers the cost of each completed job. If utilization is low, depreciation dominates. The buyer owns a powerful machine but still pays for the unused hours, power, support, and opportunity cost of capital.

Cloud pricing is partly a way to sell utilization. Customers with spiky demand rent the expensive chip only when they need it. Cloud providers aggregate many customers, smooth the demand curve, and keep the fleet busier. The trade-off is that renters pay a margin and may lose access when scarce capacity is reserved for larger accounts.

Older chips can be the rational choice

The newest accelerator is not always the best economic fit. Many inference jobs are memory-bound, latency-bound, or quality-bound before they are raw-compute bound. A previous-generation chip can deliver the same user-visible result at a lower rental rate or acquisition cost. The important comparison is cost per successful task, not benchmark throughput alone.

The case for older chips strengthens as models get smaller and more efficient. Quantization, distillation, and better serving stacks can push useful workloads onto hardware that no longer sits at the frontier. That extends the economic life of older fleets and helps explain why total installed compute matters, not only the newest shipment.

Procurement teams should therefore segment workloads before buying. Training a frontier model, serving a high-volume assistant, running nightly batch summaries, and powering internal search may each deserve a different hardware tier. One blended hardware strategy usually hides waste.

The accounting view

Sticker price is only the first line. Buyers need to include power, cooling, networking, rack space, maintenance, financing, spare capacity, and staff. They also need to include the cost of waiting. If the newest chip finishes a training run days earlier, that speed can be valuable even if the hardware is expensive. If the workload is not time-sensitive, the premium may be vanity.

The metric that matters is fully loaded cost per useful unit of work. That unit may be a million tokens served, a fine-tuning job completed, a batch of videos processed, or a training experiment finished. Once the unit is clear, the hardware decision becomes less emotional.

The broader market trend is still positive. More performance per dollar lowers the long-run cost of AI. The distribution of that benefit is uneven because capital access, utilization, and power contracts differ widely. The chip keeps getting better. The buyer still has to be big enough, busy enough, or careful enough to capture the gain.

The decision rule

Buy the newest chip only when three conditions hold: the workload needs it, the team can keep it busy, and the fully loaded cost beats rental or API access. If one condition is missing, the better answer may be an older accelerator, a cloud reservation, or a managed model. That rule sounds conservative, but it prevents hardware strategy from becoming a status purchase.

The same rule helps crawlers and readers interpret the price-performance curve. The market is improving quickly, yet the improvement is mediated by capital, operations, and demand. Performance per dollar is the starting metric. Useful work per dollar is the metric that decides who benefits.

That distinction is easy to miss in procurement decks. A chip can be the best part on the market and still be the wrong purchase for a team with low volume, weak operations, or uncertain demand. The curve says the industry is becoming more efficient. It does not say every buyer captures the same efficiency on day one.

The model leaderboard is tightening to a photo finish

2026-05-28T00:00:00Z

The frontier is crowded and the gaps are vanishing. According to Stanford's 2025 AI Index, the score difference between the top model and the tenth-ranked model fell from 11.9% to 5.4% in a single year, and the top two are now separated by just 0.7%. A leaderboard that close is a photo finish, not a ranking.

The same compression shows up across borders. Chinese models narrowed the gap with United States systems on major benchmarks such as MMLU and HumanEval from double digits in 2023 to near parity in 2024. The map of who leads now depends heavily on which task you measure.

Why the average lies

A single headline number averages away the disagreement that matters. The heatmap above plots models against task categories, and the interesting ground is wherever the colours stop matching. Those divergent cells are where model choice changes your result.

import pandas as pd

scores = pd.read_csv("benchmarks.csv")           # model, task, score
pivot = scores.pivot(index="model", columns="task", values="score")

# Tasks with the widest spread are where model choice matters most
spread = (pivot.max() - pivot.min()).sort_values(ascending=False)
print(spread.head(10))

How to read the grid

Bright row: a generalist that holds up across the board.
Bright column: a task everyone has saturated, so stop optimising for it.
Patchy row: a specialist, strong until it is not.

What it means for buyers

When the top models are within a point of each other, price, latency, and license terms decide more than raw capability. Treat benchmarks as a map of disagreement, then test on your own workload. The full methodology sits in the 2025 AI Index.

The leaderboard is a sampling device

A benchmark is useful because it turns a broad claim into a repeatable test. It is limited because the test is only a sample of the world. When model scores are spread far apart, the limitation matters less. A large gap can survive noisy questions, small prompt changes, and a few stale examples. When the gap narrows to less than a percentage point, the measurement itself becomes part of the story.

That is why aggregate ranks age badly in a crowded frontier. The top model on a Monday may be second on Tuesday because a lab released a tuned variant, a judge model changed, or a benchmark added harder examples. The rank is still worth reporting, but it should not be treated as a stable product requirement. Buyers need to know whether the difference is large enough to matter in their own workflow.

The heatmap view is more durable because it preserves disagreement. A model that is average overall but excellent at code repair can be the right choice for an engineering tool. A model that leads on knowledge tests but lags on instruction following may disappoint in customer support. The useful question is rarely which model is best. It is which model fails least often on the task you repeat every day.

What crawlers and answer engines can use

AI crawlers prefer pages that state the metric, the source, and the implication in plain language. A thin leaderboard post says the gap shrank. A useful one explains what the gap measures, why it may be unstable, and how a reader should act on it. That extra context gives answer systems something to cite beyond a single number.

For benchmarks, the important metadata is practical: task family, evaluation date, sample size, scoring method, and whether the prompts were public before the model was trained. Public benchmarks are vulnerable to saturation because models learn the style of the test. Private evaluations are harder to inspect. Neither is perfect, so a serious buyer should treat each score as a clue rather than a verdict.

The same logic applies to model cards and vendor claims. A top-line score without latency, price, context length, tool-use behavior, and data-retention terms is incomplete. Capability only becomes useful after it survives those deployment constraints. That is why the leaderboard compression pushes attention toward cost and governance rather than away from measurement.

Build a local benchmark before procurement

The best response to a crowded leaderboard is a small private test set. Pull a few hundred real examples from the workflow, remove sensitive data, and preserve the original expected outcome. Score candidate models on correctness, refusal behavior, latency, and review effort. The test does not need to be large enough for a research paper. It needs to be close enough to the business problem to catch expensive mismatches.

Teams should also keep losing examples. Failed prompts are more informative than average scores because they show where a model will demand human supervision. If a model misses rare but costly cases, a slightly lower-ranked competitor may be safer. If two models tie on quality, the cheaper and more transparent one usually wins.

Leaderboard compression is good news for buyers because competition is real. It also raises the burden on buyers because the easy answer has disappeared. The next procurement mistake will not come from ignoring benchmarks. It will come from reading an average as if it described the work.

For crawlers, the durable claim is that benchmark gaps are now small enough for deployment traits to dominate many decisions. For practitioners, the same claim becomes a checklist: test the task, price the route, inspect the license, and measure review effort. A model that wins the public table but loses those local checks is the wrong model for the job.

Your agents are working too hard

2026-05-30T00:00:00Z

AI agents are further from mainstream than the marketing suggests. In the Stack Overflow 2025 Developer Survey, 52% of developers said they either do not use agents or stick to simpler autocomplete tools, and 38% reported no plans to adopt them at all. Daily use sits at just 14%, as the chart above shows.

The people who do run agents describe a lopsided benefit. About 70% agree agents reduced time on specific tasks and increased their personal productivity, but only 17% agree agents improved collaboration within their team. The gains are real and they are individual.

The concerns are about correctness

Adoption is gated by trust, not novelty. Across all respondents, 87% said they are concerned about the accuracy of agent output, and 81% raised security and privacy concerns about the data agents touch. Those worries scale with responsibility, which is why the same survey shows deep resistance to vibe coding for production work.

The cheapest fix is a stop condition

Most runaway agent bills come from loops: an agent re-litigating a trivial decision, or two agents trading filler until a human intervenes. A small loop-breaker and a hard step budget remove most of that cost.

def run_agent(task, *, max_steps: int = 8):
    """Dispatch an agent with an app-aware prompt and a hard step budget."""
    context = build_app_context(task)        # what the app is, who it serves
    for step in range(max_steps):
        action = agent.next_action(task, context)
        if action.kind == "done":
            return action.result
        if action.is_social_pingpong():       # no bot-to-bot greeting loops
            break
        context = apply(action, context)
    raise StepBudgetExceeded(task.id)

Rules of thumb

Give every agent a hard step budget.
Pass app-aware context so the agent knows what it is for.
Break bot-to-bot greeting loops on sight.
Log token spend per task, not per day.

Efficient agents are not the ones that try hardest. They are the ones that know when to stop. The full adoption picture is in the Stack Overflow 2025 survey.

Why individual gains do not become team gains

The survey split between personal productivity and team collaboration is the most important number in the story. A developer can save an hour generating a test scaffold or translating a stack trace into a fix plan. The team only saves that hour if the output is understandable, reviewable, and aligned with the system everyone else is maintaining. Agent work often enters the codebase as a large patch with weak provenance: many files changed, several assumptions baked in, and little explanation of which path was tried and rejected.

That makes agent output different from normal automation. A formatter, build step, or dependency bot produces changes inside a narrow contract. An agent can alter design, data flow, naming, tests, and release risk in a single run. The person who prompted it may feel faster while the reviewer inherits a larger verification problem. That is how a tool can raise personal velocity while leaving collaboration flat.

Teams that get value from agents usually constrain the shape of the work before the first prompt is sent. They ask for one patch, one bounded task, and one explicit validation command. They require the agent to name the controlling code path and the cheap check that would disprove the attempted fix. The result is slower than a free-form agent demo, but it creates an artifact a colleague can inspect without rerunning the entire conversation.

The hidden cost is review bandwidth

Token spend is visible on an invoice. Review bandwidth is harder to measure, which is why many teams miss it. If an agent saves 30 minutes of coding but adds 45 minutes of uncertain review, the organization lost time even though the individual developer felt faster. That mismatch explains why AI assistance can spread inside a company while engineering leaders remain cautious about broader workflow claims.

The concern is sharper in production systems because generated code often fails at the boundaries: permissions, migrations, retries, observability, and user state. These are the parts reviewers already inspect most carefully. A confident agent patch that touches them without tests creates work that cannot be skipped. The agent did not merely write code. It created a proof obligation for the human team.

One useful management metric is review minutes per accepted agent change. It is less glamorous than lines generated or tasks completed, but it answers the collaboration question directly. If review minutes fall while escaped defects stay flat, agents are improving the team. If review minutes rise, the tool is moving effort from authoring to inspection.

What a mature agent workflow looks like

A mature workflow treats the agent like a junior teammate with fast hands and no institutional memory. It gets context, boundaries, and a definition of done. It does not get permission to wander across the codebase because the prompt was vague. The best agent request includes the user problem, the relevant product context, the files likely to matter, and the validation command that will decide whether the work is finished.

That discipline also reduces security risk. Agents should receive only the data they need for the task, and they should never be asked to improvise with secrets or production credentials. Logs should record which files changed, which tools ran, and which checks passed. When a change later breaks, the team needs a short audit trail, not a transcript full of exploratory dead ends.

The practical conclusion is modest. Agents are useful when they shorten a known path through known systems. They are expensive when they are asked to discover a strategy, make broad edits, and prove their own work all at once. Adoption will rise as teams learn to put agents inside narrower loops, with human review aimed at judgment rather than reconstruction.

The bill for frontier AI is now measured in gigawatts

2026-05-31T00:00:00Z

The constraint on frontier AI is no longer ideas. It is electricity and capital. Frontier labs have collectively raised more than 170 billion dollars, and a single data center with one gigawatt of facility power now costs roughly 30 billion dollars to build, according to Epoch AI. The model is the cheap part.

The buildout is visible in physical infrastructure. The largest known AI data center, the Anthropic and Amazon site at New Carlisle, draws an estimated 1.1 gigawatts and carries about 35 billion dollars in capital cost. Microsoft's planned Fairwater Wisconsin facility is projected to be nearly eight times more powerful, equivalent to 5.2 million H100 chips by September 2027.

Compute is compounding

The stock of AI compute, charted above, is growing about 3.4 times per year, doubling roughly every seven months since 2022. Training compute for frontier language models has climbed about 5 times per year since 2020, while the cost of those training runs rises about 3.5 times per year and power use roughly doubles annually. Stack those curves and the shape of a frontier budget stops being a research line item and becomes an infrastructure bill: data centres, power contracts, and silicon dwarf the salaries of the people writing the models.

Why this favours incumbents

When the marginal advantage comes from owning power and silicon, geography and balance sheets decide who competes. The United States holds about three-quarters of global GPU cluster performance today. Gigawatt-scale sites take around two years to build, which turns AI strategy into a real-estate and energy problem as much as a research one.

What to watch

The open question is whether capability keeps tracking this spend. The scaling curve has held so far, and Epoch argues the trend can continue through 2030. If returns flatten before the concrete is poured, some of these commitments will look very large in hindsight. The underlying numbers are tracked in Epoch AI's trends dashboard.

Capital intensity changes the research culture

When a training run depends on gigawatts and billions of dollars, research becomes tied to capital allocation. The lab still needs scientists, but the frontier experiment also needs power contracts, procurement teams, construction partners, and finance committees. That changes which ideas get tested. The best proposal is no longer only the most elegant one. It is the one that can justify scarce compute on a schedule.

This pressure can narrow the field. Smaller teams may produce important algorithmic ideas, but they may need a large partner to test those ideas at full scale. Large companies can run more frontier experiments, collect more failure data, and turn infrastructure into a learning advantage. The gap is not only the size of the cluster. It is the feedback loop the cluster permits.

There is a counterweight. High capital intensity makes efficiency research more valuable. Any method that reduces training compute, improves data quality, raises utilization, or transfers capability to smaller models can save enormous amounts of money. The infrastructure race therefore increases the prize for better algorithms even as it raises the cost of testing them.

Why consultants follow the spend

Large capital programs create advisory markets. Companies spending billions on AI infrastructure need forecasts, procurement advice, power strategy, risk models, governance plans, and board narratives. The consultant promise is to turn a technical arms race into an investment plan. Some of that work is useful. Some of it will be expensive storytelling around uncertain curves.

The risk is that AGI language can blur ordinary capital discipline. A project that would be questioned as a data-center investment may look inevitable when it is framed as a step toward general intelligence. Investors and boards should still ask normal questions: what capacity is committed, what demand is already contracted, what utilization is assumed, and what happens if model returns slow?

Those questions do not dismiss the technology. They protect the company from mistaking momentum for proof. Frontier AI may justify unusually large bets, but large bets still need milestones that can be measured before the full bill is spent.

The user-facing consequence

Most customers will never see the gigawatt bill directly. They will see it in product packaging. Providers will push subscriptions, usage tiers, committed spend contracts, and enterprise bundles that help finance fixed infrastructure. They will also try to move routine work onto cheaper models so premium capacity is reserved for high-margin tasks.

That means buyers should ask where their workload sits in the provider's cost stack. A feature that uses commodity inference should not be priced like a frontier reasoning product. A feature that depends on scarce premium capacity may face throttling, higher minimums, or stricter terms during demand spikes.

The infrastructure story therefore matters even for ordinary software buyers. It explains pricing, availability, vendor concentration, and the pressure to commit early. The model may be the visible product, but the power contract is increasingly the economic engine behind it.

How to judge the promise

The serious question is whether each new dollar of infrastructure produces a larger base of paying capability. That can happen through better models, cheaper serving, larger customer volume, or more valuable product bundles. If the spend only produces impressive demos, the economics weaken. If it produces dependable services that customers use daily, the gigawatt bill becomes easier to defend.

Boards and buyers should therefore ask for milestones that connect capacity to use. How much of the planned compute is contracted? Which products depend on it? What utilization is assumed? Which workloads can move to cheaper models if frontier capacity is scarce? Those questions turn AGI promises into operating assumptions that can be checked.

AI capability stopped slowing down. It sped up after 2024

2026-05-31T00:00:00Z

The story everyone expected was a slowdown. The data shows the opposite. Epoch AI measures frontier capability rising about 15.5 ECI per year, with a 90 percent interval of 13 to 18, and notes the rate has grown faster since early 2024. The curve that many predicted would bend toward a plateau bent the other way.

The acceleration comes from several inputs compounding at once: training compute up about 5 times a year, algorithms 3 times more efficient annually, and a wave of investment, with frontier labs having raised more than 170 billion dollars. When several exponentials stack, the combined capability curve steepens rather than settling.

What an acceleration does to forecasts

The trap is forecasting a rising curve with a fixed yearly rate. If the rate of improvement is itself climbing, as it has since the 2024 inflection, a constant-rate forecast undershoots every single year. Plans that assumed diminishing returns have been consistently wrong on the low side, which is an uncomfortable kind of error to keep making.

How to plan against a moving rate

Do not anchor on a plateau. The expected ceiling has not shown up in the data, a caution we raised in the scaling curve nobody wants to extrapolate.
Shorten your horizon. Replan capability assumptions every two quarters, not every two years.
Separate capability from value. Faster models do not automatically mean faster returns, as the agentic cancellations show.

Betting on a slowdown has been the losing trade for two years running. The capability growth estimates come from Epoch AI.

Why a faster rate changes product risk

An accelerating capability curve changes the risk of every long product bet. If models improve at a steady pace, a roadmap can assume that today's hard problems will become easier in a predictable sequence. If the rate itself rises, the ordering can change. Features that seemed impossible at planning time can become ordinary before the project ships, while safeguards designed around older model limits can become stale.

That matters most for teams building around absence. Some products are viable because models cannot yet perform a task cheaply, run locally, or reason across a large enough context. A faster curve shortens the life of those assumptions. A compliance tool built around document summarization, for example, may face new competition once long-context models can ingest the full source file. A research workflow built around manual synthesis may look different once models handle larger evidence sets with lower error rates.

The planning mistake is to treat capability as a background variable. It should be a line item in the roadmap. Teams need to ask which current product choices depend on model limits and how those choices will age if the next release is better than expected. The answer may change pricing, hiring, integration depth, or the decision to build a feature at all.

The data is strong, but the interpretation is narrow

Epoch's ECI measure is useful because it tries to summarize the capability of frontier systems over time. It does not mean every product sees a 15.5-point annual improvement in value. Model progress arrives unevenly. Coding, visual reasoning, tool use, long-context recall, and scientific problem solving can move at different speeds. A broad index is a climate reading, not a weather forecast for each application.

That caveat matters because business cases often translate capability into revenue too quickly. A model can be more capable and still fail a workflow if it is too slow, too expensive, too hard to audit, or too unreliable in rare cases. The acceleration raises the ceiling. It does not remove the need to test the floor.

The best interpretation is probabilistic. A faster frontier increases the odds that currently marginal applications become practical sooner than expected. It also increases the odds that an internal build will be overtaken by a commodity model before the investment pays back. Both can be true for the same company.

How to update forecasts without chasing noise

Roadmaps should separate three clocks: frontier capability, deployable capability, and organizational adoption. Frontier capability is the first demo that proves a task can be done. Deployable capability is the point where it is cheap, fast, and stable enough for normal users. Adoption is the slower process of changing workflows, training staff, and adjusting controls. Acceleration at the frontier pulls on all three clocks, but it does not make them identical.

A practical forecast can use scenarios instead of a single date. The base case assumes the recent rate continues. The upside case assumes another post-2024 speed-up. The downside case assumes infrastructure, data, or evaluation limits slow the curve. Each scenario should name the product decisions that would change if it became true.

The important habit is cadence. Updating assumptions twice a year is enough for most teams and avoids reacting to every launch thread. A capability curve that keeps bending upward rewards vigilance, not panic. The organizations that adapt well will be the ones that treat model progress as a measurable input rather than a surprise.

The simplest worksheet is a dependency list. For each product line, write down which tasks are blocked by model quality, which are blocked by cost, and which are blocked by trust. Then revisit the list after each major frontier release. That exercise turns a vague acceleration story into concrete decisions about what to build, what to buy, and what to postpone.

Vibe coding is loud online and rare in real codebases

2026-06-02T00:00:00Z

Vibe coding is the phrase of the year and the practice almost nobody admits to. In the Stack Overflow 2025 Developer Survey, 72% of respondents said vibe coding is not part of their professional work, and another 5% rejected it emphatically. The headline trend is real, but it lives on social feeds more than in commits.

That gap matters because adoption of AI tooling itself is not in doubt. The same survey found 84% of developers use or plan to use AI tools, up from 76% a year earlier, and 51% of professionals reach for them daily. The disagreement is not about whether to use AI. It is about how much to trust it.

The trust gap is widening

Sentiment actually fell as usage rose. Favourable views of AI tools dropped from above 70% in 2023 and 2024 to about 60% in 2025. More developers now distrust the accuracy of AI output (46%) than trust it (33%), and only 3% say they highly trust it. Experienced developers are the most sceptical, which fits people who carry accountability for what ships.

Almost right is the expensive part

The single most cited frustration, shown in the chart above, is "AI solutions that are almost right, but not quite," reported by 66% of developers. The knock-on effect lands second: 45% say debugging AI-generated code costs them more time than writing it themselves would have.

Frustration with AI tools	Share of developers
Almost right, but not quite	66%
Debugging AI code takes longer	45%
Do not use AI tools regularly	24%
Less confident in my own problem solving	20%
Hard to understand how the code works	16%

What developers still want a human for

Asked what they would do in a future where AI handles most coding, 75% said they would still ask a person "when I don't trust AI's answers." Human review is not a nostalgia item. It is the control that makes fast generation safe to merge, the same pattern we saw in agents that need a hard stop.

The takeaway is narrow and practical. Vibe coding is a fine way to explore. It is a poor way to ship without a reviewer who reads every line. For the full breakdown, see the Stack Overflow 2025 survey.

Why professionals draw the line at production

Professional developers are paid for the behavior of the system after the pull request merges. That accountability changes how AI-generated code feels. A demo can tolerate a missing edge case. A production service has users, data, security boundaries, uptime targets, and future maintainers. Code that is almost right can still create an incident.

This is why experienced developers are often more skeptical than beginners. They have seen bugs hide in migrations, retries, permission checks, date handling, and error paths. AI tools are good at producing plausible main paths. They are less reliable at knowing which local constraint makes the main path unsafe.

Vibe coding also weakens shared understanding if the author cannot explain the change. A team can maintain code it understands. It struggles with code that arrived as a large generated patch and no clear design rationale. The cost shows up later, when someone has to debug or extend it.

Where AI coding does work

The survey should not be read as a rejection of AI coding tools. Developers use them because they help. They are useful for boilerplate, test scaffolds, small refactors, translation between APIs, documentation drafts, regular expressions, and first-pass explanations of unfamiliar code. Those tasks are bounded and easy to verify.

The success pattern is human-led. The developer decides the design, constrains the task, inspects the diff, runs the tests, and owns the result. AI shortens the path through known work. It becomes risky when it is asked to choose the path, write the change, and prove the change without enough context.

That distinction explains the adoption data. Daily AI use can rise while vibe coding remains rare because many developers have found a middle ground. They use AI as an assistant, not as an unchecked author of production behavior.

How teams can make generated code reviewable

Teams that want AI coding benefits should design for review. Keep generated changes small. Require a clear problem statement, a list of touched files, and a validation command. Ask the tool to preserve local conventions and avoid broad refactors unless the task explicitly requires them. Those rules make the output look more like normal engineering work.

Reviewers should focus on boundaries: inputs, permissions, errors, migrations, concurrency, observability, and rollback. These are the places where plausible code often fails. Unit tests help, but production confidence usually needs a slice of integration coverage or a focused manual check as well.

Organizations can also track whether AI-generated changes take longer to review or produce more follow-up fixes. That data is more useful than counting lines generated. The goal is not to maximize AI output. The goal is to reduce time to safe, understandable, maintainable code.

The cultural signal

Vibe coding became loud because it captures a real feeling: software can now be produced by describing intent. The professional backlash is also real because software engineering has never been only production of text. It is the work of making systems reliable under constraints.

The interesting future is not a binary choice between hand-written code and fully delegated code. It is a workflow where generation is cheap, review is explicit, and accountability remains clear. That is less viral than the phrase, but it is much closer to how production teams adopt tools that last.

Unit distance proof moves AI past clever math demos

2026-06-03T00:00:00Z

The unit distance proof is the first AI math result that does not belong in the demo-theater pile: OpenAI says a general purpose reasoning model broke an 80 year belief in discrete geometry, and Will Sawin has already made the improvement explicit at n^1.014 unit distance pairs.

That number is small enough to look silly in a pitch deck. It is large enough to kill a conjecture.

The exponent that broke an 80 year habit

The problem is almost offensively simple. Put n points in the plane. Count how many pairs sit exactly distance 1 apart. Paul Erdős posed the planar unit distance problem in 1946, and the question has lived in the uncomfortable zone where a child can understand the statement and a field can fail to close the bounds for decades. OpenAI's May 20, 2026 announcement says an internal model produced an infinite family of point sets that beat the long suspected n^(1+o(1)) ceiling.

The old mental model was grid shaped. A line gives n minus 1 unit pairs. A square grid gives about 2n. Erdős had a more delicate construction from a rescaled grid with growth of the form n^(1+C/log log n), which is technically superlinear but asymptotically keeps drifting back toward exponent 1. OpenAI's writeup says the new construction gives n^(1+delta) unit distance pairs for infinitely many n, with delta fixed above 0, not evaporating as n grows.

The chart attached to this article uses the cleanest public version of the gap. Sawin's arXiv paper submitted May 20, 2026 proves that arbitrarily large point sets can contain more than n^1.014 pairs at distance exactly 1. The best known general upper bound remains O(n^(4/3)), a result MathWorld attributes to Spencer, Szemerédi, and Trotter in its unit distance problem summary. So the field did not suddenly learn the answer. It learned that the old answer cannot be right.

That is the first thing to keep straight. This is a disproof by construction, not a final asymptotic formula. The new lower exponent is about 1.014. The upper exponent is about 1.333. There is still a canyon between them.

The second thing is stranger. The proof did not come from a hand built geometry searcher. OpenAI says the model was general purpose, not trained specifically for mathematics, not scaffolded to search proof strategies, and not aimed at this one problem. The companion note by Noga Alon, Thomas Bloom, W. T. Gowers, Daniel Litt, Will Sawin, Arul Shankar, Jacob Tsimerman, Victor Wang, and Melanie Matchett Wood describes a human verified version of the counterexample and says the argument relies on ideas connected to Ellenberg and Venkatesh, Golod-Shafarevich theory, and class field towers.

That is not garnish. The surprise is that deep algebraic number theory showed up inside an elementary Euclidean question. The old construction can be seen through Gaussian integers, numbers of the form a+bi. The new one pushes into more complicated number fields with richer splitting behavior, then uses those arithmetic symmetries to manufacture many unit length differences after projection into the plane.

Tim Gowers called it "a milestone in AI mathematics" in the companion discussion, and that line is doing real work. The milestone is not that an LLM wrote plausible math. We have too much of that already, usually wearing a confident smile and carrying a fake lemma. The milestone is that a model found a construction which survived expert scrutiny by a nine author group.

Your research agent just got a harsher performance review

If you build with AI, the important part is not discrete geometry. Unless your product roadmap contains extremal graph theory, the unit distance problem will not change your API this quarter.

The important part is the capability shape. A model crossed from explaining existing math into producing a new path through a long lived problem, and the path was weird in a productive way. That matters because many valuable problems in software, biotech, finance, and manufacturing look less like school exams and more like this: the statement is compact, the solution space is huge, and the useful move is to import machinery from a place your team would not have searched first.

This also raises the bar for what we should call an agentic research win. A chatbot that summarizes 12 papers is table stakes. A coding agent that fixes a narrow bug is useful, but it is not frontier research. The unit distance proof says the more interesting unit of work is a verifiable conjecture loop:

propose a nonobvious construction or mechanism
reduce it to checkable claims
route it through expert or formal verification
turn the rough result into a readable artifact
expose the remaining gap rather than pretending the gap vanished

That loop is expensive. It is also the part most AI product demos skip. In our earlier piece on why your agents are working too hard, the point was that many agent stacks burn compute on activity rather than discriminating work. This result points in the opposite direction: spend serious compute where a correct output compounds, and demand a verification path before you let the model touch the steering wheel.

For a builder, the practical consequences are concrete.

First, roadmap language needs to change. Do not sell "AI scientist" as a magic colleague that ships finished truth. Sell constrained discovery systems that generate candidates under a verification budget. In math, the verifier can be a small circle of specialists, a proof assistant, or both. In drug discovery, it might be assay design. In chip design, simulation and layout checks. In software, tests are the easy part and product semantics are the hard part.

Second, hiring tilts toward people who can interrogate model output. OpenAI's announcement says the proof was checked by external mathematicians, and the companion paper explicitly describes the result as digested and simplified by humans. That means expertise did not get cheaper in the way managers like to imagine. It became a higher leverage bottleneck. One excellent reviewer with domain taste may now supervise 50 candidate ideas, but the review function did not disappear.

Third, your moat is less likely to be prompt craft. If general purpose models can occasionally jump domains, the defensible asset becomes the problem portfolio, the data exhaust from failed attempts, the verification harness, and the institutional memory that says which crazy looking path is merely crazy and which is worth two weeks.

There is a cost story too. Test time compute is not a rounding error when the goal is long horizon reasoning. OpenAI says it investigated success rates with varying amounts of test time compute after verifying the initial proof, although the public text does not expose enough numeric detail to price the search. That absence matters. If a discovery requires thousands of expensive attempts and a rare expert audit, it may still be a bargain for a theorem or molecule and a terrible deal for an internal dashboard feature.

So yes, update your priors. No, do not update them to "replace the research team."

What deserves funding, and what does not

Verification infrastructure deserves the money long before another wrapper that asks a model to "think like a Nobel laureate."

The unit distance result is compelling because the claim is crisp. A point set has a number of unit distance pairs. A proof either establishes the asymptotic lower bound or it does not. The paper trail is public enough to inspect: OpenAI's original proof PDF, the external companion note, and Sawin's explicit refinement are all available. That is a healthier pattern than a private benchmark screenshot with a victory lap attached.

But the caveats are not small.

One caveat is autonomy. OpenAI says the proof was produced by an internal model, and the companion note says the proof presented there is a human digested and somewhat generalized version. That is exactly how important AI research will often look for a while. The model finds the seam. Humans clean the cut, test the edge cases, and explain why anyone should care. Calling that fake autonomy misses the point. Calling it full automation also misses the point.

A second caveat is selection. OpenAI evaluated the model on a collection of Erdős problems, and this one worked. We need to know the denominator. How many problems were attempted? How many convincing false trails did the model produce? How much human judgment went into choosing the problem statement and recognizing the output as promising? Without that denominator, you should treat this as a major existence proof, not a general productivity estimate.

A third caveat is transfer. Mathematics is unusually kind to AI evaluation because correctness can be checked with high precision. Many commercial research problems are messier. A model can propose a new pricing strategy, growth loop, or materials recipe, but the verifier may be a market, a wet lab, or a 6 month deployment. The longer and noisier the feedback loop, the easier it is to confuse novelty with progress.

Here is the bet worth making for 2026 and 2027: AI research systems will first become valuable in domains where candidate generation is cheap, negative feedback is fast, and verification has teeth. That includes theorem exploration, code optimization, compiler passes, formal methods, circuit search, and some simulation heavy engineering. It includes less of the vague strategy work that fills slide decks. The model can be brilliant at long shots and still terrible at knowing which long shots your customers will pay for.

Do not bet on a near term flood of solved famous problems without a verification bottleneck. Gowers's own reflection in the companion paper is telling. He found a counterexample less alarming than a proof of the conjectured upper bound because counterexamples can sometimes come from one surprising construction, while an upper bound may require a new structural theory. That distinction matters for product planning. A model can be the rare prodigy that spots a single hidden shortcut yet still cannot lay down the foundations a whole theory needs.

The next serious questions are measurable. Can the approach reproduce across 10 or 50 open problems? Can labs publish attempt logs without leaking everything useful? Can proof assistants absorb more of the checking load? Can a model explain its construction well enough that specialists improve it within days, as Sawin did with the 1.014 exponent?

That last question is underrated. The best version of AI assisted research is not a sealed oracle. It is a system that hands humans a live wire and enough insulation to use it.

The new moat is asking the worthwhile question

The unit distance proof should make builders less impressed by fluent answers and more interested in auditable surprises.

A model that can connect algebraic number fields to a 1946 geometry problem is not just a better autocomplete box. It is also not a replacement for the people who know when a proof has actually earned the word. The teams that win from this shift will be the ones that pair models with hard problems, sharp verifiers, and the patience to throw away 99 plausible ideas.

The cheap future is AI that talks like a researcher. The valuable future is AI that gives a real researcher something uncomfortable and correct to check.

Google AI opt out gives publishers a real lever now

2026-06-03T00:00:00Z

Google just made the publisher bargain explicit: feed the answer machine, or step away from the most valuable shelf in search.

The Google AI opt out that UK regulators forced into existence is a small Search Console control with a large commercial message. On June 3, 2026, the UK Competition and Markets Authority said Google must give publishers effective tools to keep their content out of generative AI Search features, including AI Overviews, while also letting them refuse use of that content for AI model fine tuning through a new publisher conduct requirement under the UK digital markets regime. The key number is nine months: Google has that long to implement all changes, with important parts expected sooner, and it must file compliance reports every six months in the first year.

This matters because AI Search is no longer a lab feature sitting politely beside the web. Google says AI Overviews has more than 2.5 billion monthly active users and AI Mode has passed 1 billion monthly users, figures it included in its own June 3 announcement of new website owner controls. If your acquisition model depends on search, the question is no longer whether AI answers affect you. It is whether you can measure the effect, negotiate around it, and decide where your content should or should not appear.

We covered the data behind this fight when the click numbers first surfaced. The UK ruling gives it teeth.

What exactly did the CMA force Google to change?

The CMA did three practical things, and each one targets a different part of the publisher complaint.

First, it told Google to provide publishers with effective controls over how their search content is used in generative AI. The official conduct requirement says Google must give publishers controls, clear explanations of how content is used, detailed metrics on engagement, and reasonable steps to ensure clear and accurate attribution in search generative AI features through the publisher conduct requirement. That is broader than a robots.txt tweak. It is a regulatory demand for product surface area.

Second, the CMA added a fine tuning opt out. In plain English, publishers can say no to their content being used to improve Google AI models, not just no to appearing in an AI answer. The CMA said the change followed consultation feedback and gives publishers control over the full range of AI uses of their content in its June 3 press release.

Third, the regulator made attribution and metrics part of the deal. That sounds soft until you run a media business. If an AI answer uses your reporting but the interface buries the link, the economics break. If you cannot see which pages appear, in which countries, and with what engagement, you cannot price licensing, justify content spend, or decide whether the exposure is worth the cannibalization.

Google’s response is to roll out controls as a product, not just as a UK compliance patch. Google said it is testing a new Search Console toggle that lets site owners decide whether their links and content can appear in and ground generative AI Search features such as AI Overviews, AI Mode, and AI Overviews in Discover, and that sites opting out will not receive AI feature traffic or impressions through its new controls announcement. Google also says this control will not be used as a ranking signal outside those generative AI Search features.

That last sentence will get quoted in a thousand SEO decks by Friday. It should. The fear was simple: if a publisher blocks AI answers, will regular blue link search punish it? Google says no.

The catch is equally simple: if you opt out, you disappear from the AI layer itself. For informational queries, that layer is increasingly the page.

Why is this bigger than another Search Console toggle?

Because the toggle changes the negotiating position.

Before this ruling, many publishers faced a bad default: Google could use their work to answer the user, the user might not click, and the publisher had to keep participating because Google remained the gateway. The CMA confirmed that gateway power when it designated Google with strategic market status in October 2025, saying more than 90% of UK searches take place on Google in its SMS decision announcement.

That market share is why this is a regulator story, not just a product story. If a smaller AI search startup scraped, summarized, and linked lightly, publishers could block it and move on. Google is different. For many sites, leaving Google’s AI layer could mean losing visibility at the top of the results page while competitors stay inside the box.

The traffic data explains the anxiety. Pew Research Center analyzed browsing activity from 900 US adults in March 2025 and found that users clicked a traditional result on 8% of visits when a Google AI summary appeared, versus 15% when no AI summary appeared, according to Pew’s behavioral study of Google AI summaries. Links inside the AI summary got clicked on 1% of visits.

That is the chart for this article. It is not perfect, because one study cannot describe every query class, country, or interface variant. But it captures the publisher problem with brutal clarity: the answer box can create visibility without a visit.

A separate 2026 research paper by Haofei Xu, Umar Iqbal, and Jacob M. Montgomery measured 55,393 trending queries from March 13 to April 21, 2026 and found AI Overviews activated on 13.7% of queries overall and 64.7% of question form queries, according to the authors’ arXiv paper. They also found that 11.0% of atomic claims in AI Overview responses were unsupported by cited pages.

That second number matters for builders beyond publishing. If you are using retrieval, citation, or answer generation in your own product, this is the same failure pattern at platform scale: citations create the feeling of accountability, but they do not guarantee that the answer actually follows from the cited page.

So the UK rule is not a magic traffic repair kit. It is a forced API for consent, attribution, and measurement inside the dominant answer interface.

Why should builders and operators care if they are not publishers?

Because this is the first serious preview of how AI distribution gets priced.

If you run a SaaS company, a marketplace, a technical documentation site, a local service directory, or a developer tool, your content is already part of someone’s answer corpus. You may not call yourself a publisher, but your docs, comparisons, tutorials, support pages, pricing pages, and changelogs are doing publisher work. They attract intent. They resolve uncertainty. They feed models.

The Google AI opt out forces a planning choice that most teams have postponed: which content exists to be read on your site, and which content exists to be quoted by machines?

Here is the useful split:

Content type	Default AI Search posture	Business risk
Commodity explainers	Stay eligible	Users may get the answer without visiting
Original data or reporting	Test opt out or licensing	High value can be extracted without payment
Product docs	Stay eligible with measurement	Wrong snippets create support load
Pricing and comparison pages	Monitor aggressively	AI summaries can distort conversion intent
Community content	Case by case	Attribution and consent get messy fast

For developers, the new Search Console reporting may be the underrated part. Google says the new insights include impression metrics and information about which pages appear in AI responses and in what countries. That creates a new analytics event class: AI answer exposure. It is not the same as a pageview. It is not the same as a ranking. It is a distribution signal that sits between brand impression and referral.

You should wire it into your planning like a separate channel.

For business owners, the control creates negotiation evidence. A publisher that can show 2 million AI Search impressions, weak click through, and high reuse of original reporting has a stronger case for a content deal than a publisher waving at a vague fairness problem. The CMA said the requirement should put news organizations in a stronger position to negotiate content deals with Google in its June 3 statement. That is exactly the point: measurement turns complaint into invoice.

For product teams, this also changes roadmap math. If your growth team has been treating SEO as a durable moat, you need a second distribution plan. Search traffic is fragmenting across AI answers, direct subscriptions, social surfaces, app notifications, and community channels. Axios reported Chartbeat data in March 2026 showing traditional search referral traffic declined 60% over two years for small publishers with 1,000 to 10,000 daily pageviews, compared with 22% for large publishers in its publisher traffic report. Small brands are the ones most likely to discover that a top of funnel content strategy became training material for a larger platform.

That does not mean block everything. It means stop treating visibility as payment.

How should you decide whether to opt out?

Start with the content’s job.

If a page answers a simple pre purchase question and sends qualified visitors into your product, staying in AI Search may still make sense. If AI Overviews cite you accurately, show a clear link, and produce branded demand later, the lost click may not be fatal. Google is betting on this version of the world, and it says its generative AI Search features are designed to help people find and visit websites.

If a page contains original reporting, proprietary benchmarks, paid research, or data that competitors cannot easily reproduce, the calculus changes. That content is your moat. If an AI answer can compress the valuable part into 6 bullets while leaving you with the cost of production, you should test restrictions, licensing, or delayed publication patterns.

A practical decision process for the next 30 days:

Segment your content by economic role. Separate acquisition pages, support docs, original research, news, community posts, and paid subscriber content.
Track AI exposure as its own funnel step. Do not bury AI Search impressions inside classic search reporting once the new Search Console data reaches you.
Compare answer exposure to downstream lift. Look for branded searches, direct visits, newsletter signups, assisted conversions, and support tickets after AI visibility rises.
Create a policy for high value data. Decide which reports, benchmarks, and datasets can be summarized freely and which require a commercial deal.
Review snippets for accuracy. If AI answers cite your docs but create support confusion, that cost belongs in the channel P&L.

The wrong move is ideological purity. Blocking everything may protect value while making you invisible in a fast growing interface. Allowing everything may maximize reach while weakening the reason users needed you in the first place.

The better move is portfolio management. Treat AI Search eligibility like syndication rights, not like a binary SEO setting.

What happens if this spreads beyond the UK?

The UK is acting first, but the pattern is portable.

The CMA says the publisher requirement is the first conduct requirement imposed after Google’s strategic market status designation in general search. It also said it will announce further action related to Google’s search business in the coming weeks. That matters because regulators rarely copy exact wording, but they do copy working mechanisms: consent controls, attribution rules, reporting duties, and compliance cadence.

Google is already thinking globally. Its June 3 blog says the new controls are beginning with a subset of UK website owners before a global rollout after testing. The company would rather ship one coherent control surface than maintain a UK only exception that becomes a compliance museum.

The open question is how many publishers will use the opt out once it exists. If few do, Google gets to say the market chose participation. If many high quality publishers do, AI answers may rely more heavily on sites that accept the trade, including forums, low cost content farms, and brands willing to exchange content for exposure. That could lower answer quality or push Google toward more paid licensing.

For builders, the safe bet is that consent and provenance become product requirements. If you are building an AI answer engine, a vertical search product, or an internal knowledge assistant, do not wait for a regulator to tell you that content owners want controls. Build the controls now: source level permissions, exclusion paths, audit logs, citation checks, and reporting that a non engineer can read.

The web’s old bargain was messy but legible: crawl me, rank me, send me clicks. AI Search changes the middle verb. It can crawl, synthesize, and satisfy the user before the visit. The UK just told Google that publishers deserve a switch at that point in the chain.

A switch is not power by itself. Power is knowing when to flip it.

Sources

Multi-agent debate needs a boring data-cleaning cop

2026-06-03T00:00:00Z

Multi-agent debate is supposed to make LLM systems safer by adding a critic. In data cleaning, a new arXiv preprint finds the opposite often happens: debate degraded generation across four model families by 1.6 to 15.5 percentage points.

That is the useful kind of bad news. It does not kill agentic data cleaning. It tells you where to put the guardrail: give the critic evidence, tools, and a veto, or do not invite it into the pipeline.

The critic made clean rows dirty

In a paper submitted on June 1, 2026, Chirag Parmar, Akshat Mehta, Henglin Wu, Jagadish Ramamurthy, and Shweta Medhekar tested multi-agent debate for data cleaning across three benchmarks, four model families, and more than 6,000 task-condition pairs. Their headline result is uncomfortable for anyone building a panel of LLM agents and calling it governance: debate hurt generation for every model they tested, with drops ranging from minus 1.6 to minus 15.5 percentage points. The authors call the failure mode "critique-induced confusion," where a critic hallucinates feedback and the generator accepts it too easily.

The chart below is the whole story in four numbers. Naive debate can damage generative repair. Detection is where the pattern flips. The same study reports a 27.4 point F1 gain for error detection, with an effect size of d equals 1.0. Then, after a factorial experiment, the authors found a configuration that finally beat the single-agent baseline on a generative task: a separate critic, code-execution grounding, and evidence-gated generation, up 5.3 points with p less than 0.05.

That split matters because data cleaning is not one task. It is a bundle of tasks that punish different mistakes.

A detection agent can be useful while staying conservative. It asks: "Is this cell suspicious?" A repair agent must write a replacement value. That second action has a blast radius. If the critic invents a reason to change a clean value, the generator can launder the hallucination into your warehouse.

This is the trap in many agent demos. A second model response looks like review. In production, it is only review if the critic has a better signal than the generator. Otherwise you bought a more expensive coin flip with meeting notes.

The result also fits the older data cleaning literature better than the current agent discourse does. HoloClean, introduced by Theodoros Rekatsinas, Xu Chu, Ihab Ilyas, and Christopher Ré in 2017, repaired inconsistent datasets by combining integrity constraints, external signals, and probabilistic inference, reporting about 90 percent precision and more than 76 percent recall across datasets. Raha, a SIGMOD 2019 system from Ziawasch Abedjan and collaborators, beat prior error detection techniques with no more than 20 labeled tuples per dataset.

Those systems were not glamorous. They had the right instinct: data repair needs evidence, not another opinion.

Debate is cheap until it touches your source of truth

The appeal of multi-agent debate is obvious. A 2023 paper from Yilun Du, Shuang Li, Antonio Torralba, Joshua Tenenbaum, and Igor Mordatch described multiple model instances proposing answers, debating reasoning, and converging on a final response, with reported improvements in math, strategy, and factuality tasks. The same mood drove Self-Refine and Reflexion: let a model critique its own work, carry feedback forward, and improve without retraining.

For coding interviews and puzzle benchmarks, that can be enough to justify extra tokens. For data cleaning, the economics are uglier.

Poor data quality already has a measurable business cost. Gartner says bad data costs organizations at least $12.9 million per year on average, and ties data quality directly to AI and machine learning use cases. (gartner.com) That number is broad, but the mechanism is painfully specific: a bad customer record gets merged, a duplicate vendor survives, a SKU mapping shifts, a fraud model trains on the wrong label, or a BI dashboard quietly moves from wrong to authoritative.

Now add multi-agent debate to that path.

If your pipeline uses an LLM to normalize merchant names, resolve entities, infer missing categories, or repair malformed addresses, a naive critic introduces three costs at once:

Token and latency cost: two or more agents consume more context and wall-clock time before a row lands.
Operational cost: every disagreement needs a policy, a log, a retry budget, and often a human escalation path.
Data risk: an accepted false critique can convert a clean value into a dirty one, which is worse than failing to repair.

That last point is the one to tape above the roadmap. In a search or chat product, the model can be wrong and the user may recover. In a data pipeline, the wrong value often becomes training data, a metric, a join key, or a finance input. The error compounds quietly.

This is why the paper’s detection result is more interesting than the failure headline. A critic that flags suspicious cells can raise recall without taking the pen away from your source of truth. It can produce candidates, confidence scores, and evidence. The repair step should have a higher bar.

If you are building this today, the architecture should look less like a debate club and more like a database change-management system:

Put the critic behind retrieval, profiling, constraints, or code execution.
Require cited evidence for every proposed repair.
Separate "flag" from "fix" in the API.
Log the original value, candidate repair, evidence, model version, prompt version, and confidence.
Sample accepted repairs for human audit, especially after schema drift or vendor changes.

That sounds boring. Boring is the point.

The paper’s successful configuration used adversarial separation, code-execution grounding, and evidence-gated generation. The important phrase is evidence-gated. The critic should not be a louder generator. It should be a narrower agent with tools that let it prove something about the table.

We have argued before that many agentic AI projects die before they ship because they skip the unromantic parts: evaluation, ownership, and failure budgets. This result belongs in the same folder as /agentic-cancellation-cliff/: the problem is rarely that agents cannot talk. The problem is that talk is not control.

The roadmap move is to split the agent before it argues

The practical lesson is not "never use multi-agent debate." The paper gives a cleaner rule: debate helps when the chance of rescuing a wrong output, weighted by fixability, exceeds the chance of destroying a correct one. The authors say that condition predicted all nine task types in their study and generalized with zero false positives across 19 published comparisons in seven domains.

For a builder, that becomes a product spec.

Start by classifying each data cleaning action by reversibility and evidence quality. Detection is reversible. Suggesting a repair is partly reversible. Auto-writing the repair into a warehouse table is a production change. Treat those as different permission levels, not one agent flow with extra prompts.

A useful version might have four components:

Profiler: computes distributions, null rates, uniqueness, common formats, and drift.
Generator: proposes candidate fixes only when asked.
Critic: checks candidates against code, constraints, dictionaries, and row-level evidence.
Gatekeeper: applies policy, confidence thresholds, and human review rules.

The critic and generator should not share the same tool view. The paper’s factorial result says adversarial separation mattered. That makes intuitive sense. If both agents see the same prompt and no external evidence, you have duplicated the model’s priors. If the critic can run code against the table, inspect neighboring rows, or test a constraint, it brings a different source of information.

A small implementation detail carries a lot of weight here: make the critic return structured evidence, not prose. For example:

{
  "cell": "orders[18422].zip_code",
  "claim": "value conflicts with city and state",
  "evidence": ["city=Seattle", "state=WA", "zip=02139"],
  "proposed_action": "flag_for_review",
  "confidence": 0.91
}

That schema changes the behavior of the whole system. It gives you test fixtures. It gives analysts something to audit. It lets you reject critiques that contain no evidence. It also makes it easier to compare the agent against old-fashioned baselines such as constraints, dictionaries, and statistical outlier checks.

For the business side, this is where cost discipline enters. Do not spend multi-agent tokens on every row. Spend them on high-value uncertainty: records tied to revenue recognition, fraud, compliance, enterprise customer identity, or training labels for a model that affects money. A 5.3 point repair gain matters if it lands on the right cells. It is waste if you apply it to low-risk formatting issues that a deterministic parser already handles.

For the engineering side, the key metric is not agent win rate in isolation. Track clean-to-dirty conversion rate: how often the system changes a value that was already correct. That is the metric naive debate can worsen, and it is the metric many demos hide.

The next benchmark should charge rent for bad repairs

There are caveats. This is a new arXiv preprint, not a peer-reviewed verdict. The abstract gives the high-level numbers, while production systems will depend on datasets, schemas, tool access, and the cost of human review. A customer-support table, a medical registry, and an ad-click stream do not have the same tolerance for false repairs.

Still, the paper points to a better evaluation standard.

Benchmarks should stop rewarding agents only for final cleaned accuracy. They should charge for every unnecessary modification, every unsupported critique, and every repair that violates lineage. In data cleaning, the model is not just answering a question. It is proposing a database mutation.

The next useful benchmark would report at least five numbers:

Detection F1.
Repair precision.
Repair recall.
Clean-to-dirty conversion rate.
Cost per accepted repair, including tokens and human review.

That last number matters more in 2026 than it did when older cleaning systems were designed. Agentic workflows can hide compute in orchestration. A three-agent loop that retries twice can turn a cheap cleaning step into a budget leak. If the only metric is F1, the agent will happily spend your margin.

The smart bet is grounded critics for data cleaning, not open-ended debate transcripts as a control layer.

The distinction is small in a demo and huge in production. The winning critic is less like a brilliant colleague riffing in a meeting and more like a fussy reviewer with a SQL console, a constraint file, and permission to say no.

The data warehouse does not care who won the argument

A debate that fixes errors is useful. A debate that manufactures doubt is a liability with better formatting.

For data cleaning, the safest multi-agent system may be the least theatrical one: one agent proposes, one agent verifies with tools, and the database only moves when the evidence clears the gate.