<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <title>Data Today</title>
  <subtitle>The latest in data &amp; AI, told like a newsroom.</subtitle>
  <link href="https://data-today.net/feed.xml" rel="self" />
  <link href="https://data-today.net/" />
  <updated>2026-06-03T00:00:00Z</updated>
  <id>https://data-today.net/</id>
  <author>
    <name>Data Today Newsroom</name>
  </author>
  <entry>
    <title>Inference got cheap, but not for the work that pays</title>
    <link href="https://data-today.net/inference-price-uneven/" />
    <updated>2026-05-04T00:00:00Z</updated>
    <id>https://data-today.net/inference-price-uneven/</id>
    <content type="html">&lt;p&gt;The headline number is real and it is huge: the price to run a language model at
a fixed level of performance has fallen about 40 times per year. The number that
matters for your budget is the spread behind it. Epoch AI puts the decline
between 9 and 900 times per year depending on which capability milestone you
measure. The cheap collapse and the expensive plateau are the same trend seen
from two ends.&lt;/p&gt;
&lt;p&gt;Easy work fell off a cliff. Reaching GPT-3.5 quality on general knowledge went
from a premium product to a rounding error, dropping by hundreds of times a year
as small open models caught the old frontier. Hard work moved slowly. Holding
PhD-level science accuracy steady has only become cheaper at roughly 9 times a
year, because staying at that bar still needs a large, current model.&lt;/p&gt;
&lt;h2 id=&quot;read-the-gap-not-the-average&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/inference-price-uneven/#read-the-gap-not-the-average&quot;&gt;&lt;span&gt;Read the gap, not the average&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;pre class=&quot;language-python&quot; tabindex=&quot;0&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; numpy &lt;span class=&quot;token keyword&quot;&gt;as&lt;/span&gt; np

&lt;span class=&quot;token comment&quot;&gt;# Annual price-decline factors at three capability milestones (Epoch AI)&lt;/span&gt;
milestones &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;easy_general&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;900&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;mid_range&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;40&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;frontier_reasoning&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;9&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;token comment&quot;&gt;# Cost after t years, relative to today, for a fixed quality bar&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; name&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; factor &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; milestones&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;items&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    cost_in_2_years &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;/&lt;/span&gt; factor&lt;span class=&quot;token operator&quot;&gt;**&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;2&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string-interpolation&quot;&gt;&lt;span class=&quot;token string&quot;&gt;f&quot;&lt;/span&gt;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;name&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;token format-spec&quot;&gt;18s&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;: &lt;/span&gt;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;cost_in_2_years&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;token format-spec&quot;&gt;.6f&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt; of today&#39;s price&quot;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The output tells the planning story. Anything sitting on the 900x line is nearly
free to serve next year, so do not architect around its cost. Anything on the 9x
line stays expensive, so that is where caching, routing, and smaller fallback
models earn their keep.&lt;/p&gt;
&lt;h2 id=&quot;what-it-changes-for-buyers&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/inference-price-uneven/#what-it-changes-for-buyers&quot;&gt;&lt;span&gt;What it changes for buyers&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task class&lt;/th&gt;
&lt;th&gt;Annual price drop&lt;/th&gt;
&lt;th&gt;Planning move&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Classification, extraction, summaries&lt;/td&gt;
&lt;td&gt;up to 900x&lt;/td&gt;
&lt;td&gt;Assume near-zero cost soon&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;General chat, drafting, retrieval&lt;/td&gt;
&lt;td&gt;about 40x&lt;/td&gt;
&lt;td&gt;Route to mid-size models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-step reasoning, hard science&lt;/td&gt;
&lt;td&gt;about 9x&lt;/td&gt;
&lt;td&gt;Budget for it, cache aggressively&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The lesson connects to the broader
&lt;a href=&quot;https://data-today.net/everyone-automated-everything/&quot;&gt;collapse in token prices&lt;/a&gt;: falling averages
hide where money still goes. Spend your optimisation effort on the flat line, not
the steep one. The milestone-by-milestone data is published by
&lt;a href=&quot;https://epoch.ai/data-insights/llm-inference-price-trends&quot;&gt;Epoch AI&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;task-routing-is-the-new-cost-discipline&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/inference-price-uneven/#task-routing-is-the-new-cost-discipline&quot;&gt;&lt;span&gt;Task routing is the new cost discipline&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;When price declines vary by task, the cheapest architecture is rarely one model
for everything. A good system routes easy work to small models, sends ambiguous
or high-value work to stronger models, and escalates only when the expected gain
justifies the cost. The routing layer becomes as important as the model choice.&lt;/p&gt;
&lt;p&gt;This is a practical change for product teams. Classification, extraction,
formatting, and first-pass summarization can often run on cheaper models with
strict validation. Multi-step reasoning, high-stakes advice, and scientific or
legal analysis may need a frontier model plus human review. Treating both groups
as the same AI workload wastes money and hides risk.&lt;/p&gt;
&lt;p&gt;The routing policy should be measurable. Track cost per accepted answer, error
rate by task class, fallback frequency, and human-review minutes. If a smaller
model produces answers that reviewers reject, its apparent savings are false. If
it clears routine work reliably, it protects the budget for harder calls.&lt;/p&gt;
&lt;h2 id=&quot;falling-costs-can-increase-total-spend&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/inference-price-uneven/#falling-costs-can-increase-total-spend&quot;&gt;&lt;span&gt;Falling costs can increase total spend&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Cheaper inference does not guarantee a smaller bill. Lower prices often increase
usage. Teams add AI to more screens, run more background analysis, and ask for
more drafts because each one feels inexpensive. The unit cost falls while total
volume rises. That is how a budget can grow during a price collapse.&lt;/p&gt;
&lt;p&gt;The effect is strongest for work that was previously skipped. A company that
used to summarize only high-value documents may start summarizing everything.
A support team may generate suggested replies for every message rather than only
complex cases. A data team may run nightly interpretations across dashboards
that no analyst had time to inspect before.&lt;/p&gt;
&lt;p&gt;That expansion can be valuable, but it needs a budget owner. The right question
is not whether each call is cheap. It is whether the new volume changes a
decision, reduces labor, improves quality, or creates a product feature users
will pay for. Without that discipline, the cost decline becomes a demand engine
with no governor.&lt;/p&gt;
&lt;h2 id=&quot;what-to-put-in-the-dashboard&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/inference-price-uneven/#what-to-put-in-the-dashboard&quot;&gt;&lt;span&gt;What to put in the dashboard&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;An inference dashboard should separate price from mix. Show total tokens, cost
per task class, model used, cache hit rate, retry rate, latency, and accepted
outputs. Then split the view by easy, medium, and frontier workloads. The blend
is what matters because the average can improve while the expensive class grows
quietly.&lt;/p&gt;
&lt;p&gt;Caching deserves its own line. Many high-cost calls repeat stable context,
system instructions, or reference material. Prompt caching, retrieval caching,
and answer reuse can lower cost without changing model quality. For hard tasks
on the 9x line, those operational savings may matter more than waiting for the
next price cut.&lt;/p&gt;
&lt;p&gt;The durable lesson is that inference is no longer one market. Commodity language
work is racing toward near-zero marginal cost. Frontier reasoning remains a
premium input. The companies that understand the split will make AI feel cheap
to users without letting the hardest tasks quietly consume the margin.&lt;/p&gt;
&lt;h2 id=&quot;the-procurement-question&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/inference-price-uneven/#the-procurement-question&quot;&gt;&lt;span&gt;The procurement question&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Buyers should ask vendors to price the actual workload mix, not an average token
bundle. A support assistant heavy on classification and retrieval should be
priced differently from a scientific reasoning tool. A drafting product with
high cache reuse should not carry the same cost model as a long-context analyst.
The benchmark is cost per accepted answer.&lt;/p&gt;
&lt;p&gt;That metric forces hidden costs into the open. It includes retries, review,
latency, caching, and model escalation. It also rewards systems that use cheaper
models well. The next phase of inference competition will be less about a single
headline token price and more about routing the right task to the right level of
capability.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>The world&#39;s AI compute now doubles every seven months</title>
    <link href="https://data-today.net/compute-stock-doubling/" />
    <updated>2026-05-06T00:00:00Z</updated>
    <id>https://data-today.net/compute-stock-doubling/</id>
    <content type="html">&lt;p&gt;The supply of AI compute is compounding faster than almost any other industrial
input in history. Epoch AI estimates the total computing power of the installed
stock of AI chips is growing 3.4 times per year, a doubling every seven months,
based on revenue data, financial disclosures, and analyst reports. By early 2025
the cumulative stock had passed the equivalent of roughly 16 million H100 chips.&lt;/p&gt;
&lt;p&gt;That pace reframes the bottleneck. When capacity doubles twice a year, the
constraint stops being silicon on a wafer and becomes the power and the buildings
to run it. A single gigawatt of facility power now costs about 30 billion dollars
to stand up, and the largest sites are measured in gigawatts.&lt;/p&gt;
&lt;h2 id=&quot;what-a-seven-month-double-means&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/compute-stock-doubling/#what-a-seven-month-double-means&quot;&gt;&lt;span&gt;What a seven-month double means&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The intuition breaks once you compound it. At &lt;strong&gt;3.4 times a year&lt;/strong&gt;, capacity
doubles about every seven months, and three years of that growth is roughly a
40-fold increase. Any capacity plan written on a two-year horizon is planning for
a world with an order of magnitude more compute than the one it was drafted in.&lt;/p&gt;
&lt;h2 id=&quot;the-constraint-moved-downstream&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/compute-stock-doubling/#the-constraint-moved-downstream&quot;&gt;&lt;span&gt;The constraint moved downstream&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Chips:&lt;/strong&gt; plentiful relative to the past, and improving in
&lt;a href=&quot;https://data-today.net/hardware-value-per-dollar/&quot;&gt;performance per dollar&lt;/a&gt; by about 37 percent a
year.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power:&lt;/strong&gt; doubling annually per training run, now the gating resource.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Buildings:&lt;/strong&gt; gigawatt campuses take about two years to build.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The strategic question for any team is no longer whether compute will be
available. It is whether the grid connection and the data center will be ready
when the chips arrive. The underlying estimates come from
&lt;a href=&quot;https://epoch.ai/data-insights/ai-chip-production&quot;&gt;Epoch AI&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;why-installed-stock-matters-more-than-annual-shipments&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/compute-stock-doubling/#why-installed-stock-matters-more-than-annual-shipments&quot;&gt;&lt;span&gt;Why installed stock matters more than annual shipments&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Chip shipment headlines are useful, but the installed stock is the capacity that
actually changes product economics. A delivered accelerator only becomes useful
after it is mounted, powered, cooled, networked, scheduled, and connected to the
software stack that can keep it busy. The lag between shipment and productive
capacity is why the physical buildout matters as much as the semiconductor
roadmap.&lt;/p&gt;
&lt;p&gt;Stock also compounds differently from sales. A strong shipment year adds to the
machines already running, so the service capacity available to model providers
can rise even faster than a single year&#39;s revenue suggests. That is the reason a
seven-month doubling is such a violent planning signal. It means the market is
not simply replacing old parts with new ones. It is adding layers of usable
capacity at industrial speed.&lt;/p&gt;
&lt;p&gt;For builders, the installed-stock view explains why yesterday&#39;s expensive
capability becomes tomorrow&#39;s standard feature. More total compute means more
competition among providers, more room for batching, and more pressure to fill
idle capacity. Those forces help push inference prices down, especially for
workloads that no longer need the newest frontier model.&lt;/p&gt;
&lt;h2 id=&quot;the-bottleneck-shifts-by-layer&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/compute-stock-doubling/#the-bottleneck-shifts-by-layer&quot;&gt;&lt;span&gt;The bottleneck shifts by layer&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Every doubling moves the constraint to a different layer. At first the question
is whether enough chips exist. Then it is whether the data center shell is ready.
Then it is whether the site has enough power, cooling, networking, and technical
staff. Finally it is whether there is enough demand to use the cluster at high
utilization. A market can be short at one layer and long at another.&lt;/p&gt;
&lt;p&gt;That layer shift matters for forecasts. A chip analyst may see supply improving
while a cloud customer still cannot get the instance type they want in the
region they need. A utility may see years of interconnection queues while a
model company announces capacity targets that assume the power arrives on time.
Both views can be accurate. They describe different points in the same pipeline.&lt;/p&gt;
&lt;p&gt;The safest reading is that AI capacity is becoming less like a software release
and more like a logistics system. The slowest dependency sets the effective
speed. If transformers, substations, permitting, or skilled labor fall behind,
the theoretical compute stock will not translate into cheap reliable service.&lt;/p&gt;
&lt;h2 id=&quot;what-users-should-expect&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/compute-stock-doubling/#what-users-should-expect&quot;&gt;&lt;span&gt;What users should expect&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Most users will not buy clusters directly, but they will feel the stock doubling
in product behavior. More capacity should mean larger context windows, cheaper
batch jobs, faster response times at off-peak hours, and more willingness from
vendors to bundle AI features into ordinary software plans. The change will feel
gradual at the interface and dramatic in the infrastructure budget.&lt;/p&gt;
&lt;p&gt;The effect will be uneven. Frontier reasoning will still consume scarce premium
capacity, while summarization, extraction, classification, and routine drafting
will be pushed onto cheaper models and older hardware. That is why a single
average price decline can mislead. The market is segmenting by task difficulty
as fast as it is expanding.&lt;/p&gt;
&lt;p&gt;Companies planning AI adoption should therefore track both capacity and workload
mix. If their tasks sit on the commodity side, waiting can reduce cost quickly.
If they need frontier reasoning, they should plan for premium capacity and
scarce regional availability. The compute stock is exploding, but the useful
question remains which slice of that stock your workload can actually use.&lt;/p&gt;
&lt;h2 id=&quot;the-procurement-lesson&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/compute-stock-doubling/#the-procurement-lesson&quot;&gt;&lt;span&gt;The procurement lesson&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Procurement teams should avoid locking every workload to the newest accelerator.
A seven-month doubling means older capacity becomes more available quickly, and
many applications do not need the top bin. The better contract separates
baseline workloads, burst workloads, and frontier workloads. That gives buyers
room to move routine inference onto cheaper capacity while reserving premium
clusters for the tasks that justify them.&lt;/p&gt;
&lt;p&gt;The same logic applies to internal forecasts. Capacity plans should be revisited
quarterly, not annually, because the supply curve is moving too fast for old
assumptions. A budget built around last year&#39;s scarcity may overpay. A plan that
ignores power and regional access may still under-deliver.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Context windows grew 30x a year. Your retrieval stack noticed</title>
    <link href="https://data-today.net/context-windows-30x/" />
    <updated>2026-05-09T00:00:00Z</updated>
    <id>https://data-today.net/context-windows-30x/</id>
    <content type="html">&lt;p&gt;A model&#39;s working memory has grown faster than almost any other capability.
Epoch AI puts the growth in context window size at about 30 times per year since
2023, taking the maximum from a few thousand tokens to well over a million. A full
codebase, a quarter of legal filings, or a book now fits in a single prompt.&lt;/p&gt;
&lt;p&gt;That shift changes the engineering calculus around retrieval. For two years the
standard answer to a long document was to chop it into chunks, embed them, and
fetch the closest few at query time. When the window holds a million tokens, the
question becomes which problems still need that machinery at all.&lt;/p&gt;
&lt;h2 id=&quot;when-to-stop-chunking&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/context-windows-30x/#when-to-stop-chunking&quot;&gt;&lt;span&gt;When to stop chunking&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;pre class=&quot;language-python&quot; tabindex=&quot;0&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;strategy&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;doc_tokens&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; window&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;1_000_000&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; budget_per_query_usd&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;0.03&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
             price_per_mtok&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;0.4&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    cost_full &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; doc_tokens &lt;span class=&quot;token operator&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;1_000_000&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;*&lt;/span&gt; price_per_mtok
    &lt;span class=&quot;token keyword&quot;&gt;if&lt;/span&gt; doc_tokens &lt;span class=&quot;token operator&quot;&gt;&amp;lt;=&lt;/span&gt; window &lt;span class=&quot;token keyword&quot;&gt;and&lt;/span&gt; cost_full &lt;span class=&quot;token operator&quot;&gt;&amp;lt;=&lt;/span&gt; budget_per_query_usd&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;stuff the whole document&quot;&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;retrieve: corpus too big or too costly to stuff every call&quot;&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;strategy&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;120_000&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;     &lt;span class=&quot;token comment&quot;&gt;# a long report&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;strategy&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;40_000_000&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;token comment&quot;&gt;# a full corpus&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The rule is mundane and that is the point. If the source fits the window and the
token cost fits the budget, stuffing beats retrieval on both accuracy and
engineering effort. Retrieval earns its complexity when the corpus is far larger
than any single window or when per-query cost rules out sending everything.&lt;/p&gt;
&lt;h2 id=&quot;what-still-breaks-at-length&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/context-windows-30x/#what-still-breaks-at-length&quot;&gt;&lt;span&gt;What still breaks at length&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Attention dilution:&lt;/strong&gt; models still lose facts buried mid-context, so
position matters even when everything fits.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cost:&lt;/strong&gt; a million-token call is not free, which ties back to
&lt;a href=&quot;https://data-today.net/inference-price-uneven/&quot;&gt;where inference money still goes&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; long prompts are slow prompts, so cache the stable prefix.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Long context did not kill retrieval. It moved the line where retrieval pays for
itself. The context-window trend is tracked by
&lt;a href=&quot;https://epoch.ai/trends&quot;&gt;Epoch AI&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;long-context-changes-failure-modes&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/context-windows-30x/#long-context-changes-failure-modes&quot;&gt;&lt;span&gt;Long context changes failure modes&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Retrieval systems fail when they fetch the wrong chunk or miss the one sentence
that matters. Long-context systems fail differently. They may receive the right
document and still overlook a buried clause, overweight a repeated but less
important section, or mix instructions from unrelated parts of the prompt. The
engineering problem moves from finding text to organizing attention.&lt;/p&gt;
&lt;p&gt;That is why prompt structure matters more as windows grow. A million-token input
should not be treated as a giant paste buffer. Stable instructions, source
summaries, document tables, and explicit citations help the model navigate the
material. Putting the most important constraints at the start and end of the
context can also reduce mid-context loss.&lt;/p&gt;
&lt;p&gt;The best systems combine long context with retrieval rather than choosing one
ideology. They retrieve the most relevant documents, then provide enough
surrounding material for the model to reason without guessing. They also cache
static context, so a large policy manual or codebase does not have to be resent
from scratch on every question.&lt;/p&gt;
&lt;h2 id=&quot;cost-turns-architecture-into-policy&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/context-windows-30x/#cost-turns-architecture-into-policy&quot;&gt;&lt;span&gt;Cost turns architecture into policy&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The decision to stuff a document into the prompt is partly technical and partly
economic. If a full-document call costs less than the engineering time required
to maintain a retrieval stack, stuffing is rational. If the same call runs
thousands of times a day, cost and latency can make retrieval necessary even
when the window is large enough.&lt;/p&gt;
&lt;p&gt;That trade-off should be visible to product managers. A legal research tool may
accept a slower, more expensive call because the value of a correct answer is
high. A customer-support assistant may need aggressive retrieval and routing
because volume is high and many questions are routine. The right architecture
depends on task value, not only token count.&lt;/p&gt;
&lt;p&gt;The policy layer is data access. Long context makes it tempting to send
everything. Good systems still apply least privilege. They include the material
needed for the answer, exclude restricted content, and log which sources were
used. Bigger windows increase the need for access control because they make it
easier to move large amounts of sensitive text in one request.&lt;/p&gt;
&lt;h2 id=&quot;what-to-measure-before-rebuilding-rag&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/context-windows-30x/#what-to-measure-before-rebuilding-rag&quot;&gt;&lt;span&gt;What to measure before rebuilding RAG&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Teams considering a long-context rewrite should run a simple bake-off. Choose a
representative set of questions, answer them with the existing retrieval system,
then answer them by placing the full source or a larger source bundle into the
context. Score correctness, citation quality, latency, cost, and review effort.
The winner may vary by question type.&lt;/p&gt;
&lt;p&gt;They should also measure maintenance burden. Retrieval pipelines need chunking
rules, embedding refreshes, indexes, metadata filters, and monitoring. Long
context needs prompt organization, caching, permission checks, and cost controls.
Neither approach is free. The point is to pay for the failure mode you can
manage.&lt;/p&gt;
&lt;p&gt;Longer windows are a genuine platform shift because they remove whole classes of
retrieval work for documents that fit comfortably inside the budget. They also
raise expectations. Users will assume the model saw the document they provided,
so overlooked facts become less forgivable. Bigger memory makes the system feel
smarter only when the answer proves it used that memory well.&lt;/p&gt;
&lt;h2 id=&quot;the-durable-rule&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/context-windows-30x/#the-durable-rule&quot;&gt;&lt;span&gt;The durable rule&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Use the window when the question depends on one bounded source. Use retrieval
when the question depends on a changing corpus, a large archive, or strict access
rules. Use both when the user needs citations and context around each cited
source. That rule is simple enough for product planning and specific enough to
prevent expensive rewrites driven only by launch announcements.&lt;/p&gt;
&lt;p&gt;The biggest mistake is treating context size as a substitute for information
architecture. A larger window lets more text enter the model. It does not decide
which text deserves attention, which source is authoritative, or which answer is
safe to show.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Last year&#39;s frontier model now runs on a laptop</title>
    <link href="https://data-today.net/consumer-hardware-lag/" />
    <updated>2026-05-12T00:00:00Z</updated>
    <id>https://data-today.net/consumer-hardware-lag/</id>
    <content type="html">&lt;p&gt;The distance between a billion-dollar data center and a laptop is now measured in
months. Epoch AI estimates that frontier AI performance becomes accessible on
consumer hardware within about eight months. What you can only rent from a hosted
API today, you can likely run locally before the year is out.&lt;/p&gt;
&lt;p&gt;That clock matters for anyone building a product on a model. A capability that
feels like a moat the day it ships has a short half-life. Eight months later a
comparable open-weight model fits on a workstation, and the differentiator moves
from having the capability to integrating it well.&lt;/p&gt;
&lt;h2 id=&quot;plan-to-the-lag-not-the-launch&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/consumer-hardware-lag/#plan-to-the-lag-not-the-launch&quot;&gt;&lt;span&gt;Plan to the lag, not the launch&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Treat the lag as a date on your roadmap, not a prophecy. Take a frontier launch,
add &lt;strong&gt;roughly eight months&lt;/strong&gt;, and that is when a local-capable equivalent tends to
show up. If your plan depends on a capability staying hosted-only, that assumption
expires on that clock. If your plan assumes you can eventually run it on customer
hardware for privacy or cost, the same clock tells you when.&lt;/p&gt;
&lt;h2 id=&quot;why-the-gap-keeps-closing&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/consumer-hardware-lag/#why-the-gap-keeps-closing&quot;&gt;&lt;span&gt;Why the gap keeps closing&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Efficiency:&lt;/strong&gt; algorithms deliver the same quality for about 3 times less
compute each year, so models shrink without losing ground.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open weights:&lt;/strong&gt; strong open models now trail closed ones by a narrow margin,
a story we covered in &lt;a href=&quot;https://data-today.net/open-models-default/&quot;&gt;the open-weight default&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hardware:&lt;/strong&gt; consumer chips keep gaining memory and bandwidth.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The competitive edge is shifting from raw access to deployment. Build for the
world where the model is a commodity and your product is the wrapper. The
consumer-hardware lag is tracked by &lt;a href=&quot;https://epoch.ai/trends&quot;&gt;Epoch AI&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;local-models-change-the-privacy-bargain&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/consumer-hardware-lag/#local-models-change-the-privacy-bargain&quot;&gt;&lt;span&gt;Local models change the privacy bargain&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The eight-month lag is a technical fact with a policy consequence. When a task
can run on a laptop or workstation, the default privacy bargain changes. A user
no longer has to send every document, recording, or codebase to a remote API in
order to get useful assistance. That does not make local models perfect, but it
does give product teams a stronger answer for regulated or sensitive workflows.&lt;/p&gt;
&lt;p&gt;Local deployment also changes procurement. A hosted frontier model may be the
right choice for the first version because it is available immediately and
requires no hardware planning. A local-capable equivalent arriving months later
can become the enterprise version, the offline mode, or the high-volume fallback.
The product that plans for both paths has more room to negotiate on cost and
data handling.&lt;/p&gt;
&lt;p&gt;The practical design pattern is tiering. Use the hosted model for tasks that
need the current frontier, use a smaller local model for repetitive or sensitive
tasks, and route between them based on risk and value. That pattern requires
some engineering work up front, but it avoids locking the product to the most
expensive endpoint forever.&lt;/p&gt;
&lt;h2 id=&quot;hardware-is-not-the-whole-constraint&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/consumer-hardware-lag/#hardware-is-not-the-whole-constraint&quot;&gt;&lt;span&gt;Hardware is not the whole constraint&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Running a model locally still depends on memory, quantization, thermal limits,
and user tolerance for latency. A model that technically fits on consumer
hardware may be too slow for an interactive workflow or too large for a battery
powered device. The phrase &amp;quot;runs on a laptop&amp;quot; covers a wide range of user
experiences.&lt;/p&gt;
&lt;p&gt;This is why local capability arrives in stages. First it is a developer demo on
a high-end machine. Then it becomes a workstation feature. Then it becomes a
normal consumer app feature after model compression, runtime optimization, and
hardware refreshes. The eight-month estimate marks the beginning of practical
access, not the moment every user has the same experience.&lt;/p&gt;
&lt;p&gt;Software polish matters too. A raw model file is not a product. Users need
installation, updates, permissions, model selection, fallback behavior, and clear
signals when a task should be escalated to a stronger remote model. The teams
that win local AI will be the ones that hide the complexity without hiding the
trade-offs.&lt;/p&gt;
&lt;h2 id=&quot;why-hosted-providers-still-matter&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/consumer-hardware-lag/#why-hosted-providers-still-matter&quot;&gt;&lt;span&gt;Why hosted providers still matter&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Local models do not eliminate hosted providers. They change what hosted
providers are used for. Frontier APIs remain valuable for the hardest reasoning,
large-scale batch work, fresh multimodal capabilities, and managed reliability.
They also provide an immediate path for small teams that cannot support model
operations themselves.&lt;/p&gt;
&lt;p&gt;The difference is that hosted access becomes a choice rather than a requirement
for more workloads over time. That weakens moats based only on model access and
strengthens moats based on workflow, data integration, trust, and distribution.
If a competitor can run a similar model locally, the product has to compete on
the job it performs for the user.&lt;/p&gt;
&lt;p&gt;For crawlers, the useful headline is that frontier capability diffuses quickly
down the hardware stack. For builders, the operational lesson is to design
model-agnostic systems. The model that defines the launch may not be the model
that defines the margin a year later.&lt;/p&gt;
&lt;h2 id=&quot;the-product-question-to-ask-now&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/consumer-hardware-lag/#the-product-question-to-ask-now&quot;&gt;&lt;span&gt;The product question to ask now&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Every roadmap should identify which features become better when they move from
cloud to device. Some candidates are obvious: private writing assistance,
personal search, meeting notes, code review on proprietary repositories, and
offline field work. Others may still belong in the cloud because they need fresh
tools, shared memory, or heavy reasoning.&lt;/p&gt;
&lt;p&gt;The right answer is rarely a permanent choice. Products should be able to shift
tasks across local, private-cloud, and hosted frontier models as cost and
capability change. The eight-month lag makes that flexibility valuable sooner
than many launch plans currently assume in everyday product practice.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Open-weight models closed the gap to 1.7 points</title>
    <link href="https://data-today.net/open-models-default/" />
    <updated>2026-05-15T00:00:00Z</updated>
    <id>https://data-today.net/open-models-default/</id>
    <content type="html">&lt;p&gt;&lt;strong&gt;Dataset:&lt;/strong&gt; the benchmark gap charted here is drawn from &lt;a href=&quot;https://hai.stanford.edu/ai-index/2025-ai-index-report&quot;&gt;Stanford HAI&#39;s 2025 AI Index&lt;/a&gt;, whose underlying data is free to download and reuse.&lt;/p&gt;
&lt;p&gt;Open-weight models are no longer the cheap compromise. They are a default. On
some benchmarks, the performance gap between the best open-weight model and the
best closed model fell from 8% to 1.7% in a single year, according to Stanford&#39;s
2025 AI Index. The chart above traces that collapse.&lt;/p&gt;
&lt;p&gt;Near parity changes the buying decision. When the capability difference is under
two points, the reasons to hold the weights yourself become decisive: data
residency, predictable cost, and the freedom to fine-tune on proprietary data
without sending it to a vendor.&lt;/p&gt;
&lt;h2 id=&quot;why-teams-hold-the-weights&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/open-models-default/#why-teams-hold-the-weights&quot;&gt;&lt;span&gt;Why teams hold the weights&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data residency&lt;/strong&gt; rules rule out some hosted options entirely.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cost is predictable&lt;/strong&gt; when you control the deployment, with no per-token
surprise at the end of the month.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fine-tuning&lt;/strong&gt; on private data is simpler when the model is yours.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-python&quot; tabindex=&quot;0&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; pandas &lt;span class=&quot;token keyword&quot;&gt;as&lt;/span&gt; pd

df &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; pd&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;read_csv&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;deployments.csv&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;     &lt;span class=&quot;token comment&quot;&gt;# month, license, count&lt;/span&gt;
share &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;
    df&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;groupby&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;month&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;license&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;count&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
      &lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;groupby&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;level&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;apply&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;lambda&lt;/span&gt; s&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; s &lt;span class=&quot;token operator&quot;&gt;/&lt;/span&gt; s&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;share&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;unstack&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;tail&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;the-competitive-backdrop&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/open-models-default/#the-competitive-backdrop&quot;&gt;&lt;span&gt;The competitive backdrop&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;This is the same crowding visible across the
&lt;a href=&quot;https://data-today.net/benchmark-heatmap-explainer/&quot;&gt;tightening leaderboard&lt;/a&gt;: when many models cluster
near the top, license terms and running cost decide deployments more than raw
capability. For organisations weighing the
&lt;a href=&quot;https://data-today.net/labs-promise-agi-consultants/&quot;&gt;gigawatt-scale cost of frontier compute&lt;/a&gt;, an
open model that runs on hardware you control is an easier line to defend.&lt;/p&gt;
&lt;p&gt;Parity does not mean open models win everywhere. It means the burden of proof has
flipped. The benchmark detail is in the
&lt;a href=&quot;https://hai.stanford.edu/ai-index/2025-ai-index-report&quot;&gt;2025 AI Index&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;near-parity-changes-the-risk-calculation&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/open-models-default/#near-parity-changes-the-risk-calculation&quot;&gt;&lt;span&gt;Near parity changes the risk calculation&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;When open-weight models trailed by eight points, choosing them required a clear
reason. The buyer accepted lower performance in exchange for control. At a gap
of 1.7 points, the trade-off looks different. The hosted closed model may still
win on some tasks, but the open option is close enough that governance, cost,
and portability can decide the purchase.&lt;/p&gt;
&lt;p&gt;That shift is especially important for regulated organizations. A bank,
hospital, public agency, or defense contractor may value data control more than
a small benchmark advantage. If the open model performs adequately on the
organization&#39;s own test set, the ability to host it inside existing controls can
outweigh the last point of public leaderboard performance.&lt;/p&gt;
&lt;p&gt;Open weights also reduce vendor concentration. A team can fine-tune, evaluate,
and serve a model without depending on one provider&#39;s roadmap or pricing. That
does not make operation easy. It gives the buyer an exit path, which is often
enough to improve contract terms with hosted vendors.&lt;/p&gt;
&lt;h2 id=&quot;the-hidden-costs-of-owning-the-model&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/open-models-default/#the-hidden-costs-of-owning-the-model&quot;&gt;&lt;span&gt;The hidden costs of owning the model&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Open-weight does not mean free. Someone has to choose the model, provision
hardware, monitor quality, apply safety controls, patch serving infrastructure,
and manage upgrades. The per-token bill may be predictable, but the operational
bill moves onto the buyer&#39;s own team.&lt;/p&gt;
&lt;p&gt;The cost depends heavily on scale. A high-volume product can justify the fixed
work because each additional request is cheaper. A small team with irregular
usage may be better served by an API, even if the model itself is open. The
right comparison is total cost of ownership, not license type alone.&lt;/p&gt;
&lt;p&gt;There is also a quality-maintenance burden. Hosted providers update models,
tools, and safety systems continuously. A self-hosted deployment can drift if it
is not re-evaluated against new tasks and new competitors. Open weights give
control, but control includes responsibility for keeping the system current.&lt;/p&gt;
&lt;h2 id=&quot;what-buyers-should-test&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/open-models-default/#what-buyers-should-test&quot;&gt;&lt;span&gt;What buyers should test&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The practical test is a three-way comparison: best closed model, best open model
hosted internally, and best open model through a managed provider. Score each on
task quality, latency, cost, data handling, auditability, and upgrade path. The
winner may differ by workflow.&lt;/p&gt;
&lt;p&gt;Buyers should also test failure behavior. A model that is slightly weaker on an
average score may be safer if its errors are easier to detect or if it refuses
uncertain requests more predictably. For production systems, the shape of
failure often matters more than the average gap.&lt;/p&gt;
&lt;p&gt;The open-weight default is not an ideological claim. It is a procurement claim:
when capability is close, control becomes valuable enough to lead the decision.
The best closed models will still set the frontier, but many everyday
deployments no longer need the frontier to create value.&lt;/p&gt;
&lt;h2 id=&quot;why-the-gap-may-keep-tightening&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/open-models-default/#why-the-gap-may-keep-tightening&quot;&gt;&lt;span&gt;Why the gap may keep tightening&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Open models benefit from a broad ecosystem. Researchers publish techniques,
developers optimize runtimes, hardware vendors tune kernels, and users report
failure cases in public. Each improvement compounds outside one company&#39;s API.
Closed labs still have advantages in capital, data, and frontier training runs,
but the open ecosystem is fast at imitation and deployment.&lt;/p&gt;
&lt;p&gt;The remaining gap may also matter less as tasks specialize. A general benchmark
can show a closed model ahead overall while an open model wins on a narrow
domain after fine-tuning. Companies do not buy general intelligence in the
abstract. They buy performance on their own documents, customers, code, and
operating constraints.&lt;/p&gt;
&lt;p&gt;That is why the 1.7-point number is more than a scoreboard. It signals that the
default question has changed from &amp;quot;why use open?&amp;quot; to &amp;quot;why give up control?&amp;quot; In
many workflows, that is the question that procurement, security, and finance
were already waiting to ask.&lt;/p&gt;
&lt;p&gt;The answer will still vary by task. Closed frontier systems remain attractive
when the last bit of quality, speed of new features, or managed safety layer is
worth the premium. Open-weight systems lead when data control, predictable cost,
and deployment flexibility matter more. A mature AI strategy should be able to
use both without rebuilding the product around one vendor.&lt;/p&gt;
&lt;p&gt;That mixed strategy is the practical meaning of open-weight parity. It gives
buyers choice. Choice lowers lock-in, improves negotiation, and lets teams match
the model to the risk of the work.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>The gigawatt build-out is the real AI race now</title>
    <link href="https://data-today.net/gigawatt-buildout/" />
    <updated>2026-05-15T00:00:00Z</updated>
    <id>https://data-today.net/gigawatt-buildout/</id>
    <content type="html">&lt;p&gt;The frontier of AI has become a construction project. Epoch AI estimates the
largest known AI data center, the Anthropic and Amazon site at New Carlisle, has
computing power equivalent to about 700,000 H100 chips, drawing roughly 1.1
gigawatts and costing about 35 billion dollars. Microsoft&#39;s planned Fairwater
campus in Wisconsin is projected at 5.2 million H100-equivalents by September
2027, nearly eight times larger.&lt;/p&gt;
&lt;p&gt;Those figures move the contest off the chip and onto the grid. A gigawatt of
facility power costs around 30 billion dollars to build and takes about two years
of construction. The binding constraints are now electricity, permits, and steel,
all of which move slower than a model release.&lt;/p&gt;
&lt;h2 id=&quot;the-capital-math&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/gigawatt-buildout/#the-capital-math&quot;&gt;&lt;span&gt;The capital math&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;pre class=&quot;language-python&quot; tabindex=&quot;0&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;GW_COST_USD_B &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;30&lt;/span&gt;          &lt;span class=&quot;token comment&quot;&gt;# roughly $30B to build 1 GW of facility power&lt;/span&gt;
sites &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;New Carlisle&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;1.1&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;Fairwater (2027)&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;8.6&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; name&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; gw &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; sites&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;items&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string-interpolation&quot;&gt;&lt;span class=&quot;token string&quot;&gt;f&quot;&lt;/span&gt;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;name&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;token format-spec&quot;&gt;18s&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;: ~&lt;/span&gt;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;gw&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt; GW  -&gt;  ~$&lt;/span&gt;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;gw &lt;span class=&quot;token operator&quot;&gt;*&lt;/span&gt; GW_COST_USD_B&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;token format-spec&quot;&gt;,.0f&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;B to build&quot;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The output explains why only a handful of companies are in this race. Spending
tens of billions on a single building only pencils out if you expect to fill it,
which assumes both demand and a power connection that may not exist yet.&lt;/p&gt;
&lt;h2 id=&quot;what-this-concentrates&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/gigawatt-buildout/#what-this-concentrates&quot;&gt;&lt;span&gt;What this concentrates&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;th&gt;Who controls it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Chips&lt;/td&gt;
&lt;td&gt;Improving in supply&lt;/td&gt;
&lt;td&gt;Several vendors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Capital&lt;/td&gt;
&lt;td&gt;Tens of billions per site&lt;/td&gt;
&lt;td&gt;A few hyperscalers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Power&lt;/td&gt;
&lt;td&gt;The hard limit&lt;/td&gt;
&lt;td&gt;Utilities and regulators&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The cost of these bets ties directly to
&lt;a href=&quot;https://data-today.net/scaling-curve-extrapolate/&quot;&gt;the scaling curve nobody wants to extrapolate&lt;/a&gt;:
the bills are real, the payoff is a forecast. Watch the substation, not the
launch event. The data center estimates come from
&lt;a href=&quot;https://epoch.ai/data/data-centers&quot;&gt;Epoch AI&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;why-a-gigawatt-is-a-different-category&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/gigawatt-buildout/#why-a-gigawatt-is-a-different-category&quot;&gt;&lt;span&gt;Why a gigawatt is a different category&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A gigawatt-scale AI campus is closer to an industrial project than a normal data
center expansion. It needs land, transmission capacity, transformers, cooling,
backup systems, fiber, and a construction schedule that can survive permitting
delays. The model roadmap may move in months, but the physical plant moves in
years. That mismatch is now part of AI strategy.&lt;/p&gt;
&lt;p&gt;The scale also changes risk. A conventional cloud region can grow in phases as
customers arrive. A frontier AI site requires a large commitment before the
demand is fully proven, because the power and building shell have to be secured
early. The buyer is making a bet on future model revenue, future inference
volume, and future access to electricity all at once.&lt;/p&gt;
&lt;p&gt;This is why the largest projects cluster around companies with enormous balance
sheets. The advantage is not only that they can buy chips. They can sign power
agreements, absorb construction delays, and finance capacity before it earns
revenue. Smaller labs may still innovate, but renting the frontier increasingly
means renting from someone who owns the grid connection.&lt;/p&gt;
&lt;h2 id=&quot;the-local-effects-are-political&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/gigawatt-buildout/#the-local-effects-are-political&quot;&gt;&lt;span&gt;The local effects are political&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Gigawatt projects become local policy issues quickly. They compete for power
with factories, homes, and other data centers. They create construction jobs,
tax revenue, water concerns, and pressure on transmission planning. A project
that looks like a model-capability decision from Silicon Valley can look like an
energy-development decision to the county that hosts it.&lt;/p&gt;
&lt;p&gt;That political layer can slow or redirect the race. Utilities need to decide how
much generation and transmission to build for customers whose demand forecasts
depend on uncertain AI revenue. Regulators need to decide who pays for upgrades
if a campus is delayed or canceled. Communities need to decide whether the jobs
and tax base justify the infrastructure burden.&lt;/p&gt;
&lt;p&gt;For AI companies, community acceptance becomes part of execution risk. A site
with cheaper power but slower permitting may lose to a site with stronger local
support. The winning location is the one where capital, electricity, regulation,
and schedule line up closely enough for the model roadmap.&lt;/p&gt;
&lt;h2 id=&quot;what-to-measure-next&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/gigawatt-buildout/#what-to-measure-next&quot;&gt;&lt;span&gt;What to measure next&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The visible headline is H100-equivalent capacity, but the next set of useful
metrics is more practical. How many megawatts are energized, not merely planned?
What share of power is under firm contract? How much capacity is available for
training versus inference? What is the expected utilization rate after the first
year? Those details separate announced ambition from usable compute.&lt;/p&gt;
&lt;p&gt;There is also an efficiency question. A larger campus is only better if it can
feed chips, move data, and cool racks without wasting too much power. Power
usage effectiveness, networking topology, and maintenance downtime all affect
the real capacity customers experience. A chart of theoretical chips does not
capture those losses.&lt;/p&gt;
&lt;p&gt;The AI race now has a public-infrastructure dimension. Model releases still
matter, but the next frontier may be decided by interconnection queues and
construction schedules. The company that can turn capital into energized
capacity fastest will have an advantage before a single training run starts.&lt;/p&gt;
&lt;h2 id=&quot;the-buyers-takeaway&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/gigawatt-buildout/#the-buyers-takeaway&quot;&gt;&lt;span&gt;The buyer&#39;s takeaway&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Customers should read gigawatt announcements as capacity signals, not as product
guarantees. A planned campus can lower future prices only if it is built,
energized, filled with chips, and used at high utilization. Until then, buyers
should ask providers where capacity is available today, which regions are
constrained, and which workloads may be throttled during demand spikes.&lt;/p&gt;
&lt;p&gt;The same questions matter for vendor risk. A provider with many smaller sites
may be more resilient than one waiting on a single enormous project. A provider
with firm power contracts may be safer than one with only a public target. The
headline number is useful, but the delivery path is the story to watch.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Most agentic AI projects will be cancelled before they ship</title>
    <link href="https://data-today.net/agentic-cancellation-cliff/" />
    <updated>2026-05-18T00:00:00Z</updated>
    <id>https://data-today.net/agentic-cancellation-cliff/</id>
    <content type="html">&lt;p&gt;The agent hype has met its budget review. Gartner expects more than 40 percent of
agentic AI projects to be cancelled by the end of 2027, citing escalating costs,
unclear business value, and inadequate risk controls. The technology demos well
and ships rarely, and the gap between those two states is where the money goes.&lt;/p&gt;
&lt;p&gt;The pattern is familiar to anyone who has run a pilot. An agent that handles a
scripted demo falls apart on the long tail of real inputs, where a single wrong
action compounds across steps. Without a clear value case and tight guardrails, a
promising proof of concept becomes a maintenance liability nobody wants to own.&lt;/p&gt;
&lt;h2 id=&quot;score-a-project-before-you-fund-it&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/agentic-cancellation-cliff/#score-a-project-before-you-fund-it&quot;&gt;&lt;span&gt;Score a project before you fund it&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;pre class=&quot;language-python&quot; tabindex=&quot;0&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;survives&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;value_per_run_usd&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; runs_per_month&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; cost_per_run_usd&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
             failure_rate&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; cleanup_cost_usd&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    gross &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;value_per_run_usd &lt;span class=&quot;token operator&quot;&gt;-&lt;/span&gt; cost_per_run_usd&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;*&lt;/span&gt; runs_per_month
    cleanup &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; failure_rate &lt;span class=&quot;token operator&quot;&gt;*&lt;/span&gt; runs_per_month &lt;span class=&quot;token operator&quot;&gt;*&lt;/span&gt; cleanup_cost_usd
    &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; gross &lt;span class=&quot;token operator&quot;&gt;-&lt;/span&gt; cleanup &lt;span class=&quot;token operator&quot;&gt;&gt;&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;0&lt;/span&gt;

&lt;span class=&quot;token comment&quot;&gt;# A plausible support-automation agent&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;survives&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;value_per_run_usd&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;4.0&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; runs_per_month&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;20_000&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
               cost_per_run_usd&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;0.6&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; failure_rate&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;0.05&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
               cleanup_cost_usd&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;12.0&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The model is crude but it asks the right question. A 5 percent failure rate with
a 12 dollar cleanup cost can erase the entire gross margin. Most cancelled
projects never run this arithmetic until the invoices arrive.&lt;/p&gt;
&lt;h2 id=&quot;what-separates-the-survivors&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/agentic-cancellation-cliff/#what-separates-the-survivors&quot;&gt;&lt;span&gt;What separates the survivors&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Narrow scope:&lt;/strong&gt; one task with a measurable outcome, not an open-ended
assistant.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cheap failure:&lt;/strong&gt; actions that are easy to review or reverse.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;A real baseline:&lt;/strong&gt; a value case that beats the
&lt;a href=&quot;https://data-today.net/agents-working-too-hard/&quot;&gt;tooling already in production&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Agentic AI is not failing because the models are weak. It is failing where teams
skip the business case. Gartner&#39;s prediction is detailed in its
&lt;a href=&quot;https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027&quot;&gt;2025 agentic AI analysis&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;the-demo-hides-the-operating-model&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/agentic-cancellation-cliff/#the-demo-hides-the-operating-model&quot;&gt;&lt;span&gt;The demo hides the operating model&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The strongest agent demos usually compress the work into a single clean path. A
sales agent qualifies a lead, a support agent issues a refund, or a research
agent writes a market brief. The production version has to handle permissions,
handoffs, audit trails, rate limits, disputed facts, and customers who ask for
two incompatible things at once. That gap is where many pilots lose their
budget sponsor.&lt;/p&gt;
&lt;p&gt;An agent is also harder to govern than a normal workflow because its route can
change from run to run. A conventional automation has fixed branches that can be
tested directly. An agent chooses tools, reads state, and may create new
intermediate steps while it works. That flexibility is the point, but it turns
quality assurance into a sampling problem. Teams have to ask how often the agent
takes an unsafe action, not simply whether it succeeded on the last demo.&lt;/p&gt;
&lt;p&gt;The projects that survive tend to make the operating model explicit. They define
which actions require human approval, which inputs are outside scope, which
systems the agent can touch, and how every decision is logged. Those controls
make the launch slower, but they also make the project legible to finance,
security, and legal teams. Without that legibility, the pilot remains a charming
prototype that no executive wants to own in production.&lt;/p&gt;
&lt;h2 id=&quot;the-cancellation-point-is-usually-after-integration&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/agentic-cancellation-cliff/#the-cancellation-point-is-usually-after-integration&quot;&gt;&lt;span&gt;The cancellation point is usually after integration&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Budgets rarely die at the first prompt. They die after the agent is connected to
real systems. Integration exposes the messy economics: every tool call has a
latency cost, every retrieval step has a data-quality dependency, and every
write action creates a recovery burden if the agent is wrong. The value case has
to survive all of those frictions, not just the model bill.&lt;/p&gt;
&lt;p&gt;That is why a small error rate can dominate the spreadsheet. A support agent
that mishandles one in twenty cases may look accurate in a demo, but those cases
are exactly where customers complain, managers intervene, and auditors ask what
happened. If cleanup takes a senior employee ten minutes, the apparent savings
from hundreds of successful cases can shrink quickly.&lt;/p&gt;
&lt;p&gt;Integration also changes who pays. The innovation team may fund the prototype,
while operations inherits the monitoring, exception handling, and incident
response. A cancellation is often a rational handoff failure: the group asked to
run the system sees a different cost profile from the group that built it.&lt;/p&gt;
&lt;h2 id=&quot;a-better-approval-gate&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/agentic-cancellation-cliff/#a-better-approval-gate&quot;&gt;&lt;span&gt;A better approval gate&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The useful approval gate is a pre-mortem with numbers attached. Before a team
funds a pilot, it should name the baseline process, expected volume, acceptable
failure rate, review cost, and maximum monthly model spend. It should also name
the shutdown condition. If the project cannot state what evidence would kill it,
the organization is buying hope rather than testing a system.&lt;/p&gt;
&lt;p&gt;The gate should include a replay set drawn from real historical cases. Synthetic
prompts are helpful for coverage, but they rarely contain the awkward edge cases
that make production expensive. A replay set lets the team compare the agent
against the current process and measure the difference in time, quality, and
cleanup work.&lt;/p&gt;
&lt;p&gt;Agentic systems will keep shipping where the work is narrow, the failures are
cheap, and the measurement is honest. The cancellation wave will hit projects
that treated autonomy as a feature instead of an operating cost. That distinction
is mundane, which is why it matters.&lt;/p&gt;
&lt;p&gt;The crawler-friendly version of the story is also the executive version. Agentic
AI projects need a named task, a measurable baseline, a bounded action set, and
a cost model that includes cleanup. Without those four pieces, the project is a
demo competing against the real world. With them, it becomes a normal technology
investment that can be approved, monitored, and stopped when the data says stop.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>How cheap inference made the one-person studio viable</title>
    <link href="https://data-today.net/without-ai-quit-my-job/" />
    <updated>2026-05-19T00:00:00Z</updated>
    <id>https://data-today.net/without-ai-quit-my-job/</id>
    <content type="html">&lt;p&gt;People are leaving stable jobs to run one-person studios not because they got
braver, but because the cost of the work fell through the floor. The price of
running a model at a fixed quality level, GPT-3.5 and better, dropped about 280
times between November 2022 and October 2024, according to Stanford&#39;s 2025 AI
Index. The chart above is the reason a one-person studio is now a viable
business.&lt;/p&gt;
&lt;p&gt;That decline is not uniform, and the detail matters. Epoch AI estimates inference
prices at a fixed capability level have fallen between 9 and 900 times per year
depending on the task, with the steepest drops on the benchmarks that commoditise
fastest. For a solo operator, picking the right tier is most of the margin.&lt;/p&gt;
&lt;h2 id=&quot;what-do-the-cheaper-tokens-actually-buy&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/without-ai-quit-my-job/#what-do-the-cheaper-tokens-actually-buy&quot;&gt;&lt;span&gt;What do the cheaper tokens actually buy?&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The collapse in price turns work that used to be outsourced into work a single
operator can run in-house. Research drafts, first-pass analysis, chart
generation, and customer replies all move onto one desk because each one now
costs cents rather than an afternoon.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;Now&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Market scan&lt;/td&gt;
&lt;td&gt;hired analyst&lt;/td&gt;
&lt;td&gt;overnight run&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;First-draft report&lt;/td&gt;
&lt;td&gt;a full day&lt;/td&gt;
&lt;td&gt;minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Customer replies&lt;/td&gt;
&lt;td&gt;a support hire&lt;/td&gt;
&lt;td&gt;reviewed queue&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&quot;what-is-the-catch&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/without-ai-quit-my-job/#what-is-the-catch&quot;&gt;&lt;span&gt;What is the catch?&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Cheap inference is not free judgment. The output is
&lt;a href=&quot;https://data-today.net/vibecoding-to-production/&quot;&gt;almost right too often&lt;/a&gt; to ship unread, so a
disciplined operator reviews everything that leaves the studio. The economics
changed; the accountability did not.&lt;/p&gt;
&lt;p&gt;This is not a victory lap for automation. Plenty of people lost ground in the
same shift. But for anyone who wanted to work alone and could never make the
numbers add up, the falling price of capability is the quiet story of the decade.
The price data is tracked by &lt;a href=&quot;https://epoch.ai/trends&quot;&gt;Epoch AI&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;what-did-the-studio-economics-look-like-before-the-price-collapse&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/without-ai-quit-my-job/#what-did-the-studio-economics-look-like-before-the-price-collapse&quot;&gt;&lt;span&gt;What did the studio economics look like before the price collapse?&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Before cheap inference, a one-person data studio had an awkward cost structure.
Research, drafting, analysis, charting, customer support, and marketing all
competed for the same hours. Outsourcing helped, but only after revenue was
large enough to cover contractors. Hiring helped, but it turned a small business
idea into a payroll problem before the product was proven.&lt;/p&gt;
&lt;p&gt;The hard part was not ambition. It was throughput. A solo operator could do one
or two functions well and then run out of week. Every additional service line
added coordination overhead. Every customer request carried an opportunity cost.
The business was limited by the founder&#39;s calendar more than the addressable
market.&lt;/p&gt;
&lt;p&gt;Cheap models changed that constraint. They made it possible to create first
drafts, summarize sources, test angles, clean data, and prepare customer replies
without hiring for each task. The founder still has to judge the output, but the
blank-page labor is no longer the bottleneck.&lt;/p&gt;
&lt;h2 id=&quot;what-still-cannot-be-delegated&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/without-ai-quit-my-job/#what-still-cannot-be-delegated&quot;&gt;&lt;span&gt;What still cannot be delegated?&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The model can draft a market scan. It cannot decide which client relationship is
worth protecting. It can propose a chart. It cannot know which caveat will make
the claim fair. It can write a reply. It cannot carry the reputation cost if the
reply is wrong. The human work shifts toward taste, accountability, and final
judgment.&lt;/p&gt;
&lt;p&gt;That shift is productive but tiring. Reviewing machine output requires attention
because the errors are often fluent. A bad paragraph may sound polished. A wrong
number may sit next to five correct ones. A plausible citation may need checking.
The work is faster than starting from scratch, but it is still work.&lt;/p&gt;
&lt;p&gt;The economic gain comes from moving more tasks into the reviewable category. If
a model can get a draft to 70 percent quality for a few cents, the owner can
spend time on the last 30 percent. If the draft is only 30 percent right, the
tool creates cleanup work. The margin depends on knowing which tasks belong in
which bucket.&lt;/p&gt;
&lt;h2 id=&quot;why-does-this-matter-beyond-one-career&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/without-ai-quit-my-job/#why-does-this-matter-beyond-one-career&quot;&gt;&lt;span&gt;Why does this matter beyond one career?&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Cheap capability changes who can start. A solo founder, journalist, analyst,
designer, or researcher can now attempt projects that used to require a small
team. That does not guarantee success, but it lowers the fixed cost of trying.
More experiments can happen before outside funding, hiring, or a large customer
contract.&lt;/p&gt;
&lt;p&gt;This is also where the labor story becomes complicated. The same tools that let
one person start a studio can pressure entry-level service work that used to
provide the first rung for others. The benefit is real. The distribution is
uneven. A serious account of AI and work has to hold both facts at once.&lt;/p&gt;
&lt;p&gt;For buyers, the result may be more specialized suppliers. A one-person studio
can serve a narrow niche with lower overhead, using AI to cover the routine work
around a specific expertise. That can create better services, but it also raises
the bar for trust because the business depends heavily on one person&#39;s review.&lt;/p&gt;
&lt;h2 id=&quot;what-is-the-rule-that-keeps-this-working&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/without-ai-quit-my-job/#what-is-the-rule-that-keeps-this-working&quot;&gt;&lt;span&gt;What is the rule that keeps this working?&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The discipline that holds the model is simple: use AI where the cost of a bad
first draft is low and the value of speed is high, and never use it as the final
authority on facts, claims, client promises, or anything that would damage trust
if it were wrong. That one rule keeps the economics from swallowing the judgment.&lt;/p&gt;
&lt;p&gt;The broader lesson is that falling inference prices make small organizations
look larger from the outside. They can publish more, respond faster, and test
more ideas. The durable advantage still comes from knowing what should exist in
the first place.&lt;/p&gt;
&lt;p&gt;That is why the cost curve matters to more than AI companies. It changes the
minimum viable size of a knowledge business. A solo operator can now cover more
surface area before hiring, a small team can test more markets before raising
capital, and a specialist can turn judgment into a product without first
building a department. The constraint does not vanish. It moves from production
capacity to editorial and commercial discipline.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Training costs rise 3.5x a year. Efficiency is the only brake</title>
    <link href="https://data-today.net/training-cost-vs-efficiency/" />
    <updated>2026-05-21T00:00:00Z</updated>
    <id>https://data-today.net/training-cost-vs-efficiency/</id>
    <content type="html">&lt;p&gt;Two exponentials are pulling the cost of frontier AI in opposite directions.
Epoch AI estimates the cost to train frontier language models has risen about 3.5
times per year since 2020, while pre-training compute efficiency improves roughly
3 times per year. One curve prices labs out, the other keeps the door open, and
the narrow gap between them sets the pace of the field.&lt;/p&gt;
&lt;p&gt;The arithmetic is unforgiving at the top. The largest known training run, Grok 4,
used around 5e26 FLOP, and power use per run roughly doubles every year. Efficiency
is the only force working in the buyer&#39;s favour, letting the same capability be
reached for less compute as time passes.&lt;/p&gt;
&lt;h2 id=&quot;the-race-between-cost-and-efficiency&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/training-cost-vs-efficiency/#the-race-between-cost-and-efficiency&quot;&gt;&lt;span&gt;The race between cost and efficiency&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The two rates are what matter, and they nearly cancel. Training cost climbs about
&lt;strong&gt;3.5 times a year&lt;/strong&gt; while efficiency improves about 3 times a year, so the real
cost of holding a fixed capability bar rises only around 1.2 times annually. After
three years the headline training bill is up more than 40-fold, yet the price of
matching last year&#39;s frontier has barely moved. Efficiency is doing almost all the
work of keeping AI affordable.&lt;/p&gt;
&lt;h2 id=&quot;who-this-favours&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/training-cost-vs-efficiency/#who-this-favours&quot;&gt;&lt;span&gt;Who this favours&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Labs at the frontier:&lt;/strong&gt; absorb the 3.5x and chase the largest runs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Everyone else:&lt;/strong&gt; ride the 3x efficiency gain and arrive a year later for far
less, the same logic behind
&lt;a href=&quot;https://data-today.net/consumer-hardware-lag/&quot;&gt;last year&#39;s model on a laptop&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Investors:&lt;/strong&gt; watch the gap. If efficiency stalls, the cost curve wins.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The frontier is expensive on purpose. The rest of the market runs on the
efficiency dividend. Both rates are documented by
&lt;a href=&quot;https://epoch.ai/publications/training-compute-of-frontier-ai-models-grows-by-4-5x-per-year&quot;&gt;Epoch AI&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;the-gap-decides-market-access&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/training-cost-vs-efficiency/#the-gap-decides-market-access&quot;&gt;&lt;span&gt;The gap decides market access&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The difference between 3.5x cost growth and 3x efficiency growth looks small on
paper. Over time, it decides who can afford to stay near the frontier. If costs
grow faster than efficiency, the leading edge becomes more exclusive even while
older capability gets cheaper. If efficiency catches up or pulls ahead, more
organizations can compete with less capital.&lt;/p&gt;
&lt;p&gt;That is why the efficiency rate deserves as much attention as training budgets.
A new architecture, better data mixture, improved optimizer, or more efficient
serving method can change the economics for everyone downstream. It may not
make the largest run cheap, but it can make last year&#39;s largest capability
available to many more users.&lt;/p&gt;
&lt;p&gt;The effect is strongest outside the frontier. A company that does not need the
absolute best model can wait for efficiency gains to lower the cost of a
previously expensive capability. This is the economic path from lab demo to
ordinary software feature.&lt;/p&gt;
&lt;h2 id=&quot;why-headline-training-bills-keep-rising&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/training-cost-vs-efficiency/#why-headline-training-bills-keep-rising&quot;&gt;&lt;span&gt;Why headline training bills keep rising&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Frontier labs spend more because the prize is positional. The first lab to reach
a new capability can win customers, investors, talent, and strategic leverage.
That race encourages spending even when efficiency improves. Savings are often
reinvested into larger runs rather than returned as lower budgets.&lt;/p&gt;
&lt;p&gt;This creates a strange public picture. The technology gets more efficient, but
the largest checks get bigger. That is not a contradiction. It means labs are
using efficiency to climb the curve faster. The same force that lowers the cost
of a fixed capability can increase the ambition of the next training run.&lt;/p&gt;
&lt;p&gt;Power use follows the same pattern. Better efficiency reduces the compute needed
for a given target, but frontier targets keep moving upward. The grid still
feels pressure because the field spends the efficiency dividend on scale.&lt;/p&gt;
&lt;h2 id=&quot;what-investors-should-monitor&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/training-cost-vs-efficiency/#what-investors-should-monitor&quot;&gt;&lt;span&gt;What investors should monitor&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Investors should watch whether efficiency gains remain broad or become harder to
find. If the 3x improvement slows materially while training ambitions keep
rising, frontier economics deteriorate quickly. More capital would be needed for
smaller gains, and the number of credible frontier players would shrink.&lt;/p&gt;
&lt;p&gt;They should also monitor how quickly efficiency diffuses. Some improvements are
published, copied, and absorbed into open tooling. Others remain proprietary to
the largest labs. The more private the efficiency gain, the more concentrated
the market becomes.&lt;/p&gt;
&lt;p&gt;The cleanest signal is cost to reach a fixed capability level. If that cost
keeps falling, the broader AI market can thrive even as frontier spending grows.
If it stops falling, the field becomes more dependent on a few companies willing
to fund massive runs.&lt;/p&gt;
&lt;h2 id=&quot;the-practical-takeaway&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/training-cost-vs-efficiency/#the-practical-takeaway&quot;&gt;&lt;span&gt;The practical takeaway&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;For most builders, the smart strategy is to ride the efficiency curve rather
than chase the frontier. Use the newest models to learn what will become
possible, then move stable workloads onto cheaper models as soon as quality is
good enough. That approach captures capability without inheriting the largest
capital burden.&lt;/p&gt;
&lt;p&gt;For labs, the calculation is harsher. They need efficiency research to keep the
frontier affordable, and they need scale to stay ahead. The tension between
those two needs is the central economics of AI development in 2026.&lt;/p&gt;
&lt;p&gt;The cost curve is therefore not only a warning about expensive models. It is a
map of how capability spreads: first through capital-heavy frontier runs, then
through efficiency gains that make the same work cheaper for everyone else.&lt;/p&gt;
&lt;h2 id=&quot;the-policy-angle&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/training-cost-vs-efficiency/#the-policy-angle&quot;&gt;&lt;span&gt;The policy angle&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Efficiency also matters for national and regional strategy. A country that
cannot match the largest frontier budgets may still benefit if efficiency gains
make strong models trainable and deployable on smaller clusters. Public funding
for data quality, evaluation, open tooling, and energy-efficient infrastructure
can therefore widen access even without funding the largest runs.&lt;/p&gt;
&lt;p&gt;The opposite is also true. If efficiency gains concentrate inside a few private
labs, the market becomes more dependent on those labs for both capability and
cost reductions. Watching the efficiency curve is a way to watch the openness of
the AI economy, not just its technical progress.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>The scaling curve nobody wants to extrapolate</title>
    <link href="https://data-today.net/scaling-curve-extrapolate/" />
    <updated>2026-05-22T00:00:00Z</updated>
    <id>https://data-today.net/scaling-curve-extrapolate/</id>
    <content type="html">&lt;p&gt;Scaling laws are the closest thing modern AI has to physics. The scatter above,
plotting capability against training compute, fits a line almost too well.
According to Epoch AI, training compute for frontier language models has grown
about 5 times per year since 2020, a roughly ten-thousand-fold increase among
the top systems. The real question is how far that line goes.&lt;/p&gt;
&lt;p&gt;The drivers are not free. The cost of frontier training runs rises about 3.5
times per year, and power use roughly doubles annually. Working the other way,
algorithmic efficiency improves about 3 times per year, so the same capability
costs less compute over time. The curve is a race between rising scale and
falling unit cost.&lt;/p&gt;
&lt;p&gt;Fit a line to the published points and the slope is steady: each tenfold increase
in training compute buys a roughly constant bump in benchmark capability. That
straight line on a log scale is exactly what makes the next paragraph dangerous.
A clean trend invites you to read the future straight off the ruler.&lt;/p&gt;
&lt;h2 id=&quot;two-ways-to-be-wrong&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/scaling-curve-extrapolate/#two-ways-to-be-wrong&quot;&gt;&lt;span&gt;Two ways to be wrong&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Extrapolate naively&lt;/strong&gt; and you promise miracles at the next order of
magnitude.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Assume a wall&lt;/strong&gt; and you under-invest right before the payoff.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The data cannot settle which is right, because every point sits to the left of
the question. We are, by definition, fitting a line we hope to leave behind.
Epoch argues the inputs can keep scaling through 2030, which is a statement about
supply, not about whether capability keeps tracking it.&lt;/p&gt;
&lt;h2 id=&quot;a-reporters-caveat&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/scaling-curve-extrapolate/#a-reporters-caveat&quot;&gt;&lt;span&gt;A reporter&#39;s caveat&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Treat any chart that ends in an arrow with suspicion. The most expensive mistakes
in this field have all been confident extrapolations of a clean-looking curve,
and the people paying the
&lt;a href=&quot;https://data-today.net/labs-promise-agi-consultants/&quot;&gt;gigawatt-scale bills&lt;/a&gt; are betting it holds. The
underlying compute trends are documented by &lt;a href=&quot;https://epoch.ai/trends&quot;&gt;Epoch AI&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;what-the-curve-can-and-cannot-say&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/scaling-curve-extrapolate/#what-the-curve-can-and-cannot-say&quot;&gt;&lt;span&gt;What the curve can and cannot say&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The scaling curve is evidence that more training compute has produced more
capability across recent frontier systems. It is not a guarantee that the next
tenfold increase will buy the same gain. The fitted line describes the range of
models already built. The most important decisions sit just beyond that range.&lt;/p&gt;
&lt;p&gt;That limitation does not make the chart useless. It makes the chart a disciplined
starting point. A clean relationship across many releases tells investors,
engineers, and policymakers that scale has worked so far. It also tells them
which assumption they are making when they fund the next order of magnitude.&lt;/p&gt;
&lt;p&gt;The danger is false certainty in either direction. A skeptic can point to rising
costs and declare the wall near. An optimist can point to the line and declare
the future solved. The data supports neither posture fully. It supports a more
careful claim: recent progress has been strongly associated with scale, and the
next test is expensive.&lt;/p&gt;
&lt;h2 id=&quot;why-extrapolation-is-still-tempting&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/scaling-curve-extrapolate/#why-extrapolation-is-still-tempting&quot;&gt;&lt;span&gt;Why extrapolation is still tempting&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Companies extrapolate because the alternative is harder. If a lab believes the
line will continue, it can justify more compute, larger data centers, and deeper
partnerships. If it believes the line will break, it should redirect money into
efficiency, data quality, productization, or narrower models. The investment
choice forces a view before the evidence is complete.&lt;/p&gt;
&lt;p&gt;That is why scaling debates become emotional. They are not only technical. They
decide who gets capital, which teams hire, which countries subsidize
infrastructure, and which vendors win enterprise commitments. A curve on a log
chart becomes a capital-allocation argument.&lt;/p&gt;
&lt;p&gt;The responsible version of extrapolation includes milestones. A lab can fund the
next scale step while naming the capability gain it expects, the cost ceiling it
will tolerate, and the evidence that would cause it to slow down. Extrapolation
without stop conditions is faith. Extrapolation with measurement is strategy.&lt;/p&gt;
&lt;h2 id=&quot;the-role-of-efficiency&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/scaling-curve-extrapolate/#the-role-of-efficiency&quot;&gt;&lt;span&gt;The role of efficiency&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Algorithmic efficiency complicates the story in a useful way. If the same
capability requires less compute each year, then scale is not the only path to
progress. Better data, architectures, training recipes, inference methods, and
tool use can shift the curve. That means a smaller model in the future may match
a larger model today.&lt;/p&gt;
&lt;p&gt;Efficiency also widens access. Frontier labs may spend more to move the leading
edge, while the rest of the market benefits as techniques diffuse. Today&#39;s
expensive capability becomes tomorrow&#39;s commodity feature when efficiency gains
and hardware improvements meet.&lt;/p&gt;
&lt;p&gt;This diffusion is why scaling can be both centralized and democratizing. The
frontier run may require enormous capital, but the lessons from that run can
lower costs for everyone else. The social question is who captures the value
between those two moments.&lt;/p&gt;
&lt;h2 id=&quot;what-readers-should-take-from-the-chart&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/scaling-curve-extrapolate/#what-readers-should-take-from-the-chart&quot;&gt;&lt;span&gt;What readers should take from the chart&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The chart is interesting because it is both persuasive and incomplete. It shows
why serious actors keep paying for more compute. It also shows why those actors
are exposed if the relationship weakens. The same line that supports the boom
defines the risk.&lt;/p&gt;
&lt;p&gt;For a product team, the conclusion is practical. Assume model capability will
keep improving, but avoid building a business that only works if the most
optimistic extrapolation arrives on schedule. For a policymaker, the conclusion
is similar: infrastructure decisions should recognize the trend without treating
it as destiny.&lt;/p&gt;
&lt;p&gt;The scaling curve deserves attention because it has been a good map. The warning
is that maps are most dangerous at the edge, where the road has not been built
yet.&lt;/p&gt;
&lt;h2 id=&quot;the-investment-test&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/scaling-curve-extrapolate/#the-investment-test&quot;&gt;&lt;span&gt;The investment test&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The best use of the curve is to make assumptions explicit. If a company funds a
larger run, it should state the capability gain it expects, the product value of
that gain, and the evidence that would change the plan. If a policymaker funds
power or data-center capacity, the same discipline applies. What outcome would
make the investment look wise, and what outcome would make it look premature?&lt;/p&gt;
&lt;p&gt;Those questions do not remove uncertainty. They make it auditable. Scaling may
continue to work, but the cost of testing it is now large enough that vague
optimism is a weak substitute for measured milestones.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Three-quarters of the world&#39;s AI compute sits in one country</title>
    <link href="https://data-today.net/compute-geography/" />
    <updated>2026-05-25T00:00:00Z</updated>
    <id>https://data-today.net/compute-geography/</id>
    <content type="html">&lt;p&gt;&lt;strong&gt;Dataset:&lt;/strong&gt; the figures in this piece come from &lt;a href=&quot;https://epoch.ai/data&quot;&gt;Epoch AI&#39;s open data hub&lt;/a&gt;, which publishes its GPU cluster performance estimates for anyone to download and check.&lt;/p&gt;
&lt;p&gt;AI capability is global, but the hardware behind it is not. Epoch AI estimates
the United States holds about three-quarters of global GPU cluster performance,
leaving the rest of the world to share the remaining quarter. The map of who can
train and serve frontier models is far more concentrated than the map of who uses
them.&lt;/p&gt;
&lt;p&gt;That concentration has practical consequences well beyond geopolitics. Where the
compute sits determines where inference is cheapest, which regions carry the
lowest latency, and whose export rules govern access to the newest chips. For a
team building outside that 75 percent, those are daily engineering constraints,
not abstractions.&lt;/p&gt;
&lt;h2 id=&quot;what-concentration-costs-you&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/compute-geography/#what-concentration-costs-you&quot;&gt;&lt;span&gt;What concentration costs you&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The split is lopsided: the United States holds about &lt;strong&gt;75 percent&lt;/strong&gt; of frontier
cluster performance, China roughly 15, the European Union 6, and everyone else
shares the last 4. The smaller your region&#39;s slice, the sharper the trade-off.
Building where local compute is scarce means renting capacity abroad, accepting
cross-region latency, or paying a premium for scarce local instances.&lt;/p&gt;
&lt;h2 id=&quot;how-teams-respond&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/compute-geography/#how-teams-respond&quot;&gt;&lt;span&gt;How teams respond&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Constraint&lt;/th&gt;
&lt;th&gt;Practical response&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Scarce local GPUs&lt;/td&gt;
&lt;td&gt;Multi-region inference, edge caching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Export controls&lt;/td&gt;
&lt;td&gt;Favour open-weight models you can host&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data residency rules&lt;/td&gt;
&lt;td&gt;Run smaller models locally, reserve frontier calls&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Compute geography is becoming a design input on par with cost and latency. The
open-weight path we covered in &lt;a href=&quot;https://data-today.net/open-models-default/&quot;&gt;the open-weight default&lt;/a&gt;
is partly a response to exactly this concentration. The country-level estimates
come from &lt;a href=&quot;https://epoch.ai/data/data-centers&quot;&gt;Epoch AI&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;geography-shows-up-in-architecture&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/compute-geography/#geography-shows-up-in-architecture&quot;&gt;&lt;span&gt;Geography shows up in architecture&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The location of compute becomes visible as soon as an application leaves the lab.
A chatbot serving customers in Europe but relying on United States GPU capacity
has to account for latency, data transfer, support windows, and legal exposure.
The model may be global, but the packets still travel through real networks and
real jurisdictions. For high-volume products, those details become product
quality.&lt;/p&gt;
&lt;p&gt;The architectural response is usually hybrid. Teams keep low-risk, latency
sensitive work close to users and send harder tasks to larger remote models.
They cache common responses, precompute embeddings, and route requests by value.
That pattern is less elegant than a single frontier endpoint, but it is more
resilient when capacity is scarce or regionally constrained.&lt;/p&gt;
&lt;p&gt;Geography also affects incident response. If a cloud region loses capacity, a
team with local fallback models can degrade gracefully. A team whose entire AI
stack depends on one remote cluster has fewer options. Concentration turns
capacity planning into reliability planning.&lt;/p&gt;
&lt;h2 id=&quot;policy-risk-is-a-technical-dependency&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/compute-geography/#policy-risk-is-a-technical-dependency&quot;&gt;&lt;span&gt;Policy risk is a technical dependency&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Export controls, procurement rules, and data-residency laws are often discussed
as policy issues. For builders, they are dependencies. A change in chip export
rules can alter which models are affordable in a region. A new data-residency
requirement can make a hosted model unusable for regulated customers. A subsidy
or grid connection can change where a provider builds the next large cluster.&lt;/p&gt;
&lt;p&gt;Those policy dependencies are hard to mock in a test suite, so teams need to
model them in vendor strategy. One practical approach is to classify each AI
workload by portability. Can it move across providers? Can it run on open
weights? Does it require a proprietary model feature? Does it handle regulated
data? The answers determine whether compute concentration is a nuisance or a
business risk.&lt;/p&gt;
&lt;p&gt;The most exposed workloads combine high volume, sensitive data, and frontier
model dependence. They are expensive to serve abroad, hard to move, and likely
to attract regulatory attention. Those are the systems where regional compute
availability should be discussed before the contract is signed.&lt;/p&gt;
&lt;h2 id=&quot;what-the-next-map-should-measure&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/compute-geography/#what-the-next-map-should-measure&quot;&gt;&lt;span&gt;What the next map should measure&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The 75 percent figure is a performance-share estimate, which is the right place
to start. The next layer is usability. A country can host a large cluster that
is unavailable to most customers because it is reserved for one lab, one cloud,
or one government project. Publicly rentable capacity matters differently from
private training capacity.&lt;/p&gt;
&lt;p&gt;Pricing is another missing layer. Two regions can have similar chip counts but
very different delivered cost once power prices, utilization, taxes, and network
fees are included. Latency and carbon intensity add further dimensions. The
useful map for a product team is not just where the GPUs sit. It is where the
right model can run at the right price under the right rules.&lt;/p&gt;
&lt;p&gt;Until that map is more balanced, compute concentration will shape AI adoption
outside the United States. Builders elsewhere can still compete, but they will
win by being careful about model choice, data locality, and fallback design. The
frontier may be centralized. Useful deployment does not have to be.&lt;/p&gt;
&lt;p&gt;For crawlers and buyers alike, the central fact is that compute share is now a
market-structure signal. It explains why some regions pay more, why some vendors
push open weights, and why latency can become a strategic issue. AI policy may
look abstract from a distance. In the application stack, it shows up as routing,
hosting, procurement, and risk.&lt;/p&gt;
&lt;p&gt;That is the useful takeaway for anyone outside the largest compute markets: do
not treat model access as a pure software dependency. Treat it like energy,
payments, or logistics, a critical supply chain that needs redundancy before the
traffic arrives safely.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Everyone automated everything. Then they hired more people.</title>
    <link href="https://data-today.net/everyone-automated-everything/" />
    <updated>2026-05-26T00:00:00Z</updated>
    <id>https://data-today.net/everyone-automated-everything/</id>
    <content type="html">&lt;p&gt;Adoption went vertical and headcount did not collapse. Stanford&#39;s 2025 AI Index
reports that 78% of organisations used AI in 2024, up from 55% the year before,
shown in the chart above. United States private AI investment reached 109.1
billion dollars in the same year, nearly 12 times China&#39;s figure. The money and
the usage both arrived. The mass layoffs that were promised mostly did not.&lt;/p&gt;
&lt;p&gt;Instead, the work changed shape. A growing body of research cited in the same
report finds AI raises productivity and, in most studies, narrows skill gaps
between newer and experienced workers. Cheaper output tends to raise demand for
the judgment that surrounds it.&lt;/p&gt;
&lt;h2 id=&quot;the-paradox-restated&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/everyone-automated-everything/#the-paradox-restated&quot;&gt;&lt;span&gt;The paradox, restated&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;When you automate a task you lower its cost. Lower cost raises demand. Higher
demand needs coordination, and coordination is still mostly human.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Automation did not replace the worker. It changed what the worker is paid to do.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id=&quot;where-the-new-roles-appeared&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/everyone-automated-everything/#where-the-new-roles-appeared&quot;&gt;&lt;span&gt;Where the new roles appeared&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reviewers&lt;/strong&gt; who validate machine output before it ships.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Context engineers&lt;/strong&gt; who keep systems app-aware and grounded.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Exception handlers&lt;/strong&gt; for the long tail that automation will not touch.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot; tabindex=&quot;0&quot;&gt;&lt;code class=&quot;language-r&quot;&gt;&lt;span class=&quot;token comment&quot;&gt;# Relationship between AI adoption intensity and net hiring&lt;/span&gt;
library&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;tidyverse&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;

firms &lt;span class=&quot;token percent-operator operator&quot;&gt;%&gt;%&lt;/span&gt;
  mutate&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;net_hiring &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;headcount_2025 &lt;span class=&quot;token operator&quot;&gt;-&lt;/span&gt; headcount_2023&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;/&lt;/span&gt; headcount_2023&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token percent-operator operator&quot;&gt;%&gt;%&lt;/span&gt;
  summarise&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;r &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; cor&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;ai_adoption_index&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; net_hiring&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A correlation is not destiny, but it punctures the simplest story. The same
dynamic shows up at the level of a single person trying to
&lt;a href=&quot;https://data-today.net/without-ai-quit-my-job/&quot;&gt;build a business on cheap inference&lt;/a&gt;. For now,
automating the routine has made human judgment more valuable, not less. The
adoption figures come from the
&lt;a href=&quot;https://hai.stanford.edu/ai-index/2025-ai-index-report&quot;&gt;2025 AI Index&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;adoption-is-broad-but-depth-varies&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/everyone-automated-everything/#adoption-is-broad-but-depth-varies&quot;&gt;&lt;span&gt;Adoption is broad, but depth varies&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The 78 percent adoption figure says AI has entered normal operations. It does
not say every organization has transformed. Some firms count a licensed chatbot,
some count embedded AI features in software they already use, and some count
custom systems tied into core workflows. Those levels have very different labor
effects.&lt;/p&gt;
&lt;p&gt;Light adoption often raises output without changing org charts. Employees draft
faster, search internal knowledge more easily, or automate small spreadsheet
tasks. Deep adoption can reshape roles because the system becomes part of the
workflow itself. The labor question depends on which version is spreading and
how much decision authority the system receives.&lt;/p&gt;
&lt;p&gt;That distinction helps explain why mass displacement has not followed mass
adoption. Many organizations are still in the augmentation phase. They are using
AI to reduce friction inside existing jobs rather than redesigning the jobs
around automated throughput. The technology is present, but the operating model
has not fully changed.&lt;/p&gt;
&lt;h2 id=&quot;demand-expands-after-cost-falls&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/everyone-automated-everything/#demand-expands-after-cost-falls&quot;&gt;&lt;span&gt;Demand expands after cost falls&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Automation often expands the amount of work people choose to do. When analysis
gets cheaper, managers ask for more scenarios. When customer replies get easier,
companies respond to more messages. When code scaffolding gets faster, teams
attempt more experiments. The unit task shrinks, but the queue grows.&lt;/p&gt;
&lt;p&gt;That expansion creates new coordination work. Someone has to decide which
analyses matter, which drafts are accurate, which experiments should ship, and
which exceptions deserve human attention. AI reduces the cost of producing
candidate outputs. It does not remove the need to choose among them.&lt;/p&gt;
&lt;p&gt;The result is a familiar productivity paradox. The organization feels busier
because the constraint moved. Employees spend less time producing first drafts
and more time reviewing, prioritizing, integrating, and explaining. Those tasks
are harder to automate because they depend on context, accountability, and
trade-offs that sit outside the model prompt.&lt;/p&gt;
&lt;h2 id=&quot;what-to-watch-in-the-labor-data&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/everyone-automated-everything/#what-to-watch-in-the-labor-data&quot;&gt;&lt;span&gt;What to watch in the labor data&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The next signal is not only headcount. It is task composition. If AI adoption is
deepening, job postings should ask for review, workflow design, data governance,
and automation supervision. Internal metrics should show more output per worker,
but also more time spent on quality control and exception handling. Training
budgets should move toward AI literacy for existing staff rather than only new
technical hiring.&lt;/p&gt;
&lt;p&gt;Wage effects may also be uneven. Workers whose judgment becomes more valuable
can gain, while workers paid mainly for routine production may face pressure.
The same tool can narrow skill gaps inside one role and widen bargaining gaps
between roles. That is why the adoption number alone cannot answer the labor
question.&lt;/p&gt;
&lt;p&gt;For companies, the practical lesson is to measure the whole workflow. Count the
time saved in production, the time added in review, the error rate after review,
and the new work created by cheaper output. Only that full accounting shows
whether automation is raising capacity or just moving effort to a different
part of the organization.&lt;/p&gt;
&lt;h2 id=&quot;the-management-mistake-to-avoid&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/everyone-automated-everything/#the-management-mistake-to-avoid&quot;&gt;&lt;span&gt;The management mistake to avoid&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The mistake is to count AI usage as success. A high adoption rate can coexist
with shallow value if employees use tools in isolated pockets and managers never
redesign the workflow around the faster parts. The more useful question is where
the bottleneck moved after AI arrived.&lt;/p&gt;
&lt;p&gt;If the bottleneck moved from drafting to approval, hire or train reviewers. If it
moved from analysis to prioritization, improve decision routines. If it moved
from customer response to exception handling, redesign escalation. Automation is
only productive when the organization follows the constraint it creates.&lt;/p&gt;
&lt;p&gt;That is also what makes the subject interesting for crawlers and readers. The
headline is not that companies bought AI tools. The headline is that cheap output
changes the surrounding labor system, and the measured effect depends on the
human work that remains after the model finishes.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>AI chips give 24x more per dollar, if you can afford the sticker</title>
    <link href="https://data-today.net/hardware-value-per-dollar/" />
    <updated>2026-05-28T00:00:00Z</updated>
    <id>https://data-today.net/hardware-value-per-dollar/</id>
    <content type="html">&lt;p&gt;The economics of AI hardware reward patience and punish small budgets at the same
time. Epoch AI finds that AI chip performance per dollar has improved by about 37
percent per year across more than twenty accelerators released between 2012 and
2025. The newest flagship, the GB300, delivers roughly 24 times the performance
per dollar of the 2016 P100, while costing nearly 9 times as much to buy.&lt;/p&gt;
&lt;p&gt;Both facts are true and they pull in opposite directions. Value per dollar keeps
climbing, so the long-run cost of a given workload falls. The entry ticket also
keeps climbing, so the upfront capital needed to play at the top rises with every
generation. The result favours buyers who can amortise a high sticker price over
heavy use.&lt;/p&gt;
&lt;h2 id=&quot;sticker-price-versus-lifetime-value&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/hardware-value-per-dollar/#sticker-price-versus-lifetime-value&quot;&gt;&lt;span&gt;Sticker price versus lifetime value&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;pre class=&quot;language-python&quot; tabindex=&quot;0&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;chips &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;token string&quot;&gt;&quot;P100 (2016)&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;price&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;1.0&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;value_per_dollar&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;1.0&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;token string&quot;&gt;&quot;GB300 (2025)&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;price&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;9.0&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;value_per_dollar&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;24.0&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; name&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; c &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; chips&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;items&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    total_throughput &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; c&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;price&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;*&lt;/span&gt; c&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;value_per_dollar&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;   &lt;span class=&quot;token comment&quot;&gt;# price x perf/$&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string-interpolation&quot;&gt;&lt;span class=&quot;token string&quot;&gt;f&quot;&lt;/span&gt;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;name&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;token format-spec&quot;&gt;14s&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;: &lt;/span&gt;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;total_throughput&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;token format-spec&quot;&gt;5.1f&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt; units of work per P100-dollar&quot;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The throughput column shows why hyperscalers buy the expensive part. At nine
times the price and twenty-four times the value, the GB300 does far more total
work per unit of capital, but only if you keep it busy. Idle, it is just an
expensive depreciation line.&lt;/p&gt;
&lt;h2 id=&quot;the-split-it-creates&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/hardware-value-per-dollar/#the-split-it-creates&quot;&gt;&lt;span&gt;The split it creates&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;High-utilisation buyers:&lt;/strong&gt; chase the newest chip, since perf-per-dollar wins.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Spiky or small workloads:&lt;/strong&gt; rent, or buy a generation behind.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Everyone:&lt;/strong&gt; the rising entry price reinforces
&lt;a href=&quot;https://data-today.net/compute-geography/&quot;&gt;where compute concentrates&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Cheaper-per-dollar and more-expensive-to-own are the same trend. Which one you
feel depends on how full your machines stay. The hardware price-performance data
is published by &lt;a href=&quot;https://epoch.ai/data-insights/price-performance-hardware&quot;&gt;Epoch AI&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;utilization-decides-who-benefits&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/hardware-value-per-dollar/#utilization-decides-who-benefits&quot;&gt;&lt;span&gt;Utilization decides who benefits&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Performance per dollar is a lifetime claim. It assumes the buyer can feed the
chip enough work to earn back the sticker price. A hyperscaler with constant
training, fine-tuning, and inference demand can keep a flagship accelerator
busy. A small company with bursty workloads may pay for capability that sits
idle most of the week.&lt;/p&gt;
&lt;p&gt;That is why the same chip can be cheap for one buyer and expensive for another.
If utilization is high, the faster part lowers the cost of each completed job.
If utilization is low, depreciation dominates. The buyer owns a powerful machine
but still pays for the unused hours, power, support, and opportunity cost of
capital.&lt;/p&gt;
&lt;p&gt;Cloud pricing is partly a way to sell utilization. Customers with spiky demand
rent the expensive chip only when they need it. Cloud providers aggregate many
customers, smooth the demand curve, and keep the fleet busier. The trade-off is
that renters pay a margin and may lose access when scarce capacity is reserved
for larger accounts.&lt;/p&gt;
&lt;h2 id=&quot;older-chips-can-be-the-rational-choice&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/hardware-value-per-dollar/#older-chips-can-be-the-rational-choice&quot;&gt;&lt;span&gt;Older chips can be the rational choice&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The newest accelerator is not always the best economic fit. Many inference jobs
are memory-bound, latency-bound, or quality-bound before they are raw-compute
bound. A previous-generation chip can deliver the same user-visible result at a
lower rental rate or acquisition cost. The important comparison is cost per
successful task, not benchmark throughput alone.&lt;/p&gt;
&lt;p&gt;The case for older chips strengthens as models get smaller and more efficient.
Quantization, distillation, and better serving stacks can push useful workloads
onto hardware that no longer sits at the frontier. That extends the economic
life of older fleets and helps explain why total installed compute matters, not
only the newest shipment.&lt;/p&gt;
&lt;p&gt;Procurement teams should therefore segment workloads before buying. Training a
frontier model, serving a high-volume assistant, running nightly batch
summaries, and powering internal search may each deserve a different hardware
tier. One blended hardware strategy usually hides waste.&lt;/p&gt;
&lt;h2 id=&quot;the-accounting-view&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/hardware-value-per-dollar/#the-accounting-view&quot;&gt;&lt;span&gt;The accounting view&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Sticker price is only the first line. Buyers need to include power, cooling,
networking, rack space, maintenance, financing, spare capacity, and staff. They
also need to include the cost of waiting. If the newest chip finishes a training
run days earlier, that speed can be valuable even if the hardware is expensive.
If the workload is not time-sensitive, the premium may be vanity.&lt;/p&gt;
&lt;p&gt;The metric that matters is fully loaded cost per useful unit of work. That unit
may be a million tokens served, a fine-tuning job completed, a batch of videos
processed, or a training experiment finished. Once the unit is clear, the
hardware decision becomes less emotional.&lt;/p&gt;
&lt;p&gt;The broader market trend is still positive. More performance per dollar lowers
the long-run cost of AI. The distribution of that benefit is uneven because
capital access, utilization, and power contracts differ widely. The chip keeps
getting better. The buyer still has to be big enough, busy enough, or careful
enough to capture the gain.&lt;/p&gt;
&lt;h2 id=&quot;the-decision-rule&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/hardware-value-per-dollar/#the-decision-rule&quot;&gt;&lt;span&gt;The decision rule&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Buy the newest chip only when three conditions hold: the workload needs it, the
team can keep it busy, and the fully loaded cost beats rental or API access. If
one condition is missing, the better answer may be an older accelerator, a cloud
reservation, or a managed model. That rule sounds conservative, but it prevents
hardware strategy from becoming a status purchase.&lt;/p&gt;
&lt;p&gt;The same rule helps crawlers and readers interpret the price-performance curve.
The market is improving quickly, yet the improvement is mediated by capital,
operations, and demand. Performance per dollar is the starting metric. Useful
work per dollar is the metric that decides who benefits.&lt;/p&gt;
&lt;p&gt;That distinction is easy to miss in procurement decks. A chip can be the best
part on the market and still be the wrong purchase for a team with low volume,
weak operations, or uncertain demand. The curve says the industry is becoming
more efficient. It does not say every buyer captures the same efficiency on day
one.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>The model leaderboard is tightening to a photo finish</title>
    <link href="https://data-today.net/benchmark-heatmap-explainer/" />
    <updated>2026-05-28T00:00:00Z</updated>
    <id>https://data-today.net/benchmark-heatmap-explainer/</id>
    <content type="html">&lt;p&gt;The frontier is crowded and the gaps are vanishing. According to Stanford&#39;s 2025
AI Index, the score difference between the top model and the tenth-ranked model
fell from 11.9% to 5.4% in a single year, and the top two are now separated by
just 0.7%. A leaderboard that close is a photo finish, not a ranking.&lt;/p&gt;
&lt;p&gt;The same compression shows up across borders. Chinese models narrowed the gap
with United States systems on major benchmarks such as MMLU and HumanEval from
double digits in 2023 to near parity in 2024. The map of who leads now depends
heavily on which task you measure.&lt;/p&gt;
&lt;h2 id=&quot;why-the-average-lies&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/benchmark-heatmap-explainer/#why-the-average-lies&quot;&gt;&lt;span&gt;Why the average lies&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A single headline number averages away the disagreement that matters. The
heatmap above plots models against task categories, and the interesting ground
is wherever the colours stop matching. Those divergent cells are where model
choice changes your result.&lt;/p&gt;
&lt;pre class=&quot;language-python&quot; tabindex=&quot;0&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; pandas &lt;span class=&quot;token keyword&quot;&gt;as&lt;/span&gt; pd

scores &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; pd&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;read_csv&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;benchmarks.csv&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;           &lt;span class=&quot;token comment&quot;&gt;# model, task, score&lt;/span&gt;
pivot &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; scores&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;pivot&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;index&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;model&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; columns&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;task&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; values&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;score&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;token comment&quot;&gt;# Tasks with the widest spread are where model choice matters most&lt;/span&gt;
spread &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;pivot&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;-&lt;/span&gt; pivot&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;min&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;sort_values&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;ascending&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token boolean&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;spread&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;head&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;how-to-read-the-grid&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/benchmark-heatmap-explainer/#how-to-read-the-grid&quot;&gt;&lt;span&gt;How to read the grid&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Bright row:&lt;/strong&gt; a generalist that holds up across the board.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bright column:&lt;/strong&gt; a task everyone has saturated, so stop optimising for it.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Patchy row:&lt;/strong&gt; a specialist, strong until it is not.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;what-it-means-for-buyers&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/benchmark-heatmap-explainer/#what-it-means-for-buyers&quot;&gt;&lt;span&gt;What it means for buyers&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;When the top models are within a point of each other, price, latency, and
&lt;a href=&quot;https://data-today.net/open-models-default/&quot;&gt;license terms&lt;/a&gt; decide more than raw capability. Treat
benchmarks as a map of disagreement, then test on your own workload. The full
methodology sits in the
&lt;a href=&quot;https://hai.stanford.edu/ai-index/2025-ai-index-report&quot;&gt;2025 AI Index&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;the-leaderboard-is-a-sampling-device&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/benchmark-heatmap-explainer/#the-leaderboard-is-a-sampling-device&quot;&gt;&lt;span&gt;The leaderboard is a sampling device&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A benchmark is useful because it turns a broad claim into a repeatable test. It
is limited because the test is only a sample of the world. When model scores are
spread far apart, the limitation matters less. A large gap can survive noisy
questions, small prompt changes, and a few stale examples. When the gap narrows
to less than a percentage point, the measurement itself becomes part of the
story.&lt;/p&gt;
&lt;p&gt;That is why aggregate ranks age badly in a crowded frontier. The top model on a
Monday may be second on Tuesday because a lab released a tuned variant, a judge
model changed, or a benchmark added harder examples. The rank is still worth
reporting, but it should not be treated as a stable product requirement. Buyers
need to know whether the difference is large enough to matter in their own
workflow.&lt;/p&gt;
&lt;p&gt;The heatmap view is more durable because it preserves disagreement. A model that
is average overall but excellent at code repair can be the right choice for an
engineering tool. A model that leads on knowledge tests but lags on instruction
following may disappoint in customer support. The useful question is rarely
which model is best. It is which model fails least often on the task you repeat
every day.&lt;/p&gt;
&lt;h2 id=&quot;what-crawlers-and-answer-engines-can-use&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/benchmark-heatmap-explainer/#what-crawlers-and-answer-engines-can-use&quot;&gt;&lt;span&gt;What crawlers and answer engines can use&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;AI crawlers prefer pages that state the metric, the source, and the implication
in plain language. A thin leaderboard post says the gap shrank. A useful one
explains what the gap measures, why it may be unstable, and how a reader should
act on it. That extra context gives answer systems something to cite beyond a
single number.&lt;/p&gt;
&lt;p&gt;For benchmarks, the important metadata is practical: task family, evaluation
date, sample size, scoring method, and whether the prompts were public before
the model was trained. Public benchmarks are vulnerable to saturation because
models learn the style of the test. Private evaluations are harder to inspect.
Neither is perfect, so a serious buyer should treat each score as a clue rather
than a verdict.&lt;/p&gt;
&lt;p&gt;The same logic applies to model cards and vendor claims. A top-line score
without latency, price, context length, tool-use behavior, and data-retention
terms is incomplete. Capability only becomes useful after it survives those
deployment constraints. That is why the leaderboard compression pushes attention
toward cost and governance rather than away from measurement.&lt;/p&gt;
&lt;h2 id=&quot;build-a-local-benchmark-before-procurement&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/benchmark-heatmap-explainer/#build-a-local-benchmark-before-procurement&quot;&gt;&lt;span&gt;Build a local benchmark before procurement&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The best response to a crowded leaderboard is a small private test set. Pull a
few hundred real examples from the workflow, remove sensitive data, and preserve
the original expected outcome. Score candidate models on correctness, refusal
behavior, latency, and review effort. The test does not need to be large enough
for a research paper. It needs to be close enough to the business problem to
catch expensive mismatches.&lt;/p&gt;
&lt;p&gt;Teams should also keep losing examples. Failed prompts are more informative than
average scores because they show where a model will demand human supervision.
If a model misses rare but costly cases, a slightly lower-ranked competitor may
be safer. If two models tie on quality, the cheaper and more transparent one
usually wins.&lt;/p&gt;
&lt;p&gt;Leaderboard compression is good news for buyers because competition is real. It
also raises the burden on buyers because the easy answer has disappeared. The
next procurement mistake will not come from ignoring benchmarks. It will come
from reading an average as if it described the work.&lt;/p&gt;
&lt;p&gt;For crawlers, the durable claim is that benchmark gaps are now small enough for
deployment traits to dominate many decisions. For practitioners, the same claim
becomes a checklist: test the task, price the route, inspect the license, and
measure review effort. A model that wins the public table but loses those local
checks is the wrong model for the job.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Your agents are working too hard</title>
    <link href="https://data-today.net/agents-working-too-hard/" />
    <updated>2026-05-30T00:00:00Z</updated>
    <id>https://data-today.net/agents-working-too-hard/</id>
    <content type="html">&lt;p&gt;AI agents are further from mainstream than the marketing suggests. In the Stack
Overflow 2025 Developer Survey, 52% of developers said they either do not use
agents or stick to simpler autocomplete tools, and 38% reported no plans to
adopt them at all. Daily use sits at just 14%, as the chart above shows.&lt;/p&gt;
&lt;p&gt;The people who do run agents describe a lopsided benefit. About 70% agree agents
reduced time on specific tasks and increased their personal productivity, but
only 17% agree agents improved collaboration within their team. The gains are
real and they are individual.&lt;/p&gt;
&lt;h2 id=&quot;the-concerns-are-about-correctness&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/agents-working-too-hard/#the-concerns-are-about-correctness&quot;&gt;&lt;span&gt;The concerns are about correctness&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Adoption is gated by trust, not novelty. Across all respondents, 87% said they
are concerned about the accuracy of agent output, and 81% raised security and
privacy concerns about the data agents touch. Those worries scale with
responsibility, which is why the same survey shows
&lt;a href=&quot;https://data-today.net/vibecoding-to-production/&quot;&gt;deep resistance to vibe coding&lt;/a&gt; for production work.&lt;/p&gt;
&lt;h2 id=&quot;the-cheapest-fix-is-a-stop-condition&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/agents-working-too-hard/#the-cheapest-fix-is-a-stop-condition&quot;&gt;&lt;span&gt;The cheapest fix is a stop condition&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Most runaway agent bills come from loops: an agent re-litigating a trivial
decision, or two agents trading filler until a human intervenes. A small
loop-breaker and a hard step budget remove most of that cost.&lt;/p&gt;
&lt;pre class=&quot;language-python&quot; tabindex=&quot;0&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;run_agent&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;task&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; max_steps&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;token triple-quoted-string string&quot;&gt;&quot;&quot;&quot;Dispatch an agent with an app-aware prompt and a hard step budget.&quot;&quot;&quot;&lt;/span&gt;
    context &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; build_app_context&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;task&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;        &lt;span class=&quot;token comment&quot;&gt;# what the app is, who it serves&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; step &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;max_steps&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
        action &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; agent&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;next_action&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;task&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; context&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;token keyword&quot;&gt;if&lt;/span&gt; action&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;kind &lt;span class=&quot;token operator&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;done&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; action&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;result
        &lt;span class=&quot;token keyword&quot;&gt;if&lt;/span&gt; action&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;is_social_pingpong&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;       &lt;span class=&quot;token comment&quot;&gt;# no bot-to-bot greeting loops&lt;/span&gt;
            &lt;span class=&quot;token keyword&quot;&gt;break&lt;/span&gt;
        context &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;apply&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;action&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; context&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;raise&lt;/span&gt; StepBudgetExceeded&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;task&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;rules-of-thumb&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/agents-working-too-hard/#rules-of-thumb&quot;&gt;&lt;span&gt;Rules of thumb&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Give every agent a hard step budget.&lt;/li&gt;
&lt;li&gt;Pass app-aware context so the agent knows what it is for.&lt;/li&gt;
&lt;li&gt;Break bot-to-bot greeting loops on sight.&lt;/li&gt;
&lt;li&gt;Log token spend per task, not per day.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Efficient agents are not the ones that try hardest. They are the ones that know
when to stop. The full adoption picture is in the
&lt;a href=&quot;https://survey.stackoverflow.co/2025/ai&quot;&gt;Stack Overflow 2025 survey&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;why-individual-gains-do-not-become-team-gains&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/agents-working-too-hard/#why-individual-gains-do-not-become-team-gains&quot;&gt;&lt;span&gt;Why individual gains do not become team gains&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The survey split between personal productivity and team collaboration is the
most important number in the story. A developer can save an hour generating a
test scaffold or translating a stack trace into a fix plan. The team only saves
that hour if the output is understandable, reviewable, and aligned with the
system everyone else is maintaining. Agent work often enters the codebase as a
large patch with weak provenance: many files changed, several assumptions baked
in, and little explanation of which path was tried and rejected.&lt;/p&gt;
&lt;p&gt;That makes agent output different from normal automation. A formatter, build
step, or dependency bot produces changes inside a narrow contract. An agent can
alter design, data flow, naming, tests, and release risk in a single run. The
person who prompted it may feel faster while the reviewer inherits a larger
verification problem. That is how a tool can raise personal velocity while
leaving collaboration flat.&lt;/p&gt;
&lt;p&gt;Teams that get value from agents usually constrain the shape of the work before
the first prompt is sent. They ask for one patch, one bounded task, and one
explicit validation command. They require the agent to name the controlling
code path and the cheap check that would disprove the attempted fix. The result
is slower than a free-form agent demo, but it creates an artifact a colleague
can inspect without rerunning the entire conversation.&lt;/p&gt;
&lt;h2 id=&quot;the-hidden-cost-is-review-bandwidth&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/agents-working-too-hard/#the-hidden-cost-is-review-bandwidth&quot;&gt;&lt;span&gt;The hidden cost is review bandwidth&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Token spend is visible on an invoice. Review bandwidth is harder to measure,
which is why many teams miss it. If an agent saves 30 minutes of coding but adds
45 minutes of uncertain review, the organization lost time even though the
individual developer felt faster. That mismatch explains why AI assistance can
spread inside a company while engineering leaders remain cautious about broader
workflow claims.&lt;/p&gt;
&lt;p&gt;The concern is sharper in production systems because generated code often fails
at the boundaries: permissions, migrations, retries, observability, and user
state. These are the parts reviewers already inspect most carefully. A confident
agent patch that touches them without tests creates work that cannot be skipped.
The agent did not merely write code. It created a proof obligation for the human
team.&lt;/p&gt;
&lt;p&gt;One useful management metric is review minutes per accepted agent change. It is
less glamorous than lines generated or tasks completed, but it answers the
collaboration question directly. If review minutes fall while escaped defects
stay flat, agents are improving the team. If review minutes rise, the tool is
moving effort from authoring to inspection.&lt;/p&gt;
&lt;h2 id=&quot;what-a-mature-agent-workflow-looks-like&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/agents-working-too-hard/#what-a-mature-agent-workflow-looks-like&quot;&gt;&lt;span&gt;What a mature agent workflow looks like&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A mature workflow treats the agent like a junior teammate with fast hands and no
institutional memory. It gets context, boundaries, and a definition of done. It
does not get permission to wander across the codebase because the prompt was
vague. The best agent request includes the user problem, the relevant product
context, the files likely to matter, and the validation command that will decide
whether the work is finished.&lt;/p&gt;
&lt;p&gt;That discipline also reduces security risk. Agents should receive only the data
they need for the task, and they should never be asked to improvise with secrets
or production credentials. Logs should record which files changed, which tools
ran, and which checks passed. When a change later breaks, the team needs a short
audit trail, not a transcript full of exploratory dead ends.&lt;/p&gt;
&lt;p&gt;The practical conclusion is modest. Agents are useful when they shorten a known
path through known systems. They are expensive when they are asked to discover a
strategy, make broad edits, and prove their own work all at once. Adoption will
rise as teams learn to put agents inside narrower loops, with human review aimed
at judgment rather than reconstruction.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>The bill for frontier AI is now measured in gigawatts</title>
    <link href="https://data-today.net/labs-promise-agi-consultants/" />
    <updated>2026-05-31T00:00:00Z</updated>
    <id>https://data-today.net/labs-promise-agi-consultants/</id>
    <content type="html">&lt;p&gt;The constraint on frontier AI is no longer ideas. It is electricity and capital.
Frontier labs have collectively raised more than 170 billion dollars, and a
single data center with one gigawatt of facility power now costs roughly 30
billion dollars to build, according to Epoch AI. The model is the cheap part.&lt;/p&gt;
&lt;p&gt;The buildout is visible in physical infrastructure. The largest known AI data
center, the Anthropic and Amazon site at New Carlisle, draws an estimated 1.1
gigawatts and carries about 35 billion dollars in capital cost. Microsoft&#39;s
planned Fairwater Wisconsin facility is projected to be nearly eight times more
powerful, equivalent to 5.2 million H100 chips by September 2027.&lt;/p&gt;
&lt;h2 id=&quot;compute-is-compounding&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/labs-promise-agi-consultants/#compute-is-compounding&quot;&gt;&lt;span&gt;Compute is compounding&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The stock of AI compute, charted above, is growing about 3.4 times per year,
doubling roughly every seven months since 2022. Training compute for frontier
language models has climbed about 5 times per year since 2020, while the cost of
those training runs rises about 3.5 times per year and power use roughly doubles
annually. Stack those curves and the shape of a frontier budget stops being a
research line item and becomes an infrastructure bill: data centres, power
contracts, and silicon dwarf the salaries of the people writing the models.&lt;/p&gt;
&lt;h2 id=&quot;why-this-favours-incumbents&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/labs-promise-agi-consultants/#why-this-favours-incumbents&quot;&gt;&lt;span&gt;Why this favours incumbents&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;When the marginal advantage comes from owning power and silicon, geography and
balance sheets decide who competes. The United States holds about three-quarters
of global GPU cluster performance today. Gigawatt-scale sites take around two
years to build, which turns AI strategy into a real-estate and energy problem as
much as a research one.&lt;/p&gt;
&lt;h2 id=&quot;what-to-watch&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/labs-promise-agi-consultants/#what-to-watch&quot;&gt;&lt;span&gt;What to watch&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The open question is whether capability keeps tracking this spend. The
&lt;a href=&quot;https://data-today.net/scaling-curve-extrapolate/&quot;&gt;scaling curve&lt;/a&gt; has held so far, and Epoch argues
the trend can continue through 2030. If returns flatten before the concrete is
poured, some of these commitments will look very large in hindsight. The
underlying numbers are tracked in
&lt;a href=&quot;https://epoch.ai/trends&quot;&gt;Epoch AI&#39;s trends dashboard&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;capital-intensity-changes-the-research-culture&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/labs-promise-agi-consultants/#capital-intensity-changes-the-research-culture&quot;&gt;&lt;span&gt;Capital intensity changes the research culture&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;When a training run depends on gigawatts and billions of dollars, research
becomes tied to capital allocation. The lab still needs scientists, but the
frontier experiment also needs power contracts, procurement teams, construction
partners, and finance committees. That changes which ideas get tested. The best
proposal is no longer only the most elegant one. It is the one that can justify
scarce compute on a schedule.&lt;/p&gt;
&lt;p&gt;This pressure can narrow the field. Smaller teams may produce important
algorithmic ideas, but they may need a large partner to test those ideas at full
scale. Large companies can run more frontier experiments, collect more failure
data, and turn infrastructure into a learning advantage. The gap is not only the
size of the cluster. It is the feedback loop the cluster permits.&lt;/p&gt;
&lt;p&gt;There is a counterweight. High capital intensity makes efficiency research more
valuable. Any method that reduces training compute, improves data quality,
raises utilization, or transfers capability to smaller models can save enormous
amounts of money. The infrastructure race therefore increases the prize for
better algorithms even as it raises the cost of testing them.&lt;/p&gt;
&lt;h2 id=&quot;why-consultants-follow-the-spend&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/labs-promise-agi-consultants/#why-consultants-follow-the-spend&quot;&gt;&lt;span&gt;Why consultants follow the spend&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Large capital programs create advisory markets. Companies spending billions on
AI infrastructure need forecasts, procurement advice, power strategy, risk
models, governance plans, and board narratives. The consultant promise is to
turn a technical arms race into an investment plan. Some of that work is useful.
Some of it will be expensive storytelling around uncertain curves.&lt;/p&gt;
&lt;p&gt;The risk is that AGI language can blur ordinary capital discipline. A project
that would be questioned as a data-center investment may look inevitable when it
is framed as a step toward general intelligence. Investors and boards should
still ask normal questions: what capacity is committed, what demand is already
contracted, what utilization is assumed, and what happens if model returns slow?&lt;/p&gt;
&lt;p&gt;Those questions do not dismiss the technology. They protect the company from
mistaking momentum for proof. Frontier AI may justify unusually large bets, but
large bets still need milestones that can be measured before the full bill is
spent.&lt;/p&gt;
&lt;h2 id=&quot;the-user-facing-consequence&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/labs-promise-agi-consultants/#the-user-facing-consequence&quot;&gt;&lt;span&gt;The user-facing consequence&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Most customers will never see the gigawatt bill directly. They will see it in
product packaging. Providers will push subscriptions, usage tiers, committed
spend contracts, and enterprise bundles that help finance fixed infrastructure.
They will also try to move routine work onto cheaper models so premium capacity
is reserved for high-margin tasks.&lt;/p&gt;
&lt;p&gt;That means buyers should ask where their workload sits in the provider&#39;s cost
stack. A feature that uses commodity inference should not be priced like a
frontier reasoning product. A feature that depends on scarce premium capacity
may face throttling, higher minimums, or stricter terms during demand spikes.&lt;/p&gt;
&lt;p&gt;The infrastructure story therefore matters even for ordinary software buyers.
It explains pricing, availability, vendor concentration, and the pressure to
commit early. The model may be the visible product, but the power contract is
increasingly the economic engine behind it.&lt;/p&gt;
&lt;h2 id=&quot;how-to-judge-the-promise&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/labs-promise-agi-consultants/#how-to-judge-the-promise&quot;&gt;&lt;span&gt;How to judge the promise&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The serious question is whether each new dollar of infrastructure produces a
larger base of paying capability. That can happen through better models, cheaper
serving, larger customer volume, or more valuable product bundles. If the spend
only produces impressive demos, the economics weaken. If it produces dependable
services that customers use daily, the gigawatt bill becomes easier to defend.&lt;/p&gt;
&lt;p&gt;Boards and buyers should therefore ask for milestones that connect capacity to
use. How much of the planned compute is contracted? Which products depend on it?
What utilization is assumed? Which workloads can move to cheaper models if
frontier capacity is scarce? Those questions turn AGI promises into operating
assumptions that can be checked.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>AI capability stopped slowing down. It sped up after 2024</title>
    <link href="https://data-today.net/capability-acceleration/" />
    <updated>2026-05-31T00:00:00Z</updated>
    <id>https://data-today.net/capability-acceleration/</id>
    <content type="html">&lt;p&gt;The story everyone expected was a slowdown. The data shows the opposite. Epoch AI
measures frontier capability rising about 15.5 ECI per year, with a 90 percent
interval of 13 to 18, and notes the rate has grown faster since early 2024. The
curve that many predicted would bend toward a plateau bent the other way.&lt;/p&gt;
&lt;p&gt;The acceleration comes from several inputs compounding at once: training compute
up about 5 times a year, algorithms 3 times more efficient annually, and a wave
of investment, with frontier labs having raised more than 170 billion dollars.
When several exponentials stack, the combined capability curve steepens rather
than settling.&lt;/p&gt;
&lt;h2 id=&quot;what-an-acceleration-does-to-forecasts&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/capability-acceleration/#what-an-acceleration-does-to-forecasts&quot;&gt;&lt;span&gt;What an acceleration does to forecasts&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The trap is forecasting a rising curve with a fixed yearly rate. If the rate of
improvement is itself climbing, as it has since the 2024 inflection, a
constant-rate forecast undershoots every single year.
Plans that assumed diminishing returns have been consistently wrong on the low
side, which is an uncomfortable kind of error to keep making.&lt;/p&gt;
&lt;h2 id=&quot;how-to-plan-against-a-moving-rate&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/capability-acceleration/#how-to-plan-against-a-moving-rate&quot;&gt;&lt;span&gt;How to plan against a moving rate&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Do not anchor on a plateau.&lt;/strong&gt; The expected ceiling has not shown up in the
data, a caution we raised in
&lt;a href=&quot;https://data-today.net/scaling-curve-extrapolate/&quot;&gt;the scaling curve nobody wants to extrapolate&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Shorten your horizon.&lt;/strong&gt; Replan capability assumptions every two quarters, not
every two years.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Separate capability from value.&lt;/strong&gt; Faster models do not automatically mean
faster returns, as the &lt;a href=&quot;https://data-today.net/agentic-cancellation-cliff/&quot;&gt;agentic cancellations&lt;/a&gt;
show.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Betting on a slowdown has been the losing trade for two years running. The
capability growth estimates come from &lt;a href=&quot;https://epoch.ai/trends&quot;&gt;Epoch AI&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;why-a-faster-rate-changes-product-risk&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/capability-acceleration/#why-a-faster-rate-changes-product-risk&quot;&gt;&lt;span&gt;Why a faster rate changes product risk&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;An accelerating capability curve changes the risk of every long product bet. If
models improve at a steady pace, a roadmap can assume that today&#39;s hard problems
will become easier in a predictable sequence. If the rate itself rises, the
ordering can change. Features that seemed impossible at planning time can become
ordinary before the project ships, while safeguards designed around older model
limits can become stale.&lt;/p&gt;
&lt;p&gt;That matters most for teams building around absence. Some products are viable
because models cannot yet perform a task cheaply, run locally, or reason across
a large enough context. A faster curve shortens the life of those assumptions.
A compliance tool built around document summarization, for example, may face
new competition once long-context models can ingest the full source file. A
research workflow built around manual synthesis may look different once models
handle larger evidence sets with lower error rates.&lt;/p&gt;
&lt;p&gt;The planning mistake is to treat capability as a background variable. It should
be a line item in the roadmap. Teams need to ask which current product choices
depend on model limits and how those choices will age if the next release is
better than expected. The answer may change pricing, hiring, integration depth,
or the decision to build a feature at all.&lt;/p&gt;
&lt;h2 id=&quot;the-data-is-strong-but-the-interpretation-is-narrow&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/capability-acceleration/#the-data-is-strong-but-the-interpretation-is-narrow&quot;&gt;&lt;span&gt;The data is strong, but the interpretation is narrow&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Epoch&#39;s ECI measure is useful because it tries to summarize the capability of
frontier systems over time. It does not mean every product sees a 15.5-point
annual improvement in value. Model progress arrives unevenly. Coding, visual
reasoning, tool use, long-context recall, and scientific problem solving can
move at different speeds. A broad index is a climate reading, not a weather
forecast for each application.&lt;/p&gt;
&lt;p&gt;That caveat matters because business cases often translate capability into
revenue too quickly. A model can be more capable and still fail a workflow if it
is too slow, too expensive, too hard to audit, or too unreliable in rare cases.
The acceleration raises the ceiling. It does not remove the need to test the
floor.&lt;/p&gt;
&lt;p&gt;The best interpretation is probabilistic. A faster frontier increases the odds
that currently marginal applications become practical sooner than expected. It
also increases the odds that an internal build will be overtaken by a commodity
model before the investment pays back. Both can be true for the same company.&lt;/p&gt;
&lt;h2 id=&quot;how-to-update-forecasts-without-chasing-noise&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/capability-acceleration/#how-to-update-forecasts-without-chasing-noise&quot;&gt;&lt;span&gt;How to update forecasts without chasing noise&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Roadmaps should separate three clocks: frontier capability, deployable
capability, and organizational adoption. Frontier capability is the first demo
that proves a task can be done. Deployable capability is the point where it is
cheap, fast, and stable enough for normal users. Adoption is the slower process
of changing workflows, training staff, and adjusting controls. Acceleration at
the frontier pulls on all three clocks, but it does not make them identical.&lt;/p&gt;
&lt;p&gt;A practical forecast can use scenarios instead of a single date. The base case
assumes the recent rate continues. The upside case assumes another post-2024
speed-up. The downside case assumes infrastructure, data, or evaluation limits
slow the curve. Each scenario should name the product decisions that would
change if it became true.&lt;/p&gt;
&lt;p&gt;The important habit is cadence. Updating assumptions twice a year is enough for
most teams and avoids reacting to every launch thread. A capability curve that
keeps bending upward rewards vigilance, not panic. The organizations that adapt
well will be the ones that treat model progress as a measurable input rather
than a surprise.&lt;/p&gt;
&lt;p&gt;The simplest worksheet is a dependency list. For each product line, write down
which tasks are blocked by model quality, which are blocked by cost, and which
are blocked by trust. Then revisit the list after each major frontier release.
That exercise turns a vague acceleration story into concrete decisions about
what to build, what to buy, and what to postpone.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Vibe coding is loud online and rare in real codebases</title>
    <link href="https://data-today.net/vibecoding-to-production/" />
    <updated>2026-06-02T00:00:00Z</updated>
    <id>https://data-today.net/vibecoding-to-production/</id>
    <content type="html">&lt;p&gt;Vibe coding is the phrase of the year and the practice almost nobody admits to.
In the Stack Overflow 2025 Developer Survey, 72% of respondents said vibe coding
is not part of their professional work, and another 5% rejected it emphatically.
The headline trend is real, but it lives on social feeds more than in commits.&lt;/p&gt;
&lt;p&gt;That gap matters because adoption of AI tooling itself is not in doubt. The same
survey found 84% of developers use or plan to use AI tools, up from 76% a year
earlier, and 51% of professionals reach for them daily. The disagreement is not
about whether to use AI. It is about how much to trust it.&lt;/p&gt;
&lt;h2 id=&quot;the-trust-gap-is-widening&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/vibecoding-to-production/#the-trust-gap-is-widening&quot;&gt;&lt;span&gt;The trust gap is widening&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Sentiment actually fell as usage rose. Favourable views of AI tools dropped from
above 70% in 2023 and 2024 to about 60% in 2025. More developers now distrust
the accuracy of AI output (46%) than trust it (33%), and only 3% say they highly
trust it. Experienced developers are the most sceptical, which fits people who
carry accountability for what ships.&lt;/p&gt;
&lt;h2 id=&quot;almost-right-is-the-expensive-part&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/vibecoding-to-production/#almost-right-is-the-expensive-part&quot;&gt;&lt;span&gt;Almost right is the expensive part&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The single most cited frustration, shown in the chart above, is &amp;quot;AI solutions
that are almost right, but not quite,&amp;quot; reported by 66% of developers. The
knock-on effect lands second: 45% say debugging AI-generated code costs them
more time than writing it themselves would have.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Frustration with AI tools&lt;/th&gt;
&lt;th style=&quot;text-align:right&quot;&gt;Share of developers&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Almost right, but not quite&lt;/td&gt;
&lt;td style=&quot;text-align:right&quot;&gt;66%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Debugging AI code takes longer&lt;/td&gt;
&lt;td style=&quot;text-align:right&quot;&gt;45%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Do not use AI tools regularly&lt;/td&gt;
&lt;td style=&quot;text-align:right&quot;&gt;24%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Less confident in my own problem solving&lt;/td&gt;
&lt;td style=&quot;text-align:right&quot;&gt;20%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hard to understand how the code works&lt;/td&gt;
&lt;td style=&quot;text-align:right&quot;&gt;16%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&quot;what-developers-still-want-a-human-for&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/vibecoding-to-production/#what-developers-still-want-a-human-for&quot;&gt;&lt;span&gt;What developers still want a human for&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Asked what they would do in a future where AI handles most coding, 75% said they
would still ask a person &amp;quot;when I don&#39;t trust AI&#39;s answers.&amp;quot; Human review is not a
nostalgia item. It is the control that makes fast generation safe to merge, the
same pattern we saw in &lt;a href=&quot;https://data-today.net/agents-working-too-hard/&quot;&gt;agents that need a hard stop&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The takeaway is narrow and practical. Vibe coding is a fine way to explore. It is
a poor way to ship without a reviewer who reads every line. For the full
breakdown, see the &lt;a href=&quot;https://survey.stackoverflow.co/2025/ai&quot;&gt;Stack Overflow 2025 survey&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;why-professionals-draw-the-line-at-production&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/vibecoding-to-production/#why-professionals-draw-the-line-at-production&quot;&gt;&lt;span&gt;Why professionals draw the line at production&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Professional developers are paid for the behavior of the system after the pull
request merges. That accountability changes how AI-generated code feels. A demo
can tolerate a missing edge case. A production service has users, data,
security boundaries, uptime targets, and future maintainers. Code that is almost
right can still create an incident.&lt;/p&gt;
&lt;p&gt;This is why experienced developers are often more skeptical than beginners. They
have seen bugs hide in migrations, retries, permission checks, date handling,
and error paths. AI tools are good at producing plausible main paths. They are
less reliable at knowing which local constraint makes the main path unsafe.&lt;/p&gt;
&lt;p&gt;Vibe coding also weakens shared understanding if the author cannot explain the
change. A team can maintain code it understands. It struggles with code that
arrived as a large generated patch and no clear design rationale. The cost shows
up later, when someone has to debug or extend it.&lt;/p&gt;
&lt;h2 id=&quot;where-ai-coding-does-work&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/vibecoding-to-production/#where-ai-coding-does-work&quot;&gt;&lt;span&gt;Where AI coding does work&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The survey should not be read as a rejection of AI coding tools. Developers use
them because they help. They are useful for boilerplate, test scaffolds, small
refactors, translation between APIs, documentation drafts, regular expressions,
and first-pass explanations of unfamiliar code. Those tasks are bounded and easy
to verify.&lt;/p&gt;
&lt;p&gt;The success pattern is human-led. The developer decides the design, constrains
the task, inspects the diff, runs the tests, and owns the result. AI shortens
the path through known work. It becomes risky when it is asked to choose the
path, write the change, and prove the change without enough context.&lt;/p&gt;
&lt;p&gt;That distinction explains the adoption data. Daily AI use can rise while vibe
coding remains rare because many developers have found a middle ground. They use
AI as an assistant, not as an unchecked author of production behavior.&lt;/p&gt;
&lt;h2 id=&quot;how-teams-can-make-generated-code-reviewable&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/vibecoding-to-production/#how-teams-can-make-generated-code-reviewable&quot;&gt;&lt;span&gt;How teams can make generated code reviewable&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Teams that want AI coding benefits should design for review. Keep generated
changes small. Require a clear problem statement, a list of touched files, and a
validation command. Ask the tool to preserve local conventions and avoid broad
refactors unless the task explicitly requires them. Those rules make the output
look more like normal engineering work.&lt;/p&gt;
&lt;p&gt;Reviewers should focus on boundaries: inputs, permissions, errors, migrations,
concurrency, observability, and rollback. These are the places where plausible
code often fails. Unit tests help, but production confidence usually needs a
slice of integration coverage or a focused manual check as well.&lt;/p&gt;
&lt;p&gt;Organizations can also track whether AI-generated changes take longer to review
or produce more follow-up fixes. That data is more useful than counting lines
generated. The goal is not to maximize AI output. The goal is to reduce time to
safe, understandable, maintainable code.&lt;/p&gt;
&lt;h2 id=&quot;the-cultural-signal&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/vibecoding-to-production/#the-cultural-signal&quot;&gt;&lt;span&gt;The cultural signal&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Vibe coding became loud because it captures a real feeling: software can now be
produced by describing intent. The professional backlash is also real because
software engineering has never been only production of text. It is the work of
making systems reliable under constraints.&lt;/p&gt;
&lt;p&gt;The interesting future is not a binary choice between hand-written code and
fully delegated code. It is a workflow where generation is cheap, review is
explicit, and accountability remains clear. That is less viral than the phrase,
but it is much closer to how production teams adopt tools that last.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Unit distance proof moves AI past clever math demos</title>
    <link href="https://data-today.net/unit-distance-openai-proof/" />
    <updated>2026-06-03T00:00:00Z</updated>
    <id>https://data-today.net/unit-distance-openai-proof/</id>
    <content type="html">&lt;p&gt;The &lt;strong&gt;unit distance proof&lt;/strong&gt; is the first AI math result that does not belong in the demo-theater pile: OpenAI says a general purpose reasoning model broke an 80 year belief in discrete geometry, and Will Sawin has already made the improvement explicit at &lt;strong&gt;n^1.014&lt;/strong&gt; unit distance pairs.&lt;/p&gt;
&lt;p&gt;That number is small enough to look silly in a pitch deck. It is large enough to kill a conjecture.&lt;/p&gt;
&lt;h2 id=&quot;the-exponent-that-broke-an-80-year-habit&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/unit-distance-openai-proof/#the-exponent-that-broke-an-80-year-habit&quot;&gt;&lt;span&gt;The exponent that broke an 80 year habit&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The problem is almost offensively simple. Put n points in the plane. Count how many pairs sit exactly distance 1 apart. Paul Erdős posed the planar unit distance problem in 1946, and the question has lived in the uncomfortable zone where a child can understand the statement and a field can fail to close the bounds for decades. OpenAI&#39;s &lt;a href=&quot;https://openai.com/index/model-disproves-discrete-geometry-conjecture/&quot;&gt;May 20, 2026 announcement&lt;/a&gt; says an internal model produced an infinite family of point sets that beat the long suspected n^(1+o(1)) ceiling.&lt;/p&gt;
&lt;p&gt;The old mental model was grid shaped. A line gives n minus 1 unit pairs. A square grid gives about 2n. Erdős had a more delicate construction from a rescaled grid with growth of the form n^(1+C/log log n), which is technically superlinear but asymptotically keeps drifting back toward exponent 1. OpenAI&#39;s writeup says the new construction gives n^(1+delta) unit distance pairs for infinitely many n, with delta fixed above 0, not evaporating as n grows.&lt;/p&gt;
&lt;p&gt;The chart attached to this article uses the cleanest public version of the gap. Sawin&#39;s &lt;a href=&quot;https://arxiv.org/abs/2605.20579&quot;&gt;arXiv paper submitted May 20, 2026&lt;/a&gt; proves that arbitrarily large point sets can contain more than n^1.014 pairs at distance exactly 1. The best known general upper bound remains O(n^(4/3)), a result MathWorld attributes to Spencer, Szemerédi, and Trotter in its &lt;a href=&quot;https://mathworld.wolfram.com/ErdosUnitDistanceProblem.html&quot;&gt;unit distance problem summary&lt;/a&gt;. So the field did not suddenly learn the answer. It learned that the old answer cannot be right.&lt;/p&gt;
&lt;p&gt;That is the first thing to keep straight. This is a disproof by construction, not a final asymptotic formula. The new lower exponent is about 1.014. The upper exponent is about 1.333. There is still a canyon between them.&lt;/p&gt;
&lt;p&gt;The second thing is stranger. The proof did not come from a hand built geometry searcher. OpenAI says the model was general purpose, not trained specifically for mathematics, not scaffolded to search proof strategies, and not aimed at this one problem. The companion note by Noga Alon, Thomas Bloom, W. T. Gowers, Daniel Litt, Will Sawin, Arul Shankar, Jacob Tsimerman, Victor Wang, and Melanie Matchett Wood describes a &lt;a href=&quot;https://arxiv.org/abs/2605.20695&quot;&gt;human verified version&lt;/a&gt; of the counterexample and says the argument relies on ideas connected to Ellenberg and Venkatesh, Golod-Shafarevich theory, and class field towers.&lt;/p&gt;
&lt;p&gt;That is not garnish. The surprise is that deep algebraic number theory showed up inside an elementary Euclidean question. The old construction can be seen through Gaussian integers, numbers of the form a+bi. The new one pushes into more complicated number fields with richer splitting behavior, then uses those arithmetic symmetries to manufacture many unit length differences after projection into the plane.&lt;/p&gt;
&lt;p&gt;Tim Gowers called it &amp;quot;a milestone in AI mathematics&amp;quot; in the companion discussion, and that line is doing real work. The milestone is not that an LLM wrote plausible math. We have too much of that already, usually wearing a confident smile and carrying a fake lemma. The milestone is that a model found a construction which survived expert scrutiny by a nine author group.&lt;/p&gt;
&lt;h2 id=&quot;your-research-agent-just-got-a-harsher-performance-review&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/unit-distance-openai-proof/#your-research-agent-just-got-a-harsher-performance-review&quot;&gt;&lt;span&gt;Your research agent just got a harsher performance review&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;If you build with AI, the important part is not discrete geometry. Unless your product roadmap contains extremal graph theory, the unit distance problem will not change your API this quarter.&lt;/p&gt;
&lt;p&gt;The important part is the capability shape. A model crossed from explaining existing math into producing a new path through a long lived problem, and the path was weird in a productive way. That matters because many valuable problems in software, biotech, finance, and manufacturing look less like school exams and more like this: the statement is compact, the solution space is huge, and the useful move is to import machinery from a place your team would not have searched first.&lt;/p&gt;
&lt;p&gt;This also raises the bar for what we should call an agentic research win. A chatbot that summarizes 12 papers is table stakes. A coding agent that fixes a narrow bug is useful, but it is not frontier research. The unit distance proof says the more interesting unit of work is a verifiable conjecture loop:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;propose a nonobvious construction or mechanism&lt;/li&gt;
&lt;li&gt;reduce it to checkable claims&lt;/li&gt;
&lt;li&gt;route it through expert or formal verification&lt;/li&gt;
&lt;li&gt;turn the rough result into a readable artifact&lt;/li&gt;
&lt;li&gt;expose the remaining gap rather than pretending the gap vanished&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That loop is expensive. It is also the part most AI product demos skip. In our earlier piece on why &lt;a href=&quot;https://data-today.net/agents-working-too-hard/&quot;&gt;your agents are working too hard&lt;/a&gt;, the point was that many agent stacks burn compute on activity rather than discriminating work. This result points in the opposite direction: spend serious compute where a correct output compounds, and demand a verification path before you let the model touch the steering wheel.&lt;/p&gt;
&lt;p&gt;For a builder, the practical consequences are concrete.&lt;/p&gt;
&lt;p&gt;First, roadmap language needs to change. Do not sell &amp;quot;AI scientist&amp;quot; as a magic colleague that ships finished truth. Sell constrained discovery systems that generate candidates under a verification budget. In math, the verifier can be a small circle of specialists, a proof assistant, or both. In drug discovery, it might be assay design. In chip design, simulation and layout checks. In software, tests are the easy part and product semantics are the hard part.&lt;/p&gt;
&lt;p&gt;Second, hiring tilts toward people who can interrogate model output. OpenAI&#39;s announcement says the proof was checked by external mathematicians, and the companion paper explicitly describes the result as digested and simplified by humans. That means expertise did not get cheaper in the way managers like to imagine. It became a higher leverage bottleneck. One excellent reviewer with domain taste may now supervise 50 candidate ideas, but the review function did not disappear.&lt;/p&gt;
&lt;p&gt;Third, your moat is less likely to be prompt craft. If general purpose models can occasionally jump domains, the defensible asset becomes the problem portfolio, the data exhaust from failed attempts, the verification harness, and the institutional memory that says which crazy looking path is merely crazy and which is worth two weeks.&lt;/p&gt;
&lt;p&gt;There is a cost story too. Test time compute is not a rounding error when the goal is long horizon reasoning. OpenAI says it investigated success rates with varying amounts of test time compute after verifying the initial proof, although the public text does not expose enough numeric detail to price the search. That absence matters. If a discovery requires thousands of expensive attempts and a rare expert audit, it may still be a bargain for a theorem or molecule and a terrible deal for an internal dashboard feature.&lt;/p&gt;
&lt;p&gt;So yes, update your priors. No, do not update them to &amp;quot;replace the research team.&amp;quot;&lt;/p&gt;
&lt;h2 id=&quot;what-deserves-funding-and-what-does-not&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/unit-distance-openai-proof/#what-deserves-funding-and-what-does-not&quot;&gt;&lt;span&gt;What deserves funding, and what does not&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Verification infrastructure deserves the money long before another wrapper that asks a model to &amp;quot;think like a Nobel laureate.&amp;quot;&lt;/p&gt;
&lt;p&gt;The unit distance result is compelling because the claim is crisp. A point set has a number of unit distance pairs. A proof either establishes the asymptotic lower bound or it does not. The paper trail is public enough to inspect: OpenAI&#39;s &lt;a href=&quot;https://cdn.openai.com/pdf/74c24085-19b0-4534-9c90-465b8e29ad73/unit-distance-proof.pdf&quot;&gt;original proof PDF&lt;/a&gt;, the external companion note, and Sawin&#39;s explicit refinement are all available. That is a healthier pattern than a private benchmark screenshot with a victory lap attached.&lt;/p&gt;
&lt;p&gt;But the caveats are not small.&lt;/p&gt;
&lt;p&gt;One caveat is autonomy. OpenAI says the proof was produced by an internal model, and the companion note says the proof presented there is a human digested and somewhat generalized version. That is exactly how important AI research will often look for a while. The model finds the seam. Humans clean the cut, test the edge cases, and explain why anyone should care. Calling that fake autonomy misses the point. Calling it full automation also misses the point.&lt;/p&gt;
&lt;p&gt;A second caveat is selection. OpenAI evaluated the model on a collection of Erdős problems, and this one worked. We need to know the denominator. How many problems were attempted? How many convincing false trails did the model produce? How much human judgment went into choosing the problem statement and recognizing the output as promising? Without that denominator, you should treat this as a major existence proof, not a general productivity estimate.&lt;/p&gt;
&lt;p&gt;A third caveat is transfer. Mathematics is unusually kind to AI evaluation because correctness can be checked with high precision. Many commercial research problems are messier. A model can propose a new pricing strategy, growth loop, or materials recipe, but the verifier may be a market, a wet lab, or a 6 month deployment. The longer and noisier the feedback loop, the easier it is to confuse novelty with progress.&lt;/p&gt;
&lt;p&gt;Here is the bet worth making for 2026 and 2027: AI research systems will first become valuable in domains where candidate generation is cheap, negative feedback is fast, and verification has teeth. That includes theorem exploration, code optimization, compiler passes, formal methods, circuit search, and some simulation heavy engineering. It includes less of the vague strategy work that fills slide decks. The model can be brilliant at long shots and still terrible at knowing which long shots your customers will pay for.&lt;/p&gt;
&lt;p&gt;Do not bet on a near term flood of solved famous problems without a verification bottleneck. Gowers&#39;s own reflection in the companion paper is telling. He found a counterexample less alarming than a proof of the conjectured upper bound because counterexamples can sometimes come from one surprising construction, while an upper bound may require a new structural theory. That distinction matters for product planning. A model can be the rare prodigy that spots a single hidden shortcut yet still cannot lay down the foundations a whole theory needs.&lt;/p&gt;
&lt;p&gt;The next serious questions are measurable. Can the approach reproduce across 10 or 50 open problems? Can labs publish attempt logs without leaking everything useful? Can proof assistants absorb more of the checking load? Can a model explain its construction well enough that specialists improve it within days, as Sawin did with the 1.014 exponent?&lt;/p&gt;
&lt;p&gt;That last question is underrated. The best version of AI assisted research is not a sealed oracle. It is a system that hands humans a live wire and enough insulation to use it.&lt;/p&gt;
&lt;h2 id=&quot;the-new-moat-is-asking-the-worthwhile-question&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/unit-distance-openai-proof/#the-new-moat-is-asking-the-worthwhile-question&quot;&gt;&lt;span&gt;The new moat is asking the worthwhile question&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The unit distance proof should make builders less impressed by fluent answers and more interested in auditable surprises.&lt;/p&gt;
&lt;p&gt;A model that can connect algebraic number fields to a 1946 geometry problem is not just a better autocomplete box. It is also not a replacement for the people who know when a proof has actually earned the word. The teams that win from this shift will be the ones that pair models with hard problems, sharp verifiers, and the patience to throw away 99 plausible ideas.&lt;/p&gt;
&lt;p&gt;The cheap future is AI that talks like a researcher. The valuable future is AI that gives a real researcher something uncomfortable and correct to check.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Google AI opt out gives publishers a real lever now</title>
    <link href="https://data-today.net/google-ai-opt-out-lever/" />
    <updated>2026-06-03T00:00:00Z</updated>
    <id>https://data-today.net/google-ai-opt-out-lever/</id>
    <content type="html">&lt;p&gt;Google just made the publisher bargain explicit: feed the answer machine, or step away from the most valuable shelf in search.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;Google AI opt out&lt;/strong&gt; that UK regulators forced into existence is a small Search Console control with a large commercial message. On June 3, 2026, the UK Competition and Markets Authority said Google must give publishers effective tools to keep their content out of generative AI Search features, including AI Overviews, while also letting them refuse use of that content for AI model fine tuning through a new publisher conduct requirement under the UK digital markets regime. The key number is &lt;strong&gt;nine months&lt;/strong&gt;: Google has that long to implement all changes, with important parts expected sooner, and it must file compliance reports every six months in the first year.&lt;/p&gt;
&lt;p&gt;This matters because AI Search is no longer a lab feature sitting politely beside the web. Google says AI Overviews has more than 2.5 billion monthly active users and AI Mode has passed 1 billion monthly users, figures it included in its own June 3 announcement of new website owner controls. If your acquisition model depends on search, the question is no longer whether AI answers affect you. It is whether you can measure the effect, negotiate around it, and decide where your content should or should not appear.&lt;/p&gt;
&lt;p&gt;We covered &lt;a href=&quot;https://data-today.net/google-ai-opt-out-data/&quot;&gt;the data behind this fight&lt;/a&gt; when the click numbers first surfaced. The UK ruling gives it teeth.&lt;/p&gt;
&lt;h2 id=&quot;what-exactly-did-the-cma-force-google-to-change&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/google-ai-opt-out-lever/#what-exactly-did-the-cma-force-google-to-change&quot;&gt;&lt;span&gt;What exactly did the CMA force Google to change?&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The CMA did three practical things, and each one targets a different part of the publisher complaint.&lt;/p&gt;
&lt;p&gt;First, it told Google to provide publishers with effective controls over how their search content is used in generative AI. The official conduct requirement says Google must give publishers controls, clear explanations of how content is used, detailed metrics on engagement, and reasonable steps to ensure clear and accurate attribution in search generative AI features through the &lt;a href=&quot;https://www.gov.uk/find-digital-markets-measures/google-search-publisher-conduct-requirement&quot;&gt;publisher conduct requirement&lt;/a&gt;. That is broader than a robots.txt tweak. It is a regulatory demand for product surface area.&lt;/p&gt;
&lt;p&gt;Second, the CMA added a fine tuning opt out. In plain English, publishers can say no to their content being used to improve Google AI models, not just no to appearing in an AI answer. The CMA said the change followed consultation feedback and gives publishers control over the full range of AI uses of their content in its &lt;a href=&quot;https://www.gov.uk/government/news/cma-secures-fairer-deal-for-publishers-and-improves-google-search-services-in-uk&quot;&gt;June 3 press release&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Third, the regulator made attribution and metrics part of the deal. That sounds soft until you run a media business. If an AI answer uses your reporting but the interface buries the link, the economics break. If you cannot see which pages appear, in which countries, and with what engagement, you cannot price licensing, justify content spend, or decide whether the exposure is worth the cannibalization.&lt;/p&gt;
&lt;p&gt;Google’s response is to roll out controls as a product, not just as a UK compliance patch. Google said it is testing a new Search Console toggle that lets site owners decide whether their links and content can appear in and ground generative AI Search features such as AI Overviews, AI Mode, and AI Overviews in Discover, and that sites opting out will not receive AI feature traffic or impressions through its &lt;a href=&quot;https://blog.google/products-and-platforms/products/search/new-controls-website-owners/&quot;&gt;new controls announcement&lt;/a&gt;. Google also says this control will not be used as a ranking signal outside those generative AI Search features.&lt;/p&gt;
&lt;p&gt;That last sentence will get quoted in a thousand SEO decks by Friday. It should. The fear was simple: if a publisher blocks AI answers, will regular blue link search punish it? Google says no.&lt;/p&gt;
&lt;p&gt;The catch is equally simple: if you opt out, you disappear from the AI layer itself. For informational queries, that layer is increasingly the page.&lt;/p&gt;
&lt;h2 id=&quot;why-is-this-bigger-than-another-search-console-toggle&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/google-ai-opt-out-lever/#why-is-this-bigger-than-another-search-console-toggle&quot;&gt;&lt;span&gt;Why is this bigger than another Search Console toggle?&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Because the toggle changes the negotiating position.&lt;/p&gt;
&lt;p&gt;Before this ruling, many publishers faced a bad default: Google could use their work to answer the user, the user might not click, and the publisher had to keep participating because Google remained the gateway. The CMA confirmed that gateway power when it designated Google with strategic market status in October 2025, saying more than 90% of UK searches take place on Google in its &lt;a href=&quot;https://www.gov.uk/government/news/cma-confirms-google-has-strategic-market-status-in-search-services&quot;&gt;SMS decision announcement&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;That market share is why this is a regulator story, not just a product story. If a smaller AI search startup scraped, summarized, and linked lightly, publishers could block it and move on. Google is different. For many sites, leaving Google’s AI layer could mean losing visibility at the top of the results page while competitors stay inside the box.&lt;/p&gt;
&lt;p&gt;The traffic data explains the anxiety. Pew Research Center analyzed browsing activity from 900 US adults in March 2025 and found that users clicked a traditional result on 8% of visits when a Google AI summary appeared, versus 15% when no AI summary appeared, according to Pew’s &lt;a href=&quot;https://www.pewresearch.org/short-reads/2025/07/22/google-users-are-less-likely-to-click-on-links-when-an-ai-summary-appears-in-the-results/&quot;&gt;behavioral study of Google AI summaries&lt;/a&gt;. Links inside the AI summary got clicked on 1% of visits.&lt;/p&gt;
&lt;p&gt;That is the chart for this article. It is not perfect, because one study cannot describe every query class, country, or interface variant. But it captures the publisher problem with brutal clarity: the answer box can create visibility without a visit.&lt;/p&gt;
&lt;p&gt;A separate 2026 research paper by Haofei Xu, Umar Iqbal, and Jacob M. Montgomery measured 55,393 trending queries from March 13 to April 21, 2026 and found AI Overviews activated on 13.7% of queries overall and 64.7% of question form queries, according to the authors’ &lt;a href=&quot;https://arxiv.org/abs/2605.14021&quot;&gt;arXiv paper&lt;/a&gt;. They also found that 11.0% of atomic claims in AI Overview responses were unsupported by cited pages.&lt;/p&gt;
&lt;p&gt;That second number matters for builders beyond publishing. If you are using retrieval, citation, or answer generation in your own product, this is the same failure pattern at platform scale: citations create the feeling of accountability, but they do not guarantee that the answer actually follows from the cited page.&lt;/p&gt;
&lt;p&gt;So the UK rule is not a magic traffic repair kit. It is a forced API for consent, attribution, and measurement inside the dominant answer interface.&lt;/p&gt;
&lt;h2 id=&quot;why-should-builders-and-operators-care-if-they-are-not-publishers&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/google-ai-opt-out-lever/#why-should-builders-and-operators-care-if-they-are-not-publishers&quot;&gt;&lt;span&gt;Why should builders and operators care if they are not publishers?&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Because this is the first serious preview of how AI distribution gets priced.&lt;/p&gt;
&lt;p&gt;If you run a SaaS company, a marketplace, a technical documentation site, a local service directory, or a developer tool, your content is already part of someone’s answer corpus. You may not call yourself a publisher, but your docs, comparisons, tutorials, support pages, pricing pages, and changelogs are doing publisher work. They attract intent. They resolve uncertainty. They feed models.&lt;/p&gt;
&lt;p&gt;The Google AI opt out forces a planning choice that most teams have postponed: which content exists to be read on your site, and which content exists to be quoted by machines?&lt;/p&gt;
&lt;p&gt;Here is the useful split:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Content type&lt;/th&gt;
&lt;th style=&quot;text-align:right&quot;&gt;Default AI Search posture&lt;/th&gt;
&lt;th&gt;Business risk&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Commodity explainers&lt;/td&gt;
&lt;td style=&quot;text-align:right&quot;&gt;Stay eligible&lt;/td&gt;
&lt;td&gt;Users may get the answer without visiting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Original data or reporting&lt;/td&gt;
&lt;td style=&quot;text-align:right&quot;&gt;Test opt out or licensing&lt;/td&gt;
&lt;td&gt;High value can be extracted without payment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Product docs&lt;/td&gt;
&lt;td style=&quot;text-align:right&quot;&gt;Stay eligible with measurement&lt;/td&gt;
&lt;td&gt;Wrong snippets create support load&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pricing and comparison pages&lt;/td&gt;
&lt;td style=&quot;text-align:right&quot;&gt;Monitor aggressively&lt;/td&gt;
&lt;td&gt;AI summaries can distort conversion intent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Community content&lt;/td&gt;
&lt;td style=&quot;text-align:right&quot;&gt;Case by case&lt;/td&gt;
&lt;td&gt;Attribution and consent get messy fast&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For developers, the new Search Console reporting may be the underrated part. Google says the new insights include impression metrics and information about which pages appear in AI responses and in what countries. That creates a new analytics event class: AI answer exposure. It is not the same as a pageview. It is not the same as a ranking. It is a distribution signal that sits between brand impression and referral.&lt;/p&gt;
&lt;p&gt;You should wire it into your planning like a separate channel.&lt;/p&gt;
&lt;p&gt;For business owners, the control creates negotiation evidence. A publisher that can show 2 million AI Search impressions, weak click through, and high reuse of original reporting has a stronger case for a content deal than a publisher waving at a vague fairness problem. The CMA said the requirement should put news organizations in a stronger position to negotiate content deals with Google in its June 3 statement. That is exactly the point: measurement turns complaint into invoice.&lt;/p&gt;
&lt;p&gt;For product teams, this also changes roadmap math. If your growth team has been treating SEO as a durable moat, you need a second distribution plan. Search traffic is fragmenting across AI answers, direct subscriptions, social surfaces, app notifications, and community channels. Axios reported Chartbeat data in March 2026 showing traditional search referral traffic declined 60% over two years for small publishers with 1,000 to 10,000 daily pageviews, compared with 22% for large publishers in its &lt;a href=&quot;https://www.axios.com/2026/03/17/chartbeat-search-traffic-ai-chatbots&quot;&gt;publisher traffic report&lt;/a&gt;. Small brands are the ones most likely to discover that a top of funnel content strategy became training material for a larger platform.&lt;/p&gt;
&lt;p&gt;That does not mean block everything. It means stop treating visibility as payment.&lt;/p&gt;
&lt;h2 id=&quot;how-should-you-decide-whether-to-opt-out&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/google-ai-opt-out-lever/#how-should-you-decide-whether-to-opt-out&quot;&gt;&lt;span&gt;How should you decide whether to opt out?&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Start with the content’s job.&lt;/p&gt;
&lt;p&gt;If a page answers a simple pre purchase question and sends qualified visitors into your product, staying in AI Search may still make sense. If AI Overviews cite you accurately, show a clear link, and produce branded demand later, the lost click may not be fatal. Google is betting on this version of the world, and it says its generative AI Search features are designed to help people find and visit websites.&lt;/p&gt;
&lt;p&gt;If a page contains original reporting, proprietary benchmarks, paid research, or data that competitors cannot easily reproduce, the calculus changes. That content is your moat. If an AI answer can compress the valuable part into 6 bullets while leaving you with the cost of production, you should test restrictions, licensing, or delayed publication patterns.&lt;/p&gt;
&lt;p&gt;A practical decision process for the next 30 days:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Segment your content by economic role.&lt;/strong&gt; Separate acquisition pages, support docs, original research, news, community posts, and paid subscriber content.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Track AI exposure as its own funnel step.&lt;/strong&gt; Do not bury AI Search impressions inside classic search reporting once the new Search Console data reaches you.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compare answer exposure to downstream lift.&lt;/strong&gt; Look for branded searches, direct visits, newsletter signups, assisted conversions, and support tickets after AI visibility rises.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Create a policy for high value data.&lt;/strong&gt; Decide which reports, benchmarks, and datasets can be summarized freely and which require a commercial deal.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Review snippets for accuracy.&lt;/strong&gt; If AI answers cite your docs but create support confusion, that cost belongs in the channel P&amp;amp;L.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The wrong move is ideological purity. Blocking everything may protect value while making you invisible in a fast growing interface. Allowing everything may maximize reach while weakening the reason users needed you in the first place.&lt;/p&gt;
&lt;p&gt;The better move is portfolio management. Treat AI Search eligibility like syndication rights, not like a binary SEO setting.&lt;/p&gt;
&lt;h2 id=&quot;what-happens-if-this-spreads-beyond-the-uk&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/google-ai-opt-out-lever/#what-happens-if-this-spreads-beyond-the-uk&quot;&gt;&lt;span&gt;What happens if this spreads beyond the UK?&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The UK is acting first, but the pattern is portable.&lt;/p&gt;
&lt;p&gt;The CMA says the publisher requirement is the first conduct requirement imposed after Google’s strategic market status designation in general search. It also said it will announce further action related to Google’s search business in the coming weeks. That matters because regulators rarely copy exact wording, but they do copy working mechanisms: consent controls, attribution rules, reporting duties, and compliance cadence.&lt;/p&gt;
&lt;p&gt;Google is already thinking globally. Its June 3 blog says the new controls are beginning with a subset of UK website owners before a global rollout after testing. The company would rather ship one coherent control surface than maintain a UK only exception that becomes a compliance museum.&lt;/p&gt;
&lt;p&gt;The open question is how many publishers will use the opt out once it exists. If few do, Google gets to say the market chose participation. If many high quality publishers do, AI answers may rely more heavily on sites that accept the trade, including forums, low cost content farms, and brands willing to exchange content for exposure. That could lower answer quality or push Google toward more paid licensing.&lt;/p&gt;
&lt;p&gt;For builders, the safe bet is that consent and provenance become product requirements. If you are building an AI answer engine, a vertical search product, or an internal knowledge assistant, do not wait for a regulator to tell you that content owners want controls. Build the controls now: source level permissions, exclusion paths, audit logs, citation checks, and reporting that a non engineer can read.&lt;/p&gt;
&lt;p&gt;The web’s old bargain was messy but legible: crawl me, rank me, send me clicks. AI Search changes the middle verb. It can crawl, synthesize, and satisfy the user before the visit. The UK just told Google that publishers deserve a switch at that point in the chain.&lt;/p&gt;
&lt;p&gt;A switch is not power by itself. Power is knowing when to flip it.&lt;/p&gt;
&lt;h2 id=&quot;sources&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/google-ai-opt-out-lever/#sources&quot;&gt;&lt;span&gt;Sources&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.gov.uk/government/news/cma-secures-fairer-deal-for-publishers-and-improves-google-search-services-in-uk&quot;&gt;CMA secures fairer deal for publishers, gov.uk&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.gov.uk/find-digital-markets-measures/google-search-publisher-conduct-requirement&quot;&gt;Google Search publisher conduct requirement, gov.uk&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.gov.uk/government/news/cma-confirms-google-has-strategic-market-status-in-search-services&quot;&gt;CMA confirms Google strategic market status, gov.uk&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://blog.google/products-and-platforms/products/search/new-controls-website-owners/&quot;&gt;New controls for website owners, Google&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.pewresearch.org/short-reads/2025/07/22/google-users-are-less-likely-to-click-on-links-when-an-ai-summary-appears-in-the-results/&quot;&gt;Google users are less likely to click when an AI summary appears, Pew Research Center&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2605.14021&quot;&gt;AI Overviews measurement study, arXiv&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.axios.com/2026/03/17/chartbeat-search-traffic-ai-chatbots&quot;&gt;Chartbeat search traffic data, Axios&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content>
  </entry>
  <entry>
    <title>Multi-agent debate needs a boring data-cleaning cop</title>
    <link href="https://data-today.net/debate-data-cleaning/" />
    <updated>2026-06-03T00:00:00Z</updated>
    <id>https://data-today.net/debate-data-cleaning/</id>
    <content type="html">&lt;p&gt;Multi-agent debate is supposed to make LLM systems safer by adding a critic. In data cleaning, a new arXiv preprint finds the opposite often happens: debate degraded generation across four model families by &lt;strong&gt;1.6 to 15.5 percentage points&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;That is the useful kind of bad news. It does not kill agentic data cleaning. It tells you where to put the guardrail: give the critic evidence, tools, and a veto, or do not invite it into the pipeline.&lt;/p&gt;
&lt;h2 id=&quot;the-critic-made-clean-rows-dirty&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/debate-data-cleaning/#the-critic-made-clean-rows-dirty&quot;&gt;&lt;span&gt;The critic made clean rows dirty&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;In a &lt;a href=&quot;https://arxiv.org/abs/2606.02866&quot;&gt;paper submitted on June 1, 2026&lt;/a&gt;, Chirag Parmar, Akshat Mehta, Henglin Wu, Jagadish Ramamurthy, and Shweta Medhekar tested multi-agent debate for data cleaning across three benchmarks, four model families, and more than 6,000 task-condition pairs. Their headline result is uncomfortable for anyone building a panel of LLM agents and calling it governance: debate hurt generation for every model they tested, with drops ranging from &lt;strong&gt;minus 1.6 to minus 15.5 percentage points&lt;/strong&gt;. The authors call the failure mode &amp;quot;critique-induced confusion,&amp;quot; where a critic hallucinates feedback and the generator accepts it too easily.&lt;/p&gt;
&lt;p&gt;The chart below is the whole story in four numbers. Naive debate can damage generative repair. Detection is where the pattern flips. The same study reports a &lt;strong&gt;27.4 point F1 gain&lt;/strong&gt; for error detection, with an effect size of d equals 1.0. Then, after a factorial experiment, the authors found a configuration that finally beat the single-agent baseline on a generative task: a separate critic, code-execution grounding, and evidence-gated generation, up &lt;strong&gt;5.3 points&lt;/strong&gt; with p less than 0.05.&lt;/p&gt;
&lt;p&gt;That split matters because data cleaning is not one task. It is a bundle of tasks that punish different mistakes.&lt;/p&gt;
&lt;p&gt;A detection agent can be useful while staying conservative. It asks: &amp;quot;Is this cell suspicious?&amp;quot; A repair agent must write a replacement value. That second action has a blast radius. If the critic invents a reason to change a clean value, the generator can launder the hallucination into your warehouse.&lt;/p&gt;
&lt;p&gt;This is the trap in many agent demos. A second model response looks like review. In production, it is only review if the critic has a better signal than the generator. Otherwise you bought a more expensive coin flip with meeting notes.&lt;/p&gt;
&lt;p&gt;The result also fits the older data cleaning literature better than the current agent discourse does. &lt;a href=&quot;https://arxiv.org/abs/1702.00820&quot;&gt;HoloClean&lt;/a&gt;, introduced by Theodoros Rekatsinas, Xu Chu, Ihab Ilyas, and Christopher Ré in 2017, repaired inconsistent datasets by combining integrity constraints, external signals, and probabilistic inference, reporting about &lt;strong&gt;90 percent precision&lt;/strong&gt; and more than &lt;strong&gt;76 percent recall&lt;/strong&gt; across datasets. &lt;a href=&quot;https://research.uni-hannover.de/en/publications/raha-a-configuration-free-error-detection-system/&quot;&gt;Raha&lt;/a&gt;, a SIGMOD 2019 system from Ziawasch Abedjan and collaborators, beat prior error detection techniques with no more than &lt;strong&gt;20 labeled tuples&lt;/strong&gt; per dataset.&lt;/p&gt;
&lt;p&gt;Those systems were not glamorous. They had the right instinct: data repair needs evidence, not another opinion.&lt;/p&gt;
&lt;h2 id=&quot;debate-is-cheap-until-it-touches-your-source-of-truth&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/debate-data-cleaning/#debate-is-cheap-until-it-touches-your-source-of-truth&quot;&gt;&lt;span&gt;Debate is cheap until it touches your source of truth&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The appeal of multi-agent debate is obvious. A &lt;a href=&quot;https://arxiv.org/abs/2305.14325&quot;&gt;2023 paper&lt;/a&gt; from Yilun Du, Shuang Li, Antonio Torralba, Joshua Tenenbaum, and Igor Mordatch described multiple model instances proposing answers, debating reasoning, and converging on a final response, with reported improvements in math, strategy, and factuality tasks. The same mood drove &lt;a href=&quot;https://arxiv.org/abs/2303.17651&quot;&gt;Self-Refine&lt;/a&gt; and Reflexion: let a model critique its own work, carry feedback forward, and improve without retraining.&lt;/p&gt;
&lt;p&gt;For coding interviews and puzzle benchmarks, that can be enough to justify extra tokens. For data cleaning, the economics are uglier.&lt;/p&gt;
&lt;p&gt;Poor data quality already has a measurable business cost. Gartner says bad data costs organizations at least &lt;strong&gt;$12.9 million per year on average&lt;/strong&gt;, and ties data quality directly to AI and machine learning use cases. (&lt;a href=&quot;https://www.gartner.com/en/data-analytics/topics/data-quality?utm_source=openai&quot;&gt;gartner.com&lt;/a&gt;) That number is broad, but the mechanism is painfully specific: a bad customer record gets merged, a duplicate vendor survives, a SKU mapping shifts, a fraud model trains on the wrong label, or a BI dashboard quietly moves from wrong to authoritative.&lt;/p&gt;
&lt;p&gt;Now add multi-agent debate to that path.&lt;/p&gt;
&lt;p&gt;If your pipeline uses an LLM to normalize merchant names, resolve entities, infer missing categories, or repair malformed addresses, a naive critic introduces three costs at once:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Token and latency cost:&lt;/strong&gt; two or more agents consume more context and wall-clock time before a row lands.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Operational cost:&lt;/strong&gt; every disagreement needs a policy, a log, a retry budget, and often a human escalation path.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data risk:&lt;/strong&gt; an accepted false critique can convert a clean value into a dirty one, which is worse than failing to repair.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That last point is the one to tape above the roadmap. In a search or chat product, the model can be wrong and the user may recover. In a data pipeline, the wrong value often becomes training data, a metric, a join key, or a finance input. The error compounds quietly.&lt;/p&gt;
&lt;p&gt;This is why the paper’s detection result is more interesting than the failure headline. A critic that flags suspicious cells can raise recall without taking the pen away from your source of truth. It can produce candidates, confidence scores, and evidence. The repair step should have a higher bar.&lt;/p&gt;
&lt;p&gt;If you are building this today, the architecture should look less like a debate club and more like a database change-management system:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Put the critic behind retrieval, profiling, constraints, or code execution.&lt;/li&gt;
&lt;li&gt;Require cited evidence for every proposed repair.&lt;/li&gt;
&lt;li&gt;Separate &amp;quot;flag&amp;quot; from &amp;quot;fix&amp;quot; in the API.&lt;/li&gt;
&lt;li&gt;Log the original value, candidate repair, evidence, model version, prompt version, and confidence.&lt;/li&gt;
&lt;li&gt;Sample accepted repairs for human audit, especially after schema drift or vendor changes.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That sounds boring. Boring is the point.&lt;/p&gt;
&lt;p&gt;The paper’s successful configuration used adversarial separation, code-execution grounding, and evidence-gated generation. The important phrase is &lt;strong&gt;evidence-gated&lt;/strong&gt;. The critic should not be a louder generator. It should be a narrower agent with tools that let it prove something about the table.&lt;/p&gt;
&lt;p&gt;We have argued before that many agentic AI projects die before they ship because they skip the unromantic parts: evaluation, ownership, and failure budgets. This result belongs in the same folder as /agentic-cancellation-cliff/: the problem is rarely that agents cannot talk. The problem is that talk is not control.&lt;/p&gt;
&lt;h2 id=&quot;the-roadmap-move-is-to-split-the-agent-before-it-argues&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/debate-data-cleaning/#the-roadmap-move-is-to-split-the-agent-before-it-argues&quot;&gt;&lt;span&gt;The roadmap move is to split the agent before it argues&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The practical lesson is not &amp;quot;never use multi-agent debate.&amp;quot; The paper gives a cleaner rule: debate helps when the chance of rescuing a wrong output, weighted by fixability, exceeds the chance of destroying a correct one. The authors say that condition predicted all &lt;strong&gt;nine task types&lt;/strong&gt; in their study and generalized with zero false positives across &lt;strong&gt;19 published comparisons&lt;/strong&gt; in seven domains.&lt;/p&gt;
&lt;p&gt;For a builder, that becomes a product spec.&lt;/p&gt;
&lt;p&gt;Start by classifying each data cleaning action by reversibility and evidence quality. Detection is reversible. Suggesting a repair is partly reversible. Auto-writing the repair into a warehouse table is a production change. Treat those as different permission levels, not one agent flow with extra prompts.&lt;/p&gt;
&lt;p&gt;A useful version might have four components:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Profiler:&lt;/strong&gt; computes distributions, null rates, uniqueness, common formats, and drift.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Generator:&lt;/strong&gt; proposes candidate fixes only when asked.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Critic:&lt;/strong&gt; checks candidates against code, constraints, dictionaries, and row-level evidence.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gatekeeper:&lt;/strong&gt; applies policy, confidence thresholds, and human review rules.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The critic and generator should not share the same tool view. The paper’s factorial result says adversarial separation mattered. That makes intuitive sense. If both agents see the same prompt and no external evidence, you have duplicated the model’s priors. If the critic can run code against the table, inspect neighboring rows, or test a constraint, it brings a different source of information.&lt;/p&gt;
&lt;p&gt;A small implementation detail carries a lot of weight here: make the critic return structured evidence, not prose. For example:&lt;/p&gt;
&lt;pre class=&quot;language-json&quot; tabindex=&quot;0&quot;&gt;&lt;code class=&quot;language-json&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;token property&quot;&gt;&quot;cell&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;orders[18422].zip_code&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;token property&quot;&gt;&quot;claim&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;value conflicts with city and state&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;token property&quot;&gt;&quot;evidence&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;city=Seattle&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;state=WA&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;zip=02139&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;token property&quot;&gt;&quot;proposed_action&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;flag_for_review&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;token property&quot;&gt;&quot;confidence&quot;&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;0.91&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That schema changes the behavior of the whole system. It gives you test fixtures. It gives analysts something to audit. It lets you reject critiques that contain no evidence. It also makes it easier to compare the agent against old-fashioned baselines such as constraints, dictionaries, and statistical outlier checks.&lt;/p&gt;
&lt;p&gt;For the business side, this is where cost discipline enters. Do not spend multi-agent tokens on every row. Spend them on high-value uncertainty: records tied to revenue recognition, fraud, compliance, enterprise customer identity, or training labels for a model that affects money. A 5.3 point repair gain matters if it lands on the right cells. It is waste if you apply it to low-risk formatting issues that a deterministic parser already handles.&lt;/p&gt;
&lt;p&gt;For the engineering side, the key metric is not agent win rate in isolation. Track &lt;strong&gt;clean-to-dirty conversion rate&lt;/strong&gt;: how often the system changes a value that was already correct. That is the metric naive debate can worsen, and it is the metric many demos hide.&lt;/p&gt;
&lt;h2 id=&quot;the-next-benchmark-should-charge-rent-for-bad-repairs&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/debate-data-cleaning/#the-next-benchmark-should-charge-rent-for-bad-repairs&quot;&gt;&lt;span&gt;The next benchmark should charge rent for bad repairs&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;There are caveats. This is a new arXiv preprint, not a peer-reviewed verdict. The abstract gives the high-level numbers, while production systems will depend on datasets, schemas, tool access, and the cost of human review. A customer-support table, a medical registry, and an ad-click stream do not have the same tolerance for false repairs.&lt;/p&gt;
&lt;p&gt;Still, the paper points to a better evaluation standard.&lt;/p&gt;
&lt;p&gt;Benchmarks should stop rewarding agents only for final cleaned accuracy. They should charge for every unnecessary modification, every unsupported critique, and every repair that violates lineage. In data cleaning, the model is not just answering a question. It is proposing a database mutation.&lt;/p&gt;
&lt;p&gt;The next useful benchmark would report at least five numbers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Detection F1.&lt;/li&gt;
&lt;li&gt;Repair precision.&lt;/li&gt;
&lt;li&gt;Repair recall.&lt;/li&gt;
&lt;li&gt;Clean-to-dirty conversion rate.&lt;/li&gt;
&lt;li&gt;Cost per accepted repair, including tokens and human review.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That last number matters more in 2026 than it did when older cleaning systems were designed. Agentic workflows can hide compute in orchestration. A three-agent loop that retries twice can turn a cheap cleaning step into a budget leak. If the only metric is F1, the agent will happily spend your margin.&lt;/p&gt;
&lt;p&gt;The smart bet is grounded critics for data cleaning, not open-ended debate transcripts as a control layer.&lt;/p&gt;
&lt;p&gt;The distinction is small in a demo and huge in production. The winning critic is less like a brilliant colleague riffing in a meeting and more like a fussy reviewer with a SQL console, a constraint file, and permission to say no.&lt;/p&gt;
&lt;h2 id=&quot;the-data-warehouse-does-not-care-who-won-the-argument&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://data-today.net/debate-data-cleaning/#the-data-warehouse-does-not-care-who-won-the-argument&quot;&gt;&lt;span&gt;The data warehouse does not care who won the argument&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A debate that fixes errors is useful. A debate that manufactures doubt is a liability with better formatting.&lt;/p&gt;
&lt;p&gt;For data cleaning, the safest multi-agent system may be the least theatrical one: one agent proposes, one agent verifies with tools, and the database only moves when the evidence clears the gate.&lt;/p&gt;
</content>
  </entry>
</feed>