New Data Project Aims to Link AI Token Usage to Shipped Software

The new Data Project aims to help companies understand if using more AI language model tokens actually leads to more shipped software.

A new data project seeks to determine if increased AI token usage leads to more shipped software, moving beyond anecdotes to establish clear, comparable metrics. The question is critical, as token-to-value efficiency has become a core operating metric for AI programs (Deloitte Insights). With LLM APIs processing an estimated 50 trillion tokens daily (Fireworks AI), even minor inefficiencies can create significant budget overruns.

Current data connecting token usage to output is limited. While industry reports suggest varying conversion rates, analysts warn these figures often stem from single vendors, making them unreliable for wider application. A rigorous, privacy-preserving telemetry network across multiple software companies is needed to determine if token consumption actually accelerates software delivery or merely inflates costs.

The Data Project: Methodology and Metrics

The project will establish a definitive link between AI token consumption and productivity by collecting anonymized telemetry from multiple companies. By correlating token counts with software delivery metrics like pull requests, it will provide the first cross-firm benchmarks for measuring the true ROI of generative AI.

The project will gather anonymized telemetry, including prompt lengths, token counts, latency, and workflow tags. Following OpenTelemetry best practices, it will strip direct identifiers and aggregate data at the cohort level to ensure privacy. This telemetry will be aligned with release metadata, such as merged pull requests, to create a clear link between AI usage and development output. Statistical controls like cohort analysis will be used to correct for variables like team size or sprint duration.

A preliminary data schema will track key fields for each interaction:

Timestamp (rounded to the nearest hour)
Model family and context window
Input and output token counts
Workflow tag (e.g., code completion, document summary)
Hashed release artifact ID

This schema enables analysts to determine the number of tokens used per merged pull request and assess if factors like larger context windows reduce iteration cycles or QA defects.

Understanding Token Consumption and Total Cost

Early benchmarks already show wide variance in consumption. Common enterprise tasks show significant variation in token usage depending on complexity and scope (Iternal AI). While code completion uses fewer tokens per instance, its high frequency significantly impacts daily totals. This variance highlights that token budgets are driven by both task complexity and frequency.

Furthermore, cost visibility extends beyond the API bill. Research from DigitalApplied shows that the total cost of ownership (TCO) for AI can be substantially higher than API fees due to orchestration and observability. Similarly, Deloitte notes that nearly half the cost of on-premise AI infrastructure comes from non-GPU expenses like networking and power. Therefore, any meaningful analysis must account for these overhead multipliers, not just raw token prices.

From Benchmarking Dashboards to Strategic Decisions

By aggregating shared telemetry, the project can produce percentile benchmarks for key efficiency metrics across participating organizations. These benchmarks will cover metrics such as tokens per merged pull request, cost per feature, and conversion ratios between token usage and shipped code.

These industry-wide benchmarks will empower strategic decision-making. Executives can identify teams where token consumption outpaces throughput, while investors can benchmark portfolio companies on AI efficiency.

The data also provides an early-warning system for governance. Alerts can flag sudden spikes in token usage from model upgrades or prompt changes, allowing for intervention before budgets are impacted. This addresses challenges noted by Deloitte where complex models can cause spend to multiply without proper controls.

Ultimately, a shared dataset breaks companies out of their telemetry silos. It provides the industry-wide visibility needed to differentiate between scaling product and merely scaling costs.

What exactly will the new data project measure?

The initiative will track anonymized token consumption data and link it to shipped product outcomes across multiple organizations. Instead of relying on single-vendor anecdotes, the project will gather multi-firm telemetry to produce statistically defensible benchmarks. Deliverables will include interactive dashboards, quarterly benchmarking reports, and best-practice playbooks that executives, investors, and vendors can use to judge where AI spend creates real economic value and where it drives waste.

How will the project protect sensitive data while still getting useful telemetry?

All data collection follows established best practices that OpenTelemetry and industry leaders already recommend:
- Data minimization - only prompt frequency, token counts, latency, and workflow tags are captured.
- Edge anonymization - direct identifiers are stripped or hashed in the OpenTelemetry Collector before data ever reaches a shared warehouse.
- Aggregation only - metrics are surfaced at team, project, or cohort level; no individual developer is ever exposed.
- Continuous review - raw telemetry is auto-deleted once derived metrics are validated, reducing long-term privacy risk.

What kinds of statistics might the finished datasets reveal?

Early industry snapshots (via Deloitte and Keyhole Software [2][3]) show token usage can swing dramatically by use-case:

Task	Approx. tokens per successful request	Source
Single-page summarization	Varies significantly by complexity	Iternal AI [1]
Contract clause extraction	Substantial variation based on document length	Iternal AI [1]
Inline code completion	Lower per-instance usage but high frequency	Iternal AI [1]

When multiplied by 50 trillion tokens processed daily across the LLM API market (Fireworks AI [5]), even small underestimations of total cost-of-ownership (DigitalApplied [4]) becomes material. The new project will therefore quantify both extremes and surface median and percentile ranges so teams know when their spend is in the fat tail.

Who should care about the findings and why?

Executives - to set defensible AI budgets and defend ROI to boards.
Investors - to spot vendors whose token efficiency is improving faster than price erosion.
Vendors - to benchmark their own tools against vertical peers and avoid a race to the bottom on per-token pricing.
Policymakers - to understand aggregate compute demand and carbon footprint before writing new regulations.

When will the first benchmarking report be available?

The project is currently recruiting pilot partners on an opt-in basis. Once anonymized telemetry pipelines are validated across a sufficient sample of participating organizations, the first public benchmarking report is planned for later this year. Early participants gain early-access dashboards and co-branding rights in exchange for transparent, anonymized data feeds.