AI's Never Just One Thing: Different FLOPS for Different Folks

Over the last five years – since the release of ChatGPT – the term "Artificial Intelligence" has become increasingly identified with one particular approach: very large proprietary models, trained by companies with very deep pockets, on as much data as they can gather. These are then commercialized through human-like chatbots and tools that serve a double purpose: serving these models and collecting further data for improving them.

The paradigm has and continues to push automation boundaries in economically significant sectors - systems like Claude Opus and GPT 5.2 CodeX have found undeniable success with the software developers working for the technology companies that represent the majority of the SP500. Yet this scale-first approach represents the version of AI that best suits these same companies, who control the majority of computational resources and have uniquely centralized access to user data, giving them strong incentives to ensure attention remains focused on these compute-hungry versions of AI even if it means warping public perception - including of policymakers and potential cutomers - about actual costs and viable alternatives.

In practice, however, not everything in AI is multi-trillion-parameter behemoths with 9-figure training budgets. Different versions of the technology can often be a much better fit for the specific needs of the organizations building these systems and building with them. We outline some of these categories below:

🎯 Task and Domain AI Models cover a broad range of modalities, architectures, and applications, from speech transcription and OCR to scientific domains like genomics; with diverse requirements that typically make brute scaling less effective.
"Large" Language Models correspond to categories of models that often take multimodal inputs and primarily output language tokens; this provides a useful unified framework for a range of tasks. Different size categories however have different applications:
- 📱 "Smol" models for on-device computing, designed to fit on a variety of hardware settings with limited memory (up to 8B parameters).
- 🐎 Medium models as accessible workhorses that fit on mid-range GPUs for offline use (typically in the 14-32B parameter range).
- 🌩️ Large super-industrial models with 100B+ parameters that require data center GPUs and rely on scale of data and compute to the fullest (current versions start at 80B total parameters).

The present report provides a wide-ranging overview of the training and deployment costs of AI models across categories, which is primarily made possible by the information provided by open science and open-source developers. We show that while some approaches to AI development do require moderate to significant compute resources, the majority remain much more accessible to diverse actors than what the largest companies are advertising; including notably for many AI models that approach or surpass the raw performance of their models on many commercially and scientifically significant applications.

Training Costs of AI Systems

Training Costs by AI System Category

The graph below indicates estimated training cost (USD) for open-weight and task-specific models, with reference lines for selected commercial models. Marker shape indicates confidence (see methodology for details).

Click a point on the chart to see model details in the report card to the right.

Select a point on the chart to see model details.

The Long (and Heavy) Tail of Task and Domain AI

Most definitions of "Artificial Intelligence" cover significantly more than just chatbots; a very wide range of applications of Machine Learning to various categories of data and modalities to develop models for tasks like OCR, speech and image processing, or sceintific reseatch would fall under that heading. Notably, many models trained over five years ago - before the explosion of compute resources - remain in significant use today as the backbone of diverse AI-powered applications: OpenAI's CLIP and Whisper for image and speech processing, variations on BERT or SentenceLM for run-of-the-mill text embedding applications, and models like Wav2vec, that are routinely fine-tuned to adapt to new languages. These models were trained at the beginning of the shift toward model scaling, which means they were often the largest models at the time of their releases in terms of parameters or dataset size. With today's resources, their training costs would be at most a few thousand dollars, if that. Nevertheless, they remain broadly useful and among the most-downloaded artefacts on the Hugging Face Hub.

Models trained in recent years for specific tasks and domains also tend to be more parsimonious with training compute than ones trained for "general-purpose" use. Corporate and individual contributors of open-weight models, companies with defined use cases, and research organizations continue to train models that meet their specific needs (or those of their communities). While growing access to compute does play a role in unlocking new use cases for the technology (generalized OCR and genetics stand out as two recent successes - e.g. DeepSeek-OCR), in most cases the compute expenses remain limited, with success hinging more on access to the right training data and problem modeling; especially in cases where artificially framing the problems as something a language model can do would be a waste of resources; e.g. while a system like ChatGPT could technically output a representation of a protein's 3D structure, training a separate model to that end with a more appropriate format than free text is a lot easier and more efficient than trying to add the capability to a general-purpose model.

Dedicated models such as Google's AlphaFold3 and the open reproduction OpenFold3 represent the state of the art for the latter task; OpenFold3 has slightly lower benchmark scores but within a standard deviation, and its open nature has already allowed private companies and other research efforts to fine-tune it for their purposes. Given the greater overall efficiency of both training data and especially model size, inference compute costs are also significantly less of a concern for these approaches, with costs similar to those of running a small or medium language model depending on the models. For example, OCR with recent models can process tens of thousands of pages for a few dozen dollars with recent models.

Training Costs

While there are too many approaches to and applications of AI to provide an exhaustive list, we were able to review recent publications with sufficient information to estimate the compute cost of training for a few different use cases:

Image embedding: While OpenAI's CLIP remains widely used for image embedding and filtering applications, SigLIP and PinCLIP provide two more recent examples of interest for training image embedding models. SigLIP 1 was trained as an open-domain image embedding system showcasing a different training objective and higher benchmark performance than CLIP, and the developers' disclosure of wall clock and training hardware can be used to estimate a training compute cost of under 8,000$. PinCLIP is a model trained by company Pinterest based on SigLIP to adapt it to their own data and use cases; and while the model was not released, its technical report lets us estimate the additional cost to be less than that of the initial training; putting the total bill at under 16,000$ for a company to have an image embedding model that corresponds to the actual data distribution of their commercial activity.
OCR (Optical Character Recognition) stands as a particularly useful application of AI and ML, as a significant proportion of the information that is accessible to many organizations exists in the form of scans or PDFs that are otherwise difficult to process. DeepSeek-OCR and OlmOCR are two high-performing models for the task corresponding to different training approaches. DeepSeek-OCR is initialized with a small DeepSeekMoE model pre-trained on at most 2T tokens, then trains its vision encoder and decoder on 100M image-text pairs; for a likely total cost of around 100,000$. OlmOCR takes a different approach of starting with a larger Qwen-2.5-VL model (estimated training compute cost of about 600,000$), but as a result can afford significantly cheaper adaptation for the task of OCR, with the developers' disclosed wall clock and hardware corresponding for an extra fine-tuning of about 6,000$.
Machine Translation also shows the utility of favoring task training over using general-purpose-trained models for everything. Tencent recently released a 1.5B MT model called Tencent HY-MT (along with a larger 7B version). While the smaller version of the model is competitive with existing commercial APIs and very large models like Gemini 3 Pro, the model's limited size and information provided by the developers about the size of the training datasets for the base model and MT version would put its training compute cost at likely under 500,000$ including general pre-training.
Genomics AI for science has seen a sharp rise in attention over the last year. Two recent genomics sequence models provide helpful touchstones for the costs that AI models trained to support work in specific domains might entail. GenomeOcean is a 4B model trained to model microbial genome specifically, and gives us a sense of what a subfield-wide foundation model might look like at about 40,000$ spent on training compute. At the other end of the spectrum, Evo2 40B is a suite of models of up to 40B parameters trained by a consortium made up of multiple universities and non-profits to model DNA more generally, for a likely total expense of up to 9,000,000$, based on available information. This model is intended to constitute a field-defining infrastructure effort with high potential for reuse and adaptation and likely represents a higher bound for such efforts. We see a similar range for protein folding models --looking at open reproductions of Google Deepmind's AlphaFold result, researchers disclosed a training compute cost of 100,000$ for OpenFold2 and more recently 17,000,000$ for the consortium training OpenFold3; mostly spent on generating training data points, and with an estimated 2,000,000$ of proper model training compute.

Overall, while we do see some applications of AI outside of Large Language Models leverage significant amount of compute, there remains a great variety of effort that keeps their training budget much more accessible across scientific and commercial use cases. Even for cases that can benefit from a higher scale of computation - especially in domains like earth and life sciences that handle large amounts of data outside of any AI applications - the genomics examples included here show that consortia of organizations that aim to make AI a public good can be a promising way to mutualize costs and spread the benefits of the technology.

Language Models of All Sizes

As outlined in the previous section, much of current AI takes forms other than (Large) Language Models. Still, given the prevalence of text in digital records and interactions, the latter have come to represent a significant portion of the AI landscape. Different size categories of LMs however have very different development and deployment requirements, and can lead to drastically different approaches to building AI systems; and notably to a much more or much less concentrated ecosystem depending on how they are prioritized.

"Smol" for On-Device and Efficiency

A world of ubiquitous AI can follow one of two main organization principles. The first requires wide-spread access to large cloud-compute-bound models that requires centralization of resources and a constant flow of (sensitive and valuable) data from users' devices to large data centers controlled by the model providers. Alternatively, building AI systems around models that can run directly on users' devices provides a more privacy- and security-conscious path to enabling AI-supported functionality without depending as much on centralized compute.

The latter has grown into a more plausible proposition over the last three years, as smaller ("smol") models have become a competitive approach to building AI systems, trailing the SOTA of large models by a few months on some benchmarks and overtaking them on others. Qwen3 4B Instruct and its 8B versions can run on even low-end GPUs, and are now commonly adopted models for a wide range of applications; Qwen3 8B in particular has nearly 1,000 reported fine-tunes on Hugging Face for different modalities and applications, including models like NVIDIA's Orchestrator or AI2's Molmo that are popular in their own right. At the same time, we see with just 1.5B parameters that are fit for "edge" devices - notably smartphones - do as well as the last generation of large commercial models on coding tasks for Weibo VibeThinker, and even surpass the current state-of-the-art on translation in the case of Tencent HY-MT. Even for domains as conceptually complex as mathematical theorem proving, 4B models like QED-Nano show remarkable performance. We increasingly see companies such as IBM bet on these sizes of models for their flagship AI products to support particular commercially relevant applications of the technology.

Smaller models really shine in settings where AI is applied to a well-defined use case, and we're seeing mounting evidence that a little extra dataset and fine-tuning work can make them competitive with the largest alternatives at a fraction of the cost and without incurring the same risks to competitiveness, liability, security, and sustainability.

Training Costs

Unsurprisingly, the smaller category of models is also the most financially and computationally accessible to train; they're also the category for which we have the most information on those costs. Hugging Face's SmolLM2 and SmolLM3 and IBM's Granite 3.0-2B and Granite 3.0-8B provide: wall clock, training infrastructure, and/or direct cost estimations for their training runs in addition to training data sizes; with costs ranging from 250,000$ for SmolLM 2 to 1,700,000$ for Granite 3.0-8B. Looking at the differences in training dataset sizes, this likely implies a training compute cost of about 4,500,000$ for Qwen3 8B, probably the most compute-intensive model in this category.

Since smaller models are designed to be developed and leveraged in a somewhat different fashion to large and generalist commercial APIs, it can also be useful to consider how the costs are distributed along the whole development chain. The Weibo VibeThinker model provides a perfect textbook case for this: it is a model designed specifically to excel at specific coding and tool-calling tasks. It was fine-tuned to that end on top of the more general domain-focused model Qwen-2.5-Math, which itself was the product of continued domain training from the general-purpose Qwen-2.5. Given the information we have about the model sizes and all the training data sizes we can estimate the compute cost of the general-purpose base model at around 410,000$, followed by another 25,000$ of domain adaptation to get to Qwen-2.5-Math, and a final 7,800$ disclosed by the authors for the final task specification. Similarly QED-Nano only requires an additional 28,000$ on top of its base model to reach generally competitive performance on mathematical applications. In practice, different categories of actors will start at different stages in the process; as long as they can rely on a sufficiently thriving ecosystem of open-weight general or domain-level models, which enables training task-specific models for a fraction of the cost of the initial model training.

Language models on the smaller end have shown impressive progress over the last two years, to the point where general-purpose base models trained for a few 100,00 to a few million dollars in compute can now compete with or even surpass the most expensive AI sytems given data expertise and a few extra 1,000$. Importantly, this model of development leads to models which, while restricted in the scope of their applications, can do the job they're designed for well while running entirely on a user end-device and keeping all competitive advantages and IP in the hands of the developer.

Mid-Range LMs for Offline Users

Models can take advantage of scale without depending as strongly on cloud compute for inference. Models up to 32B parameters have shown that they can benefit from the same scales of datasets (up to 36T tokens) as larger versions to boost their benchmark performance. These models have been particularly important in the development, deployment, and governance of AI for several reasons:

They are broadly accessible to run. Mid to top-line general-purpose GPUs, commercialized for non-AI compute-intensive tasks such as graphic design or gaming, commonly have up to 16GB VRAM, which is sufficient to run this range of model with proper software and quantization. Users may deploy model instances on cloud compute instances for convenience - on machines typically priced around 0.5 or 0.8$/hour of activity - or easily purchase dedicated hardware to run locally without extensive data center infrastructure.
They are sufficiently performant to support a very wide range of AI applications, either out-of-the-box for some of inherent strengths of generative AI such as software applications or with an adaptation step that is significantly cheaper than the pre-training phase. GLM-4.7 Flash for example is a 3B activated/30B total parameters model with very strong results on LiveCodeBench v6; less than 6 months behind the state-of-the-art using the largest models. OpenAI's smaller GPT-OSS-20B (which can even be fine-tuned on consumer GPUs), Google's Gemma3-27B, and NVIDIA Nemotron Nano are also strong examples in that category, and Qwen-2.5-VL-32B is notable as the base model for AllenAI's OlmOCR, fine-tuned for high performance OCR for under 6,000$ additional compute.
Perhaps most importantly, we have recent examples of 32B models trained in a maximally open and transparent fashion, with specific information about training recipes and costs. Examples lie Stanford's Marin Project and AllenAI's Olmo-3 series provide invaluable information about the specific costs and design trade-offs that go into training medium- to large-scale models. This information helps ground discussions about the field much more broadly (e.g. by enabling extrapolation to other models as we do in this project) and ensure that policymakers and the public can rely on independently verified research when assessing the benefits and risks of model development from less transparent actors.

Training Costs

While the cost of training medium-range LMs can vary significantly, compute training costs are likely to be in the $1-20M range. The most reliable information on training cost comes from OLMo 3.1, as the developers report not only model architecture and training data size but also the wall clock and training infrastructure they used, putting the likely cost at about 2,750,000$; assuming an overall cost of 2$/hour for H100, which reflects current rates for large bulk reservations in many the right regions on commercial platforms. At the high end of the range, we find Qwen3-32B, which was trained on significantly more data - assuming a similar per-data cost as Olmo – given the similarities of model architecture – would put the training compute cost at 16,000,000$. Using this estimation, Qwen's comparison of the training costs of their 3B/30B MoE model Qwen3-30B-A3B to the 32B version, and linearly extrapolating to similarly-structured NVIDIA Nemotron Nano gives us a training compute budget of 1,300,000$ for this model. This is at the lower end of cost for performance, without accounting for the fact that Nemotron was reportedly trained at lower precision, which would presumably further reduce the cost.

Unfortunately, while models GLM-4.7-Flash, OpenAI's GPT-OSS-20B, and Gemma-3 are certainly meaningful examples of this size category , we lack any information about the size of their datasets; which would be a minimum requirement to extrapolate from available information on more transparent models. However, assuming that their costs are in line with the information we have about the more transparent alternatives, it would put the expense of training models in this category at a 7-to-low-8 figure R&D budget item, one of many we can expect to find on the books of hundreds if not thousands of companies that already have similar investments in their own software development.

Language models in the 20B-32B parameter correspond to the version of Large Language Models that is currently the most accessible to established commercial actors and medium-to-large public organizations. Recent leaps in optmization and efficiency have brought training costs for models with off-the-shelf general-purpose applications to a couple million dollars or less. A diversified ecosystems with models in this range trained for different languages, scientific and commercial domains, or on proprietary data from commercial actors with their own individual user bases and applications can enable most of the same applications as the systems currently commercialized by the largest few tech companies with much lower concentration of data and greater compute efficiency.

Large and Proprietary Models

The current focus on large generalist models as the dominant (or at least most visible) paradigm for current AI development started with GPT-3 - a 175B parameter model trained on a corpus of 600B tokens of Internet text for a few million dollars - at a time when common practice was to train significantly smaller models on more curated datasets. Since then, companies like OpenAI, Anthropic, Google, and more recently x.AI have kept increasing the size and training compute costs of their flagship models, with reports of multi-100M$ training runs and billions of dollars overall spent on compute for model development yearly. These increasingly lavish compute expenses have allowed the most well-resourced companies to run an exclusive race for the top benchmark scores on evaluation metrics, aimed at measuring the models' general coverage of information about the world, increasingly complex software engineering tasks, and increasingly robust multi-stage "tool use"; with the top position changing with releases every few months granting its developer a boost in attention - and interest from potential customers.

The largest of Large Language Models have two main appeals. First, they are trained to be versatile out-of-the-box, with popular rankings increasingly requiring models to have high scores on as many benchmarks as possible. This allows prospective users to start using them with a reasonable chance of getting some automation value without any specific Machine Learning development skills, which drastically increases their prospective customer base. Second, they can get value out of inference-time compute through "reasoning" (GPT-5.2 achieves its best scores on the "extra-high" setting), which means that in most cases users can increase the AI system's likelihood of succeeding at a task by simply turning up the cost dial. They also present one major constraint, with potentially dire consequences for competition in the space: they require access to data center grade GPUs to run at all, and to extensive compute infrastructure to be served at scale. This makes it all the more notable that most US developers of such models either own hyperscalers or have them significantly represented in their capitalization tables. Within this overall family of models, however, we can still further specify categories with different cost dynamics:

Commercial API-only US flagship models powering ChatGPT, Gemini, Claude, etc. disclose little meaningful information about their costs; but secondary information about the companies implies that they remain the most expensive models around by at least an order of magnitude. We do know that these developers all have significant data flywheels, saving up to tens of trillions of tokens a year of user data to support further training (a very conservative estimate for ChatGPT's 1B+ MAU).
Top-of-the-line large open-weight models are models with up to 1T parameters, typically trained on 10-30T tokens. Since the so-called "DeepSeek moment" of December 2024, these models, especially ones developed by Chinese companies, have approached then overtaken - with Kimi-K2, Kimi-K2.5, and GLM 5 - commercial API models on common benchmarks; although they still lag behind on adoption; likely until the developers who serve their own models also start scaling up user data in addition to using static corpora and synthetic data.
The Long Tail of Large-ish models includes a wider range of models in the 100B-400B parameter range, developed by a much more diverse set of actors. Open-weight models in that category include US start-up Arcee Trinity-Large, French scale-up Mistral's Mixtral and Devstral models, Korean company LG's K-EXAONE, Alibaba's Qwen3 Next and Qwen-Coder-Next, or OpenAI's GPT-OSS-120B, among others. These models are designed with some measure of efficiency in mind, including in terms of training costs, inference speed, and deployment costs. They are also either as good as the larger versions on a subset of tasks (e.g. Qwen-Coder-Next reports results on par with Claude-Sonnet-4.5 on the SWE-Bench Pro benchmark at just 80B parameters) or "close enough" out-of-the-box options in general (recently released MiniMax-M2.5 ranks higher than Grok and similar to Claude 4.5 Sonnet in aggregate).

Training Costs

Information about the training costs of the largest models is especially difficult to estimate given the secrecy of developers, particularly for closed API-only models developed by large companies. Estimations based on information about companies' compute infrastructure and reports of total FlOps put two of 2025's large models at 200-300M$ (218,000,000$ and 334,000,000$ for Grok 3 and GPT 4.5 respectively according to estimations by Epoch.AI) in compute costs, and Claude 3.5 cost "a few 10M$" to train according to Anthropic CEO Dario Amodei. Any further investigation - or even educated guesses - are hampered by the fact that companies to date have failed to disclose even basic information about the models' sizes, architectures, or training dataset token counts.

We have a somewhat better sense of the training cost of the largest open-weight models. DeepSeek v3 helpfully provided not only its training dataset size, but also the wall clock for training; factoring in the extra data added to DeepSeek v3.2 would put the training compute cost of the latest version of the model at around 12M$. US start-up Arcee Trinity-Large also recently disclosed a total cost of training a 400B parameter model at under 20M$ overall, including not just training compute but also data and personnel costs. By extrapolating this information to the largest and arguably strongest open-weight model to date on aggregated benchmarks, Kimi-K2.5, we can estimate up to 35M$ of training compute (up from 20M$ for Kimi-K2) for current top-of-the-line benchmark success.

At the lower end of training cost (or higher end of training efficiency), Qwen3 Next and Qwen-Coder-Next show benchmark performance that approaches top models on software engineering and coding tasks for estimated training compute costs in the range of 1.5-2M$. This represents a 10x factor between the least and most training-compute-intensive top-of-the-line open models for their sizes, and then again another 10x factor for fully closed models. The first jump can easily be assigned to the range of model sizes. The source of the difference between open and closed-weight models is less clear, but with user bases between the tens of millions to billions of monthly active users, it is likely that the integration of the use data to the training methods significantly raises the costs.

Training costs for large models span orders of magnitude: open-weight systems in this report range from roughly 1.5M$ to 35M$ in estimated compute, while closed flagship APIs are widely estimated at an order of magnitude higher. The source of that remaining gap is unclear from public information; given the scale of user-data flywheels available to closed developers, a plausible explanation is the cost and role of data accumulation by these developers, leveraging the trillions of tokens users submit to further improve their models' coverage of request types and application settings. Differences in performance also become harder to quantify at this level, as open-weight models starting at 230B parameters share some of the top spots with private systems.

Deployment Costs of AI Systems

Beyond training, the cost of running models varies widely. The plot below compares deployment ($/hour) and inference costs ($/1M tokens) across selected open-weight models and proprietary APIs.

Deployment & Inference
Costs

Select hosting precision or inference pricing mode to see costs. Left: log-scale $/hour. Right: log-scale $/M tokens.

Deployment

Inference

Tasks, Domain, and Smol AI sans Cloud

The majority of the models in the Task and Domain AI and "Smol" Language Models categories can be used without relying on cloud compute services at all. Translation models like Tencent HY-MT even general-purpose models like Qwen3 4B Instruct are small enough to run on a phone, low-end GPU, or even generic personal computer CPU when they need to handle a few examples at a time in the course of individual interactions with a system. Encouragingly, important layers of the web infrastructure are starting to take advantage of these capabilities, with libraries like WebGPU enabling web applications to rely on users' local compute for models up to 20B parameters. While cloud compute can still be an attractive option to use models in these categories for significant workloads or scheduled scripts, the expenses remain closer to those of traditional data processing loads, e.g. up to a few dozen dollars to process tens of thousands of pages with DeepSeek-OCR. Models in the Mid-Range LMs category follow similar dynamics, with options like GLM 4.7 Flash or NVIDIA Nemotron Nano offering great value under local deployment settings - albeit in a way that is more weighed toward cloud use with online GPUs on the cmore accessibleheaper end to support more ubstantial use loads.

Because these modes of deployments are so different from the dominant paradigm of centrally served AI, they often get left out of conversations about compute capacity for AI. In practice, even smaller AI models still increase the compute load of software significantly, and will require continued progress on hardware to reach their full potential - but this would look more like previous evolutions of computing paradigms to better accomodate the needs of graphic design or game engines. We are seeing promising efforts in this direction, which could help move AI compute expenses from an extremely concentrated commodity that encourages speculation to a more distributed and grounded resource.

Large(r) Language Models Variation

As previously mentioned, we define our category of the Largest Language Models - both open-weight and proprietary versions - by their dependence on data center grade GPUs and cloud compute to function. Additionally, while Mid-Range LMs can be run on personal compute resource, heavy use loads might be better addressed by some form of cloud deployment. However, cost and infrastructure requirements vary significantly across model categories within the general paradigm of cloud-supported inference.

We provide a comparison of costs for selected models above using two categories of metrics reflecting the two primary ways of using medim to large models. First, users can deploy any open-weight models themselves by renting compute instances from cloud providers. This approach provides greater control over data flow and flexibility for managing variable loads, but also requires more work to manage resource allocation. This also allows developers to choose between using a full-precision version of the models, or a quantized alternative that might sacrifice some performance but reduces VRAM requirements - and thus rental cost of the required instances - significantly. In this work, we provide deployment costs for both full precision and 4-bit quantization.

Additionally, some companies also take on the technical work of hosting selected models and managing their workloads, and charge developers for use on a per-token basis. This is the only use-based pricing for proproetary models (ChatGPT, Gemini, Claude, etc.), and platforms like OpenRouter and Hugging Face also provide access to a catalog of inference providers for a selection of popular open-weight models. While this means we can't directly compare the deployment costs of all open-weight models with those of proprietary APIs, the open-weight models that do have hosted options can provide an anchor for high-level comparisons - assuming of course that proprietary models are priced at a rate that is neither predatory not reflects monopoly pricing.

Even the limited information we have shows a chasm between the costs of different LLMs, even in cases where benchmark performance is similar; within the Largest Language Models class OpenAI's GPT-OSS-120B and MiniMax M2.5 are notable examnples of hosted models that are significantly cheaper that proprietary alternatives, even though the latter has similar benchmark performances as models that charge 10 or 20 times more per output token. Self-deployment costs show a similar range, with a price factor of up to 20 between quantized open-weight models, and over 100 between the cheapest and most expensive options overall.

Benchmark numbers by themselves do not tell the entire story of model performance, and we have already noted the appeal of using models that offer more general off-the-shelf utility even in cases where they can be significantly more expensive; but these numbers do show that deploying AI does not have to require the compute expenses that come with automatically relying on the best-known commercial offering. At the very least, it follows that any public conversations about widespread use of AI would be well served to demand some efficiency of the system developers to ensure that the models do not consumet ten, a hundred, and a thousand times more than they need to to achieve a particular goal; especially when ownership structures of compute infrastructure may disincentivize some developers from prioritizing efficiency.

Conclusion

In the sections above, we showed that training and deployment costs across all of the categories we have surveyed — task and domain models, small and medium language models, and large open-weight and proprietary systems — vary by orders of magnitude. This report provides quantitative estimates of those cost to show that the majority of commercially and scientifically significant AI development does not require 9-figure training runs or exclusive access to data-center-scale infrastructure. Open science and open-source developers have demonstrated that models that are competitive on a wide range of benchmarks can be trained for costs that range from the low thousands to the low tens of millions of dollars, and that deployment can be as simple as running on a single GPU instance or on-device, or as scaled as paid API access. The gap between that reality and the narrative centered on frontier-scale systems is not merely technical; it has implications for who can build, audit, and govern AI.

Policymakers, developers, and the public can better assess opportunities and risks when cost and capability are understood across the full spectrum of AI development. We hope this overview of training and deployment costs—and the transparency of the developers who made it possible—contributes to a more accurate picture of what “AI” costs and who can afford to build it.

- Appendix -

A few aspects of our methodology deserve attention to better understand the scope - and limitations - of our study.

Methodology: Estimating Training Costs

Motivation and conditions for working with estimated amounts: the cost of training AI models has become a particularly consequential question given the extraordinary sums currently dedicated to this purpose; unfortunately developers rarely disclose a dollar amount for those compute costs directly. However, while we may not be able to obtain exact dollar amounts for all categories of AI models, we can infer orders of magnitudes and often even reasonably precise ranges with sufficiently high confidence to help answer concrete governance and strategic questions. Our approach in this work is to make such inferences in cases where we believe we know enough of the main variables, and to provide a transparent account of our assumptions in each case. While we use different categories of information to help refine our estimates in different cases, we always need to have access *a minima to a model's size and architecture (which is available for all open-weight models) and the size of a model's training corpus.
Prioritizing upper bounds on costs for open models: the main arguments in this study address the fact that most approaches to AI systems are significantly cheaper than the "frontier model" approach championed by a few companies. To that end, we prioritize estimates that overestimate rather than underestimate the cost of these approaches.
Focus on training compute: the compute cost of the main training run for a model provides a convenient anchor for comparison. While it does not account for the entire training compute used to train a model given extra experiments, fine-tuning, data processing, etc., is still dominates other expenses in cases that report these other costs, and always accounts for at least half of the compute expense, even accounting for reinforcement learning based approach. This might change in coming years as training approaches increasingly rely on data transformation and generation in their training procedures. Some of the models we include, notably Arcee Trinity-Large, actually disclose cost that include these additional costs - we simply treat those as an upper bound on training compute expenses, in line with point 2.
Extrapolation methods and levels of confidence: to estimate training costs of a wide range of AI models, we start with efforts that directly disclose their training expenses - typically academic researchers, non-profits, and some start-ups - and extrapolate their costs to other categories of models in cases where developers disclose sufficient technical information to support analysis. We note that this process underscores the importance of maximally open AI research efforts - in addition to yielding valuable models in their own rights, these models provide indispensible information about the entire field. For each of the model costs and estimates in this report, we rate our confidence as confident, likely, or speculative, following the categories used by Epoch AI (although our definitions might differ somewhat):
a. Confident: We rate a cost estimate as confident when the developer directly discloses either a dollar amount for the training or a combination of the wall clock and training infrastructure. AllenAI, Hugging Face, or Stanford's Marin Project, and EleutherAI are all examples of maximally transparent language model developers that provide full training information, including training costs - although that information might in some cases be difficult to identify from the main documentation. Other organizations like IBM, US start-up Arcee, and DeepSeek are also notable for disclosing the training costs of their models.
b. Likely: In most cases, training costs and wall clock scale linearly with the amount of training data for a given architecture, which typically simply requires running training for longer in proportion to the dataset size. In cases where we have a Confident estimate for model A, and model B with a similar architecture discloses its training data size, we apply a simple linear extrapolation based on the ratio of their dataset sizes. In particular, the confident models we have access to all release their training code, so we expect the compute-efficiency of training rates to be equivalent or greater; which makes our estimates with this method a likely upper bound, following point 2. NVIDIA Nemotron Nano represents an extreme example of this phenomenon: the developers mention that they were able to train the model with native FP4 precision, but we provide an estimate based on what we know about similar models with (more expensive) higher-precision training.
c. Speculative: Some models require additional steps to filling gaps in information, for example where we have to provide an upper bound on wall clock based on external information, have a close match for architecture but not the exact same parameter counts, have to provide an upper bound of an example-to-token ratio, or rely on other quantitative information such as the ratio of training costs between different models provided by Qwen (first figure in the Pre-training section). All of these choices are clearly documented in the corresponding models' cards, and systematically lead to a confidence rating of Speculative; they're all cases where we believe our upper bound estimate is valid but that aren't directly supported by developer disclosure or simple dataset size interpolation
Models fine-tuned from a base model: For models fine-tuned from a base model, the primary number disclosed, and the one used in plots, is always the total training cost including for each of the training stages. This is typically much higher that the final cost the developer paid for the model adaptation stage, but follows our upper bound approach. We discuss the specific fine-tuning costs in the text and the cards.
Translating wall clock and hardware to cost estimates: In cases where we have wall clock and training infrastructure but not dollar amounts, we use the hourly rental cost for the compute nodes mentioned for a large reservation in the most favorable region we could find on Google Cloud Provider. This helps strike a balance between ensuring that the estimate reflect the actual training cost of models, that typically work through bulk reservation, while ensuring that the estimate corresponds to costs actually available to the general public without particular contractual relationships with cloud providers that may include other forms of payment or conditions for preferential prices.

Methodology: Estimating Deployment Costs

Deployment costs are taken from the recommended hardware and price on the Hugging Face Endpoints service through AWS instances. API inference costs are taken from OpenRouter, prioritizing the developer's price for proprietary API, then Google Vertex, then DeepInfra that had the highest coverage of medium models. GPT-OSS-20B, while smaller than other models in the Medium category, is only compatible with newer GPUs, hence the higher deployment cost for "Full Precision". The cheapest available cloud GPU instance is a machine with a single NVIDIA T4, which fits models up to 7B parameters with full precision - and Q4-quantized gpt-oss-20B.

AI's Never Just One Thing:Different FLOPS for Different Folks

Training Costs of AI Systems

Training Costs by AI System Category

The Long (and Heavy) Tail of Task and Domain AI

Training Costs

Language Models of All Sizes

"Smol" for On-Device and Efficiency

Training Costs

Mid-Range LMs for Offline Users

Training Costs

Large and Proprietary Models

Training Costs

Deployment Costs of AI Systems

Deployment & InferenceCosts

Tasks, Domain, and Smol AI sans Cloud

Large(r) Language Models Variation

Conclusion

- Appendix -

Methodology: Estimating Training Costs

Methodology: Estimating Deployment Costs

Citation

AI's Never Just One Thing:
Different FLOPS for Different Folks

Deployment & Inference
Costs