Over the last five years – since the release of ChatGPT – the term "Artificial Intelligence" has become increasingly identified with one particular approach: very large proprietary models, trained by companies with very deep pockets, on as much data as they can gather. These are then commercialized through human-like chatbots and tools that serve a double purpose: serving these models and collecting further data for improving them.
The paradigm has and continues to push automation boundaries in economically significant sectors - systems like Claude Opus and GPT 5.2 CodeX have found undeniable success with the software developers working for the technology companies that represent the majority of the SP500. Yet this scale-first approach represents the version of AI that best suits these same companies, who control the majority of computational resources and have uniquely centralized access to user data, giving them strong incentives to ensure attention remains focused on these compute-hungry versions of AI even if it means warping public perception - including of policymakers and potential cutomers - about actual costs and viable alternatives.
In practice, however, not everything in AI is multi-trillion-parameter behemoths with 9-figure training budgets. Different versions of the technology can often be a much better fit for the specific needs of the organizations building these systems and building with them. We outline some of these categories below:
The graph below indicates estimated training cost (USD) for open-weight and task-specific models, with reference lines for selected commercial models. Marker shape indicates confidence (see methodology for details).
Click a point on the chart to see model details in the report card to the right.
Most definitions of "Artificial Intelligence" cover significantly more than just chatbots; a very wide range of applications of Machine Learning to various categories of data and modalities to develop models for tasks like OCR, speech and image processing, or sceintific reseatch would fall under that heading. Notably, many models trained over five years ago - before the explosion of compute resources - remain in significant use today as the backbone of diverse AI-powered applications: OpenAI's CLIP and Whisper for image and speech processing, variations on BERT or SentenceLM for run-of-the-mill text embedding applications, and models like Wav2vec, that are routinely fine-tuned to adapt to new languages. These models were trained at the beginning of the shift toward model scaling, which means they were often the largest models at the time of their releases in terms of parameters or dataset size. With today's resources, their training costs would be at most a few thousand dollars, if that. Nevertheless, they remain broadly useful and among the most-downloaded artefacts on the Hugging Face Hub.
Models trained in recent years for specific tasks and domains also tend to be more parsimonious with training compute than ones trained for "general-purpose" use. Corporate and individual contributors of open-weight models, companies with defined use cases, and research organizations continue to train models that meet their specific needs (or those of their communities). While growing access to compute does play a role in unlocking new use cases for the technology (generalized OCR and genetics stand out as two recent successes - e.g. DeepSeek-OCR), in most cases the compute expenses remain limited, with success hinging more on access to the right training data and problem modeling; especially in cases where artificially framing the problems as something a language model can do would be a waste of resources; e.g. while a system like ChatGPT could technically output a representation of a protein's 3D structure, training a separate model to that end with a more appropriate format than free text is a lot easier and more efficient than trying to add the capability to a general-purpose model.
Dedicated models such as Google's AlphaFold3 and the open reproduction OpenFold3 represent the state of the art for the latter task; OpenFold3 has slightly lower benchmark scores but within a standard deviation, and its open nature has already allowed private companies and other research efforts to fine-tune it for their purposes. Given the greater overall efficiency of both training data and especially model size, inference compute costs are also significantly less of a concern for these approaches, with costs similar to those of running a small or medium language model depending on the models. For example, OCR with recent models can process tens of thousands of pages for a few dozen dollars with recent models.
While there are too many approaches to and applications of AI to provide an exhaustive list, we were able to review recent publications with sufficient information to estimate the compute cost of training for a few different use cases:
As outlined in the previous section, much of current AI takes forms other than (Large) Language Models. Still, given the prevalence of text in digital records and interactions, the latter have come to represent a significant portion of the AI landscape. Different size categories of LMs however have very different development and deployment requirements, and can lead to drastically different approaches to building AI systems; and notably to a much more or much less concentrated ecosystem depending on how they are prioritized.
A world of ubiquitous AI can follow one of two main organization principles. The first requires wide-spread access to large cloud-compute-bound models that requires centralization of resources and a constant flow of (sensitive and valuable) data from users' devices to large data centers controlled by the model providers. Alternatively, building AI systems around models that can run directly on users' devices provides a more privacy- and security-conscious path to enabling AI-supported functionality without depending as much on centralized compute.
The latter has grown into a more plausible proposition over the last three years, as smaller ("smol") models have become a competitive approach to building AI systems, trailing the SOTA of large models by a few months on some benchmarks and overtaking them on others. Qwen3 4B Instruct and its 8B versions can run on even low-end GPUs, and are now commonly adopted models for a wide range of applications; Qwen3 8B in particular has nearly 1,000 reported fine-tunes on Hugging Face for different modalities and applications, including models like NVIDIA's Orchestrator or AI2's Molmo that are popular in their own right. At the same time, we see with just 1.5B parameters that are fit for "edge" devices - notably smartphones - do as well as the last generation of large commercial models on coding tasks for Weibo VibeThinker, and even surpass the current state-of-the-art on translation in the case of Tencent HY-MT. Even for domains as conceptually complex as mathematical theorem proving, 4B models like QED-Nano show remarkable performance. We increasingly see companies such as IBM bet on these sizes of models for their flagship AI products to support particular commercially relevant applications of the technology.
Smaller models really shine in settings where AI is applied to a well-defined use case, and we're seeing mounting evidence that a little extra dataset and fine-tuning work can make them competitive with the largest alternatives at a fraction of the cost and without incurring the same risks to competitiveness, liability, security, and sustainability.
Unsurprisingly, the smaller category of models is also the most financially and computationally accessible to train; they're also the category for which we have the most information on those costs. Hugging Face's SmolLM2 and SmolLM3 and IBM's Granite 3.0-2B and Granite 3.0-8B provide: wall clock, training infrastructure, and/or direct cost estimations for their training runs in addition to training data sizes; with costs ranging from 250,000$ for SmolLM 2 to 1,700,000$ for Granite 3.0-8B. Looking at the differences in training dataset sizes, this likely implies a training compute cost of about 4,500,000$ for Qwen3 8B, probably the most compute-intensive model in this category.
Since smaller models are designed to be developed and leveraged in a somewhat different fashion to large and generalist commercial APIs, it can also be useful to consider how the costs are distributed along the whole development chain. The Weibo VibeThinker model provides a perfect textbook case for this: it is a model designed specifically to excel at specific coding and tool-calling tasks. It was fine-tuned to that end on top of the more general domain-focused model Qwen-2.5-Math, which itself was the product of continued domain training from the general-purpose Qwen-2.5. Given the information we have about the model sizes and all the training data sizes we can estimate the compute cost of the general-purpose base model at around 410,000$, followed by another 25,000$ of domain adaptation to get to Qwen-2.5-Math, and a final 7,800$ disclosed by the authors for the final task specification. Similarly QED-Nano only requires an additional 28,000$ on top of its base model to reach generally competitive performance on mathematical applications. In practice, different categories of actors will start at different stages in the process; as long as they can rely on a sufficiently thriving ecosystem of open-weight general or domain-level models, which enables training task-specific models for a fraction of the cost of the initial model training.
Models can take advantage of scale without depending as strongly on cloud compute for inference. Models up to 32B parameters have shown that they can benefit from the same scales of datasets (up to 36T tokens) as larger versions to boost their benchmark performance. These models have been particularly important in the development, deployment, and governance of AI for several reasons:
While the cost of training medium-range LMs can vary significantly, compute training costs are likely to be in the $1-20M range. The most reliable information on training cost comes from OLMo 3.1, as the developers report not only model architecture and training data size but also the wall clock and training infrastructure they used, putting the likely cost at about 2,750,000$; assuming an overall cost of 2$/hour for H100, which reflects current rates for large bulk reservations in many the right regions on commercial platforms. At the high end of the range, we find Qwen3-32B, which was trained on significantly more data - assuming a similar per-data cost as Olmo – given the similarities of model architecture – would put the training compute cost at 16,000,000$. Using this estimation, Qwen's comparison of the training costs of their 3B/30B MoE model Qwen3-30B-A3B to the 32B version, and linearly extrapolating to similarly-structured NVIDIA Nemotron Nano gives us a training compute budget of 1,300,000$ for this model. This is at the lower end of cost for performance, without accounting for the fact that Nemotron was reportedly trained at lower precision, which would presumably further reduce the cost.
Unfortunately, while models GLM-4.7-Flash, OpenAI's GPT-OSS-20B, and Gemma-3 are certainly meaningful examples of this size category , we lack any information about the size of their datasets; which would be a minimum requirement to extrapolate from available information on more transparent models. However, assuming that their costs are in line with the information we have about the more transparent alternatives, it would put the expense of training models in this category at a 7-to-low-8 figure R&D budget item, one of many we can expect to find on the books of hundreds if not thousands of companies that already have similar investments in their own software development.
The current focus on large generalist models as the dominant (or at least most visible) paradigm for current AI development started with GPT-3 - a 175B parameter model trained on a corpus of 600B tokens of Internet text for a few million dollars - at a time when common practice was to train significantly smaller models on more curated datasets. Since then, companies like OpenAI, Anthropic, Google, and more recently x.AI have kept increasing the size and training compute costs of their flagship models, with reports of multi-100M$ training runs and billions of dollars overall spent on compute for model development yearly. These increasingly lavish compute expenses have allowed the most well-resourced companies to run an exclusive race for the top benchmark scores on evaluation metrics, aimed at measuring the models' general coverage of information about the world, increasingly complex software engineering tasks, and increasingly robust multi-stage "tool use"; with the top position changing with releases every few months granting its developer a boost in attention - and interest from potential customers.
The largest of Large Language Models have two main appeals. First, they are trained to be versatile out-of-the-box, with popular rankings increasingly requiring models to have high scores on as many benchmarks as possible. This allows prospective users to start using them with a reasonable chance of getting some automation value without any specific Machine Learning development skills, which drastically increases their prospective customer base. Second, they can get value out of inference-time compute through "reasoning" (GPT-5.2 achieves its best scores on the "extra-high" setting), which means that in most cases users can increase the AI system's likelihood of succeeding at a task by simply turning up the cost dial. They also present one major constraint, with potentially dire consequences for competition in the space: they require access to data center grade GPUs to run at all, and to extensive compute infrastructure to be served at scale. This makes it all the more notable that most US developers of such models either own hyperscalers or have them significantly represented in their capitalization tables. Within this overall family of models, however, we can still further specify categories with different cost dynamics:
Information about the training costs of the largest models is especially difficult to estimate given the secrecy of developers, particularly for closed API-only models developed by large companies. Estimations based on information about companies' compute infrastructure and reports of total FlOps put two of 2025's large models at 200-300M$ (218,000,000$ and 334,000,000$ for Grok 3 and GPT 4.5 respectively according to estimations by Epoch.AI) in compute costs, and Claude 3.5 cost "a few 10M$" to train according to Anthropic CEO Dario Amodei. Any further investigation - or even educated guesses - are hampered by the fact that companies to date have failed to disclose even basic information about the models' sizes, architectures, or training dataset token counts.
We have a somewhat better sense of the training cost of the largest open-weight models. DeepSeek v3 helpfully provided not only its training dataset size, but also the wall clock for training; factoring in the extra data added to DeepSeek v3.2 would put the training compute cost of the latest version of the model at around 12M$. US start-up Arcee Trinity-Large also recently disclosed a total cost of training a 400B parameter model at under 20M$ overall, including not just training compute but also data and personnel costs. By extrapolating this information to the largest and arguably strongest open-weight model to date on aggregated benchmarks, Kimi-K2.5, we can estimate up to 35M$ of training compute (up from 20M$ for Kimi-K2) for current top-of-the-line benchmark success.
At the lower end of training cost (or higher end of training efficiency), Qwen3 Next and Qwen-Coder-Next show benchmark performance that approaches top models on software engineering and coding tasks for estimated training compute costs in the range of 1.5-2M$. This represents a 10x factor between the least and most training-compute-intensive top-of-the-line open models for their sizes, and then again another 10x factor for fully closed models. The first jump can easily be assigned to the range of model sizes. The source of the difference between open and closed-weight models is less clear, but with user bases between the tens of millions to billions of monthly active users, it is likely that the integration of the use data to the training methods significantly raises the costs.
Beyond training, the cost of running models varies widely. The plot below compares deployment ($/hour) and inference costs ($/1M tokens) across selected open-weight models and proprietary APIs.
Select hosting precision or inference pricing mode to see costs. Left: log-scale $/hour. Right: log-scale $/M tokens.
The majority of the models in the Task and Domain AI and "Smol" Language Models categories can be used without relying on cloud compute services at all. Translation models like Tencent HY-MT even general-purpose models like Qwen3 4B Instruct are small enough to run on a phone, low-end GPU, or even generic personal computer CPU when they need to handle a few examples at a time in the course of individual interactions with a system. Encouragingly, important layers of the web infrastructure are starting to take advantage of these capabilities, with libraries like WebGPU enabling web applications to rely on users' local compute for models up to 20B parameters. While cloud compute can still be an attractive option to use models in these categories for significant workloads or scheduled scripts, the expenses remain closer to those of traditional data processing loads, e.g. up to a few dozen dollars to process tens of thousands of pages with DeepSeek-OCR. Models in the Mid-Range LMs category follow similar dynamics, with options like GLM 4.7 Flash or NVIDIA Nemotron Nano offering great value under local deployment settings - albeit in a way that is more weighed toward cloud use with online GPUs on the cmore accessibleheaper end to support more ubstantial use loads.
Because these modes of deployments are so different from the dominant paradigm of centrally served AI, they often get left out of conversations about compute capacity for AI. In practice, even smaller AI models still increase the compute load of software significantly, and will require continued progress on hardware to reach their full potential - but this would look more like previous evolutions of computing paradigms to better accomodate the needs of graphic design or game engines. We are seeing promising efforts in this direction, which could help move AI compute expenses from an extremely concentrated commodity that encourages speculation to a more distributed and grounded resource.
As previously mentioned, we define our category of the Largest Language Models - both open-weight and proprietary versions - by their dependence on data center grade GPUs and cloud compute to function. Additionally, while Mid-Range LMs can be run on personal compute resource, heavy use loads might be better addressed by some form of cloud deployment. However, cost and infrastructure requirements vary significantly across model categories within the general paradigm of cloud-supported inference.
We provide a comparison of costs for selected models above using two categories of metrics reflecting the two primary ways of using medim to large models. First, users can deploy any open-weight models themselves by renting compute instances from cloud providers. This approach provides greater control over data flow and flexibility for managing variable loads, but also requires more work to manage resource allocation. This also allows developers to choose between using a full-precision version of the models, or a quantized alternative that might sacrifice some performance but reduces VRAM requirements - and thus rental cost of the required instances - significantly. In this work, we provide deployment costs for both full precision and 4-bit quantization.
Additionally, some companies also take on the technical work of hosting selected models and managing their workloads, and charge developers for use on a per-token basis. This is the only use-based pricing for proproetary models (ChatGPT, Gemini, Claude, etc.), and platforms like OpenRouter and Hugging Face also provide access to a catalog of inference providers for a selection of popular open-weight models. While this means we can't directly compare the deployment costs of all open-weight models with those of proprietary APIs, the open-weight models that do have hosted options can provide an anchor for high-level comparisons - assuming of course that proprietary models are priced at a rate that is neither predatory not reflects monopoly pricing.
Even the limited information we have shows a chasm between the costs of different LLMs, even in cases where benchmark performance is similar; within the Largest Language Models class OpenAI's GPT-OSS-120B and MiniMax M2.5 are notable examnples of hosted models that are significantly cheaper that proprietary alternatives, even though the latter has similar benchmark performances as models that charge 10 or 20 times more per output token. Self-deployment costs show a similar range, with a price factor of up to 20 between quantized open-weight models, and over 100 between the cheapest and most expensive options overall.
Benchmark numbers by themselves do not tell the entire story of model performance, and we have already noted the appeal of using models that offer more general off-the-shelf utility even in cases where they can be significantly more expensive; but these numbers do show that deploying AI does not have to require the compute expenses that come with automatically relying on the best-known commercial offering. At the very least, it follows that any public conversations about widespread use of AI would be well served to demand some efficiency of the system developers to ensure that the models do not consumet ten, a hundred, and a thousand times more than they need to to achieve a particular goal; especially when ownership structures of compute infrastructure may disincentivize some developers from prioritizing efficiency.
In the sections above, we showed that training and deployment costs across all of the categories we have surveyed — task and domain models, small and medium language models, and large open-weight and proprietary systems — vary by orders of magnitude. This report provides quantitative estimates of those cost to show that the majority of commercially and scientifically significant AI development does not require 9-figure training runs or exclusive access to data-center-scale infrastructure. Open science and open-source developers have demonstrated that models that are competitive on a wide range of benchmarks can be trained for costs that range from the low thousands to the low tens of millions of dollars, and that deployment can be as simple as running on a single GPU instance or on-device, or as scaled as paid API access. The gap between that reality and the narrative centered on frontier-scale systems is not merely technical; it has implications for who can build, audit, and govern AI.
A few aspects of our methodology deserve attention to better understand the scope - and limitations - of our study.
Deployment costs are taken from the recommended hardware and price on the Hugging Face Endpoints service through AWS instances. API inference costs are taken from OpenRouter, prioritizing the developer's price for proprietary API, then Google Vertex, then DeepInfra that had the highest coverage of medium models. GPT-OSS-20B, while smaller than other models in the Medium category, is only compatible with newer GPUs, hence the higher deployment cost for "Full Precision". The cheapest available cloud GPU instance is a machine with a single NVIDIA T4, which fits models up to 7B parameters with full precision - and Q4-quantized gpt-oss-20B.
For attribution in academic contexts, please cite this work as
"AI's Never Just One Thing: Different FLOPS for Different Folks", 2026.
BibTeX citation
@misc{different_flops_2026,
title={AI's Never Just One Thing: Different FLOPS for Different Folks},
author={Jernite, Yacine and Luccioni, Sasha},
year={2026},
url={https://huggingface.co/spaces/},
}