Skip to content
All posts
AICloud

Self-Hosting an AI Model vs Paying for the Cloud: Which One Should You Actually Pick?

March 16, 2026·Read on Medium·

There is a question that keeps coming up in developer group chats, tech forums and social media posts lately. Should you run your own AI model on your own hardware, or just pay for access through a cloud provider?

It sounds like a simple question. It is not.

The answer depends on what you are trying to do, how much data sensitivity is involved, how many users you are serving and how much you are honestly willing to spend. Most people who ask this question have already made a mental decision. They lean toward self-hosting because it sounds free. But “free” is doing a lot of work in that sentence, and it is worth unpacking before you go any further.

This article covers what both options actually mean, why frontier models like Claude Opus and GPT-5 cannot be self-hosted at all, what running a powerful model locally would genuinely cost you and then walks through four real use cases so you can match the right choice to your actual situation.

What Is Self-Hosting?

Self-hosting means you download an AI model and run it on hardware you own or control. The model lives on your machine. Your data never leaves your environment. You are responsible for the setup, the maintenance and every hardware cost that comes with it.

The standard tool for this today is Ollama, which lets you pull and run open-source models with a single command from your terminal. LM Studio offers a more visual desktop interface for the same purpose. Both are free. Both work on Windows, macOS and Linux.

The models available this way are the open-source ones: Llama from Meta, Qwen from Alibaba, DeepSeek, Mistral and a growing list of others. These are genuinely capable models. For many tasks they will serve you well. But they are not the same as the frontier models you access through cloud providers, and understanding that gap matters a lot when making this decision.

The hard constraint on self-hosting is VRAM. Your GPU has a fixed amount of video memory, and the model must fit inside it entirely to run at usable speed. If the model is too large for your VRAM, it spills into your system RAM or gets processed by the CPU instead. When that happens, speed drops from usable to unusable. Think five to ten minutes to generate a paragraph rather than a few seconds. At the consumer level, the most capable card available right now is the RTX 5090 with 32GB of GDDR7 VRAM, priced at USD 1,999 at MSRP. In Malaysia and Southeast Asia, expect to pay considerably more after duties and import costs, and stock has been scarce since launch. A full desktop setup built around this card will push your total well beyond the GPU price alone.

With 32GB of VRAM you can comfortably run models up to around 30 billion parameters in a compressed format. That covers a solid range of capable open-source models. You cannot run the largest and most capable open-source models at full quality without a much bigger setup.

What Is Cloud AI?

Cloud AI means you access a model running on someone else’s servers, either through a paid subscription or by calling an API and paying per volume of text processed.

Subscription products like Claude.ai Pro, ChatGPT Plus or Google Gemini Advanced give you a monthly fee in exchange for access to a capable model through a browser interface. No code required. No infrastructure to manage.

API access is for developers who want to call a model programmatically from their own application. You send a request, you get a response, and you are billed based on how much text was processed. Billing is measured in tokens, which are roughly equivalent to word fragments. The providers here include Anthropic (Claude), OpenAI (GPT) and Google (Gemini). Pricing shifts regularly, so always check the provider’s current rate card before building anything.

The key advantage is that you get the most capable models available. You also get infrastructure you do not have to manage, availability guarantees and the ability to serve many users at once without any extra work on your end.

The Part Most Guides Skip: Can You Even Self-Host Opus or GPT-5?

This is the question that most comparison guides dodge, so let’s address it directly.

No. You cannot self-host Claude Opus, GPT-5, GPT-5.2 or any other frontier model.

These models are closed-source. Anthropic and OpenAI have never released the model weights for their flagship products. The weights are what actually contain the model’s knowledge and capability. Without them, there is nothing to download, nothing to run and no setup process to follow. These models exist exclusively on their respective providers’ servers. The only way to use them is to pay for access.

This is not a hardware limitation. It is a deliberate product and business decision by the companies that built them.

But What Would It Cost If You Could?

This is worth exploring anyway, because it illustrates the scale of what these models actually require.

Frontier models like GPT-5 are estimated to have hundreds of billions of parameters, running a unified system that combines a fast model for routine queries and a deeper reasoning model for harder problems. OpenAI has not confirmed exact specifications. The computational footprint required to run such a model at speed is significant enough that it demands data-center-grade hardware, not consumer GPUs.

The industry-standard GPU for running large models at this tier is the NVIDIA H100 with 80GB of HBM3 memory. A single H100 costs between USD 25,000 and USD 40,000 depending on the variant, the vendor and current market availability. That is just the card. You still need a server to put it in, power infrastructure to run it and a cooling setup to keep it alive.

The most capable openly available model at the time of writing that approaches frontier quality is GPT-OSS-120B, which OpenAI released in August 2025 as an open-source model alongside the GPT-5 launch. This is not GPT-5. It is a separate, smaller model released specifically for self-hosting, with capability roughly comparable to older mid-tier models. Running it requires a single H100 80GB in its most optimized configuration. Running it on a consumer RTX 5090 with 32GB of VRAM requires aggressive compression and still produces slower, lower-quality output.

Here is the hardware picture laid out plainly:

Consumer tier (RTX 5090, 32GB VRAM, approx. USD 2,000 for the card alone) Capable of running models up to around 30 billion parameters at compressed quality. Real-world examples include Llama 3.3 70B at reduced quality settings, Qwen 2.5 32B at full quality or Mistral variants. Speed is usable for a single person. Not suitable for multiple concurrent users. Model quality is noticeably below frontier cloud models for complex reasoning tasks.

Near-frontier tier (1x H100 80GB, approx. USD 30,000 to 40,000 for the card) Capable of running GPT-OSS-120B in its recommended configuration. This is roughly comparable in capability to older mid-tier cloud models. Still single-user at any given moment without a much larger setup. Does not match the current frontier models available through cloud APIs.

Multi-GPU server (8x H100s in a full DGX-style server) Required to approach the compute levels that power frontier cloud models. A full 8-GPU H100 server from a system integrator costs USD 300,000 or more. This is enterprise data-center territory. Power consumption alone runs to several kilowatts. This is not a home server. This is not a small office setup.

To put that in perspective: the hardware required to run a frontier-class model locally, at the spec level these models demand, costs more than most small companies raise in their first funding round.

How Cloud Pricing Actually Works

Given the above, cloud pricing starts to look very different.

Subscription access (for individual users):

  • Claude.ai Pro: USD 20 per month
  • Claude.ai Max: USD 100 per month
  • ChatGPT Plus: USD 20 per month

These plans give you access to capable models through a chat interface with no API calls required. For individual work, this is almost always the most cost-effective option.

API access (for developers building products):

Pricing is per million tokens. As a rough reference point using current Anthropic pricing, Claude Haiku 4.5 starts at USD 1 per million input tokens. Claude Sonnet 4.5 runs at USD 3 per million input tokens. Claude Opus 4.5 sits at USD 5 per million input tokens. Output tokens cost more than input tokens across all tiers, typically around four to five times as much.

A million tokens is roughly equivalent to 750,000 words of text. For most light-to-moderate applications, actual monthly API spend starts in the range of a few dollars and scales with usage.

Always verify current pricing directly with the provider before building anything, as rates change.

The Four Use Cases

Use Case 1: You Are a Solo Developer or Hobbyist Experimenting with AI

Verdict: Either option works. Self-hosting is a reasonable starting point if you already own decent hardware.

If you have an RTX 3090, 4090 or similar GPU already, self-hosting gives you something genuinely valuable: an environment where you can experiment without watching a billing meter. Pull a model with Ollama, run it locally, break things freely. For learning how these models work, understanding prompting techniques or building personal tools that do not need frontier-level reasoning, this is a solid setup.

The practical limitation is quality. A self-hosted 7B or 13B parameter model will not match the depth of reasoning you get from a frontier cloud model. For casual experimentation, learning and prototyping, this is usually acceptable.

If you do not already own a capable GPU, buying one purely to experiment is hard to justify. A cloud subscription at USD 20 per month gives you access to a better model than anything you can run locally at that price point. Start with the cloud and revisit hardware only if you find a specific reason to go local.

Cost picture: Cloud subscriptions start at USD 20 per month. API costs for light personal usage sit in the range of a few dollars a month. GPU hardware for self-hosting starts at several hundred dollars for older cards and scales steeply from there.

Use Case 2: Your Work Involves Sensitive or Regulated Data

Verdict: Self-hosting is the right call.

Healthcare records, legal documents, financial reports, client contracts. If the data you are feeding into a model is sensitive, sending it to a third-party server over the internet is a genuine problem. Even providers with strong privacy policies are external parties. You are trusting their infrastructure, their security posture and their compliance with the regulations that apply to your industry.

Self-hosting removes that exposure. The data never leaves your environment. You control the logs. You control access. You can demonstrate to auditors exactly where the data went.

This does come with the hardware cost discussed above. But for a company handling regulated data, the cost of a proper self-hosted setup is small compared to the cost of a compliance failure or a breach.

Data that cannot leave your building should not be processed by a server you do not own.

Note that model quality will be lower than cloud alternatives. The trade-off is deliberate: privacy over peak capability. For many regulated use cases, a solid open-source model running locally is more than sufficient for the actual tasks involved.

Practical note: if you are self-hosting for compliance reasons, also think carefully about how you manage model updates, user access control and audit logging. The model itself is one part of a larger data governance picture.

Cost picture: Hardware investment of USD 2,000 to 5,000 or more depending on what you need. No ongoing per-token fees. The cost is front-loaded.

Use Case 3: A Small Internal Company Team (5 to 20 People)

Verdict: Depends on your data sensitivity. Both options are viable.

If your team’s use is general internal productivity work, writing, summarising, drafting communications, generating code and the content is not sensitive, a shared cloud subscription is usually the most practical option. A small team plan gives everyone access to capable models, requires no infrastructure work and costs a predictable amount each month.

The math shifts if your team is using AI heavily throughout the day. At high usage volumes, per-token API costs accumulate. Run some estimates against your expected usage before committing.

If your team handles sensitive internal data, a self-hosted setup on a dedicated machine makes sense. One server with a capable GPU can serve a small team without much difficulty. Concurrent usage is limited, but for five to fifteen people making occasional queries through the day rather than all simultaneously, it is workable.

The quality gap is also worth naming. If your team’s work involves nuanced analysis, complex reasoning or tasks where answer quality directly affects decisions, a frontier cloud model will outperform most self-hosted alternatives today.

Cost picture: Cloud team plans vary by provider, roughly USD 30 per user per month for subscription access. A self-hosted server with a capable consumer GPU starts at USD 3,000 to 6,000 for a dedicated machine.

Use Case 4: You Are Building a SaaS Product or Public-Facing Service

Verdict: Use the cloud API. Self-hosting is not a realistic option here.

The moment you open your service to external users, the concurrency problem becomes a hard blocker for self-hosting. A single consumer GPU serves one request at a time. Two users querying simultaneously means one of them waits. Fifty users means a queue that grows by the minute.

Cloud APIs are built for this. They handle spikes, scale automatically and you pay for what you use. You can go from ten users to ten thousand without changing your architecture. That is what a product needs.

If your service has more than a handful of concurrent users, self-hosting requires enterprise-grade infrastructure. That is a different conversation from a consumer GPU under your desk.

There is also the question of model quality. If you are selling a product, the quality of the AI output directly affects your customer’s experience. Frontier cloud models give you the best available output today. That matters when your product’s reputation depends on the results it delivers.

Cost picture: API costs scale with usage. For a new product with low traffic, costs are minimal. As the product grows, you manage costs through prompt optimisation, response caching and choosing the right model tier for each task. This is a much more manageable problem than building and maintaining your own GPU cluster.

Why Open-Source Models Are Not the Same as Frontier Models (And Why That Gap Matters)

There is a persistent misconception worth clearing up directly. People assume that running DeepSeek or Llama locally is basically the same as using Claude or GPT, just without the subscription cost. It is not.

Open-source models are impressive for what they are. The gap between what was open-source two years ago and what is open-source today is enormous, and the trajectory is genuinely exciting. But the frontier cloud models are ahead for a reason.

The organizations running these frontier models have invested billions of dollars in training compute, data curation, human feedback and infrastructure that no open-source project has been able to fully match. The results show up in reasoning quality, instruction following, nuanced judgment calls and performance on complex multi-step tasks.

For simple, well-defined tasks, the gap is smaller. For tasks that require judgment, nuance, complex reasoning or handling unexpected inputs gracefully, the gap is still significant. The right model for your use case depends on what kind of work you are actually doing.

What About Renting GPU Compute in the Cloud?

There is a middle option worth mentioning. If you want more control than a standard cloud API but do not want to buy hardware, you can rent GPU instances from providers like RunPod, Lambda Labs or Vast.ai. These services let you spin up a machine with an H100 or A100, deploy a self-hosted model and pay by the hour.

An H100 80GB rents for roughly USD 3 to 5 per hour from most providers, with variation based on demand and commitment length. This gives you privacy and control over the model without the capital expenditure of owning hardware. The trade-off is that the operational overhead still falls on you. You are managing deployments, handling downtime and keeping the environment running.

This path makes the most sense for teams that need data sovereignty, want to fine-tune or modify open-source models and have the engineering capacity to manage the infrastructure.

The Summary

The honest version of this comparison is less exciting than either camp tends to admit.

Self-hosting makes the most sense when data must stay under your control, usage is predictable with a small number of concurrent users, you already own capable hardware and you can accept that model quality will be lower than the best available cloud models.

Cloud API access makes the most sense when you need the best possible model quality, usage is unpredictable or large, you are building something other people will use and you want to focus on your product rather than infrastructure.

Frontier models like Claude Opus and GPT-5 cannot be self-hosted under any circumstances. Their weights are not publicly available. The only way to use them is through their providers’ cloud APIs. Even if the weights were somehow available, running them at the hardware level they require would cost hundreds of thousands of dollars in server infrastructure alone.

The people who benefit most from self-hosting are developers experimenting without a billing meter, small teams with genuine data privacy requirements and organizations willing to accept a quality trade-off in exchange for full control over their data pipeline.

Everyone else, especially developers building products, teams who need consistent quality for important decisions and individuals who just want a capable assistant, is better served by paying for cloud access and focusing their energy on what they are actually building.

The question is never which approach is better in general. The question is which one fits what you are actually doing today.

Found this helpful?

If this article saved you time or solved a problem, consider supporting — it helps keep the writing going.

Originally published on Medium.

View on Medium
Self-Hosting an AI Model vs Paying for the Cloud: Which One Should You Actually Pick? — Hafiq Iqmal — Hafiq Iqmal