How to run Ollama on a VPS — and actually get good performance

How to run Ollama on a VPS — ilinuxu.com
Local LLM VPS Setup 12 min read · Updated March 2026

Most guides tell you to install Ollama on localhost. This one shows you how to run it 24/7 on a remote VPS, expose it securely, and get real token speeds — without spending a fortune on cloud GPUs.

S
Said · ilinuxu.com
Sysadmin, self-hosting everything since before it was cool
In this article
  1. Why run Ollama on a VPS at all?
  2. Choosing the right VPS — the RAM problem
  3. Server setup: Ubuntu 24.04 baseline
  4. Installing and configuring Ollama
  5. Securing the API with Nginx reverse proxy
  6. Real benchmark numbers from my setup
  7. Connecting Open WebUI as your frontend

Why run Ollama on a VPS at all?

Running Ollama on your laptop works fine for testing. But the second you close the lid, your model goes offline. If you want a persistent AI assistant — accessible from your phone, from a browser tab, from n8n automations, or from a client demo — you need it running somewhere that never sleeps.

That’s the VPS case. And it’s more practical than most people think. You don’t need a GPU server. CPU inference on a well-specced VPS is genuinely usable for smaller models — Llama 3 8B, Mistral 7B, Gemma 2 9B — with response times that won’t make you want to throw your laptop out the window.

Before we go further — the honest caveat CPU inference is real and usable, not magical. A 70B model on a CPU VPS will be painful. This guide is built around the 7B–9B sweet spot where CPU inference actually makes sense.

Choosing the right VPS — the RAM problem

Here’s what nobody tells you in the generic Ollama tutorials: the bottleneck isn’t CPU speed. It’s RAM. Ollama loads the entire model into memory. A 7B model in Q4 quantization needs roughly 4.5–5GB of RAM just to load. Add your OS overhead and you’re at 6–7GB minimum before a single token is generated.

This means a 4GB VPS is a dead end before you even start. You need at least 8GB, ideally 16GB if you want to run two models or a larger one without swapping to disk (which kills performance entirely).

What you need before starting
  • A VPS with minimum 8GB RAM — 16GB recommended for comfort
  • Ubuntu 22.04 or 24.04 (this guide uses 24.04)
  • Root or sudo access
  • A domain pointed to your VPS IP (optional but recommended for HTTPS)
  • About 30 minutes
🖥️
The VPS I use for this exact setup Contabo Cloud VPS M — 8 vCPUs, 16GB RAM, 200GB NVMe for €8.99/month. Best RAM-per-euro ratio I’ve found for LLM workloads. I’ve been running my stack on Contabo for over a year.
Get Contabo VPS M →

I’ve tested Hetzner, DigitalOcean, and Vultr for this same workload. Hetzner is great (and I’ll write a full comparison soon), but at this RAM tier Contabo wins on pure value. For LLMs, RAM is king — optimize for it.

Server setup: Ubuntu 24.04 baseline

1 First SSH into your fresh VPS and do the standard hardening. Don’t skip this — an open Ollama API on a public IP is a free inference endpoint for anyone who finds it.

bash — initial server setup
# Update everything first
apt update && apt upgrade -y

# Create a non-root user
adduser said
usermod -aG sudo said

# Basic firewall — allow SSH, HTTP, HTTPS only
ufw allow OpenSSH
ufw allow 80/tcp
ufw allow 443/tcp
ufw enable

2 Install Nginx now — we’ll use it later to proxy Ollama’s API behind HTTPS. Exposing port 11434 directly to the internet is something you absolutely do not want to do.

bash
apt install -y nginx certbot python3-certbot-nginx

Installing and configuring Ollama

3 Ollama’s install script handles everything — binary, systemd service, the works. One line:

bash
curl -fsSL https://ollama.com/install.sh | sh

After install, Ollama runs as a systemd service listening on 127.0.0.1:11434 by default. That’s correct — it should only be accessible locally. We’ll proxy it through Nginx.

4 Pull your first model. I’m using Llama 3.2 3B for this demo because it’s fast on CPU, but the steps are identical for any model:

bash
# 3B is fast — good for testing your setup first
ollama pull llama3.2:3b

# Once confirmed working, pull the real one
ollama pull llama3.1:8b

# Verify it's running
ollama list

5 Test it locally before touching Nginx:

bash
ollama run llama3.1:8b "What is self-hosting in one sentence?"

# Or via the API directly
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.1:8b", "prompt": "Hello", "stream": false}'

Securing the API with Nginx reverse proxy

6 Create an Nginx config that proxies /ollama/ to the local Ollama port. Replace yourdomain.com with your actual domain:

/etc/nginx/sites-available/ollama
server {
    listen 80;
    server_name yourdomain.com;

    location /ollama/ {
        proxy_pass http://127.0.0.1:11434/;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;

        # Important for streaming responses
        proxy_buffering off;
        proxy_read_timeout 300s;
        chunked_transfer_encoding on;
    }
}
bash — enable and get SSL
ln -s /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/
nginx -t && systemctl reload nginx
certbot --nginx -d yourdomain.com

Now your Ollama API is accessible at https://yourdomain.com/ollama/api/generate over HTTPS. No raw port exposed, no unencrypted traffic.

Real benchmark numbers from my setup

These are actual numbers from my Contabo VPS M (16GB RAM, 8 vCPU, AMD EPYC). Not cherry-picked. Not ideal conditions — these are normal daytime numbers with other services running:

Tokens/second on Contabo VPS M (CPU inference, Q4 quant)
Llama 3.2 3B
~18 t/s
Llama 3.1 8B
~7 t/s
Mistral 7B
~6 t/s
Gemma 2 9B
~4 t/s

7 tokens per second on Llama 3.1 8B is genuinely usable for a personal assistant or automation use case. It’s not GPT-4 speed, but it’s also not €0.01 per 1K tokens either — it’s flat-rate infrastructure you own.

Connecting Open WebUI as your frontend

Ollama’s API is great, but you’ll want a proper chat interface. Open WebUI connects directly to Ollama and gives you a ChatGPT-style UI that’s entirely self-hosted. Here’s the one-liner using Docker:

bash — run Open WebUI
docker run -d \
  --name open-webui \
  --network host \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Then add another Nginx server block pointing your subdomain (e.g. chat.yourdomain.com) to port 8080. Run Certbot again. Done — you now have a private AI chat interface running 24/7 on your own server.

📊
Thinking about upgrading when your workload grows? When you start running multiple models or heavier automation, the VPS L (32GB RAM) is the natural next step. Same Contabo pricing model — still the best value for RAM-intensive workloads.
Compare Contabo VPS plans →

What’s next on ilinuxu.com

Part 2: Connecting your Ollama VPS to n8n for AI-powered automations — no API costs, no rate limits.

Subscribe to get it →

Leave a Comment

Your email address will not be published. Required fields are marked *