Most guides tell you to install Ollama on localhost. This one shows you how to run it 24/7 on a remote VPS, expose it securely, and get real token speeds — without spending a fortune on cloud GPUs.
- Why run Ollama on a VPS at all?
- Choosing the right VPS — the RAM problem
- Server setup: Ubuntu 24.04 baseline
- Installing and configuring Ollama
- Securing the API with Nginx reverse proxy
- Real benchmark numbers from my setup
- Connecting Open WebUI as your frontend
Why run Ollama on a VPS at all?
Running Ollama on your laptop works fine for testing. But the second you close the lid, your model goes offline. If you want a persistent AI assistant — accessible from your phone, from a browser tab, from n8n automations, or from a client demo — you need it running somewhere that never sleeps.
That’s the VPS case. And it’s more practical than most people think. You don’t need a GPU server. CPU inference on a well-specced VPS is genuinely usable for smaller models — Llama 3 8B, Mistral 7B, Gemma 2 9B — with response times that won’t make you want to throw your laptop out the window.
Choosing the right VPS — the RAM problem
Here’s what nobody tells you in the generic Ollama tutorials: the bottleneck isn’t CPU speed. It’s RAM. Ollama loads the entire model into memory. A 7B model in Q4 quantization needs roughly 4.5–5GB of RAM just to load. Add your OS overhead and you’re at 6–7GB minimum before a single token is generated.
This means a 4GB VPS is a dead end before you even start. You need at least 8GB, ideally 16GB if you want to run two models or a larger one without swapping to disk (which kills performance entirely).
- A VPS with minimum 8GB RAM — 16GB recommended for comfort
- Ubuntu 22.04 or 24.04 (this guide uses 24.04)
- Root or sudo access
- A domain pointed to your VPS IP (optional but recommended for HTTPS)
- About 30 minutes
Get Contabo VPS M →
I’ve tested Hetzner, DigitalOcean, and Vultr for this same workload. Hetzner is great (and I’ll write a full comparison soon), but at this RAM tier Contabo wins on pure value. For LLMs, RAM is king — optimize for it.
Server setup: Ubuntu 24.04 baseline
1 First SSH into your fresh VPS and do the standard hardening. Don’t skip this — an open Ollama API on a public IP is a free inference endpoint for anyone who finds it.
# Update everything first apt update && apt upgrade -y # Create a non-root user adduser said usermod -aG sudo said # Basic firewall — allow SSH, HTTP, HTTPS only ufw allow OpenSSH ufw allow 80/tcp ufw allow 443/tcp ufw enable
2 Install Nginx now — we’ll use it later to proxy Ollama’s API behind HTTPS. Exposing port 11434 directly to the internet is something you absolutely do not want to do.
apt install -y nginx certbot python3-certbot-nginx
Installing and configuring Ollama
3 Ollama’s install script handles everything — binary, systemd service, the works. One line:
curl -fsSL https://ollama.com/install.sh | sh
After install, Ollama runs as a systemd service listening on 127.0.0.1:11434 by default. That’s correct — it should only be accessible locally. We’ll proxy it through Nginx.
4 Pull your first model. I’m using Llama 3.2 3B for this demo because it’s fast on CPU, but the steps are identical for any model:
# 3B is fast — good for testing your setup first ollama pull llama3.2:3b # Once confirmed working, pull the real one ollama pull llama3.1:8b # Verify it's running ollama list
5 Test it locally before touching Nginx:
ollama run llama3.1:8b "What is self-hosting in one sentence?" # Or via the API directly curl http://localhost:11434/api/generate \ -d '{"model": "llama3.1:8b", "prompt": "Hello", "stream": false}'
Securing the API with Nginx reverse proxy
6 Create an Nginx config that proxies /ollama/ to the local Ollama port. Replace yourdomain.com with your actual domain:
server {
listen 80;
server_name yourdomain.com;
location /ollama/ {
proxy_pass http://127.0.0.1:11434/;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
# Important for streaming responses
proxy_buffering off;
proxy_read_timeout 300s;
chunked_transfer_encoding on;
}
}
ln -s /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/ nginx -t && systemctl reload nginx certbot --nginx -d yourdomain.com
Now your Ollama API is accessible at https://yourdomain.com/ollama/api/generate over HTTPS. No raw port exposed, no unencrypted traffic.
Real benchmark numbers from my setup
These are actual numbers from my Contabo VPS M (16GB RAM, 8 vCPU, AMD EPYC). Not cherry-picked. Not ideal conditions — these are normal daytime numbers with other services running:
7 tokens per second on Llama 3.1 8B is genuinely usable for a personal assistant or automation use case. It’s not GPT-4 speed, but it’s also not €0.01 per 1K tokens either — it’s flat-rate infrastructure you own.
Connecting Open WebUI as your frontend
Ollama’s API is great, but you’ll want a proper chat interface. Open WebUI connects directly to Ollama and gives you a ChatGPT-style UI that’s entirely self-hosted. Here’s the one-liner using Docker:
docker run -d \ --name open-webui \ --network host \ -v open-webui:/app/backend/data \ -e OLLAMA_BASE_URL=http://127.0.0.1:11434 \ --restart always \ ghcr.io/open-webui/open-webui:main
Then add another Nginx server block pointing your subdomain (e.g. chat.yourdomain.com) to port 8080. Run Certbot again. Done — you now have a private AI chat interface running 24/7 on your own server.
Compare Contabo VPS plans →
What’s next on ilinuxu.com
Part 2: Connecting your Ollama VPS to n8n for AI-powered automations — no API costs, no rate limits.
Subscribe to get it →