Sovereign AI

Deploy Mistral on-premise

Deploy a Mistral model on-premise or in a sovereign private cloud hosted in Europe. Model choice, GPU sizing, OpenAI-compatible private API, no lock-in.

Updated June 2026

Mistral / QwenSovereign private cloudGPU hosted in EuropeOn-premiseGPU on your premises

Mistral is a European open-weight model, which makes it the natural choice for sovereign AI: it performs well, its license allows commercial use, and you can host it yourself.

This guide explains how to deploy it on-premise or in a sovereign private cloud with Bunker, without sending a single prompt to a third-party API.

Which Mistral model for the need

The family covers several sizes. You pick based on the task and the hardware available.

  • Mistral 7B: the entry point. Excellent for summarization, classification, extraction, and an internal assistant. Fits on a single GPU, even quantized on modest hardware.
  • Mixtral (mixture of experts): better reasoning quality at a contained inference cost, since only part of the parameters activate per request.
  • Larger models: for demanding tasks (long-form writing, complex reasoning), at the cost of more VRAM and higher latency.

Start small. A well-integrated Mistral 7B often delivers more than a large, badly sized model that responds slowly.

On-premise or private cloud deployment

Two options, the same software:

Sovereign private cloudGPU hosted in EuropeOn-premiseGPU on your premisesPrivate APIOpenAI formatYour applications

In both cases, the API exposes the OpenAI format. Your existing libraries and integrations point at the new URL and keep working.

Sizing the GPU

The model decides the hardware. A few reference points for Mistral 7B:

  • fp16: around 16 GB of VRAM. A 24 GB GPU leaves room for context.
  • 8-bit quantized: around 8 to 10 GB.
  • 4-bit quantized: around 5 to 7 GB, workable on consumer hardware.

The longer the context (prompt length), the more memory it consumes on top of the model. If you process long documents, plan for headroom or a more generous GPU.

Putting it into service

The typical flow is the same regardless of mode:

  1. Choose the model and GPU size with the Bunker team or from the console.
  2. Inference is deployed in Europe, and you get the URL and key for your private API.
  3. You point your tools at that URL (same format as the OpenAI API).
  4. You measure real throughput and adjust the GPU if needed.

Because everything is open source, the deployment stays portable: you can re-internalise it later onto your hardware, without rewriting your applications.

Frequently asked questions

Can I really deploy Mistral without depending on a US API?

Yes. Mistral is open-weight. The model runs on the GPU you chose, in Europe or on your premises, and contacts no third-party service.

Is the API compatible with my existing code?

Yes. Inference exposes the OpenAI format. Changing the base URL and key is enough in most cases.

How long until it's running?

On the sovereign private cloud, deployment is fast once the model and GPU are chosen. On-premise depends on your hardware.

What if I need a larger model later?

We change the GPU size and the model without touching your applications, since the interface stays the same.

Deploy Mistral, in Europe or on your premises

Sovereign private cloud or on-premise, OpenAI-compatible API, zero lock-in.