Running AI on a Pi in Under 5 minutes - Virtualization Review

How-To

Running AI on a Raspberry Pi, Part 2: Running AI on a Pi in Under 5 minutes

In a recent
article, I discussed running AI locally on a relatively
low-powered system, particularly a Raspberry Pi 500+. In that
article, I discussed the major software components of a local AI,
including LLMs and RAGs, what they are, and how they are used. I also
discussed why I think the Pi, with its ARM processor, 16 GB RAM, and
NVMe drive, could handle the load that AI would require. Since the Pi
500+ is based on the more popular Pi 5, I will use that model’s name
interchangeably in this article, as I did in my past article.

In this article,
I will run a Local Large Language Model (LLM) system on a
Raspberry Pi. This was a great way for me to dip my toe into AI,
and the best part was that I could do it in less than 5 minutes!

Which LLM on a Pi
Recent advances
in model architecture and aggressive quantization have enabled the
running of AI models on extremely small devices, such as the
Raspberry Pi. People have reported that LLMs in the 1–4 billion
parameter range can now deliver impressive performance for tasks such
as text generation, reasoning, coding, tool calling, and even
vision understanding, all without requiring GPUs, cloud resources, or
heavy infrastructure.

Using tools like
Ollama and quantized model formats, it is now practical to
experiment with local AI on low-power hardware, opening the door
to affordable, private, and portable AI deployments that fit
literally in the palm of your hand (Pi 5), or in my case, the inside
of a keyboard (Pi 500+).

[Click on image for larger view.]

Before starting
this project, I reviewed many popular tiny models, including several
from the Qwen3 family, EXAONE 4.0, Ministral 3, Jamba Reasoning,
IBM’s Granite Micro, and Microsoft’s Phi-4 Mini. Each had
different strengths in areas such as long-context processing,
reasoning, multimodal understanding, and agentic capabilities. In the
end, I narrowed it down to three small LLMs to work with.

Ollama
Ollama is an
open-source AI platform that lets you run LLMs locally, without
relying on cloud-hosted AI services. It allows you to download,
manage, and run models such as Llama, Mistral, Gemma, and others
directly on your own machines, whether they’re laptops, desktops,
servers, or even Pi systems.

Ollama
abstracts away much of the complexity involved in model setup,
dependency management, and hardware acceleration. It provides a clean
command-line interface and API that developers can easily integrate
into their workflows. Ollama is gaining widespread adoption among
developers, researchers, IT professionals, and organizations
experimenting with private AI deployments, local RAG pipelines, and
edge-based inference.

[Click on image for larger view.]

Ollama emerged in
response to a movement toward greater access to AI and the
decentralization of inference workloads. It really started to gain
traction in 2023 and accelerated rapidly through 2024 and 2025 as
local model performance improved dramatically. As quantization
techniques, optimized runtimes, and efficient architecture made it
possible to run capable models on consumer-grade hardware, Ollama
quickly positioned itself as an easy way for those with limited
resources to get started with AI. Its rapid adoption has been
fueled by its simplicity, active open-source community, and growing
ecosystem of supported models and integrations.

Ollama Commands
Here are a few of
the commands I used with Ollama. These examples use llama3 as the
LLM, but you can use other LLMs.

Command

Example

What
It Does

ollama
pull

ollama
pull llama3

Downloads
a model from the Ollama registry and stores it locally for offline
use.

ollama
run

ollama
run llama3

Launches
an interactive chat session with a model. It will download the
model if needed.

ollama
list

Lists
all models currently installed on the local system.

ollama
rm

ollama
rm llama3

Deletes
a locally stored model to free disk space.

ollama
show

ollama
show llama3

Displays
detailed information about a model, including parameters and
configuration.

ollama
serve

Starts
the local Ollama API server for programmatic access and
integrations.

ollama
ps

Shows
currently running models and active inference processes.

ollama
stop

ollama
stop llama3

Stops
a currently running model session.

ollama
create

ollama
create my-model -f Modelfile

Builds
a custom model using a Modelfile configuration.

ollama
cp

ollama
cp llama3 my-model

Copies
an existing model locally, often used as a base for customization.

Tom’s Tip: Use /bye to exit ollama.

Installing
a Local LLM on Raspberry Pi
Installing and
running an LLM on Raspberry Pi is relatively simple using Ollama.

Deciding which
LLM to run is more difficult. The “religious” wars over LLMs
are intense and make past IT wars, such as Windows vs. Mac or
file vs. block storage, seem minor compared to the opinions about LLM
selection. After far too much research, I decided to start with
qwen2.5.

Qwen is an
interesting LLM and made quite a ruckus when it was first released,
as it is an LLM from the Chinese company Alibaba. The Qwen family of
LLMs were, designed to deliver strong performance across reasoning,
coding, multilingual understanding, and general-purpose text
generation.

Running AI on a Pi in Under 5 minutes — Virtualization Review

Tags: