I’ve been putting this aside for some time. But the time has come and I will be experimenting/learning about local LLM deployment, expanding them with RAG and learn more about Agents and MCP, Agentic flows etc
The only obvious limitation is not having access to GPUs, only access to my Mac M3 but that should be enough for a lot of the smaller models out there, and should fit perfectly for my self-education project here.
In this post I ll cover my experience setting up a local LLM with Ollama.
Why Ollama?
I will be using Ollama for the fact that its simple, offers both UI based and CLI/API usage and allows me to try/switch a large variety of models, and is an easy entry level to use LLM models locally.
Step 1: Install Ollama (1–2 minutes)
The Terminal Way
On macOS, installation is straightforward.
Homebrew
With the terminal
brew install ollama
Once installed, verify:
ollama --version
Step 2: Pull Your First Model (2–3 minutes)
Ollama gives access to multiple large language models with different sizes and capabilities, including models for conversation, coding, reasoning, and multimodal tasks, all designed to run locally.
Let’s start with model that is small enough to run on my machine.
Llama 3.1 is a new state-of-the-art model from Meta available in 8B, 70B and 405B parameter sizes.

ollama pull llama3.1
This will download the latest model by default. If I wanted the larger ones I could do
ollama pull llama3.1:405b
The smaller model took less than 2 minutes to download.
Step 3: Run the Model Locally (Instant)
Once the model is downloaded:
ollama run llama3.1
That’s it. You’re now chatting with a local LLM.

Try something simple:
What is the bug of the Y2K?
The response is generated 100% locally.

The above response is actually quite decent, but you can get very different results depending on what you are asking. I tried asking for the starting squad from FCPorto that won the 2004 Champions League final against Monaco. The result is nothing short than hilarious 🙂 Lots of hallucination in there.

This can be due to poorly trained data on this particular topic, being a smaller model, lack of external sources or all of these combined. Therefore it’s important to understand what can these models do.
Understanding Models (Important)
Ollama is a gate to a wide variety of LLM models. From Llama (Meta) to DeepSeek, Qwen3, and many others. There are different sizes, which obviously will affect how well they perform, how fast they run, and how much hardware they need.
Most of the models you run through Ollama are not raw base models. A true base model is trained mainly to predict the next token (word or part of a word) in a sequence. The models available through Ollama are usually instruction or fine tuned *, meaning they have been further trained to handle tasks such as conversation, reasoning, coding (e.g. codellama), or even multimodal tasks like image understanding (e.g. qwen3-vl) (depending on the model type).
An instruction model (or instruction-tuned LLM) is a type of large language model specially trained to understand, follow, and execute specific user instructions or prompts. Unlike base models that simply predict the next word in a sequence, instruct models are fine-tuned to act as assistants, offering high accuracy for tasks like summarizing, coding, or answering questions. – a direct quote from Gemini 🙂
Even with this tuning, these models can sometimes feel underwhelming when compared with systems like ChatGPT.
One More Model
Ok, lets have a go at another model, this one a large language model that can use text prompts to generate and discuss code – codellama
ollama pull codellama
ollama run codellama

Performance
Ollama exposes useful metrics via its API. By dividing the number of generated tokens by the generation time, it is possible to estimate model throughput.
danimmar@danimmar-mac / % curl http://localhost:11434/api/generate -d '{
"model": "codellama",
"prompt": "Which method prints a text in javascript",
"stream": false
}'
Running CodeLlama locally on my M3 Mac produced approximately 31 tokens per second during generation.
danimmar@danimmar-mac / % curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "Which method prints a text in javascript",
"stream": false
}'
Llama3.1 gave me around 26 tokens per second
This is just to give an idea of the throughput of these models, and is by no means any benchmarking.
Common Ollama CLI Commands
ollama pull <name>– Downloads a model from the libraryollama rm <name>– Deletes a local modelollama list– Shows all models currently downloaded on your systemollama run <name>– t pulls the model and starts an interactive chat sessionollama ps– Displays which models are currently loaded into memory and running.
Using the GUI
Actually the first time using Ollama was via the GUI and it is as simple as it can be.
Download here

You can download the model you desire and immediately start chatting

Leave a Reply