Life At Red Buffer • October 29, 2025 • 4 min read

Ultimate Guide to Running Quantized LLMs on CPU with LLaMA.cpp

We are all witnessing the rapid evolution of Generative AI, with new Large Language Models (LLMs) emerging daily at various scales. However, running these models on local machines remains a challenge due to their high computational demands. LLaMA.cpp solves this problem by enabling the use of quantized models, which are significantly lighter while maintaining reasonable accuracy.

Llama.cpp is a light weight, open-source library, which was developed by Georgi Gerganov to enable LLMs inference possible on local machine. It supports running quantized models which are lower in precision and light in weight reducing memory requirements significantly. This allows efficient execution of LLMs without needing specialized hardware.

Running LLaMA.cpp using command line

Steps to Run Inference with LLaMA.cpp

Clone and build Llama.cpp.
Download Quantized(GGUF) model of your choice.
Run Inference.

Step 1: Install CMake (Required for Building LLaMA.cpp)

Make sure that cmake is installed. If not, run following commands to install

Step 2: Clone and Build LLaMA.cpp

Once the cloning is completed, build llama.cpp using:

Step 3: Download a GGUF Quantized Model

Download the GGUF quantized model of your choice. Remember, The lower the quantization bit, the faster the model, but with reduced accuracy. Here 4 bit Small version of Mistral-7B-Instruct is used.

Step 4: Run Inference

After downloading the model you can run the inference. Make sure to pass the correct path of model.

Here is the output.

Running LLaMA.cpp with Python Bindings

LLaMA.cpp provides bindings for different programming languages, allowing easy integration of quantized LLMs into applications. Here, we explore python bindings.

Install Dependencies and Python Binding

Before installing the python bindings, ensure your system has the necessary tools.

Text Generation

Make sure that cmake is installed. If not, run following commands to install

Creating Embedding with LLaMA.cpp

Running Multimodal Vision LLMs in LLaMA.cpp

LLaMA.cpp now supports list of opensource multi-modals as well through quantization. Unlike simple LLMs, llama.cpp python have separate wrappers for each vision mode. You can find list of model and their chat-handler here.

In this guide, we will be using MiniCPM, an opensource MLLM, which is supported in llama.cpp.

Understanding the Required Models

To run Vision LLMs, you need to download two models:

1. GGUF Model (Text Generator)

This is the main language model.
Different quantized versions (e.g., Q3_K_L, Q4_K_M) are available for optimized performance.

2. MMProj Model (Vision Encoder)

This acts as a bridge between the image encoder and the LLM.
It translates images into a format the LLM can understand.
These models are usually not quantized and available only in full precision (fp16).

Step 1: Download the Required Models

Download both GGUF (quantized) and MMProj (full precision) models from Hugging Face.

Step 2: Load the Models

Provided mmproj models path to Chat handler and gguf models path to Llama.

Step 3: Generate a Response for an Image

To pass local image to the model, it should be first converted to base64 encoded data and then passed to url parameter.

We have now seen that with LLaMA.cpp, you can run powerful LLMs, in quantized version, right on your local machine — no fancy hardware needed. Whether you’re generating text or diving into multimodal AI with vision models, you’re all set to experiment and build.

Ultimate Guide to Running Quantized LLMs on CPU with LLaMA.cpp

Ultimate Guide to Running Quantized LLMs on CPU with LLaMA.cpp

Running LLaMA.cpp using command line

Steps to Run Inference with LLaMA.cpp

Step 1: Install CMake (Required for Building LLaMA.cpp)

Step 2: Clone and Build LLaMA.cpp

Step 3: Download a GGUF Quantized Model

Step 4: Run Inference

Running LLaMA.cpp with Python Bindings

Install Dependencies and Python Binding

Text Generation

Creating Embedding with LLaMA.cpp

Running Multimodal Vision LLMs in LLaMA.cpp

Understanding the Required Models

Step 1: Download the Required Models

Step 2: Load the Models

Step 3: Generate a Response for an Image

Yamna

Previous PostA Practical Guide to the OpenAI Agent SDK

Next PostZero-Shot Object Detection with CLIP Models