Run any Huggingface model locally

Adithya S K
10 min readJan 1, 2024

A guide/colab notebook to quantize LLMs in GGUF formate to run them locally

Code Updated : 29th December 2023

Introduction

In the ever-evolving landscape of natural language processing, language models are at the forefront, playing a central role in comprehending and generating human-like text. The increasing complexity and size of these models present a critical challenge: the need for efficiency without compromising performance. As language models continue to advance, striking the right balance becomes essential for their effective deployment across various applications.

This is where quantization steps in as a crucial technique.

Need for Quantization

As neural network models typically operate with 32-bit floating point weights (float32), consuming 4 bytes or 32 bits of memory for each value, quantization becomes a necessity. By reducing this precision through techniques like INT8 quantization, where weights are represented using 8-bit integers instead of the original 32-bit floats, not only does the model size decrease significantly, but computational speed also sees a notable boost. Specialized 8-bit matrix multiplication kernels in modern GPUs and TPUs further optimize the performance of INT8 quantization.

However, the overarching goal remains preserving the original full-precision accuracy as much as possible. It’s important to recognize that a simplistic approach to quantization, such as merely rounding weights to lower precision, can pose a substantial threat to model accuracy. Particularly for large language models, sophisticated quantization techniques are imperative. These techniques are meticulously tailored to the unique characteristics of expansive language models, ensuring optimal accuracy while meeting the efficiency demands of diverse applications.

What is GGUF?

GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp.

Here is an incomplete list of clients and libraries that are known to support GGUF:

  • llama.cpp. The source project for GGUF. Offers a CLI and a server option.

--

--

Adithya S K

Post blogs about Gen AI | Cloud | Web Dev | Founder @CognitiveLab spending time fine-tuning LLMs ,Diffusion models and developing production ready application