Skip to content

AI/ML

Quantization in LLMs 🌐

In the rapidly evolving world of artificial intelligence, large language models (LLMs) have become pivotal in various applications, from chatbots to recommendation systems. However, deploying these advanced models can be challenging due to high memory and computational requirements.

This is where quantization comes into play!

Do you know?

GPT-3.5 has around 175 billion parameters, while the current state-of-the-art GPT-4 has in excess of 1 trillion parameters.

In this blog, let’s explore how quantization can make LLMs more efficient, accessible, and ready for deployment on edge devices. 🌍

What is Quantization? 🤔

Quantization is a procedure that maps the range of high precision weight values, like FP32, into lower precision values such as FP16 or even INT8 (8-bit Integer) data types. By reducing the precision of the weights, we create a more compact version of the model without significantly losing accuracy.

Tldr

Quantization transforms high precision weights into lower precision formats to optimize resource usage without sacrificing performance.

Why Quantize? 🌟

Here are a few compelling reasons to consider quantization:

  1. Reduced Memory Footprint 🗄️
    Quantization dramatically lowers memory requirements, making it possible to deploy LLMs on lower-end machines and edge devices. This is particularly important as many edge devices only support integer data types for storage.

  2. Faster Inference
    Lower precision computations (such as integers) are inherently faster than higher precision (floats). By using quantized weights, mathematical operations during inference speed up significantly. Plus, modern CPUs and GPUs have specialized instructions designed for lower-precision computations, allowing you to take full advantage of hardware acceleration for even better performance!

  3. Reduced Energy Consumption 🔋
    Many contemporary hardware accelerators are optimized for lower-precision operations, capable of performing more calculations per watt of energy when models are quantized. This is a win-win for efficiency and sustainability!

Linear Quantization 📏

In linear quantization, we essentially perform scaling within a specified range. Here, the minimum value (Rmin) is mapped to its quantized minimum (Qmin), and the maximum (Rmax) to its quantized counterpart (Qmax).

The zero in the actual range corresponds to a specific zero_point in the quantized range, allowing for efficient mapping and representation.

To achieve quantization, we need to find the optimum way to project our range of FP32 weight values, which we’ll label [min, max] to the INT4 space: one method of implementing this is called the affine quantization scheme, which is shown in the formula below:

$$ x_q = round(x/S + Z) $$

where:

  • x_q: the quantized INT4 value that corresponds to the FP32 value x

  • S: an FP32 scaling factor and is a positive float32

  • Zthe zero-point: the INT4 value that corresponds to 0 in the FP32 space

  • round: refers to the rounding of the resultant value to the closest integer

Types of Quantization

PTQ 🛠️

As the name suggests, Post Training Quantization (PTQ) occurs after the LLM training phase.

In this process, the model’s weights are converted from higher precision formats to lower precision types, applicable to both weights and activations. While this enhances speed, memory efficiency, and power usage, it comes with an accuracy trade-off.

Beware of Quantizaion Error

During quantization, rounding or truncation introduces quantization error, which can affect the model’s ability to capture fine details in weights.

QAT ⏰

Quantization-Aware Training (QAT) refers to methods of fine-tuning on data with quantization in mind. In contrast to PTQ techniques, QAT integrates the weight conversion process, i.e., calibration, range estimation, clipping, rounding, etc., during the training stage. This often results in superior model performance, but is more computationally demanding.

Tip

PTQ is easier to implement than QAT, as it requires less training data and is faster. However, it can also result in reduced model accuracy from lost precision in the value of the weights.

Final Thoughts 💭

Quantization is not just a technical detail; it's a game-changer for making LLMs accessible and cost-effective.

By leveraging this technique, developers can democratize AI technology and deploy sophisticated language models on everyday CPUs.

So, whether you’re building intelligent chatbots, personalized recommendation engines, or innovative code generators, don’t forget to incorporate quantization into your toolkit—it might just be your secret weapon! 🚀

Happy learning 🧑‍🏫

LLM as a Judge 🧑‍⚖️

LLM-as-a-Judge is a powerful solution that uses LLMs to evaluate LLM responses based on any specific criteria of your choice, which means using LLMs to carry out LLM (system) evaluation.

Potential issues with using LLM as a Judge?

The non-deterministic nature of LLMs implies that even with controlled parameters, outputs may vary, raising concerns about the reliability of these judgments.

LLM Judge Prompt Example
prompt = """
You will be given 1 summary (LLM output) written for a news article published in Ottawa Daily. 
Your task is to rate the summary on how coherent it is to the original text (input). 

Original Text:
{input}

Summary:
{llm_output}

Score:
"""

LLM Metrics 📊

  • Recall@k: It measures the proportion of all relevant documents retrieved in the top k results, and is crucial for ensuring the system captures a high percentage of pertinent information.

  • Precision@k: It complements this by measuring the proportion of retrieved documents that are relevant.

  • Mean Average Precision (MAP): It provides an overall measure of retrieval quality across different recall levels.

  • Normalized Discounted Cumulative Gain (NDCG): It is particularly valuable as it considers both the relevance and ranking of retrieved documents.

LLM Metric Types ⎐

Metrics for LLM calls can be broken up into two categories

  • Absolute
  • Subjective

Absolute Metrics

These metrics like latency, throughput, etc are easier to calculate.

Subjective Metrics

They are more difficult to calculate. These subjective categories range from truthfulness, faithfulness, answer relevancy, to any custom metric your business cares about.

How to find the relavancy for Subjective metrics?

Typically, in all the subjective metrics, it requires a level of human reasoning to determine a numeric answer. Techniques used for evaluation are:

1. Human Evaluators

This is a time intensive process although sometimes its considered as gold standard. It requires humans to go through and evaluate your answer. You need to select the humans carefully and make sure their instructions on how to grade are clear

It’s not unusual for a real-world LLM application to generate approximately 100,000 responses a month. I don’t know about you, but it takes me about 60 seconds on average to read through a few paragraphs and make a judgment about it. That adds up to around 6 million seconds, or about 65 consecutive days each month — without taking lunch breaks — to evaluate every single generated LLM responses.

2. LLM's as a Judge

To use LLM-as-a-judge, you have to iterate on a prompt until the human annotators generally agree with the LLMs grades. An evaluation dataset should be created and graded by a human.

Single Layer Judge ·

The flow for single layer Judge is shown below

Muti Layered Judgements ⵘ

We can also use a master LLM judge to judge the judgement of First level Judge for getting better recall

Why are we using Sampling?

It is also worth noting that using a random sampling method for evaluation might be a good approach to save resources

How to improve LLM Judgements? 📈

  • Use Chain of Thought (CoT) Prompting by asking the reasoning process
  • Use Few shot Prompting: This approach can be more computationally expensive
  • Provide a reference guide for Judgements
  • Evaluate based on QAG (Question Answer Generation)
-->