Skip to content

Embeddings

Quantization in LLMs 🌐

In the rapidly evolving world of artificial intelligence, large language models (LLMs) have become pivotal in various applications, from chatbots to recommendation systems. However, deploying these advanced models can be challenging due to high memory and computational requirements.

This is where quantization comes into play!

Do you know?

GPT-3.5 has around 175 billion parameters, while the current state-of-the-art GPT-4 has in excess of 1 trillion parameters.

In this blog, let’s explore how quantization can make LLMs more efficient, accessible, and ready for deployment on edge devices. 🌍

What is Quantization? 🤔

Quantization is a procedure that maps the range of high precision weight values, like FP32, into lower precision values such as FP16 or even INT8 (8-bit Integer) data types. By reducing the precision of the weights, we create a more compact version of the model without significantly losing accuracy.

Tldr

Quantization transforms high precision weights into lower precision formats to optimize resource usage without sacrificing performance.

Why Quantize? 🌟

Here are a few compelling reasons to consider quantization:

  1. Reduced Memory Footprint 🗄️
    Quantization dramatically lowers memory requirements, making it possible to deploy LLMs on lower-end machines and edge devices. This is particularly important as many edge devices only support integer data types for storage.

  2. Faster Inference
    Lower precision computations (such as integers) are inherently faster than higher precision (floats). By using quantized weights, mathematical operations during inference speed up significantly. Plus, modern CPUs and GPUs have specialized instructions designed for lower-precision computations, allowing you to take full advantage of hardware acceleration for even better performance!

  3. Reduced Energy Consumption 🔋
    Many contemporary hardware accelerators are optimized for lower-precision operations, capable of performing more calculations per watt of energy when models are quantized. This is a win-win for efficiency and sustainability!

Linear Quantization 📏

In linear quantization, we essentially perform scaling within a specified range. Here, the minimum value (Rmin) is mapped to its quantized minimum (Qmin), and the maximum (Rmax) to its quantized counterpart (Qmax).

The zero in the actual range corresponds to a specific zero_point in the quantized range, allowing for efficient mapping and representation.

To achieve quantization, we need to find the optimum way to project our range of FP32 weight values, which we’ll label [min, max] to the INT4 space: one method of implementing this is called the affine quantization scheme, which is shown in the formula below:

$$ x_q = round(x/S + Z) $$

where:

  • x_q: the quantized INT4 value that corresponds to the FP32 value x

  • S: an FP32 scaling factor and is a positive float32

  • Zthe zero-point: the INT4 value that corresponds to 0 in the FP32 space

  • round: refers to the rounding of the resultant value to the closest integer

Types of Quantization

PTQ 🛠️

As the name suggests, Post Training Quantization (PTQ) occurs after the LLM training phase.

In this process, the model’s weights are converted from higher precision formats to lower precision types, applicable to both weights and activations. While this enhances speed, memory efficiency, and power usage, it comes with an accuracy trade-off.

Beware of Quantizaion Error

During quantization, rounding or truncation introduces quantization error, which can affect the model’s ability to capture fine details in weights.

QAT ⏰

Quantization-Aware Training (QAT) refers to methods of fine-tuning on data with quantization in mind. In contrast to PTQ techniques, QAT integrates the weight conversion process, i.e., calibration, range estimation, clipping, rounding, etc., during the training stage. This often results in superior model performance, but is more computationally demanding.

Tip

PTQ is easier to implement than QAT, as it requires less training data and is faster. However, it can also result in reduced model accuracy from lost precision in the value of the weights.

Final Thoughts 💭

Quantization is not just a technical detail; it's a game-changer for making LLMs accessible and cost-effective.

By leveraging this technique, developers can democratize AI technology and deploy sophisticated language models on everyday CPUs.

So, whether you’re building intelligent chatbots, personalized recommendation engines, or innovative code generators, don’t forget to incorporate quantization into your toolkit—it might just be your secret weapon! 🚀

Happy learning 🧑‍🏫

LLM as a Judge 🧑‍⚖️

LLM-as-a-Judge is a powerful solution that uses LLMs to evaluate LLM responses based on any specific criteria of your choice, which means using LLMs to carry out LLM (system) evaluation.

Potential issues with using LLM as a Judge?

The non-deterministic nature of LLMs implies that even with controlled parameters, outputs may vary, raising concerns about the reliability of these judgments.

LLM Judge Prompt Example
prompt = """
You will be given 1 summary (LLM output) written for a news article published in Ottawa Daily. 
Your task is to rate the summary on how coherent it is to the original text (input). 

Original Text:
{input}

Summary:
{llm_output}

Score:
"""

LLM Metrics 📊

  • Recall@k: It measures the proportion of all relevant documents retrieved in the top k results, and is crucial for ensuring the system captures a high percentage of pertinent information.

  • Precision@k: It complements this by measuring the proportion of retrieved documents that are relevant.

  • Mean Average Precision (MAP): It provides an overall measure of retrieval quality across different recall levels.

  • Normalized Discounted Cumulative Gain (NDCG): It is particularly valuable as it considers both the relevance and ranking of retrieved documents.

LLM Metric Types ⎐

Metrics for LLM calls can be broken up into two categories

  • Absolute
  • Subjective

Absolute Metrics

These metrics like latency, throughput, etc are easier to calculate.

Subjective Metrics

They are more difficult to calculate. These subjective categories range from truthfulness, faithfulness, answer relevancy, to any custom metric your business cares about.

How to find the relavancy for Subjective metrics?

Typically, in all the subjective metrics, it requires a level of human reasoning to determine a numeric answer. Techniques used for evaluation are:

1. Human Evaluators

This is a time intensive process although sometimes its considered as gold standard. It requires humans to go through and evaluate your answer. You need to select the humans carefully and make sure their instructions on how to grade are clear

It’s not unusual for a real-world LLM application to generate approximately 100,000 responses a month. I don’t know about you, but it takes me about 60 seconds on average to read through a few paragraphs and make a judgment about it. That adds up to around 6 million seconds, or about 65 consecutive days each month — without taking lunch breaks — to evaluate every single generated LLM responses.

2. LLM's as a Judge

To use LLM-as-a-judge, you have to iterate on a prompt until the human annotators generally agree with the LLMs grades. An evaluation dataset should be created and graded by a human.

Single Layer Judge ·

The flow for single layer Judge is shown below

Muti Layered Judgements ⵘ

We can also use a master LLM judge to judge the judgement of First level Judge for getting better recall

Why are we using Sampling?

It is also worth noting that using a random sampling method for evaluation might be a good approach to save resources

How to improve LLM Judgements? 📈

  • Use Chain of Thought (CoT) Prompting by asking the reasoning process
  • Use Few shot Prompting: This approach can be more computationally expensive
  • Provide a reference guide for Judgements
  • Evaluate based on QAG (Question Answer Generation)

What are embeddings

What are embeddings?

Embeddings are numerical representations of real-world objects that machine learning (ML) and artificial intelligence (AI) systems use to understand complex knowledge domains like humans do.

Example

A bird-nest and a lion-den are analogous pairs, while day-night are opposite terms. Embeddings convert real-world objects into complex mathematical representations that capture inherent properties and relationships between real-world data. The entire process is automated, with AI systems self-creating embeddings during training and using them as needed to complete new tasks.

Advantages of using embeddings

Dimentionality reduction:

DS use embeddings to represent high-dimensional data in a low-dimensional space. In data science, the term dimension typically refers to a feature or attribute of the data. Higher-dimensional data in AI refers to datasets with many features or attributes that define each data point.

Train large language Models

Embeddings improve data quality when training/re-training large language models (LLMs).

Types of embeddings

  • Image Embeddigns - With image embeddings, engineers can build high-precision computer vision applications for object detection, image recognition, and other visual-related tasks.

  • Word Embeddings - With word embeddings, natural language processing software can more accurately understand the context and relationships of words.

  • Graph Embeddings - Graph embeddings extract and categorize related information from interconnected nodes to support network analysis.

What are Vectors?

ML models cannot interpret information intelligibly in their raw format and require numerical data as input. They use neural network embeddings to convert real-word information into numerical representations called vectors.

Vectors are numerical values that represent information in a multi-dimensional space. They help ML models to find similarities among sparsely distributed items.

The Conference (Horror, 2023, Movie)

Upload (Comedy, 2023, TV Show, Season 3)

Crypt Tales (Horror, 1989, TV Show, Season 7)

Dream Scenario (Horror-Comedy, 2023, Movie)

Their embeddings are shown below

The Conference (1.2, 2023, 20.0)

Upload (2.3, 2023, 35.5)

Crypt Tales (1.2, 1989, 36.7)

Dream Scenario (1.8, 2023, 20.0)

Embedding Models?

Data scientists use embedding models to enable ML models to comprehend and reason with high-dimensional data.

Types of embedding models are shown below

PCA

Principal component analysis (PCA) is a dimensionality-reduction technique that reduces complex data types into low-dimensional vectors. It finds data points with similarities and compresses them into embedding vectors that reflect the original data.

SVD

Singular value decomposition (SVD) is an embedding model that transforms a matrix into its singular matrices. The resulting matrices retain the original information while allowing models to better comprehend the semantic relationships of the data they represent.

-->