Doubleword
    Aa

    Glossary

    Deep Learning Glossary

    We've cut through the jargon so you don't need to (and then we've built best-in-class infrastructure so you don't need to!)

    A

    AGI

    Artificial General Intelligence (AGI) in simple terms is the point at which artificial intelligence (AI) can perform all human cognitive skills better than the smartest human, including an ability to self teach.

    There are two broad debates within the machine learning field around AGI: 1) How to define the point at which AGI has been reached 2) Whether developing an AGI system is possible

    Currently, researchers are using all types of tests (Turing Test, Steve Wozniak's coffee test, bar exam, CFA exams, medical exams) as a way of measuring whether AI is close to reaching AGI, although no strict criteria exists to determine the point at which AGI will have been achieved. The general consensus remains that AGI has not yet been reached.

    Whilst researchers have previously claimed AGI would never be reached, or would be 50-100 years away, it is the openly stated goal for companies including OpenAI, Google Deepmind, and Anthropic. Geoffrey Hinton (one of the godfathers of AI) now believes achieving AGI is likely to be much less than 30 years away, whilst CEO of Anthropic, Dario Amodei, believes AGI is only 2-3 years away.

    A

    API

    An API (Application Programming Interface) is a set of rules and protocols which allow different software applications to communicate and interact with each other. It enables developers to access specific features or data from external services, libraries, or platforms, making it useful for building AI-powered applications.

    In terms of AI adoption, many enterprises currently rely on API-based model deployments. This is because, historically, proprietary large language models, including GPT-4, have been considered the gold standard, whilst open source models were seen as significantly cheaper but ultimately, poor-quality substitutes. Yet, in 2023, there were significant improvements in the quality of open source models. In December, Mistral AI's Mixtral demonstrated significantly better performance than GPT-3.5. As major players, including Middle Eastern nations and Meta, continue to invest heavily within this space, we expect Llama 3 (or equivalent) to be as good, if not better than GPT-4.

    API model deployment
    API model deployment is an effortless process.
    A

    API-based large language models

    API-based Generative AI models (including ChatGPT, Bard, Cohere, Claude, LLaMA and PaLM) are hosted in external servers, meaning that every time the model is called, both the data and the responses are sent outside a business' secure environment to the environment where the model is hosted.

    Whilst this is an effortless process, it is therefore not the most private and secure form of large language model deployment. Instead, self-hosting is considered the gold standard in terms of private and secure large language model deployments.

    API vs self-hosted deployment comparison
    These are the differences typically between API-based large language model deployments and self-hosted deployments.
    A

    Activation aware quantization (AWQ)

    Activation aware quantization (AWQ) is a process for quantizing large language models whilst maintaining significant accuracy without the memory overhead of Quantization Aware Training. There are difficulties quantizing very large language models due to outliers.

    Outliers are weights in the network which take on very large values. These large values can skew the distribution of weights at quantization time, making it harder to maintain performance whilst reducing weight precision. AWQ accounts for these outlier values during the quantization process by calculating scale factors to offset them, thereby maintaining model performance.

    AWQ quantization diagram
    Image credits: J. Lin et al, Source: arxiv.org/pdf/2306.00978.pdf
    A

    Agent

    Agents are language models which are given persistent access to tooling and memory in order to solve an open-ended task. Agents might be given access to third-party APIs, code interpreters, or a scratch pad to record previously generated texts and told to use them when appropriate in order to complete a task. An agent operates autonomously (not directly controlled by a human operator).

    There are a number of types of agents within machine learning: 1) Reactive: an agent responds to stimuli from their environment to achieve goal 2) Proactive: an agent takes initiative and plans in advance to achieve goal 3) Operate in a fixed environment: contain a static set of rules in which an agent should respond 4) Operate in a dynamic environment: rules are constantly changing and it therefore requires an agent to regularly adapt to new circumstances 5) Single agent: as described 6) Multi-agent system: where many agents work together to achieve a common goal

    Agent architecture diagram
    Image credits: Z. Xi et al, Source: arxiv.org/pdf/2309.07864.pdf
    A

    Attention

    Humans are able to naturally determine the context behind a word which can have multiple meanings (homonym) for example, differentiating between when "spring" means a season or a metal coil. However, for large language models, this process is known as "attention". Attention mechanisms are therefore integral to large language models, allowing for their ability to understand and generate natural language.

    In machine learning, there are two types of attention typically talked about: 1) Self attention: a core mechanism in models like transformers, it evaluates the importance of elements within the same input sequence (e.g., words in a sentence) by computing attention scores among them. This enables the model to assign varying degrees of importance to each element based on its relationships with others, leading to richer context understanding. 2) Multi head attention: extends self-attention by employing multiple parallel attention mechanisms, or "heads." Each head focuses on different aspects of the input, allowing the model to capture diverse relationships and representations simultaneously. These individual heads are then combined to provide a comprehensive understanding of the data, enhancing model performance in various machine learning tasks.

    A

    Auto regressive model

    An auto regressive model is a type of generative model that generates output one token at a time, where each new token is conditioned on all previously generated tokens. This sequential generation process allows the model to produce coherent and contextually relevant outputs.

    A

    Autoscaling

    Autoscaling is a cloud computing feature that automatically adjusts the number of computational resources (such as virtual machines) allocated to an application based on its current workload. This ensures optimal performance and cost efficiency as AI workloads fluctuate.

    B

    BERT

    BERT is a language model architecture which is commonly used for classification and embedding tasks. Often used interchangeably with encoder-only language models. It was created by machine learning researchers at Google in 2018.

    B

    Backpropagation

    Backpropagation is arguably the most fundamental building block in a neural network. It was popularized by Rumelhart et al in a paper entitled "Learning representations by back-propagating errors".

    Backpropagation is a process used to calculate the gradients of the weights of a machine learning model from a batch of input data. This is followed by the optimization step, where those gradients are used to update the model.

    B

    Bandwidth

    Bandwidth refers to the maximum data transfer rate of a network or internet connection. In the context of AI, sufficient bandwidth is crucial for transmitting large datasets, facilitating real-time communication with AI models, and ensuring smooth operations.

    B

    Batch

    A batch of inputs is a set of inputs which are processed by a machine learning model in parallel. This is an effective technique to increase throughput when using GPUs.

    B

    Batch accumulation

    Batch accumulation is a technique for reducing the GPU memory requirements of machine learning training. In a normal model training step, the gradients with respect to each parameter of a machine learning model are stored, and a single update is performed. If the batch size is too small, these gradients can cause an "out-of-memory" error. With gradient accumulation, you can accumulate these gradients in place across a set of smaller batches. This trades time for memory, allowing for performing training as if there was a GPU with more VRAM, at the cost of a longer training time.

    Batch accumulation diagram
    Image credits: Precisely Docs
    B

    Batch size

    Batch size is simply the number of distinct items in a batch.

    B

    Big data

    Big data is a term that has been used to describe tools for managing and processing massive data sets. Usually, these tools have to be specifically architected to extract, transform and manage fractions of these huge datasets in an efficient streaming manner.

    C

    CI/CD Pipelines

    CI/CD pipelines refer to the automated processes of Continuous Integration (CI) and Continuous Delivery or Continuous Deployment (CD). CI involves automatically integrating code changes from multiple contributors into a single software project, usually accompanied by automated testing to ensure code quality. CD extends this by automating the release of the tested changes to a staging or production environment, enabling rapid and reliable software development and deployment.

    C

    CPU

    The CPU (central processing unit) is the heart of any modern computer. It executes a series of instructions across a number (usually less than 10) of processors.

    CPUs can also be used for machine learning inference, but their architecture is less well suited to accelerate massively parallelizable modern machine learning architectures than GPUs. CPUs are, however, significantly cheaper and easier to come by than GPUs, and can be effective for machine learning, especially at inference time, and for small batch sizes.

    C

    Chain of thought prompting

    Chain of thought prompting is a prompting method which involves asking language models to "think step-by-step" in order to encourage longer outputs with more steps that resemble reasoning about the problem. This method has been shown to improve the output of models for complex reasoning tasks.

    Chain of thought prompting example
    Image credits: J. Wei et al, Source: arxiv.org/pdf/2201.11903.pdf
    C

    Classification

    Classification is the task of placing data into one of a fixed set of buckets.

    C

    Classification model

    A classification model is a model designed to place inputs into one of a fixed number of buckets.

    C

    Cloud computing

    Before the advent of cloud computing, companies looking to sell digital services (operating an online store, a social media company etc) would have to buy computing hardware. This hardware would require a great deal of expertise to setup, maintain, and troubleshoot. Cloud Computing is the name for the paradigm shift that we have seen in the computing industry over the last few decades, wherein cloud computing providers, usually spun off from large IT-focused companies which built substantial experience operating computing hardware themselves, began to rent access to this hardware over the network. With cloud computing services, users can pay by the day for virtual machines (VMs) with various capabilities, pay by the Gigabyte for storage, or access many other services on-demand. (see on-prem).

    C

    Compression

    Compression in AI typically refers to model compression. Model compression helps to reduce the size of the neural network, without significant accuracy loss. There are four main types of model compression often used by machine learning engineers:

    1. Quantization 2. Pruning 3. Knowledge distillation 4. Low-rank factorization

    C

    Containerized

    Containerization involves encapsulating software in a container with its own operating environment, libraries, and dependencies, ensuring consistency and efficiency across different computing environments. Containers offer lightweight, portable units for application development, deployment, and management, facilitating faster delivery and scalability. This technology isolates applications from the underlying infrastructure, enhancing security, and making it easier for teams to collaborate on and deploy applications regardless of the host system.

    C

    Context Length

    Context length describes the upper limit of tokens that the model can recall during text generation. A longer context window allows the model to understand long-range dependencies in text better.

    C

    Continuous batching

    Continuous batching is an algorithm which increases the throughput of large language model (LLM) serving. It allows the size of the batch on which a machine learning model is working on to grow and shrink dynamically over time. This means responses are served to users more quickly at high load (significantly higher throughput).

    D

    Data parallelism

    Data parallelism involves running models in parallel on different devices where each sees a distinct subset of data during training or inference. It accelerates training deep learning models, reducing training time significantly.

    It is often confused with task parallelism, but both can be applied in tandem to make the most of available resources, reduce training and deployment times, and optimize the end-to-end machine learning process for better results and faster development. See task parallelism.

    D

    Deep learning

    Deep learning is the process of producing deep neural networks to solve machine learning problems.

    D

    Deep neural network (DNN)

    For a neural network to be considered a deep neural network (DNN), it is typically complex and usually with at least two layers. Neural networks consist of a series of layers. Each layer performs a successive transformation on data which was passed into the model. Layers usually consist of a linear operation, followed by a simple, elementwise nonlinearity. The composition of many such simple operations can be used to build up any data transformation. At the same time, their relative simplicity means they map well to modern computer hardware. (see also deep learning).

    D

    Distillation

    Distillation is the process of using a larger model to train smaller models. This has been shown to be more effective than training small models from scratch. It can involve using intermediate states of the larger model to assist the smaller model, or using large generative models to produce new text from which the smaller model is trained on.

    Distillation process diagram
    Image credits: C. Hsieh et al, Source: arxiv.org/pdf/2305.02301.pdf
    D

    Docker

    Docker is an open-source platform that automates the deployment of applications inside software containers, providing an additional layer of abstraction and automation of operating-system-level virtualization on Linux. It enables developers to package applications with all of their dependencies into a standardized unit for software development, ensuring that it works seamlessly in any environment. Docker simplifies the process of managing applications in containers, making it easier to create, deploy, and run applications by using containers.

    D

    Dynamic batching

    Dynamic batching is a process of adjusting the batch size run during inference to match the incoming traffic. During times of high traffic, the model runs at large batches to maximize GPU utilization, and during times of low traffic, a lower batch size is used to minimize time spent waiting for additional requests.

    E

    Encoder

    A machine learning model designed to produce a representation of the input data which can be used for further downstream processing. Encoders are used to populate vector databases, or can be combined with decoder models for generation. They benefit tasks including data compression, anomaly detection, transfer learning, and recommendation systems.

    E

    Epoch

    During model training, an epoch has passed when the model has seen all pieces of data in the training set once.

    F

    F-score

    Accuracy is a common metric for assessing the performance of binary classification models. However, it can sometimes be a difficult metric to interpret properly in the case where the number of examples of different classes is highly unbalanced. This is where the F-score, also known as the F1-score comes in; it combines precision and recall into a single score to provide a balanced measure of a model's accuracy.

    It is calculated as the harmonic mean of precision and recall, and it balances the trade-off between precision (the ratio of true positives to all predicted positives) and recall (the ratio of true positives to all actual positives). This balance is important when dealing with situations where one metric may be favoured over the other.

    The F-score ranges between 0 and 1, with higher values indicating better model performance. It is particularly useful when you want to strike a balance between precision and recall, such as in information retrieval, medical diagnoses, or fraud detection, where false positives and false negatives have different consequences.

    F-score formula
    F

    Falcon

    Falcon is a family of large language models released by the UAE's Technology Innovation Institute (TII). Falcon's 40B model was trained on AWS Cloud continuously for two months with 384 GPUs. The pre training data largely consisted of public data, with few data sources taken from research papers and social media conversations. Offering high performance, whilst also being more cost-effective than competitors, Falcon garnered the #1 spot on Hugging Face's latest open large language model leaderboard.

    F

    Few shot learning

    Few shot learning is the ability of a model to learn new behaviours, having been shown only a "few" examples of the desired behaviour.

    It is typically useful for: 1. Scarcity of data: Collecting large amounts of labelled data is impractical and/or costly. 2. Rapid adaptation: Enables models to quickly adapt to new tasks or classes without the need for extensive retraining. 3. Efficient training: Requires less computational resources and time compared to traditional deep learning methods. 4. Generalization: Encourages models to generalize from the limited number of examples available. 5. Low resource settings: Particularly helpful in low-resource settings where collecting extensive labelled data is challenging.

    Few shot learning is not to be confused with zero shot learning and one shot learning.

    F

    Few shot prompting

    Few shot prompting is the ability of a model to learn new behaviours having only been shown a "few" examples of the desired behaviour as part of its input prompt. It can help with a variety of tasks without the need for extensive fine-tuning or training on specific examples.

    Use cases include: 1. Custom chatbots: For developing custom chatbots and virtual assistants, few shot prompting is helpful when guiding the model's responses. 2. Question answering: You can provide a question as the prompt, along with relevant context, to help the model generate accurate answers. 3. Creative writing and storytelling: Few shot prompting is helpful for seeding creativity, generating stories, or assisting with narrative generation.

    F

    Foundation model

    A foundational model, also known as "general purpose artificial intelligence", or "GPAI", refers to a large-scale, pre-trained machine learning model which serves as a base for further fine tuning and customization. It is usually trained on vast amounts of text data in order to capture general language patterns.

    Foundation models form the basis of many applications including: 1. OpenAI's ChatGPT 2. Microsoft's Bing 3. Midjourney 4. Adobe Photoshop's generative fill tools 5. + many chatbots.

    G

    GPT

    GPT stands for Generative Pretrained Transformers. It is a type of decoder-only transformer first defined by OpenAI that underlies the famous GPT-3/4 series of models.

    G

    GPU

    GPUs (graphics processing units) are a type of computing hardware which can be used to perform a series of computations in parallel. This makes them useful in graphics applications, however, they have since become a crucial tool within machine learning. Used in both training and inference, they have significantly accelerated training times for deep learning models, enabling the development of state-of-the-art AI systems.

    However, there are a number of challenges associated with GPU usage in machine learning: 1. High costs: High-performance GPUs can be expensive 2. Compatibility: GPUs require compatible hardware and software 3. Energy usage: GPUs consume a substantial amount of power 4. Parallelism: Not all machine learning algorithms are highly parallelizable 5. Memory constraints: GPUs have limited memory compared to traditional CPUs 6. Driver and software updates: Must be kept up to date for optimal performance 7. Portability: Deploying GPU-based models in resource-constrained environments can be challenging 8. Vendor lock-ins: Different GPU vendors may have specific tools and libraries

    G

    Generative AI (GenAI)

    Generative models model the whole distribution of the data. Generative AI is the new wave of models which can produce everything from art and language, to music and video. Its ability to produce such diverse and creative content has led to its rapid adoption across industries and revolutionized how humans generate, create and interact with data and media.

    Prominent examples of generative AI applications: 1. Text generation: Models like GPT-4 can generate human-like text 2. Image generation: Generative adversarial networks (GANs) can create realistic images 3. Music composition: AI algorithms can compose music 4. Video synthesis: AI can generate video content 5. Drug discovery: AI-driven generative models can suggest new chemical compounds

    H

    HIPAA

    HIPAA, the Health Insurance Portability and Accountability Act of 1996, is a United States legislation that provides data privacy and security provisions for safeguarding medical information. It sets standards for the protection of sensitive patient health information, ensuring that it is handled with confidentiality and security. HIPAA applies to healthcare providers, health plans, healthcare clearinghouses, and business associates of those entities that process health information.

    H

    Hallucination

    Hallucination is the habit of generative models to produce plausible-sounding but ultimately incorrect completions of the prompt. Machine learning researchers have found a technique called retrieval augmented generation (RAG) to be successful in reducing hallucinations.

    H

    Human in the loop

    Machine learning models can sometimes be unreliable. A human in the loop system is one in which machine learning inferences are assessed continually by a human operator. For example, GitHub's Copilot system has a human in the loop - in that the responses from the code model are accepted or rejected by the human writing the code.

    In essence, using a human in the loop approach is valuable in situations where human judgement, expertise, or oversight is essential to enhance the performance, safety, ethics, and overall reliability of AI systems.

    Human in the loop diagram
    Image credits: Anderson Anthony
    I

    Inference

    Inference in generative AI refers to the process where a trained generative model generates new data samples based on learned patterns and structures. This process involves the model taking input (which can be minimal or even none in some cases) and producing output that aligns with the distribution of the data it was trained on.

    For example, in a generative AI model trained on images, inference would be the act of the model creating a new image that resembles the types of images it has seen during training. Similarly, in text-based models like GPT-3 or GPT-4, inference involves generating text that is coherent and contextually appropriate based on the input prompt.

    I

    Inference Server

    Inference servers are the "workhorse" of AI applications, they are the bridge between the trained AI model and real-world, useful applications. Inference servers are specialised software that efficiently manages and executes these crucial inference tasks.

    The inference server handles requests to process data, running the model, and returning results. An inference server is deployed on a single 'node' (GPU, or group of GPUs), it is scaled across nodes for elastic scale through integrations with orchestration tools like Kubernetes.

    I

    Inference optimization

    Inference optimization is the process of making machine learning models run quickly at inference time. This might include model compilation, pruning, quantization, or other general purpose code optimizations. The result improves efficiency, speed and resource utilization.

    The use of inference optimization matters for several reasons: 1. Efficiency: Optimizing inference ensures predictions are made quickly 2. Cost reduction: Efficient inference leads to reduced hardware and operational costs 3. Scalability: Optimized inference allows for seamless scalability 4. Energy efficiency: Contributes to energy savings 5. Resource compatibility: Models can be deployed on a wide range of hardware 6. Enhanced user experience: Faster inference reduces waiting times 7. Deployment flexibility: Optimized models are easier to deploy across various environments

    I

    Instruction tuning

    Instruction tuning is the process of fine tuning a language model on datasets of instruction output pairs. The purpose is to make the model more likely to follow instructions given by the user, as opposed to attempting to finish the instruction.

    Instruction tuning diagram
    Image credits: S. Zhang et al, Source: arxiv.org/pdf/2308.10792.pdf
    K

    Kernel

    A kernel is a single function which gets executed on a GPU.

    K

    Kubernetes

    Kubernetes is an open-source platform designed to automate deploying, scaling, and operating application containers. It groups containers that make up an application into logical units for easy management and discovery. Developed by Google, Kubernetes is widely used for cloud-native applications due to its efficiency in managing containerized environments. In AI, Kubernetes is often used to scale Inference Servers over nodes.

    L

    LLaMA

    LLaMA is a family of language models released by Meta. LLaMA-2 models are released with weights, and are free for many commercial use cases.

    L

    Language model

    A language model is a machine learning model which is trained to be able to model natural language. It learns statistical patterns, relationships, and structures of language by analyzing large datasets of text. This understanding allows it to predict and generate coherent and contextually relevant text. Language models are fundamental components of natural language processing (NLP) systems and are used for various tasks, including language translation, text generation and sentiment analysis.

    L

    Large language model

    A large language model (LLM) is a specific type of language model which is characterized by its extensive size (typically measured in terms of the number of parameters it contains). Large language models have tens or hundreds of millions, or even billions, of parameters. These models are pre-trained on vast amounts of text data to capture a broad and deep understanding of language. Notable examples include GPT-3, BERT, and T5.

    L

    Latency

    Latency refers to the time taken from when an input is provided to the model until an output is received. Low latency is critical in real-time applications where swift responses are essential as it directly impacts user experience.

    L

    LLaVA

    LLaVA represents a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA.

    LLaVA architecture
    Source: huggingface.co/docs/transformers/model_doc/llava
    M

    Machine learning (ML)

    Machine learning (ML) is a subset of artificial intelligence (AI). It involves the use and development of computer systems which are able to learn and adapt without following explicit instructions, by using algorithms and statistical models to analyze and draw inferences from patterns in data. It encompasses various techniques, including supervised learning, unsupervised learning, reinforcement learning, and deep learning.

    M

    Machine learning inference

    Machine learning inference is often referred to as "moving a model into production". It is the vital process of using a trained machine learning model to make predictions or classifications on fresh, unprocessed data, thus enabling efficient, cost-effective, and scalable deployments of machine learning solutions.

    M

    Mistral

    Mistral is a series of language models created by Mistral.ai. It is an open source model, and is widely considered to be France's closest answer to OpenAI's language models.

    M

    Mixture of Expert Models (MoE)

    Mixture of Expert (MoE) models are a type of conditional computation where parts of the network are activated on a per-example basis. It has been proposed as a way to dramatically increase model capacity without a proportional increase in computation.

    M

    Model

    A model in the context of machine learning is an object which is used to transform input data into insights. Models usually consist of some fixed data encoding their knowledge, and then an algorithm for generating results from the combination of this data, and input data.

    There are a number of popular model types. These include: 1. Large language models 2. Deep neural networks 3. Linear regression 4. Logistic regression 5. Decision trees 6. Linear discrimination analysis 7. Naive bayes 8. Support vector machines 9. Learning vector quantization 10. K-nearest neighbours 11. Random forest

    M

    Model compilation

    Model compilation is an essential step in the deployment of AI models. It is a process applied in some deep learning frameworks to prepare a model for inference. Software frameworks designed to make training machine learning models easy, often leave a lot of inference-time performance on the table because they must be flexible for practitioners to be able to experiment rapidly. Compilation takes the output of a training process and squeezes out this flexibility, leaving only the information required to run the model in inference.

    M

    Model monitoring

    Model monitoring is an important part of the MLOps pipeline. Observability and monitoring are a key part of building reliable and flexible operations. Model monitoring describes these principles as applied to machine learning models. This might involve subsampling and saving the input data, tracking model performance, producing online accuracy metrics, as well as encompassing techniques from standard devops observability best practises.

    M

    Model parallelism

    Model parallelism is a form of parallelism where a model is divided up to sit on different GPUs. It is useful for increasing the speed of models at inference and during training. Not to be confused with data parallelism, both can be applied in tandem to make the most of available resources, reduce training and deployment times, and optimize the end-to-end machine learning process.

    M

    Model serving

    Model serving is the process of taking a machine learning model and putting it into a server. A server is a continuously running listener process which waits for requests from end-users, processes them, and then sends responses. This should be distinguished from, for example, batch processing - where a process has a list of data that it churns through on some regular schedule.

    M

    Multi-GPU inference

    Multi-GPU deployments allow the distributed inference of large language models (LLMs) by distributing those models across multiple GPUs. This allows for the inference of larger models and enables larger batch sizes. It is advantageous for applications which require high throughput, reduced latency, and efficient utilization of computational resources.

    N

    Natural language processing (NLP)

    Natural Language Processing (NLP) is a branch of artificial intelligence (AI) and is the processing of natural language to extract insights from human language, combining the power of computational linguistics and AI.

    Typical uses of NLP within AI include: 1. Text analysis: Sentiment analysis, text summarization and topic modelling 2. Machine translation: Automatic translation of text and/or speech 3. Chatbots and virtual assistants: Natural language conversations 4. Search engines: Improved accuracy and relevance of results 5. Speech recognition: Voice commands and transcription services 6. Text generation: Human-like text for content generation

    N

    Natural language understanding (NLU)

    Natural language understanding (NLU) are tasks which involve processing language, without being required to generate new text. Tasks typically include: classification, closed question answering and named entity extraction.

    N

    Neural networks

    (Artificial) neural networks are a type of machine learning model inspired by biological networks in mammalian brains. These networks of neurons process information by composing trillions of simple computations into massively-connected networks which are capable of performing all kinds of difficult tasks. Artificial neural networks are highly simplified when compared to their biological forebears, but the underlying principle of composing simple computations to produce intelligent results is the same.

    Neural network diagram
    Image credits: C. Gershenson, Source: arxiv.org/ftp/cs/papers/0308/0308031.pdf
    N

    Ngram

    An ngram is a set of n adjacent tokens. For example, if each word is a token then "my name is" is a 3-gram. Ngram models, which estimate the probabilities of all sets of models, were considered state-of-the-art within language modeling, and are a useful pairing with large language models (LLMs), speculative decoding tools and other applications.

    N

    Node

    A node refers to a single computer or machine within a larger network of computers that work together. Each node might perform a portion of a larger task in parallel computing. When deploying Generative AI models a node typically refers to a GPU or a defined collection of GPUs.

    O

    On-prem

    Companies that don't make use of cloud computing services must maintain compute capability themselves, in the form of a large number of networked servers. These services are stored "on the premises", commonly abbreviated as "on-prem". This is often a result of regulatory or privacy-sensitivity. See Cloud computing.

    P

    Paged Attention

    Paged Attention is a memory management technique for optimizing GPU usage during LLM inference by partitioning the key-value cache into smaller, non-contiguous blocks called pages. This approach minimizes memory fragmentation, allowing for dynamic adjustment of sequence lengths and batch sizes while maximizing throughput and GPU efficiency.

    P

    Perplexity

    Perplexity is a measurement of how well a probability distribution or probability model predicts a sample. A low perplexity indicates the model is good at predicting the sample.

    P

    Pretrained model

    Pretraining is the process of producing a general purpose, flexible model from a massive corpus of general purpose data. Modern machine learning training (especially in language processing) usually has two training phases: pretraining - where the model is taught to understand general language, logic, and conceptual features, and, finetuning - where the model is taught to understand concepts or language specific to a domain.

    P

    Prompt engineering

    Prompt engineering is the process of writing better prompts to large language models to get the desired output more often. It is a complex language task which requires deep reasoning as it involves closely examining a model's errors, hypothesizing what is missing and/or misleading in the current prompt, and then, communicating the task more clearly back to the large language model.

    Prompt engineering techniques
    Image credits: Q. Ye et al, Source: arxiv.org/pdf/2311.05661.pdf
    P

    Pruning

    Pruning is a machine learning optimization technique which is applicable to deep neural networks (DNNs). Pruning involves finding neurons (often called weights) in neural networks that do not contribute significantly to the performance of the model, and then removing them. This can improve the processing speed of the model, so long as it is done in a way which can be accelerated by the underlying hardware.

    Pruning diagram
    Image credits: T. Liang et al, Source: arxiv.org/pdf/2101.09671.pdf
    P

    Public Cloud

    A public cloud is a platform that uses the standard cloud computing model to make resources, such as virtual machines, applications, or storage, available to users over the internet. It's operated by third-party cloud service providers, offering scalability, reliability, and cost-efficiency, where resources are shared among multiple customers and billed on a pay-per-use basis.

    Q

    Quantization

    Quantization is a machine learning optimization technique which is applicable to deep neural networks. Neural networks store a large number of variables (called neurons, or weights) that encode the models knowledge of the task they are trained to perform. During training numbers must be stored at (relatively) high precision, to make sure the model learns from the data effectively. In inference time, it is often possible, without a substantial drop in model ability, to decrease the precision with which these weights are stored.

    Q

    Quantization aware training

    Quantization aware training is an optimization technique for performing quantization without incurring substantial accuracy losses. The goal of quantization aware training is to find the best way to reduce the stored precision of a model with regards to its performance on a data set. To that end, during quantization aware training, quantization proceeds whilst simultaneously attempting to keep the model performance on a fixed dataset the same.

    Quantization aware training process
    Image credits: P. Novak et al, Source: researchgate.net
    R

    RAG (Retrieval Augmented Generation)

    Retrieval Augmented Generation (RAG) is a method for enhancing factuality and groundedness of the outputs of a machine learning model with a corpus. Unconstrained generation from LLMs is prone to hallucination, and finetuning to add capabilities or knowledge to a model can be difficult and error-prone. Allowing access to a corpus of data at model runtime, for example, a company wiki or open source documentation, can add capabilities without requiring finetuning.

    R

    Rate Limits

    Rate limits in the context of API-accessed Large Language Models (LLMs) like ChatGPT refer to the policies that restrict the number of API requests a user or application can make within a specified time period. These limits are implemented to ensure equitable access, prevent abuse, and maintain the performance and reliability of the service for all users. Self-hosted LLMs do not experience the same kind of rate limiting.

    R

    Recurrent neural network (RNN)

    A recurrent neural network (RNN) processes sequences one-by-one and produces intermediate states which are passed between inferences to maintain a memory of previously seen items in the sequence.

    R

    Repetition penalty

    Repetition penalty is a factor applied to discourage the model from generating repetitive text or phrases. By adjusting this penalty, users can influence the model's output, reducing the likelihood of it producing redundant or repeated content. A higher repetition penalty generally results in more diverse outputs, whilst a lower value might lead to more repetition.

    R

    Rust

    Rust is a popular programming language which emphasizes performance, memory safety, and developer productivity. Rust's strong type system and zero-cost abstractions allow developers to write very robust code, whilst its manual memory management means that rust performance is best-in-class.

    S

    Sampling temperature

    Sampling temperature is a parameter used during the text generation process in large language models (LLMs). It controls the randomness of the model's output. A higher temperature results in more random and diverse outputs, whilst a lower temperature makes the output more deterministic and focused on the most likely predictions.

    S

    Self-hosted models

    Self-hosted models are AI models that are run and maintained on a business' own infrastructure rather than relying on third-party providers. It is the most private and secure method of deploying large language models, and often, since there is no reliance on third party providers, it is significantly cheaper than using API-based model deployments, when scaling.

    Typically, self-hosting is considered to be an incredibly complex and time consuming process for machine learning teams to build and maintain.

    S

    Sentiment analysis

    Sentiment analysis, also known as opinion mining, is the process of extracting the sentiment of a body of text. It aims to classify text into different categories or sentiments, such as positive, negative, or neutral, to understand the attitudes, opinions, and emotions conveyed by the author.

    S

    Serving

    Serving is the act of hosting an AI model such that it is able to be used at scale and power downstream applications.

    S

    Speculative decoding

    Speculative decoding is a sampling method for accelerating text generation. It accelerates the process by employing a smaller language model to produce candidate text samples. These candidates are evaluated by a larger model, and only approved text is accepted.

    Speculative decoding is typically used to: 1. Enhance diversity in output 2. Reduce repetition 3. Improve quality and contextuality 4. Explore various ideas 5. Adapt to different interpretations 6. Mitigate bias 7. Enhance user experience 8. Choose the best response

    Speculative decoding diagram
    Image credits: R. Zhu, TitanML
    S

    Supervised learning

    Supervised learning is machine learning where the training process includes data labelled by some supervisory process, usually a human labeller.

    S

    Synthetic data

    One of the most expensive and time consuming parts of the MLOps pipeline is the construction of datasets for machine learning training. This is usually performed by human labellers at high cost. The promise of synthetic data is that this data can be constructed by automatic processes. There are various methods for doing so - the most promising of which is to use other machine learning models to construct the data.

    T

    TPU

    TPUs are specialist deep-learning hardware designed by Google for deep learning training and inference.

    T

    Tensor Parallelism

    Tensor parallelism is a technique used to distribute a large model across multiple GPUs. For instance, during the multiplication of input tensors with the first weight tensor, the process involves splitting the weight tensor column-wise, multiplying each column separately with the input, and then concatenating the resulting outputs.

    Tensor parallelism diagram
    Source: HuggingFace
    T

    Throughput

    Throughput denotes the number of input samples or tasks that a model can process within a specific time frame. It is a measure of the system's capacity and efficiency in handling multiple requests.

    High throughput is beneficial when speed, real-time processing, and scalability are critical. It allows AI systems to handle a large number of tasks or data points quickly and efficiently. Use cases include autonomous vehicles, real-time financial trading, customer support chatbots, and content delivery networks.

    Low throughput might be acceptable or even preferable as it may allow for deeper analysis, more complex computations, and a focus on accuracy over speed. Use cases include scientific simulations, complex data analysis, and research experiments.

    T

    Titan Takeoff Inference Server

    The Titan Takeoff Inference Server is the flagship product of TitanML. It is the easiest way to locally inference self-hosted models, applying state of the art techniques in inference optimization and integrations with other software crucial for language models.

    Titan Takeoff Inference Server
    T

    Token

    A token is a discrete chunk of a larger sentence. In language modelling a token can be a single character, a word, a subword, or a group of words.

    T

    Tokenization

    Tokenization, in the context of large language models (LLMs), is the process of converting input text into smaller units, or "tokens," which can then be processed by the model. This process is a critical preprocessing step before feeding data to a large language model (LLM), as it ensures the text is in a format the model can understand and process.

    T

    Top K

    Top K sampling is a text generation strategy in which the model considers only the top 'K' most likely next tokens for its next word prediction. By restricting the pool of possible tokens, this method ensures the generated content remains coherent and contextually relevant.

    T

    Top P

    Top P, also known as nucleus sampling, is a method used in large language models (LLMs) to select a subset of possible next tokens where the cumulative probability exceeds a specified threshold 'P'. By sampling from this subset, it ensures a balance between randomness and predictability in a model's outputs. This method offers more dynamic sampling than top K and can lead to more varied and high-quality generated content.

    T

    Training data

    Deep learning models are trained on data. During the training process, the training data is passed through the model, and the model is updated to attempt better performance in a task on said training data. For example, for generative NLP models, the training task is (usually) to predict the next word (properly, tokens), given all of the previous words. The goal is the model, at inference time, is able to generalise beyond the information encoded in its training data to unseen test data.

    T

    Training set

    The training set is simply the training data used to train a model.

    T

    Transfer learning

    Transfer learning is an attempt to reuse concepts which have been previously encoded in a machine learning model for a new task. For example, in machine learning training, often two phases are involved: one pretraining, where the model picks up general information, and finetuning, where the model learns domain specific knowledge.

    T

    Transformer

    A transformer is a particular architecture of deep learning networks used for language, image, and audio modelling. There are many variants but most transformer models involve alternating fully-connected and attention-based layers.

    T

    Turing Test

    Conceived by Alan Turing in 1950, the Turing Test is a test of a machine's ability to exhibit intelligent behaviour indistinguishable from that of a human.

    U

    Unsupervised learning

    Unsupervised learning is machine learning where the training process does not include data labelled by a supervisor. Instead, the model is trained to extract insights from the statistical structure of the raw data. A typical example of a task possibly solvable via unsupervised learning is that of clustering.

    V

    Virtual Private Cloud (VPC)

    A Virtual Private Cloud (VPC) is an isolated, private section within a public cloud that allows organizations to run their cloud resources in a secure and secluded environment. It provides configurable IP address ranges, subnets, and security settings, enabling users to closely manage networking and security, similar to a traditional data center but with the scalability and flexibility of the cloud.

    Virtual Private Cloud diagram
    W

    Weight

    The weights of a language model are the values which are used to calculate the outputs of the model. These are what are optimized during training.

    Z

    Zero shot learning

    Zero shot learning is the ability of a model to generalize to unseen inputs without any additional information being passed to them. It is in comparison to one shot learning and few shot learning.