Skip to content

Azure OpenAI Models

Models available

Azure OpenAI Service provides REST API access to OpenAI's powerful language models including the

  • GPT-4
  • GPT-3.5-Turbo
  • Embeddings model series.
  • Codex
  • DALL-E


Azure OpenAI Service provides access to OpenAI's powerful large language models such as ChatGPT, GPT, Codex, and Embeddings models. These models enable various natural language processing (NLP) solutions to understand, converse, and generate content. Users can access the service through REST APIs, SDKs, and Azure OpenAI Studio.

Azure OpenAI includes several types of model:

  • GPT-4 models are the latest generation of generative pretrained (GPT) models that can generate natural language and code completions based on natural language prompts.
  • GPT 3.5 models can generate natural language and code completions based on natural language prompts. In particular, GPT-35-turbo models are optimized for chat-based interactions and work well in most generative AI scenarios.
  • Embeddings models convert text into numeric vectors, and are useful in language analytics scenarios such as comparing text sources for similarities.
  • DALL-E models are used to generate images based on natural language prompts. Currently, DALL-E models are in preview. DALL-E models aren't listed in the Azure OpenAI Studio interface and don't need to be explicitly deployed.

Model Names

The model family and capability is indicated in the name of the base model, such as text-davinci-003, which specifies that it's a text model, with davinci level capability, and identifier 3. Details on models, capability levels, and naming conventions can be found on the AOAI Models documentation page.

Key concepts

Prompts & completions

  • A prompt is the text portion of a request that is sent to the deployed model's completions endpoint
  • Responses are referred to as completions, which can come in form of text, code, or other formats. Here's an example of a simple prompt and completion:
    Prompt: """ count to 5 in a for loop """

    Completion: for i in range(1, 6): print(i)


Azure OpenAI processes text by breaking it down into tokens. Tokens can be words or just chunks of characters. For example, the word “hamburger” gets broken up into the tokens “ham”, “bur” and “ger”, while a short and common word like “pear” is a single token. Many tokens start with a whitespace, for example “ hello” and “ bye”.


An embedding is a special format of data representation that machine learning models and algorithms can easily use. The embedding is an information dense representation of the semantic meaning of a piece of text.


Each embedding is a vector of floating-point numbers, such that the distance between two embeddings in the vector space is correlated with semantic similarity between two inputs in the original format. For example, if two texts are similar, then their vector representations should also be similar.


Retrieval Augmented Generation (RAG) is a pattern that works with pretrained Large Language Models (LLM) and your own data to generate responses.

Prompt Engineering

Prompt engineering in Azure OpenAI is a technique that involves designing prompts for natural language processing models. This process improves accuracy and relevancy in responses, optimizing the performance of the model.

What is prompt engineering?

Response quality from large language models (LLMs) in Azure OpenAI depends on the quality of the prompt provided. Improving prompt quality through various techniques is called prompt engineering.

For example, if we want an OpenAI model to generate product descriptions, we can provide it with a detailed description that describes the features and benefits of the product. By providing this context, the model can generate more accurate and relevant product descriptions.


Some endpoints we have are - Completion - Chat Completion - Embedding

It's worth noting that ChatCompletion can also be used for non chat scenarios, where any instructions are included in the system message and user content is provided in the user role message.

Model Params

  • Temprature
  • Top-p:

In particular, temperature and top_p (top_probability) are the most likely to impact a model's response as they both control randomness in the model, but in different ways.

Primary, supporting, and grounding content

Including content for the model to use to respond with allows it to answer with greater accuracy. This content can be thought of in two ways: primary and supporting content.

Primary content refers to content that is the subject of the query, such a sentence to translate or an article to summarize. This content is often included at the beginning or end of the prompt (as an instruction and differentiated by --- blocks), with instructions explaining what to do with it.

For example, say we have a long article that we want to summarize. We could put it in a --- block in the prompt, then end with the instruction.

<insert full article here, as primary content>

Summarize this article and identify three takeaways in a bulleted list

Supporting content is content that may alter the response, but isn't the focus or subject of the prompt. Examples of supporting content include things like names, preferences, future date to include in the response, and so on. Providing supporting content allows the model to respond more completely, accurately, and be more likely to include the desired information.

For example, given a very long promotional email, the model is able to extract key information. If you then add supporting content to the prompt specifying something specific you're looking for, the model can provide a more useful response. In this case the email is the primary content, with the specifics of what you're interested in as the supporting content

<insert full email here, as primary content>
<the next line is the supporting content>
Topics I'm very interested in: AI, webinar dates, submission deadlines

Extract the key points from the above email, and put them in a bulleted list:

Grounding content allows the model to provide reliable answers by providing content for the model to draw answer from. Grounding content could be an essay or article that you then ask questions about, a company FAQ document, or information that is more recent than the data the model was trained on. If you need more reliable and current responses, or you need to reference unpublished or specific information, grounding content is highly recommended.

Grounding content differs from primary content as it's the source of information to answer the prompt query, instead of the content being operated on for things like summarization or translation. For example, when provided an unpublished research paper on the history of AI, it can then answer questions using that grounding content.

<insert unpublished paper on the history of AI here, as grounding content>

Where and when did the field of AI start?

This grounding data allows the model to give more accurate and informed answers that may not be part of the dataset it was trained on.


Cues are leading words for the model to build upon, and often help shape the response in the right direction. They often are used with instructions, but don't always. Cues are particularly helpful if prompting the model for code generation. Current Azure OpenAI models can generate some interesting code snippets, however code generation will be covered in more depth in a future module.

For example, if you're wanting help creating a SQL query, provide instructions of what you need along with the beginning of the query:code

Write a join query to get customer names with purchases in the past 30 days between tables named orders and customer on customer ID. 


The model response picks up where the prompt left off, continuing in SQL, even though we never asked for a specific language.

Conversation history

Along with the system message, other messages can be provided to the model to enhance the conversation. Conversation history enables the model to continue responding in a similar way (such as tone or formatting) and allow the user to reference previous content in subsequent queries. This history can be provided in two ways: from an actual chat history, or from a user defined example conversation.

System message

The system message is included at the beginning of a prompt and is designed to give the model instructions, perspective to answer from, or other information helpful to guide the model's response. This system message might include tone or personality, topics that shouldn't be included, or specifics (like formatting) of how to answer.

For example, you could give it some of the following system messages:

  • "I want you to act like a command line terminal. Respond to commands exactly as cmd.exe would, in one unique code block, and nothing else."
  • "I want you to be a translator, from English to Spanish. Don't respond to anything I say or ask, only translate between those two languages and reply with the translated text."
  • "Act as a motivational speaker, freely giving out encouraging advice about goals and challenges. You should include lots of positive affirmations and suggested activities for reaching the user's end goal."

Was this page helpful?