Amar's TechSpace 🛸

Basics¶

Skillset¶

In Azure AI Search, you can apply Artificial Intelligence (AI) skills as part of the indexing process to enrich the source data with new information, which can be mapped to index fields. The skills used by an indexer are encapsulated in a skillset that defines an enrichment pipeline in which each step enhances the source data with insights obtained by a specific AI skill. Examples of the kind of information that can be extracted by an AI skill include:

The language in which a document is written.
Key phrases that might help determine the main themes or topics discussed in a document.
A sentiment score that quantifies how positive or negative a document is.
Specific locations, people, organizations, or landmarks mentioned in the content.
AI-generated descriptions of images, or image text extracted by optical character recognition.
Custom skills that you develop to meet specific requirements.

Indexer¶

The indexer is the engine that drives the overall indexing process. It takes the outputs extracted using the skills in the skillset, along with the data and metadata values extracted from the original data source, and maps them to fields in the index.

An indexer is automatically run when it is created, and can be scheduled to run at regular intervals or run on demand to add more documents to the index. In some cases, such as when you add new fields to an index or new skills to a skillset, you may need to reset the index before re-running the indexer.

Index¶

Index: The index is the searchable result of the indexing process. It consists of a collection of JSON documents, with fields that contain the values extracted during indexing. Client applications can query the index to retrieve, filter, and sort information.

Each index field can be configured with the following attributes:

Key: Fields that define a unique key for index records.
Searchable: Fields that can be queried using full-text search.
Filterable: Fields that can be included in filter expressions to return only documents that match specified constraints.
Sortable: Fields that can be used to order the results.
Facetable: Ensure that users can perform drill down filtering based on category.
Retrievable: Fields that can be included in search results (by default, all fields are retrievable unless this attribute is explicitly removed).

Indexing Process

The indexing process works by creating a document for each indexed entity. During indexing, an enrichment pipeline iteratively builds the documents that combine metadata from the data source with enriched fields extracted by cognitive skills. You can think of each indexed document as a JSON structure, which initially consists of a document with the index fields you have mapped to fields extracted directly from the source data, like this:

document
    metadata_storage_name
    metadata_author
    content

When the documents in the data source contain images, you can configure the indexer to extract the image data and place each image in a normalized_images collection, like this:

document
    metadata_storage_name
    metadata_author
    content
    normalized_images
        image0
        image1

Normalizing the image data in this way enables you to use the collection of images as an input for skills that extract information from image data.

Each skill adds fields to the document, so for example a skill that detects the language in which a document is written might store its output in a language field, like this:

document
    metadata_storage_name
    metadata_author
    content
    normalized_images
        image0
        image1
    language

The document is structured hierarchically, and the skills are applied to a specific context within the hierarchy, enabling you to run the skill for each item at a particular level of the document. For example, you could run an optical character recognition (OCR) skill for each image in the normalized images collection to extract any text they contain:

document
    metadata_storage_name
    metadata_author
    content
    normalized_images
        image0
            Text
        image1
            Text
    language

The output fields from each skill can be used as inputs for other skills later in the pipeline, which in turn store their outputs in the document structure. For example, we could use a merge skill to combine the original text content with the text extracted from each image to create a new merged_content field that contains all of the text in the document, including image text.

document
    metadata_storage_name
    metadata_author
    content
    normalized_images
        image0
            Text
        image1
            Text
    language
    merged_content

Full text search¶

Full text search describes search solutions that parse text-based document contents to find query terms. Full text search queries in Azure AI Search are based on the Lucene query syntax, which provides a rich set of query operations for searching, filtering, and sorting data in indexes. Azure AI Search supports two variants of the Lucene syntax:

Simple - An intuitive syntax that makes it easy to perform basic searches that match literal query terms submitted by a user.
Full - An extended syntax that supports complex filtering, regular expressions, and other more sophisticated queries.

Query processing consists of four stages

Query parsing - The search expression is evaluated and reconstructed as a tree of appropriate subqueries. Subqueries might include term queries (finding specific individual words in the search expression - for example hotel), phrase queries (finding multi-term phrases specified in quotation marks in the search expression - for example, "free parking"), and prefix queries (finding terms with a specified prefix - for example air*, which would match airway, air-conditioning, and airport).
Lexical analysis - The query terms are analyzed and refined based on linguistic rules. For example, text is converted to lower case and nonessential stopwords (such as "the", "a", "is", and so on) are removed. Then words are converted to their root form (for example, "comfortable" might be simplified to "comfort") and composite words are split into their constituent terms.
Document retrieval - The query terms are matched against the indexed terms, and the set of matching documents is identified.
Scoring - A relevance score is assigned to each result based on a term frequency/inverse document frequency (TF/IDF) calculation.

Azure AI Search¶

Azure AI Search provides a cloud-based solution for indexing and querying a wide range of data sources, and creating comprehensive and high-scale search solutions. With Azure AI Search, you can:

Index documents and data from a range of sources.
Use cognitive skills to enrich index data.
Store extracted insights in a knowledge store for analysis and integration.

Replicas and partitions¶

Depending on the pricing tier you select, you can optimize your solution for scalability and availability by creating replicas and partitions.

Replicas are instances of the search service - you can think of them as nodes in a cluster. Increasing the number of replicas can help ensure there is sufficient capacity to service multiple concurrent query requests while managing ongoing indexing operations.
Partitions are used to divide an index into multiple storage locations, enabling you to split I/O operations such as querying or rebuilding an index.

The combination of replicas and partitions you configure determines the search units used by your solution. Put simply, the number of search units is the number of replicas multiplied by the number of partitions (R x P = SU). For example, a resource with four replicas and three partitions is using 12 search units.

Cognitive Search¶

Azure Cognitive Search is a search service hosted in Azure that can index content on your premises or in a cloud location.

How does indexing happen?

During the indexing process, Cognitive Search crawls your content, processes it, and creates a list of words that will be added to the index, together with their location. There are five stages to the indexing process:

Document Cracking: In document cracking, the indexer opens the content files and extracts their content.
Field Mappings: Fields such as titles, names, dates, and more are extracted from the content. You can use field mappings to control how they're stored in the index.
Skillset Execution: In the optional skillset execution stage, custom AI processing is done on the content to enrich the final index.
Output field mappings: If you're using a custom skillset, its output is mapped to index fields in this stage.
Push to index: The results of the indexing process are stored in the index in Azure Cognitive Search.

Integrate Cognitive Search and Azure AI Document Intelligence¶

To integrate Azure AI Document Intelligence into the Cognitive Search indexing process, you must write a Web service that integrates the custom skill interface.

If you've developed an Azure AI Document Intelligence solution, you may be using it to accept scanned or photographed forms or documents from users, perhaps from an app on their mobile device. Azure AI Document Intelligence can use either a built-in model or a custom model to analyze the content of these images and return text, structural information, languages used, key-value pairs, and other data.

That's the kind of data that may be useful in a Cognitive Search index. For example, if the content that you index includes scanned sales invoices, Azure AI Document Intelligence can identify field such as currency amounts, retailer names, and tax information by using its prebuilt Invoice model. When users search for a retailer, you'd like them to receive a link to invoices from that retailer in their results.

To integrate Azure AI Document Intelligence into the Cognitive Search indexing pipeline, you must:

Create an Azure AI Document Intelligence resource in your Azure subscription.
Configure one or more models in Azure AI Document Intelligence. You can either select prebuilt models, such as Invoice or Business Card or train your own model for unusual or unique form types.
Develop and deploy a web service that can call your Azure AI Document Intelligence resource. In this module, you'll use an Azure Function to host this service.
Add a custom web API skill, with the correct configuration to the Cognitive Search skillset. This skill should be configured to send requests to the web service.

Custom Skill¶

Your custom skill must implement the expected schema for input and output data that is expected by skills in an Azure AI Search skillset.

The input schema for a custom skill defines a JSON structure containing a record for each document to be processed. Each document has a unique identifier, and a data payload with one or more inputs, like this:

{
    "values": [
      {
        "recordId": "<unique_identifier>",
        "data":
           {
             "<input1_name>":  "<input1_value>",
             "<input2_name>": "<input2_value>",
             ...
           }
      },
      {
        "recordId": "<unique_identifier>",
        "data":
           {
             "<input1_name>":  "<input1_value>",
             "<input2_name>": "<input2_value>",
             ...
           }
      },
      ...
    ]
}

The schema for the results returned by your custom skill reflects the input schema. It is assumed that the output contains a record for each input record, with either the results produced by the skill or details of any errors that occurred.

{
    "values": [
      {
        "recordId": "<unique_identifier_from_input>",
        "data":
           {
             "<output1_name>":  "<output1_value>",
              ...
           },
         "errors": [...],
         "warnings": [...]
      },
      {
        "recordId": "< unique_identifier_from_input>",
        "data":
           {
             "<output1_name>":  "<output1_value>",
              ...
           },
         "errors": [...],
         "warnings": [...]
      },
      ...
    ]
}

Knowledge stores¶

While the index might be considered the primary output from an indexing process, the enriched data it contains might also be useful in other ways. For example:

Since the index is essentially a collection of JSON objects, each representing an indexed record, it might be useful to export the objects as JSON files for integration into a data orchestration process using tools such as Azure Data Factory.
You may want to normalize the index records into a relational schema of tables for analysis and reporting with tools such as Microsoft Power BI.
Having extracted embedded images from documents during the indexing process, you might want to save those images as files.

Azure AI Search supports these scenarios by enabling you to define a knowledge store in the skillset that encapsulates your enrichment pipeline. The knowledge store consists of projections of the enriched data, which can be JSON objects, tables, or image files. When an indexer runs the pipeline to create or update an index, the projections are generated and persisted in the knowledge store.