Using Snowflake's Document AI to Query Documents with Natural Language

Hello, I'm Harmit Saini.

I'm passionate about cloud computing, machine learning, and solving real-world problems through code. I recently earned the Snowflake Snow Pro Core Certification, which inspired me to start this blog. My goal is to share what I've learned, break down complex topics, and support others on their Snowflake journey. Whether you're just starting out or are already a professional, I hope you find something helpful and insightful here.

Table of Contents

  1. What is Snowflake?
  2. What is Document AI?
  3. Using Document AI
  4. Conclusion

1. What is Snowflake?

Snowflake is a cloud data platform. To be more specific it’s the first cloud built data platform. Its architecture allows data specialists to not only create data warehouses but also cloud data lake-houses because it can manage both structured and unstructured data easily. It stands out for its unique architecture that separates storage, compute, and cloud services, enabling independent scaling and optimization of each component. This separation ensures that users only pay for the resources they actually use, whether they are storing data or running queries.

2. What is Document AI?

Document AI is a Snowflake AI feature that uses Arctic-TILT, a proprietary large language model (LLM), to extract data from documents. It processes documents of various formats and extracts information from both text-heavy paragraphs and the content in a graphical form, such as logos, handwritten text (signatures), or checkmarks. With Document AI, you can prepare pipelines for continuous processing of new documents of a specific type, such as invoices or finance statements.

Document AI provides both zero-shot extraction and fine-tuning. Zero-shot means that the foundation model can locate and extract information specific to a document type, even if the model has never seen the document before. This is because the foundation model is trained on a large volume of various documents, so the model broadly understands the type of document being processed.

Additionally, you can fine-tune the Snowflake Arctic-TILT model to improve your results by training the model on the documents specific to your use case. The fine-tuned model (including the training data used) is available only to you and is not shared with other Snowflake customers.

3. Using Document AI

Step 1: Set up the required objects and privileges

To get started, please click here and download the required files.

The data is taken from github which contains sample PDF invoices, however, you can use your own dataset if you want.

If you dont have a Snowflake account already, you can create one here.

Note: You can use your personal/company e-mail for signup. Each account will get a free tier only once. While creating an account, please select the Enterprise version and AWS as the cloud provider in the Tokyo region. Also, ensure that you get $400 free credits so that you can follow this tutorial for free.

Create a warehouse, database, schema, and custom role

To create a warehouse, database, schema, and a role to work with Document AI, do the following:

  • Create a custom role DOC_ROLE and assign necessary privileges:
USE ROLE ACCOUNTADMIN;
CREATE ROLE DOC_ROLE;
GRANT DATABASE ROLE SNOWFLAKE.DOCUMENT_INTELLIGENCE_CREATOR TO ROLE DOC_ROLE;
  • Grant the custom role to the current user:
GRANT ROLE DOC_ROLE TO USER <your_user_name>;

Note: To get your username, run SELECT CURRENT_USER() query in Snowflake.

  • Create Warehouse, Database and assign the CREATE SCHEMA privilege to the created role.
CREATE WAREHOUSE DOC_WH;

-- GIVE THE DOC_ROLE ROLE ACCESS TO THE WAREHOUSE
GRANT USAGE, OPERATE, MODIFY ON WAREHOUSE DOC_WH TO ROLE DOC_ROLE;

-- CREATE DATABASE AND SCHEMA TO BE USED, GIVE THE DOC_ROLE ACCESS
CREATE DATABASE DOC_DB;
GRANT CREATE SCHEMA, MODIFY, USAGE ON DATABASE DOC_DB TO ROLE DOC_ROLE;
  • Use the DOC_ROLE and create a schema.
USE ROLE DOC_ROLE;
        
-- CREATE A SCHEMA FOR THE DOCUMENT AI MODEL, STAGE etc
CREATE SCHEMA DOC_DB.DOC_SCHEMA;

-- EXPLICIT GRANT USAGE AND snowflake.ml.document_intelligence on the SCHEMA
GRANT USAGE ON SCHEMA DOC_DB.DOC_SCHEMA to role DOC ROLE;
GRANT CREATE snowflake.ml.document_intelligence on schema DOC_DB.DOC_SCHEMA to role DOC_ROLE;
  • Create Stage for uploading our files to test the Document AI model.
CREATE STAGE DOC_DB.DOC_SCHEMA.DOC_STAGE
    DIRECTORY = (enable = true)
    ENCRYPTION = (type = 'snowflake_sse');

Step 2: Create a Document AI model build

To create a Document AI model build, do the following:

  1. In the navigation menu, select AI & ML » Document AI.
  2. Select the warehouse we created just now - DOC_WH (The selection dropdown should be in the top right corner next to the Build button).
  3. Select + Build.
  4. In the dialog that appears, enter DOC_TEST_BUILD as a name for your model build, and select the location (DOC_DB database and DOC_SCHEMA schema).
  5. Select Create.

(If you encounter an error: "Unable to create a build on the specified database and schema. Please check the documentation to learn more", then please switch your role from ACCOUNTADMIN to DOC_ROLE. The selection menu is at the bottom left corner where your username is written.)

Upload documents to the Document AI model build

To upload documents to the newly created Document AI model build, do the following:

  1. In the DOC_TEST_BUILD model build, select the Build Details tab.
  2. Click on the Upload documents button.
  3. Select Browse and upload any 10 documents from the folder you downloaded.
  4. Select Upload.

I have uploaded 10 files, but you can upload more files for better accuracy of the model. You should see something like this when you are done uploading:

Build Details Tab

Step 3: Define data values and review the results

To define values for the Document AI model build:

  1. In the DOC_TEST_BUILD model build, select the Build Details tab.
  2. Select Define values.
  3. You should see the document view with a button to add value on right hand side.
  4. In the Documents review view, select + Value.
  5. For each document, enter the pairs of value names and questions
  6. For each document and data value, review the answers that the model provides:
    • If the answer is correct, select the checkmark.
    • If the answer is incorrect, enter the correct value manually.

The pairs of value names and questions for my use case were defined like this:

Sr Value name Question
1 CLIENT Who is the client?
2 SHIPPING_ADDRESS What is the shipping address?
3 TOTAL What is the Total?
4 DISCOUNT What is the discount?
5 ITEM_NAME What is the item name?

Note: When I entered the discount value, I got (10%) in response. But I want the discount value and not the percentage. So, if we are unsatisfied with the answer from the model, we can edit it.

Before:

Discount Before

After:

Discount After

Once done, click on the Accept all and review next button to verify the same questions for all other documents.

(Note: If you are unable to click the Accept all and review next button, then probably all the values would have a green 1.00 box which indicates you have already accepted those values.)

After accepting values for all the documents, the Documents tab should look something like this:

Documents Tab

Step 4: Publish a Document AI model build

To publish the model build, do the following:

  1. In the DOC_TEST_BUILD model build, select the Build Details tab.
  2. Under Model accuracy, select Publish version.
  3. In the dialog that appears, select Publish to confirm.
  4. Copy the Extracting Query that appears below and keep it somewhere.

For example, my extracting query looks like this:

SELECT DOC_DB.DOC_SCHEMA.DOC_TEST_BUILD!PREDICT(
  GET_PRESIGNED_URL(@<stage_name>, '<relative_file_path>'), 1);

For reference, this is how your Build Details tab should look like:

Build Details Tab

Note: If the model accuracy and the results are not satisfactory, you can optionally fine-tune the model to improve it. For more information about evaluating and training the model, see Evaluate a Document AI model.

Step 5: Testing the Document AI model build

Now, to test the model, I used a random document from the 1000 document folder and uploaded it to the Snowflake Stage that we initially created.

Upload a document to the Snowflake Stage

To upload files onto your stage, do the following:

  1. Select Data » Add Data.
  2. On the Add Data page, select Load files into a Stage.
  3. In the Upload Your Files dialog that appears, select the files that you want to upload. You can upload multiple files at the same time.
  4. Select the database schema in which we created the stage, then select the stage.
  5. Select Upload.

See the images below for reference:

1] Add Data from Snowsight

Add Data

2] Upload the files

Select Correct Stage

Testing the Model Build

Use the copied Extracting Query in the worksheet to query your documents and find out the answers. I only have one file uploaded in the stage, so lets use the single file query. You can also query all files from a stage with the second Extracting Query in the Build Details tab.

SELECT DOC_DB.DOC_SCHEMA.DOC_TEST_BUILD!PREDICT(
  GET_PRESIGNED_URL(@<stage_name>, '<relative_file_path>'), 1);

To get the <relative_file_path> you can go to your stage under Data » Databases » DOC_DB » DOC_SCHEMA » Stages » DOC_STAGE.

Where to Find Stage
Copy the file name. Example: invoice_Sh****y S****dt_14495.pdf in my case.

Run the query in the Query editor:

SELECT DOC_DB.DOC_SCHEMA.DOC_TEST_BUILD!PREDICT(
  GET_PRESIGNED_URL(@DOC_STAGE, 'invoice_Sh****y S****dt_14495.pdf'), 1);

I got the following response after running the above query:

{
  "CLIENT": [
    {
      "score": 0.983,
      "value": "Sh****y S****dt"
    }
  ],
  "DISCOUNT": [
    {
      "score": 0.35,
      "value": "(50%)"
    }
  ],
  "ITEM_NAME": [
    {
      "score": 0.907,
      "value": "S***o C***ic B****a**, Traditional Bookcases, Furniture, FUR-BO-5763"
    }
  ],
  "SHIPPING_ADDRESS": [
    {
      "score": 0.906,
      "value": "G****n G****en Netherlands"
    }
  ],
  "TOTAL": [
    {
      "score": 0.876,
      "value": "$4,012.61"
    }
  ],
  "__documentMetadata": {
    "ocrScore": 0.92
  }
}

Note: The score refers to the confidence level and the value refers to the answer.

Step 6: Re-training the model

You can see the discount value was (50%) instead of what we wanted. This is because the training is done only on 10 documents (see the score value of DISCOUNT key). You can fine-tune the model by clicking the train button in the Build Details tab and then, you can query it with the new Extracting Query.

After you clicked on Train button and finished training the model, your build version will be incremented from 1 to 2.

FineTune

Execute the new query:

SELECT DOC_DB.DOC_SCHEMA.DOC_TEST_BUILD!PREDICT(
  GET_PRESIGNED_URL(@DOC_STAGE, 'invoice_Sh****y S****dt_14495.pdf'), 2);

Note the build version has changed from 1 to 2.

The results I got:

{
  "CLIENT": [
    {
      "score": 0.998,
      "value": "Sh****y S****dt"
    }
  ],
  "DISCOUNT": [
    {
      "score": 1,
      "value": "$3,943.35"
    }
  ],
  "ITEM_NAME": [
    {
      "score": 1,
      "value": "S***o C***ic B****a**, Traditional Bookcases, Furniture, FUR-BO-5763"
    }
  ],
  "SHIPPING_ADDRESS": [
    {
      "score": 1,
      "value": "G****n G****en Netherlands"
    }
  ],
  "TOTAL": [
    {
      "score": 1,
      "value": "$4,012.61"
    }
  ],
  "__documentMetadata": {
    "ocrScore": 0.92
  }
}

Here, we finally got the discount value instead of the percentage because we trained the model on the values we provided it in the Define Values section. Hence, we can say that Document AI is a very good service from Snowflake to use AI-OCR in documents and extract its information from SQL queries.

Clean up by executing these commands:

DROP SCHEMA DOC_SCHEMA;
USE ROLE ACCOUNTADMIN;
DROP DATABASE DOC_DB;
DROP ROLE DOC_ROLE;
DROP WAREHOUSE DOC_WH;

Make sure to delete the build from Document AI as well. In this way, you can save costs of your Snowflake account.

4. Conclusion

To summarize, this is how Snowflake's Document AI can be used to get answers to questions using natural language and use it to query unseen documents in the stages. For further steps, we can create a data pipeline so that each newly added file will be processed, and the results will be stored in the table for analysis just like a normal relational table. This is how we can query unstructured data as if it was structured in Snowflake with its AI capabilities.

Author:

Harmit Saini

JTP Co., Ltd.