{"id":6002,"date":"2023-07-07T09:00:00","date_gmt":"2023-07-07T07:00:00","guid":{"rendered":"https:\/\/blog.besharp.it\/?p=6002"},"modified":"2023-07-07T12:35:22","modified_gmt":"2023-07-07T10:35:22","slug":"extracting-actionable-data-from-structured-documents-with-amazon-textract-aws-lambda-e-amazon-s3","status":"publish","type":"post","link":"https:\/\/blog.besharp.it\/extracting-actionable-data-from-structured-documents-with-amazon-textract-aws-lambda-e-amazon-s3\/","title":{"rendered":"Extracting actionable data from structured documents with Amazon Textract, AWS Lambda e Amazon S3"},"content":{"rendered":"\n

In the digital age, effectively processing and managing large quantities of documents is a priority for companies in every sector. Many organizations are faced with the task of digitizing large volumes of paper documents or processing data from structured documents, such as invoices or contracts, automatically and with minimum human intervention. In this context, Optical Character Recognition (OCR)<\/strong> has proved to be an indispensable tool for automating processes and improving overall efficiency.<\/p>\n\n\n\n

However, being able to extract text from a document is only part of what most applications need; it can be considered a primitive function. The goal is more often to extract specific information<\/strong>, selecting text based on the structure of the document.<\/p>\n\n\n\n

To correctly select valuable information, is crucial to obtain details on the structure of the document itself, such as how the text is grouped, tabulated, or its position within the page.<\/p>\n\n\n\n

Finding answers to these questions is exactly where Amazon Textract<\/strong> excels.<\/p>\n\n\n\n

In addition to providing the ability to extract text from documents, Amazon Textract can identify and return page structure information, opening up a wide range of data processing possibilities. Unlike traditional OCR software, which requires manual configurations and continuous updates to adapt to changes in forms, Amazon Textract uses machine learning models to process any type of document, ensuring accurate extraction of text, handwriting, tables, and other data without any manual intervention.<\/p>\n\n\n\n

Let’s move on to the description of a use case.<\/p>\n\n\n\n

Extracting information from an invoice<\/h2>\n\n\n\n

To explore the potential of Amazon Textract we will make use of a (not too much) hypothetical case in which the need is to automatically extract some information from company purchase invoices in order to enter the amounts and the date in a database that is periodically loaded into the financial management software.<\/p>\n\n\n\n

Therefore, we need to build an automatic system capable of extracting the amount and date from the invoices it receives. For simplicity, let’s assume that all the invoices have the same structure because they come from the supplier’s site from which our company obtains its supplies of consumer goods, although Textract can easily analyze heterogeneous invoices.<\/p>\n\n\n\n

Invoices are PDF documents meant to be read by a human. They contain headers, the vendor logo image, text, and tables in different places on the page.<\/p>\n\n\n\n

The site sends an invoice via email for each order. In our scenario, the address refers to an email group; therefore, we can ensure that the automatic system receives a copy without affecting the processes involving our operators.<\/p>\n\n\n\n

In this situation, we could sketch the following high-level solution<\/p>\n\n\n\n

<\/p>\n\n\n

\n
\"infrastructure<\/figure><\/div>\n\n\n

<\/p>\n\n\n\n

By subscribing an ad-hoc email address to the email group we can route a copy of the emails containing invoices to our system.<\/p>\n\n\n\n

To receive emails and process them, we can exploit Amazon SES; although it is commonly known for sending emails in the AWS environment, it can also be used for receiving and consequently integrating with AWS services useful for email processing.<\/p>\n\n\n\n

Receiving emails supports various integrations, but in our case, the most practical one is certainly the one with Amazon S3.<\/p>\n\n\n\n

Through the integration, Amazon SES saves an object containing the raw data of the mail in MIME format on the designated Amazon S3 bucket. This allows us to track – on the AWS side – a history of the raw body of each message received. This is useful for both archiving and investigating any malfunctions.<\/p>\n\n\n\n

The use of Amazon S3 storage space comes with minimal costs, and in the event of very large quantities of messages received, it is possible to optimize the cost further by taking advantage of Lifecycle policies, low-cost storage classes, and all Amazon S3\u2019s functions.<\/p>\n\n\n\n

At this point, an Amazon S3 trigger comes into operation, starting a Lambda Function that takes care of parsing the body of the email and extracting the attachment. The file can then be saved to a dedicated S3 bucket.<\/p>\n\n\n\n

In this article, we will not explore the code necessary to extract the attachments because it is not the core of what we intend to cover. However, libraries exist for most of the popular languages \u200b\u200bthat make parsing and manipulating MIME data easier. For example, this is a NodeJs-based library<\/a> from which to build the function.<\/p>\n\n\n\n

At this point, if the invoice is saved in one of the formats supported by Amazon Textract, it is possible to proceed with extracting the information. Otherwise, it is possible to add a further function or extend the one that manipulates the email to convert to a universal format, such as PDF, for example.<\/p>\n\n\n\n

A second Amazon S3 trigger starts a Lambda function used to invoke Amazon Texttract, passing it as input to the object to be analyzed. The lambda will then navigate the result of the analysis to detect the total and issue date of the invoice and save the information involved in the database.<\/p>\n\n\n\n

The integration with Amazon Textract is pretty seamless, and the methods exist in all the AWS SDKs for major programming languages, such as boto3 for Python and the AWS SDK for JavaScript.<\/p>\n\n\n\n

The reference invoice<\/h2>\n\n\n\n

In our scenario, we will use invoices generated by Amazon Business, which look something like this:<\/p>\n\n\n\n

<\/p>\n\n\n

\n
\"extracting<\/figure><\/div>\n\n\n

<\/p>\n\n\n\n

The document presents the total both outside the table – in the header detail box – and at the bottom of the table.<\/p>\n\n\n\n

This is an aspect to consider since we will be able to select the easiest total to identify with Amazon Textract, or build specific logic to identify the total by searching both inside and outside the table or to pick the one with the greatest confidence choosing among two results.<\/p>\n\n\n\n

Textract API<\/h2>\n\n\n\n

The service provides various API calls, both synchronous and asynchronous.<\/p>\n\n\n\n

Synchronous versions allow you to send a document for immediate analysis, and the response to the API call is the result of the analysis itself. These API calls have important limitations on the accepted formats and on the fact that the caller must necessarily wait for the end of the analysis, which can take several seconds.<\/p>\n\n\n\n

The asynchronous versions, on the other hand, immediately return a JOBID, through which it is possible to request the result of the analysis at a later time. There is also a mechanism to obtain notification of the successful analysis via SNS. See this content<\/a> for details. In summary, it is possible to specify the SNS topic when starting the asynchronous analysis.<\/p>\n\n\n\n

Remember to carefully consider the use of the asynchronous version, especially if the computation is based on Lambda. This allows for better decouple the infrastructural components, building robust retry mechanisms, increasing the overall high reliability of the solution, and reducing costs by eliminating the payment of the Lambda computation time occupied by the synchronous waiting for the analysis result.<\/p>\n\n\n\n

At the time of writing this article, the available API calls are as follows:<\/p>\n\n\n\n