{"id":6002,"date":"2023-07-07T09:00:00","date_gmt":"2023-07-07T07:00:00","guid":{"rendered":"https:\/\/blog.besharp.it\/?p=6002"},"modified":"2023-07-07T12:35:22","modified_gmt":"2023-07-07T10:35:22","slug":"extracting-actionable-data-from-structured-documents-with-amazon-textract-aws-lambda-e-amazon-s3","status":"publish","type":"post","link":"https:\/\/blog.besharp.it\/extracting-actionable-data-from-structured-documents-with-amazon-textract-aws-lambda-e-amazon-s3\/","title":{"rendered":"Extracting actionable data from structured documents with Amazon Textract, AWS Lambda e Amazon S3"},"content":{"rendered":"\n
In the digital age, effectively processing and managing large quantities of documents is a priority for companies in every sector. Many organizations are faced with the task of digitizing large volumes of paper documents or processing data from structured documents, such as invoices or contracts, automatically and with minimum human intervention. In this context, Optical Character Recognition (OCR)<\/strong> has proved to be an indispensable tool for automating processes and improving overall efficiency.<\/p>\n\n\n\n
However, being able to extract text from a document is only part of what most applications need; it can be considered a primitive function. The goal is more often to extract specific information<\/strong>, selecting text based on the structure of the document.<\/p>\n\n\n\n
To correctly select valuable information, is crucial to obtain details on the structure of the document itself, such as how the text is grouped, tabulated, or its position within the page.<\/p>\n\n\n\n
Finding answers to these questions is exactly where Amazon Textract<\/strong> excels.<\/p>\n\n\n\n
In addition to providing the ability to extract text from documents, Amazon Textract can identify and return page structure information, opening up a wide range of data processing possibilities. Unlike traditional OCR software, which requires manual configurations and continuous updates to adapt to changes in forms, Amazon Textract uses machine learning models to process any type of document, ensuring accurate extraction of text, handwriting, tables, and other data without any manual intervention.<\/p>\n\n\n\n
Let’s move on to the description of a use case.<\/p>\n\n\n\n
Extracting information from an invoice<\/h2>\n\n\n\n
To explore the potential of Amazon Textract we will make use of a (not too much) hypothetical case in which the need is to automatically extract some information from company purchase invoices in order to enter the amounts and the date in a database that is periodically loaded into the financial management software.<\/p>\n\n\n\n
Therefore, we need to build an automatic system capable of extracting the amount and date from the invoices it receives. For simplicity, let’s assume that all the invoices have the same structure because they come from the supplier’s site from which our company obtains its supplies of consumer goods, although Textract can easily analyze heterogeneous invoices.<\/p>\n\n\n\n
Invoices are PDF documents meant to be read by a human. They contain headers, the vendor logo image, text, and tables in different places on the page.<\/p>\n\n\n\n
The site sends an invoice via email for each order. In our scenario, the address refers to an email group; therefore, we can ensure that the automatic system receives a copy without affecting the processes involving our operators.<\/p>\n\n\n\n
In this situation, we could sketch the following high-level solution<\/p>\n\n\n\n