Mindee - AI based PDF data extraction (API)

PDF scraping/data extraction using AI based services of Mindee API

MINDEE LOGO

Read other Mindee related posts

Our company has a few hundred of CAD (computer aided design) files in PDF file format with lots of information in each, from which some are important parameters, worth to extract. These describe the product (code and name) and its specific version/revision. This information helps us to identify the most recent version (issue) of the engineering drawing. These are also key points in determining how the most up-to-date product should look and what it is made of. Using outdated drawings would lead to non-conform products that are not accepted by the Customer(s) and result in economical consequences in addition to the lost time of the Production department. To avoid such a situation all files were gathered and data was extracted.

There were a few aspects that the process had to meet:
- either should be done manually,
- or similarly, in a simple way (easy coding), using a robust and reliable method, at low costs (or as usual, even better if for free).

I have tested Optical Character Recognition (OCR) software-based document processing software systems, for example some Python modules, like Google tesseract, Keras-OCR, or AI softwares promising automatic "pdf data extraction", e.g. mindee,

I have also tested a few standalone OCR software, e.g. Adobe PDF Services API9, ABBYY's Fine Reader EnhancedOCR engine (API)10, Cloudmersive imageapi11. Finally, I have found mindee to be the best for our case.

Mindee, as in general API (Application Programming Interface) services require a profile (user account, including username and API key). The software being quite new on the market tried to position itself by allowing free usage below a certain amount of pages per month, see the website for details!

It's easy handling besides the appropriate documentation and online tutorials including different languages as 
    - Python3
    - Node.js
    - Ruby
    - Java
    - .NET
    - PHP

helped me to create a code of a simple (mindee defined "Off-the-shelf") and a customised API for data extraction in a few hours.

The website could be slightly more user-friendly in regard adding or training new APIs, but the website documentation is always there to help you out at every step.

Setting up mindee apis

Prepared APIs are ready to use for example for reading invoice or passport data. I tried the latter one. After a few simple registration steps, you can immediately select and add APIs to your own APIs. They are easy to use and require no model training, as Mindee has already done a thorough job of preparing AI models.


APIs can be customized to match your desired application, but in this case I recommend at least 50 documents for training to have a reliable outcome of the AI model's learning phase. The trained models are also shown in the APIs list.
In this situation the invoice reader API could be used for free for 250 pages (not files!) per month.

Use it, it's worth it!

Files on my Github - Mindee folder.

No comments:

Post a Comment

Snowflake universe, part #6 - Forecasting2

Forecasting with built-in ML module Further posts in  Snowflake  topic SnowFlake universe, part#1 SnowFlake, part#2 SnowPark Notebook Snow...