Product identification - a key point in production
Project
Read other Mindee related posts
The hundreds of engineering CAD (Computer Aided Design) drawings of our manufacturing processes are in PDF format. Each contains a wealth of information, from which we had to extract the data that would allow us to identify the product accurately (product revision, document version), as well as to gain the product code and name.
Product revision (or version) information helps us to identify the latest version (release) of the technical drawing refering to the latest modifications in the product manufacturing process or in the material content. It is also a key point to determine how the latest product should look like. Using outdated drawings would lead to non-conform products that would not be accepted by the customer(s), with further significant consequences. Being aware of such situations, all files have been collected and the necessary data was extracted with the help of a tool and my Python code.
Mindee - the toolset
A PDF/image scraping tool using AI based computer vision (CV) supplemented optical character recognition (OCR) and Natural Language Processing (NLP) model to extract data from the files. These joint services can be accessed through an API (application programming interface) at Mindee.
Preparations
The followings were requested to complete the plan:
- a programming language that is supported, from the list I chose Python3,
- registration on Mindee website to get a username and an API key,
- customized OCR API to match our needs, with a model pre-trained on a small part of our files,
- locally a mindee Python modul installed,
- files (preferably in pdf format) to scratch, or the mindee invoice sample file
- reliable internet connection as the files are uploaded to mindee server, processed and then the extracted data is sent back to your application.
Python3 coding in Anaconda Spyder IDE
I used Spyder as a programming environment (Integrated Development Environment, IDE) provided with the Anaconda python distribution, however for this project it was not required. These are not compulsory but helps to reduce coding time in large projects, in this situation a simple python runtime environment and a simple text/code editor would have been enough.
mindee module was installed easily (but not with conda package manager :( )
pip install mindee
It was fast and easy.
Registration and generating API key
Registration is also simple and can be done on the website. API key creation is easy as well if you navigate to the API keys page of your profile and select Add new API key. Shortly you get your secret API key that you should note (or copy) as it is needed in your code.
API key gives you the right to use the services besides it determines which API and how you may use.
Up to 250 pages (not files!) in a month the use of the API is for free.
Customizing an OCR model
In my case the offered off-the-shelf API solutions were not appropriate so I decided to make my API model by training first with 20 then finally with above 60 pages. Training an AI model generally gives you better prediction (higher precision) so better match with the real text if you use larger amount of training samples. As we only have few hundreds of files it was not worth to use more files for training and also at the level of 70 the models prediction was above 90%. This sounds low but trust me it was good enough for the project.
Training was made on 5-20 uploaded pages in a round (uploaded by in steps of 5 files). I only had to identify the (text) boxes, which I wanted to extract and typed or corrected the predicted text to match the real content of the box as feedback. It is important to note that there are different data formats from which we have to select the appropriate one, like
- number (different kind),
- string (full text or mixed),
- date (higher precision on English type of order: mm-dd-yyyy),
- phone number,
- email address,
- URL.
I used the date, number and string formats for publishing date, version number and the latter for itemcode and item name, correspondingly.
It is important to keep in mind that by wrong format decision, the model can misinterpret the data and as a consequence, the output may mislead the workers. To minimize this kind of error approximately 10% of the total amount of the drawing files were double checked manually. We could verify whether the model was reliable by comparing manually the extracted and the real data.
Optimizing by further training
After validating a new set of 20 pages Mindee creates an updated AI model (an email is sent ("[mindee] New model trained and deployed"). I really enjoyed those moments.
There is an online tool, so you can immediately verify the newly updated/created model by dragging a pdf and confirming/rejecting the predictions, which indeed will count in the training amount.
Accessing data with a software of your choice
When the model is ready and API is available, API key and username are also created, then a sotfware written in a supported language from this list, in my case Python 3 can reach the API service and receive the extracted data.
I created two functions to initialize and to save a Pandas dataframe which in the process is filled in with extracted data. During the initialization I defined the column names (the header) of the dataframe.
'productCode': a code (alphanumeric) referencing to one type of product,
'productName': name of the product (defined by the creator, or creator's company),
'productVer': product version, refers and corresponds to a well defined set up, look and material content of the product,
'verDate': date when the product version was declared,
'doc_ver': version of the document, different from product version,
'printDate': publishing date of the document,
'other': a field that may or may not contain data (depending on the publisher/creator).
In the function to save the Pandas dataframe the inbuilt .to_excel() function is used, which requires a writer engine. The engine uses openpyxl module, but xlswriter module may be used as well. The required filepath is the path where we would like to save the dataframe, ending with '.xlsx' as the aim is to save to an excel file for further use.
def create_pd_DF():
#function to create Pandas dataframe (df) to store and save data
#predefine column names:
colname_list = ['doc_ver', 'other', 'printDate', 'verDate', 'productCode', 'productName', 'productVer' ]
df_CAD = pd.DataFrame(columns = colname_list)
return df_CAD, colname_list
def save_pd_DF(df_CAD):
#save gathered data stored in pandas dataframe -> given filepath (xlsx!)
cad_filepath = 'FILEPATH_HERE'
writer_CADinfo = pd.ExcelWriter(cad_filepath, engine='openpyxl') # engine='xlsxwriter')
df_CAD.to_excel(writer_CADinfo, header= True)
writer_CADinfo.close() # close file to release
The code starts with a Mindee client call (use your API code here!) then the endpoint is recognised by the Mindee services based on the username and endpoint name.
Folder path is given as a string (ending with an appropriate '/' or '\\' sign), separately filenames are defined in a list.
Folder path is given as a string (ending with an appropriate '/' or '\\' sign), separately filenames are defined in a list.
mindee_client = Client(api_key='32CHARACTER_API_KEY_HERE')
custom_endpoint = mindee_client.create_endpoint("ENDPOINT_NAME", "USERNAME")
# Load a file from network
folderPath = 'FOLDER_PATH'
fileList = ['file1.PDF', 'file2.PDF'] #insert filenames here
df_CAD, colname_list = create_pd_DF()
The files are reached in a cycle in which full URL is determined by concatenating folder path and file name.
for index, file in enumerate(fileList):
full_path = folderPath + fileList[index] # os.path.join() may be used
input_doc = mindee_client.source_from_path(full_path)
result: PredictResponse = mindee_client.parse(product.CustomV1, input_doc, endpoint=custom_endpoint)
# Print a brief summary of the parsed data - if needed
print(result.document)
# process files ...
Here is the core of the code, processing the loaded files and printing out the complete document ina format as the API returns the data (and header information, paging, ...) or directly accessing the field names along with the predicted field values (the data that we would like to extract):
# Print a brief summary of the parsed data - if needed
print(result.document)
# Iterate over all the fields in the document
index = 0
for field_name, field_values in result.document.inference.prediction.fields.items():
if index == 2:
print(field_name, "=", field_values)
df_CAD.at[rowindex, colname_list[index]] =\
str( result.document.inference.prediction.fields[field_name] )
index += 1
rowindex += 1
At the end the data(frame) is saved to the given path.
save_pd_DF(df_CAD)
SUMMARY
Advantages of Mindee (customised API) services
+ optical character recognition (OCR) + computer vision (CV) in one application that is a very useful combination and consequently outperforms concurrent pdf extracting tools,
+ easy steps of registration, API key generation and simple use of the website, due to
+ a vast amount of tutorials and detailed documentation, which also determines,
+ simple usage of the API in any of the several supported programming languages,
+ immediate online functionality test with drag&drop option to verify the trained model's reliability and performance.
Disadvantages
- 250 pages per month is IN GENERAL not enough only for a portion of the documents. Even a small company uses more pdf files in their electronic communication. Fortunately we needed only for one purpose, but do not forget that there are payed options,
- the model training does not give you an immediate feedback on creating a new API model, at least not in a form which I could easily realise either on the website or by sending an email instantly.
Alltogether, I was extremely satisfied, especially due to that I finally found a service that reliably worked on our files.
No comments:
Post a Comment