Decision Support subject - education

 Decision Support Systems and Business Intelligence


During my recent university studies (2021-24), I had a subject called Decision Support Systems and Business Intelligence. This included theoretical background on Decision Support Systems (DSS) and Business Intelligence (BI) as key tools for business decision making. DSS is a set of systems that use data, models and analytical techniques to help managers make informed decisions. These systems allow complex problems to be solved quickly and efficiently, especially in such situations where human intuition alone is not sufficient or gathering information without the complex information system would take more time. So the advantage of using such systems is not only being more precise in decision making, but making the required decisions on the right time.

Business intelligence, is a broader concept that includes data collection, processing, analysis and reporting, so all the preparations before decision. BI tools and systems are designed to provide visibility into company processes, or if external (cooperative or competitive businesses) data is available then to identify market trends and support strategic decision-making. Business intelligence enables companies to convert their data to a competitive advantage. Both DSS and BI systems share the common goal of providing the right information to decision-makers at the right time, thus increasing the efficiency and success of the organization.

On the practical side of the subject we were taught the use of MS Excel, Power Query, and majorly MS Power BI. ETL and analysis processes were practiced both in Excel and Power BI.

The exam involved theoretical and practical questions to answer, and problems to solve, as well. 



Power BI - logo
Power BI is considered the most powerful tool from the above mentioned list.
The problem solving task was about a dataset of a Flower shop, selling different types of flowers to specific people (Customers). One of the questions was to demonstrate the income of the shop according to the different flower types.

flower shop income according to different flowers
Sharp-eyed people can spot that there is no 4th type of Flower. There may be several reasons of it. If I was in contact with the Flower shop owner then I would have asked for the reason, just by curiousity.
It seems to be evident that the light blue colored product (Flower A) had the highest impact on the revenue.
After the exam, I went further in analysis and tried to further explore the data and searched for preferred flower-buying-day of the Customers. I would bet for Friday if I count on casual (romantic) Customer's surprising their beloved ones before the weekend starts..., here you are:
Friday is the busiest day of the Flower shop.
I was right. Maybe the justification is not the real one, but that could be verified only with further investigation, on the field, supplementing the dataset with additional information.
Here comes a slight inaccuracy of the BI software to point out, in the legend the days are not listed in the order of the day number in the week, but in spite of this discrepancy the colors and day names can be paired.
Day numbers should be matched with their English names for which I generated (typed-in) a new Table and connected to the appropriate Day data extracted from Invoice date (datetime) data column.
Creating Connection between Tables (fields) - Power BI

Another aim of the story exploration was whether there are people who insist on buying one type of flower or always the same type of bouquet. Since the Customers and Flower types had names (alphabet letters) these I converted to numbers with specific function, creating new columns and then I plotted the data:
Flower type and Customer ID, bought quantities and unit price - Power BI

The (scatter) plot involves Flower types (1-6, except 4) and Customer ID (1-13), bought Total quantities of the specific flower types and Unit price of the specific flower (see in the color legend on the top-left corner). Flower 1 and 2 were bought throughout the most Customers (the most spots present along the horizontal axis). Customer 12 bought the largest amount from one type of flower: Flower type 3 (the most regarding all the flower types), as it can be seen from the largest spot size on the plot, which reflects total purchased quantities (of specified Customer). Whether it was bought once-in-a-time resulting in an enormous bouquet or bought the same flower plenty times, it cannot be defined from this plot.
The spots are colored according to unit price which are in increasing order from top-to-bottom, as you can see from the Unit price legend on the top-left corner of the image.
To make it easier to compare the impact of each flower to the income I glued an additional horizontal bar graph to the plot, precisely in the same order as the flower types appear on the plot:
Flower type and Customer ID, bought quantities and unit price - Power BI
This complex figure demonstrates that however Flower A was not bought in the largest amount but taking into account its highest price (3000 unit) it has the highest impact on the income, as it was also demonstrated in a simple way on the first bar graph above.

Note: Power BI pops up a legend with defined data related to the data-spot in the background if you hover over the spots. It is very helpful for beginner presenters who stick to the data and want to emphasize their explanation with the right numbers. This functionality is not demonstrated here.

Mindee - Product identification with CAD file scraping

Product identification - a key point in production

AI Model training - Mindee

Project


Read other Mindee related posts

The hundreds of engineering CAD (Computer Aided Design) drawings of our manufacturing processes are in PDF format. Each contains a wealth of information, from which we had to extract the data that would allow us to identify the product accurately (product revision, document version), as well as to gain the product code and name.
Product revision (or version) information helps us to identify the latest version (release) of the technical drawing refering to the latest modifications in the product manufacturing process or in the material content. It is also a key point to determine how the latest product should look like. Using outdated drawings would lead to non-conform products that would not be accepted by the customer(s), with further significant consequences. Being aware of such situations, all files have been collected and the necessary data was extracted with the help of a tool and my Python code.

Mindee - the toolset

A PDF/image scraping tool using AI based computer vision (CV) supplemented optical character recognition (OCR) and Natural Language Processing (NLP) model to extract data from the files. These joint services can be accessed through an API (application programming interface) at Mindee.

Preparations

The followings were requested to complete the plan:
- a programming language that is supported, from the list I chose Python3,
- registration on Mindee website to get a username and an API key,
- customized OCR API to match our needs, with a model pre-trained on a small part of our files,
- locally a mindee Python modul installed, 
- files (preferably in pdf format) to scratch, or the mindee invoice sample file 
- reliable internet connection as the files are uploaded to mindee server, processed and then the extracted data is sent back to your application.

Python3 coding in Anaconda Spyder IDE

I used Spyder as a programming environment (Integrated Development Environment, IDE) provided with the Anaconda python distribution, however for this project it was not required. These are not compulsory but helps to reduce coding time in large projects, in this situation a simple python runtime environment and a simple text/code editor would have been enough.
mindee module was installed easily (but not with conda package manager :( )

pip install mindee
It was fast and easy.

Registration and generating API key

API key creation (Mindee)
Registration is also simple and can be done on the website. API key creation is easy as well if you navigate to the API keys page of your profile and select Add new API key. Shortly you get your secret API key that you should note (or copy) as it is needed in your code.
Created API key - Mindee
API key gives you the right to use the services besides it determines which API and how you may use.
Up to 250 pages (not files!) in a month the use of the API is for free.


Customizing an OCR model

In my case the offered off-the-shelf API solutions were not appropriate so I decided to make my API model by training first with 20 then finally with above 60 pages. Training an AI model generally gives you better prediction (higher precision) so better match with the real text if you use larger amount of training samples. As we only have few hundreds of files it was not worth to use more files for training and also at the level of 70 the models prediction was above 90%. This sounds low but trust me it was good enough for the project. 

AI Model training - Mindee


Field types - Mindee
Training was made on 5-20 uploaded pages in a round (uploaded by in steps of 5 files). I only had to identify the (text) boxes, which I wanted to extract and typed or corrected the predicted text to match the real content of the box as feedback. It is important to note that there are different data formats from which we have to select the appropriate one, like
- number (different kind), 
- string (full text or mixed),
- date (higher precision on English type of order: mm-dd-yyyy),
- phone number,
- email address,
- URL.
I used the date, number and string formats for publishing date, version number and the latter for itemcode and item name, correspondingly. 
It is important to keep in mind that by wrong format decision, the model can misinterpret the data and as a consequence, the output may mislead the workers. To minimize this kind of error approximately 10% of the total amount of the drawing files were double checked manually. We could verify whether the model was reliable by comparing manually the extracted and the real data.      

Optimizing by further training

After validating a new set of 20 pages Mindee creates an updated AI model (an email is sent ("[mindee] New model trained and deployed"). I really enjoyed those moments.

Custom model optimization - Mindee

At each pdf for each item the Validate button was hit ... many-many times, but that's how it works.

There is an online tool, so you can immediately verify the newly updated/created model by dragging a pdf and confirming/rejecting the predictions, which indeed will count in the training amount.

Accessing data with a software of your choice

When the model is ready and API is available, API key and username are also created, then a sotfware written in a supported language from this list, in my case Python 3 can reach the API service and receive the extracted data.
I created two functions to initialize and to save a Pandas dataframe which in the process is filled in with extracted data. During the initialization I defined the column names (the header) of the dataframe.
'productCode': a code (alphanumeric) referencing to one type of product,
'productName': name of the product (defined by the creator, or creator's company),
'productVer': product version, refers and corresponds to a well defined set up, look and material content of the product,
'verDate': date when the product version was declared,
'doc_ver':  version of the document, different from product version,
'printDate': publishing date of the document,
'other':  a field that may or may not contain data (depending on the publisher/creator).
In the function to save the Pandas dataframe the inbuilt .to_excel() function is used, which requires a writer engine. The engine uses openpyxl module, but xlswriter module may be used as well. The required filepath is the path where we would like to save the dataframe, ending with '.xlsx' as the aim is to save to an excel file for further use.

def create_pd_DF():
    #function to create Pandas dataframe (df) to store and save data
    #predefine column names:
    colname_list = ['doc_ver', 'other', 'printDate', 'verDate', 'productCode', 'productName', 'productVer' ]
    df_CAD = pd.DataFrame(columns = colname_list)   
    return df_CAD, colname_list

def save_pd_DF(df_CAD):
    #save gathered data stored in pandas dataframe -> given filepath (xlsx!)
    cad_filepath = 'FILEPATH_HERE'
    writer_CADinfo = pd.ExcelWriter(cad_filepath, engine='openpyxl') # engine='xlsxwriter')
    df_CAD.to_excel(writer_CADinfo, header= True) 
    writer_CADinfo.close() # close file to release
 The code starts with a Mindee client call (use your API code here!) then the endpoint is recognised by the Mindee services based on the username and endpoint name.
Folder path is given as a string (ending with an appropriate '/' or '\\' sign), separately filenames are defined in a list. 
mindee_client = Client(api_key='32CHARACTER_API_KEY_HERE')
custom_endpoint = mindee_client.create_endpoint("ENDPOINT_NAME", "USERNAME")

# Load a file from network
folderPath = 'FOLDER_PATH'
fileList = ['file1.PDF', 'file2.PDF'] #insert filenames here

df_CAD, colname_list = create_pd_DF()
The files are reached in a cycle in which full URL is determined by concatenating folder path and file name.  
for index, file in enumerate(fileList):
    full_path = folderPath + fileList[index] # os.path.join() may be used
    input_doc = mindee_client.source_from_path(full_path)
    result: PredictResponse = mindee_client.parse(product.CustomV1, input_doc, endpoint=custom_endpoint)

    # Print a brief summary of the parsed data - if needed
    print(result.document) 

    # process files ...
Here is the core of the code, processing the loaded files and printing out the complete document ina format as the API returns the data (and header information, paging, ...) or directly accessing the field names along with the predicted field values (the data that we would like to extract):

    # Print a brief summary of the parsed data - if needed
    print(result.document)
       
    # Iterate over all the fields in the document
    index = 0
    for field_name, field_values in result.document.inference.prediction.fields.items():       
        if index == 2:
            print(field_name, "=", field_values)    
        df_CAD.at[rowindex, colname_list[index]] =\
            str( result.document.inference.prediction.fields[field_name] )
        index += 1
    rowindex += 1
At the end the data(frame) is saved to the given path.
save_pd_DF(df_CAD)
Check complete sample code here or on GITHUB.

SUMMARY

Advantages of Mindee (customised API) services

+ optical character recognition (OCR) + computer vision (CV) in one application that is a very useful combination and consequently outperforms concurrent pdf extracting tools,
+ easy steps of registration, API key generation and simple use of the website, due to
+ a vast amount of tutorials and detailed documentation, which also determines,
+ simple usage of the API in any of the several supported programming languages,
+ immediate online functionality test with drag&drop option to verify the trained model's reliability and performance.

Disadvantages

- 250 pages per month is IN GENERAL not enough only for a portion of the documents. Even a small company uses more pdf files in their electronic communication. Fortunately we needed only for one purpose, but do not forget that there are payed options,
- the model training does not give you an immediate feedback on creating a new API model, at least not in a form which I could easily realise either on the website or by sending an email instantly.

Alltogether, I was extremely satisfied, especially due to that I finally found a service that reliably worked on our files.

Mindee - a simple invoice data extractor

Mindee invoice OCR API- using Python3

MINDEE LOGO

Read other Mindee related posts


A software that I found after a long search, including online, offline solutions. I needed this PDF scraping/data extraction tool using AI based services accessed through an API.
In addition to the many useful features of mindee, a repository of predefined document data retrieval models is available through an AI-based API, which registered users can access for free - with a limited number of pages per month, of course.
Among the several "off-the-shelf" solutions (OCRs: invoice, receipt, passport, license plates), the use of the invoice API is presented below. The code is infinitely simple, as mindee's well-constructed website is a great help for using the service, including sample codes. This makes it infinitely simply to use.
What is needed:
- python (or other languages see below) installed, easier if an IDE (Integrated Development Environment) helps in coding, I used Spyder provided in the Anaconda distribution
- mindee registration (see above)
- mindee python modul installed (see below)
- files (preferably in pdf format) to scratch, or the mindee invoice sample file 
- reliable internet connection as the files are uploaded to mindee server, processed and then the extracted data is sent back to your application. 

The mindee documentation on using invoice file data extraction starts at the installation of the required Python module (in the system prompt window):

pip install mindee
However I have Anaconda distribution, which uses conda package manager, I was forced to use pip to install mindee module. That was the only drawback of the whole project, never mind!
The installation is a relatively fast process (of course this also needs an internet connection!).

As usual, API services require a user account; including a username and the website (mindee system) generated API key to access the tool.
Pricing was a tempting offer: free up to 250 pages (not files!) of usage per month. See Mindee website for details!
To receive an API key select the required API from the several offered "off-the-shelf" solution, in this case the invoice API. Click on the Invoice card in the APIs Store
This is added to your APIs:

Now you only need to copy the API key and username from the website. 

It is important to note that there are examples (codes) in the following languages for all APIs, including the invoice OCR case: Python3, Node.js, Ruby, Java, .NET, PHP,
so if you are rather interested in one of those, go ahead and check their website!

The provided code that only needs your API key and the filepath to fill in the right place that I highlighted.

from mindee import Client, PredictResponse, product

# Init a new client
mindee_client = Client(api_key= "YOUR32CHARACTERAPICODEHERE")

# Load a file from disk or from local network
input_doc = mindee_client.source_from_path('FILEPATH')

# Load a file from disk and parse it.
result: PredictResponse = mindee_client.parse(product.InvoiceV4, input_doc)
As it is defined in the original code "The endpoint name must be specified since it cannot be determined from the class." It is not mentioned but defining product.InvoiceV4 is not enough but you really have to start the invoice API services on the website as I indicated above. Otherwise in spite of the correct python code the process returns with error due to unreachable service.
If everything goes fine then printing out the whole document or printing the "predictions", meaning the found texts in the uploaded document, according to the invoice API:

# Print a summary of the API result
print(result.document)

# Print the document-level summary
print(result.document.inference.prediction)

The output includes (if are declared on the invoice):

  • Invoice Number, Purchase Date, Due Date,
  • amounts as Total Net, Total Amount, Total Tax
  • Supplier data as Name, Company Address
  • Customer data as Name, Company Address

For compatibility and other details, check the Release notes session.

 AI & AUT EXPO




"Exhibition for Local, Regional and European Technology and Innovation"
It was held in Budapest at the Hungarian Railway Museum, 14-15 February 2023

My aim was to get a glimpse of the state and level of the Hungarian companies and Universities in regard of AI and automation topics. Also to meet people working in the domain of optical character recognition (OCR) based automated image processing to push a bit on my project with the extraction of data that I would need for production resource lists (recipies) to import into our ERP system.

I had the chance to meet a lot of people and to discuss details of ongoing developments ... not only those connected to my first aim, but as I am a curious man I tried to discover and understand almost all applications that has been presented by talk or by someone at a booth. Sometimes I felt that I could inspire the develepers/researchers by my viewpoints and thoughts or even I would be happy to join their team. 

Automatized object(ion) recognition softwares built-in different cars, Machine Learning based systems for Autonomous Driving, Medical Text and MRI Image processing, other healthcare related topics, autonomous drones for the transportation of the future, Large Language Models, speech recognition softwares and much more were the topics of the conference.

I felt some "AI vibe" while being there. I am happy that I could attend the lectures and that I could get some useful tricks in my question as well.

I will check further OCR based AI models, for example Google tesseract or the python version of it (pythesseract) and will also try to find some API (if possible python based) that could give better results on data extraction compared to those softwares that I have already checked in the previous months. 

Retail company - Creation, load and analysis - MSSQL

Analysis of a retail company transaction dataset by MS SQL database.

In the description Microsoft SQL Server Management Studio (SSMS) version 2022 was used to create dataset and to analyse data.

  1. Checking server and database properties

First of all, to avoid discrepancies it is better to adjust the character set (collation) of the server to match to the collation of the data. 
-- server COLLATION property
SELECT SERVERPROPERTY('collation');
GO

The reply can be for example: SQL_Latin1_General_CP1_CI_AS.

The database was named Questions and Answers, shortly QADB. 

USE QADB;
GO

SELECT DATABASEPROPERTYEX(DB_NAME(), 'collation');
SELECT DB_NAME(); 
GO
The latter SQL query can be used only when the database has been created. The best if the database collation matches with the loaded data and even better if it matches with the database server collation, leading to fewer errors in special character recognition and consequently during the data analysis/display.

Actually, to create a new database first change/define the context of the current database to the master database. The master database is a system database that contains crucial information about the SQL Server instance, such as system configuration settings, login accounts, and the metadata about all other databases within the instance. For safety reasons of a future complete restart of SQL DB and Query building processes a drop function is inserted which is followed by the real database creation:

USE master;
DROP DATABASE IF EXISTS QADB;
GO

-- creating new database (QADB)
CREATE DATABASE QADB;
GO

In the case the entire database collation has to be changed use: 
ALTER DATABASE QADB
COLLATE SQL_Latin1_General_CP1_CI_AS;
In another case, when a field of the database collation has to be changed, for example textual entries with special characters for proper display, use: 
ALTER TABLE QADB
ALTER COLUMN city nvarchar(100)
COLLATE SQL_Latin1_General_CP1_CI_AS
NOT NULL;
COLLATE SQL_Latin1_General_CP1_CI_AS;
Create user (at master db level) with specified access to the (QA) database:
CREATE LOGIN NewUser WITH PASSWORD = 'SecurePassword';

This can be set later as well, but for simple analysis it is not required as by creating the database you immediately get access to it as default user. 

Start using the created database by switching from master DB to QADB
USE QADB;
  GO
CREATE SCHEMA KDB 
AUTHORIZATION dbo;
GO
The second part orders the database server to create Master Database file (mdf) and transactions log file (ldf). There are certain restrictions and some details to follow to keep the database and the server always at their best performance; see the links. 
  1. Creating tables and fields with constraints
Create the Tables of the database with realistic and logical limitations (constraints 1, 2), such as product_price cannot be negative, product_name should not be longer than 50 characters, but can be shorter, which can be defined as variable character length: nvarchar
DROP TABLE IF EXISTS KDB.Product; -- safety drop on restart
CREATE TABLE KDB.Product (
	productID IDENTITY(1000000000000,1) NOT NULL
		CONSTRAINT PRIMARY KEY,
	product_name nvarchar(50) NOT NULL,
    	product_price int DEFAULT 0 NOT NULL
    	CHECK (product_price >= 1)
);
Constraints CHECK is a limitation that does not allow (from creation with the defined limitation) to enter such values which do not match with the declared condition. 
Create another Table to store Warehouse related information such as the quantity of available products. Being a new table it requires an identity (primary key*) field as well: stock ID:
CREATE TABLE KDB.ProductRegister (
product_stockID char(13) NOT NULL, -- IDENTITY(1,1) 
product_quantity INT NOT NULL 
	CHECK (product_quantity >= 1)
);
The above defined product_stockID in ProductRegister Table is related to the productID in 1:1 relationship, so these feilds has to be connected in a way that the product_stockID besides being an ID in its own table, requires a foreign key: the productID:
ALTER TABLE KDB.ProductRegister
ADD CONSTRAINT PK_ProductRegister_product_stockID PRIMARY KEY (product_stockID),
	CONSTRAINT FK_ProductRegister_productID FOREIGN KEY (product_stockID)
	REFERENCES KDB.Product(productID);
GO
Primary keys and Foreign keys may be named as you like but it is optimal if you define a descriptive name, including its nature "PK_" or "FK_", the name of the field in the current table and in the case of foreign key add the name of the referred primary key in the other table, as well.
A business always have clients and it is important to store the related information in the database:
CREATE TABLE KDB.Customers (
customerID INT NOT NULL IDENTITY(1,1), 
customer_name nvarchar(50) NOT NULL,
phonenumber varchar(15),
address varchar(100) -- this could be split into parts if required
);

ALTER TABLE KDB.Customers
	ADD CONSTRAINT PK_Customers_customerID PRIMARY KEY (customerID);
GO
If Warehouse, Product, and Customer information have their fields defined then only the purchases remain to be registered. There are simple and complex ways to create the required structure, here I made a medium-level complexity solution. We store the purchase (invoice/bill/transaction) main data in one table (SalesHead) and the exact items (codes) and related information are stored in a separate, but connected table Sales
DROP TABLE IF EXISTS KDB.SalesHead; -- for safety restart reasons
CREATE TABLE KDB.SalesHead (
customerID INT NOT NULL
	CONSTRAINT FK_Customers_SalesHead_customerID FOREIGN KEY (customerID)
	REFERENCES KDB.Customers (customerID),
transactionID INT IDENTITY(1,1) NOT NULL 
	CONSTRAINT PK_SalesHead_transactionID PRIMARY KEY (transactionID),
totalamount INT NOT NULL,
	CHECK (totalamount >= 1)
t_date SMALLDATETIME NOT NULL 
	);

ALTER TABLE KDB.SalesHead
	ADD CONSTRAINT UQ_SalesHead_salesCustomerID UNIQUE (customerID, transactionID) 
    totalamount;
GO
customerID, and transactionID as a combination create such a value pair which is able (as a value pair) to identify the transaction, but those were not chosen as a composite (combined) primary key* because for simple reasons it is better to create an ID for all events separately: transactionID.
The unique nature of the identity (typically a large number or a long alphanumerical value) is provided by the server, as on every new entry the identity value 'increases'. 
DROP TABLE IF EXISTS KDB.Sales;
CREATE TABLE KDB.sales (
salesID INT NOT NULL UNIQUE,
	CONSTRAINT FK_SalesHeadtransactionID_SalesID FOREIGN KEY (salesID)
	REFERENCES KDB.SalesHead(transactionID),
itemnumber INT NOT NULL IDENTITY(1,1)
		CONSTRAINT PK_SalesItemNumID PRIMARY KEY (itemnumber),
sold_product INT NOT NULL,
sold_quantity INT NOT NULL,
	CHECK (sold_quantity >= 1)
);

* Primary key: One or more fields (composite primary key) whose values uniquely identify each record in a table. A primary key cannot allow Null values and must always have a unique index. A primary key is used to relate a table to foreign keys in other tables.

Next steps...

Snowflake universe, part #6 - Forecasting2

Forecasting with built-in ML module Further posts in  Snowflake  topic SnowFlake universe, part#1 SnowFlake, part#2 SnowPark Notebook Snow...