OnlineRetail - PowerBI analysis #3.2 - clustering

OnlineRetail dataset - MS Power BI, clustering (analysis #2)

Dataset reference: the analyzed dataset - mentioned on DataCamp website - was downloaded from the site of Online retail dataset sharing/owner.

In the previous two projects, I analyzed the dataset with Python and SQL with the help of the in-built visualization of the Notebook offered on DataCamp website and also using matplotlib for special purposes.
Power BI - logo
Basic analysis by Power BI of Online Retail dataset revealed simple facts. In an another additional Power BI project the frequency of well defined purchased amounts (1, 2, 3, ... pieces) versus the purchased amounts (pieces) on the horizontal axis, while different countries were presented (Frequency-Amount-Country plots).

The complete dataset and the BI solution is zipped. The package sizes 26 MB and requires installed Microsoft Power BI software (ver 2.130+, made wtih ver 2.131.1203.0).

As a simple step Microsoft (MS) Power BI software is able to load data easily from Excel files. The DataCamp project defined simple questions that could be easily answered in a short time.

Here I demonstrate a clustering visualization that cannot be created neither by DataCamp (or any) online Notebook application even if those regularly have an in-built data visualization tool, nor by offline analysis in Python using the matplotlib module. A more advanced python module is required for clustered plots.

It is easy to plot 2 datasets ("columns") on a 2D graph's X and Y axis rather the advantage of the Power BI software is adding a 3rd layer (not axis) of some additional context, strongly connected to the previous datasets. In such a case, a completely new viewpoint comes into the picture. Using another advantage of Power BI, the plotted data can be clustered in subsets based on the X-Y axis values and the clusters highlighted or specifically selected for visualization. Consequently, a confusingly large dataset can be easily analyzed.
Note: clustering by Power BI can be made in an automated or semi-automated way, where the latter means that the number of requested (provisory) clusters may be defined, but still the real data analysis is done automatically by the software (AI in the background).

At this time, the analysis of Online retail sales data was made, clustered based on key metrics such as Quantity, Total Price (Quantity*Unit price). The third layer is the Product Code. The clustered-plot as outcome offers valuable insights into Customer purchasing patterns (or Sales patterns, depending on the point-of-view). By grouping similar sales transactions together, clustering tries to reveal distinct customer behaviors or product demand trends. For instance, certain clusters may represent high-volume, low-cost purchases, indicating bulk buying of simple goods, while others might reflect high-cost, low-quantity purchases, typical of premium or otherwise specific products. Visualizing these clusters through plotted graphs helps in recognizing these patterns easily, enabling the company to tailor their marketing strategies, optimize inventory management, and enhance customer targeting for better sales performance.

Using fully automated clustering in Power BI the transaction cases were split into two domains:

2 clusters on Sales data (Quantity & Total price)
This I found oversimplified as visually I could define at least one more domain, so I declared to create 3 clusters (still an automatic AI analysis runs on the data):
3 clusters on Sales data (Quantity & Total price)
The following three domains were defined:
  • low price, low amounts
  • low price high amounts
  • high price, low amounts.
I could go further with 4 clusters ... but surely make a more precise clustering another (Python or other adjustable AI) tool is required.
Clusters - how I see it
There are 2-3 dots above the blue lines which I would consider the member of "high price, low amounts" cluster. The decision between light or dark blue borderlines is a matter of ... the Management  of the company. The orange line's position is fine to define the edge between "low price, low amounts" and "low price, high amounts" domains.
I would split the points above 30k in Quantity into a 4th cluster, see the circled dots on the right, as those are the "low price, extremely high amounts" = bulk products, which create the lowest relative profit but require the most work if the Sales+Logistics processes are not well organized... and may turn out to be lossmaking goods. Oh yes, the most outlier in the bottom-right corner is a product for sure to be double checked from all viewpoints of Sales+Logistics and probably an item to neglect in the future!




No comments:

Post a Comment

Snowflake universe, part #6 - Forecasting2

Forecasting with built-in ML module Further posts in  Snowflake  topic SnowFlake universe, part#1 SnowFlake, part#2 SnowPark Notebook Snow...