OnlineRetail dataset - MS Power BI, clustering (analysis #2)
In the previous two projects, I analyzed the dataset with
Python and
SQL with the help of the in-built visualization of the Notebook offered on
DataCamp website and also using
matplotlib for special purposes.
Basic analysis by
Power BI of
Online Retail dataset revealed simple facts. In an another additional Power BI project the
frequency of well defined purchased amounts (1, 2, 3, ... pieces) versus the
purchased amounts (pieces) on the horizontal axis, while different countries were presented (
Frequency-Amount-Country plots).
The complete dataset and the BI solution is
zipped. The package sizes 26 MB and requires installed Microsoft Power BI software (ver 2.130+, made wtih ver 2.131.1203.0).
As a simple step
Microsoft (MS) Power BI software is able to load data easily from
Excel files. The
DataCamp project defined simple questions that could be easily
answered in a short time.
Here I demonstrate a clustering visualization that cannot be created neither by
DataCamp (or any) online
Notebook application even if those regularly have an in-built data visualization tool, nor by offline analysis in
Python using the
matplotlib module. A more advanced python module is required for clustered plots.
It is easy to plot 2 datasets ("columns") on a 2D graph's X and Y axis rather the advantage of the Power BI software is adding a 3rd layer (not axis) of some additional context, strongly connected to the previous datasets. In such a case, a completely new viewpoint comes into the picture. Using another advantage of Power BI, the plotted data can be clustered in subsets based on the X-Y axis values and the clusters highlighted or specifically selected for visualization. Consequently, a confusingly large dataset can be easily analyzed.
Note: clustering by Power BI can be made in an automated or semi-automated way, where the latter means that the number of requested (provisory) clusters may be defined, but still the real data analysis is done automatically by the software (AI in the background).
At this time, the analysis of Online retail sales data was made, clustered based on key metrics such as Quantity, Total Price (Quantity*Unit price). The third layer is the Product Code. The clustered-plot as outcome offers valuable insights into Customer purchasing patterns (or Sales patterns, depending on the point-of-view). By grouping similar sales transactions together, clustering tries to reveal distinct customer behaviors or product demand trends. For instance, certain clusters may represent high-volume, low-cost purchases, indicating bulk buying of simple goods, while others might reflect high-cost, low-quantity purchases, typical of premium or otherwise specific products. Visualizing these clusters through plotted graphs helps in recognizing these patterns easily, enabling the company to tailor their marketing strategies, optimize inventory management, and enhance customer targeting for better sales performance.
Using fully automated clustering in Power BI the transaction cases were split into two domains:
This I found oversimplified as visually I could define at least one more domain, so I declared to create 3 clusters (still an automatic AI analysis runs on the data):
The following three domains were defined:
- low price, low amounts
- low price high amounts
- high price, low amounts.
I could go further with 4 clusters ... but surely make a more precise clustering another (Python or other adjustable AI) tool is required.
There are 2-3 dots above the blue lines which I would consider the member of "high price, low amounts" cluster. The decision between light or dark blue borderlines is a matter of ... the Management of the company. The orange line's position is fine to define the edge between "low price, low amounts" and "low price, high amounts" domains.
I would split the points above 30k in Quantity into a 4th cluster, see the circled dots on the right, as those are the "low price, extremely high amounts" = bulk products, which create the lowest relative profit but require the most work if the Sales+Logistics processes are not well organized... and may turn out to be lossmaking goods. Oh yes, the most outlier in the bottom-right corner is a product for sure to be double checked from all viewpoints of Sales+Logistics and probably an item to neglect in the future!
No comments:
Post a Comment