OnlineRetail dataset - MS Power BI, additional analysis #1
Dataset reference: the analyzed dataset - mentioned on DataCamp website - was downloaded from the site of Online retail dataset sharing/owner.
Basic analysis

In another two projects, I analyzed the dataset with Python and SQL with the help of the in-built visualization of the Notebook offered on DataCamp website and also using matplotlib for special purposes.
The complete dataset and the BI solution is zipped. The package sizes 26 MB and requires installed Microsoft Power BI software (ver 2.130+, made wtih ver 2.131.1203.0).
As a simple step Microsoft (MS) Power BI software is able to load data easily from Excel files. The DataCamp project defined simple questions that could be easily answered in a short time.
Here I demonstrate a visualization that cannot be created neither by DataCamp (or any) online Notebook application even if those regularly have an in-built data visualization tool, nor by offline analysis in Python using the matplotlib module.
The advantage of Power BI software is to plot 2 datasets ("columns") on a 2D graph's X and Y axis but to add a 3rd layer (not axis) of some additional information, strongly connected to the previous datasets. In such case, a completely new viewpoint comes to the picture. Using another advantage of Power BI, the plot can be made dynamic so that selecting some value of the 3rd layer subset of data can be highlighted or specifically selected for visualization and consequently a confusingly large dataset can be easily analyzed according to our needs depending on our selection.
The plot(s) below visualize the frequency of quantities of purchased product amounts with the additional information of purchaser's country. The vertical axis (Y) represents the frequency of well defined purchased amounts (1, 2, 3, ... pieces) versus the purchased amounts (pieces) on the horizontal axis, while different countries are presented with different colors (automatically set by Power BI). Note that both axis are in logarithmic representation therefore the 1-100 distance is the same as the distance of 100 - 10 000 units, but for first understanding it is not important.
Frequency of purchased quantities by Countries (default view)
This is a bit crowded therefore it would be great to have some selector with what we can define subsets in what we are interested in. Power BI has a Slicer option in which the selection categories may be loaded by a simple drag&drop method. In this case 30+ countries appear in the slicer, each countries in a separate box.
Frequency of purchased quantities - all countries plot, with an additional slicer.
In deed, the slicer takes quite a large part, but there is no option to rearrange or minimize internal margin in the boxes. Still it is a really good opportunity to select for example United Kingdom (UK) related data:
Frequency of purchased quantities - UK
UK purchases has a well defined triangle, which seems to have a sharp lower edge. It is due to the fact thet UK customers has at least 100 times more data entries then other countries, so it is evident that the frequency values cannot take low numbers for low purchased quantities.
Let's see the non-UK purchase quantities:
Frequency of purchased quantities - non-UK countries
Note: based on these plots above the DataCamp 3rd question "is non-UK purchased amounts significantly higher than UK purchsed amounts?" cannot be precisely answered, however we may have an impression. For significant difference analyses statistical t-probe should be utilized (not presented here).
Countries of our interest can be selected for specific analyses:
Purchases as income and refunds related losses on maps
Country-related total values of purchases or returns expressed in quantities or money may be plotted using Power BI maps.
Heat map would be more useful to express the differences of countries' contribution to the whole amounts. Keeping mind that UK entries are extremely higher in the dataset compared to non-Uk countries relative numbers should be plotted. Map not presented.
Additional analysis #2 - Clustering with Power BI
No comments:
Post a Comment