OnlineRetail - PowerBI analysis #3.2 - clustering

OnlineRetail dataset - MS Power BI, clustering (analysis #2)

Dataset reference: the analyzed dataset - mentioned on DataCamp website - was downloaded from the site of Online retail dataset sharing/owner.

In the previous two projects, I analyzed the dataset with Python and SQL with the help of the in-built visualization of the Notebook offered on DataCamp website and also using matplotlib for special purposes.
Power BI - logo
Basic analysis by Power BI of Online Retail dataset revealed simple facts. In an another additional Power BI project the frequency of well defined purchased amounts (1, 2, 3, ... pieces) versus the purchased amounts (pieces) on the horizontal axis, while different countries were presented (Frequency-Amount-Country plots).

The complete dataset and the BI solution is zipped. The package sizes 26 MB and requires installed Microsoft Power BI software (ver 2.130+, made wtih ver 2.131.1203.0).

As a simple step Microsoft (MS) Power BI software is able to load data easily from Excel files. The DataCamp project defined simple questions that could be easily answered in a short time.

Here I demonstrate a clustering visualization that cannot be created neither by DataCamp (or any) online Notebook application even if those regularly have an in-built data visualization tool, nor by offline analysis in Python using the matplotlib module. A more advanced python module is required for clustered plots.

It is easy to plot 2 datasets ("columns") on a 2D graph's X and Y axis rather the advantage of the Power BI software is adding a 3rd layer (not axis) of some additional context, strongly connected to the previous datasets. In such a case, a completely new viewpoint comes into the picture. Using another advantage of Power BI, the plotted data can be clustered in subsets based on the X-Y axis values and the clusters highlighted or specifically selected for visualization. Consequently, a confusingly large dataset can be easily analyzed.
Note: clustering by Power BI can be made in an automated or semi-automated way, where the latter means that the number of requested (provisory) clusters may be defined, but still the real data analysis is done automatically by the software (AI in the background).

At this time, the analysis of Online retail sales data was made, clustered based on key metrics such as Quantity, Total Price (Quantity*Unit price). The third layer is the Product Code. The clustered-plot as outcome offers valuable insights into Customer purchasing patterns (or Sales patterns, depending on the point-of-view). By grouping similar sales transactions together, clustering tries to reveal distinct customer behaviors or product demand trends. For instance, certain clusters may represent high-volume, low-cost purchases, indicating bulk buying of simple goods, while others might reflect high-cost, low-quantity purchases, typical of premium or otherwise specific products. Visualizing these clusters through plotted graphs helps in recognizing these patterns easily, enabling the company to tailor their marketing strategies, optimize inventory management, and enhance customer targeting for better sales performance.

Using fully automated clustering in Power BI the transaction cases were split into two domains:

2 clusters on Sales data (Quantity & Total price)
This I found oversimplified as visually I could define at least one more domain, so I declared to create 3 clusters (still an automatic AI analysis runs on the data):
3 clusters on Sales data (Quantity & Total price)
The following three domains were defined:
  • low price, low amounts
  • low price high amounts
  • high price, low amounts.
I could go further with 4 clusters ... but surely make a more precise clustering another (Python or other adjustable AI) tool is required.
Clusters - how I see it
There are 2-3 dots above the blue lines which I would consider the member of "high price, low amounts" cluster. The decision between light or dark blue borderlines is a matter of ... the Management  of the company. The orange line's position is fine to define the edge between "low price, low amounts" and "low price, high amounts" domains.
I would split the points above 30k in Quantity into a 4th cluster, see the circled dots on the right, as those are the "low price, extremely high amounts" = bulk products, which create the lowest relative profit but require the most work if the Sales+Logistics processes are not well organized... and may turn out to be lossmaking goods. Oh yes, the most outlier in the bottom-right corner is a product for sure to be double checked from all viewpoints of Sales+Logistics and probably an item to neglect in the future!




Google track modification - python software

Modify Google Timeline/Maps track

KML editor - GITHUB

We have been on summer holidays in the mountains and as always Google (my phone) tracked our excursions. More precisely it tried to track, but in the middle of high mountains the GPS signal is not available always and leads to missing or unprecise geo-locations in the Timeline that would be neccessary for smooth and realistic path matching.

I considered to optimize some of the paths to better match the real track that we drove or hiked.

As the simplest solution the software was planned to:

  1. load a kml file downloaded from my Google Map Timeline (of the selected day)
  2. graphically represent the data on a 2D surface, so to plot all LineString points in a GUI window
  3. the points should be movable by Drag&Drop method (left click on a selected point, holding the mouse left click pressed, dragging and moving the point... and then release the point/mouse at the required position)
  4. in case any point of the track is modified then the new path should be saved
  5. ... as a copy, in the simplest, but Google Map kml compatible format, not to overwrite the original data. The Google Map compatibility means that the resulting kml file should properly work when imported in a layer of Google My Maps service, meaning that the modified track should properly appear on the map.
  6. no map background is required as that would highly complicate the functionality, requiring a Google Map API key and a continuous internet connection to update the map after each modification. The original concept was to modify the track (files) offline.

Simple KML editor

The code was written in Python3 language, the graphical user interface and the interactivity were made by tkinter modules, while kml file was loaded and data were extracted with fastkml module (after failing with simplekml module that is not a proper module to modify only to create kml files).
The original path on the map shows unprecise points (off-the road), not required points (driving to the shop, the right part from Unterkrimml) and big jumps in space-and-time, due to missing GPS signal (see the horizontal line from most left point to Unterkrimml).
Original (Google) Map with unprecise track

The same curve was stretched in a fix (800x600 px) sized software window in such a way as to have a 10 pixels padding around the curve. The software window colours were not modified as prettyfying was not planned this time.
Original, unprecise track in KML Editor

I highlighted in yellow those points that were not required/precise and which of those I modified.
Highlighted unprecise track points in KML Editor
Note: colors were modified only in an image editor.


This software does not allow adding or removing points and as consequence off-road points may be precised in position (see the yellow point), missing points to avoid cut off hairpin turns cannot be added, finally not required points cannot be deleted, but I grouped them at our starting point (see the blue spots):
Highlighted modified track points in KML Editor

Note: colors were modified only in an image editor.

At that point the first kml file of the modified track was not properly compatible with Google map, in the second step the above points showed up on Google My Maps as individual geo-locations, not as part of an excursion track, but in the final version a <LineString>... </LineString> tag was inserted before the <coordinates>... </coordinates> wrapper in the kml file.

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
  <Placemark>
    <LineString>
      <coordinates>17.2012735,47.8064775,0 </coordinates>
    </LineString>
  </Placemark>
</Document>
</kml>

After modification the file was imported in a new layer of a newly created Google My Maps map:
Modified track in Google My Maps

Read more details in the KML editor - GITHUB readme. 

This is a simple aim-oriented software that can be developed to overcome the above mentioned weaknesses, such as
  • removing points would be a key point and easy-to-make option,
  • in contrast adding track points is a bit more tricky because the points should be added at the right place in the order of points (LineString) to avoid dizzy lines,
  • highlighting a point by clicking is an easy request but I consider it does not add much to the track modification functionality, rather it serves fanciness only,
  • adding real maps (with continuous updates) in the background would be the best development as a positive outcome but has a high coding time demand, but either the software should use a common Google API key that due to economical reasons Google does not offer, or each user should use their own API key which I consider to be an overcomplication of the task and functionality.

OnlineRetail dataset, Analysis#3.1 - Power BI

OnlineRetail dataset - MS Power BI, additional analysis #1

Power BI - logo
Dataset reference: the analyzed dataset - mentioned on DataCamp website - was downloaded from the site of Online retail dataset sharing/owner.

Basic analysis
by Power BI of Online Retail dataset revealed simple facts.
In another two projects, I analyzed the dataset with Python and SQL with the help of the in-built visualization of the Notebook offered on DataCamp website and also using matplotlib for special purposes.

The complete dataset and the BI solution is zipped. The package sizes 26 MB and requires installed Microsoft Power BI software (ver 2.130+, made wtih ver 2.131.1203.0).

As a simple step Microsoft (MS) Power BI software is able to load data easily from Excel files. The DataCamp project defined simple questions that could be easily answered in a short time.

Here I demonstrate a visualization that cannot be created neither by DataCamp (or any) online Notebook application even if those regularly have an in-built data visualization tool, nor by offline analysis in Python using the matplotlib module.

The advantage of Power BI software is to plot 2 datasets ("columns") on a 2D graph's X and Y axis but to add a 3rd layer (not axis) of some additional information, strongly connected to the previous datasets. In such case, a completely new viewpoint comes to the picture. Using another advantage of Power BI, the plot can be made dynamic so that selecting some value of the 3rd layer subset of data can be highlighted or specifically selected for visualization and consequently a confusingly large dataset can be easily analyzed according to our needs depending on our selection.

The plot(s) below visualize the frequency of quantities of purchased product amounts with the additional information of purchaser's country. The vertical axis (Y) represents the frequency of well defined purchased amounts (1, 2, 3, ... pieces) versus the purchased amounts (pieces) on the horizontal axis, while different countries are presented with different colors (automatically set by Power BI). Note that both axis are in logarithmic representation therefore the 1-100 distance is the same as the distance of 100 - 10 000 units, but for first understanding it is not important.

Frequency of purchased quantities by Countries (default)

Frequency of purchased quantities by Countries (default view)

Power BI - Slicer
This is a bit crowded therefore it would be great to have some selector with what we can define subsets in what we are interested in. Power BI has a Slicer option in which the selection categories may be loaded by a simple drag&drop method. In this case 30+ countries appear in the slicer, each countries in a separate box.
Frequency of purchased quantities by Countries (with Slicer)
Frequency of purchased quantities - all countries plot, with an additional slicer.

In deed, the slicer takes quite a large part, but there is no option to rearrange or minimize internal margin in the boxes. Still it is a really good opportunity to select for example United Kingdom (UK) related data: 

Frequency of purchased quantities - UK
Frequency of purchased quantities - UK

UK purchases has a well defined triangle, which seems to have a sharp lower edge. It is due to the fact thet UK customers has at least 100 times more data entries then other countries, so it is evident that the frequency values cannot take low numbers for low purchased quantities.
Let's see the non-UK purchase quantities:

Frequency of purchased quantities - non-UK
Frequency of purchased quantities - non-UK countries

Note: based on these plots above the DataCamp 3rd question "is non-UK purchased amounts significantly higher than UK purchsed amounts?" cannot be precisely answered, however we may have an impression. For significant difference analyses statistical t-probe should be utilized (not presented here).

Countries of our interest can be selected for specific analyses:

Purchases as income and refunds related losses on maps

Country-related total values of purchases or returns expressed in quantities or money may be plotted using Power BI maps.

Heat map would be more useful to express the differences of countries' contribution to the whole amounts. Keeping mind that UK entries are extremely higher in the dataset compared to non-Uk countries relative numbers should be plotted. Map not presented.

Additional analysis #2 - Clustering with Power BI

OnlineRetail dataset, Analysis#3 - Power BI

Analysis 3rd attempt using MS Power BI on the DataCamp - OnlineRetail dataset


Additional Power BI #2 - Clustering with Power BI


In the first two runs, I analyzed the dataset with Python and SQL with the help of the in-built visualization of the Notebook offered on DataCamp website and also using matplotlib for special purposes.

Dataset reference: The original dataset (xlsx file) was downloaded from the Dataset owner's site.

The complete dataset and the BI solution is zipped. The package sizes 26 MB and requires installed Microsoft Power BI software (ver 2.130+, made wtih ver 2.131.1203.0).

Microsoft (MS) Power BI software is able to load data easily from Excel (or from multiple files or online databases) and recognize data types automatically so I only had to concentrate on visualization of the required information. I created the simplest figures with minimal style adjustments, as making more elaborate or eye-catching visuals would have taken nearly twice as long to complete the project.

Before the questions and answers, here you find some general information on the data and the dataset:

Dataset general info

First of all, in general the most important data for Sales&Management is what are the flagship products? Of course they already know, but this is a proof of properly working queries on a well designed database/dataset. See in table, and in graph format:

Online retail - best sellers

It is important to know the dataset size, the amount of countries and Users (website user Customers), see the numbers on the right:

or evidently being UK centered it is also benefitial to know the most returned goods in UK or even better to know the users (identifiers) who have returned the most amount of goods, to extract their list and to target them either by some discount or by feedback questionnaire to understand the reason of returns... with an aim to care with them and regain their trust and loyalty toward our company and our products.

In the order of the questions, let's go through the answers.

1) The most returned items



The number of return related entries in the dataset is around 10 620, sent back by 1590 Customers from 30 countries.
Returns - numbers

The details of the Returns - general dashboard

Most returned goods

Returns - by countries
Not surprisingly the returned goods in highest amounts are related to UK customers, it is a direct consequence of that the most purchases are originating from there, as well (data not shown).


At the bottom Treemap, UK related transactions are excluded to demonstrate the contribution of different countries in the amount of returned goods (the tile graph may be optimized in size).

2) Profits earned in the UK, on different timescales

A simple demand, easy to answer using the in-built [Month] or [MonthNo], [Quarter] part of DateTime variable. I used 'Online Retail'[InvoiceDate][MonthNo] to define a new column in the data Table. 

In case of weeks the week number can be determinde by the formula
WEEKNUM('Online Retail'[InvoiceDate])
in a new column.

The profit is considered as the sum of the purchases and returns related costs/refunds, so a 'simple' sum of Quantity*Unit price for all transactions in the given time period. Remember that returned quantities are defined with negative integers.

Profits at different timescales

3) Is non-UK purchased amount significantly larger than UK quantities?

First thing to mention; Power BI does not have an in-built (statistical) t-test functionality, so at the moment I provide only such simple graphs and related data that is the base of a t-test. Expert data analyser 


Note that the larger mean value does not prove immediately that non-UK customers buy goods in significantly larger amount. However the plots would indicate so for a professional Data Scientist, the absolut proof can be made by a t-test (not shown here).

In addition to the above-mentioned dataset-related supplementary information, a lot of other facts and conclusion can be extracted from the dataset, but this post is about answering the DataCamp defined 3 questions using Power BI.

OnlineRetail dataset, Analysis#2 - SQL

DataCamp - OnlineRetail dataset,  Analysis #2 - SQL with Notebook

Previously I analyzed the below presented dataset with python with the help of the in-built visualization of the Notebook offered on DataCamp website and also using matplotlib for special purposes. I created another analysis with and MS Power BI.

I have elaborated the answers for the so called Online Retail dataset with SQL queries and zipped all required files (data and explanation as csv files, notebook as ipynb file) GitHub. If you have no access to any Notebook that is compatible* with the ipynb file then a pdf version is available here: SQL queries (0.1 MB) or the complete python and SQL solutions without the coding part Python+SQL (12 MB)

My solution includes additional details and goes beyond the required questions.

The Online retail project (original data source) is presented by DataCamp in steps:
  1. Problem definition of E-Commerce Data 

  2. Data Dictionary - explaining the content, the data types and the meaning of specific values or signs


  3. "Don't know where to start?" session defines Exploration, Analysis, and Visualisation challenges:
    1. Explore: Negative order quantities indicate returns. Which products have been returned the most?
    2. Visualize: Create a plot visualizing the profits earned from UK customers weekly, monthly.
    3. Analyze: Are order sizes from countries outside the United Kingdom significantly larger than orders from inside the United Kingdom?

With SQL visualization, and statistical analysis is not possible, but the data can be extracted for further analysis. DataCamp website in-built Notebook functionality can be of help in case of visualization, but t-probe statistics is still not doable on a pure SQL basis.

Duck DB's SQL is a bit different from other SQL languages so loading the data already required a specific method: 
Loading data from csv - Duck DB method

Searching for the most returned items during the whole period defined by the dataset:
Most returned items - SQL query

Preparing data for profit  (in UK) visualization at different timescales:

See the Python-based study or the Power BI solution on the same topic.

* Compatible Notebook readers/editors: DataCamp online Notebook (requires registration) Please note that Jupyter Notebook versions of any kind are not compatible automatically with any kind of SQL interpreters. There are ways (search online for solutions), but I recommend reading the pdf version provided above.

DC OR editable (for myself)

OnlineRetail dataset, Analysis#1 - Python

DataCamp - Online Retail dataset,  Analysis #1 - Python supported by Notebook functionality

I have elaborated the answers for the so called Online Retail dataset and zipped all required files (data and explanation as csv files, notebook as ipynb file) GitHub. If you have no access to any Notebook that is compatible* with the ipynb file then a pdf version is available here: Analysis with Python (10 MB)

My solution includes additional details and goes beyond the required questions.

The Online retail dataset (original source) is presented by DataCamp in steps:

  1. Problem definition of E-Commerce Data 

  2. Data Dictionary - explaining the content, the data types and the meaning of specific values or signs


  3. "Don't know where to start?" session defines Exploration, Analysis, and Visualisation challenges:
    1. Explore: Negative order quantities indicate returns. Which products have been returned the most?
    2. Visualize: Create a plot visualizing the profits earned from UK customers weekly, monthly.
    3. Analyze: Are order sizes from countries outside the United Kingdom significantly larger than orders from inside the United Kingdom?
Here are some extracted figures.
Monthly income and repayment
The monthly income (Quantity * UnitPrice) and repayments (Quantity * UnitPrice). The December - January returns are following the October - December increased pucrhases. "End of the year" craziness. :) 

The 2 figures below are specific visualizations of the "How often each product is returned (frequency) and in what amount?" These help to perceive the outliers: the weakest and the best-sold (with lowest return rate) products in our portfolio.



The t-test result of the hypothesis "non-UK costumers purchase goods in significantly higher amount than UK customers".  Further details can be found in the above-mentioned Notebook or pdf files.


See the SQL-based study or the Power BI solution on the same topic.

* Compatible Notebook readers/editors: DataCamp online Notebook (requires registration) or Anaconda Jupyter Notebook (requires installation) or the online version of Jupyter

Snowflake universe, part #6 - Forecasting2

Forecasting with built-in ML module Further posts in  Snowflake  topic SnowFlake universe, part#1 SnowFlake, part#2 SnowPark Notebook Snow...