Data4every1: SnowFlake universe, part#2 SnowPark Notebook

SnowPark Python - Notebook

Further posts in Snowflake topic
SnowFlake universe, part#1
SnowFlake, part#3 Initializing the system
SnowFlake, part#4 using 3rd party Python
SnowFlake, part#5 ML - forecasting
Snowflake, part#6 ML Forecast model (2)

"The Snowpark API provides an intuitive library for querying and processing data
at scale in Snowflake."

SnowPark supports three coding languages Java, Python, and Scala, and SQL codes can also be injected into the Notebook (note that SnowPark SQL motor is limited to standard SQL syntax compared to some other SQL engines on the market). SnowFlake system also provides tools and tutorials on AI and ML practices using their services.

How to use Notebooks is a helpful guide for the first steps among the wide range knowledge base present on SnowFlake Tutorials website.

I found really helpful the option that 3rd party python packages may be added easily to the SnowPark Notebook or Worksheets and thus you can integrate a Slider for interactive SQL/python queries (SnowFlake docs or my slider test)

I chose the so called Tasty Bytes learning dataset to verify the useability of the website, both in case of testing SQL or Python in SnowPark.

Database and table was created, besides (minimal) resources were selected: COMPUTE_WH is the default engine name, X-Small is the smallest capacity selectable.

Notebook in SQL mode

As usual, lines (cells) may be defined to be interpreted as SQL or Python code lines or used only for notes or for (formatted) text for e.g. teaching purposes such as markdown lines.

As a simple step dataset was fully loaded to get to know the content.

First, data was copied to a database on SnowFlake from a given site (as defined in the tutorial):

At every step resources and process time is indicated on the right side. This helps to calculate the consumed resources and consequently the price of the data handling, but also can be help for code (or data loading) optimization leading to cheaper operation.

Dataset parameters may be extracted using LIST function of SQL

There are different in-built data representation options such as table, plots and line graphs as the usual formats in Notebooks:

Running another SQL query the resulting data may be represented as a Chart instead of a table format:

The X-Y values are automatically recognised by SnowPark system but those can be modified on the right hand side panel, of course only in case of valid settings would the chart show something, otherwise returns with an error pointing out that we made some mistake in the settings.

Heatmap is also an option however the graphical representation is not the best:

Resizability of the cells and some selectable recolouring palette option would highly increase it's readability.

See further details on representation below at Representation in general session.

Notebook in Python mode

As basic concept Notebooks were made originally for Python based tests of data wrangling. I have found that SnowPark has the most requested python functionalities.

importing the required modules

After some preparatory steps the dataset may be used:

The outcome is a (Pandas) DataFrame that can be easily handled for data processing in further steps and can be represented in the above mentioned modes.

SnowPark also offers some function(ality) hints for Python codes which enormously ease the use of the Notebook.

To see more check the Dashboards of SnowFlake which is a simple in design but inevitable tool of all large dataset handling softwares or cloud services.

Representation in general

Independently of using Python or SQL to collect and/or filter our data SnowPark has some robust and reliable methods to represent final datasets. One advantage was noted already, that if Chart option is selected the X-Y axis values are automatically recognised and corresponding values plotted, which may be overwritten or modified to match to our needs or expectations in regard data representation.

I chose another free dataset: Finance & Economics of Cybersyn (about initial steps and data selection you may reed in part#3)

Another a
dvantage of the SnowPark system is that when query results are demonstrated as a table, you may get immediate insight into overall details of the data on the right hand side. By clicking on different columns (header) we get instantaneous breakdown of that columns content and we may also use that bar chart to filter the represented data defining left and right edges of the shown range to match with our interest.

It is simple to add another dataset to a chart (bar or line colours are defined automatically).

When handling timeseries data the ranges (bucketing) may be easily modified from day to weeks or months or year-quarters, but if data related time is also given not only the date then hours, minutes or seconds can also be selected as base of bucketing.

Here I provide two snapshots of the time settings if the video does not help and with changed to whote background for better visualisation of the line chart menu. Data (number of credit card issues of a special bank) with daily sum up:

and after changing to monthly basis:

Part#1 SnowFlake in general

Part#3 initializing SnowFlake environment

Part#4 using Python 3rd party modules

Part#5 Dashboards

Part#6 AI & ML using SnowFlake ...

Data4every1

Pages

SnowFlake universe, part#2 SnowPark Notebook