SnowFlake universe, part#5 ML - forecasting

Forecasting from time series data with Snowflake ML

Finding and loading real-life data 

First of all, Snowflake's Machine learning (ML) tutorials offer some data for training, but those are 'prepaired' data, which in fact, is not a problem. I consider that a test is better if made on real data. There are plenty free or paid data sources out there, the question is where to find a small, but suitable amount of data. 

I found and downloaded Electric production data from Kaggle website (Kaggle source data or mysource, Electric_Production.csv, 7.32 kB) and uploaded to Snowflake through Data / Add data / Load files into a stage option.

Snowflake - Add data / load to stage

Snowflake - Add data - Select file
This works easily, with Drag&Drop, then creating a new (chosing EPRODUCT as database name) or selecting an existing database to load the data into. As a result the database is ready to use:
Snowflake - New database added
Note1: I uploaded data because (in trial version) copying/modifying shared database is not allowed. Therefore all such databases that are loaded from Snowflake Marketplace are excluded. Try the above defined Python based model creation to avoid this limitation. 

Note2: I have found a lot of data sharing websites, for example NYC Open data or UC Irvine. Those have large variety of real-life datasets but those are so huge that Snowflake (in trial mode?) refused to load that due to an upper limit declared to be 200 MB (see on the file selection window above).  

Dataset consist of 2 columns: date (as date format) and production data (float), 398 lines (entries per columns).

ML forecast model - automatic model configuration 

Snowflake provides a 'model creation wizard' that helps you through the model configuration and creation steps and creates a worksheet of SQL commands.
First of all model name (eProduct_Forecast_model) and metrics can be determined and the Warehouse on which the model should run. To better separate the model's cost from all other processes a new Warehouse is suggested to be created (I have not done so in this testing situation).
 
The process is simple, chosing the right data or settings (see in the video or read the list):
  • Database and Schema
  • time or date(stamp) type column from the dataset to serve as training data for the forecast model
  • additionally series identifer and additional features (possibly affecting the values in time) may be selected (which I have not done as the database had only 2 data columns)
  • forecast horizon (amount of time to forecast, in units of the given data, e.g. days) and prediction interval width (default 0.95) may be declared...
and there you are, the model is created in the background along with a SQL worksheet having all required commands to regenerate, modify, or to run the created model. Running the whole worksheet would take some tens of seconds (using X-Small warehouse) and results in a View consisting of all existing data (loaded = preexisting and forecasted with the newly trained model): 
 
See more on Quickstart to forecasting (Snowflake).

The SQL commands in the worksheet are responsible for the selection and preparation of the data, training of the model and the forecasting as well.

Preparing model creation

A new Snowpark Notebook (eProduct_Forecast) was created to cut into simple steps and to demonstrate the model creation process: 
It shows up in the Projects / Notebooks option selected on the left menu panel right after creation.
Projects / Notebooks - ML Forecast study
Each smiple steps were created (as usual in Notebooks) in separate cells (with increasing numbers as identifiers), where the coding language is set to SQL (see in the top-left corner of the cell on the image below). Environment (user role, warehouse, database and schema) has to be declared first:
Runtime is indicated in the top-right corner of the cell to ease resource (credit) consumption. Significant amount is consumed when model is created or run.
Returns of the code in the block may be simple texts (like above) or data (presented in table format, first 10 lines usually).

It is always advised to check the loaded data (quality, format,...) both in Tables or in Views. Above, it seems that data is identique and is OK. Format check of Date column is shown below

The code below was not run as the View (ELECTRONIC_PRODUCTION_v1) was already created earlier, when the complete SQL command list was run by running the 'wizard' created worksheet.

Model creation

CREATE SNOWFLAKE.ML.FORECAST command invokes the Snowflake Machine learning module to create a forecast model based on the data stored in the given view (which is the training dataset for the model). It may take a minute using X-SMALL database for this ~400 datapoint dataset. The timestamp and 'target' column names were given in the 'model wizard' (see in the video above):
Snowflake ML - forecast model created
An instance (= a model) was created and is ready to be used for forecasting.

Do not be surprised, the created eProduct_Forecast_model is not listed in the Snowflake / AI&ML / Models section on the left panel, because to show up the model there requires registration which is another story (not discussed here):
Snowflake / AI&ML / Models section
The created Forecast Model not shown among snowflake Models (see SQL verification later in the text).
The existence of the model can be assured by SHOWing SNOWFLAKE.ML.FORECAST instances: 
Running the created model is easy using call (SQL) function:
Snowflake ML - Run forecast model
Above the forecasted values of lower and upper bound are demonstrated (in a table) but if we would like to use it for further calculations or just to plot it, then the data should be stored in a (temporary or permanent) Table of the database. Note that this time the SELECT * FROM selects from a TABLE, but 'call' function is not used! I have run into this error unfortunately when simply copied the whole command above behind the TABLE() function.
Snowflake / Notebook - Create table from query result
Besides prediction time length (set to 14) given in unit of the loaded dataset (here in days), prediction interval can also be set (for example: CONFIG_OBJECT => {'prediction_interval': 0.95}).
Table MY_FORECASTS_EPRODUCT is created but cannot be vizualised as is, because SnowPark notebook does not have an in-built chart representation for query tables.
There are several ways to fetch data from a created Table (Python Connector, Pandas: Dataframe.from_records(), or read_sql(), Snowflake SQLAlchemy, option sources: StackowerFlow post) or from the recently run SQL query results, here I demonstrate the latter case using Python get_active_session() method (from the long list of Snowflake APIs). 
Snowflake - Python, get_active_session()
After that moment, with activated session almost everything gets trackable, usable and modifiable in the context of current session. Now we can extract the data from a Table including both the original (loaded) data and the forecasted ones using UNION method, if the query is run:
Snowflake - SQL query UNION
Note that this SQL query was run in the 14th cell of the SnowPark Notebook.
Now, fetch data using the previous SQL query block: cell14 (this time) and the .to_pandas() method to load into a Pandas dataframe type variable, which is easy to use and very much liked by Data Scientist.
Snowflake - Python, fetch data from SQL query

Snowflake - verifying data type (python)
Snowflake - verifying data type (python) to ensure Timestamp format of 'Date' column in dataframe.

It was a simple choice of using Altair for vizualisation as this is the only in-built vizualisation tool (in trial?) so I did not have to bother with installing (in the Python Anaconda system working at the background) any of other Snowflake supported vizualisation tools (see the list of Python modules).
However it has a large variety of applications with very nice outlook (see Altair plot examples) that choice made my work harder as I got used to Matplotlib (see plot examples), but it seemed to save time on avoiding python module installation. The syntax is different, in spite of having their common python basis. Let's move on.
Importing Altair module is simple (as usual in python):
Importing Altair module for vizualisation
First of all, I selected the original data to plot (markers are circles) in a Chart, where x and y axis data can be selected in the encode() method by defining the name of the column in the pandas dataframe. It is important to note that plotting does not consume considerable amount of resources (credit), but of course it depends on the plotted dataset size.
The full dataset of real data and predicted lower and upper bound for 14 days forecast was plotted as line plot, with three independent lines of defined colours.
To have a better look on the predicted values the last 30 (.tail(30)) entries were selected for plotting which includes some of the real data and the 14 days prediction of lower and upper limit. The model has learned the fluctuation of the data and has learned to predict in a reasonable range, however I found the outcome a bit off.
Snowflake - Forecast on EProduct dataset
Further precision of settings or longer training dataset may result in better results.
Finally, Snowflake offers checking forecast evaluation metrics:
Snowflake - Forecast metrics
Snowflake - Forecast metrics

Mean Absolute Error (MAE)
Mean Absolute Percentage Error (MAPE)
Mean Direction Accuracy (MDA)
Mean Standard Error (MSE)
Symmetric Mean Absolute Percentage Error (SMAPE)

SnowFlake universe, part#4 using 3rd party Python modules

Enhancing SnowPark Notebook capabilities

SnowSpark as the online notebook app of the SnowFlake system may be upgraded in functionality. Not suprisingly, but coming from Python nature additional modules can be imported to enhance data handling. There are several modules such as scikit learn for analysis and AI functionality with included visualization ability or 'simple' data visualization tools as matplotlib or seaborn (see more).

Streamlit, the app generator 

Streamlit website and a demo to start with if you are or become interested in it.

"Streamlit turns data scripts into shareable web apps in minutes.
All in pure Python. No front‑end experience required." (Streamlit website)

Streamlitis an open-source Python library that allows data scientists and developers to easily collaborate, quickly and easily develop interactive data visualizations and web applications with minimal web (html, css, php/javascript) development skills. Streamlit stands out from the market by enabling rapid prototyping and further developments through its streamlined and simple development process.

It is a software having increasing impact on the market among similar applications, here you find the most widely used data visualization and web application development tools:

  • Dash (Plotly, Python-based)
  • Panel (HoloViz library, also Python-based)
  • Shiny (for R, from RStudio)

Why Streamlit?

Ease of use and fast development, which is "one of Streamlit's biggest advantages because it has an intuitive API, so data analysts and data scientists can quickly build interactive apps with just a few lines of Python code."

Rapid app development is made easier also because the app automatically reloads itself after changes are made to the code, making the development process fast and flexible. That is valid, if you don't mess up a step of the development process.😊

The dozens of built-in visualization and interactive components (e.g. sliders, text boxes, radio buttons) make interaction more satisfying for end users on the app's user side. Individual Streamlit components can be easily embedded into applications, even into a SnowPark Notebook. A typical case is, for example, that setting the minimum and/or maximum values of a slider causes the query from the database to run again with the changed parameter and thus update the (filtered) extracted data from the dataset according to the user's expectations.

Sliders demo made by Streamlit (source: their slider documentation website):

 If it does not work see image.

My own test on slider functionality: filtering fetched data by defining lower price limit (on a tutorial dataset) is shown in the following video. Read the notebook notes (markdown parts) in the video for better understanding.

After importing the Streamlit module we declare min_price variable with a given value received from a slider (range) set by the user.
import streamlit as st
st.markdown("# Move the slider to define lower price limit to filter data")
#col1 = st.columns(1)
#with col1:
min_price = st.slider('Define min_price', 1, 20, 2)

After the interactive slider is activated a Python query using previously defined variable min_price (1 line of code) can filter the result of the query, which is taken from that restricted part of the whole dataset, where the company is called 'Freezing point'.

df_menu_freezing_point[df_menu_freezing_point['SALE_PRICE_USD'] \
    > min_price][['TRUCK_BRAND_NAME','MENU_ITEM_NAME','SALE_PRICE_USD']]

A similar, but SQL query using the previously defined variable min_price as lower sale price limit, now querying the whole dataset including all companies:

SELECT truck_brand_name, menu_item_name, sale_price_usd
FROM tasty_bytes_sample_data.raw_pos.menu
WHERE sale_price_usd > {{min_price}}

Streamlit has the added advantage of being able to integrate with popular data visualization libraries such as Matplotlib, Plotly and Altair, and therefore, although not by itself, but offers a wide range of data visualization options through integrated systems. Accordingly, its visualization capabilities are basically determined by the integrated python module.

Streamlit Cloud also offers Cloud support, making it easy to share and run prototypes and applications. Find examples in the Streamlit App Gallery. Funny on its own reflection how..., but still impressive cheat-sheet website of Streamlit development made as an app by Streamlit itself.

In terms of the SnowFlake system, Streamlit has the advantage of being able to connect directly to the SnowFlake data warehouse, making it easy to create interactive data visualizations based on data from SnowFlake and take advantage of the backend solutions provided by the SnowFlake system. This can be particularly useful for data analysts and business decision makers as they can query and visualize data from SnowFlake in real time. In simpler cases, this can even replace the use of more expensive BI software (I am not listing any software here, being respectful).

Streamlit compared to other software on the market

Streamlit is not suitable for the development of large and complex (multimodal) web applications, as it has limited scalability and does not support detailed user permissions, nor advanced front-end customization options.

If an application requires multiple pages, complex navigation or detailed user identification, Streamlit is not an ideal choice as it does not support these features well. In this respect, Dash or Shiny may offer more options.

Streamlit applications are ideal for smaller data visualization projects, but if you are working with more complex or larger data sets on the input side, or need to display multiple types of data on the output side, or need to serve multiple users simultaneously, performance can be severely degraded or constrained for the development team and for the end-user as well.

Note that Dash or Panel offers more customisation and performance optimisation options.

The lack of built-in data manipulation tools, as well as the aforementioned data visualization toolset, relies on the integration ability with various data processing libraries, such as Pandas, with Pandas’ own built-in data manipulation tools. Data processing must be handled by separate modules and the results returned to the Streamlit application for final visualization. This is not necessarily a real disadvantage as python users got used to this mentality, but in the case of Dash and Shiny, the full integration of Plotly and R-Studio as data manipulation software gives a wider range of built-in data processing capabilities. I would say it makes coding simpler in the latter case, but it is not an unbearable situation to push-and-pull data between modules.

Streamlit allows more limited functionality, so applications with complex operations cannot be created with it. The queries themselves (math or code) can be complex, but they cannot, for example, be built on top of each other.

Not sure if it's actually a drawback, but Streamlit is specifically Python-based and from there on is not sympathetic to R-using statisticians (as far as I know). Dash may be more popular among statisticians and data scientists because it's more versatile for Python and R developers (I have not checked market data in regard of this topic).


Stored Python procedures

(Almost) everyone heard about stored SQL procedures, but this is about stored Python procedures which is rarely found on the market. Read more about the topic.

SnowFlake universe, part#3 Initializing the system

Initializing the SnowFlake system - back to the beginning

Signing up

... is easy (but nowadays it is almost true for every site who does not want to bother it's provisionary user by time-wasting steps).

"HIPAA, PCI DSS, SOC 1 and SOC 2 Type 2 compliant, and FedRAMP Authorized" says the website's first page when trying to create an account. OK, check it for yourself if you want to know more, I was concentrating on getting into the system. 

You must select a cloud provider and define it’s location from a dropdown list. I chose Microsoft Azure.


Image captcha check should be accomplished for safety reasons. Then some questions should be answered to better serve your needs, but only a minimal amount. Name and email address should be given at first step
 
 
As the next step you are noted that you have succeded and an email has sent to your email address given. At the bottom of the left panel $400 credit is shown to be consumed during your 30 days of trial period. Well, we are already there!

Some help is provided in form of links to help you to start (which they consider to be relevant):
  • "GETTING STARTED VIDEO" - 8 minutes about the whole system.
  • VIRTUAL HANDS-ON LAB - "Join an instructor-led, virtual hands-on lab to learn how to get started with Snowflake"
  • FULL SNOWFLAKE DOCUMENTATION - the link to the documentation of SnowFlake ecosystem

Time travel - a 90 days Warehouse data storing management option to reach historical data, of course costs money. Not important for studying purposes. 

Multi-cluster warehouses - "Snowflake supports allocating, either statically or dynamically, additional clusters to make a larger pool of compute resources available." Of course costs money and this is also not important for studying purposes. 

Materialized Views - which basically is a pre-computed data set deriving from a query and stored for later use, therefore it provides way faster execution against the base table of the view.

Setting the environment parameters

Warehouses are for storing data but the engine should also be defined at the beginning of our data processing to ensure optimal time and costs ratio. As SnowFlake resources can easily be scaled (keep in mind the economical consequences) for very large datasets or for complex queries/calculations X-SMALL size is not enough!

 
The COMPUTE_WH (default name of the) warehouse should be started from the originally suspended state (see on the images above). Here you can see that X-SMALL with 1 cluster was set to deal with the (simple) queries I made on the offered relatively large datasets and still the result was presented in a few seconds. 

Get some data

If the engine is running already now we need some data to target for analysis. There are plenty options to load data to the SnowFlake system. There are connectors to a large variety of Data lakes or Warehouses like Amazon S3, Microsoft Azure or Google Cloud Platform,
but other connectors can also be selected in the left menu under Data / Add data option, frmo which I chose SnowFlake Marketplace (3rd line, most right one):
I aimed for some financial data to be able to create time series plots. The steps of selecting and loading of the dataset:

For SnowFlake tutorial on loading sample data of "Tasty bytes" by following the detailed documentation or one of the SQL or Python tutorials (Projects / Worksheets in the left panel menu):

Finance & Economics from Cybersyn was chosen: "Aggregate financial data for the banking industry. Calculate the total quarterly net income for the banking industry over the last four decades."

As final step additional roles of the (current) user may be defined.
 
It sounds appropriate for my study goals and importantly it is free (there are dozens of free datasets available).

There are several points on the SnowFlake system how it shows that you reached the required dataset after pushing the Get data button.

 

Clicking on the 3 dots beside the dataset name provides some information and also some options to change some properties:

The currently added database appers in the list of available databases.

It ease the selection of dataset that SnowFlake provides an insight to the data before selection (loading), both in regard values and data types:

Set the context

The context of the whole warehouse-engine system should be done before running any code. If not set then "Data not found" error is indicated (see the middle of the image), in this case below the warehouse was not running.


Warehouse (named COMPUTE-WH as default)then was set to started state from suspended.
 

From the FINANCE_ECONOMICS dataset CYBERSYN was selected (not the Information schema):


Now the Python/SQL codes can run (in SnowPark notebook) resulting in demanded data. Have fun!









SnowFlake universe, part#2 SnowPark Notebook

 SnowPark Python - Notebook

"The Snowpark API provides an intuitive library for querying and processing data
at scale in Snowflake."
SnowPark supports three coding languages JavaPython, and Scala, and SQL codes can also be injected into the Notebook (note that SnowPark SQL motor is limited to standard SQL syntax compared to some other SQL engines on the market). SnowFlake system also provides tools and tutorials on AI and ML practices using their services.

How to use Notebooks is a helpful guide for the first steps among the wide range knowledge base present on SnowFlake Tutorials website.

I found really helpful the option that 3rd party python packages may be added easily to the SnowPark Notebook or Worksheets and thus you can integrate a Slider for interactive SQL/python queries (SnowFlake docs or my slider test

I chose the so called Tasty Bytes learning dataset to verify the useability of the website, both in case of testing SQL or Python in SnowPark.


Database and table was created, besides (minimal) resources were selected: COMPUTE_WH is the default engine name, X-Small is the smallest capacity selectable.


Notebook in SQL mode

As usual, lines (cells) may be defined to be interpreted as SQL or Python code lines or used only for notes or for (formatted) text for e.g. teaching purposes such as markdown lines.

As a simple step dataset was fully loaded to get to know the content.

First, data was copied to a database on SnowFlake from a given site (as defined in the tutorial):

At every step resources and process time is indicated on the right side. This helps to calculate the consumed resources and consequently the price of the data handling, but also can be help for code (or data loading) optimization leading to cheaper operation.
Dataset parameters may be extracted using LIST function of SQL

There are different in-built data representation options such as table, plots and line graphs as the usual formats in Notebooks:


Running another SQL query the resulting data may be represented as a Chart instead of a table format:
The X-Y values are automatically recognised by SnowPark system but those can be modified on the right hand side panel, of course only in case of valid settings would the chart show something, otherwise returns with an error pointing out that we made some mistake in the settings.

Heatmap is also an option however the graphical representation is not the best:
Resizability of the cells and some selectable recolouring palette option would highly increase it's readability.
See further details on representation below at Representation in general session.

Notebook in Python mode

As basic concept Notebooks were made originally for Python based tests of data wrangling. I have found that SnowPark has the most requested python functionalities.

importing the required modules
After some preparatory steps the dataset may be used:
The outcome is a (Pandas) DataFrame that can be easily handled for data processing in further steps and can be represented in the above mentioned modes.

SnowPark also offers some function(ality) hints for Python codes which enormously ease the use of the Notebook.


To see more check the Dashboards of SnowFlake which is a simple in design but inevitable tool of all large dataset handling softwares or cloud services.

Representation in general

Independently of using Python or SQL to collect and/or filter our data SnowPark has some robust and reliable methods to represent final datasets. One advantage was noted already, that if Chart option is selected the X-Y axis values are automatically recognised and corresponding values plotted, which may be overwritten or modified to match to our needs or expectations in regard data representation.

I chose another free dataset: Finance & Economics of Cybersyn (about initial steps and data selection you may reed in part#3)

Another a
dvantage of the SnowPark system is that when query results are demonstrated as a table, you may get immediate insight into overall details of the data on the right hand side. By clicking on different columns (header) we get instantaneous breakdown of that columns content and we may also use that bar chart to filter the represented data defining left and right edges of the shown range to match with our interest.  

It is simple to add another dataset to a chart (bar or line colours are defined automatically).


When handling timeseries data the ranges (bucketing) may be easily modified from day to weeks or months or year-quarters, but if data related time is also given not only the date then hours, minutes or seconds can also be selected as base of bucketing.

Here I provide two snapshots of the time settings if the video does not help and with changed to whote background for better visualisation of the line chart menu. Data (number of credit card issues of a special bank) with daily sum up:

and after changing to monthly basis:



Part#1 SnowFlake in general

Part#3 initializing SnowFlake environment

Part#4 using Python 3rd party modules

Part#5 Dashboards

Part#6 AI & ML using SnowFlake ...

Snowflake universe, part #6 - Forecasting2

Forecasting with built-in ML module Further posts in  Snowflake  topic SnowFlake universe, part#1 SnowFlake, part#2 SnowPark Notebook Snow...