SnowFlake universe, part#5 ML - forecasting

Forecasting from time series data with Snowflake ML

Finding and loading real-life data 

First of all, Snowflake's Machine learning (ML) tutorials offer some data for training, but those are 'prepaired' data, which in fact, is not a problem. I consider that a test is better if made on real data. There are plenty free or paid data sources out there, the question is where to find a small, but suitable amount of data. 

I found and downloaded Electric production data from Kaggle website (Kaggle source data or mysource, Electric_Production.csv, 7.32 kB) and uploaded to Snowflake through Data / Add data / Load files into a stage option.

Snowflake - Add data / load to stage

Snowflake - Add data - Select file
This works easily, with Drag&Drop, then creating a new (chosing EPRODUCT as database name) or selecting an existing database to load the data into. As a result the database is ready to use:
Snowflake - New database added
Note1: I uploaded data because (in trial version) copying/modifying shared database is not allowed. Therefore all such databases that are loaded from Snowflake Marketplace are excluded. Try the above defined Python based model creation to avoid this limitation. 

Note2: I have found a lot of data sharing websites, for example NYC Open data or UC Irvine. Those have large variety of real-life datasets but those are so huge that Snowflake (in trial mode?) refused to load that due to an upper limit declared to be 200 MB (see on the file selection window above).  

Dataset consist of 2 columns: date (as date format) and production data (float), 398 lines (entries per columns).

ML forecast model - automatic model configuration 

Snowflake provides a 'model creation wizard' that helps you through the model configuration and creation steps and creates a worksheet of SQL commands.
First of all model name (eProduct_Forecast_model) and metrics can be determined and the Warehouse on which the model should run. To better separate the model's cost from all other processes a new Warehouse is suggested to be created (I have not done so in this testing situation).
 
The process is simple, chosing the right data or settings (see in the video or read the list):
  • Database and Schema
  • time or date(stamp) type column from the dataset to serve as training data for the forecast model
  • additionally series identifer and additional features (possibly affecting the values in time) may be selected (which I have not done as the database had only 2 data columns)
  • forecast horizon (amount of time to forecast, in units of the given data, e.g. days) and prediction interval width (default 0.95) may be declared...
and there you are, the model is created in the background along with a SQL worksheet having all required commands to regenerate, modify, or to run the created model. Running the whole worksheet would take some tens of seconds (using X-Small warehouse) and results in a View consisting of all existing data (loaded = preexisting and forecasted with the newly trained model): 
 
See more on Quickstart to forecasting (Snowflake).

The SQL commands in the worksheet are responsible for the selection and preparation of the data, training of the model and the forecasting as well.

Preparing model creation

A new Snowpark Notebook (eProduct_Forecast) was created to cut into simple steps and to demonstrate the model creation process: 
It shows up in the Projects / Notebooks option selected on the left menu panel right after creation.
Projects / Notebooks - ML Forecast study
Each smiple steps were created (as usual in Notebooks) in separate cells (with increasing numbers as identifiers), where the coding language is set to SQL (see in the top-left corner of the cell on the image below). Environment (user role, warehouse, database and schema) has to be declared first:
Runtime is indicated in the top-right corner of the cell to ease resource (credit) consumption. Significant amount is consumed when model is created or run.
Returns of the code in the block may be simple texts (like above) or data (presented in table format, first 10 lines usually).

It is always advised to check the loaded data (quality, format,...) both in Tables or in Views. Above, it seems that data is identique and is OK. Format check of Date column is shown below

The code below was not run as the View (ELECTRONIC_PRODUCTION_v1) was already created earlier, when the complete SQL command list was run by running the 'wizard' created worksheet.

Model creation

CREATE SNOWFLAKE.ML.FORECAST command invokes the Snowflake Machine learning module to create a forecast model based on the data stored in the given view (which is the training dataset for the model). It may take a minute using X-SMALL database for this ~400 datapoint dataset. The timestamp and 'target' column names were given in the 'model wizard' (see in the video above):
Snowflake ML - forecast model created
An instance (= a model) was created and is ready to be used for forecasting.

Do not be surprised, the created eProduct_Forecast_model is not listed in the Snowflake / AI&ML / Models section on the left panel, because to show up the model there requires registration which is another story (not discussed here):
Snowflake / AI&ML / Models section
The created Forecast Model not shown among snowflake Models (see SQL verification later in the text).
The existence of the model can be assured by SHOWing SNOWFLAKE.ML.FORECAST instances: 
Running the created model is easy using call (SQL) function:
Snowflake ML - Run forecast model
Above the forecasted values of lower and upper bound are demonstrated (in a table) but if we would like to use it for further calculations or just to plot it, then the data should be stored in a (temporary or permanent) Table of the database. Note that this time the SELECT * FROM selects from a TABLE, but 'call' function is not used! I have run into this error unfortunately when simply copied the whole command above behind the TABLE() function.
Snowflake / Notebook - Create table from query result
Besides prediction time length (set to 14) given in unit of the loaded dataset (here in days), prediction interval can also be set (for example: CONFIG_OBJECT => {'prediction_interval': 0.95}).
Table MY_FORECASTS_EPRODUCT is created but cannot be vizualised as is, because SnowPark notebook does not have an in-built chart representation for query tables.
There are several ways to fetch data from a created Table (Python Connector, Pandas: Dataframe.from_records(), or read_sql(), Snowflake SQLAlchemy, option sources: StackowerFlow post) or from the recently run SQL query results, here I demonstrate the latter case using Python get_active_session() method (from the long list of Snowflake APIs). 
Snowflake - Python, get_active_session()
After that moment, with activated session almost everything gets trackable, usable and modifiable in the context of current session. Now we can extract the data from a Table including both the original (loaded) data and the forecasted ones using UNION method, if the query is run:
Snowflake - SQL query UNION
Note that this SQL query was run in the 14th cell of the SnowPark Notebook.
Now, fetch data using the previous SQL query block: cell14 (this time) and the .to_pandas() method to load into a Pandas dataframe type variable, which is easy to use and very much liked by Data Scientist.
Snowflake - Python, fetch data from SQL query

Snowflake - verifying data type (python)
Snowflake - verifying data type (python) to ensure Timestamp format of 'Date' column in dataframe.

It was a simple choice of using Altair for vizualisation as this is the only in-built vizualisation tool (in trial?) so I did not have to bother with installing (in the Python Anaconda system working at the background) any of other Snowflake supported vizualisation tools (see the list of Python modules).
However it has a large variety of applications with very nice outlook (see Altair plot examples) that choice made my work harder as I got used to Matplotlib (see plot examples), but it seemed to save time on avoiding python module installation. The syntax is different, in spite of having their common python basis. Let's move on.
Importing Altair module is simple (as usual in python):
Importing Altair module for vizualisation
First of all, I selected the original data to plot (markers are circles) in a Chart, where x and y axis data can be selected in the encode() method by defining the name of the column in the pandas dataframe. It is important to note that plotting does not consume considerable amount of resources (credit), but of course it depends on the plotted dataset size.
The full dataset of real data and predicted lower and upper bound for 14 days forecast was plotted as line plot, with three independent lines of defined colours.
To have a better look on the predicted values the last 30 (.tail(30)) entries were selected for plotting which includes some of the real data and the 14 days prediction of lower and upper limit. The model has learned the fluctuation of the data and has learned to predict in a reasonable range, however I found the outcome a bit off.
Snowflake - Forecast on EProduct dataset
Further precision of settings or longer training dataset may result in better results.
Finally, Snowflake offers checking forecast evaluation metrics:
Snowflake - Forecast metrics
Snowflake - Forecast metrics

Mean Absolute Error (MAE)
Mean Absolute Percentage Error (MAPE)
Mean Direction Accuracy (MDA)
Mean Standard Error (MSE)
Symmetric Mean Absolute Percentage Error (SMAPE)

No comments:

Post a Comment

Snowflake universe, part #6 - Forecasting2

Forecasting with built-in ML module Further posts in  Snowflake  topic SnowFlake universe, part#1 SnowFlake, part#2 SnowPark Notebook Snow...