Forecasting from time series data with Snowflake ML
Finding and loading real-life data
First of all, Snowflake's Machine learning (ML) tutorials offer some data for training, but those are 'prepaired' data, which in fact, is not a problem. I consider that a test is better if made on real data. There are plenty free or paid data sources out there, the question is where to find a small, but suitable amount of data.
I found and downloaded Electric production data from Kaggle website (Kaggle source data or mysource, Electric_Production.csv, 7.32 kB) and uploaded to Snowflake through Data / Add data / Load files into a stage option.
This works easily, with Drag&Drop, then creating a new (chosing EPRODUCT as database name) or selecting an existing database to load the data into. As a result the database is ready to use:
Note1: I uploaded data because (in trial version) copying/modifying shared database is not allowed. Therefore all such databases that are loaded from Snowflake Marketplace are excluded. Try the above defined Python based model creation to avoid this limitation.
Note2: I have found a lot of data sharing websites, for example NYC Open data or UC Irvine. Those have large variety of real-life datasets but those are so huge that Snowflake (in trial mode?) refused to load that due to an upper limit declared to be 200 MB (see on the file selection window above).
Dataset consist of 2 columns: date (as date format) and production data (float), 398 lines (entries per columns).
ML forecast model - automatic model configuration
Snowflake provides a 'model creation wizard' that helps you through the model configuration and creation steps and creates a worksheet of SQL commands.
First of all model name (eProduct_Forecast_model) and metrics can be determined and the Warehouse on which the model should run. To better separate the model's cost from all other processes a new Warehouse is suggested to be created (I have not done so in this testing situation).
The process is simple, chosing the right data or settings (see in the video or read the list):
- Database and Schema
- time or date(stamp) type column from the dataset to serve as training data for the forecast model
- additionally series identifer and additional features (possibly affecting the values in time) may be selected (which I have not done as the database had only 2 data columns)
- forecast horizon (amount of time to forecast, in units of the given data, e.g. days) and prediction interval width (default 0.95) may be declared...
and there you are, the model is created in the background along with a SQL worksheet having all required commands to regenerate, modify, or to run the created model. Running the whole worksheet would take some tens of seconds (using X-Small warehouse) and results in a View consisting of all existing data (loaded = preexisting and forecasted with the newly trained model):
The SQL commands in the worksheet are responsible for the selection and preparation of the data, training of the model and the forecasting as well.
Preparing model creation
A new Snowpark Notebook (eProduct_Forecast) was created to cut into simple steps and to demonstrate the model creation process:

It shows up in the Projects / Notebooks option selected on the left menu panel right after creation.
Each smiple steps were created (as usual in Notebooks) in separate cells (with increasing numbers as identifiers), where the coding language is set to SQL (see in the top-left corner of the cell on the image below). Environment (user role, warehouse, database and schema) has to be declared first:
Runtime is indicated in the top-right corner of the cell to ease resource (credit) consumption. Significant amount is consumed when model is created or run.
Returns of the code in the block may be simple texts (like above) or data (presented in table format, first 10 lines usually).
It is always advised to check the loaded data (quality, format,...) both in Tables or in Views. Above, it seems that data is identique and is OK. Format check of Date column is shown below
The code below was not run as the View (ELECTRONIC_PRODUCTION_v1) was already created earlier, when the complete SQL command list was run by running the 'wizard' created worksheet.

Model creation
CREATE SNOWFLAKE.ML.FORECAST command invokes the Snowflake Machine learning module to create a forecast model based on the data stored in the given view (which is the training dataset for the model). It may take a minute using X-SMALL database for this ~400 datapoint dataset. The timestamp and 'target' column names were given in the 'model wizard' (see in the video above):
An instance (= a model) was created and is ready to be used for forecasting.
Do not be surprised, the created eProduct_Forecast_model is not listed in the Snowflake / AI&ML / Models section on the left panel, because to show up the model there requires registration which is another story (not discussed here):
The created Forecast Model not shown among snowflake Models (see SQL verification later in the text).
The existence of the model can be assured by SHOWing SNOWFLAKE.ML.FORECAST instances:
Running the created model is easy using call (SQL) function:
Above the forecasted values of lower and upper bound are demonstrated (in a table) but if we would like to use it for further calculations or just to plot it, then the data should be stored in a (temporary or permanent) Table of the database. Note that this time the SELECT * FROM selects from a TABLE, but 'call' function is not used! I have run into this error unfortunately when simply copied the whole command above behind the TABLE() function.
Besides prediction time length (set to 14) given in unit of the loaded dataset (here in days), prediction interval can also be set (for example: CONFIG_OBJECT => {'prediction_interval': 0.95}).
Table MY_FORECASTS_EPRODUCT is created but cannot be vizualised as is, because SnowPark notebook does not have an in-built chart representation for query tables.
After that moment, with activated session almost everything gets trackable, usable and modifiable in the context of current session. Now we can extract the data from a Table including both the original (loaded) data and the forecasted ones using
UNION method, if the query is run:
Note that this SQL query was run in the 14th cell of the SnowPark Notebook.
Now, fetch data using the previous SQL query block:
cell14 (this time) and the
.to_pandas() method to load into a Pandas dataframe type variable, which is easy to use and very much liked by Data Scientist.
Snowflake - verifying data type (python) to ensure Timestamp format of 'Date' column in dataframe.
It was a simple choice of using
Altair for vizualisation as this is the only in-built vizualisation tool (in trial?) so I did not have to bother with installing (in the
Python Anaconda system working at the background) any of other
Snowflake supported vizualisation tools (see the
list of Python modules).
However it has a large variety of applications with very nice outlook (see
Altair plot examples) that choice made my work harder as I got used to
Matplotlib (see
plot examples), but it seemed to save time on avoiding python module installation. The syntax is different, in spite of having their common python basis. Let's move on.
Importing Altair module is simple (as usual in python):
First of all, I selected the original data to plot (markers are circles) in a
Chart, where x and y axis data can be selected in the
encode() method by defining the name of the column in the pandas dataframe. It is important to note that plotting does not consume considerable amount of resources (credit), but of course it depends on the plotted dataset size.
The full dataset of real data and predicted lower and upper bound for 14 days forecast was plotted as line plot, with three independent lines of defined colours.
To have a better look on the predicted values the last 30 (.tail(30)) entries were selected for plotting which includes some of the real data and the 14 days prediction of lower and upper limit. The model has learned the fluctuation of the data and has learned to predict in a reasonable range, however I found the outcome a bit off.
Further precision of settings or longer training dataset may result in better results.
Mean Absolute Error (MAE)
Mean Absolute Percentage Error (MAPE)
Mean Direction Accuracy (MDA)
Mean Standard Error (MSE)
Symmetric Mean Absolute Percentage Error (SMAPE)
No comments:
Post a Comment