Initializing the SnowFlake system - back to the beginning
Signing up
... is easy (but nowadays it is almost true for every site who does not want to bother it's provisionary user by time-wasting steps).
"HIPAA, PCI DSS, SOC 1 and SOC 2 Type 2 compliant, and FedRAMP Authorized" says the website's first page when trying to create an account. OK, check it for yourself if you want to know more, I was concentrating on getting into the system.
You must select a cloud provider and define it’s location from a dropdown list. I chose Microsoft Azure.
Image captcha check should be accomplished for safety reasons. Then some questions should be answered to better serve your needs, but only a minimal amount. Name and email address should be given at first step
As the next step you are noted that you have succeded and an email has sent to your email address given. At the bottom of the left panel $400 credit is shown to be consumed during your 30 days of trial period. Well, we are already there!
Some help is provided in form of links to help you to start (which they consider to be relevant):
- "GETTING STARTED VIDEO" - 8 minutes about the whole system.
- VIRTUAL HANDS-ON LAB - "Join an instructor-led, virtual hands-on lab to learn how to get started with Snowflake"
- FULL SNOWFLAKE DOCUMENTATION - the link to the documentation of SnowFlake ecosystem
Time travel - a 90 days Warehouse data storing management option to reach historical data, of course costs money. Not important for studying purposes.
Multi-cluster warehouses - "Snowflake supports allocating, either statically or dynamically, additional clusters to make a larger pool of compute resources available." Of course costs money and this is also not important for studying purposes.
Materialized Views - which basically is a pre-computed data set deriving from a query and stored for later use, therefore it provides way faster execution against the base table of the view.
Setting the environment parameters
Warehouses are for storing data but the engine should also be defined at the beginning of our data processing to ensure optimal time and costs ratio. As SnowFlake resources can easily be scaled (keep in mind the economical consequences) for very large datasets or for complex queries/calculations X-SMALL size is not enough!
The
COMPUTE_WH (default name of the) warehouse should be started from the originally suspended state (see on the images above). Here you can see that X-SMALL with 1 cluster was set to deal with the (simple) queries I made on the offered relatively large datasets and still the result was presented in a few seconds.
Get some data
If the engine is running already now we need some data to target for analysis. There are plenty options to load data to the SnowFlake system. There are connectors to a large variety of Data lakes or Warehouses like Amazon S3, Microsoft Azure or Google Cloud Platform,
but other connectors can also be selected in the left menu under Data / Add data option, frmo which I chose SnowFlake Marketplace (3rd line, most right one):

I aimed for some financial data to be able to create time series plots. The steps of selecting and loading of the dataset:
For SnowFlake tutorial on loading sample data of "Tasty bytes" by following the
detailed documentation or one of the SQL or Python tutorials (Projects / Worksheets in the left panel menu):
Finance & Economics from Cybersyn was chosen: "Aggregate financial data for the banking industry. Calculate the total quarterly net income for the banking industry over the last four decades."
As final step additional roles of the (current) user may be defined.
It sounds appropriate for my study goals and importantly it is free (there are dozens of free datasets available).
There are several points on the SnowFlake system how it shows that you reached the required dataset after pushing the Get data button.
Clicking on the 3 dots beside the dataset name provides some information and also some options to change some properties:
The currently added database appers in the list of available databases.
It ease the selection of dataset that SnowFlake provides an insight to the data before selection (loading), both in regard values and data types:
Set the context
The context of the whole warehouse-engine system should be done before running any code. If not set then "Data not found" error is indicated (see the middle of the image), in this case below the warehouse was not running.
Warehouse (named COMPUTE-WH as default)then was set to started state from suspended.
From the FINANCE_ECONOMICS dataset CYBERSYN was selected (not the Information schema):
Now the Python/SQL codes can run (in SnowPark notebook) resulting in demanded data. Have fun!
No comments:
Post a Comment