SnowFlake universe, part#4 using 3rd party Python modules

Enhancing SnowPark Notebook capabilities

SnowSpark as the online notebook app of the SnowFlake system may be upgraded in functionality. Not suprisingly, but coming from Python nature additional modules can be imported to enhance data handling. There are several modules such as scikit learn for analysis and AI functionality with included visualization ability or 'simple' data visualization tools as matplotlib or seaborn (see more).

Streamlit, the app generator 

Streamlit website and a demo to start with if you are or become interested in it.

"Streamlit turns data scripts into shareable web apps in minutes.
All in pure Python. No front‑end experience required." (Streamlit website)

Streamlitis an open-source Python library that allows data scientists and developers to easily collaborate, quickly and easily develop interactive data visualizations and web applications with minimal web (html, css, php/javascript) development skills. Streamlit stands out from the market by enabling rapid prototyping and further developments through its streamlined and simple development process.

It is a software having increasing impact on the market among similar applications, here you find the most widely used data visualization and web application development tools:

  • Dash (Plotly, Python-based)
  • Panel (HoloViz library, also Python-based)
  • Shiny (for R, from RStudio)

Why Streamlit?

Ease of use and fast development, which is "one of Streamlit's biggest advantages because it has an intuitive API, so data analysts and data scientists can quickly build interactive apps with just a few lines of Python code."

Rapid app development is made easier also because the app automatically reloads itself after changes are made to the code, making the development process fast and flexible. That is valid, if you don't mess up a step of the development process.😊

The dozens of built-in visualization and interactive components (e.g. sliders, text boxes, radio buttons) make interaction more satisfying for end users on the app's user side. Individual Streamlit components can be easily embedded into applications, even into a SnowPark Notebook. A typical case is, for example, that setting the minimum and/or maximum values of a slider causes the query from the database to run again with the changed parameter and thus update the (filtered) extracted data from the dataset according to the user's expectations.

Sliders demo made by Streamlit (source: their slider documentation website):

 If it does not work see image.

My own test on slider functionality: filtering fetched data by defining lower price limit (on a tutorial dataset) is shown in the following video. Read the notebook notes (markdown parts) in the video for better understanding.

After importing the Streamlit module we declare min_price variable with a given value received from a slider (range) set by the user.
import streamlit as st
st.markdown("# Move the slider to define lower price limit to filter data")
#col1 = st.columns(1)
#with col1:
min_price = st.slider('Define min_price', 1, 20, 2)

After the interactive slider is activated a Python query using previously defined variable min_price (1 line of code) can filter the result of the query, which is taken from that restricted part of the whole dataset, where the company is called 'Freezing point'.

df_menu_freezing_point[df_menu_freezing_point['SALE_PRICE_USD'] \
    > min_price][['TRUCK_BRAND_NAME','MENU_ITEM_NAME','SALE_PRICE_USD']]

A similar, but SQL query using the previously defined variable min_price as lower sale price limit, now querying the whole dataset including all companies:

SELECT truck_brand_name, menu_item_name, sale_price_usd
FROM tasty_bytes_sample_data.raw_pos.menu
WHERE sale_price_usd > {{min_price}}

Streamlit has the added advantage of being able to integrate with popular data visualization libraries such as Matplotlib, Plotly and Altair, and therefore, although not by itself, but offers a wide range of data visualization options through integrated systems. Accordingly, its visualization capabilities are basically determined by the integrated python module.

Streamlit Cloud also offers Cloud support, making it easy to share and run prototypes and applications. Find examples in the Streamlit App Gallery. Funny on its own reflection how..., but still impressive cheat-sheet website of Streamlit development made as an app by Streamlit itself.

In terms of the SnowFlake system, Streamlit has the advantage of being able to connect directly to the SnowFlake data warehouse, making it easy to create interactive data visualizations based on data from SnowFlake and take advantage of the backend solutions provided by the SnowFlake system. This can be particularly useful for data analysts and business decision makers as they can query and visualize data from SnowFlake in real time. In simpler cases, this can even replace the use of more expensive BI software (I am not listing any software here, being respectful).

Streamlit compared to other software on the market

Streamlit is not suitable for the development of large and complex (multimodal) web applications, as it has limited scalability and does not support detailed user permissions, nor advanced front-end customization options.

If an application requires multiple pages, complex navigation or detailed user identification, Streamlit is not an ideal choice as it does not support these features well. In this respect, Dash or Shiny may offer more options.

Streamlit applications are ideal for smaller data visualization projects, but if you are working with more complex or larger data sets on the input side, or need to display multiple types of data on the output side, or need to serve multiple users simultaneously, performance can be severely degraded or constrained for the development team and for the end-user as well.

Note that Dash or Panel offers more customisation and performance optimisation options.

The lack of built-in data manipulation tools, as well as the aforementioned data visualization toolset, relies on the integration ability with various data processing libraries, such as Pandas, with Pandas’ own built-in data manipulation tools. Data processing must be handled by separate modules and the results returned to the Streamlit application for final visualization. This is not necessarily a real disadvantage as python users got used to this mentality, but in the case of Dash and Shiny, the full integration of Plotly and R-Studio as data manipulation software gives a wider range of built-in data processing capabilities. I would say it makes coding simpler in the latter case, but it is not an unbearable situation to push-and-pull data between modules.

Streamlit allows more limited functionality, so applications with complex operations cannot be created with it. The queries themselves (math or code) can be complex, but they cannot, for example, be built on top of each other.

Not sure if it's actually a drawback, but Streamlit is specifically Python-based and from there on is not sympathetic to R-using statisticians (as far as I know). Dash may be more popular among statisticians and data scientists because it's more versatile for Python and R developers (I have not checked market data in regard of this topic).


Stored Python procedures

(Almost) everyone heard about stored SQL procedures, but this is about stored Python procedures which is rarely found on the market. Read more about the topic.

No comments:

Post a Comment

Snowflake universe, part #6 - Forecasting2

Forecasting with built-in ML module Further posts in  Snowflake  topic SnowFlake universe, part#1 SnowFlake, part#2 SnowPark Notebook Snow...