DuckDB
In order to follow this examples, make sure your installation is all set for duckdb
Install
pip install cuallee
pip install cuallee[duckdb]
The example illustrated in this section, uses pandas to create a test data structure, as opposed DML in duckdb to create a table or view to conduct the demonstration.
duckdb uses a pyarrow to access the pandas dataframes in memory, and therefore is capable to interpret a pandas dataframe as a virtual table from which it can run queries.
The name given to the virtual table is the variable name in the case of the majority of the examples this is: df or alternatively df2.
You can find more information here.
Data Structures
DuckDBPyConnection
duckdb is a rapid changing framework with a relatively fast release cadence. That means that keeping up to date with their releases is challenging. For example in version 0.10.0 the project introduced a PySpark API that allow users to interact with a DuckDB database with a very similar API as that of Spark. However, parity of the APIs is not 100% and in practice, the most reliable API for duckdb is always the SQL interface. For that reason, in cuallee we decided to use that interface to duckdb as opposed to the one closest to the dataframe.
What this means, is that cuallee uses the following data type to interact with the data:
Supported
import duckdb
conn = duckdb.connect(":memory:")
type(conn)
duckdb.duckdb.DuckDBPyConnection
DuckDBPyRelation
cuallee does not support the relational API as of version 0.10.3. Implenting it, will depend on the parity of functionality with the native SQL API in DuckDb. Currently not all the operations supported in the native SQL API of duckdb are available through the relational API, and therefore it has limitations when trying to match their siblings for instance with PySpark or Pandas.
Not Supported
import duckdb
conn = duckdb.read_parquet("data.parquet")
type(conn)
duckdb.duckdb.DuckDBPyRelation
Checks
is_complete
It validates the completeness attribute of a data set. It confirms that a column does not contain null values .
is_complete
In this example, we validate that the column id does not have any missing values.
import pandas as pd
import duckdb
from cuallee import Check
conn = duckdb.connect(":memory:")
df = pd.DataFrame({"id" : [1,2,3,4,5]})
check = Check(table_name="df")
check.is_complete("id")
# Validate
check.validate(conn)
output:
id timestamp check level column rule value rows violations pass_rate pass_threshold status
1 2024-05-18 16:22:53 cuallee.check WARNING id is_complete N/A 5 0 1.0 1.0 PASS
In this example, we intentionally place 2 null values in the dataframe and that produces a FAIL check as result.
import pandas as pd
import duckdb
from cuallee import Check
conn = duckdb.connect(":memory:")
df = pd.DataFrame({"id" : [1,2,3,None, None]})
check = Check(table_name="df")
check.is_complete("id")
# Validate
check.validate(conn)
output:
id timestamp check level column rule value rows violations pass_rate pass_threshold status
1 2024-05-18 16:33:55 cuallee.check WARNING id is_complete N/A 5 2 0.6 1.0 FAIL
In this example, we validate reuse the data frame with empty values from the previous example, however we set our tolerance via the pct parameter on the rule is_complete to 0.6. Producing now a PASS result on the check, regardless of the 2 present null values.
import pandas as pd
import duckdb
from cuallee import Check
conn = duckdb.connect(":memory:")
df = pd.DataFrame({"id" : [1,2,3,None, None]})
check = Check(table_name="df")
check.is_complete("id", pct=0.6)
# Validate
check.validate(conn)
output:
id timestamp check level column rule value rows violations pass_rate pass_threshold status
1 2024-05-18 16:33:55 cuallee.check WARNING id is_complete N/A 5 2 0.6 0.6 PASS