DuckDB
In order to follow this examples, make sure your installation is all set for duckdb
Install
pip install cuallee
pip install cuallee[duckdb]
The example illustrated in this section, uses pandas
to create a test data structure, as opposed DML
in duckdb
to create a table
or view
to conduct the demonstration.
duckdb
uses a pyarrow
to access the pandas
dataframes in memory, and therefore is capable to interpret a pandas
dataframe as a virtual table
from which it can run queries.
The name given to the virtual table
is the variable name in the case of the majority of the examples this is: df
or alternatively df2
.
You can find more information here.
Data Structures
DuckDBPyConnection
duckdb
is a rapid changing framework with a relatively fast release cadence. That means that keeping up to date with their releases is challenging. For example in version 0.10.0
the project introduced a PySpark API that allow users to interact with a DuckDB database with a very similar API as that of Spark. However, parity of the APIs is not 100%
and in practice, the most reliable API for duckdb
is always the SQL
interface. For that reason, in cuallee
we decided to use that interface to duckdb
as opposed to the one closest to the dataframe
.
What this means, is that cuallee
uses the following data type to interact with the data:
Supported
import duckdb
conn = duckdb.connect(":memory:")
type(conn)
duckdb.duckdb.DuckDBPyConnection
DuckDBPyRelation
cuallee
does not support the relational API as of version 0.10.3
. Implenting it, will depend on the parity of functionality with the native SQL
API in DuckDb. Currently not all the operations supported in the native SQL
API of duckdb are available through the relational API, and therefore it has limitations when trying to match their siblings for instance with PySpark or Pandas.
Not Supported
import duckdb
conn = duckdb.read_parquet("data.parquet")
type(conn)
duckdb.duckdb.DuckDBPyRelation
Checks
is_complete
It validates the completeness attribute of a data set. It confirms that a column does not contain null
values .
is_complete
In this example, we validate that the column id
does not have any missing values.
import pandas as pd
import duckdb
from cuallee import Check
conn = duckdb.connect(":memory:")
df = pd.DataFrame({"id" : [1,2,3,4,5]})
check = Check(table_name="df")
check.is_complete("id")
# Validate
check.validate(conn)
output:
id timestamp check level column rule value rows violations pass_rate pass_threshold status
1 2024-05-18 16:22:53 cuallee.check WARNING id is_complete N/A 5 0 1.0 1.0 PASS
In this example, we intentionally place 2 null
values in the dataframe and that produces a FAIL
check as result.
import pandas as pd
import duckdb
from cuallee import Check
conn = duckdb.connect(":memory:")
df = pd.DataFrame({"id" : [1,2,3,None, None]})
check = Check(table_name="df")
check.is_complete("id")
# Validate
check.validate(conn)
output:
id timestamp check level column rule value rows violations pass_rate pass_threshold status
1 2024-05-18 16:33:55 cuallee.check WARNING id is_complete N/A 5 2 0.6 1.0 FAIL
In this example, we validate reuse the data frame with empty values from the previous example, however we set our tolerance via the pct
parameter on the rule is_complete
to 0.6
. Producing now a PASS
result on the check, regardless of the 2
present null
values.
import pandas as pd
import duckdb
from cuallee import Check
conn = duckdb.connect(":memory:")
df = pd.DataFrame({"id" : [1,2,3,None, None]})
check = Check(table_name="df")
check.is_complete("id", pct=0.6)
# Validate
check.validate(conn)
output:
id timestamp check level column rule value rows violations pass_rate pass_threshold status
1 2024-05-18 16:33:55 cuallee.check WARNING id is_complete N/A 5 2 0.6 0.6 PASS