Introduction
Welcome to cuallee
and thanks for using this amazing framework. None of this work could have been possible without the inspiration of pydeequ
. So, thanks to the AWS folks for putting the research work together, and the references so that we could build on the shoulders of giants.
This pure-python
implementation of unit tests for your data, will help you define validations for your data using 3
concepts described below:
Entities
To better understand cuallee
you will need to get familiar with the following 3 concepts: Check
, Rule
and ComputeInstruction
.
Entity | Icon | Description |
---|---|---|
Check |
Use it to define a group of validations on a dataframe and report them as WARNING or ERROR . You can chain as many rules into a check , internally cuallee will make sure the same rule is not executed twice. |
|
Rule |
A rule represents the predicate you want to test on a single or multiple columns in a dataframe. A rule as a 4 attributes method : name of the predicate, column : the column in the dataframe, value : the value to compare against and coverage : the percentage of positive predicate necessary to set the status of the check to PASS . |
|
ComputeInstruction |
Are the implementation specific representations of the predicates in the rule . Because cuallee is a dataframe agnostic data quality framework, the implementation of the rules, rely in the creation of compute instructions passed to the specific dataframe of choice, including the following dataframe options: pandas , pyspark and snowpark |
In principle, the only interface you need to be familiar with is the Check
as it is through this object
that you can chain your validations and then directly through the validate
method, execute validations on any DataFrame
.
Process Flow
graph LR
U((start)) --> A;
A[Check] -.-> B(is_complete);
A[Check] -.-> C(is_between);
A[Check] -.-> D(is_on_weekday);
B --> E{all rules?};
C --> E{all rules?};
D --> E{all rules?};
E --> F[/read dataframe/];
A -.-> G{want results?};
F --> G;
G --> H(validate);
H --> I([get results])
I --> K((end))
Installation
cuallee
is designed to work primarily with pyspark==3.3.0
and this is its only dependency.
It uses the Observation
API features in pyspark, to reduce the computation time for aggregations, and calculating summaries in one pass of the data frames being validated.
pip
# Latest
pip install cuallee
Check
Validating data sets is about creating a Check
and adding rules into it.
You can choose from different types: numeric
, date algebra
, range of values
, temporal
, and many others.
A Check
provides a declarative interface to build a comprehensive validation on a dataframe as shown below:
# Imports
from cuallee import Check, CheckLevel
from cuallee import dataframe as D
# Check
check = Check(CheckLevel.WARNING, "TaxiNYCheck")
# Data
df = spark.read.parquet("temp/taxi/*.parquet")
# Adding rules
# =============
# All fields are filled
[check.is_complete(name) for name in df.columns]
# Verify taxi ride distance is positive
[check.is_greater_than(name, 0) for name in D.numeric_fields(df)]
# Confirm that tips are not outliers
[check.is_less_than(name, 1e4) for name in D.numeric_fields(df)]
# 70% of data is on weekdays
[check.is_on_weekday(name, .7) for name in D.timestamp_fields(df)]
# Binary classification fields
[check.has_entropy(name, 1.0, 0.5) for name in D.numeric_fields(df)]
# Percentage of big tips
[check.is_between(name, (1000,2000)) for name in D.numeric_fields(df)]
# Confirm 22 years of data
[check.is_between(name, ("2000-01-01", "2022-12-31")) for name in D.timestamp_fields(df)]
# Validation
check.validate(df)