Catalogue
The following table contains the list of all available checks in cuallee
:
Checks
Check | Description | DataType |
---|---|---|
is_complete |
Zero nulls |
agnostic |
is_empty |
All nulls |
agnostic |
is_unique |
Zero duplicates |
agnostic |
is_primary_key |
Zero duplicates |
agnostic |
are_complete |
Zero nulls on group of columns |
agnostic |
are_unique |
Composite primary key check | agnostic |
is_composite_key |
Zero duplicates on multiple columns | agnostic |
is_greater_than |
col > x |
numeric |
is_positive |
col > 0 |
numeric |
is_negative |
col < 0 |
numeric |
is_greater_or_equal_than |
col >= x |
numeric |
is_less_than |
col < x |
numeric |
is_less_or_equal_than |
col <= x |
numeric |
is_equal_than |
col == x |
numeric |
is_contained_in |
col in [a, b, c, ...] |
agnostic |
is_in |
Alias of is_contained_in |
agnostic |
not_contained_in |
col not in [a, b, c, ...] |
agnostic |
not_in |
Alias of not_contained_in |
agnostic |
is_between |
a <= col <= b |
numeric, date |
has_pattern |
Matching a pattern defined as a regex |
string |
is_legit |
String not null & not empty ^\S$ |
string |
has_min |
min(col) == x |
numeric |
has_max |
max(col) == x |
numeric |
has_std |
σ(col) == x |
numeric |
has_mean |
μ(col) == x |
numeric |
has_sum |
Σ(col) == x |
numeric |
has_percentile |
%(col) == x |
numeric |
has_cardinality |
count(distinct(col)) == x |
agnostic |
has_infogain |
count(distinct(col)) > 1 |
agnostic |
has_max_by |
A utilitary predicate for max(col_a) == x for max(col_b) |
agnostic |
has_min_by |
A utilitary predicate for min(col_a) == x for min(col_b) |
agnostic |
has_correlation |
Finds correlation between 0..1 on corr(col_a, col_b) |
numeric |
has_entropy |
Calculates the entropy of a column entropy(col) == x for classification problems |
numeric |
is_inside_interquartile_range |
Verifies column values reside inside limits of interquartile range Q1 <= col <= Q3 used on anomalies. |
numeric |
is_in_millions |
col >= 1e6 |
numeric |
is_in_billions |
col >= 1e9 |
numeric |
is_t_minus_1 |
For date fields confirms 1 day ago t-1 |
date |
is_t_minus_2 |
For date fields confirms 2 days ago t-2 |
date |
is_t_minus_3 |
For date fields confirms 3 days ago t-3 |
date |
is_t_minus_n |
For date fields confirms n days ago t-n |
date |
is_today |
For date fields confirms day is current date t-0 |
date |
is_yesterday |
For date fields confirms 1 day ago t-1 |
date |
is_on_weekday |
For date fields confirms day is between Mon-Fri |
date |
is_on_weekend |
For date fields confirms day is between Sat-Sun |
date |
is_on_monday |
For date fields confirms day is Mon |
date |
is_on_tuesday |
For date fields confirms day is Tue |
date |
is_on_wednesday |
For date fields confirms day is Wed |
date |
is_on_thursday |
For date fields confirms day is Thu |
date |
is_on_friday |
For date fields confirms day is Fri |
date |
is_on_saturday |
For date fields confirms day is Sat |
date |
is_on_sunday |
For date fields confirms day is Sun |
date |
is_on_schedule |
For date fields confirms time windows i.e. 9:00 - 17:00 |
timestamp |
is_daily |
Can verify daily continuity on date fields by default. [2,3,4,5,6] which represents Mon-Fri in PySpark. However new schedules can be used for custom date continuity |
date |
has_workflow |
Adjacency matrix validation on 3-column graph, based on group , event , order columns. |
agnostic |
satisfies |
An open SQL expression builder to construct custom checks |
agnostic |
validate |
The ultimate transformation of a check with a dataframe input for validation |
agnostic |
Controls
Currently available only for pyspark
and pandas
dataframes.
Check | Description | DataType |
---|---|---|
completeness |
Zero nulls |
agnostic |
information |
Zero nulls and cardinality > 1 |
agnostic |
intelligence |
Zero nulls, zero empty strings and cardinality > 1 | agnostic |
percentage_fill |
% rows not empty |
agnostic |
percentage_empty |
% rows empty |
agnostic |
Demo
import pandas as pd
from cuallee import Control
df = pd.DataFrame({"X":[1,2,3], "Y": [10,20,30]})
# Checks all columns in dataframe for using is_complete check
Control.completeness(df)
ISO
A new module has been incorporated in cuallee>=0.4.0
which allows the verification of International Standard Organization columns in data frames. Simply access the check.iso
interface to add the set of checks as shown below.
Check | Description | DataType |
---|---|---|
iso_4217 |
currency compliant ccy |
string |
iso_3166 |
country compliant country |
string |
Demo
df = spark.createDataFrame([[1, "USD"], [2, "MXN"], [3, "CAD"], [4, "EUR"], [5, "CHF"]], ["id", "ccy"])
check = Check(CheckLevel.WARNING, "ISO Compliant")
check.iso.iso_4217("ccy")
check.validate(df).show()
+---+-------------------+-------------+-------+------+---------------+--------------------+----+----------+---------+--------------+------+
| id| timestamp| check| level|column| rule| value|rows|violations|pass_rate|pass_threshold|status|
+---+-------------------+-------------+-------+------+---------------+--------------------+----+----------+---------+--------------+------+
| 1|2023-05-14 18:28:02|ISO Compliant|WARNING| ccy|is_contained_in|{'BHD', 'CRC', 'M...| 5| 0.0| 1.0| 1.0| PASS|
+---+-------------------+-------------+-------+------+---------------+--------------------+----+----------+---------+--------------+------+