Skip to content

Catalogue

The following table contains the list of all available checks in cuallee:

Checks

Check Description DataType
is_complete Zero nulls agnostic
is_empty All nulls agnostic
is_unique Zero duplicates agnostic
is_primary_key Zero duplicates agnostic
are_complete Zero nulls on group of columns agnostic
are_unique Composite primary key check agnostic
is_composite_key Zero duplicates on multiple columns agnostic
is_greater_than col > x numeric
is_positive col > 0 numeric
is_negative col < 0 numeric
is_greater_or_equal_than col >= x numeric
is_less_than col < x numeric
is_less_or_equal_than col <= x numeric
is_equal_than col == x numeric
is_contained_in col in [a, b, c, ...] agnostic
is_in Alias of is_contained_in agnostic
not_contained_in col not in [a, b, c, ...] agnostic
not_in Alias of not_contained_in agnostic
is_between a <= col <= b numeric, date
has_pattern Matching a pattern defined as a regex string
is_legit String not null & not empty ^\S$ string
has_min min(col) == x numeric
has_max max(col) == x numeric
has_std σ(col) == x numeric
has_mean μ(col) == x numeric
has_sum Σ(col) == x numeric
has_percentile %(col) == x numeric
has_cardinality count(distinct(col)) == x agnostic
has_infogain count(distinct(col)) > 1 agnostic
has_max_by A utilitary predicate for max(col_a) == x for max(col_b) agnostic
has_min_by A utilitary predicate for min(col_a) == x for min(col_b) agnostic
has_correlation Finds correlation between 0..1 on corr(col_a, col_b) numeric
has_entropy Calculates the entropy of a column entropy(col) == x for classification problems numeric
is_inside_interquartile_range Verifies column values reside inside limits of interquartile range Q1 <= col <= Q3 used on anomalies. numeric
is_in_millions col >= 1e6 numeric
is_in_billions col >= 1e9 numeric
is_t_minus_1 For date fields confirms 1 day ago t-1 date
is_t_minus_2 For date fields confirms 2 days ago t-2 date
is_t_minus_3 For date fields confirms 3 days ago t-3 date
is_t_minus_n For date fields confirms n days ago t-n date
is_today For date fields confirms day is current date t-0 date
is_yesterday For date fields confirms 1 day ago t-1 date
is_on_weekday For date fields confirms day is between Mon-Fri date
is_on_weekend For date fields confirms day is between Sat-Sun date
is_on_monday For date fields confirms day is Mon date
is_on_tuesday For date fields confirms day is Tue date
is_on_wednesday For date fields confirms day is Wed date
is_on_thursday For date fields confirms day is Thu date
is_on_friday For date fields confirms day is Fri date
is_on_saturday For date fields confirms day is Sat date
is_on_sunday For date fields confirms day is Sun date
is_on_schedule For date fields confirms time windows i.e. 9:00 - 17:00 timestamp
is_daily Can verify daily continuity on date fields by default. [2,3,4,5,6] which represents Mon-Fri in PySpark. However new schedules can be used for custom date continuity date
has_workflow Adjacency matrix validation on 3-column graph, based on group, event, order columns. agnostic
satisfies An open SQL expression builder to construct custom checks agnostic
validate The ultimate transformation of a check with a dataframe input for validation agnostic

Controls

Currently available only for pyspark and pandas dataframes.

Check Description DataType
completeness Zero nulls agnostic
information Zero nulls and cardinality > 1 agnostic
intelligence Zero nulls, zero empty strings and cardinality > 1 agnostic
percentage_fill % rows not empty agnostic
percentage_empty % rows empty agnostic

Demo

import pandas as pd
from cuallee import Control
df = pd.DataFrame({"X":[1,2,3], "Y": [10,20,30]})
# Checks all columns in dataframe for using is_complete check
Control.completeness(df)

ISO

A new module has been incorporated in cuallee>=0.4.0 which allows the verification of International Standard Organization columns in data frames. Simply access the check.iso interface to add the set of checks as shown below.

Check Description DataType
iso_4217 currency compliant ccy string
iso_3166 country compliant country string

Demo

df = spark.createDataFrame([[1, "USD"], [2, "MXN"], [3, "CAD"], [4, "EUR"], [5, "CHF"]], ["id", "ccy"])
check = Check(CheckLevel.WARNING, "ISO Compliant")
check.iso.iso_4217("ccy")
check.validate(df).show()
+---+-------------------+-------------+-------+------+---------------+--------------------+----+----------+---------+--------------+------+
| id|          timestamp|        check|  level|column|           rule|               value|rows|violations|pass_rate|pass_threshold|status|
+---+-------------------+-------------+-------+------+---------------+--------------------+----+----------+---------+--------------+------+
|  1|2023-05-14 18:28:02|ISO Compliant|WARNING|   ccy|is_contained_in|{'BHD', 'CRC', 'M...|   5|       0.0|      1.0|           1.0|  PASS|
+---+-------------------+-------------+-------+------+---------------+--------------------+----+----------+---------+--------------+------+