PySpark
In order to follow this examples, make sure your installation is all set for pyspark
Install
pip install cuallee
pip install cuallee[pyspark]
is_complete
It validates the completeness attribute of a data set. It confirms that a column does not contain null
values.
is_complete
In this example, we validate that the column id
does not have any missing values.
from pyspark.sql import SparkSession
from cuallee import Check
spark = SparkSession.builder.getOrCreate()
df = spark.range(10)
check = Check()
check.is_complete("id")
# Validate
check.validate(df).show()
output:
+---+-------------------+-------------+-------+------+-----------+-----+----+----------+---------+--------------+------+
| id| timestamp| check| level|column| rule|value|rows|violations|pass_rate|pass_threshold|status|
+---+-------------------+-------------+-------+------+-----------+-----+----+----------+---------+--------------+------+
| 1|2024-05-18 16:53:56|cuallee.check|WARNING| id|is_complete| N/A| 10| 0| 1.0| 1.0| PASS|
+---+-------------------+-------------+-------+------+-----------+-----+----+----------+---------+--------------+------+
In this example, we intentionally place 2 null
values in the dataframe and that produces a FAIL
check as result.
from pyspark.sql import SparkSession
from cuallee import Check
spark = SparkSession.builder.getOrCreate()
df = spark.range(8).union(spark.createDataFrame([(None,), (None,)], schema="id int"))
check = Check()
check.is_complete("id")
# Validate
check.validate(df).show()
output:
+---+-------------------+-------------+-------+------+-----------+-----+----+----------+---------+--------------+------+
| id| timestamp| check| level|column| rule|value|rows|violations|pass_rate|pass_threshold|status|
+---+-------------------+-------------+-------+------+-----------+-----+----+----------+---------+--------------+------+
| 1|2024-05-18 16:53:56|cuallee.check|WARNING| id|is_complete| N/A| 10| 2| 0.8| 1.0| FAIL|
+---+-------------------+-------------+-------+------+-----------+-----+----+----------+---------+--------------+------+
In this example, we validate reuse the data frame with empty values from the previous example, however we set our tolerance via the pct
parameter on the rule is_complete
to 0.8
. Producing now a PASS
result on the check, regardless of the 2
present null
values.
from pyspark.sql import SparkSession
from cuallee import Check
spark = SparkSession.builder.getOrCreate()
df = spark.range(8).union(spark.createDataFrame([(None,), (None,)], schema="id int"))
check = Check()
check.is_complete("id", pct=0.8)
# Validate
check.validate(df).show()
output:
+---+-------------------+-------------+-------+------+-----------+-----+----+----------+---------+--------------+------+
| id| timestamp| check| level|column| rule|value|rows|violations|pass_rate|pass_threshold|status|
+---+-------------------+-------------+-------+------+-----------+-----+----+----------+---------+--------------+------+
| 1|2024-05-18 16:53:56|cuallee.check|WARNING| id|is_complete| N/A| 10| 2| 0.8| 0.8| PASS|
+---+-------------------+-------------+-------+------+-----------+-----+----+----------+---------+--------------+------+