BigQuery
In order to follow this examples, make sure your installation is all set for bigquery
Install
pip install cuallee
pip install cuallee[bigquery]
Pre-Requisites
You will need a Google Cloud account active, with the BigQuery API enabled to proceed with this examples.
Once your account is enabled with BigQuery, you will have to export a service account
credential file in json
format.
cuallee
will read the environment variable GOOGLE_APPLICATION_CREDENTIALS
expecting the name of the file that contains your service account credentials
Cost Associated
Be aware that running cuallee
checks in bigquery
incurs into cloud costs.
The process inside cuallee
to handle the credentials is as follows:
Credentials Handling
import os
from google.cloud import bigquery
credentials = os.getenv('GOOGLE_APPLICATION_CREDENTIALS')
client = bigquery.Client(project="GOOGLE_CLOUD_PROJECT_IDENTIFIER", credentials=credentials)
is_complete
It validates the completeness attribute of a data set. It confirms that a column does not contain null
values .
is_complete
In this example, we validate that the column id
does not have any missing values.
from google.cloud import bigquery
from cuallee import Check
# Public dataset in BigQuery
df = bigquery.dataset.Table("bigquery-public-data.chicago_taxi_trips.taxi_trips")
check = Check()
check.is_complete("taxi_id")
# Validate
check.validate(df)
output:
timestamp check level column rule value rows violations pass_rate pass_threshold status
id
1 2024-05-18 21:24:15 cuallee.check WARNING taxi_id is_complete N/A 102589284 0 1.0 1.0 PASS
In this example, we intentionally place 2 null
values in the dataframe and that produces a FAIL
check as result.
from google.cloud import bigquery
from cuallee import Check
# Public dataset in BigQuery
df = bigquery.dataset.Table("bigquery-public-data.chicago_taxi_trips.taxi_trips")
check = Check()
check.is_complete("trip_end_timestamp")
# Validate
check.validate(df)
output:
timestamp check level column rule value rows violations pass_rate pass_threshold status
id
1 2024-05-18 21:24:15 cuallee.check WARNING trip_end_timestamp is_complete N/A 102589284 1589 0.999985 1.0 FAIL
In this example, we validate reuse the data frame with empty values from the previous example, however we set our tolerance via the pct
parameter on the rule is_complete
to 0.6
. Producing now a PASS
result on the check, regardless of the 2
present null
values.
from google.cloud import bigquery
from cuallee import Check
# Public dataset in BigQuery
df = bigquery.dataset.Table("bigquery-public-data.chicago_taxi_trips.taxi_trips")
check = Check()
check.is_complete("trip_end_timestamp", pct=0.9)
# Validate
check.validate(df)
output:
timestamp check level column rule value rows violations pass_rate pass_threshold status
id
1 2024-05-18 21:24:15 cuallee.check WARNING trip_end_timestamp is_complete N/A 102589284 1589 0.999985 0.9 PASS