Skip to content

BigQuery

In order to follow this examples, make sure your installation is all set for bigquery

Install

pip install cuallee
pip install cuallee[bigquery]

Pre-Requisites

You will need a Google Cloud account active, with the BigQuery API enabled to proceed with this examples.
Once your account is enabled with BigQuery, you will have to export a service account credential file in json format.

cuallee will read the environment variable GOOGLE_APPLICATION_CREDENTIALS expecting the name of the file that contains your service account credentials

Cost Associated

Be aware that running cuallee checks in bigquery incurs into cloud costs.

The process inside cuallee to handle the credentials is as follows:

Credentials Handling

import os
from google.cloud import bigquery

credentials = os.getenv('GOOGLE_APPLICATION_CREDENTIALS')
client = bigquery.Client(project="GOOGLE_CLOUD_PROJECT_IDENTIFIER", credentials=credentials)

is_complete

It validates the completeness attribute of a data set. It confirms that a column does not contain null values .

is_complete

In this example, we validate that the column id does not have any missing values.

from google.cloud import bigquery
from cuallee import Check

# Public dataset in BigQuery
df = bigquery.dataset.Table("bigquery-public-data.chicago_taxi_trips.taxi_trips")
check = Check()
check.is_complete("taxi_id")

# Validate
check.validate(df)

output:

            timestamp          check    level   column         rule value       rows  violations  pass_rate  pass_threshold status
id
1   2024-05-18 21:24:15  cuallee.check  WARNING  taxi_id  is_complete   N/A  102589284           0        1.0             1.0   PASS

In this example, we intentionally place 2 null values in the dataframe and that produces a FAIL check as result.

from google.cloud import bigquery
from cuallee import Check

# Public dataset in BigQuery
df = bigquery.dataset.Table("bigquery-public-data.chicago_taxi_trips.taxi_trips")
check = Check()
check.is_complete("trip_end_timestamp")

# Validate
check.validate(df)

output:

            timestamp          check    level              column         rule value       rows  violations  pass_rate  pass_threshold status
id
1   2024-05-18 21:24:15  cuallee.check  WARNING  trip_end_timestamp  is_complete   N/A  102589284        1589   0.999985             1.0   FAIL

In this example, we validate reuse the data frame with empty values from the previous example, however we set our tolerance via the pct parameter on the rule is_complete to 0.6. Producing now a PASS result on the check, regardless of the 2 present null values.

from google.cloud import bigquery
from cuallee import Check

# Public dataset in BigQuery
df = bigquery.dataset.Table("bigquery-public-data.chicago_taxi_trips.taxi_trips")
check = Check()
check.is_complete("trip_end_timestamp", pct=0.9)

# Validate
check.validate(df)

output:

            timestamp          check    level              column         rule value       rows  violations  pass_rate  pass_threshold status
id
1   2024-05-18 21:24:15  cuallee.check  WARNING  trip_end_timestamp  is_complete   N/A  102589284        1589   0.999985             0.9   PASS