Documentation
Pipedog is a lightweight CLI for data quality checks and schema drift detection on flat files. This page covers everything from installation to CI/CD integration.
#Quick Start
Install Pipedog from PyPI:
pip install pipedogVerify the installation:
pipedog --versionInitialise a baseline from one or more historical files:
pipedog init orders_jan.csv orders_feb.csv --profile ordersScan a new file against the baseline:
pipedog scan orders_mar.csv --profile ordersTip: Pipedog stores baselines in a .pipedog/ directory relative to where you run the command. Add it to .gitignore or commit it — your choice.
#Commands
Pipedog exposes five top-level commands. Run pipedog --help or pipedog <command> --help for full flag details.
#init
Reads one or more files, merges them into a single baseline, and saves a schema snapshot plus quality checks to .pipedog/<profile>/.
pipedog init <file1> [file2 ...] [--profile <name>]
# Examples
pipedog init data.csv
pipedog init jan.csv feb.csv mar.csv --profile monthly_sales
pipedog init warehouse.parquet --profile wh| Flag | Default | Description |
|---|---|---|
| --profile | default | Named profile for this dataset |
#scan
Reads a single file and runs all checks from the stored baseline. Saves an HTML report to .pipedog/<profile>/reports/ and appends an entry to history.json. Exits with 1 if any error-severity check fails.
pipedog scan <file> [--profile <name>]
# Examples
pipedog scan orders_apr.csv --profile orders
pipedog scan new_data.parquet| Flag | Default | Description |
|---|---|---|
| --profile | default | Named profile to compare against |
#profile
Displays a rich summary of the stored baseline for a profile — column types, sample values, null rates, numeric ranges, and more.
pipedog profile [--profile <name>]
# Example
pipedog profile --profile orders#checks
Manage the quality checks stored in the baseline. Three subcommands:
# List all checks for a profile
pipedog checks list [--profile <name>]
# Open checks file in $EDITOR for manual edits
pipedog checks edit [--profile <name>]
# Add a custom check interactively
pipedog checks add [--profile <name>]#report
List saved HTML reports for a profile and optionally open one in the default browser.
# List reports
pipedog report [--profile <name>]
# Open the most recent report
pipedog report --open [--profile <name>]#Multi-Profile Workflow
Use --profile to track different datasets or pipelines independently. Each profile stores its own baseline, checks, and report history.
# Initialise two separate profiles
pipedog init sales_jan.csv sales_feb.csv --profile sales
pipedog init events_jan.csv --profile events
# Scan each independently
pipedog scan sales_mar.csv --profile sales
pipedog scan events_feb.csv --profile events
# Inspect each baseline
pipedog profile --profile sales
pipedog profile --profile eventsProfiles are stored as subdirectories inside .pipedog/:
.pipedog/
├── sales/
│ ├── schema.json
│ ├── checks.json
│ ├── history.json
│ └── reports/
│ └── scan_2026-03-29_143012.html
└── events/
├── schema.json
├── checks.json
└── reports/#Auto-Generated Checks
Pipedog infers which checks to generate by analysing your baseline data. You never write rules manually — but you can edit or extend them with pipedog checks edit.
| Check | When Generated | Severity |
|---|---|---|
not_null | Column had zero nulls at init | error |
null_rate | Column had some nulls; threshold = baseline % + 10pp | warning |
min_value | Numeric column; locks observed minimum | error |
max_value | Numeric column; locks observed maximum | error |
unique | Every value was distinct (key column detection) | error |
allowed_values | String/boolean column with ≤ 50 distinct values | error |
std_dev_change | Numeric column; flags distribution shift > 50% | warning |
row_count | Every file; threshold = 80% of baseline row count | error |
Checks are stored as JSON in .pipedog/<profile>/checks.json. You can add, remove, or modify thresholds directly in that file or via pipedog checks edit.
#CI/CD Integration
Pipedog exits with code 0 when all checks pass and 1 when any error-severity check fails. This makes it a natural fit for any CI/CD pipeline.
#GitHub Actions
# .github/workflows/data-quality.yml
name: Data Quality
on:
push:
paths:
- 'data/**'
jobs:
quality:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install Pipedog
run: pip install pipedog
- name: Run quality scan
run: pipedog scan data/orders_latest.csv --profile orders#Apache Airflow
from airflow.operators.bash import BashOperator
quality_check = BashOperator(
task_id="pipedog_scan",
bash_command="pipedog scan {{ ds }}_orders.csv --profile orders",
)#Makefile
ingest: quality
@echo "Quality passed — ingesting..."
quality:
pipedog scan $(FILE) --profile $(PROFILE)#File Formats
Pipedog auto-detects file format from the extension. All three formats are treated identically after loading — the same checks apply regardless of format.
.csvCSV
Comma-separated values. Most common format. Detected by .csv extension.
.parquetParquet
Apache Parquet columnar format. Requires pyarrow (installed automatically).
.jsonJSON
JSON array of records or newline-delimited JSON. Both formats supported.
Something missing? Open an issue on GitHub →
v0.2.1