pipedog / docs

Documentation

Pipedog is a lightweight CLI for data quality checks and schema drift detection on flat files. This page covers everything from installation to CI/CD integration.

Quick Start

Install Pipedog from PyPI:

bash
pip install pipedog

Verify the installation:

bash
pipedog --version

Initialise a baseline from one or more historical files:

bash
pipedog init orders_jan.csv orders_feb.csv --profile orders

Scan a new file against the baseline:

bash
pipedog scan orders_mar.csv --profile orders

Tip: Pipedog stores baselines in a .pipedog/ directory relative to where you run the command. Add it to .gitignore or commit it — your choice.

Commands

Pipedog exposes five top-level commands. Run pipedog --help or pipedog <command> --help for full flag details.

init

Reads one or more files, merges them into a single baseline, and saves a schema snapshot plus quality checks to .pipedog/<profile>/.

bash
pipedog init <file1> [file2 ...] [--profile <name>]

# Examples
pipedog init data.csv
pipedog init jan.csv feb.csv mar.csv --profile monthly_sales
pipedog init warehouse.parquet --profile wh
FlagDefaultDescription
--profiledefaultNamed profile for this dataset

scan

Reads a single file and runs all checks from the stored baseline. Saves an HTML report to .pipedog/<profile>/reports/ and appends an entry to history.json. Exits with 1 if any error-severity check fails.

bash
pipedog scan <file> [--profile <name>]

# Examples
pipedog scan orders_apr.csv --profile orders
pipedog scan new_data.parquet
FlagDefaultDescription
--profiledefaultNamed profile to compare against

profile

Displays a rich summary of the stored baseline for a profile — column types, sample values, null rates, numeric ranges, and more.

bash
pipedog profile [--profile <name>]

# Example
pipedog profile --profile orders

checks

Manage the quality checks stored in the baseline. Three subcommands:

bash
# List all checks for a profile
pipedog checks list [--profile <name>]

# Open checks file in $EDITOR for manual edits
pipedog checks edit [--profile <name>]

# Add a custom check interactively
pipedog checks add [--profile <name>]

report

List saved HTML reports for a profile and optionally open one in the default browser.

bash
# List reports
pipedog report [--profile <name>]

# Open the most recent report
pipedog report --open [--profile <name>]

Multi-Profile Workflow

Use --profile to track different datasets or pipelines independently. Each profile stores its own baseline, checks, and report history.

bash
# Initialise two separate profiles
pipedog init sales_jan.csv sales_feb.csv --profile sales
pipedog init events_jan.csv --profile events

# Scan each independently
pipedog scan sales_mar.csv --profile sales
pipedog scan events_feb.csv --profile events

# Inspect each baseline
pipedog profile --profile sales
pipedog profile --profile events

Profiles are stored as subdirectories inside .pipedog/:

text
.pipedog/
├── sales/
│   ├── schema.json
│   ├── checks.json
│   ├── history.json
│   └── reports/
│       └── scan_2026-03-29_143012.html
└── events/
    ├── schema.json
    ├── checks.json
    └── reports/

Auto-Generated Checks

Pipedog infers which checks to generate by analysing your baseline data. You never write rules manually — but you can edit or extend them with pipedog checks edit.

CheckWhen GeneratedSeverity
not_nullColumn had zero nulls at initerror
null_rateColumn had some nulls; threshold = baseline % + 10ppwarning
min_valueNumeric column; locks observed minimumerror
max_valueNumeric column; locks observed maximumerror
uniqueEvery value was distinct (key column detection)error
allowed_valuesString/boolean column with ≤ 50 distinct valueserror
std_dev_changeNumeric column; flags distribution shift > 50%warning
row_countEvery file; threshold = 80% of baseline row counterror

Checks are stored as JSON in .pipedog/<profile>/checks.json. You can add, remove, or modify thresholds directly in that file or via pipedog checks edit.

CI/CD Integration

Pipedog exits with code 0 when all checks pass and 1 when any error-severity check fails. This makes it a natural fit for any CI/CD pipeline.

GitHub Actions

yaml
# .github/workflows/data-quality.yml
name: Data Quality

on:
  push:
    paths:
      - 'data/**'

jobs:
  quality:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install Pipedog
        run: pip install pipedog

      - name: Run quality scan
        run: pipedog scan data/orders_latest.csv --profile orders

Apache Airflow

python
from airflow.operators.bash import BashOperator

quality_check = BashOperator(
    task_id="pipedog_scan",
    bash_command="pipedog scan {{ ds }}_orders.csv --profile orders",
)

Makefile

makefile
ingest: quality
	@echo "Quality passed — ingesting..."

quality:
	pipedog scan $(FILE) --profile $(PROFILE)

File Formats

Pipedog auto-detects file format from the extension. All three formats are treated identically after loading — the same checks apply regardless of format.

.csv

CSV

Comma-separated values. Most common format. Detected by .csv extension.

.parquet

Parquet

Apache Parquet columnar format. Requires pyarrow (installed automatically).

.json

JSON

JSON array of records or newline-delimited JSON. Both formats supported.

Something missing? Open an issue on GitHub →

v0.2.1