▸ pipedog / docs

Documentation

Pipedog is a lightweight CLI for data quality checks and schema drift detection on flat files. This page covers everything from installation to CI/CD integration.

#Quick Start

Install Pipedog from PyPI:

bash

pip install pipedog

Verify the installation:

bash

pipedog --version

Initialise a baseline from one or more historical files:

bash

pipedog init orders_jan.csv orders_feb.csv --profile orders

Scan a new file against the baseline:

bash

pipedog scan orders_mar.csv --profile orders

Tip: Pipedog stores baselines in a .pipedog/ directory relative to where you run the command. Add it to .gitignore or commit it — your choice.

#Commands

Pipedog exposes five top-level commands. Run pipedog --help or pipedog <command> --help for full flag details.

#init

Reads one or more files, merges them into a single baseline, and saves a schema snapshot plus quality checks to .pipedog/<profile>/.

bash

pipedog init <file1> [file2 ...] [--profile <name>]

# Examples
pipedog init data.csv
pipedog init jan.csv feb.csv mar.csv --profile monthly_sales
pipedog init warehouse.parquet --profile wh

Flag	Default	Description
--profile	default	Named profile for this dataset

#scan

Reads a single file and runs all checks from the stored baseline. Saves an HTML report to .pipedog/<profile>/reports/ and appends an entry to history.json. Exits with 1 if any error-severity check fails.

bash

pipedog scan <file> [--profile <name>]

# Examples
pipedog scan orders_apr.csv --profile orders
pipedog scan new_data.parquet

Flag	Default	Description
--profile	default	Named profile to compare against

#profile

Displays a rich summary of the stored baseline for a profile — column types, sample values, null rates, numeric ranges, and more.

bash

pipedog profile [--profile <name>]

# Example
pipedog profile --profile orders

#checks

Manage the quality checks stored in the baseline. Three subcommands:

bash

# List all checks for a profile
pipedog checks list [--profile <name>]

# Open checks file in $EDITOR for manual edits
pipedog checks edit [--profile <name>]

# Add a custom check interactively
pipedog checks add [--profile <name>]

#report

List saved HTML reports for a profile and optionally open one in the default browser.

bash

# List reports
pipedog report [--profile <name>]

# Open the most recent report
pipedog report --open [--profile <name>]

#Multi-Profile Workflow

Use --profile to track different datasets or pipelines independently. Each profile stores its own baseline, checks, and report history.

bash

# Initialise two separate profiles
pipedog init sales_jan.csv sales_feb.csv --profile sales
pipedog init events_jan.csv --profile events

# Scan each independently
pipedog scan sales_mar.csv --profile sales
pipedog scan events_feb.csv --profile events

# Inspect each baseline
pipedog profile --profile sales
pipedog profile --profile events

Profiles are stored as subdirectories inside .pipedog/:

text

.pipedog/
├── sales/
│   ├── schema.json
│   ├── checks.json
│   ├── history.json
│   └── reports/
│       └── scan_2026-03-29_143012.html
└── events/
    ├── schema.json
    ├── checks.json
    └── reports/

#Auto-Generated Checks

Pipedog infers which checks to generate by analysing your baseline data. You never write rules manually — but you can edit or extend them with pipedog checks edit.

Check	When Generated	Severity
`not_null`	Column had zero nulls at init	error
`null_rate`	Column had some nulls; threshold = baseline % + 10pp	warning
`min_value`	Numeric column; locks observed minimum	error
`max_value`	Numeric column; locks observed maximum	error
`unique`	Every value was distinct (key column detection)	error
`allowed_values`	String/boolean column with ≤ 50 distinct values	error
`std_dev_change`	Numeric column; flags distribution shift > 50%	warning
`row_count`	Every file; threshold = 80% of baseline row count	error

Checks are stored as JSON in .pipedog/<profile>/checks.json. You can add, remove, or modify thresholds directly in that file or via pipedog checks edit.

#CI/CD Integration

Pipedog exits with code 0 when all checks pass and 1 when any error-severity check fails. This makes it a natural fit for any CI/CD pipeline.

#GitHub Actions

yaml

# .github/workflows/data-quality.yml
name: Data Quality

on:
  push:
    paths:
      - 'data/**'

jobs:
  quality:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install Pipedog
        run: pip install pipedog

      - name: Run quality scan
        run: pipedog scan data/orders_latest.csv --profile orders

#Apache Airflow

python

from airflow.operators.bash import BashOperator

quality_check = BashOperator(
    task_id="pipedog_scan",
    bash_command="pipedog scan {{ ds }}_orders.csv --profile orders",
)

#Makefile

makefile

ingest: quality
	@echo "Quality passed — ingesting..."

quality:
	pipedog scan $(FILE) --profile $(PROFILE)

#File Formats

Pipedog auto-detects file format from the extension. All three formats are treated identically after loading — the same checks apply regardless of format.

.csv

CSV

Comma-separated values. Most common format. Detected by .csv extension.

.parquet

Parquet

Apache Parquet columnar format. Requires pyarrow (installed automatically).

.json

JSON

JSON array of records or newline-delimited JSON. Both formats supported.

Something missing? Open an issue on GitHub →

v0.2.1