Sample Corpus
samples/ contains 150 Python generators producing synthetic fund documents across 8 fund families and 47 standalone files.

Fund Families

| Family | Strategy | Documents |
|---|---|---|
| Apex Capital | PE buyout | 14 |
| Greenfield Ventures | VC | 12 |
| Cornerstone RE | Real estate | 19 |
| European Growth | Luxembourg PE | 10 |
| Pacific Credit | Credit/lending | 14 |
| Meridian Secondaries | GP-led continuation | 10 |
| Catalyst Infrastructure | UK infrastructure | 14 |
| Summit Macro | Hedge fund | 10 |
| Standalone files | Cross-cutting | 47 |
Each family is a coherent set of related documents, enabling cross-document skills: compare, reconcile, multi-doc-analyze, mfn-tracker, and playbook.
Intentional Issues
Every document embeds intentional issues for testing detection accuracy:
- Mathematical errors (carried interest calculations, distribution waterfalls)
- Compliance gaps (missing disclosures, regulatory deadline errors)
- Cross-document conflicts (side letter terms inconsistent with LPA)
- Wire fraud signals (suspicious wire instruction changes)
The sample LPA (generate_sample_lpa.py) contains 10 specific issues. See LPA Safety Score for the full list.
Generating the Corpus
cd fundadmin-ai
# Generate all 150 documents
python3 samples/generate_all.py
# Generate one fund family only
python3 samples/generate_all.py --family apex
# Compile-check only (no output files)
python3 samples/generate_all.py --check
# List all generators
python3 samples/generate_all.py --listGenerated output goes to samples/output/ (gitignored — reproducible from scripts).
Individual Generators
# Single document types (each with intentional issues)
python3 generate_sample_lpa.py # LPA — 10 issues
python3 generate_sample_ppm.py # PPM — 7 issues
python3 generate_sample_side_letter.py # Side letter — 3 issues
python3 generate_sample_subscription.py # Subscription — 5 issuesShared Utilities
samples/generators/common.py provides:
- Document styles and setup
- Fund family constants
- Investor profiles
- Standard issue injection helpers
Classifier Accuracy
The content classifier is regression-tested to maintain ≥ 97% accuracy across all 150 PDFs:
node tests/audit-classifier.mjs
python3 -m unittest tests/test_classifier_accuracy.py -vCatalog
See samples/SAMPLE-CATALOG.md for the full index with a list of embedded issues per document.