Skip to content

Company data backup and restore

This guide covers how to export and import complete company datasets using the management commands. These tools are designed for:

  • Creating full backups of company data
  • Migrating companies between environments (dev → staging → production)
  • Potentially with anonymization, testing with production-like data in lower environments

Overview

The backup system consists of two management commands:

  • export_company_data - Exports all data for a company to JSON files and uploads to S3
  • import_company_data - Downloads and imports company data from S3 backups

Both commands work with S3 storage.

Exporting company data

The export command creates a complete snapshot of a company's data, including infrastructure, emissions, events, users, and configuration.

Basic export usage

bash
python manage.py export_company_data --owner "Company Name"

Export command options

OptionDescriptionRequiredDefault
--ownerCompany name (must be exact match)Yes-
--bucket-nameS3 bucket name for uploadNoFrom AWS_STORAGE_BUCKET_NAME env var
--access-key-idS3 access keyNoFrom AWS_S3_ACCESS_KEY_ID env var
--secret-access-keyS3 secret keyNoFrom AWS_S3_SECRET_ACCESS_KEY env var
--endpoint-urlS3 endpoint URLNoFrom AWS_S3_ENDPOINT_URL env var
--batch-sizeRecords per batch for memory-efficient processingNo1000

S3 credentials

The export command requires S3 credentials, which can be provided via:

  1. Command-line arguments (highest priority)
  2. Environment variables:
    • AWS_STORAGE_BUCKET_NAME
    • AWS_S3_ACCESS_KEY_ID
    • AWS_S3_SECRET_ACCESS_KEY
    • AWS_S3_ENDPOINT_URL
  3. Django settings (lowest priority)

What gets exported

The export includes all related data for the company:

  • Infrastructure: Sites, equipment, hierarchies, aerial images, pipeline systems
  • Emissions: Data batches, data points, emission records, scene observations
  • Events: Events, root causes, action plans, event associations
  • Configuration: Notification settings, matching configurations, waffle switches
  • Users: Company users, memberships, permissions (excluding passwords)
  • Analytics: Analytics data snapshots
  • Historical records: Complete audit trail from django-simple-history

Export output

Files are uploaded to S3 in the following structure:

s3://bucket-name/company-name/YYYY-MM-DD/
  ├── manifest.json                              # Export metadata
  ├── accounts_Company.json
  ├── accounts_User.json
  ├── infrastructure_Site.json
  ├── emissions_EmissionRecord.json
  └── ... (one file per model)

Memory-efficient processing

The export uses streaming writes to handle large datasets without memory issues:

  • Processes records in batches (default 1000)
  • Writes directly to files without accumulating in memory
  • Preserves created_at and updated_at timestamps

Importing company data

The import command downloads backups from S3 and restores them to the database. This is useful for migrating data between environments or restoring from backups.

Basic import usage

bash
python manage.py import_company_data \
  --owner "Company Name" \
  --backup-date "2026-01-23"

Import command options

OptionDescriptionRequiredDefault
--ownerCompany name from exportYes-
--backup-dateDate of backup (YYYY-MM-DD)Yes-
--modelsSpecific models to import (space-separated)NoAll models
--dry-runPreview import without making changesNofalse
--batch-sizeRecords per batchNo1000
--disable-constraintsDisable FK constraints during import (faster)Nofalse
--reindex-presetsRebuild geo filter indexes after importNofalse
--disable-notificationsSkip importing notification settingsNofalse
--src-data-bucket-nameSource S3 bucket nameYesFrom AWS_SRC_DATA_BUCKET_NAME env var
--src-data-access-key-idSource S3 access keyYesFrom AWS_SRC_DATA_S3_ACCESS_KEY_ID env var
--src-data-secret-access-keySource S3 secret keyYesFrom AWS_SRC_DATA_S3_SECRET_ACCESS_KEY env var
--src-endpoint-urlSource S3 endpoint URLNoFrom AWS_SRC_DATA_ENDPOINT_URL env var

S3 credentials

The import command requires two sets of S3 credentials:

  1. Source credentials (AWS_SRC_DATA_*) - To download backup files from the source S3 location
  2. Target credentials (AWS_*) - For the target environment's S3 storage

This dual-credential setup allows importing data from a different S3 location (e.g., production backups) into another environment (e.g., staging) that has its own S3 storage.

Source credentials can be provided via:

  1. Command-line arguments (highest priority):

    • --src-data-bucket-name
    • --src-data-access-key-id
    • --src-data-secret-access-key
    • --src-endpoint-url
  2. Environment variables (required):

    • AWS_SRC_DATA_BUCKET_NAME
    • AWS_SRC_DATA_S3_ACCESS_KEY_ID
    • AWS_SRC_DATA_S3_SECRET_ACCESS_KEY
    • AWS_SRC_DATA_ENDPOINT_URL

Target credentials are read from environment variables or Django settings:

  • AWS_STORAGE_BUCKET_NAME
  • AWS_S3_ACCESS_KEY_ID
  • AWS_S3_SECRET_ACCESS_KEY
  • AWS_S3_ENDPOINT_URL

Import workflow

The import process follows these steps:

  1. Download - Downloads JSON files from S3 to temporary directory
  2. Validate - Checks manifest and verifies files exist
  3. Import - Restores data in dependency order (foreign keys respected)
  4. Timestamps - Restores original created_at/updated_at values
  5. Sequences - Resets PostgreSQL sequences for auto-increment fields
  6. Cleanup - Removes temporary files

Selective imports

You can import specific models using the --models option:

bash
python manage.py import_company_data \
  --owner "Company Name" \
  --backup-date "2026-01-23" \
  --models infrastructure.Site infrastructure.Equipment

This is useful for:

  • Importing only infrastructure without emissions data
  • Updating specific datasets without touching others
  • Testing imports of problematic models

Preserving notification settings

When migrating data between environments, you may want to temporarily disable the target database's notification settings. Use the --disable-notifications flag:

bash
python manage.py import_company_data \
  --owner "Company Name" \
  --backup-date "2026-01-23" \
  --disable-notifications

This skips importing EmissionNotificationSettings, so make sure to import that model later when needed via the --models argument.

Performance optimization

For large imports, these options can significantly improve performance:

bash
python manage.py import_company_data \
  --owner "Company Name" \
  --backup-date "2026-01-23" \
  --disable-constraints \
  --batch-size 2000

⚠️ Warning: Disabling constraints can lead to inconsistent data if the import fails partway through. Only use in controlled environments.

Dry run mode

Always test imports with --dry-run first to preview what would be imported:

bash
python manage.py import_company_data \
  --owner "Company Name" \
  --backup-date "2026-01-23" \
  --dry-run

This shows which files would be processed without making any database changes.

Complete import checklist

This checklist should generally be followed when performing company import.

Pre-import tasks

⚠️ Critical: Complete ALL pre-import tasks before starting the import

  1. Scale down scheduler tasks

    bash
    # Set scheduler container task number to 0 to prevent scheduled tasks from running
  2. Clear S3 bucket for CVX

    • Delete existing plumes, aerial images, data_downloads in the target environment's S3 bucket
    • The target db should be empty, so these files are orphans anyway
  3. Increase container resources

    • Increase the long-running container size (CPU/memory) (or whatever container the import is being run on)
    • Import is memory-intensive and requires additional resources
  4. Drop and recreate database

    bash
    # On target environment
    # Drop and recreate db or relevant schemas
  5. Run migrations

    bash
    python manage.py migrate
  6. Set credentials

    bash
    # Source credentials (where backup files are stored)
    export AWS_SRC_DATA_BUCKET_NAME="production-backup-bucket"
    export AWS_SRC_DATA_S3_ACCESS_KEY_ID="prod-access-key"
    export AWS_SRC_DATA_S3_SECRET_ACCESS_KEY="prod-secret-key"
    
    # Verify target credentials are set (for current environment's S3)
    echo $AWS_STORAGE_BUCKET_NAME
    echo $AWS_S3_ACCESS_KEY_ID

Run import

Execute the import in two phases:

Phase 1: Import database records

bash
python manage.py import_company_data \
  --owner "{company_name}" \
  --backup-date "{YYYY-MM-DD}" \
  --disable-notifications \
  --reindex-presets

Phase 2: Copy S3 files

bash
python manage.py import_company_data \
  --owner "{company_name}" \
  --backup-date "{YYYY-MM-DD}" \
  --mode copy-files

Note: The copy-files mode transfers actual files (images, documents) from source S3 to target S3. This can take considerable time for large datasets, but the time required is offset by queuing copy jobs on the dataimport container.

Post-import tasks

Complete these tasks immediately after import finishes:

  1. Upload user guide

    • Upload company-specific user guide documentation
    • Update any environment-specific links or instructions
  2. Delete provider-specific data (if applicable for target environment)

    bash
    # Delete Bridger and GHGSat data if not needed in non-production environments
    python manage.py shell
    >>> from emissions.models import DataBatch, EmissionRecord
    >>> from event_management.models import Event
    >>> batch_ids = DataBatch.objects.filter(data_provider__name__in=["Bridger", "GHGSat"]).values_list("pk", flat=True)
    >>> Event.objects.filter(main_emission_record__data_point__data_batch_id__in=batch_ids).delete()
    >>> EmissionRecord.objects.filter(data_point__data_batch_id__in=batch_ids).delete()
    >>> PlumeImage.objects.filter(data_batch_id__in=batch_ids).delete()
    >>> DataPoint.objects.filter(data_batch_id__in=batch_ids).delete()
    >>> DataBatch.objects.filter(pk__in=batch_ids).delete()
    >>> Scene.objects.filter(data_provider__name__in=["Bridger", "GHGSat"]).delete()
    >>> SiteNonDetect.objects.filter(data_provider__name__in=["Bridger", "GHGSat"]).delete()
  3. Create SSO setup for Aerscape

    • Configure SSO settings in Django Admin
    • Add Aerscape email domain to SSO configuration
  4. Resize container to normal size

    • Return container resources to standard allocation
    • Remove the temporary resource increase from pre-import step

Post-import tasks (to complete later)

These tasks should be completed after verifying the import was successful:

  1. Enable notifications (when ready to start sending emails)

    bash
    python manage.py import_company_data \
      --owner "{company_name}" \
      --backup-date "{YYYY-MM-DD}" \
      --models emissions.EmissionNotificationSettings

    Important: Only import notification settings after verifying the environment is properly configured to send emails. This prevents accidentally spamming users during testing.

  2. Restore scheduler tasks

    bash
    # Set Celery worker replica count back to 1
    # This re-enables automated background tasks

Import verification checklist

After completing the import, verify the following:

  • [ ] Users can log in successfully
  • [ ] Infrastructure (sites, equipment) displays correctly on maps
  • [ ] Emission records are visible and properly matched
  • [ ] Events show correct status and associations
  • [ ] File uploads (aerial images, documents) are accessible
  • [ ] Geo filters and presets work correctly
  • [ ] No duplicate records or data inconsistencies
  • [ ] Celery tasks remain disabled until verification complete

Common workflows

Full company migration (production → staging)

Export from source environment:

bash
# On production database
python manage.py export_company_data --owner "Acme Corp"

Import to target environment (requires both credential sets):

bash
# On staging database - set source credentials to point to production S3
export AWS_SRC_DATA_BUCKET_NAME="production-backup-bucket"
export AWS_SRC_DATA_S3_ACCESS_KEY_ID="prod-access-key"
export AWS_SRC_DATA_S3_SECRET_ACCESS_KEY="prod-secret-key"

# Target credentials (AWS_STORAGE_BUCKET_NAME, etc.) should already be set for staging environment

python manage.py import_company_data \
  --owner "Acme Corp" \
  --backup-date "2026-01-23" \
  --disable-notifications

Infrastructure-only import

bash
# Import only sites and equipment
python manage.py import_company_data \
  --owner "Acme Corp" \
  --backup-date "2026-01-23" \
  --models infrastructure.Site infrastructure.Equipment \
  --dry-run

Technical details

Memory efficiency

Both commands use streaming I/O to handle millions of records with minimal memory:

  • Export: Writes records to JSON files in batches without accumulating
  • Import: Deserializes records one at a time using Django's streaming API
  • Typical memory usage: ~100-200MB regardless of dataset size

Timestamp preservation

Django's auto_now and auto_now_add fields are normally excluded from serialization. These commands preserve them:

  • Export: Manually extracts timestamps after serialization
  • Import: Uses raw SQL with CASE statements to restore timestamps in batches

This ensures imported records maintain their original creation/modification times.

Sequence reset

PostgreSQL sequences for auto-increment primary keys are automatically reset after import to prevent ID conflicts:

  • Only resets sequences for models that were imported
  • Sets sequence to MAX(id) value to avoid collisions
  • Handles both regular AutoField and historical model history_id fields

Import order

Models are imported in a specific order to respect foreign key dependencies. The order is defined in MODEL_IMPORT_ORDER constant in the import command.

Signal handling

Django signals are temporarily disabled during import to prevent:

  • Automatic creation of related objects (e.g., notification settings)
  • Triggering workflows or notifications
  • Side effects from model save() methods

Troubleshooting

Export process killed

If exports fail with "Killed" message, the process likely ran out of memory. Reduce batch size or increase container size:

bash
python manage.py export_company_data \
  --owner "Company Name" \
  --batch-size 500

Import foreign key errors

If imports fail with foreign key constraint violations, try:

bash
python manage.py import_company_data \
  --owner "Company Name" \
  --backup-date "2026-01-23" \
  --disable-constraints

Sequence conflicts after import

If you see "duplicate key value violates unique constraint" errors after import, sequences weren't reset properly. Manually reset:

bash
python manage.py shell
>>> from django.db import connection
>>> cursor = connection.cursor()
>>> cursor.execute("SELECT setval('accounts_company_id_seq', (SELECT MAX(id) FROM accounts_company))")

S3 connection issues

Verify credentials are set correctly.

For export:

bash
echo $AWS_STORAGE_BUCKET_NAME
echo $AWS_S3_ACCESS_KEY_ID
echo $AWS_S3_SECRET_ACCESS_KEY

For import (requires BOTH sets):

bash
# Source credentials (to read backup files)
echo $AWS_SRC_DATA_BUCKET_NAME
echo $AWS_SRC_DATA_S3_ACCESS_KEY_ID
echo $AWS_SRC_DATA_S3_SECRET_ACCESS_KEY

# Target credentials (for current environment's S3)
echo $AWS_STORAGE_BUCKET_NAME
echo $AWS_S3_ACCESS_KEY_ID
echo $AWS_S3_SECRET_ACCESS_KEY

Source credentials can also be provided via command-line arguments.

Best practices

  1. Always use --dry-run first when importing to new environments
  2. Export regularly - Automate exports with cron or scheduled tasks
  3. Test restores periodically - Verify backups can be restored successfully
  4. Use --disable-notifications when importing to non-production environments
  5. Monitor S3 storage - Old backups can accumulate; implement retention policies
  6. Document backup dates - Keep a log of when backups were created and why

Security considerations

  • User passwords are excluded from exports
  • Sensitive fields like API keys should be reviewed before cross-environment imports
  • S3 buckets should use appropriate IAM policies to restrict access
  • Consider encrypting S3 buckets for sensitive company data