Company data backup and restore

This guide covers how to export and import complete company datasets using the management commands. These tools are designed for:

Creating full backups of company data
Migrating companies between environments (dev → staging → production)
Potentially with anonymization, testing with production-like data in lower environments

Overview

The backup system consists of two management commands:

export_company_data - Exports all data for a company to JSON files and uploads to S3
import_company_data - Downloads and imports company data from S3 backups

Both commands work with S3 storage.

Exporting company data

The export command creates a complete snapshot of a company's data, including infrastructure, emissions, events, users, and configuration.

Basic export usage

bash

python manage.py export_company_data --owner "Company Name"

Export command options

Option	Description	Required	Default
`--owner`	Company name (must be exact match)	Yes	-
`--bucket-name`	S3 bucket name for upload	No	From `AWS_STORAGE_BUCKET_NAME` env var
`--access-key-id`	S3 access key	No	From `AWS_S3_ACCESS_KEY_ID` env var
`--secret-access-key`	S3 secret key	No	From `AWS_S3_SECRET_ACCESS_KEY` env var
`--endpoint-url`	S3 endpoint URL	No	From `AWS_S3_ENDPOINT_URL` env var
`--batch-size`	Records per batch for memory-efficient processing	No	`1000`

S3 credentials

The export command requires S3 credentials, which can be provided via:

Command-line arguments (highest priority)
Environment variables:
- AWS_STORAGE_BUCKET_NAME
- AWS_S3_ACCESS_KEY_ID
- AWS_S3_SECRET_ACCESS_KEY
- AWS_S3_ENDPOINT_URL
Django settings (lowest priority)

What gets exported

The export includes all related data for the company:

Infrastructure: Sites, equipment, hierarchies, aerial images, pipeline systems
Emissions: Data batches, data points, emission records, scene observations
Events: Events, root causes, action plans, event associations
Configuration: Notification settings, matching configurations, waffle switches
Users: Company users, memberships, permissions (excluding passwords)
Analytics: Analytics data snapshots
Historical records: Complete audit trail from django-simple-history

Export output

Files are uploaded to S3 in the following structure:

s3://bucket-name/company-name/YYYY-MM-DD/
  ├── manifest.json                              # Export metadata
  ├── accounts_Company.json
  ├── accounts_User.json
  ├── infrastructure_Site.json
  ├── emissions_EmissionRecord.json
  └── ... (one file per model)

Memory-efficient processing

The export uses streaming writes to handle large datasets without memory issues:

Processes records in batches (default 1000)
Writes directly to files without accumulating in memory
Preserves created_at and updated_at timestamps

Importing company data

The import command downloads backups from S3 and restores them to the database. This is useful for migrating data between environments or restoring from backups.

Basic import usage

bash

python manage.py import_company_data \
  --owner "Company Name" \
  --backup-date "2026-01-23"

Import command options

Option	Description	Required	Default
`--owner`	Company name from export	Yes	-
`--backup-date`	Date of backup (YYYY-MM-DD)	Yes	-
`--models`	Specific models to import (space-separated)	No	All models
`--dry-run`	Preview import without making changes	No	`false`
`--batch-size`	Records per batch	No	`1000`
`--disable-constraints`	Disable FK constraints during import (faster)	No	`false`
`--reindex-presets`	Rebuild geo filter indexes after import	No	`false`
`--disable-notifications`	Skip importing notification settings	No	`false`
`--src-data-bucket-name`	Source S3 bucket name	Yes	From `AWS_SRC_DATA_BUCKET_NAME` env var
`--src-data-access-key-id`	Source S3 access key	Yes	From `AWS_SRC_DATA_S3_ACCESS_KEY_ID` env var
`--src-data-secret-access-key`	Source S3 secret key	Yes	From `AWS_SRC_DATA_S3_SECRET_ACCESS_KEY` env var
`--src-endpoint-url`	Source S3 endpoint URL	No	From `AWS_SRC_DATA_ENDPOINT_URL` env var

S3 credentials

The import command requires two sets of S3 credentials:

Source credentials (AWS_SRC_DATA_*) - To download backup files from the source S3 location
Target credentials (AWS_*) - For the target environment's S3 storage

This dual-credential setup allows importing data from a different S3 location (e.g., production backups) into another environment (e.g., staging) that has its own S3 storage.

Source credentials can be provided via:

Command-line arguments (highest priority):
- --src-data-bucket-name
- --src-data-access-key-id
- --src-data-secret-access-key
- --src-endpoint-url
Environment variables (required):
- AWS_SRC_DATA_BUCKET_NAME
- AWS_SRC_DATA_S3_ACCESS_KEY_ID
- AWS_SRC_DATA_S3_SECRET_ACCESS_KEY
- AWS_SRC_DATA_ENDPOINT_URL

Target credentials are read from environment variables or Django settings:

AWS_STORAGE_BUCKET_NAME
AWS_S3_ACCESS_KEY_ID
AWS_S3_SECRET_ACCESS_KEY
AWS_S3_ENDPOINT_URL

Import workflow

The import process follows these steps:

Download - Downloads JSON files from S3 to temporary directory
Validate - Checks manifest and verifies files exist
Import - Restores data in dependency order (foreign keys respected)
Timestamps - Restores original created_at/updated_at values
Sequences - Resets PostgreSQL sequences for auto-increment fields
Cleanup - Removes temporary files

Selective imports

You can import specific models using the --models option:

bash

python manage.py import_company_data \
  --owner "Company Name" \
  --backup-date "2026-01-23" \
  --models infrastructure.Site infrastructure.Equipment

This is useful for:

Importing only infrastructure without emissions data
Updating specific datasets without touching others
Testing imports of problematic models

Preserving notification settings

When migrating data between environments, you may want to temporarily disable the target database's notification settings. Use the --disable-notifications flag:

bash

python manage.py import_company_data \
  --owner "Company Name" \
  --backup-date "2026-01-23" \
  --disable-notifications

This skips importing EmissionNotificationSettings, so make sure to import that model later when needed via the --models argument.

Performance optimization

For large imports, these options can significantly improve performance:

bash

python manage.py import_company_data \
  --owner "Company Name" \
  --backup-date "2026-01-23" \
  --disable-constraints \
  --batch-size 2000

⚠️ Warning: Disabling constraints can lead to inconsistent data if the import fails partway through. Only use in controlled environments.

Dry run mode

Always test imports with --dry-run first to preview what would be imported:

bash

python manage.py import_company_data \
  --owner "Company Name" \
  --backup-date "2026-01-23" \
  --dry-run

This shows which files would be processed without making any database changes.

Complete import checklist

This checklist should generally be followed when performing company import.

Pre-import tasks

⚠️ Critical: Complete ALL pre-import tasks before starting the import

Scale down scheduler tasks

bash

# Set scheduler container task number to 0 to prevent scheduled tasks from running

Clear S3 bucket for CVX
- Delete existing plumes, aerial images, data_downloads in the target environment's S3 bucket
- The target db should be empty, so these files are orphans anyway
Increase container resources
- Increase the long-running container size (CPU/memory) (or whatever container the import is being run on)
- Import is memory-intensive and requires additional resources

Drop and recreate database

bash

# On target environment
# Drop and recreate db or relevant schemas

Run migrations
bash
```
python manage.py migrate
```

Set credentials

bash

# Source credentials (where backup files are stored)
export AWS_SRC_DATA_BUCKET_NAME="production-backup-bucket"
export AWS_SRC_DATA_S3_ACCESS_KEY_ID="prod-access-key"
export AWS_SRC_DATA_S3_SECRET_ACCESS_KEY="prod-secret-key"

# Verify target credentials are set (for current environment's S3)
echo $AWS_STORAGE_BUCKET_NAME
echo $AWS_S3_ACCESS_KEY_ID

Run import

Execute the import in two phases:

Phase 1: Import database records

bash

python manage.py import_company_data \
  --owner "{company_name}" \
  --backup-date "{YYYY-MM-DD}" \
  --disable-notifications \
  --reindex-presets

Phase 2: Copy S3 files

bash

python manage.py import_company_data \
  --owner "{company_name}" \
  --backup-date "{YYYY-MM-DD}" \
  --mode copy-files

Note: The copy-files mode transfers actual files (images, documents) from source S3 to target S3. This can take considerable time for large datasets, but the time required is offset by queuing copy jobs on the dataimport container.

Post-import tasks

Complete these tasks immediately after import finishes:

Upload user guide
- Upload company-specific user guide documentation
- Update any environment-specific links or instructions

Delete provider-specific data (if applicable for target environment)

bash

# Delete Bridger and GHGSat data if not needed in non-production environments
python manage.py shell
>>> from emissions.models import DataBatch, EmissionRecord
>>> from event_management.models import Event
>>> batch_ids = DataBatch.objects.filter(data_provider__name__in=["Bridger", "GHGSat"]).values_list("pk", flat=True)
>>> Event.objects.filter(main_emission_record__data_point__data_batch_id__in=batch_ids).delete()
>>> EmissionRecord.objects.filter(data_point__data_batch_id__in=batch_ids).delete()
>>> PlumeImage.objects.filter(data_batch_id__in=batch_ids).delete()
>>> DataPoint.objects.filter(data_batch_id__in=batch_ids).delete()
>>> DataBatch.objects.filter(pk__in=batch_ids).delete()
>>> Scene.objects.filter(data_provider__name__in=["Bridger", "GHGSat"]).delete()
>>> SiteNonDetect.objects.filter(data_provider__name__in=["Bridger", "GHGSat"]).delete()

Create SSO setup for Aerscape
- Configure SSO settings in Django Admin
- Add Aerscape email domain to SSO configuration
Resize container to normal size
- Return container resources to standard allocation
- Remove the temporary resource increase from pre-import step

Post-import tasks (to complete later)

These tasks should be completed after verifying the import was successful:

Enable notifications (when ready to start sending emails)
bash
```
python manage.py import_company_data \
  --owner "{company_name}" \
  --backup-date "{YYYY-MM-DD}" \
  --models emissions.EmissionNotificationSettings
```
Important: Only import notification settings after verifying the environment is properly configured to send emails. This prevents accidentally spamming users during testing.

Restore scheduler tasks

bash

# Set Celery worker replica count back to 1
# This re-enables automated background tasks

Import verification checklist

After completing the import, verify the following:

[ ] Users can log in successfully
[ ] Infrastructure (sites, equipment) displays correctly on maps
[ ] Emission records are visible and properly matched
[ ] Events show correct status and associations
[ ] File uploads (aerial images, documents) are accessible
[ ] Geo filters and presets work correctly
[ ] No duplicate records or data inconsistencies
[ ] Celery tasks remain disabled until verification complete

Common workflows

Full company migration (production → staging)

Export from source environment:

bash

# On production database
python manage.py export_company_data --owner "Acme Corp"

Import to target environment (requires both credential sets):

bash

# On staging database - set source credentials to point to production S3
export AWS_SRC_DATA_BUCKET_NAME="production-backup-bucket"
export AWS_SRC_DATA_S3_ACCESS_KEY_ID="prod-access-key"
export AWS_SRC_DATA_S3_SECRET_ACCESS_KEY="prod-secret-key"

# Target credentials (AWS_STORAGE_BUCKET_NAME, etc.) should already be set for staging environment

python manage.py import_company_data \
  --owner "Acme Corp" \
  --backup-date "2026-01-23" \
  --disable-notifications

Infrastructure-only import

bash

# Import only sites and equipment
python manage.py import_company_data \
  --owner "Acme Corp" \
  --backup-date "2026-01-23" \
  --models infrastructure.Site infrastructure.Equipment \
  --dry-run

Technical details

Memory efficiency

Both commands use streaming I/O to handle millions of records with minimal memory:

Export: Writes records to JSON files in batches without accumulating
Import: Deserializes records one at a time using Django's streaming API
Typical memory usage: ~100-200MB regardless of dataset size

Timestamp preservation

Django's auto_now and auto_now_add fields are normally excluded from serialization. These commands preserve them:

Export: Manually extracts timestamps after serialization
Import: Uses raw SQL with CASE statements to restore timestamps in batches

This ensures imported records maintain their original creation/modification times.

Sequence reset

PostgreSQL sequences for auto-increment primary keys are automatically reset after import to prevent ID conflicts:

Only resets sequences for models that were imported
Sets sequence to MAX(id) value to avoid collisions
Handles both regular AutoField and historical model history_id fields

Import order

Models are imported in a specific order to respect foreign key dependencies. The order is defined in MODEL_IMPORT_ORDER constant in the import command.

Signal handling

Django signals are temporarily disabled during import to prevent:

Automatic creation of related objects (e.g., notification settings)
Triggering workflows or notifications
Side effects from model save() methods

Troubleshooting

Export process killed

If exports fail with "Killed" message, the process likely ran out of memory. Reduce batch size or increase container size:

bash

python manage.py export_company_data \
  --owner "Company Name" \
  --batch-size 500

Import foreign key errors

If imports fail with foreign key constraint violations, try:

bash

python manage.py import_company_data \
  --owner "Company Name" \
  --backup-date "2026-01-23" \
  --disable-constraints

Sequence conflicts after import

If you see "duplicate key value violates unique constraint" errors after import, sequences weren't reset properly. Manually reset:

bash

python manage.py shell
>>> from django.db import connection
>>> cursor = connection.cursor()
>>> cursor.execute("SELECT setval('accounts_company_id_seq', (SELECT MAX(id) FROM accounts_company))")

S3 connection issues

Verify credentials are set correctly.

For export:

bash

echo $AWS_STORAGE_BUCKET_NAME
echo $AWS_S3_ACCESS_KEY_ID
echo $AWS_S3_SECRET_ACCESS_KEY

For import (requires BOTH sets):

bash

# Source credentials (to read backup files)
echo $AWS_SRC_DATA_BUCKET_NAME
echo $AWS_SRC_DATA_S3_ACCESS_KEY_ID
echo $AWS_SRC_DATA_S3_SECRET_ACCESS_KEY

# Target credentials (for current environment's S3)
echo $AWS_STORAGE_BUCKET_NAME
echo $AWS_S3_ACCESS_KEY_ID
echo $AWS_S3_SECRET_ACCESS_KEY

Source credentials can also be provided via command-line arguments.

Best practices

Always use --dry-run first when importing to new environments
Export regularly - Automate exports with cron or scheduled tasks
Test restores periodically - Verify backups can be restored successfully
Use --disable-notifications when importing to non-production environments
Monitor S3 storage - Old backups can accumulate; implement retention policies
Document backup dates - Keep a log of when backups were created and why

Security considerations

User passwords are excluded from exports
Sensitive fields like API keys should be reviewed before cross-environment imports
S3 buckets should use appropriate IAM policies to restrict access
Consider encrypting S3 buckets for sensitive company data

Company data backup and restore ​

Overview ​

Exporting company data ​

Basic export usage ​

Export command options ​

S3 credentials ​

What gets exported ​

Export output ​

Memory-efficient processing ​

Importing company data ​

Basic import usage ​

Import command options ​

S3 credentials ​

Import workflow ​

Selective imports ​

Preserving notification settings ​

Performance optimization ​

Dry run mode ​

Complete import checklist ​

Pre-import tasks ​

Run import ​

Post-import tasks ​

Post-import tasks (to complete later) ​

Import verification checklist ​

Common workflows ​

Full company migration (production → staging) ​

Infrastructure-only import ​

Technical details ​

Memory efficiency ​

Timestamp preservation ​

Sequence reset ​

Import order ​

Signal handling ​

Troubleshooting ​

Export process killed ​

Import foreign key errors ​

Sequence conflicts after import ​

S3 connection issues ​

Best practices ​

Security considerations ​

Company data backup and restore

Overview

Exporting company data

Basic export usage

Export command options

S3 credentials

What gets exported

Export output

Memory-efficient processing

Importing company data

Basic import usage

Import command options

S3 credentials

Import workflow

Selective imports

Preserving notification settings

Performance optimization

Dry run mode

Complete import checklist

Pre-import tasks

Run import

Post-import tasks

Post-import tasks (to complete later)

Import verification checklist

Common workflows

Full company migration (production → staging)

Infrastructure-only import

Technical details

Memory efficiency

Timestamp preservation

Sequence reset

Import order

Signal handling

Troubleshooting

Export process killed

Import foreign key errors

Sequence conflicts after import

S3 connection issues

Best practices

Security considerations