Model Training

Nenya implements self-supervised contrastive learning for training models on satellite imagery. This page describes the training process, model architecture, and configuration options.

Training Process

The main training loop is implemented in train.py:

from nenya.train import main as train_main

# Train a model using options from a JSON file
train_main("path/to/opts_file.json", debug=False)

The training function handles:

Loading and preprocessing options
Building the model and criterion (loss function)
Setting up the optimizer
Training for the specified number of epochs
Validating the model periodically
Saving checkpoints and final model

Model Architecture

The default model architecture is based on ResNet with a customized projection head for contrastive learning:

from nenya.models.resnet_big import SupConResNet

# Create a model with a specific backbone and feature dimension
model = SupConResNet(name='resnet50', feat_dim=128)

Nenya supports several ResNet variants (‘resnet18’, ‘resnet34’, ‘resnet50’, etc.) which can be specified in the options file.

Loss Function

Nenya uses a contrastive loss function implemented in losses.py:

from nenya.losses import SupConLoss

# Create a loss function with a specific temperature
criterion = SupConLoss(temperature=0.07)

The contrastive loss encourages representations of different augmentations of the same image to be similar, while pushing representations of different images apart.

Option Configuration

Training options are specified in a JSON file and loaded using the Params class:

from nenya import params

# Load options from a JSON file
opt = params.Params("path/to/opts_file.json")

# Preprocess options (set derived values)
params.option_preprocess(opt)

Key training options include:

{
  "ssl_method": "SimCLR",     // Training method (SimCLR or SupCon)
  "ssl_model": "resnet50",    // Backbone model
  "learning_rate": 0.05,      // Initial learning rate
  "batch_size_train": 64,     // Batch size for training
  "batch_size_valid": 64,     // Batch size for validation
  "epochs": 200,              // Number of epochs
  "feat_dim": 128,            // Feature dimension size
  "temp": 0.07,               // Temperature parameter for loss
  "weight_decay": 1e-4,       // Weight decay for optimizer
  "momentum": 0.9,            // Momentum for optimizer
  "cosine": true,             // Use cosine learning rate schedule
  "random_jitter": [5, 5],    // Jitter parameters for augmentation
  "model_root": "models/v5",  // Root directory for model output
  "train_key": "train",       // Dataset key for training
  "valid_key": "valid",       // Dataset key for validation
  "save_freq": 10,            // Save checkpoint every N epochs
  "valid_freq": 5             // Validate every N epochs
}

Data Loaders

Training and validation data loaders are created using the nenya_loader function:

from nenya.train_util import nenya_loader

# Create a training data loader
train_loader = nenya_loader(opt, valid=False)

# Create a validation data loader
valid_loader = nenya_loader(opt, valid=True)

These loaders apply the appropriate transformations and augmentations to the input images.

Training Loop

The core training loop is implemented in train_model:

from nenya.train_util import train_model

# Train for one epoch
loss, losses_step, losses_avg = train_model(
    train_loader, model, criterion, optimizer, epoch, opt,
    cuda_use=opt.cuda_use)

For each batch, the function:

Loads images and applies augmentations
Forwards the augmented views through the model
Calculates the contrastive loss
Updates the model parameters through backpropagation

Learning Rate Schedule

Nenya supports learning rate warmup and decay:

from nenya.util import adjust_learning_rate, warmup_learning_rate

# Adjust learning rate according to epoch
adjust_learning_rate(opt, optimizer, epoch)

# Apply warmup to the learning rate within an epoch
warmup_learning_rate(opt, epoch, idx, len(train_loader), optimizer)

Model Saving

Models are saved periodically during training and at the end:

from nenya.util import save_model

# Save model checkpoint
save_file = os.path.join(opt.model_folder, f'ckpt_epoch_{epoch}.pth')
save_model(model, optimizer, opt, epoch, save_file)

Monitoring Training

Training progress is monitored using the AverageMeter class:

from nenya.util import AverageMeter

# Create meters for tracking statistics
batch_time = AverageMeter()
data_time = AverageMeter()
losses = AverageMeter()

# Update meter with new values
losses.update(loss.item(), bsz)

Learning curves (loss over time) are saved to HDF5 files for later analysis:

with h5py.File(losses_file_train, 'w') as f:
    f.create_dataset('loss_train', data=np.array(loss_train))
    f.create_dataset('loss_step_train', data=np.array(loss_step_train))
    f.create_dataset('loss_avg_train', data=np.array(loss_avg_train))

Multi-GPU Training

Nenya supports multi-GPU training through PyTorch’s DataParallel:

if torch.cuda.is_available() and cuda_use:
    if torch.cuda.device_count() > 1:
        model.encoder = torch.nn.DataParallel(model.encoder)
    model = model.cuda()
    criterion = criterion.cuda()
    cudnn.benchmark = True

Training Tips

Batch Size: Larger batch sizes generally work better for contrastive learning. If GPU memory is limited, consider using gradient accumulation.
Temperature: The temperature parameter in the loss function controls the concentration of the distribution. Lower values (e.g., 0.07) typically work well.
Learning Rate: A cosine learning rate schedule with warmup often leads to better results.
Augmentations: Strong augmentations are crucial for contrastive learning. Experiment with different combinations of rotation, jitter, and flips.
Feature Dimension: A higher feature dimension (e.g., 128 or 256) generally captures more information but requires more GPU memory.