latents_extraction

The latents_extraction module provides functionality for extracting latent representations from preprocessed images.

Classes

class nenya.latents_extraction.HDF5RGBDataset(file_path, partition, allowed_indices=None)

A PyTorch dataset for HDF5 data used in latent extraction.

Parameters:
  • file_path (str) – Path to the HDF5 file

  • partition (str) – Dataset name in the HDF5 file (e.g., ‘train’, ‘valid’)

  • allowed_indices (numpy.ndarray, optional) – Set of image indices to include (defaults to all)

__len__()

Return the number of samples in the dataset.

Returns:

Number of samples

Return type:

int

__getitem__(index)

Get a sample from the dataset.

Parameters:

index (int) – Index of the sample

Returns:

Tuple of (data, metadata)

Return type:

tuple

Functions

nenya.latents_extraction.main(opt_path, pp_files, clobber=False, debug=False)

Main function for batch latent extraction.

Parameters:
  • opt_path (str) – Path to options file

  • pp_files (list) – List of preprocessed file paths

  • clobber (bool, optional) – Whether to overwrite existing files. Defaults to False.

  • debug (bool, optional) – Whether to run in debug mode. Defaults to False.

nenya.latents_extraction.build_loader(data_file, dataset, batch_size=1, num_workers=1, allowed_indices=None)

Create a data loader for latent extraction.

Parameters:
  • data_file (str) – Path to the data file

  • dataset (str) – Dataset name in the file (e.g., ‘train’, ‘valid’)

  • batch_size (int, optional) – Batch size for data loading. Defaults to 1.

  • num_workers (int, optional) – Number of worker processes. Defaults to 1.

  • allowed_indices (numpy.ndarray, optional) – Set of image indices to include. Defaults to None (all).

Returns:

Tuple of (dataset, data loader)

Return type:

tuple

nenya.latents_extraction.calc_latent(model, image_tensor, using_gpu)

Calculate latent representations for an image tensor.

Parameters:
  • model (torch.nn.Module) – Nenya model

  • image_tensor (torch.Tensor) – Image tensor

  • using_gpu (bool) – Whether to use GPU

Returns:

Latent vectors as numpy array

Return type:

numpy.ndarray

nenya.latents_extraction.prep(opt)

Prepare the environment for latent extraction.

Parameters:

opt (nenya.params.Params) – Model options

Returns:

Tuple of (model base name, list of existing latent files)

Return type:

tuple

nenya.latents_extraction.model_latents_extract(opt, data_file, model_path, remove_module=True, in_loader=None, partitions=('train', 'valid'), allowed_indices=None, debug=False)

Extract latents from a data file using a model.

Parameters:
  • opt (nenya.params.Params) – Model options

  • data_file (str) – Path to the data file

  • model_path (str) – Path to the model file

  • remove_module (bool, optional) – Whether to remove ‘module.’ prefix from keys. Defaults to True.

  • in_loader (torch.utils.data.DataLoader, optional) – Optional pre-configured data loader. Defaults to None.

  • partitions (tuple, optional) – Dataset partitions to process. Defaults to (‘train’, ‘valid’).

  • allowed_indices (numpy.ndarray, optional) – Set of image indices to include. Defaults to None (all).

  • debug (bool, optional) – Whether to run in debug mode. Defaults to False.

Returns:

Dictionary of latent vectors for each partition

Return type:

dict

Example Usage

from nenya.latents_extraction import model_latents_extract, main
from nenya import io as nenya_io

# Extract latents for specific files
pp_files = [
    's3://bucket/PreProc/data_file1_preproc.h5',
    's3://bucket/PreProc/data_file2_preproc.h5'
]

# Batch extraction
main("path/to/opts.json", pp_files, clobber=False)

# Individual extraction
opt, model_path = nenya_io.load_opt('v5')
latent_dict = model_latents_extract(opt, "data_file_preproc.h5", model_path)

# Access latents
valid_latents = latent_dict['valid']
train_latents = latent_dict['train']

Implementation Details

The latent extraction process:

  1. Loads the model and its weights

  2. Creates data loaders for each partition in the data file

  3. Passes batches of images through the model

  4. Collects the latent vectors

  5. Returns a dictionary with latent vectors for each partition

When using the main function, the process also includes:

  1. Downloading files from S3 if necessary

  2. Checking for existing latent files to avoid duplicating work

  3. Saving extracted latents to HDF5 files

  4. Uploading results to S3

  5. Cleaning up temporary files