.. _latent_extraction: Latent Extraction =============== After training a Nenya model, the next step is to extract latent representations from images. These latent vectors capture meaningful features and patterns in the data. After training a Nenya model, the next step is to extract latent representations from images. These latent vectors capture meaningful features and patterns in the data. Basic Latent Extraction --------------------- The main function for extracting latents is ``model_latents_extract`` in ``latents_extraction.py``: .. code-block:: python from nenya.latents_extraction import model_latents_extract from nenya import io as nenya_io # Load options and model path opt, model_path = nenya_io.load_opt('v5') # Extract latents from a data file latent_dict = model_latents_extract(opt, data_file, model_path) # Access latents for different partitions valid_latents = latent_dict['valid'] train_latents = latent_dict['train'] Single Image Analysis ------------------- To extract latents from a single image, you can use functions from ``analyze_image.py``: .. code-block:: python from nenya import analyze_image # Extract latents from a single image latents, pp_img = analyze_image.get_latents(image, model_file, opt) # Calculate DT (temperature difference) for the image DT = analyze_image.calc_DT(image, opt.random_jitter) # UMAP embed the image embedding, pp_img, table_file, DT, latents = analyze_image.umap_image('v5', image) Batch Processing -------------- For batch processing of multiple images, you can use the main function in ``latents_extraction.py``: .. code-block:: python from nenya.latents_extraction import main as extract_main # Path to options file opt_path = "path/to/opts_file.json" # List of preprocessed files to process pp_files = [ 's3://bucket/PreProc/data_preproc_1.h5', 's3://bucket/PreProc/data_preproc_2.h5' ] # Extract latents from all files extract_main(opt_path, pp_files, clobber=False, debug=False) This will: 1. Download each preprocessed file 2. Extract latents for all images in the file 3. Save the latents to a new file with "_latents" suffix 4. Upload the results to S3 (if configured) Data Loaders for Latent Extraction -------------------------------- Nenya provides custom data loaders for latent extraction: .. code-block:: python class HDF5RGBDataset(torch.utils.data.Dataset): """Dataset for loading HDF5 data for latent extraction""" def __init__(self, file_path, partition, allowed_indices=None): self.file_path = file_path self.partition = partition self.meta_dset = partition + '_metadata' self.h5f = h5py.File(file_path, 'r') self.allowed_indices = allowed_indices or np.arange(self.h5f[self.partition].shape[0]) # Implementation details... The ``build_loader`` function creates a data loader for efficient batch processing: .. code-block:: python def build_loader(data_file, dataset, batch_size=1, num_workers=1, allowed_indices=None): """Create a dataloader for latent extraction""" dset = HDF5RGBDataset(data_file, partition=dataset, allowed_indices=allowed_indices) loader = torch.utils.data.DataLoader( dset, batch_size=batch_size, shuffle=False, collate_fn=id_collate, drop_last=False, num_workers=num_workers) return dset, loader Loading Models for Extraction --------------------------- Models are loaded using functions from ``train_util.py``: .. code-block:: python from nenya.train_util import set_model # Load the model model, _ = set_model(opt, cuda_use=using_gpu) # Load the model state from a file if not using_gpu: model_dict = torch.load(model_path, map_location=torch.device('cpu')) else: model_dict = torch.load(model_path) # Load the model weights if remove_module: # Remove 'module.' prefix from DataParallel models new_dict = {} for key in model_dict['model'].keys(): new_dict[key.replace('module.','')] = model_dict['model'][key] model.load_state_dict(new_dict) else: model.load_state_dict(model_dict['model']) Computing Latents --------------- The actual computation of latents is handled by the ``calc_latent`` function: .. code-block:: python def calc_latent(model, image_tensor, using_gpu): """Calculate latent representation for an image""" model.eval() if using_gpu: latents_tensor = model(image_tensor.cuda()) latents_numpy = latents_tensor.cpu().numpy() else: latents_tensor = model(image_tensor) latents_numpy = latents_tensor.numpy() return latents_numpy Latent Storage Format ------------------ Latents are stored in HDF5 files with the following structure: - File name: Original filename with "_latents" suffix - Datasets: - ``train``: Latent vectors for training set (if present) - ``valid``: Latent vectors for validation set - Shape: ``(n_samples, feat_dim)`` where ``feat_dim`` is typically 128 or 512 Working with S3 Storage -------------------- When using S3 storage, the workflow typically involves: 1. Checking if the latent file already exists in S3 2. Downloading the preprocessed file locally 3. Extracting latents 4. Saving results locally 5. Uploading to S3 6. Cleaning up local files Tips for Latent Extraction ------------------------ 1. **Memory Management**: Process files in batches to manage memory usage 2. **GPU Acceleration**: Use GPU for faster processing when available 3. **Preprocessing**: Ensure images are properly preprocessed before extraction 4. **Batch Size**: Adjust batch size based on available GPU memory 5. **Model Selection**: Choose appropriate model version based on your data (MODIS vs VIIRS) Exploring Extracted Latents ------------------------- After extraction, you can explore latents through: 1. UMAP visualization (see :ref:`umap_analysis`) 2. PCA analysis 3. Clustering algorithms 4. Similarity search