Data Handling

Description

The Data Handling module facilitates the preparation of raw data for machine learning experiments using the DataHandler class.

Main Class

class DataHandler

A class for handling dataset operations including creation, enhancement, splitting, and saving images.

__init__()

Initializes the DataHandler.

load_dataset(data: tf.data.Dataset | dict | pandas.DataFrame)

Loads a dataset from the given data and stores it in the ‘datasets_container’ under ‘complete_dataset’.

Parameters:

data (Union[tf.data.Dataset, dict, pandas.DataFrame]) –

The data to load. It can be:

  1. A TensorFlow dataset of tuples (image, label), where image shape is (height, width, 1|3).

  2. A dictionary or pandas DataFrame with ‘path’ and ‘label’ columns.

prepare_datasets(dataset_names: List[str] | None = None, batch_size: int | None = None, shuffle_seed: int | None = None, prefetch_buffer_size: int = tf.data.experimental.AUTOTUNE, repeat_num: int | None = None)

Prepares datasets by applying transformations and updates them in the ‘datasets_container’.

Parameters:
  • dataset_names (Optional[List[str]]) – The names of the datasets to enhance. Can be ‘complete_dataset’ or any split datasets (‘train_dataset’, ‘val_dataset’, ‘test_dataset’). If None, all datasets are processed.

  • batch_size (Optional[int]) – The batch size for the dataset. If None, no batching is applied.

  • shuffle_seed (Optional[int]) – The seed for shuffling. If None, no shuffling is applied.

  • prefetch_buffer_size (int) – The prefetch buffer size. Defaults to tf.data.experimental.AUTOTUNE.

  • repeat_num (Optional[int]) – The number of times to repeat the dataset. If None, no repetition is applied.

split_dataset(train_split: float = 0.8, val_split: float = 0.1, test_split: float = 0.1, dataset_size: int | None = None)

Splits ‘complete_dataset’ into ‘train_dataset’, ‘val_dataset’, and ‘test_dataset’. Removes the ‘complete_dataset’ after splitting.

Parameters:
  • train_split (float) – Proportion of the dataset for training. Defaults to 0.8.

  • val_split (float) – Proportion of the dataset for validation. Defaults to 0.1.

  • test_split (float) – Proportion of the dataset for testing. Defaults to 0.1.

  • dataset_size (Optional[int]) – The dataset size. If None, the size is determined using the ‘cardinality’ method.

save_images(output_dir: str, prefix: str | Callable[[Any], str] | None = None, num_images: int | None = None)

Saves images from the dataset to a specified directory.

Parameters:
  • output_dir (str) – The directory to save the images.

  • prefix (Optional[Union[str, Callable[[Any], str]]]) – The prefix for the image files. If callable, it should take the label as input and return a string. If None, a default prefix is used.

  • num_images (Optional[int]) – The number of images to save. If None, the complete dataset is taken.

backup_datasets()

Creates a backup of the current dataset container.

restore_datasets()

Restores the dataset container from the backup.