Data Handling¶

Description¶

The Data Handling module facilitates the preparation of raw data for machine learning experiments using the DataHandler class.

Main Class¶

class DataHandler¶

A class for handling dataset operations including creation, enhancement, splitting, and saving images.

__init__()¶: Initializes the DataHandler.

load_dataset(data: tf.data.Dataset | dict | pandas.DataFrame)¶

Loads a dataset from the given data and stores it in the ‘datasets_container’ under ‘complete_dataset’.

Parameters:

data (Union[tf.data.Dataset, dict, pandas.DataFrame]) –

The data to load. It can be:

A TensorFlow dataset of tuples (image, label), where image shape is (height, width, 1|3).
A dictionary or pandas DataFrame with ‘path’ and ‘label’ columns.

prepare_datasets(dataset_names: List[str] | None = None, batch_size: int | None = None, shuffle_seed: int | None = None, prefetch_buffer_size: int = tf.data.experimental.AUTOTUNE, repeat_num: int | None = None)¶

Prepares datasets by applying transformations and updates them in the ‘datasets_container’.

Parameters:

dataset_names (Optional[List[str]]) – The names of the datasets to enhance. Can be ‘complete_dataset’ or any split datasets (‘train_dataset’, ‘val_dataset’, ‘test_dataset’). If None, all datasets are processed.
batch_size (Optional[int]) – The batch size for the dataset. If None, no batching is applied.
shuffle_seed (Optional[int]) – The seed for shuffling. If None, no shuffling is applied.
prefetch_buffer_size (int) – The prefetch buffer size. Defaults to tf.data.experimental.AUTOTUNE.
repeat_num (Optional[int]) – The number of times to repeat the dataset. If None, no repetition is applied.

split_dataset(train_split: float = 0.8, val_split: float = 0.1, test_split: float = 0.1, dataset_size: int | None = None)¶

Splits ‘complete_dataset’ into ‘train_dataset’, ‘val_dataset’, and ‘test_dataset’. Removes the ‘complete_dataset’ after splitting.

Parameters:

train_split (float) – Proportion of the dataset for training. Defaults to 0.8.
val_split (float) – Proportion of the dataset for validation. Defaults to 0.1.
test_split (float) – Proportion of the dataset for testing. Defaults to 0.1.
dataset_size (Optional[int]) – The dataset size. If None, the size is determined using the ‘cardinality’ method.

save_images(output_dir: str, prefix: str | Callable[[Any], str] | None = None, num_images: int | None = None)¶

Saves images from the dataset to a specified directory.

Parameters:

output_dir (str) – The directory to save the images.
prefix (Optional[Union[str, Callable[[Any], str]]]) – The prefix for the image files. If callable, it should take the label as input and return a string. If None, a default prefix is used.
num_images (Optional[int]) – The number of images to save. If None, the complete dataset is taken.

backup_datasets()¶: Creates a backup of the current dataset container.

restore_datasets()¶: Restores the dataset container from the backup.