Data Handling¶
Description¶
The Data Handling module facilitates the preparation of raw data for machine learning experiments using the DataHandler class.
Main Class¶
- class DataHandler¶
A class for handling dataset operations including creation, enhancement, splitting, and saving images.
- __init__()¶
Initializes the DataHandler.
- load_dataset(data: tf.data.Dataset | dict | pandas.DataFrame)¶
Loads a dataset from the given data and stores it in the ‘datasets_container’ under ‘complete_dataset’.
- Parameters:
data (Union[tf.data.Dataset, dict, pandas.DataFrame]) –
The data to load. It can be:
A TensorFlow dataset of tuples (image, label), where image shape is (height, width, 1|3).
A dictionary or pandas DataFrame with ‘path’ and ‘label’ columns.
- prepare_datasets(dataset_names: List[str] | None = None, batch_size: int | None = None, shuffle_seed: int | None = None, prefetch_buffer_size: int = tf.data.experimental.AUTOTUNE, repeat_num: int | None = None)¶
Prepares datasets by applying transformations and updates them in the ‘datasets_container’.
- Parameters:
dataset_names (Optional[List[str]]) – The names of the datasets to enhance. Can be ‘complete_dataset’ or any split datasets (‘train_dataset’, ‘val_dataset’, ‘test_dataset’). If None, all datasets are processed.
batch_size (Optional[int]) – The batch size for the dataset. If None, no batching is applied.
shuffle_seed (Optional[int]) – The seed for shuffling. If None, no shuffling is applied.
prefetch_buffer_size (int) – The prefetch buffer size. Defaults to tf.data.experimental.AUTOTUNE.
repeat_num (Optional[int]) – The number of times to repeat the dataset. If None, no repetition is applied.
- split_dataset(train_split: float = 0.8, val_split: float = 0.1, test_split: float = 0.1, dataset_size: int | None = None)¶
Splits ‘complete_dataset’ into ‘train_dataset’, ‘val_dataset’, and ‘test_dataset’. Removes the ‘complete_dataset’ after splitting.
- Parameters:
train_split (float) – Proportion of the dataset for training. Defaults to 0.8.
val_split (float) – Proportion of the dataset for validation. Defaults to 0.1.
test_split (float) – Proportion of the dataset for testing. Defaults to 0.1.
dataset_size (Optional[int]) – The dataset size. If None, the size is determined using the ‘cardinality’ method.
- save_images(output_dir: str, prefix: str | Callable[[Any], str] | None = None, num_images: int | None = None)¶
Saves images from the dataset to a specified directory.
- Parameters:
output_dir (str) – The directory to save the images.
prefix (Optional[Union[str, Callable[[Any], str]]]) – The prefix for the image files. If callable, it should take the label as input and return a string. If None, a default prefix is used.
num_images (Optional[int]) – The number of images to save. If None, the complete dataset is taken.
- backup_datasets()¶
Creates a backup of the current dataset container.
- restore_datasets()¶
Restores the dataset container from the backup.