osl_dynamics.data#

Classes and functions used to handle data.

Note

New users may find the following tutorials helpful:

Submodules#

Package Contents#

Classes#

Data

Data Class.

SessionLabels

Class for session labels.

Functions#

load_tfrecord_dataset(tfrecord_dir, batch_size[, ...])

Load a TFRecord dataset.

class osl_dynamics.data.Data(inputs, data_field='X', picks=None, reject_by_annotation=None, sampling_frequency=None, mask_file=None, parcellation_file=None, time_axis_first=True, load_memmaps=False, store_dir='tmp', buffer_size=100000, use_tfrecord=False, session_labels=None, n_jobs=1)[source]#

Data Class.

The Data class enables the input and processing of data. When given a list of files, it produces a set of numpy memory maps which contain their raw data. It also provides methods for batching data and creating TensorFlow Datasets.

Parameters:
  • inputs (list of str or str or np.ndarray) –

    • A path to a directory containing .npy files. Each .npy file should be a subject or session.

    • A list of paths to .npy, .mat or .fif files. Each file should be a subject or session. If a .fif file is passed is must end with 'raw.fif' or 'epo.fif'.

    • A numpy array. The array will be treated as continuous data from the same subject.

    • A list of numpy arrays. Each numpy array should be the data for a subject or session.

    The data files or numpy arrays should be in the format (n_samples, n_channels). If your data is in (n_channels, n_samples) format, use time_axis_first=False.

  • data_field (str, optional) – If a MATLAB (.mat) file is passed, this is the field that corresponds to the time series data. By default we read the field 'X'. If a numpy (.npy) or fif (.fif) file is passed, this is ignored.

  • picks (str or list of str, optional) – Only used if a fif file is passed. We load the data using the mne.io.Raw.get_data method. We pass this argument to the Raw.get_data method. By default picks=None retrieves all channel types.

  • reject_by_annotation (str, optional) –

    Only used if a fif file is passed. We load the data using the mne.io.Raw.get_data method. We pass this argument to the Raw.get_data method. By default reject_by_annotation=None retrieves all time points. Use reject_by_annotation="omit" to remove segments marked as bad.

  • sampling_frequency (float, optional) – Sampling frequency of the data in Hz.

  • mask_file (str, optional) – Path to mask file used to source reconstruct the data.

  • parcellation_file (str, optional) – Path to parcellation file used to source reconstruct the data.

  • time_axis_first (bool, optional) – Is the input data of shape (n_samples, n_channels)? Default is True. If your data is in format (n_channels, n_samples), use time_axis_first=False.

  • load_memmaps (bool, optional) – Should we load the data as memory maps (memmaps)? If True, we will load store the data on disk rather than loading it into memory.

  • store_dir (str, optional) – If load_memmaps=True, then we save data to disk and load it as a memory map. This is the directory to save the memory maps to. Default is ./tmp.

  • buffer_size (int, optional) – Buffer size for shuffling a TensorFlow Dataset. Smaller values will lead to less random shuffling but will be quicker. Default is 100000.

  • use_tfrecord (bool, optional) – Should we save the data as a TensorFlow Record? This is recommended for training on large datasets. Default is False.

  • session_labels (list of SessionLabels, optional) – Extra session labels.

  • n_jobs (int, optional) – Number of processes to load the data in parallel. Default is 1, which loads data in serial.

property raw_data#

Return raw data as a list of arrays.

property n_channels#

Number of channels in the data files.

property n_samples#

Number of samples for each array.

property n_sessions#

Number of arrays.

__iter__()[source]#
__getitem__(item)[source]#
__str__()[source]#

Return str(self).

set_keep(keep)[source]#

Context manager to temporarily set the kept arrays.

Parameters:

keep (int or list of int) – Indices to keep in the Data.arrays list.

set_sampling_frequency(sampling_frequency)[source]#

Sets the sampling_frequency attribute.

Parameters:

sampling_frequency (float) – Sampling frequency in Hz.

set_buffer_size(buffer_size)[source]#

Set the buffer_size attribute.

Parameters:

buffer_size (int) – Buffer size for shuffling a TensorFlow Dataset. Smaller values will lead to less random shuffling but will be quicker.

time_series(prepared=True, concatenate=False)[source]#

Time series data for all arrays.

Parameters:
  • prepared (bool, optional) – Should we return the latest data after we have prepared it or the original data we loaded into the Data object?

  • concatenate (bool, optional) – Should we return the time series for each array concatenated?

Returns:

ts – Time series data for each array.

Return type:

list or np.ndarray

load_raw_data()[source]#

Import data into a list of memory maps.

Returns:

  • memmaps (list of np.memmap) – List of memory maps.

  • raw_data_filenames (list of str) – List of paths to the raw data memmaps.

validate_data()[source]#

Validate data files.

select(channels=None, sessions=None, use_raw=False)[source]#

Select channels.

This is an in-place operation.

Parameters:
  • channels (int or list of int, optional) – Channel indices to keep. If None, all channels are retained.

  • sessions (int or list of int, optional) – Session indices to keep. If None, all sessions are retained.

  • use_raw (bool, optional) – Should we select channel from the original ‘raw’ data that we loaded?

Returns:

data – The modified Data object.

Return type:

osl_dynamics.data.Data

filter(low_freq=None, high_freq=None, use_raw=False)[source]#

Filter the data.

This is an in-place operation.

Parameters:
  • low_freq (float, optional) – Frequency in Hz for a high pass filter. If None, no high pass filtering is applied.

  • high_freq (float, optional) – Frequency in Hz for a low pass filter. If None, no low pass filtering is applied.

  • use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?

Returns:

data – The modified Data object.

Return type:

osl_dynamics.data.Data

downsample(freq, use_raw=False)[source]#

Downsample the data.

This is an in-place operation.

Parameters:
  • freq (float) – Frequency in Hz to downsample to.

  • use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?

Returns:

data – The modified Data object.

Return type:

osl_dynamics.data.Data

pca(n_pca_components=None, pca_components=None, whiten=False, use_raw=False)[source]#

Principal component analysis (PCA).

This function will first standardize the data then perform PCA. This is an in-place operation.

Parameters:
  • n_pca_components (int, optional) – Number of PCA components to keep. If None, then pca_components should be passed.

  • pca_components (np.ndarray, optional) – PCA components to apply if they have already been calculated. If None, then n_pca_components should be passed.

  • whiten (bool, optional) – Should we whiten the PCA’ed data?

  • use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?

Returns:

data – The modified Data object.

Return type:

osl_dynamics.data.Data

tde(n_embeddings, use_raw=False)[source]#

Time-delay embedding (TDE).

This is an in-place operation.

Parameters:
  • n_embeddings (int) – Number of data points to embed the data.

  • use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?

Returns:

data – The modified Data object.

Return type:

osl_dynamics.data.Data

tde_pca(n_embeddings, n_pca_components=None, pca_components=None, whiten=False, use_raw=False)[source]#

Time-delay embedding (TDE) and principal component analysis (PCA).

This function will first standardize the data, then perform TDE then PCA. It is useful to do both operations in a single methods because it avoids having to save the time-embedded data. This is an in-place operation.

Parameters:
  • n_embeddings (int) – Number of data points to embed the data.

  • n_pca_components (int, optional) – Number of PCA components to keep. If None, then pca_components should be passed.

  • pca_components (np.ndarray, optional) – PCA components to apply if they have already been calculated. If None, then n_pca_components should be passed.

  • whiten (bool, optional) – Should we whiten the PCA’ed data?

  • use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?

Returns:

data – The modified Data object.

Return type:

osl_dynamics.data.Data

amplitude_envelope(use_raw=False)[source]#

Calculate the amplitude envelope.

This is an in-place operation.

Returns:

  • data (osl_dynamics.data.Data) – The modified Data object.

  • use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?

moving_average(n_window, use_raw=False)[source]#

Calculate a moving average.

This is an in-place operation.

Parameters:
  • n_window (int) – Number of data points in the sliding window. Must be odd.

  • use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?

Returns:

data – The modified Data object.

Return type:

osl_dynamics.data.Data

standardize(use_raw=False)[source]#

Standardize (z-transform) the data.

This is an in-place operation.

Returns:

  • data (osl_dynamics.data.Data) – The modified Data object.

  • use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?

prepare(methods)[source]#

Prepare data.

Wrapper for calling a series of data preparation methods. Any method in Data can be called. Note that if the same method is called multiple times, the method name should be appended with an underscore and a number, e.g. standardize_1 and standardize_2.

Parameters:

methods (dict) – Each key is the name of a method to call. Each value is a dict containing keyword arguments to pass to the method.

Returns:

data – The modified Data object.

Return type:

osl_dynamics.data.Data

Examples

TDE-PCA data preparation:

methods = {
    "tde_pca": {"n_embeddings": 15, "n_pca_components": 80},
    "standardize": {},
}
data.prepare(methods)

Amplitude envelope data preparation:

methods = {
    "filter": {"low_freq": 1, "high_freq": 45},
    "amplitude_envelope": {},
    "moving_average": {"n_window": 5},
    "standardize": {},
}
data.prepare(methods)
trim_time_series(sequence_length=None, n_embeddings=None, n_window=None, prepared=True, concatenate=False, verbose=False)[source]#

Trims the data time series.

Removes the data points that are lost when the data is prepared, i.e. due to time embedding and separating into sequences, but does not perform time embedding or batching into sequences on the time series.

Parameters:
  • sequence_length (int, optional) – Length of the segement of data to feed into the model. Can be pass to trim the time points that are lost when separating into sequences.

  • n_embeddings (int, optional) – Number of data points used to embed the data. If None, then we use Data.n_embeddings (if it exists).

  • n_window (int, optional) – Number of data points the sliding window applied to the data. If None, then we use Data.n_window (if it exists).

  • prepared (bool, optional) – Should we return the prepared data? If not we return the raw data.

  • concatenate (bool, optional) – Should we concatenate the data for each array?

  • verbose (bool, optional) – Should we print the number of data points we’re removing?

Returns:

Trimed time series for each array.

Return type:

list of np.ndarray

count_sequences(sequence_length, step_size=None)[source]#

Count sequences.

Parameters:
  • sequence_length (int) – Length of the segement of data to feed into the model.

  • step_size (int, optional) – The number of samples by which to move the sliding window between sequences. Defaults to sequence_length.

Returns:

n – Number of sequences for each session’s data.

Return type:

np.ndarray

_create_data_dict(i, array)[source]#

Create a dictionary of data for a single session.

Parameters:
  • i (int) – Index of the session.

  • array (np.ndarray) – Time series data for a single session.

Returns:

data – Dictionary of data for a single session.

Return type:

dict

dataset(sequence_length, batch_size, shuffle=True, validation_split=None, concatenate=True, step_size=None, drop_last_batch=False)[source]#

Create a Tensorflow Dataset for training or evaluation.

Parameters:
  • sequence_length (int) – Length of the segement of data to feed into the model.

  • batch_size (int) – Number sequences in each mini-batch which is used to train the model.

  • shuffle (bool, optional) – Should we shuffle sequences (within a batch) and batches.

  • validation_split (float, optional) – Ratio to split the dataset into a training and validation set.

  • concatenate (bool, optional) – Should we concatenate the datasets for each array?

  • step_size (int, optional) – Number of samples to slide the sequence across the dataset. Default is no overlap.

  • drop_last_batch (bool, optional) – Should we drop the last batch if it is smaller than the batch size?

Returns:

dataset – Dataset for training or evaluating the model along with the validation set if validation_split was passed.

Return type:

tf.data.Dataset or tuple

save_tfrecord_dataset(tfrecord_dir, sequence_length, step_size=None, overwrite=False)[source]#

Save the data as TFRecord files.

Parameters:
  • tfrecord_dir (str) – Directory to save the TFRecord datasets.

  • sequence_length (int) – Length of the segement of data to feed into the model.

  • step_size (int, optional) – Number of samples to slide the sequence across the dataset. Default is no overlap.

  • overwrite (bool, optional) – Should we overwrite the existing TFRecord datasets if there is a need?

tfrecord_dataset(sequence_length, batch_size, shuffle=True, validation_split=None, concatenate=True, step_size=None, drop_last_batch=False, tfrecord_dir=None, overwrite=False)[source]#

Create a TFRecord Dataset for training or evaluation.

Parameters:
  • sequence_length (int) – Length of the segement of data to feed into the model.

  • batch_size (int) – Number sequences in each mini-batch which is used to train the model.

  • shuffle (bool, optional) – Should we shuffle sequences (within a batch) and batches.

  • validation_split (float, optional) – Ratio to split the dataset into a training and validation set.

  • concatenate (bool, optional) – Should we concatenate the datasets for each array?

  • step_size (int, optional) – Number of samples to slide the sequence across the dataset. Default is no overlap.

  • drop_last_batch (bool, optional) – Should we drop the last batch if it is smaller than the batch size?

  • tfrecord_dir (str, optional) – Directory to save the TFRecord datasets. If None, then Data.store_dir is used.

  • overwrite (bool, optional) – Should we overwrite the existing TFRecord datasets if there is a need?

Returns:

dataset – Dataset for training or evaluating the model.

Return type:

tf.data.Dataset

add_session_labels(label_name, label_values, label_type)[source]#

Add session labels as a new channel to the data.

Parameters:
  • label_name (str) – Name of the new channel.

  • label_values (np.ndarray) – Labels for each session.

  • label_type (str) – Type of label, either “categorical” or “continuous”.

get_session_labels()[source]#

Get the session labels.

Returns:

session_labels – List of session labels.

Return type:

List[SessionLabels]

save_preparation(output_dir='.')[source]#

Save a pickle file containing preparation settings.

Parameters:

output_dir (str) – Path to save data files to. Default is the current working directory.

load_preparation(inputs)[source]#

Loads a pickle file containing preparation settings.

Parameters:

inputs (str) – Path to directory containing the pickle file with preparation settings.

save(output_dir='.')[source]#

Saves (prepared) data to numpy files.

Parameters:

output_dir (str) – Path to save data files to. Default is the current working directory.

delete_dir()[source]#

Deletes store_dir.

class osl_dynamics.data.SessionLabels[source]#

Class for session labels.

Parameters:
  • name (str) – Name of the session label.

  • values (np.ndarray) – Value for each session. Must be a 1D array of numbers.

  • label_type (str) – Type of the session label. Options are “categorical” and “continuous”.

name: str#
values: numpy.ndarray#
label_type: str#
__post_init__()[source]#
osl_dynamics.data.load_tfrecord_dataset(tfrecord_dir, batch_size, shuffle=True, validation_split=None, concatenate=True, drop_last_batch=False, buffer_size=100000, keep=None)[source]#

Load a TFRecord dataset.

Parameters:
  • tfrecord_dir (str) – Directory containing the TFRecord datasets.

  • batch_size (int) – Number sequences in each mini-batch which is used to train the model.

  • shuffle (bool, optional) – Should we shuffle sequences (within a batch) and batches.

  • validation_split (float, optional) – Ratio to split the dataset into a training and validation set.

  • concatenate (bool, optional) – Should we concatenate the datasets for each array?

  • drop_last_batch (bool, optional) – Should we drop the last batch if it is smaller than the batch size?

  • buffer_size (int, optional) – Buffer size for shuffling a TensorFlow Dataset. Smaller values will lead to less random shuffling but will be quicker. Default is 100000.

  • keep (list of int, optional) – List of session indices to keep. If None, then all sessions are kept.

Returns:

dataset – Dataset for training or evaluating the model along with the validation set if validation_split was passed.

Return type:

tf.data.Dataset or tuple