osl_dynamics.data.base#

Base class for handling data.

Classes#

Data

Data Class.

SessionLabels

Class for session labels.

Module Contents#

class osl_dynamics.data.base.Data(inputs, data_field='X', picks=None, reject_by_annotation=None, sampling_frequency=None, mask_file=None, parcellation_file=None, time_axis_first=True, load_memmaps=False, store_dir='tmp', buffer_size=4000, use_tfrecord=False, session_labels=None, extra_channels=None, n_jobs=1)[source]#

Data Class.

The Data class enables the input and processing of data. When given a list of files, it produces a set of numpy memory maps which contain their raw data. It also provides methods for batching data and creating TensorFlow Datasets.

Parameters:
  • inputs (list of str or pathlib.Path or str or pathlib.Path or np.ndarray) –

    • A path (str or pathlib.Path) to a directory containing .npy files. Each .npy file should be a subject or session.

    • A list of paths (str or pathlib.Path) to .npy, .mat or .fif files. Each file should be a subject or session. If a .fif file is passed is must end with 'raw.fif' or 'epo.fif'.

    • A numpy array. The array will be treated as continuous data from the same subject.

    • A list of numpy arrays. Each numpy array should be the data for a subject or session.

    The data files or numpy arrays should be in the format (n_samples, n_channels). If your data is in (n_channels, n_samples) format, use time_axis_first=False.

  • data_field (str, optional) – If a MATLAB (.mat) file is passed, this is the field that corresponds to the time series data. By default we read the field 'X'. If a numpy (.npy) or fif (.fif) file is passed, this is ignored.

  • picks (str or list of str, optional) – Only used if a fif file is passed. We load the data using the mne.io.Raw.get_data method. We pass this argument to the Raw.get_data method. By default picks=None retrieves all channel types.

  • reject_by_annotation (str, optional) –

    Only used if a fif file is passed. We load the data using the mne.io.Raw.get_data method. We pass this argument to the Raw.get_data method. By default reject_by_annotation=None retrieves all time points. Use reject_by_annotation="omit" to remove segments marked as bad.

  • sampling_frequency (float, optional) – Sampling frequency of the data in Hz.

  • mask_file (str, optional) – Path to mask file used to source reconstruct the data.

  • parcellation_file (str, optional) – Path to parcellation file used to source reconstruct the data.

  • time_axis_first (bool, optional) – Is the input data of shape (n_samples, n_channels)? Default is True. If your data is in format (n_channels, n_samples), use time_axis_first=False.

  • load_memmaps (bool, optional) – Should we load the data as memory maps (memmaps)? If True, we will load store the data on disk rather than loading it into memory.

  • store_dir (str, optional) – If load_memmaps=True, then we save data to disk and load it as a memory map. This is the directory to save the memory maps to. Default is ./tmp.

  • buffer_size (int, optional) – Buffer size for shuffling a TensorFlow Dataset. Smaller values will lead to less random shuffling but will be quicker. Default is 100000.

  • use_tfrecord (bool, optional) – Should we save the data as a TensorFlow Record? This is recommended for training on large datasets. Default is False.

  • session_labels (list of SessionLabels, optional) – Extra session labels.

  • extra_channels (dict, optional) – Extra channels to add to the data. The keys are the channel names and the values are the channel data.

  • n_jobs (int, optional) – Number of processes to load the data in parallel. Default is 1, which loads data in serial.

data_field = 'X'[source]#
picks = None[source]#
reject_by_annotation = None[source]#
original_sampling_frequency = None[source]#
sampling_frequency = None[source]#
mask_file = None[source]#
parcellation_file = None[source]#
time_axis_first = True[source]#
load_memmaps = False[source]#
buffer_size = 4000[source]#
use_tfrecord = False[source]#
n_jobs = 1[source]#
inputs = [][source]#
store_dir[source]#
n_raw_data_channels[source]#
arrays[source]#
prepared_data_filenames[source]#
keep[source]#
property raw_data: List[numpy.ndarray][source]#

Return raw data as a list of arrays.

Return type:

List[numpy.ndarray]

property n_channels: int[source]#

Number of channels in the data files.

Return type:

int

property n_samples: int[source]#

Number of samples across all arrays.

Return type:

int

property n_sessions: int[source]#

Number of arrays.

Return type:

int

property input_shapes: Dict[source]#

Get the input shapes for the model.

Returns:

shapes – Dictionary of input shapes.

Return type:

dict

set_keep(keep)[source]#

Context manager to temporarily set the kept arrays.

Parameters:

keep (int or list of int) – Indices to keep in the Data.arrays list.

set_sampling_frequency(sampling_frequency)[source]#

Sets the sampling_frequency attribute.

Parameters:

sampling_frequency (float) – Sampling frequency in Hz.

Return type:

None

set_buffer_size(buffer_size)[source]#

Set the buffer_size attribute.

Parameters:

buffer_size (int) – Buffer size for shuffling a TensorFlow Dataset. Smaller values will lead to less random shuffling but will be quicker.

Return type:

None

time_series(prepared=True, concatenate=False)[source]#

Time series data for all arrays.

Parameters:
  • prepared (bool, optional) – Should we return the latest data after we have prepared it or the original data we loaded into the Data object?

  • concatenate (bool, optional) – Should we return the time series for each array concatenated?

Returns:

ts – Time series data for each array.

Return type:

list or np.ndarray

load_raw_data()[source]#

Import data into a list of memory maps.

Returns:

  • memmaps (list of np.memmap) – List of memory maps.

  • raw_data_filenames (list of str) – List of paths to the raw data memmaps.

Return type:

Tuple[List[numpy.ndarray], List[str]]

validate_data()[source]#

Validate data files.

Return type:

None

validate_extra_channels(data, extra_channels)[source]#

Validate extra channels.

Parameters:
  • data (List[numpy.ndarray])

  • extra_channels (Dict)

Return type:

Dict

select(channels=None, sessions=None, use_raw=False)[source]#

Select channels.

This is an in-place operation.

Parameters:
  • channels (int or list of int, optional) – Channel indices to keep. If None, all channels are retained.

  • sessions (int or list of int, optional) – Session indices to keep. If None, all sessions are retained.

  • use_raw (bool, optional) – Should we select channel from the original ‘raw’ data that we loaded?

Returns:

data – The modified Data object.

Return type:

osl_dynamics.data.Data

filter(low_freq=None, high_freq=None, order=5, use_raw=False)[source]#

Filter the data.

This is an in-place operation.

Parameters:
  • low_freq (float, optional) – Frequency in Hz for a high pass filter. If None, no high pass filtering is applied.

  • high_freq (float, optional) – Frequency in Hz for a low pass filter. If None, no low pass filtering is applied.

  • order (int, optional) – Order for a butterworth filter.

  • use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?

Returns:

data – The modified Data object.

Return type:

osl_dynamics.data.Data

downsample(freq, use_raw=False)[source]#

Downsample the data.

This is an in-place operation.

Parameters:
  • freq (float) – Frequency in Hz to downsample to.

  • use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?

Returns:

data – The modified Data object.

Return type:

osl_dynamics.data.Data

pca(n_pca_components=None, pca_components=None, whiten=False, use_raw=False)[source]#

Principal component analysis (PCA).

This function will first standardize the data then perform PCA. This is an in-place operation.

Parameters:
  • n_pca_components (int, optional) – Number of PCA components to keep. If None, then pca_components should be passed.

  • pca_components (np.ndarray, optional) – PCA components to apply if they have already been calculated. If None, then n_pca_components should be passed.

  • whiten (bool, optional) – Should we whiten the PCA’ed data?

  • use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?

Returns:

data – The modified Data object.

Return type:

osl_dynamics.data.Data

tde(n_embeddings, use_raw=False)[source]#

Time-delay embedding (TDE).

This is an in-place operation.

Parameters:
  • n_embeddings (int) – Number of data points to embed the data.

  • use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?

Returns:

data – The modified Data object.

Return type:

osl_dynamics.data.Data

tde_pca(n_embeddings, n_pca_components=None, pca_components=None, whiten=False, use_raw=False)[source]#

Time-delay embedding (TDE) and principal component analysis (PCA).

This function will first standardize the data, then perform TDE then PCA. It is useful to do both operations in a single methods because it avoids having to save the time-embedded data. This is an in-place operation.

Parameters:
  • n_embeddings (int) – Number of data points to embed the data.

  • n_pca_components (int, optional) – Number of PCA components to keep. If None, then pca_components should be passed.

  • pca_components (np.ndarray, optional) – PCA components to apply if they have already been calculated. If None, then n_pca_components should be passed.

  • whiten (bool, optional) – Should we whiten the PCA’ed data?

  • use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?

Returns:

data – The modified Data object.

Return type:

osl_dynamics.data.Data

amplitude_envelope(use_raw=False)[source]#

Calculate the amplitude envelope.

This is an in-place operation.

Returns:

  • data (osl_dynamics.data.Data) – The modified Data object.

  • use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?

Parameters:

use_raw (bool)

Return type:

Data

moving_average(n_window, use_raw=False)[source]#

Calculate a moving average.

This is an in-place operation.

Parameters:
  • n_window (int) – Number of data points in the sliding window. Must be odd.

  • use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?

Returns:

data – The modified Data object.

Return type:

osl_dynamics.data.Data

standardize(use_raw=False)[source]#

Standardize (z-score) the data.

This is an in-place operation.

Returns:

  • data (osl_dynamics.data.Data) – The modified Data object.

  • use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?

Parameters:

use_raw (bool)

Return type:

Data

align_channel_signs(template_data=None, template_cov=None, n_init=3, n_iter=2500, max_flips=20, n_embeddings=1, standardize=True, use_raw=False)[source]#

Align the sign of each channel across sessions.

If no template data/covariance is passed, we use the median session.

Parameters:
  • template_data (np.ndarray or str, optional) – Data to align the sign of channels to. If str, the file will be read in the same way as the inputs to the Data object.

  • template_cov (np.ndarray or str, optional) – Covariance to align the sign of channels. This must be the covariance of the time-delay embedded data. If str, must be the path to a .npy file.

  • n_init (int, optional) – Number of initializations.

  • n_iter (int, optional) – Number of sign flipping iterations per subject to perform.

  • max_flips (int, optional) – Maximum number of channels to flip in an iteration.

  • n_embeddings (int, optional) – We may want to compare the covariance of time-delay embedded data when aligning the signs. This is the number of embeddings. The returned data is not time-delay embedded.

  • standardize (bool, optional) – Should we standardize the data before comparing across sessions?

  • use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?

Returns:

data – The modified Data object.

Return type:

osl_dynamics.data.Data

prepare(methods)[source]#

Prepare data.

Wrapper for calling a series of data preparation methods. Any method in Data can be called. Note that if the same method is called multiple times, the method name should be appended with an underscore and a number, e.g. standardize_1 and standardize_2.

Parameters:

methods (dict) – Each key is the name of a method to call. Each value is a dict containing keyword arguments to pass to the method.

Returns:

data – The modified Data object.

Return type:

osl_dynamics.data.Data

Examples

TDE-PCA data preparation:

methods = {
    "tde_pca": {"n_embeddings": 15, "n_pca_components": 80},
    "standardize": {},
}
data.prepare(methods)

Amplitude envelope data preparation:

methods = {
    "filter": {"low_freq": 1, "high_freq": 45},
    "amplitude_envelope": {},
    "moving_average": {"n_window": 5},
    "standardize": {},
}
data.prepare(methods)
trim_time_series(sequence_length=None, n_embeddings=None, n_window=None, prepared=True, concatenate=False, verbose=False)[source]#

Trims the data time series.

Removes the data points that are lost when the data is prepared, i.e. due to time embedding and separating into sequences, but does not perform time embedding or batching into sequences on the time series.

Parameters:
  • sequence_length (int, optional) – Length of the segment of data to feed into the model. Can be pass to trim the time points that are lost when separating into sequences.

  • n_embeddings (int, optional) – Number of data points used to embed the data. If None, then we use Data.n_embeddings (if it exists).

  • n_window (int, optional) – Number of data points the sliding window applied to the data. If None, then we use Data.n_window (if it exists).

  • prepared (bool, optional) – Should we return the prepared data? If not we return the raw data.

  • concatenate (bool, optional) – Should we concatenate the data for each array?

  • verbose (bool, optional) – Should we print the number of data points we’re removing?

Returns:

Trimmed time series for each array.

Return type:

list of np.ndarray

count_sequences(sequence_length, step_size=None)[source]#

Count sequences.

Parameters:
  • sequence_length (int) – Length of the segment of data to feed into the model.

  • step_size (int, optional) – The number of samples by which to move the sliding window between sequences. Defaults to sequence_length.

Returns:

n – Number of sequences for each session’s data.

Return type:

np.ndarray

dataset(sequence_length, batch_size, shuffle=True, validation_split=None, concatenate=True, step_size=None, drop_last_batch=False)[source]#

Create a Tensorflow Dataset for training or evaluation.

Parameters:
  • sequence_length (int) – Length of the segment of data to feed into the model.

  • batch_size (int) – Number sequences in each mini-batch which is used to train the model.

  • shuffle (bool, optional) – Should we shuffle sequences (within a batch) and batches.

  • validation_split (float, optional) – Ratio to split the dataset into a training and validation set.

  • concatenate (bool, optional) – Should we concatenate the datasets for each array?

  • step_size (int, optional) – Number of samples to slide the sequence across the dataset. Default is no overlap.

  • drop_last_batch (bool, optional) – Should we drop the last batch if it is smaller than the batch size?

Returns:

dataset – Dataset for training or evaluating the model along with the validation set if validation_split was passed.

Return type:

tf.data.Dataset or tuple of tf.data.Dataset

save_tfrecord_dataset(tfrecord_dir, sequence_length, step_size=None, validation_split=None, overwrite=False)[source]#

Save the data as TFRecord files.

Parameters:
  • tfrecord_dir (str) – Directory to save the TFRecord datasets.

  • sequence_length (int) – Length of the segment of data to feed into the model.

  • step_size (int, optional) – Number of samples to slide the sequence across the dataset. Default is no overlap.

  • validation_split (float, optional) – Ratio to split the dataset into a training and validation set.

  • overwrite (bool, optional) – Should we overwrite the existing TFRecord datasets if there is a need?

Return type:

None

tfrecord_dataset(sequence_length, batch_size, shuffle=True, validation_split=None, concatenate=True, step_size=None, drop_last_batch=False, tfrecord_dir=None, overwrite=False)[source]#

Create a TFRecord Dataset for training or evaluation.

Parameters:
  • sequence_length (int) – Length of the segment of data to feed into the model.

  • batch_size (int) – Number sequences in each mini-batch which is used to train the model.

  • shuffle (bool, optional) – Should we shuffle sequences (within a batch) and batches.

  • validation_split (float, optional) – Ratio to split the dataset into a training and validation set.

  • concatenate (bool, optional) – Should we concatenate the datasets for each array?

  • step_size (int, optional) – Number of samples to slide the sequence across the dataset. Default is no overlap.

  • drop_last_batch (bool, optional) – Should we drop the last batch if it is smaller than the batch size?

  • tfrecord_dir (str, optional) – Directory to save the TFRecord datasets. If None, then Data.store_dir is used.

  • overwrite (bool, optional) – Should we overwrite the existing TFRecord datasets if there is a need?

Returns:

dataset – Dataset for training or evaluating the model along with the validation set if validation_split was passed.

Return type:

tf.data.TFRecordDataset or tuple of tf.data.TFRecordDataset

add_session_labels(label_name, label_values, label_type)[source]#

Add session labels as a new channel to the data.

Parameters:
  • label_name (str) – Name of the new channel.

  • label_values (np.ndarray) – Labels for each session.

  • label_type (str) – Type of label, either “categorical” or “continuous”.

Return type:

None

add_extra_channel(channel_name, channel_values)[source]#

Add an extra channel to the data.

Parameters:
  • channel_name (str)

  • channel_values (List[numpy.ndarray])

Return type:

None

get_session_labels()[source]#

Get the session labels.

Returns:

session_labels – List of session labels.

Return type:

List[SessionLabels]

recommend_model_config()[source]#

Recommends arguments for a model config based on the data.

Return type:

None

save_preparation(output_dir='.')[source]#

Save a pickle file containing preparation settings.

Parameters:

output_dir (str) – Path to save data files to. Default is the current working directory.

Return type:

None

load_preparation(inputs)[source]#

Loads a pickle file containing preparation settings.

Parameters:

inputs (str) – Path to directory containing the pickle file with preparation settings.

Return type:

None

save(output_dir='.', outnames=None)[source]#

Saves (prepared) data.

The ordering of the saved files matches the order of the input files.

Parameters:
  • output_dir (str) – Path to save data files to. Default is the current working directory.

  • outnames (list of str, optional) – Names/IDs for output files (excluding the extension).

Return type:

None

delete_dir()[source]#

Deletes store_dir.

Return type:

None

class osl_dynamics.data.base.SessionLabels[source]#

Class for session labels.

Parameters:
  • name (str) – Name of the session label.

  • values (np.ndarray) – Value for each session. Must be a 1D array of numbers.

  • label_type (str) – Type of the session label. Options are “categorical” and “continuous”.

name: str[source]#
values: numpy.ndarray[source]#
label_type: str[source]#