osl_dynamics.data
#
Classes and functions used to handle data.
Submodules#
Package Contents#
Classes#
Data Class. |
|
Class for session labels. |
Functions#
|
Load a TFRecord dataset. |
- class osl_dynamics.data.Data(inputs, data_field='X', picks=None, reject_by_annotation=None, sampling_frequency=None, mask_file=None, parcellation_file=None, time_axis_first=True, load_memmaps=False, store_dir='tmp', buffer_size=100000, use_tfrecord=False, session_labels=None, n_jobs=1)[source]#
Data Class.
The Data class enables the input and processing of data. When given a list of files, it produces a set of numpy memory maps which contain their raw data. It also provides methods for batching data and creating TensorFlow Datasets.
- Parameters:
inputs (list of str or str or np.ndarray) –
A path to a directory containing
.npy
files. Each.npy
file should be a subject or session.A list of paths to
.npy
,.mat
or.fif
files. Each file should be a subject or session. If a.fif
file is passed is must end with'raw.fif'
or'epo.fif'
.A numpy array. The array will be treated as continuous data from the same subject.
A list of numpy arrays. Each numpy array should be the data for a subject or session.
The data files or numpy arrays should be in the format (n_samples, n_channels). If your data is in (n_channels, n_samples) format, use
time_axis_first=False
.data_field (str, optional) – If a MATLAB (
.mat
) file is passed, this is the field that corresponds to the time series data. By default we read the field'X'
. If a numpy (.npy
) or fif (.fif
) file is passed, this is ignored.picks (str or list of str, optional) – Only used if a fif file is passed. We load the data using the mne.io.Raw.get_data method. We pass this argument to the
Raw.get_data
method. By defaultpicks=None
retrieves all channel types.reject_by_annotation (str, optional) –
Only used if a fif file is passed. We load the data using the mne.io.Raw.get_data method. We pass this argument to the
Raw.get_data
method. By defaultreject_by_annotation=None
retrieves all time points. Usereject_by_annotation="omit"
to remove segments marked as bad.sampling_frequency (float, optional) – Sampling frequency of the data in Hz.
mask_file (str, optional) – Path to mask file used to source reconstruct the data.
parcellation_file (str, optional) – Path to parcellation file used to source reconstruct the data.
time_axis_first (bool, optional) – Is the input data of shape (n_samples, n_channels)? Default is
True
. If your data is in format (n_channels, n_samples), usetime_axis_first=False
.load_memmaps (bool, optional) – Should we load the data as memory maps (memmaps)? If
True
, we will load store the data on disk rather than loading it into memory.store_dir (str, optional) – If load_memmaps=True, then we save data to disk and load it as a memory map. This is the directory to save the memory maps to. Default is
./tmp
.buffer_size (int, optional) – Buffer size for shuffling a TensorFlow Dataset. Smaller values will lead to less random shuffling but will be quicker. Default is 100000.
use_tfrecord (bool, optional) – Should we save the data as a TensorFlow Record? This is recommended for training on large datasets. Default is
False
.session_labels (list of SessionLabels, optional) – Extra session labels.
n_jobs (int, optional) – Number of processes to load the data in parallel. Default is 1, which loads data in serial.
- property raw_data#
Return raw data as a list of arrays.
- property n_channels#
Number of channels in the data files.
- property n_samples#
Number of samples for each array.
- property n_sessions#
Number of arrays.
- set_keep(keep)[source]#
Context manager to temporarily set the kept arrays.
- Parameters:
keep (int or list of int) – Indices to keep in the Data.arrays list.
- set_sampling_frequency(sampling_frequency)[source]#
Sets the
sampling_frequency
attribute.- Parameters:
sampling_frequency (float) – Sampling frequency in Hz.
- set_buffer_size(buffer_size)[source]#
Set the
buffer_size
attribute.- Parameters:
buffer_size (int) – Buffer size for shuffling a TensorFlow Dataset. Smaller values will lead to less random shuffling but will be quicker.
- time_series(prepared=True, concatenate=False)[source]#
Time series data for all arrays.
- Parameters:
prepared (bool, optional) – Should we return the latest data after we have prepared it or the original data we loaded into the Data object?
concatenate (bool, optional) – Should we return the time series for each array concatenated?
- Returns:
ts – Time series data for each array.
- Return type:
list or np.ndarray
- load_raw_data()[source]#
Import data into a list of memory maps.
- Returns:
memmaps (list of np.memmap) – List of memory maps.
raw_data_filenames (list of str) – List of paths to the raw data memmaps.
- select(channels=None, sessions=None, use_raw=False)[source]#
Select channels.
This is an in-place operation.
- Parameters:
channels (int or list of int, optional) – Channel indices to keep. If None, all channels are retained.
sessions (int or list of int, optional) – Session indices to keep. If None, all sessions are retained.
use_raw (bool, optional) – Should we select channel from the original ‘raw’ data that we loaded?
- Returns:
data – The modified Data object.
- Return type:
- filter(low_freq=None, high_freq=None, use_raw=False)[source]#
Filter the data.
This is an in-place operation.
- Parameters:
low_freq (float, optional) – Frequency in Hz for a high pass filter. If
None
, no high pass filtering is applied.high_freq (float, optional) – Frequency in Hz for a low pass filter. If
None
, no low pass filtering is applied.use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?
- Returns:
data – The modified Data object.
- Return type:
- downsample(freq, use_raw=False)[source]#
Downsample the data.
This is an in-place operation.
- Parameters:
freq (float) – Frequency in Hz to downsample to.
use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?
- Returns:
data – The modified Data object.
- Return type:
- pca(n_pca_components=None, pca_components=None, whiten=False, use_raw=False)[source]#
Principal component analysis (PCA).
This function will first standardize the data then perform PCA. This is an in-place operation.
- Parameters:
n_pca_components (int, optional) – Number of PCA components to keep. If
None
, thenpca_components
should be passed.pca_components (np.ndarray, optional) – PCA components to apply if they have already been calculated. If
None
, thenn_pca_components
should be passed.whiten (bool, optional) – Should we whiten the PCA’ed data?
use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?
- Returns:
data – The modified Data object.
- Return type:
- tde(n_embeddings, use_raw=False)[source]#
Time-delay embedding (TDE).
This is an in-place operation.
- Parameters:
n_embeddings (int) – Number of data points to embed the data.
use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?
- Returns:
data – The modified Data object.
- Return type:
- tde_pca(n_embeddings, n_pca_components=None, pca_components=None, whiten=False, use_raw=False)[source]#
Time-delay embedding (TDE) and principal component analysis (PCA).
This function will first standardize the data, then perform TDE then PCA. It is useful to do both operations in a single methods because it avoids having to save the time-embedded data. This is an in-place operation.
- Parameters:
n_embeddings (int) – Number of data points to embed the data.
n_pca_components (int, optional) – Number of PCA components to keep. If
None
, thenpca_components
should be passed.pca_components (np.ndarray, optional) – PCA components to apply if they have already been calculated. If
None
, thenn_pca_components
should be passed.whiten (bool, optional) – Should we whiten the PCA’ed data?
use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?
- Returns:
data – The modified Data object.
- Return type:
- amplitude_envelope(use_raw=False)[source]#
Calculate the amplitude envelope.
This is an in-place operation.
- Returns:
data (osl_dynamics.data.Data) – The modified Data object.
use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?
- moving_average(n_window, use_raw=False)[source]#
Calculate a moving average.
This is an in-place operation.
- Parameters:
n_window (int) – Number of data points in the sliding window. Must be odd.
use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?
- Returns:
data – The modified Data object.
- Return type:
- standardize(use_raw=False)[source]#
Standardize (z-transform) the data.
This is an in-place operation.
- Returns:
data (osl_dynamics.data.Data) – The modified Data object.
use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?
- prepare(methods)[source]#
Prepare data.
Wrapper for calling a series of data preparation methods. Any method in Data can be called. Note that if the same method is called multiple times, the method name should be appended with an underscore and a number, e.g.
standardize_1
andstandardize_2
.- Parameters:
methods (dict) – Each key is the name of a method to call. Each value is a
dict
containing keyword arguments to pass to the method.- Returns:
data – The modified Data object.
- Return type:
Examples
TDE-PCA data preparation:
methods = { "tde_pca": {"n_embeddings": 15, "n_pca_components": 80}, "standardize": {}, } data.prepare(methods)
Amplitude envelope data preparation:
methods = { "filter": {"low_freq": 1, "high_freq": 45}, "amplitude_envelope": {}, "moving_average": {"n_window": 5}, "standardize": {}, } data.prepare(methods)
- trim_time_series(sequence_length=None, n_embeddings=None, n_window=None, prepared=True, concatenate=False, verbose=False)[source]#
Trims the data time series.
Removes the data points that are lost when the data is prepared, i.e. due to time embedding and separating into sequences, but does not perform time embedding or batching into sequences on the time series.
- Parameters:
sequence_length (int, optional) – Length of the segement of data to feed into the model. Can be pass to trim the time points that are lost when separating into sequences.
n_embeddings (int, optional) – Number of data points used to embed the data. If
None
, then we useData.n_embeddings
(if it exists).n_window (int, optional) – Number of data points the sliding window applied to the data. If
None
, then we useData.n_window
(if it exists).prepared (bool, optional) – Should we return the prepared data? If not we return the raw data.
concatenate (bool, optional) – Should we concatenate the data for each array?
verbose (bool, optional) – Should we print the number of data points we’re removing?
- Returns:
Trimed time series for each array.
- Return type:
list of np.ndarray
- count_sequences(sequence_length, step_size=None)[source]#
Count sequences.
- Parameters:
sequence_length (int) – Length of the segement of data to feed into the model.
step_size (int, optional) – The number of samples by which to move the sliding window between sequences. Defaults to
sequence_length
.
- Returns:
n – Number of sequences for each session’s data.
- Return type:
np.ndarray
- _create_data_dict(i, array)[source]#
Create a dictionary of data for a single session.
- Parameters:
i (int) – Index of the session.
array (np.ndarray) – Time series data for a single session.
- Returns:
data – Dictionary of data for a single session.
- Return type:
dict
- dataset(sequence_length, batch_size, shuffle=True, validation_split=None, concatenate=True, step_size=None, drop_last_batch=False)[source]#
Create a Tensorflow Dataset for training or evaluation.
- Parameters:
sequence_length (int) – Length of the segement of data to feed into the model.
batch_size (int) – Number sequences in each mini-batch which is used to train the model.
shuffle (bool, optional) – Should we shuffle sequences (within a batch) and batches.
validation_split (float, optional) – Ratio to split the dataset into a training and validation set.
concatenate (bool, optional) – Should we concatenate the datasets for each array?
step_size (int, optional) – Number of samples to slide the sequence across the dataset. Default is no overlap.
drop_last_batch (bool, optional) – Should we drop the last batch if it is smaller than the batch size?
- Returns:
dataset – Dataset for training or evaluating the model along with the validation set if
validation_split
was passed.- Return type:
tf.data.Dataset or tuple
- save_tfrecord_dataset(tfrecord_dir, sequence_length, step_size=None, overwrite=False)[source]#
Save the data as TFRecord files.
- Parameters:
tfrecord_dir (str) – Directory to save the TFRecord datasets.
sequence_length (int) – Length of the segement of data to feed into the model.
step_size (int, optional) – Number of samples to slide the sequence across the dataset. Default is no overlap.
overwrite (bool, optional) – Should we overwrite the existing TFRecord datasets if there is a need?
- tfrecord_dataset(sequence_length, batch_size, shuffle=True, validation_split=None, concatenate=True, step_size=None, drop_last_batch=False, tfrecord_dir=None, overwrite=False)[source]#
Create a TFRecord Dataset for training or evaluation.
- Parameters:
sequence_length (int) – Length of the segement of data to feed into the model.
batch_size (int) – Number sequences in each mini-batch which is used to train the model.
shuffle (bool, optional) – Should we shuffle sequences (within a batch) and batches.
validation_split (float, optional) – Ratio to split the dataset into a training and validation set.
concatenate (bool, optional) – Should we concatenate the datasets for each array?
step_size (int, optional) – Number of samples to slide the sequence across the dataset. Default is no overlap.
drop_last_batch (bool, optional) – Should we drop the last batch if it is smaller than the batch size?
tfrecord_dir (str, optional) – Directory to save the TFRecord datasets. If
None
, thenData.store_dir
is used.overwrite (bool, optional) – Should we overwrite the existing TFRecord datasets if there is a need?
- Returns:
dataset – Dataset for training or evaluating the model.
- Return type:
tf.data.Dataset
- add_session_labels(label_name, label_values, label_type)[source]#
Add session labels as a new channel to the data.
- Parameters:
label_name (str) – Name of the new channel.
label_values (np.ndarray) – Labels for each session.
label_type (str) – Type of label, either “categorical” or “continuous”.
- get_session_labels()[source]#
Get the session labels.
- Returns:
session_labels – List of session labels.
- Return type:
List[SessionLabels]
- save_preparation(output_dir='.')[source]#
Save a pickle file containing preparation settings.
- Parameters:
output_dir (str) – Path to save data files to. Default is the current working directory.
- load_preparation(inputs)[source]#
Loads a pickle file containing preparation settings.
- Parameters:
inputs (str) – Path to directory containing the pickle file with preparation settings.
- class osl_dynamics.data.SessionLabels[source]#
Class for session labels.
- Parameters:
name (str) – Name of the session label.
values (np.ndarray) – Value for each session. Must be a 1D array of numbers.
label_type (str) – Type of the session label. Options are “categorical” and “continuous”.
- name: str#
- values: numpy.ndarray#
- label_type: str#
- osl_dynamics.data.load_tfrecord_dataset(tfrecord_dir, batch_size, shuffle=True, validation_split=None, concatenate=True, drop_last_batch=False, buffer_size=100000, keep=None)[source]#
Load a TFRecord dataset.
- Parameters:
tfrecord_dir (str) – Directory containing the TFRecord datasets.
batch_size (int) – Number sequences in each mini-batch which is used to train the model.
shuffle (bool, optional) – Should we shuffle sequences (within a batch) and batches.
validation_split (float, optional) – Ratio to split the dataset into a training and validation set.
concatenate (bool, optional) – Should we concatenate the datasets for each array?
drop_last_batch (bool, optional) – Should we drop the last batch if it is smaller than the batch size?
buffer_size (int, optional) – Buffer size for shuffling a TensorFlow Dataset. Smaller values will lead to less random shuffling but will be quicker. Default is 100000.
keep (list of int, optional) – List of session indices to keep. If
None
, then all sessions are kept.
- Returns:
dataset – Dataset for training or evaluating the model along with the validation set if
validation_split
was passed.- Return type:
tf.data.Dataset or tuple