osl_dynamics.data.base#
Base class for handling data.
Classes#
Data Class. |
|
Class for session labels. |
Module Contents#
- class osl_dynamics.data.base.Data(inputs, data_field='X', picks=None, reject_by_annotation=None, sampling_frequency=None, mask_file=None, parcellation_file=None, time_axis_first=True, load_memmaps=False, store_dir='tmp', buffer_size=4000, use_tfrecord=False, session_labels=None, extra_channels=None, n_jobs=1)[source]#
Data Class.
The Data class enables the input and processing of data. When given a list of files, it produces a set of numpy memory maps which contain their raw data. It also provides methods for batching data and creating TensorFlow Datasets.
- Parameters:
inputs (list of str or pathlib.Path or str or pathlib.Path or np.ndarray) –
A path (
strorpathlib.Path) to a directory containing.npyfiles. Each.npyfile should be a subject or session.A list of paths (
strorpathlib.Path) to.npy,.mator.fiffiles. Each file should be a subject or session. If a.fiffile is passed is must end with'raw.fif'or'epo.fif'.A numpy array. The array will be treated as continuous data from the same subject.
A list of numpy arrays. Each numpy array should be the data for a subject or session.
The data files or numpy arrays should be in the format (n_samples, n_channels). If your data is in (n_channels, n_samples) format, use
time_axis_first=False.data_field (str, optional) – If a MATLAB (
.mat) file is passed, this is the field that corresponds to the time series data. By default we read the field'X'. If a numpy (.npy) or fif (.fif) file is passed, this is ignored.picks (str or list of str, optional) – Only used if a fif file is passed. We load the data using the mne.io.Raw.get_data method. We pass this argument to the
Raw.get_datamethod. By defaultpicks=Noneretrieves all channel types.reject_by_annotation (str, optional) –
Only used if a fif file is passed. We load the data using the mne.io.Raw.get_data method. We pass this argument to the
Raw.get_datamethod. By defaultreject_by_annotation=Noneretrieves all time points. Usereject_by_annotation="omit"to remove segments marked as bad.sampling_frequency (float, optional) – Sampling frequency of the data in Hz.
mask_file (str, optional) – Path to mask file used to source reconstruct the data.
parcellation_file (str, optional) – Path to parcellation file used to source reconstruct the data.
time_axis_first (bool, optional) – Is the input data of shape (n_samples, n_channels)? Default is
True. If your data is in format (n_channels, n_samples), usetime_axis_first=False.load_memmaps (bool, optional) – Should we load the data as memory maps (memmaps)? If
True, we will load store the data on disk rather than loading it into memory.store_dir (str, optional) – If load_memmaps=True, then we save data to disk and load it as a memory map. This is the directory to save the memory maps to. Default is
./tmp.buffer_size (int, optional) – Buffer size for shuffling a TensorFlow Dataset. Smaller values will lead to less random shuffling but will be quicker. Default is 100000.
use_tfrecord (bool, optional) – Should we save the data as a TensorFlow Record? This is recommended for training on large datasets. Default is
False.session_labels (list of SessionLabels, optional) – Extra session labels.
extra_channels (dict, optional) – Extra channels to add to the data. The keys are the channel names and the values are the channel data.
n_jobs (int, optional) – Number of processes to load the data in parallel. Default is 1, which loads data in serial.
- property raw_data: List[numpy.ndarray][source]#
Return raw data as a list of arrays.
- Return type:
List[numpy.ndarray]
- property input_shapes: Dict[source]#
Get the input shapes for the model.
- Returns:
shapes – Dictionary of input shapes.
- Return type:
dict
- set_keep(keep)[source]#
Context manager to temporarily set the kept arrays.
- Parameters:
keep (int or list of int) – Indices to keep in the Data.arrays list.
- set_sampling_frequency(sampling_frequency)[source]#
Sets the
sampling_frequencyattribute.- Parameters:
sampling_frequency (float) – Sampling frequency in Hz.
- Return type:
None
- set_buffer_size(buffer_size)[source]#
Set the
buffer_sizeattribute.- Parameters:
buffer_size (int) – Buffer size for shuffling a TensorFlow Dataset. Smaller values will lead to less random shuffling but will be quicker.
- Return type:
None
- time_series(prepared=True, concatenate=False)[source]#
Time series data for all arrays.
- Parameters:
prepared (bool, optional) – Should we return the latest data after we have prepared it or the original data we loaded into the Data object?
concatenate (bool, optional) – Should we return the time series for each array concatenated?
- Returns:
ts – Time series data for each array.
- Return type:
list or np.ndarray
- load_raw_data()[source]#
Import data into a list of memory maps.
- Returns:
memmaps (list of np.memmap) – List of memory maps.
raw_data_filenames (list of str) – List of paths to the raw data memmaps.
- Return type:
Tuple[List[numpy.ndarray], List[str]]
- validate_extra_channels(data, extra_channels)[source]#
Validate extra channels.
- Parameters:
data (List[numpy.ndarray])
extra_channels (Dict)
- Return type:
Dict
- select(channels=None, sessions=None, use_raw=False)[source]#
Select channels.
This is an in-place operation.
- Parameters:
channels (int or list of int, optional) – Channel indices to keep. If None, all channels are retained.
sessions (int or list of int, optional) – Session indices to keep. If None, all sessions are retained.
use_raw (bool, optional) – Should we select channel from the original ‘raw’ data that we loaded?
- Returns:
data – The modified Data object.
- Return type:
- filter(low_freq=None, high_freq=None, order=5, use_raw=False)[source]#
Filter the data.
This is an in-place operation.
- Parameters:
low_freq (float, optional) – Frequency in Hz for a high pass filter. If
None, no high pass filtering is applied.high_freq (float, optional) – Frequency in Hz for a low pass filter. If
None, no low pass filtering is applied.order (int, optional) – Order for a butterworth filter.
use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?
- Returns:
data – The modified Data object.
- Return type:
- downsample(freq, use_raw=False)[source]#
Downsample the data.
This is an in-place operation.
- Parameters:
freq (float) – Frequency in Hz to downsample to.
use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?
- Returns:
data – The modified Data object.
- Return type:
- pca(n_pca_components=None, pca_components=None, whiten=False, use_raw=False)[source]#
Principal component analysis (PCA).
This function will first standardize the data then perform PCA. This is an in-place operation.
- Parameters:
n_pca_components (int, optional) – Number of PCA components to keep. If
None, thenpca_componentsshould be passed.pca_components (np.ndarray, optional) – PCA components to apply if they have already been calculated. If
None, thenn_pca_componentsshould be passed.whiten (bool, optional) – Should we whiten the PCA’ed data?
use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?
- Returns:
data – The modified Data object.
- Return type:
- tde(n_embeddings, use_raw=False)[source]#
Time-delay embedding (TDE).
This is an in-place operation.
- Parameters:
n_embeddings (int) – Number of data points to embed the data.
use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?
- Returns:
data – The modified Data object.
- Return type:
- tde_pca(n_embeddings, n_pca_components=None, pca_components=None, whiten=False, use_raw=False)[source]#
Time-delay embedding (TDE) and principal component analysis (PCA).
This function will first standardize the data, then perform TDE then PCA. It is useful to do both operations in a single methods because it avoids having to save the time-embedded data. This is an in-place operation.
- Parameters:
n_embeddings (int) – Number of data points to embed the data.
n_pca_components (int, optional) – Number of PCA components to keep. If
None, thenpca_componentsshould be passed.pca_components (np.ndarray, optional) – PCA components to apply if they have already been calculated. If
None, thenn_pca_componentsshould be passed.whiten (bool, optional) – Should we whiten the PCA’ed data?
use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?
- Returns:
data – The modified Data object.
- Return type:
- amplitude_envelope(use_raw=False)[source]#
Calculate the amplitude envelope.
This is an in-place operation.
- Returns:
data (osl_dynamics.data.Data) – The modified Data object.
use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?
- Parameters:
use_raw (bool)
- Return type:
- moving_average(n_window, use_raw=False)[source]#
Calculate a moving average.
This is an in-place operation.
- Parameters:
n_window (int) – Number of data points in the sliding window. Must be odd.
use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?
- Returns:
data – The modified Data object.
- Return type:
- standardize(use_raw=False)[source]#
Standardize (z-score) the data.
This is an in-place operation.
- Returns:
data (osl_dynamics.data.Data) – The modified Data object.
use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?
- Parameters:
use_raw (bool)
- Return type:
- align_channel_signs(template_data=None, template_cov=None, n_init=3, n_iter=2500, max_flips=20, n_embeddings=1, standardize=True, use_raw=False)[source]#
Align the sign of each channel across sessions.
If no template data/covariance is passed, we use the median session.
- Parameters:
template_data (np.ndarray or str, optional) – Data to align the sign of channels to. If
str, the file will be read in the same way as the inputs to the Data object.template_cov (np.ndarray or str, optional) – Covariance to align the sign of channels. This must be the covariance of the time-delay embedded data. If
str, must be the path to a.npyfile.n_init (int, optional) – Number of initializations.
n_iter (int, optional) – Number of sign flipping iterations per subject to perform.
max_flips (int, optional) – Maximum number of channels to flip in an iteration.
n_embeddings (int, optional) – We may want to compare the covariance of time-delay embedded data when aligning the signs. This is the number of embeddings. The returned data is not time-delay embedded.
standardize (bool, optional) – Should we standardize the data before comparing across sessions?
use_raw (bool, optional) – Should we prepare the original ‘raw’ data that we loaded?
- Returns:
data – The modified Data object.
- Return type:
- prepare(methods)[source]#
Prepare data.
Wrapper for calling a series of data preparation methods. Any method in Data can be called. Note that if the same method is called multiple times, the method name should be appended with an underscore and a number, e.g.
standardize_1andstandardize_2.- Parameters:
methods (dict) – Each key is the name of a method to call. Each value is a
dictcontaining keyword arguments to pass to the method.- Returns:
data – The modified Data object.
- Return type:
Examples
TDE-PCA data preparation:
methods = { "tde_pca": {"n_embeddings": 15, "n_pca_components": 80}, "standardize": {}, } data.prepare(methods)
Amplitude envelope data preparation:
methods = { "filter": {"low_freq": 1, "high_freq": 45}, "amplitude_envelope": {}, "moving_average": {"n_window": 5}, "standardize": {}, } data.prepare(methods)
- trim_time_series(sequence_length=None, n_embeddings=None, n_window=None, prepared=True, concatenate=False, verbose=False)[source]#
Trims the data time series.
Removes the data points that are lost when the data is prepared, i.e. due to time embedding and separating into sequences, but does not perform time embedding or batching into sequences on the time series.
- Parameters:
sequence_length (int, optional) – Length of the segment of data to feed into the model. Can be pass to trim the time points that are lost when separating into sequences.
n_embeddings (int, optional) – Number of data points used to embed the data. If
None, then we useData.n_embeddings(if it exists).n_window (int, optional) – Number of data points the sliding window applied to the data. If
None, then we useData.n_window(if it exists).prepared (bool, optional) – Should we return the prepared data? If not we return the raw data.
concatenate (bool, optional) – Should we concatenate the data for each array?
verbose (bool, optional) – Should we print the number of data points we’re removing?
- Returns:
Trimmed time series for each array.
- Return type:
list of np.ndarray
- count_sequences(sequence_length, step_size=None)[source]#
Count sequences.
- Parameters:
sequence_length (int) – Length of the segment of data to feed into the model.
step_size (int, optional) – The number of samples by which to move the sliding window between sequences. Defaults to
sequence_length.
- Returns:
n – Number of sequences for each session’s data.
- Return type:
np.ndarray
- dataset(sequence_length, batch_size, shuffle=True, validation_split=None, concatenate=True, step_size=None, drop_last_batch=False)[source]#
Create a Tensorflow Dataset for training or evaluation.
- Parameters:
sequence_length (int) – Length of the segment of data to feed into the model.
batch_size (int) – Number sequences in each mini-batch which is used to train the model.
shuffle (bool, optional) – Should we shuffle sequences (within a batch) and batches.
validation_split (float, optional) – Ratio to split the dataset into a training and validation set.
concatenate (bool, optional) – Should we concatenate the datasets for each array?
step_size (int, optional) – Number of samples to slide the sequence across the dataset. Default is no overlap.
drop_last_batch (bool, optional) – Should we drop the last batch if it is smaller than the batch size?
- Returns:
dataset – Dataset for training or evaluating the model along with the validation set if
validation_splitwas passed.- Return type:
tf.data.Dataset or tuple of tf.data.Dataset
- save_tfrecord_dataset(tfrecord_dir, sequence_length, step_size=None, validation_split=None, overwrite=False)[source]#
Save the data as TFRecord files.
- Parameters:
tfrecord_dir (str) – Directory to save the TFRecord datasets.
sequence_length (int) – Length of the segment of data to feed into the model.
step_size (int, optional) – Number of samples to slide the sequence across the dataset. Default is no overlap.
validation_split (float, optional) – Ratio to split the dataset into a training and validation set.
overwrite (bool, optional) – Should we overwrite the existing TFRecord datasets if there is a need?
- Return type:
None
- tfrecord_dataset(sequence_length, batch_size, shuffle=True, validation_split=None, concatenate=True, step_size=None, drop_last_batch=False, tfrecord_dir=None, overwrite=False)[source]#
Create a TFRecord Dataset for training or evaluation.
- Parameters:
sequence_length (int) – Length of the segment of data to feed into the model.
batch_size (int) – Number sequences in each mini-batch which is used to train the model.
shuffle (bool, optional) – Should we shuffle sequences (within a batch) and batches.
validation_split (float, optional) – Ratio to split the dataset into a training and validation set.
concatenate (bool, optional) – Should we concatenate the datasets for each array?
step_size (int, optional) – Number of samples to slide the sequence across the dataset. Default is no overlap.
drop_last_batch (bool, optional) – Should we drop the last batch if it is smaller than the batch size?
tfrecord_dir (str, optional) – Directory to save the TFRecord datasets. If
None, thenData.store_diris used.overwrite (bool, optional) – Should we overwrite the existing TFRecord datasets if there is a need?
- Returns:
dataset – Dataset for training or evaluating the model along with the validation set if
validation_splitwas passed.- Return type:
tf.data.TFRecordDataset or tuple of tf.data.TFRecordDataset
- add_session_labels(label_name, label_values, label_type)[source]#
Add session labels as a new channel to the data.
- Parameters:
label_name (str) – Name of the new channel.
label_values (np.ndarray) – Labels for each session.
label_type (str) – Type of label, either “categorical” or “continuous”.
- Return type:
None
- add_extra_channel(channel_name, channel_values)[source]#
Add an extra channel to the data.
- Parameters:
channel_name (str)
channel_values (List[numpy.ndarray])
- Return type:
None
- get_session_labels()[source]#
Get the session labels.
- Returns:
session_labels – List of session labels.
- Return type:
List[SessionLabels]
- recommend_model_config()[source]#
Recommends arguments for a model config based on the data.
- Return type:
None
- save_preparation(output_dir='.')[source]#
Save a pickle file containing preparation settings.
- Parameters:
output_dir (str) – Path to save data files to. Default is the current working directory.
- Return type:
None
- load_preparation(inputs)[source]#
Loads a pickle file containing preparation settings.
- Parameters:
inputs (str) – Path to directory containing the pickle file with preparation settings.
- Return type:
None
- save(output_dir='.', outnames=None)[source]#
Saves (prepared) data.
The ordering of the saved files matches the order of the input files.
- Parameters:
output_dir (str) – Path to save data files to. Default is the current working directory.
outnames (list of str, optional) – Names/IDs for output files (excluding the extension).
- Return type:
None