Loading Data#

In this tutorial we demonstrate the various options for loading data. This tutorial covers:

  1. The Data Class

  2. Getting Example Data

  3. Loading Data in NumPy Format

  4. Loading Data in MATLAB Format

  5. Loading Data in fif Format

  6. Loading Data in txt Format

The Data class#

In osl-dynamics we typically load data using the osl_dynamics.data.base.Data. The Data class has a lot of useful methods that can be used to modify the data.

Inputs#

There is one mandatory argument that needs to be passed to the Data class: inputs. This can be:

  • A path to a directory containing .npy files. Each .npy file should be a subject or session.

  • A list of paths to .npy, .mat, .fif, or .txt files. Each file should be a subject or session.

  • A numpy array. The array will be treated as continuous data from the same subject.

  • A list of numpy arrays. Each numpy array should be the data for a subject or session.

Data format#

The data files or numpy arrays should be in the format (n_samples, n_channels), i.e. time by channels. If your data is in (n_channels, n_samples) format, use should also pass time_axis_first=False to the Data class.

The temporary store directory#

Note, there is an option to load the data as a memory map. This allows us to access the data without holding it in memory. To use this feature, pass load_memmaps=True. The Data class creates a directory called tmp which is used for storing temporary data (memory map files and prepared data). This directory can be safely deleted after you run your script. You can specify the name of the temporary directory by passing the store_dir argument.

We will demonstrate how the Data class is used with example data below.

Loading data in parallel#

The Data class has a n_jobs argument that can be used to load multiple data files in parallel. Note, if n_jobs is passed, the Data class will automatically also prepare data in parallel (see Preparing M/EEG Data and Preparing fMRI Data for more information regarding data preparation).

Getting Example Data#

Download the dataset#

We will download example data hosted on OSF.

import os

def get_data(name):
    os.system(f"osf -p by2tc fetch data/{name}.zip")
    os.system(f"unzip -o {name}.zip -d {name}")
    os.remove(f"{name}.zip")
    return f"Data downloaded to: {name}"

# Download the dataset (approximately 6 MB)
get_data("example_loading_data")

# List the contents of the downloaded directory containing the dataset
print("Contents of example_loading_data:")
os.listdir("example_loading_data")
Contents of example_loading_data:

['txt_format', 'matlab_format', 'fif_format', 'numpy_format']

We can see there are three directories in example_loading_data:

  • numpy_format, which contains .npy files.

  • matlab_format, which contains .mat files.

  • fif_format, which contains directories with .fif files.

  • txt_format, which contains .txt files.

We’ll show how to load data in each of these data types.

Loading Data in NumPy Format#

Let’s first list the example_loading_data/numpy_format directory.

os.listdir("example_loading_data/numpy_format")
['array1.npy', 'array0.npy']

We can see there’s two numpy files. These files contain 2D numpy array data. It is in time by channels format. If we wanted to load this data using the numpy package, we could do:

import numpy as np

# Just load one of the files
X = np.load("example_loading_data/numpy_format/array0.npy")
print(X.shape)
(6000, 42)

Importing a numpy array directly#

If we have already loaded a numpy array and just want to create an osl_dynamics.data.Data object, we can simply pass it to the class:

from osl_dynamics.data import Data

data = Data(X)
print(data)
Loading files:   0%|          | 0/1 [00:00<?, ?it/s]
Loading files: 100%|██████████| 1/1 [00:00<00:00, 3019.66it/s]
Data
 id: 139852514822432
 n_sessions: 1
 n_samples: 6000
 n_channels: 42

We normally like to keep the data for each subject separate. If we had multiple 2D numpy arrays (one for each subject), we can collate them into a python list and pass that to the Data class:

# Load numpy files
X0 = np.load("example_loading_data/numpy_format/array0.npy")
X1 = np.load("example_loading_data/numpy_format/array1.npy")

# Collate into a list
X = [X0, X1]

# Create a Data object
data = Data(X)
print(data)
Loading files:   0%|          | 0/2 [00:00<?, ?it/s]
Loading files: 100%|██████████| 2/2 [00:00<00:00, 2182.83it/s]
Data
 id: 139852517618464
 n_sessions: 2
 n_samples: 12000
 n_channels: 42

Loading from file#

Rather than loading the data into memory then creating a Data object, we could load the data directly from the file.

# Just load one of the files
data = Data("example_loading_data/numpy_format/array0.npy")
print(data)
Loading files:   0%|          | 0/1 [00:00<?, ?it/s]
Loading files: 100%|██████████| 1/1 [00:00<00:00, 2861.05it/s]
Data
 id: 139852027591856
 n_sessions: 1
 n_samples: 6000
 n_channels: 42

We can see the data loaded matches the array shape when we loaded it using numpy. To access the 2D numpy array we can use the time_series() method.

ts = data.time_series()
print(ts.shape)
(6000, 42)

Normally, we would want to load the data for multiple subjects. We could do this in two ways if the data is in numpy format (i.e. .npy). We could pass a list of file paths:

files = [f"example_loading_data/numpy_format/array{i}.npy" for i in [0, 1]]
data = Data(files)
print(data)
Loading files:   0%|          | 0/2 [00:00<?, ?it/s]
Loading files: 100%|██████████| 2/2 [00:00<00:00, 1982.65it/s]
Data
 id: 139852024346960
 n_sessions: 2
 n_samples: 12000
 n_channels: 42

or just pass the path to the directory containing the .npy files:

data = Data("example_loading_data/numpy_format")
print(data)
Loading files:   0%|          | 0/2 [00:00<?, ?it/s]
Loading files: 100%|██████████| 2/2 [00:00<00:00, 2186.24it/s]
Data
 id: 139852024347392
 n_sessions: 2
 n_samples: 12000
 n_channels: 42

Note, when we have multiple subjects, if we call the time_series() method, we will now get a list of numpy arrays. Each item in the list is the data for each subject.

ts = data.time_series()
print(len(ts))
print(ts[0].shape)
print(ts[1].shape)
2
(6000, 42)
(6000, 42)

Loading Data in MATLAB Format#

We will discuss two methods for loading MATLAB files. First, we will load the MATLAB files using public python packages (scipy and mat73), then we’ll show how to pass MATLAB files to the Data class.

Loading MATLAB files in Python#

The popular python package SciPy has a function for loading MATLAB files: scipy.io.loadmat. Note, this function can only be used to load a newer version of MATLAB files, if you saved your files using v7.3 format, then you need to use mat73.loadmat to load the file in python. Both of these packages are automatically installed when you install osl-dynamics.

Let first see what files we have in the example_loading_data/matlab_format directory.

os.listdir("example_loading_data/matlab_format")
['subject1.mat', 'subject0.mat']

Let’s load the first subject’s data using standard python function.

from scipy.io import loadmat

# Load the first subject
mat = loadmat("example_loading_data/matlab_format/subject0.mat")
print(mat)
{'__header__': b'MATLAB 5.0 MAT-file, Platform: MACA64, Created on: Thu May 15 21:24:56 2025', '__version__': '1.0', '__globals__': [], 'T': array([[6000]], dtype=uint16), 'X': array([[  8.55446529,   6.2701211 ,  14.08341885, ...,  -3.40541101,
         11.97496605,  -9.9605751 ],
       [ 38.68119812,  70.39207458, -24.58280182, ..., -22.34135818,
         91.68431854,  31.93983269],
       [-23.42669296,  58.61187744,  15.04336166, ..., -76.06550598,
         17.29451942,  -4.01404285],
       ...,
       [ -0.60452783,  52.73875809, -38.78281784, ...,  48.0056572 ,
        -14.50024414,  21.15813446],
       [-15.44007874,  94.75162506, -68.22749329, ...,  26.43334007,
        -57.44473648, -18.39329338],
       [-72.70446014,  91.29540253, -74.58401489, ..., -37.10783768,
        -58.87734604, -23.26953506]])}

We can see the loadmat function returns a python dict. We can list the fields using:

mat.keys()
dict_keys(['__header__', '__version__', '__globals__', 'T', 'X'])

The important field is X, which is the one that contains the 2D time series data for this subject. Note, MATLAB files created using the HMM-MAR toolbox come in the above format, i.e. with a X and T field. For us, only the X matters.

Loading MATLAB data into the Data class#

We can pass the numpy array contained in the X field of the dictionary directly to the Data class:

data = Data(mat["X"])
print(data)
Loading files:   0%|          | 0/1 [00:00<?, ?it/s]
Loading files: 100%|██████████| 1/1 [00:00<00:00, 7781.64it/s]
Data
 id: 139852513345808
 n_sessions: 1
 n_samples: 6000
 n_channels: 42

However, we would prefer to load the data directly from the file. We can do this by passing the file path to the .mat file and the data_field argument to the Data class.

data = Data("example_loading_data/matlab_format/subject0.mat", data_field="X")
print(data)
Loading files:   0%|          | 0/1 [00:00<?, ?it/s]
Loading files: 100%|██████████| 1/1 [00:00<00:00, 111.27it/s]
Data
 id: 139852024255392
 n_sessions: 1
 n_samples: 6000
 n_channels: 42

Note, the default value for the data_field argument is X, so the Data class would still be able to load the data without it being passed. The data_field is useful if the data is contained in a MATLAB in a field with a different name.

If we wanted to load multiple data files in MATLAB format we would need to pass a list of file paths.

files = [f"example_loading_data/matlab_format/subject{i}.mat" for i in [0, 1]]
data = Data(files)
print(data)
Loading files:   0%|          | 0/2 [00:00<?, ?it/s]
Loading files: 100%|██████████| 2/2 [00:00<00:00, 119.89it/s]
Data
 id: 139852024347488
 n_sessions: 2
 n_samples: 12000
 n_channels: 42

Loading Data in fif Format#

Another data format that can be loaded with the Data class is fif files. This format is commonly used in MNE-Python and is the data format used in osl-ephys. Here, we will load source reconstruct (parcellated) data created with osl-ephys. In osl-ephys, we often have a separate directory for each subject. The fif_format directory contains two directories for different subjects.

os.listdir('example_loading_data/fif_format')
['sub-002_run-01', 'sub-001_run-01']

Let’s see what’s inside sub-001_run-01.

os.listdir('example_loading_data/fif_format/sub-001_run-01')
['sub-001_run-01_sflip_lcmv-parc-raw.fif']

We have a fif file which contains the data for this subject. We could load this with MNE.

import mne

raw = mne.io.read_raw_fif("example_loading_data/fif_format/sub-001_run-01/sub-001_run-01_sflip_lcmv-parc-raw.fif")
print(raw.info)
Opening raw data file example_loading_data/fif_format/sub-001_run-01/sub-001_run-01_sflip_lcmv-parc-raw.fif...
    Range : 61000 ... 68500 =    244.000 ...   274.000 secs
Ready.
<Info | 11 non-empty values
 bads: []
 ch_names: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ...
 chs: 38 misc, 3 Stimulus
 custom_ref_applied: False
 description: (meg) Vectorview system at Cambridge OSL BATCH PROCESSING ...
 dig: 0 items
 file_id: 4 items (dict)
 highpass: 0.0 Hz
 lowpass: 125.0 Hz
 meas_date: 2009-04-07 14:39:35 UTC
 meas_id: 4 items (dict)
 nchan: 41
 projs: []
 sfreq: 250.0 Hz
>

We can see this particular fif file contains 38 misc channels and 3 stim channels. We’re interested in the misc channels. Let’s load these into the Data class.

data = Data(
    "example_loading_data/fif_format/sub-001_run-01/sub-001_run-01_sflip_lcmv-parc-raw.fif",
    picks="misc",
    reject_by_annotation="omit",
)
print(data)
Loading files:   0%|          | 0/1 [00:00<?, ?it/s]
Loading files: 100%|██████████| 1/1 [00:00<00:00, 105.31it/s]
Data
 id: 139852520187296
 n_sessions: 1
 n_samples: 6002
 n_channels: 38

The reject_by_annotation=”omit” argument is used to make sure we don’t include bad segments. This argument is passed to Raw.get_data in MNE.

To load multiple subjects we can do:

files =[
    f"example_loading_data/fif_format/sub-{i:03d}_run-01/sub-{i:03d}_run-01_sflip_lcmv-parc-raw.fif"
    for i in range(1,3)
]
data = Data(files, picks="misc", reject_by_annotation="omit")
print(data)
Loading files:   0%|          | 0/2 [00:00<?, ?it/s]
Loading files: 100%|██████████| 2/2 [00:00<00:00, 112.25it/s]
Data
 id: 139852022665856
 n_sessions: 2
 n_samples: 12003
 n_channels: 38

Loading Data in txt Format#

The final data format we will load using the Data class is txt files. FMRI data preprocessed with FSL is in this format. To load this type of data simply pass the list of filenames.

files =[
    f"example_loading_data/txt_format/dr_stage1_subject{i:05d}.txt"
    for i in range(1,3)
]
data = Data(files)
print(data)
Loading files:   0%|          | 0/2 [00:00<?, ?it/s]
Loading files: 100%|██████████| 2/2 [00:00<00:00, 1412.46it/s]
Data
 id: 139852024347488
 n_sessions: 2
 n_samples: 512
 n_channels: 25

Total running time of the script: (0 minutes 12.330 seconds)

Gallery generated by Sphinx-Gallery