Multimodal motor imagery decoding method based on temporal spatial feature alignment and fusion

Yukun Zhang; Shuang Qiu; Huiguang He

doi:10.1088/1741-2552/acbfdf

1. Introduction

A brain-computer interface (BCI) translates brain signals into commands for outside devices without normal neuromuscular output pathways [1]. It can be used to replace or restore lost motor function. Motor imagery (MI) is one of the main paradigms of BCI that decodes spontaneous movement intention from brain signals. MI-BCI can help neuromuscular injury patients recover or replace their motor abilities [2–5]. It can also be applied to robot control, smart home applications, and entertainment [6–13].

Due to the valuable application prospects in the future, MI-BCI has received great research attention in recent years. The brain signals used in MI-BCI include electroencephalography (EEG), functional near infrared spectroscopy (fNIRS), functional magnetic resonance imaging, magnetoencephalography and electrocorticography. Among these signals, EEG and fNIRS are commonly used due to their easy access and high safety. There into, EEG-based MI-BCI studies have a relatively long history. The common spatial pattern (CSP) [14] is one of the most classic EEG decoding methods. An extension version of CSP—filter bank CSP (FBCSP) achieves a kappa value of 0.569 in four class MI decoding tasks and won the BCI competition IV dataset 2a [15]. Later, deep learning (DL) -based MI decoding methods such as shallow convolutional neural network (CNN) [16], EEGNet [17], and C2CM [18] were proposed and improved the decoding accuracy of MI-BCI. At the same time, some fNIRS methods have been proposed to decode fNIRS-based MI tasks. These methods use statistical measures such as the mean and slope value of the concentration of hemoglobin as features and employ traditional machine learning classifiers or deep neural networks to predict the class label [19–21].

More recently, there have been some multimodal MI-BCI studies that use both EEG and fNIRS data to decode human movement intention. EEG and fNIRS represent two different modalities respectively. Shin et al proposed using a stacking-based ensemble learning method to fuse EEG and fNIRS single modality classification results and achieved 8.6% higher decoding accuracy than single modality MI decoding [22]. Chiarelli et al proposed concatenating manual features of EEG and fNIRS and using a deep neural network for classification. Their multimodal decoding accuracy is higher than the single modality decoding accuracy by 9.9% [23]. Sun et al proposed using polynomial fusion to combine EEG and fNIRS features and achieved 5.98% higher decoding accuracy than single modality MI decoding. Generally, multimodal MI-BCI decoding methods can be simply divided into late fusion and mid fusion according to the multimodal fusion position. Late fusion or decision fusion (DF)-based decoding methods predict class labels on each modality separately and integrate the result by voting or ensemble learning [24–26]. Mid fusion methods concatenate or use polynomial fusion to combine the extracted multimodal features and then use a single classifier to predict class labels [22, 23, 27]. Both types of methods achieved higher decoding accuracy than single modality methods. However, it is necessary to explore how to extract better multimodal representations for the improvement of decoding performance.

Multimodal representation learning is a branch of multimodal machine learning and has gained great attention in vision and language research [28–32]. Many multimodal representation learning methods have been proposed to learn intramodality and intermodality interactions [31, 33, 34]. Intramodality focuses on learning better feature representations within each single modality, and intermodality focuses on learning the interactions between different modalities jointly and simultaneously. In this study, learning intramodality interactions means learning from EEG for fNIRS data only, while learning intermodality interactions means learning from EEG and fNIRS data jointly. However, how to learn good representation in multimodal MI decoding has not been widely researched. Effective and robust feature representation will promote the performance of downstream tasks [31]. Therefore, designing appropriate constraints for multimodal data may result in higher separability and higher robustness of multimodal features, thus enhancing the multimodal MI decoding accuracy.

Existing multimodal MI studies mainly focus on two-class MI tasks [22, 23, 27, 35]. However, the low number of classes limits the number of output control commands in MI-BCI systems. Indeed, many multiclass single modality MI studies have been proposed [36, 37]. There have also been some multiclass multimodal MI studies (e.g. Nagasawa et al, three classes) [38] and motor execution studies [26, 39, 40]. However, due to the limitation of multimodal decoding accuracy, there are few five-class multimodal MI decoding studies. Therefore, further increasing the number of classes and enhancing decoding accuracy in multimodal MI-BCI will enhance the practicability of MI-BCI systems.

In this study, an EEG-fNIRS MI dataset from 15 subjects was built. The dataset contains five-class tasks, including left-hand grasping MI, right-hand grasping MI, both-hand grasping MI, both feet pedalling MI, and rest. Compared with the existing EEG-fNIRS open dataset [22], the collected dataset contains more classes. This dataset will be open accessed. Moreover, a multimodal MI decoding neural network was designed, which includes two feature extractors for EEG and fNIRS signals, a fusion module, and a classifier. To extract features with better representations, an intramodality center loss, an intramodality contrastive loss, and an intermodality contrastive loss were proposed. Intramodality losses work on EEG and fNIRS single modalities separately, while intermodality loss works on both modalities simultaneously. Thereinto, to extract single-modal features with better separability, intramodality center loss was designed to pull the features from the same class closer under the measure of mean square error. Additionally, intramodality contrastive loss was proposed to pull the features from the same classes closer and push features from different classes further under the measure of mutual information. To extract a common feature space between EEG and fNIRS and enhance the stability by mutual guidance of two modalities, intermodality contrastive loss was proposed to pull the features from two modalities of the same sample closer and push the features from different samples further. Altogether, the intra- and intermodality losses will result in higher separability of multimodal features and higher robustness, thus enhancing the multimodal MI decoding accuracy. In addition, an attention network-based fusion module was proposed that dynamically focuses on different temporal points according to the features from both modalities.

The main contributions of this paper are summarized as follows:

A five-class EEG-fNIRS MI dataset from 15 subjects was collected, and the signal quality was evaluated from temporal, frequency and spatial perspectives. The dataset will be open accessed.
A multimodal MI decoding neural network was proposed that includes modality-specific feature extractors, an attention-based modality fusion module, and a classifier.
Intra- and intermodality constraints were proposed. The intramodality constraints bring better separability to the features. The intermodality constraints facilitate multimodal fusion and enhance the robustness of the network.
Extensive data analysis, comparison experiments, ablation studies and visualizations were conducted to evaluate the proposed decoding method. The experimental results showed that the proposed method can enhance multimodal MI decoding accuracy.

The rest of this article is organized as follows. In section 2, MI decoding methods based on EEG and fNIRS are reviewed. In section 3, the proposed EEG-fNIRS multimodal data acquisition methods are described. In section 4, concrete descriptions of the proposed neural network and the multimodal constraints are given. Section 5 shows the analysis of the collected multimodal data, the performance comparison of the proposed method and baseline method, the ablation study results, and the visualization results. The proposed method and experimental results are discussed in section 6. Finally, the last section concludes this study.

2. Related work

2.1. Existing EEG-fNIRS datasets

In 2017, Shin et al collected two-class MI EEG-fNIRS data from 29 subjects [22]. The MI tasks were left hand and right hand grabbing. For each MI task, 30 samples were collected. This dataset had been open accessed. Later, there have been many left-hand vs. right-hand multimodal MI studies [23, 25, 41]. However, these data are not open accessed. In addition, these studies contain only two classes. This small class set restricts applications of multimodal MI-BCI. There are some multiclass EEG-fNIRS motor execution decoding studies [26, 39, 40]. However, there are gaps between movement execution and MI brain signals, which are difficult to use to produce MI decoding methods. Increasing the number of classes will enlarge the command set of MI-BCI, thus broadening the application prospects of MI-BCI. A multimodal MI dataset with more classes will be helpful for future multimodal MI decoding studies. To the best of our knowledge, no five class EEG-fNIRS MI decoding studies have been published. In this study, a five-class EEG-fNIRS dataset including left hand, right hand, both hand, feet, and rest MI tasks was built. It will be helpful for developing multiclass multimodal MI decoding methods.

2.2. Single modality MI decoding

Decoding MI from EEG has been studied for decades. In 1999, CSP was employed to decode movement intention from EEG signals [14]. Later, many improved versions of CSP were proposed [42–44]. In 2008, the FBCSP, one of the most classic CSP-based methods, was proposed by Kai Keng et al [42]. The FBCSP alleviated the subject-specific frequency band selection problem and achieved 90.3% decoding accuracy in MI classification tasks of the right hand and right foot. In 2012, FBCSP achieved the best decoding accuracy and won BCI competition IV on dataset 2a [15, 36]. More recently, Miao et al proposed common time-frequency-spatial patterns to extract features from multiple time windows and achieved 85% decoding accuracy on a three-class MI decoding task. Some Riemannian geometry-based methods have also been proposed in MI decoding [45, 46]. In 2012, Barachant et al proposed the Riemannian minimum distance to mean method and achieved a decoding accuracy of 63.2% in a four-class MI decoding task.

In recent years, DL has made great progress in vision processing and natural language processing. Many DL-based methods have been proposed for decoding MI tasks [47] and have become a trend for single modality MI decoding [48]. In 2017, Schirrmeister et al proposed a CNN-based shallow network (shallow CNN) and a CNN-based network (deep CNN) [16]. They achieved decoding accuracies of 73.7% and 70.9%, respectively, on a four-class MI decoding task. In 2018, another CNN-based neural network, EEGNet, was proposed by Lawhern et al to decode MI from EEG data [17]. Compared with shallow CNN, EEGNet contains fewer parameters and can be applied in many EEG decoding tasks, including steady-state visual evoked potentials and event-related potentials. In 2018, Sakhavi et al proposed the C2CM for MI decoding. They first applied the FBCSP spatial filter to EEG data. The envelope of the resulting data was taken as a feature. A small CNN-based network was proposed to classify manually designed features [18]. They achieved 74.46% decoding accuracy on a four-class MI decoding task. In 2021, Li et al proposed a TS-SEFF network that utilizes both temporal and spectral EEG features [49]. They achieved 74.71% decoding accuracy on a four-class MI decoding. Liu et al proposed the multiscale multitask CNN (MSMT-CNN) network to extract space-time-frequency EEG representations and proposed a multitask learning framework to enhance the feature representation [50]. In the same year, Mane et al proposed FBCNet. A filter bank is adopted to filter EEG signals into multiple subbands. A multiview CNN was designed to decode multiband EEG and achieved 76.2% decoding accuracy on the four-class MI task [51]. In 2022, Ma et al proposed a TD-Attn network that uses a time-distributed attention network to classify CSP features. They achieved 46.15% decoding accuracy on five-class same lamb MI decoding task [52]. Also in 2022, Pan et al proposed an MAtt network that takes the covariance matrix of the EEG signal as input. A manifold attention module is proposed to capture the spatial-temporal information in the covariance matrix time series. They achieved 74.71% decoding accuracy on four-class MI decoding [53].

As neuroimaging techniques have improved in the last few decades, many commercial brain fNIRS devices have appeared. The number of fNIRS channels has increased, and the devices have become more portable [54]. Therefore, fNIRS data have become easier to access for MI studies. Compared with EEG, fNIRS signals have lower temporal resolution and higher spatial resolution. Therefore, the feature extraction and classification methods of EEG and fNIRS are different. In 2017, Shin et al extracted the signal mean and signal slope over time from oxygenated hemoglobin (HbO) signals as features and adopted an linear discriminant analysis (LDA) classifier in two-class MI decoding, which achieved a decoding accuracy of 66.5% [22]. In the same year, Qureshi et al proposed using signal mean, signal peak, signal slope, signal standard deviation signal kurtosis and signal skewness as features and SVM as the classifier to predict class labels. They achieved 82.3% decoding accuracy on MI vs. rest decoding [20]. In 2020, Hosni et al used the same feature set as Qureshi did and conducted channel selection based on generalized linear mode analysis. They achieved an average decoding accuracy of 85.4% from eight amyotrophic lateral sclerosis patients [21]. In the same year, Chhabra et al proposed a deep CNN to decode fNIRS signals and achieved 72.35% decoding accuracy in the four-class MI task [55].

2.3. Multimodal MI decoding

In 2017, Shin et al proposed a stacking-based method that used single modality prediction results as the input for the meta classifier to decode MI tasks and achieved a decoding accuracy of 74.2% in the two-class MI task [22]. In 2018, Chiarelli et al proposed using manual features and a deep neural network to decode MI from EEG-fNIRS data. ERD and ERS were used as the EEG features, while average values of deoxygenated hemoglobin (HbR) and HbO concentration changes were used as the fNIRS features. They achieved a decoding accuracy of 83.28% on two-class (left-hand/right-hand) MI decoding [23]. In 2020, Hasan et al proposed a computationally efficient method for EEG-fNIRS signal decoding. They used mean, skewness, kurtosis, and peak values as features for both EEG and fNIRS and then conducted channel selection based on the Pearson product-moment correlation coefficient. They achieved an accuracy of 75.28% on classification of the four motor execution tasks versus rest [26]. In 2020, Sun et al proposed a DL-based EEG-fNIRS decoding neural network that uses two CNN base networks to extract EEG and fNIRS features separately and employs polynomial fusion to combine the multimodal features. They achieved 77.53% decoding accuracy in two classes of MI decoding (left hand and right hand) [27]. In 2021, Wang et al proposed a feature concatenation-based and DF-based method to decode EEG-fNIRS data. They achieved 84.45% decoding accuracy on well-trained subjects on two-class MI decoding (left hand and right hand) [25]. These EEG-fNIRS decoding methods use feature fusion or DF rather than consider the intramodality relationship between EEG and fNIRS signals.

3. Multimodal MI data collection

3.1. Subjects

Fifteen right-handed subjects participated in the experiment, including nine males and six females, aged 25.6 ± 2.1 (mean ± standard deviation). All subjects had no history of neurological or psychiatric disease. They were informed about the experimental details and signed consent forms before the experiment started. This study was approved by the ethical committee of the Institute of automation, Chinese Academy of Sciences.

3.2. Data acquisition

EEG and fNIRS signals were simultaneously recorded during the experiment. EEG data were collected with electrodes, and each electrode turned into one channel of EEG signal. fNIRS data were collected with probes, including the light source and detector. Each pair of sources and detectors composed one channel of the fNIRS signal. The EEG and fNIRS shared one standard 128-channel cap for signal collection. The specific channel positions of EEG and fNIRS are shown in figure 1. The EEG signal was recorded by a LiveAMP amplifier (Brain Products Company), which has 62 channels. The EEG channel setting was the same as a standard 64-channel electrode cap except that the CB1 and CB2 channels were not collected, as these two channels are too far from the sensorimotor area. Channel FCz was taken as the reference channel. Channel AFz was taken as the ground. The sampling rate was 1000 Hz. The impedance of all EEG channels was kept below 10 kΩ during the experiment. The EEG data were bandpass filtered between 0.5 Hz and 100 Hz during recording. A notch filter of 50 Hz was also applied to remove the power-line interference. fNIRS data were recorded by an NIRsports amplifier (NIRx Medical Technologies company). There were 16 sources and 16 detectors used, and each pair of adjacent sources and detectors formed one fNIRS channel. In total, 52 fNIRS channels were recorded. The sampling rate was 10.2 Hz. To synchronize the EEG signal, fNIRS signal and the MI task, triggers were sent to a synchronize box by the MI task control program.

**Figure 1.** Distribution of EEG electrodes and fNIRS probes.
Download figure:
Standard image High-resolution image

3.3. Experimental paradigm

Each subject participated in two days of the experiment. The tasks in the two days were exactly the same, while on the first day, no signal was collected. The aim of the first day of the experiment was to ensure that the subject was familiar with the experimental paradigm and practice to perform MI tasks well. On the second day of the experiment, multimodal MI signals were collected.

In the experiment, a subject sat in front of a 19 inch (1280 × 1024 pixels) screen and acted following the instructions presented on the screen. Each subject was asked to perform ten runs of the MI task. Each run consisted of 25 trials. Figure 2 shows the timing of the trials. Each trial contained three phases. The first phase was the preparing phase, when a red circle was presented on the screen for 2 s. In this phase, the subject was asked to be mentally prepared for the following task. The second phase was the task phase and lasted for 10 s. During this phase, one of the following texts in Chinese was shown on the screen: 'left hand', 'right hand', 'both hand', 'both feet', 'rest'. The subject was asked to perform the corresponding MI task according to the text instructions. In the third phase, the subject was asked to rest. The rest phase lasted for a random length from 9 s to 11 s. Random duration can help avoid the influence of the expectation of the subject on the neural signal. The timing of our experiment is designed following the studies of Shin et al [22].

**Figure 2.** Timing of one trial.
Download figure:
Standard image High-resolution image

There were a total of five kinds of tasks, four MI tasks and one rest task. These tasks are widely used in single modality MI studies [56]. In the MI tasks of hands, the subject imagined repeatedly grasping a palm-sized ball with their best effort at an approximately 1 Hz rate. Subjects imagined left, right or both hand grasping as the screen instructed. In the MI task of feet, the subject imagined riding a bicycle with the best effort. In the rest task, the subject relaxed and rested. Before each run started, the subject was asked to sit in a comfortable pose and be relaxed. During the whole session, the subject was asked to not move any part of their body, no swallow and no facial actions, keep their visual attention around the center of the screen. There were short breaks between sessions, and the subject could relax and move their body during the break time.

3.4. Data preprocessing

The raw EEG signal was first downsampled to 200 Hz. Then, 44 channels around motor cortices ('Fz', 'F1', 'F2', 'F3', 'F4', 'F5', 'F6', 'F7', 'F8', 'FC1', 'FC2', 'FC3', 'FC4', 'FC5', 'FC6', 'FT7', 'FT8', 'Cz', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'T7', 'T8', 'CPz', 'CP1', 'CP2', 'CP3', 'CP4', 'CP5', 'CP6', 'TP7', 'TP8', 'Pz', 'P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7', 'P8') were picked for further use. These channels are close to sensorimotor brain areas that are activated during MI tasks [57, 58]. Other channels were too far from the sensorimotor brain area and were discarded to avoid the disturbance of unrelated information. A fifth-order IIR bandpass filter is applied to extract the 4–40 Hz signal. For each trial, a 0–10 s EEG signal with respect to the start of the second phase was extracted as one EEG data sample for further analysis.

The Beer‒Lambert law describes how light attenuates when traveling through different materials [59] The raw fNIRS signal is first resampled to 10 Hz and then converted to concentration changes in HbR and HbO by the modified Beer–Lambert law [60]. Then, a fourth-order IIR bandpass filter is applied to extract 0.01–0.10 Hz signal. Baseline correction is performed by subtracting the signal from −3 s to 0 s with respect to the start of the second phase. Then, HbO and HbR signals from 0 s to 15 s were extracted as one fNIRS data sample for further analysis.

4. Decoding method

4.1. Basic notations

The multimodal dataset is represented as $\mathcal{D}\,{\text{ = }}\,\left\{ {(x_i^E,x_i^N,{y_i})} \right\}_{i = 1}^n$ , where $n$ is the total number of samples. $x_i^E$ is the ith preprocessed multichannel EEG data $x_i^E \in {\mathbb{R}^{1 \times 44 \times 2000}}$ . $x_i^N \in {\mathbb{R}^{2 \times 52 \times 150}}$ is the ith multichannel fNIRS data, which consists of HbR data ${x^{{\text{HbR}}}} \in {\mathbb{R}^{1 \times 52 \times 150}}$ and HbO data ${x^{{\text{HbO}}}} \in {\mathbb{R}^{1 \times 52 \times 150}}$ . ${y_i}$ is the label of the ith multimodal data sample.

In the following chapters, superscript $s$ is used to represent the spatial dimensionality of one array, and $t$ is used to represent the temporal dimensionality. Superscript $E$ and $N$ are used to indicate EEG modality and fNIRS modality. Letters i, j, and k will be repeatedly used as sample indexes.

4.2. Decoding network

The overall decoding network is illustrated in figure 3. It mainly consists of an EEG feature extractor, a fNIRS feature extractor, a fusion module, and a classifier. Intramodality alignment losses and intermodality alignment loss were also proposed. The intramodality alignment losses pull the features from the same class closer and push features from different classes further. The intermodality constrains pull the EEG and fNIRS features from the same sample closer and push the EEG and fNIRS features from different samples further.

The EEG feature extractor mainly consists of two convolution layers, a power activation layer, and a pooling layer. The first convolution layer carries out temporal convolution along the temporal axis of the input EEG data to extract temporal information. The shape of the temporal convolution kernel is 1 $\times$ 25, and the number of kernels is 40. Then, the second convolution layer is applied along the spatial axis of EEG data to extract spatial information from multichannel EEG data. The kernel shape is the shape of the spatial kernel, which is $44\ \times$ 1, and the number of kernels is 40. The kernel length is the same as the channel number of EEG data to capture the global spatial pattern. The Pow operation is selected as activation for EEG, which can reflect the power features. Finally, a slicing and pooling layer is used to conduct data augmentation and reduce the dimensionality of EEG features. Concretely, a sliding window is first applied to the EEG feature with a window length of 476 temporal points and a window step of ten temporal points. Then, in each window, a mean pooling layer of temporal length 75 and temporal step 15 is applied to reduce the temporal dimensionality. With a sliding window, we could augment EEG data by generating more data samples. Adding a sliding window before the convolution layers will greatly increase the repeated temporal computation. A sliding window was applied after temporal convolution, which means that for each input EEG data sample, only one temporal convolution operation along the full temporal length was conducted. The output features should be the same as if a sliding window was performed before the convolution operations. By this means, considerable computational effort can be saved.

The fNIRS feature extractor mainly contains temporal pooling, temporal differential, spatial convolution and slicing layers. Temporal pooling and differential layers are applied in parallel to the input fNIRS data. The length and step of the pooling layer are 30 and 1 temporal points, respectively. Mean pooling was used to capture the coarse temporal trend of fNIRS data. The differential layer calculates the mean difference of each sliding window with a window length of 30 and a window step of 1 temporal point. The differential layer could capture the fine local temporal features of fNIRS data. The outputs of the pooling layer and differential layer are concatenated along the convolution channel dimension. Similar to the EEG feature extractor, a spatial convolution layer is adopted to extract the spatial pattern of multichannel fNIRS data. Data augmentation is also applied to fNIRS features at the end of the fNIRS feature extractor. The slicing layer picks every 30-th temporal array to compose a final fNIRS feature, which results in 30 augmentations of the fNIRS feature with a temporal length of 4 for each input fNIRS data sample. For example, the 1,31,61,91th temporal points compose one augmentation. The 2,32,62,92th temporal points compose another augmentation. Each augmentation contains full temporal information in the 15 s fNIRS data.

An attention-based fusion module was designed to merge the EEG features and fNIRS features. EEG features were first transferred to query and value with two linear mapping layers. Concretely, each temporal point of one EEG feature $E \in {\mathbb{R}^{{s^E} \times {t^E}}}$ is a vector ${E_i} \in {\mathbb{R}^{{s^E} \times 1}}$ , and ${s^E}$ and ${t^E}$ are the spatial and temporal dimensionalities of the EEG feature. A linear map is applied to each vector ${E_i}$ , resulting in a new vector ${Q_i} \in {\mathbb{R}^{d \times 1}}$ , where $d$ is the dimensionality of the quarry array and equals ${s^E}$ in our network. Then, the query array $Q \in {\mathbb{R}^{d \times {t^E}}} = \{ {Q_i}\} _{i = 1}^{{t^E}}$ was obtained. In the same way, a value $V \in {\mathbb{R}^{d \times {t^E}}}$ array for EEG and a key array $K \in {\mathbb{R}^{d \times {t^N}}}$ for fNIRS were obtained. The attention map is calculated as $A \in {\mathbb{R}^{{t^E} \times {t^N}}} = {Q^{\textrm T}} \times K$ , where T represents the matrix transport operation. The attention map is averaged along dimension two, which results in an attention vector $a \in {\mathbb{R}^{{t^E}}}$ . The attention vector represents the attention weights of each temporal point in the EEG feature. The temporal weighted EEG feature is calculated as ${E^w} = \{ {E_i} \times {a_i}\} _{i = 1}^{{t^E}}$ . Similar to most attention modules, a residual block was also adopted to stabilize the model performance. Attention to fNIRS features was also applied; that is, fNIRS features were used to generate queries and values while using EEG features to generate keys. The resulting weighted fNIRS feature is concatenated with the weighted EEG feature along the temporal dimension.

Finally, a one-layer fully connected neural network along with a log softmax activation layer is adopted as a classifier. As feature augmentation was performed in the prior network, the output of augmentation from one single original sample was averaged to obtain the final output. Dropout and batch norm are employed to enhance model performance by reducing the effect of overfitting and making the training process more stable [61, 62]. The network structure and specific parameters are summarized in table 1.

Table 1. Layers and parameters of the proposed network.

Module	Layer	Parameters	Output
EEG feature extractor input: (1,44,2000)	Temporal convolution	40 * (1,25)	(40,44,1976)
	Spatial convolution	40 * (44,1)	(40,1,1976)
	Batch norm	40 * 2	(40,1,1976)
	Pow activation		(40,1,1976)
	Reshape		(40,1976)
	Sliding window	Window length = 476, step = 10	N₁*(40,476)
	Mean pooling	(1,75), step = 15	N₁*(40,27)
fNIRS feature extractor input: (2,52,150)	Temporal pooling and differential	(1,30), step = 1	(4,52,121)
	Spatial convolution	40 * (52,1)	(40,1,121)
	Slicing		N₂ * (40,1,4)
	Reshape		N₂ * (40,4)
Multimodal fusion input: EEG:(40,27) fNIRS: (40,4)	Attention		(1,27), (1,4)
	Weightning		(40,27), (40,4)
	Concatenate		(40,31)
Classifier input: (40,31)	Dropout	Probability = 0.5	(40,31)
	Reshape		1240
	Fully connected	1240 * Number of classes	Number of classes
	Activate	Softmax	Number of classes

(N₁ and N₂ are the number of slicing windows of EEG and fNIRS).

4.3. Optimization objective

The heterogeneity of EEG and fNIRS data makes the multimodal fusing process difficult. EEG and fNIRS features are distributed in their own specific space. Straight fusing EEG and fNIRS features, for example, concatenating EEG and fNIR features, leads to poor multimodal decoding accuracy. Therefore, two intramodality alignment losses and one intermodality alignment loss were proposed for extracting better feature representations.

4.3.1. Intramodality center loss

A center loss was added to the fNIRS features. The center loss pulls the features from the same class closer:

$\begin{equation}{l_{{\text{center}}}} = \sum\limits_{k = 1}^{{n_c}} \sum\limits_{i = 1}^{{n_m}} {\left( {N_i^{\,k} - {{\overline N}^{\,k}}} \right)^2}\end{equation} \tag{ 1 }$

$\begin{equation}{\overline N ^{\,k}} = \frac{1}{{{n_m}}}\sum\limits_{i = 1}^{{n_m}} {N_i^{\,k}} \end{equation} \tag{ 2 }$

where ${n_c}$ is the number of classes and ${n_m}$ is the number of samples belonging to one class (same for each class). $N_i^{\,k}$ is the fNIRS feature of sample $i$ from class $k$ . ${\overline N ^{\,k}}$ is the average of fNIRS features from class $k$ .

4.3.2. Intramodality contrastive loss

$\begin{equation}{l_{\operatorname{intra} }} = \sum\limits_{i = 1}^n {(l_i^E + l_i^N)} \end{equation} \tag{ 3 }$

$\begin{align}l_i^E &= - \frac{1}{{{n_m} - 1}}\sum\limits_{j = 1}^n {{\textbf{I}}_{i \ne j}} \cdot {{\textbf{I}}_{{y_i} = {y_j}}}{} \nonumber\\ &\quad\cdot \log \frac{{{{\text{e}}^{{\text{sim}}({E_i},{E_j})/\tau }}}}{{\sum\limits_{k = 1}^n {{\textbf{I}}_{i \ne k}} \cdot {{\text{e}}^{{\text{sim}}({E_i},{E_k})/\tau }}}}\end{align} \tag{ 4 }$

$\begin{align}l_i^N &= - \frac{1}{{{n_m} - 1}}\sum\limits_{j = 1}^n {{\textbf{I}}_{i \ne j}}\nonumber\\ &\quad \cdot {{\textbf{I}}_{{y_i} = {y_j}}} \cdot \log \frac{{{{\text{e}}^{{\text{sim}}({N_i},{N_j})/\tau }}}}{{\sum\limits_{k = 1}^n {{\textbf{I}}_{i \ne k}} \cdot {{\text{e}}^{{\text{sim}}({N_i},{N_k})/\tau }}}}\end{align} \tag{ 5 }$

$\begin{equation}{\text{sim}}(u,v) = {u^{\text{T}}}v/\left\| u \right\|\left\| v \right\|\end{equation} \tag{ 6 }$

where n is the number of samples from all classes. $\tau$ is a temperature hyper parameter and is set to 0.1. ${\textbf{I}}$ represents the indicator function, ${{\textbf{I}}_{i \ne j}} = 1$ if $i \ne j$ ; otherwise, ${{\textbf{I}}_{i \ne j}} = 0$ . Info noise-contrastive estimation (infoNCE) loss makes similar sample pairs stay close to each other while dissimilar ones are far apart [63]. The proposed intramodality contrastive loss takes the form of infoNCE loss. The key to adopting infoNCE loss into our network is to define positive and negative sample pairs. In our study, in one modality, pairs of features from the same class are positive pairs, and pairs of features from different classes are negative pairs.

The difference between center loss and intramodality contrastive loss lies in the fact that center loss measures the similarity of two features by squared error, while intramodality contrastive loss measures the cosine distance. In addition, center loss only pulls features from the same class together, while intramodality contrastive loss also pushes the features from different classes further.

4.3.3. Intermodality contrastive loss

$\begin{equation}{l_{\operatorname{inter} }} = \frac{1}{n}\sum\limits_{i = 1}^n {(l_i^E + l_i^N} )\end{equation} \tag{ 7 }$

$\begin{equation}l_i^E = - \log \frac{{{{\text{e}}^{{\text{sim}}({E_i},{N_i})/\tau }}}}{{\sum\limits_{k = 1}^n {{\textbf{I}}_{{\text{k}} \ne i}}{{\text{e}}^{{\text{sim}}({E_i},{N_k})/\tau }} + \sum\limits_{k = 1}^n {{\textbf{I}}_{{\text{k}} \ne i}}{{\text{e}}^{{\text{sim}}({E_i},{E_k})/\tau }}}}\end{equation} \tag{ 8 }$

$\begin{align}l_i^N = - \log \frac{{{{\text{e}}^{({\text{sim}}({N_i},{E_i})/\tau )}}}}{{\sum\limits_{k = 1}^n {{\textbf{I}}_{{\text{k}} \ne i}}{{\text{e}}^{({\text{sim}}({N_i},{N_k})/\tau )}} + \sum\limits_{k = 1}^n {{\textbf{I}}_{{\text{k}} \ne i}}{{\text{e}}^{({\text{sim}}({N_i},{{\text{E}}_k})/\tau )}}}}.\end{align} \tag{ 9 }$

Here, EEG and fNIRS feature one single data sample and are defined as a positive pair. EEG and fNIRS features from different data samples are negative pairs.

4.3.4. Overall optimization objective

The classification employs cross-entropy loss:

$\begin{equation}{l_{{\text{ce}}}} = - {{\textrm E}_{{x^{\,E}},{x^{\,N}},y\sim \mathcal{D}}}{\text{CE}}\left( {{\text{Net}}\left( {{x^{\,E}},{x^{\,N}}} \right),y} \right)\end{equation} \tag{ 10 }$

where $\operatorname{CE} \left( \cdot \right)$ represents the cross-entropy loss function and $\operatorname{Net} \left( x \right)$ represents our whole network.

The overall optimization objective is:

$\begin{equation}{l_{{\text{all}}}} = {l_{{\text{ce}}}} + {w_1}{l_{{\text{center}}}} + {w_2}{l_{{\text{intra}}}} + {w_3}{l_{{\text{inter}}}}\end{equation} \tag{ 11 }$

where ${w_1}$ , ${w_2}$ , and ${w_3}$ are hyperparameters of the loss weights.

4.4. Experimental setting and comparison methods

Stratified 5-fold cross-validation was conducted on our dataset and a public dataset [22], which is referred to as the MI2 dataset in the rest of the paper. The MI2 dataset contains EEG and fNIRS data of left-hand and right-hand MI from 29 subjects in a comparison experiment to evaluate the decoding accuracy of the proposed method. An ablation study was also conducted where each part of the proposed method was removed in turn to verify their effectiveness.

The comparison methods include EEG single modality methods, fNIRS single modality methods and multimodal methods.

(a)
EEG methods: EEG single modality methods include FBCSP, EEGNet, shallow CNN, deep CNN, and C2CM.
- 1.
  FBCSP [42] followed by LDA is a classic MI decoding method. It is implemented using the MNE toolbox [64]. https://mne.tools/stable/index.html.
- 2.
  EEGNet [17] is a compact DL-based MI decoding model. It is implemented by adopting the source code from https://github.com/vlawhern/arl-eegmodels.
- 3.
  Shallow CNN and Deep CNN [16] extract EEG features with temporal and spatial convolution layers. A fully connected layer is then used to output the predicted result. The codes were adopted from https://github.com/TNTLFreiburg/braindecode.
- 4.
  C2CM [18] first uses FBCSP to filter EEG data. Then, it takes the envelope of filtered signals as a feature and uses small convolutional layers to learn temporal and spatial information. As we do not have access to the source code, this method was reimplemented following the original paper.
- 5.
  FBCNet [51] employs a filter bank to create multiview EEG features and uses a multiview CNN with temporal variance layers to classify EEG signals. The codes were adopted from https://github.com/ravikiran-mane/FBCNet.
- 6.
  TD-Attn [52] extracts CSP features as input and uses a time-distributed attention network to classify EEG signals. We adopted the codes provided by the original author.
- 7.
  Matt [53] is a Riemannian geometric-based DL method that takes the covariance matrix time series of EEG signals as input and applies an attention mechanism to capture the spatial-temporal information. The codes were adopted from https://github.com/CECNL/MAtt.
(b)
fNIRS methods: The fNIRS single modality method extracts the means and slope of HbR and HbO signals as features and uses shrinkage LDA (SLDA) as a classifier. This fNIRS decoding method is referred to as SLDA and was reimplemented following the paper of Shin et al [22].
(c)
Multimodal method: multimodal methods, including DF and polynomial fusion.
- 1.
  The DF method takes vote of the output of the FBCSP and SLDA. The method was reimplemented following Wang et al [25].
- 2.
  PFNet [27] first extracts EEG and fNIRS features with a CNN. Then, it fuses the multimodal features with the polynomial fusion method. Finally, a fully connected layer is adopted to output the prediction result. The code was adopted from https://github.com/sunzhe839/tensorfusion_EEG_NIRS. As the original code did not use a validation set, we added a validation process to the code. In short, the model that gains the highest validation accuracy in training epochs was selected as the final network. The validation process is the same as that in our method.

5. Results

5.1. Data analysis

Figure 4 shows the average time-frequency maps of four MI tasks and one resting task across all subjects for EEG electrode positions C3, Cz, and C4. Blue indicates ERD. For simplicity, we use LH, RH, BH, F, R to represent left hand, right hand, both hand, feet, and resting MI tasks in the following context. For four MI tasks (LH, RH, BH, and F), the time-frequency maps show obvious ERDs in the alpha and beta frequency bands. For LH, RH, and BH, ERD is strong at C3 and C4 in the alpha and beta bands, which starts at approximately 0.5 s after MI tasks begin. Compared with the hand MI tasks, there was a weak ERD pattern at C3, C4, and Cz in the feet MI task. In the resting state, no obvious ERD is observed in the alpha and beta frequency bands.

According to the time-frequency maps, 11.5–13.5 Hz is selected to obtain the average topological maps of EEG signals across all subjects (figure 5). Obvious ERD is observed on motor cortices in topological maps of LH, RH, BH, and F MI tasks. For the resting state, the activation in the motor area is relatively low. For LF and RH, contralateral dominance was observed, and electrodes from contralateral sides of the MI task hand showed much stronger ERD than ipsilateral side electrodes. For BH, both left-side and right-side electrodes show strong ERD. For feet MI, ERD is observed on central and both sides electrodes of the motor area.

Figure 6 shows the average HbO waveform for three fNIRS channels (CCP3-CCP5, CCP2-CCP1, CCP6-CCP4) across all subjects. The channels were selected from the left motor area, middle motor area and right motor area. The positions of the three fNIRS channels are close to the positions of EEG electrodes C3, Cz, and C4. For the HbO waveforms of LH, RH, BH, and F, the HbO values increased as the MI task began and reached the highest point at approximately 8 s. For the resting state, the increase in the HbO value was relatively low. One-way repeated-measures ANOVA showed that MI class had a significant main effect on the waveform values in 2.2–13.5 s in the CCP3-CCP5 channel, in 7.5–13.8 s in the CCP2-CCP1 channel and in 7.0–12.5 s in the CCP6-CCP4 channel (p < 0.05). In the CCP3-CCP5 channel in the left hemisphere, this time range is much wider than that in the CCP2-CCP1 channel and CCP6-CCP4 channel. Figure 7 shows the HbO waveform of LH vs. BH and RH vs. BH. LH and BH waveforms are significantly different on CCP3-CCP5channel in the left hemisphere from 4.6 to 7.7 s. LH and BH waveforms tend to be different on CCP2-CCP1 channel in the mid hemisphere from 13.7 to 16.6 s. RH and BH waveforms are significantly different on CCP6-CCP4 channel in the right hemisphere from 6.4 to 11.3 s. RH and BH waveforms are also significantly different on CCP2-CCP1 channel from 15.3 s to 20 s.

**Figure 7.** Average HbO waveform across all subject for left hand, right hand and both hand fNIRS signal. Gray area indicates that difference between classes is significant (p < 0.05). Light gray area indicates that difference between classes tend to be significant (p < 0.1).
Download figure:
Standard image High-resolution image

According to the HbO waveforms, the time range of 8–10 s was selected to obtain the average topological map of HbO across all subjects (figure 8). In figure 8, LH, RH, BH, and F MI show obvious activation in the motor brain area, while the rest state shows no activation. Contralateral dominance was observed for the left- and right-hand HbO signals. For both hand MI, both left and right side channels show obvious activation. For feet MI, left side, right side and post middle channels in motor brain area shows activation.

Overall, the EEG signals correspond to the fNIR signals. Compared to the rest state, the hands and feet MI induced obvious activation of the motor brain area for both EEG and fNIRS signals. Contralateral dominance is observed on both EEG and fNIRS signals. For feet MI, central channels show relatively higher activation on EEG and fNIRS signals.

5.2. Comparison experiment

The decoding accuracy of our method and all compared methods on our dataset is presented in table 2. Each column refers to each subject in our experiment, and each row refers to one decoding method. One-way repeated-measures ANOVA showed a significant main effect of the decoding method on classification accuracy (F = 17.09, p < 0.01). Post hoc analysis shows that our method is significantly better than all compared methods (post hoc test, all p < 0.01).

Table 2. Classification accuracy of different algorithms on our dataset (in percentage %, '*', '**' represents compared with our method P < 0.05 and P < 0.01 respectively). EEG decoding methods are reported from the FBCSP to the MAtt row, SLDA is a fNIRS decoding method, multimodal decoding methods are reported from the PFNet to the last row.

Method	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	Mean
FBCSP [42]	68.0	25.2	41.2	76.8	88.8	35.6	36.4	68.4	47.2	66.8	52.8	48.8	57.6	61.2	40.4	54.3 **
EEGNet [17]	74.4	38.8	56.8	74.8	87.2	44.0	38.4	60.0	52.4	67.2	56.8	38.4	65.6	62.8	40.8	57.2 **
Shallow [16]	76.0	42.0	50.8	72.0	85.2	44.8	47.2	61.6	61.6	71.2	57.6	37.2	62.0	66.8	45.2	58.7 **
Deep [16]	67.6	32.4	45.6	66.0	88.4	31.2	28.4	54.4	52.8	52.0	52.0	31.6	57.2	60.4	28.0	49.9 **
C2CM [18]	69.6	35.2	52.4	64.0	86.0	38.8	40.8	60.8	48.0	67.2	52.4	36.0	56.0	59.6	34.4	53.4 **
FBCNet [51]	68.4	34.4	37.2	70.0	81.6	39.2	44.0	63.6	48.0	64.8	45.6	34.8	53.6	53.2	41.6	52.0 **
TD-Attn [52]	76.8	24.0	37.2	61.6	80.0	37.6	32.8	64.4	48.4	69.2	51.2	40.4	49.2	58.0	37.6	51.2 **
MAtt [53]	73.2	31.6	44.8	62.8	82.0	42.0	46.4	54.4	54.8	51.6	53.6	38.8	53.6	60.4	41.6	52.8 **
SLDA [22]	31.6	43.2	27.2	49.6	34.8	30.0	33.6	50.8	27.6	47.2	33.2	36.8	80.8	51.6	42.4	41.4 **
PFNet [27]	47.6	30.4	21.6	39.6	40.8	20.8	26.8	36.4	28.8	32.0	31.6	25.2	64.4	43.2	28.0	34.5 **
DF [25]	69.2	43.2	40.4	72.8	87.6	34.4	43.2	66.8	50.0	69.2	48.4	43.6	89.2	62.8	49.2	58.0 **
Ours	76.0	50.4	56.0	70.8	87.6	50.8	51.2	71.2	59.6	77.2	59.2	48.4	83.2	68.8	55.2	64.4

Among EEG single modality methods, shallow CNN achieves the highest decoding accuracy of 58.7%, and its performance is significantly higher than other EEG single modality methods (all p < 0.05). SLDA, a fNIRS single modality method, achieves a decoding accuracy of 41.4%, which is significantly lower than most of the EEG single modality methods (FBCSP, EEGNet, Shallow CNN, C2CM, FBCNet, Matt, all p < 0.05). This finding suggests that the EEG signal contains more useful information for MI decoding than the fNIRS signal. Our method achieves significantly higher decoding accuracy than the best single modality method Shallow CNN (p < 0.01). As our method uses information from two modalities based on our framework, we can obtain higher decoding performance.

Our multimodal method achieves a decoding accuracy of 64.4%, which is 6.4% higher than that of the multimodal compared method, DF (p < 0.01). These results show that our method can better utilize information from EEG and fNIRS data and enhance the decoding accuracy of MI-BCI.

The decoding accuracy of our method and all compared methods on the MI2 dataset is presented in table 3. On this public dataset, the decoding accuracy of fNIRS data from some subjects is below chance level; thus, this kind of fNIRS data could not contribute useful information in multimodal learning. Hence, the decoding accuracies of all subjects, subjects whose fNIRS decoding accuracy was below 50% (chance level) and subjects whose fNIRS decoding accuracy was above 50% were separately calculated. One-way repeated-measures ANOVA showed a significant main effect of decoding method on classification accuracy for all the cases (all p < 0.01). Post hoc analysis shows that our method is significantly better than all compared methods except MAtt under the case fNIRS > 50% (all p < 0.01). For case all subjects, our method achieves significantly higher decoding accuracy than all compared methods except shallow CNN and MAtt. For case fNIRS < 50%, the decoding accuracy of our method is not significantly different from that of any of the compared methods. These results show that our method can enhance MI decoding accuracy when the fNIRS modality is informative.

Table 3. Classification accuracy of different algorithms on the MI2 dataset (in percentage %, '*', '**' represents compared with our method P < 0.05 and P < 0.01 respectively). EEG decoding methods are reported from the FBCSP to the MAtt row, SLDA is a fNIRS decoding method, multimodal decoding methods are reported from the PFNet to the last row.

Method	All subjects	fNIRS < 50%	fNIRS > 50%
FBCSP [42]	67.1**	57.6	71.3**
EEGNet [17]	68.5**	59.3	72.7**
Shallow [16]	72.5	63.5	76.5*
Deep [16]	64.4**	54.1	69.0**
C2CM [18]	68.6**	59.4	72.8**
FBCNet [51]	59.1**	50.4	63.1**
TD-Attn [52]	68.2**	62.2	70.8**
MAtt [53]	73.3	62.4	78.3
SLDA [22]	56.9**	49.6	60.2**
PFNet [27]	53.5**	49.1	55.4**
DF [25]	69.6**	58.7	74.5**
Ours	73.62	60.0	79.8

5.3. Ablation study

An ablation study was conducted to evaluate the effectiveness of each part in our method. Table 4 shows the decoding accuracy of different models. Each column refers to one subject. Rows 'fNIRS' and 'EEG' represent our single modality network, without any proposed loss and fusion module. 'w/o inter', 'w/o intra', and 'w/o center' refer to our network trained without inter- and intramodality contrastive loss and center loss, respectively. In the 'w/o attn' case, EEG features and fNIRS features are fused by concatenation instead of our attention-based multimodal fusion module (all losses used). One-way repeated-measures ANOVA showed a significant main effect of model on decoding accuracy (F = 22.20, p < 0.01). The post hoc test shows that our method achieves significantly higher decoding accuracy than each compared model (p < 0.05). This indicates that each of our proposed loss and fusion models can enhance decoding accuracy. Our full multimodal method achieves 4.6% higher decoding accuracy than EEG single modality. This shows that by incorporating the fNIRS modality, we could further enhance the decoding accuracy of MI-BCI. It is noteworthy that without center loss, the decoding accuracy dropped by 8.6%. This indicates that center loss plays an important role in our method.

Table 4. Ablation study result (in percentage %, '*', '**' represents compared with our method P < 0.05 and P < 0.01 respectively).

Method	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	Mean
fNIRS	35.2	44.4	32.4	52.8	35.2	30.0	34.8	48.0	34.0	53.6	30.4	38.8	79.6	54.0	44.8	43.2 **
EEG	74.0	42.4	55.6	70.0	87.6	48.8	47.6	68.4	62.4	71.6	59.6	36.4	65.2	65.2	41.6	59.8 *
w/o inter	77.2	48.4	54.4	71.2	86.0	48.0	50.0	68.8	59.6	76.4	57.2	46.8	80.0	67.6	52.8	63.0 **
w/o intra	75.2	45.6	56.4	72.0	85.2	46.0	50.8	63.6	58.8	77.6	57.2	42.0	80.0	67.2	50.8	61.9 **
w/o center	66.0	46.8	44.0	63.2	67.2	37.6	42.0	64.4	43.6	71.2	45.2	43.6	88.8	60.4	52.8	55.8 **
w/o attn	76.4	50.4	57.2	68.8	87.6	49.2	48.4	67.2	57.6	77.6	56.0	45.2	82.4	70.4	54.0	63.2 *
Ours	76.0	50.4	56.0	70.8	87.6	50.8	51.2	71.2	59.6	77.2	59.2	48.4	83.2	68.8	55.2	64.4

5.4. Visualization of data distribution

To explore the effect of center loss on multimodal fusion, t-distributed stochastic neighbor embedding was applied to visualize the data distribution from the network. One network with center loss and one without center loss were trained on one folder of data from subject 1, which are named Net1 and Net2, respectively. Both networks are trained, and then these two networks are used to extract the features of all data from subject 1. Figures 9(a) and (b) show the distribution of raw EEG and fNIRS signals. Figures 9(c) and (d) show the distribution of EEG and fNIRS features extracted by Net2 without center loss. Figures 9(e) and (f) show the distribution of EEG and fNIRS features extracted by Net1 with center loss. Figures 9(g) and (h) show the distribution of fused multimodal features through Net2 and Net1, respectively.

**Figure 9.** Visualization of data distribution through our network with and without center loss of subject 1.
Download figure:
Standard image High-resolution image

In figures 9(a) and (b), the maximum overlap was observed among different classes of raw EEG and raw fNIRS signals. After processing by our feature extractor, the samples of each class are clustered into multiple clusters for both EEG and fNIRS features (figures 9(c)–(f)). Thereinto, clusters of EEG features are easier to distinguish (figures 9(d) and (f)), while fNIRS features of different classes are entangled together (figures 9(c) and (e)). After processed by our multimodal fusion module of Net₂ without center loss, the fused feature (figure 9(g)) shows better separability than the fNIRS feature (figure 9(c)) but worse separability than the EEG feature (figure 9(d)). After processing by our multimodal fusion module of Net1 with center loss, the fused feature (figure 9(g)) shows better separability than fNIRS (figure 9(e)) and EEG (figure 9(f)) single modality features. Notably, the EEG feature of BH is separated into two clusters in figure 9(f), while the fused feature of BH forms a single cluster between the left hand and right hand in figure 9(h). These results show that with center loss, our network could extract better multimodal feature representations.

5.5. Effect of the multimodal fusion module

To explore how our attention-based multimodal fusion module works, the average output attention weights for the fNIRS feature and the average temporal waveform of the fNIRS signal from subject 1 were visualized (figure 10). In figure 10(a), attention weights gradually grow over time, which shows a similar pattern to the fNIRS waveform in figure 10(b). As suggested in figure 6, time points from 7.5 s to 12 s are significantly different between classes. Our network gives higher weights to these time points. This shows that our fusion module learns to allocate weights according to the nature of the input feature and makes our model focus on time points that are more valuable for classification.

**Figure 10.** (a) Attention weight for NIRs feature of subject 1 (b) HbO waveform on channel CCP2-CCP1 of subject 1.
Download figure:
Standard image High-resolution image

To explore how the multimodal fusion enhances the decoding performance of each MI class, the confusion matrixes of the single modality decoding results, multimodal decoding result and the subtraction of the EEG confusion matrix from the multimodal confusion matrix were further plotted (figure 11). For EEG single modality MI decoding (figure 11(a)), LH and RH show similar decoding accuracies (57.3% and 57.5%), which are higher than the decoding accuracy of BH (42.3%). The feet and rest classes achieve 76.1% and 79.3% accuracies, respectively, which are higher than the accuracies of LH, RH and BH. BH is prone to be misclassed with LH and RH (BH → LH: 24.8%; BH → RH: 23.3%; LH → BH: 24.9%; RH → BH: 24.8%). fNIRS single-modality decoding shows similar results to EEG decoding (figure 11(b)). F and Rest achieve higher decoding accuracy than hand tasks. LH and RH are prone to be misclassed with BH. In figure 11(c), the multimodal classification accuracy of each class is higher than that of the single modal classification. As shown in figure 11(d), with multimodal fusion, the classification accuracies of LH, RH, BH, F, and R increased 5.3%, 5.3%, 3.7%, 2.3% and 4.0%, respectively, compared with EEG single modality decoding. Compared with other classes, the LH and RH classes have greater improvement in accuracy and greater reduction in misclassification with BH. Thus, for hand MI, multimodal fusion increases the decoding accuracy of LH and RH by reducing their misclassification with BH. These results demonstrate that multimodal fusion improves the decoding performance of each class.

**Figure 11.** Confusion matrix of (a) EEG single modal, (b) fNIRS single modal, (c) multimodal, and (d) the subtraction of EEG single modal confusion matrix from multimodal confusion matrix (data from subject 1).
Download figure:
Standard image High-resolution image

6. Discussion

In this study, multimodal MI data from 15 subjects were collected. Time-frequency maps, waveforms and topological maps of the collected data were analyzed. A five-class multimodal MI dataset was built, which will be open accessed and will help the studies of multimodal MI decoding methods. A decoding method was proposed and evaluated on both our collected dataset and a public dataset. A comparison experiment shows that our method achieves higher decoding accuracy than the compared methods. An ablation study shows that each part of our proposed method contributes to the decoding performance.

6.1. The mechanism of each part of our model

Our modality-specific feature extractors showed good performance. The experimental results show that the single modality decoding accuracies are comparable to the best single modality decoding methods (our EEG: 59.8% ± 3.7%, shallow CNN 58.7% ± 3.6%, p = 0.092; our fNIRS 43.2% ± 3.4%, SLDA 41.4% ± 3.6%, p < 0.05). This good performance depends on our reasonable model design based on the characteristics of EEG and fNIRS data during MI tasks. The EEG feature extractor uses temporal convolution to extract patterns in the time series and applies pow activation to extract energy information. Indeed, MI tasks induce event-related synchronization and desynchronization of EEG data. The fNIRS modality network calculates the temporal mean and slope of fNIRS data with low temporal resolution in multiple time windows and then employs spatial convolutions to extract spatial information from all fNIRS channels.

It is worth noting that center loss was found to play a very important role in multimodal MI decoding in our ablation study and visualization results (table 4 and figure 9). Also, multimodal decoding accuracy without center loss is worse than EEG single modality decoding accuracy (table 4). The distribution of the fused feature (figure 9(g)) is worse than that of the EEG single modality feature (figure 9(f)). This implies that fusing EEG and fNIRS data without proper guidance may damage the fusion feature representation and reduce the decoding accuracy. The fused multimodal feature with center loss shows a much better distribution than the fused multimodal feature without center loss (figure 9(h) vs. Figure 9(g)). This indicates that center loss can guide multimodal fusion rather than enhance single modality feature representation.

Our intramodality contrastive loss and intermodality contrastive loss significantly improved the decoding accuracy (table 4). It may be due to that our intramodality contrastive loss can enhance the single modality feature representation and that our intermodality contrastive loss can serve as a pathway for the two modalities to guide each other during network training. EEG data and fNRIS data are easily disturbed by noise. When one modality of a sample is disturbed by strong noise, the extracted feature may be far from the class center. This would confuse the classifier during training. In this case, aligning features of noisy modalities to features of the other modality would indirectly pull the features of the noisy modality closer to their class center through intermodality contrastive loss, which may alleviate the problem of noise disturbance and enhance the robustness of our method. Hence, the proposed contrastive losses improve the decoding performance in MI tasks.

Finally, the attention-based EEG-fNIRS fusion module dynamically focuses on different time points, alleviates the impact of nonstationarity and noise and enhances multimodal decoding accuracy.

6.2. When multimodal improves over EEG single modality

The mean multimodal decoding accuracy across subjects is higher than that of widely researched EEG single modal decoding accuracy. However, for different subjects, the improvement in decoding accuracy is different. For a few subjects, multimodal decoding accuracy is even decreased. To explore what might affect the accuracy improvement, we analyzed the relationship between multimodal accuracy improvement (subtraction of EEG decoding accuracy from multimodal decoding accuracy) and the single modality accuracies as well as the difference between EEG and fNIRS accuracy on our dataset.

Multimodal accuracy showed no relation with EEG accuracy (r = 0.219, p = 0.214), a positive relationship with fNIRS accuracy (r = 0.612, p < 0.01, figure 12(a)) and a negative relationship with the difference between EEG and fNIRS accuracy (r = 0.727, p < 0.01, figure 12(b)).

Higher fNIRS decoding accuracy implies that our decoding model is able to extract useful information from fNIRS data. This may be helpful to enhance the decoding accuracy when combined with EEG data. A larger difference between EEG decoding accuracy and fNIRS decoding accuracy implies that the separability of fNIRS and EEG features extracted by our network are very different. In this case, it may be difficult for our intermodality contrastive loss to align the fNIRS and EEG feature distributions and to affect the ability of our network to extract good multimodal feature representations. Thus, when the difference between EEG and fNIRS decoding accuracy is very large, our multimodal alignment and fusion modules may be hard to work well.

6.3. EEG channel selection

In this study, considering unrelated information may damage the decoding performance. Forty-four EEG channels were manually selected from the motor sensory area. The decoding accuracy of the EEG signal with all channels and the EEG signal with selected channels were compared. Two representative EEG decoding methods, the FBCSP and shallow CNN, are selected to perform this experiment. The results are presented in table 5. Each column refers to one subject. Two-way repeated-measures ANOVA showed a significant main effect of channel selection on decoding accuracy (F = 5.45, p < 0.05) and a significant interaction between the decoding method and channel selection (F = 11.52, p < 0.01). For shallow CNN, the decoding accuracy of selected channels is significantly higher than the accuracy of all channels (p < 0.01). For the FBCSP, the decoding accuracy of the selected channels was not significantly higher than the accuracy of all channels (p = 0.428). These results show that our manually selected EEG channels are feasible.

Table 5. Effect of EEG channel selection (in percentage %, '*', '**' represents compared with all channel P < 0.05 and P < 0.01 respectively).

Method	Channel	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	Mean
Shallow	All	74.4	40.0	45.6	58.8	70.4	37.2	50.0	52.0	50.4	62.0	48.4	36.0	60.8	50.4	46.8	52.2
Shallow	Selected	76.0	42.0	50.8	72.0	85.2	44.8	47.2	61.6	61.6	71.2	57.6	37.2	62.0	66.8	45.2	58.7 **
FBCSP	All	67.2	27.2	40.8	76.0	91.2	32.8	38.4	47.2	49.6	69.6	52.0	49.6	63.2	52.4	52.8	54.0
FBCSP	Selected	68.0	25.2	41.2	76.8	88.8	35.6	36.4	68.4	47.2	66.8	52.8	48.8	57.6	61.2	40.4	54.3

6.4. Simulated online experiment

In our experiments, offline five-fold cross-validation was adopted to evaluate the effectiveness of our method. Here, a simulated online experiment was performed to evaluate the performance of our method by using the first eight sessions of data as the training set and the remaining two sessions of data as the test set. Table 6 presents the results of offline cross validation and simulated online validation. Each column refers to one subject. Our method achieves 64.3% decoding accuracy in the simulated online experiment and 64.4% decoding accuracy in the offline experiment. The paired t test shows that the decoding accuracies of the offline and simulated online experiments are not significantly different. In addition, the decoding accuracy of our method in the simulated online experiment is significantly higher than the accuracies of all compared methods in the offline experiments (all p < 0.05). These results suggest that our proposed method may work well in online MI-BCI application scenes. In future work, online experiments should be conducted to solidly evaluate the decoding methods.

Table 6. Comparation of decoding accuracy between offline and simulated online experiment (in percentage %).

Method	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	Mean
Offline	76.0	50.4	56.0	70.8	87.6	50.8	51.2	71.2	59.6	77.2	59.2	48.4	83.2	68.8	55.2	64.4
Simulated online	76.0	38.0	60.0	78.0	92.0	56.0	44.0	70.0	60.0	80.0	64.0	48.0	76.0	68.0	54.0	64.3

6.5. Limitations

Although our proposed method based on EEG and fNIRS data enhances the multimodal MI decoding accuracy. There are still several limitations. First, each time one uses MI-BCI, a large amount of training data is needed to train the decoding model. This process is time consuming and impacts the practicality of MI-BCI. Transferring knowledge from existing data to facilitate the model training of new subjects can alleviate this problem. There have been some studies on single modality transfer learning of MI decoding. However, multimodal transfer learning of MI decoding needs further research.

Second, there is a practical problem in MI-BCI that the online decoding accuracy declines over a long period of time. However, many single-modality incremental learning methods have been proposed. The online update method of the multimodal MI-BCI decoding model needs further study.

7. Conclusion

In this study, a multimodal MI dataset from 15 subjects was collected, and the time-frequency map, waveform, and topological maps of the collected data were analyzed. Then, a multimodal MI decoding method was proposed. The decoding method includes a modal specific feature extractor for EEG and fNIRS data, an attention-based fusion module, intermodality contrastive loss, intramodality contrastive loss, and center loss. A comparison experiment shows that our method achieves higher decoding accuracy than the compared methods. An ablation study shows that each part of our proposed method contributes to the decoding performance. This study provides a new approach for enhancing multimodal MI-BCI feature representation and decoding accuracy.

Data availability statement

The data cannot be made publicly available upon publication because they are not available in a format that is sufficiently accessible or reusable by other researchers. The data that support the findings of this study are available upon reasonable request from the authors.

Acknowledgments

This work was supported by the Beijing Natural Science Foundation (J210010 and 7222311)，the National Natural Science Foundation of China (62020106015, U21A20388, and 62276262)，and the Strategic Priority Research Program of CAS (XDB32040000).