Data Format

Documents the acceptable data format for this pacakge

The example data is shown in the directory. Let’s load the data and see the formats

import numpy as np
import scipy

file_path = "../data/data_CRMN_vs_MMN_imbalLDA_order_proj_1.mat"
data = scipy.io.loadmat(file_path)
data.keys()

dict_keys(['__header__', '__version__', '__globals__', 'user_class_min_1', 'user_feat_1', 'user_prob_1', 'user_resp_1', 'user_source_1', 'user_tr_order_1', 'user_train_prob_1', 'user_weights_1'])

The source code that generated this data file can be accessed at this link

Variables of Interest

tr_num
- Trail Number
user_feat_{tr_num}
- Feature engineered data for each user.
user_source{tr_num}
- Source Information
user_resp_{tr_num}
- Response Information

User Features

user_features = data["user_feat_1"][0]
user_features.shape

(26,)

The 26 in shape indicates that in this trail, it has 26 participants.

user_features[0].shape, user_features[1].shape

((432, 72), (134, 72))

user_features[0]

array([[ -7.23661128, -13.93739628, -22.81788437, ...,  -9.40298223,
        -11.77818306, -18.10076694],
       [ -2.63865315,   0.3358343 ,  -1.92867504, ...,  14.19302948,
          9.98168734,  17.5732571 ],
       [ -4.28406267,  -9.36639654, -16.71320915, ...,  -2.0318536 ,
         -3.8530683 ,   3.86939731],
       ...,
       [ -3.88435706,  -4.79033675,  -5.75699235, ...,  14.81179504,
         17.04872137,  13.36880608],
       [-13.37804033,  -7.12282845,  -1.5593848 , ...,  11.92473593,
         15.36710888,  26.49037254],
       [  0.18273821,  -0.71073707,  -6.72626149, ...,  -3.00038489,
         -4.9075405 , -15.84218536]])

Each participant will have different number of trails, but all the trails has the same number of features (second dimension of 72). Details information is documented at the ALL_DATA_1.mat file. (TODO)

Each observation in the data has an associated source and response label.

Encoding Table

Source Information

Source information is a one-dimensional array that contains the numbering label for the source information. The details of the encoding are illustrated in this table:

Encoding Number	Full Description	Abbreviation
1	Source Correct	SC
2	Correct Rejection	CR
3	Source Incorrect	SI
4	Miss	Miss
5	False Alarm	FA

Response Information

Response information is a one-dimensional array that contains the numbering label for the response information.

Encoding Number	Full Description	Abbreviation
1	Remember Source	RS
2	Remember Other	RO
3	Familiarity	F
4	Maybe New	MN
5	Sure New	SN

Data Shapes and Trails

There are 26 participant in this trail (1)

source_info = data["user_source_1"][0]
source_info.shape

(26,)

The first and the second participant have 432 and 134 observations respectively

source_info[0].shape, source_info[1].shape

((432, 1), (134, 1))

The source information are number coded from 1-5

np.unique(source_info[0], return_counts=True)

(array([1, 2, 3, 4, 5], dtype=uint8), array([131, 120,  48, 105,  28]))

The shape of response label should be aligned with the source information

resp_info = data["user_resp_1"][0]
resp_info.shape

(26,)

resp_info[0].shape, resp_info[1].shape

((432, 1), (134, 1))

np.unique(resp_info[0], return_counts=True)

(array([1, 2, 3, 4, 5], dtype=uint8), array([ 96,  16,  95, 102, 123]))

Machine Learning Problem – Features and Labels

Replicating the primary results has been the main focus of this research project. We may construct the classification problem as following: can we use an interpretable algorithm to discriminate different categories of trails, based on the EEG features? In doing so, we hope to explore different mechanics of memory retrival.

Features, or the \(\mathbb{X}\), is data["user_feat_1"]. Since the data are identified via both source and response variable, we need a mechanism to filter out labels that is out of interests, and combine several smaller label into a large class.