Data Format

Documents the acceptable data format for this pacakge

The example data is shown in the directory. Let’s load the data and see the formats

import numpy as np
import scipy
file_path = "../data/data_CRMN_vs_MMN_imbalLDA_order_proj_1.mat"
data = scipy.io.loadmat(file_path)
data.keys()
dict_keys(['__header__', '__version__', '__globals__', 'user_class_min_1', 'user_feat_1', 'user_prob_1', 'user_resp_1', 'user_source_1', 'user_tr_order_1', 'user_train_prob_1', 'user_weights_1'])

The source code that generated this data file can be accessed at this link

Variables of Interest

  • tr_num
    • Trail Number
  • user_feat_{tr_num}
    • Feature engineered data for each user.
  • user_source{tr_num}
    • Source Information
  • user_resp_{tr_num}
    • Response Information

User Features

user_features = data["user_feat_1"][0]
user_features.shape
(26,)

The 26 in shape indicates that in this trail, it has 26 participants.

user_features[0].shape, user_features[1].shape
((432, 72), (134, 72))
user_features[0]
array([[ -7.23661128, -13.93739628, -22.81788437, ...,  -9.40298223,
        -11.77818306, -18.10076694],
       [ -2.63865315,   0.3358343 ,  -1.92867504, ...,  14.19302948,
          9.98168734,  17.5732571 ],
       [ -4.28406267,  -9.36639654, -16.71320915, ...,  -2.0318536 ,
         -3.8530683 ,   3.86939731],
       ...,
       [ -3.88435706,  -4.79033675,  -5.75699235, ...,  14.81179504,
         17.04872137,  13.36880608],
       [-13.37804033,  -7.12282845,  -1.5593848 , ...,  11.92473593,
         15.36710888,  26.49037254],
       [  0.18273821,  -0.71073707,  -6.72626149, ...,  -3.00038489,
         -4.9075405 , -15.84218536]])

Each participant will have different number of trails, but all the trails has the same number of features (second dimension of 72). Details information is documented at the ALL_DATA_1.mat file. (TODO)

Each observation in the data has an associated source and response label.

Encoding Table

Source Information

Source information is a one-dimensional array that contains the numbering label for the source information. The details of the encoding are illustrated in this table:

Encoding Number Full Description Abbreviation
1 Source Correct SC
2 Correct Rejection CR
3 Source Incorrect SI
4 Miss Miss
5 False Alarm FA

Response Information

Response information is a one-dimensional array that contains the numbering label for the response information.

Encoding Number Full Description Abbreviation
1 Remember Source RS
2 Remember Other RO
3 Familiarity F
4 Maybe New MN
5 Sure New SN

Data Shapes and Trails

There are 26 participant in this trail (1)

source_info = data["user_source_1"][0]
source_info.shape
(26,)

The first and the second participant have 432 and 134 observations respectively

source_info[0].shape, source_info[1].shape
((432, 1), (134, 1))

The source information are number coded from 1-5

np.unique(source_info[0], return_counts=True)
(array([1, 2, 3, 4, 5], dtype=uint8), array([131, 120,  48, 105,  28]))

The shape of response label should be aligned with the source information

resp_info = data["user_resp_1"][0]
resp_info.shape
(26,)
resp_info[0].shape, resp_info[1].shape
((432, 1), (134, 1))
np.unique(resp_info[0], return_counts=True)
(array([1, 2, 3, 4, 5], dtype=uint8), array([ 96,  16,  95, 102, 123]))

Machine Learning Problem – Features and Labels

Replicating the primary results has been the main focus of this research project. We may construct the classification problem as following: can we use an interpretable algorithm to discriminate different categories of trails, based on the EEG features? In doing so, we hope to explore different mechanics of memory retrival.

Features, or the \(\mathbb{X}\), is data["user_feat_1"]. Since the data are identified via both source and response variable, we need a mechanism to filter out labels that is out of interests, and combine several smaller label into a large class.