import numpy as np
import scipy
Data Format
The example data is shown in the directory. Let’s load the data and see the formats
= "../data/data_CRMN_vs_MMN_imbalLDA_order_proj_1.mat"
file_path = scipy.io.loadmat(file_path)
data data.keys()
dict_keys(['__header__', '__version__', '__globals__', 'user_class_min_1', 'user_feat_1', 'user_prob_1', 'user_resp_1', 'user_source_1', 'user_tr_order_1', 'user_train_prob_1', 'user_weights_1'])
The source code that generated this data file can be accessed at this link
Variables of Interest
tr_num
- Trail Number
user_feat_{tr_num}
- Feature engineered data for each user.
user_source{tr_num}
- Source Information
user_resp_{tr_num}
- Response Information
User Features
= data["user_feat_1"][0]
user_features user_features.shape
(26,)
The 26 in shape indicates that in this trail, it has 26 participants.
0].shape, user_features[1].shape user_features[
((432, 72), (134, 72))
0] user_features[
array([[ -7.23661128, -13.93739628, -22.81788437, ..., -9.40298223,
-11.77818306, -18.10076694],
[ -2.63865315, 0.3358343 , -1.92867504, ..., 14.19302948,
9.98168734, 17.5732571 ],
[ -4.28406267, -9.36639654, -16.71320915, ..., -2.0318536 ,
-3.8530683 , 3.86939731],
...,
[ -3.88435706, -4.79033675, -5.75699235, ..., 14.81179504,
17.04872137, 13.36880608],
[-13.37804033, -7.12282845, -1.5593848 , ..., 11.92473593,
15.36710888, 26.49037254],
[ 0.18273821, -0.71073707, -6.72626149, ..., -3.00038489,
-4.9075405 , -15.84218536]])
Each participant will have different number of trails, but all the trails has the same number of features (second dimension of 72). Details information is documented at the ALL_DATA_1.mat
file. (TODO)
Each observation in the data has an associated source and response label.
Encoding Table
Source Information
Source information is a one-dimensional array that contains the numbering label for the source information. The details of the encoding are illustrated in this table:
Encoding Number | Full Description | Abbreviation |
---|---|---|
1 | Source Correct | SC |
2 | Correct Rejection | CR |
3 | Source Incorrect | SI |
4 | Miss | Miss |
5 | False Alarm | FA |
Response Information
Response information is a one-dimensional array that contains the numbering label for the response information.
Encoding Number | Full Description | Abbreviation |
---|---|---|
1 | Remember Source | RS |
2 | Remember Other | RO |
3 | Familiarity | F |
4 | Maybe New | MN |
5 | Sure New | SN |
Data Shapes and Trails
There are 26 participant in this trail (1)
= data["user_source_1"][0]
source_info source_info.shape
(26,)
The first and the second participant have 432 and 134 observations respectively
0].shape, source_info[1].shape source_info[
((432, 1), (134, 1))
The source information are number coded from 1-5
0], return_counts=True) np.unique(source_info[
(array([1, 2, 3, 4, 5], dtype=uint8), array([131, 120, 48, 105, 28]))
The shape of response label should be aligned with the source information
= data["user_resp_1"][0]
resp_info resp_info.shape
(26,)
0].shape, resp_info[1].shape resp_info[
((432, 1), (134, 1))
0], return_counts=True) np.unique(resp_info[
(array([1, 2, 3, 4, 5], dtype=uint8), array([ 96, 16, 95, 102, 123]))
Machine Learning Problem – Features and Labels
Replicating the primary results has been the main focus of this research project. We may construct the classification problem as following: can we use an interpretable algorithm to discriminate different categories of trails, based on the EEG features? In doing so, we hope to explore different mechanics of memory retrival.
Features, or the \(\mathbb{X}\), is data["user_feat_1"]
. Since the data are identified via both source and response variable, we need a mechanism to filter out labels that is out of interests, and combine several smaller label into a large class.