Association analysis
Samplelevel permutationbased association testing
Whereas most Luna commands operate on EDFs in a serial manner, this command instead operates at the sample level, not based on the original EDF/annotations but rather on derived metrics (i.e. the output of prior Luna commands). As such, association analyses in Luna do not require a samplelist as input.
Command  Description 

CPT 
Apply clusterbased permutation to one or more sets of metrics 
CPT
Clusterbased permutation analysis
This command fits a set of linear models, using a permutationbased approach that allows for covariates, to generate pointwise empirical significance values, as well as familywise corrected empirical pvalues.
It additionally employs a simple clustering heuristic to find groups of adjacent predictors, and to evaluate the evidence for association over these clusters, based on the sum of test statistics. This command is set up to define clusters with respect to three types of adjacency:

by frequency (e.g. power for 4.5 Hz and 5.0 Hz might be tested jointly)

by spatial location (e.g.
CZ
andC1
might be considered nearby, and so tested jointly) 
by pairs of channels (e.g. for connectivity measures, the pair
C3F1
might be considered adjacent toCZFZ
) 
by time (e.g. for perievent data)
These stratifiers are automatically determined by the presence of F
,
CH
or CH1
and CH2
or T
columns in the dependent variable files.
The idea behind clusterbased analyses nonparametric analysis (outlined here) is that power might be increased by searching for more modest effects that span "similar" predictors, with respect to topography or frequency.
This command can accept multiple types of DV in the same analysis:
e.g. spindle density (DENS
) as well as amplitude (AMP
). With
respect to multiple test correction (the PC
and clusterbased
empirical pvalues below), this analysis will correct for both sets of
values (accounting for any correlation between them). However,
clustering only happens within a particular class of variable. That
is, if each mesaure is present for 64 EEG channels, clusters of
topographically adjacent channels may be formed for DENS
, and
separate clusters may be formed for AMP
, but no cluster would
contain both DENS
and AMP
measures (i.e. because there is no
general way to specify the adjacency of different classes of
variable).
The approach to permutation with nuissance variables is the FreedmanLane method and follows an implementation described here.
Hint
For correctlyformated inputs, the CPT
command can be used to analyse any types of numeric inputs,
whether they come from prior Luna commands or not.
Parameters
Primary parameters to specify the data and any outlier actions for the dependent variables:
Option  Example  Description 

ivfile 
demo.txt 
Name of a single, tabdelimited text file containing the primary independent variable and other covariates 
iv 
DIS 
The primary IV (assumed to be a column in the ivfile ) 
covar 
AGE,SEX 
Covariates, coded numerically (binary 0/1 or realvalued, assumed to be columns in ivfile ) 
dvfile 
spec.txt,psd.txt 
One or more dependent variable files, in longformat (see below) 
dv 
DENS,AMP 
The name of one or more DVs (assumed to be columns in the dvfile set 
alldvs 
Use all DVs from the DV files (equivalent to dv=* ) 

th 
5 
SD units for individuallevel DV outlier removal (note: casewise deletion) 
winsor 
0.02 
Proportion (e.g. 2% here) for Winsorizing the DV (no outlier removal) 
Parameters for the (clusterbased) permutation:
Option  Example  Description 

nreps 
1000 
Number of permutations to perform 
clocs 
clocs.txt 
File containing channel location information, described here 
thspatial 
0.5  Threshold for defining adjacent channels (Euclidean distance, 0 to 2) 
thfreq 
1  Threshold for defining adjacent frequencies (Hz) 
thtime 
0.5  Threshold for defining adjacent timepoints (seconds) 
thcluster 
2  Absolute value of tstatistic for inclusion in a cluster 
Secondary parameters
Option  Example  Description 

abs 
Take the absolute value of all DVs  
dB 
Take the log of all DVs  
flwr 
0.5  Ignore values for frequencies below 0.5 Hz 
fupr 
25  Ignore values for frequencies above 25 Hz 
completeobs 
Instead of casewise dropping individuals with missing data, flag an error  
exids 
id001,id002 
List of individual IDs to exclude 
incids 
@{include.txt} 
List of individual IDs to include 
1sided 
Assume a 1sided test (that B > 0 ) 
Outputs
Variablelevel output (strata: VAR
):
Variable  Description 

CH 
Channel name 
CH1 
First channel (for variables stratified by channelpairs) 
CH2 
Second channel (for variables stratified by channelpairs) 
F 
Frequency (Hz) (for variables stratified by frequency) 
T 
Time (e.g seconds) (for variables stratified by time) 
B 
Beta from linear regression 
STAT 
The correspondong tstatistic 
PU 
Uncorrected empirical significance value 
PC 
Familywise corrected empirical significance value 
CLST 
For variables assigned to a nominallysignificant cluster (P<0.05), the cluster number K (else 0) 
Clusterlevel output (strata: K
)
Variable  Description 

N 
Number of variables in this cluster 
P 
Empirical significance value 
SEED 
Seed variable (most significant) 
Clustermember output (strata: K
x M
)
Variable  Description 

VAR 
Variable name, i.e. member M of cluster K 
Example
This command requires that both dependent variable (i.e. Luna sleep
metrics) and the independent variable (i.e. demographics, disease
state, etc) have a column labeled ID
(casesensitive) to link
individuals across files.
The command expects metrics in the long format, as would come from the r
option of destrat for example:
ID CH F PSD
subj01 AF3 0.5 393.263017810112
subj01 AF3 0.75 483.725395908896
subj01 AF3 1 359.270993287664
subj01 AF3 1.25 248.829826621191
subj01 AF3 1.5 161.100179146164
subj01 AF3 1.75 103.629116449848
subj01 AF3 2 67.5148408348248
subj01 AF3 2.25 49.2884864501349
subj01 AF3 2.5 38.0206381125483
...
This file has 1,029,990 rows: one for each individual, channel and frequency combination.
The file descr.txt
has 130 rows, containing the descripive statistics (independent variables):
ID C1 C2 X1
subj01 1 1 28
subj03 1 0 43
subj04 1 1 33
subj05 1 0 24
subj06 1 1 35
subj07 1 1 31
subj09 1 1 31
subj11 1 1 46
subj12 1 0 28
...
The CPT command is invoked as follows, piping the arguments into Luna via standard input (here from the shell echo
command).
echo 'ivfile=descr.txt iv=X1 covar=C1,C2
dvfile=n2spec.txt dv=PSD dB=PSD
nreps=1000
clocs=~/luna/clocs
thspatial=0.5 thfreq=0.5 thcluster=2
fupr=20 '  luna cpt o out.db
This specifics a set of regressions of the IV X1
in the file
descr.txt
, with covariates C1
and C2
. The dependent variable(s)
are the values of PSD
from the file n2spec.txt
. These DVs will
be logtransformed (due to the dB
option). We further specify 1000 permutations,
with clustering based on channel locations in the file clocs
and the given set of
clusterdefining thresholds. We restrict the analysis to power values for frequencies under 20 Hz
(i.e. if the file n2spec.txt
might contain higher values).
Only people present in both the IV and DV files will be included in
analysis (based on the ID
field). The order of individuals need not
be the same across files.
The output written to the console is as follows:
===================================================================
+++ luna  v0.25.5, 31Mar2021  starting 20May2021 14:15:07 +++
===================================================================
read 130 people from sample_descr.txt (of total 130 data rows)
reading metrics from n2spec.txt
converting input files to a single matrix
found 130 rows (individuals) and 4503 columns (features)
identified 0 of 130 individuals with at least some missing data
finished making regular data matrix on 130 individuals
final datasets contains 4503 DVs on 130 individuals, 1 primary IV(s), and 2 covariate(s)
read 64 channel locations from ~/luna/clocs
defining adjacent variables...
on average, each variable has 25.348 adjacencies
0 variable(s) have no adjacencies
running permutations, assuming a twosided test...
found 199 clusters, maximum statistic is 107.218
.......... .......... .......... .......... .......... 50 perms
.......... .......... .......... .......... .......... 100 perms
...
.......... .......... .......... .......... .......... 950 perms
.......... .......... .......... .......... .......... 1000 perms
2 clusters significant at corrected empirical P<0.05
all done.

+++ luna  finishing 20May2021 14:15:25 +++
===================================================================
Based on the constraints, there are 4,503 power values per individual (i.e. frequency values from 0.5 Hz to 20 Hz in 0.25 Hz bins gives 79 bins; for 57 EEG channels (as this dataset contains) this yields 79 * 57 = 4,503 values per individual).
Taking advantage of the Eigen library for matrix operations, the implementation is relatively efficient: including loading all the data, and forming the adjacency map for the 4,503 measures, the above command fits just under 4.5 million linear models (each one for N=130 individuals) including the original data and the 1000 null permutations, but the entire analysis takes approximately 17 seconds running on a single macOS laptop.
The output notes that on average, each variable has about 25 'adjacent' neighbours (in both frequency and topographical space); no variables have 0 adjacent points. Overall, 199 clusters are extracted from the original data, given a minimim tstatistic threshold of 2.
The pointwise (i.e. variablebyvariable) output can be extracted as follows:
destrat out.db +CPT r VAR
ID VAR B CH CLST F PC PU STAT
. AF3~0.5~PSD 0.07959794 AF3 0 0.5 0.68131 0.02297 2.339
. AF3~0.75~PSD 0.09944737 AF3 0 0.75 0.09191 0.00299 3.426
. AF3~1~PSD 0.08683515 AF3 0 1 0.28371 0.00299 2.923
. AF3~1.25~PSD 0.07991619 AF3 0 1.25 0.47152 0.00899 2.614
. AF3~1.5~PSD 0.07243292 AF3 0 1.5 0.58441 0.01198 2.460
. AF3~1.75~PSD 0.06208055 AF3 0 1.75 0.75424 0.02497 2.225
. AF3~10~PSD 0.05459445 AF3 0 10 0.99600 0.12387 1.548
. AF3~10.25~PSD 0.05860681 AF3 0 10.25 0.98901 0.098909 1.642
Hint
The ID
field in the output of the CPT command is always .
, i.e. as this reflects samplelevel output rather than any one observation.
Also note that outputs in later versions of Luna will now include a T
field as the timestratifier, and the VAR
labels are expanded to reflect this additional extra stratifier.
The variable names (under VAR
) are automatically constructed from
the specified dv
variables (i.e. PSD
here) and any frequency and
channel stratifiers, with the ~
character separating them. To ease
parsing, the output also includes the channel (CH
) and frequency
(F
) associated with each variable.
Note that Luna's generic output mechanism sorts by the VAR
stratifier, which is alphanumeric rather than numeric. This means the order of rows
may not be in strictly increasing numeric order (i.e. 1 , 10 , 11 , 2 , 3 , ... ). To facilitate plots etc, you can simply sort the table by F
, e.g.
if in the R package:
d < d[ order(d$F) , ]
We'll use the lunaR ltopo.xy()
routine to produce a plot of these data, first defining an additional variable as the signed log10 of the empirical pvalue:
d$S < sign( d$B ) * log10( d$PU )
The Luna graph command:
ltopo.xy( c=d$CH, x=d$F, y=d$S, z=d$S, pch=20, col=rbpal, cex=0.4, xline=15, yline=c(2,0,2), y.symm=T)
Directly examining the PC
column, we see there are 46 variables significant at 0.05 after correction:
table( d$PC < 0.05 )
table( d$F[ d$PC < 0.05 ] , d$CH[ d$PC < 0.05 ] )
C1 C2 CP1 CP2 CP3 CPZ CZ F1 F2 F4 FC1 FC2 FC3 FC4 FCZ Fp1 Fp2 FPZ FZ P2 POz PZ
0.5 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
0.75 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
1 1 1 0 0 0 1 1 0 0 0 1 1 0 0 1 1 1 0 0 1 0 1
1.25 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
3.25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
3.5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
15.25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
15.5 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
15.75 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
16 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
16.25 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
The console log notes there are two clusters significant at the empirical 0.05 level (after correction for all multiple testing, i.e. based on comparing the maximum of the cluster sumstatistics in the observed versus the permuted data). The pvalues and seed variables can be obtained as follows:
destrat out.db +CPT r K
ID K N P SEED
. 1 30 0.04495 CP4~16~PSD
. 2 33 0.01498 POz~15.25~PSD
(which note, does not include a cluster around 1 or 3 Hz). The members of each cluster can be found as follows:
destrat out.db +CPT r K M
ID K M VAR
. 1 1 C4~15.5~PSD
. 1 2 C4~15.75~PSD
. 1 3 C4~16.25~PSD
. 1 4 C4~16.5~PSD
. 1 5 C4~16~PSD
. 1 6 CP2~15.5~PSD
. 1 7 CP2~15.75~PSD
. 1 8 CP2~16.25~PSD
. 1 9 CP2~16.5~PSD
. 1 10 CP2~16~PSD
...
CLST
variable in the pointwise output (r VAR
).
Warning
Further guidance on best practice on cluster thresholds, and implementing alternative methods to more intelligently select adaptive thresholds will be implemented in future releases. As is, please consider the clustering of the CPT as a secondary feature that has not been optimized in Luna yet: the primary application is to efficiently generate pointwise, corrected empirical pvalues.