Statistics for Cortical Analysis

Using mni.cortical.statistics version 0.7

This page is no longer being updated - go here instead.

Once the thickness maps are available for each subject, statistical analysis can be performed across the entire cortex or for specific regions. Currently no cortical segmentation tool is available, so the only regional analysis that is actively supported is mean thickness.

The statistical framework is built upon R, though a few pieces of code are only available in matlab.

This page serves two functions: to provide links to the man pages of the analysis functions as well as an overview of what is currently possible combined with more overview style documentation.

Go directly to the manuals

Interaction terms in the formula
Comparing formulas
Thresholding and multiple comparisons
Determining the statistical power
Plotting the results of linear model at a particular vertex

Conclusions

Introduction

The purpose of cortical statistics is to allow for hypothesis testing at every vertex of the surface. For each analysis both input data and a series of independent variables are required. The simplest case is testing whether cortical thickness is related to group on either a regional or global level, though more advanced multivariate models are certainly possible.

The input matrix

The input data contains one line per subject. In each line the first column is the filename of the cortical thickness output, and subsequent columns are for independent variables. The very first line of the input file is optionally a header for all of the subsequent columns. A sample file would thus look like the follwing:

Filename Intercept Group Age
/somedir/someplace/subject001_thickness.txt 1 Normal 23
/somedir/someplace/subject002_thickness.txt 1 Patient 25
/somedir/someplace/subject003_thickness.txt 1 Patient 22
/somedir/someplace/subject004_thickness.txt 1 Normal 20

Some notable elements:

The first line, the header, is optional.
The second column, the intercept term, is also optional.
Different columns are separated by a single space.
Missing values are allowed, use an extra space where needed (i.e. don't insert anything).
Factors are allowed as well - in the example given above, the two groups (Normal and Patient) are specified as factors. Make sure that the spelling is always identical
It is easiest to specify all possible independent variables in one file. Just because they are included does not mean that they need to be used in every analysis - that is specified separately.

Example statistical run

The usual statistical run proceeds in a series of ordered steps:

Preparing the session
Read the input matrix
The formula specification
Test for effects on the mean
Test for effects at each vertex
Write output to file
Threshold using multiple comparison corrections

Preparing the session

The first step is to start the R statistical environment by typing "R" in a shell. This should after a short delay start R.

The second step is to load the library for dealing with cortical statistics. This is accomplished with the following invocation:

> library(mni.cortical.statistics)

Loading the input data

The input matrix is prepared in a simple text file as described above. This is then loaded with the following command mni.read.glim.file

> gf <- mni.read.glim.file(filename)

There are two optional boolean arguments to mni.read.glim.file, namely header and fill. Both are by default set to FALSE; set header=TRUE if the first line of the input file contains a dataheader, and set fill=TRUE if the input file is supposed to contain missing values.

The formula specification

The next step in the analysis is to define the hypothesis and to express it in terms of linear model. Any variable from the glim file can take the place of the usual beta parameters (if no header was provided in the glim file, they will be assigned V1 V2 V3 ... Vn). So, for example, given the above example, to test the effect for age the following formula would be used:

y ~ Age

Or, to test the effect of Group + age, the following:

y ~ Group + Age

In all these example, y takes the place of the metric to be estimated, usually cortical thickness.

Testing for effects to the mean

An often interesting question is to test the effect of mean cortical thickness against the predictors. This would be done in the following way:

> ms <- mni.mean.statistics(gf, 'y ~ Age')
> summary(ms)

The key call here is to mni.mean.statistics, which in its basic form takes two parameters: the glim matrix as read in by mni.read.glim.file, and a symbolic version of the formula. The terms tested against have to exist in either the general R environment or be specified in the glim file. The function returns a linear model object, which can be further interrogated by calling the summary function on it.

Testing for effects at each vertex

Testing for an effect at each vertex is almost identical to testing for effects on the mean. The main difference is that it takes longer, and that the output is not as easily summarised by a single command. To repeat the test against age used on the mean above, the following command would be invoked:

> vs <- mni.vertex.statistics(gf, 'y ~ Age')

The only change is the replacement of mean in the function name by vertex, i.e. mni.vertex.statistics The output is a list of the important statistical elements, such as the F and t-statistics, R-squared values, etc, with one entry for each vertex tested. To make sense of it this list should be written out to file and loaded into some visualisation tool.

Writing results to file

Once the statistics have been run on the mean as well as each vertex, the results should be written out to file. To follow the above example, the following incantation should be used:

> mni.write.vertex.stats(vs, "output_filename.vertstats",
                         mean.stats=ms, glim.matrix=gf)

This will place the results, in the vertstats format, into the file specified as the second argument using the function mni.write.vertex.stats Specifying the mean statistics and the glim matrix is optional; if the arguments are provided, the relevant information will be included in the header, which can be quite useful for recovering what was done at a later point.

Advanced Topics

The example session described above deals with the most common usage of the statistical environment. There are several more advanced topics as well, however, dealing with more intricate statistics. In no particular order, the topics covered are:

Interaction terms in the formula
Comparing formulas
Thresholding and multiple comparisons
Determining the statistical power
Plotting the results of linear model at a particular vertex

Interaction terms in the formula

An interaction term refers to a predictor that has to be modified in some way before usage in the formula. The most common use is for nonlinear terms. A second-order formula, for example, would look like this:

> y ~ Age + I(Age^2)

Basically, the I expression causes whatever is inside to be interpreted arithmetically rather than symbolically.

Comparing Formulas

Sometimes it is desireable to compare the efficacy of two separate formulas. For example, where in the cortex is a second-order equation superior to a simple first-order one? The function >mni.vertex.compare.models performs an ANOVA and returns an F-value at every vertex. It is used like so:

> mni.vertex.compare.models(gf, 'y ~ Age', y ~ 'Age + I(Age^2)', dt)

where gf is the glim file as read by mni.read.glim.file and dt is the data table as created by mni.build.data.table.

Thresholding and multiple comparisons

One of the issues that any user of the cortical statistics package will have to address is that of the correction for multiple comparisons. Briefly, the problem is that in the normal statistical run over 40,000 individual linear models will be computed. If the standard threshold of alpha = 0.05 is used, that then means that on average over 2000 vertices will be considered significant even if the data is purely random. We currently have two ways of computing a sensible threshold in the face of multiple comparisons: Random Field Theory and the False Discovery Rate.

Random Field Theory is the more stringent of the two methods, as it tries to assure within a certain confidence level that no false positives will be above the threshold chosen. At this point in time, computing it is also more difficult, involving several matlab scripts. Briefly, the procedure used is the following:

Fit the linear model at each vertex
Calculate the residuals at each node
Normalize the residuals
For each triangle in the mesh, treat the normalized residuals as the coordinates of the vertices and find the area of the triangle
Sum up over all trianges, the total/(4*log(2)) is then the resels of the cortical surface
Then compute the stat threshold in the usual fashion

In practice, the commands needed to perform this analysis are the following:

Computing the False Discovery Rate (FDR) is a whole lot simpler, as it is entirely incorporated into the statistics package. First of all, the goal of the FDR is different: the threshold will give you the expected percentage of false positives in your results. I.e. if you have a threshold of 0.05, and 1000 vertices end up above that threshold, then 50 of those on average are false positives. For each t-statistics in your results, a corresponding array of q values is computed as well. The q values can be roughly interpreted as the probability of that vertex in these results being a false positive. You can then simply threshold the data by setting a maximum q value that is acceptable to you. To explicity get a threshold, this invocation will work:

> q <- mni.compute.FDR(t.stats, df=degrees.of.freedom, fdr=0.05)
> q$fdr.threshold

where t.stats was computed using mni.vertex.statistics (though this method is generic and can take any set of t-stats or even p-values ... see the manual), df is the degrees of freedom in the data, and fdr is the desired threshold. See the man page for mni.compute.FDR.

Determining statistical power

It is often useful to determine the statistical power of the cortical statistics methods given a certain population. Basically, what is need is the standard deviation of either the mean thickness or the thickness at each vertex. Given that and either the desired change to be recaptured or the number of subjects in each group one can determine the power of the technique. The function used is mni.vertex.power.analysis, and the necessary parameters are:

std.file: A text file with one line per vertex containing the standard deviation at that vertex
n: The number of subjects to compute the power for
alpha: The threshold at which to compute the power. Remember that this threshold is uncorrected for multiple comparisons, so odds are that it ought to be more conservative than 0.05.
power: The power to use in the computation
delta: The change required in the power computation

The results returned contain both the n needed to recapture the specified delta, as well as the delta that can be obtained given two groups with n subjects