AUC considered harmful

intermediate

signal processing

medical statistics

clinical studies

Author

Dieter Menne

Published

November 18, 2024

The title has been stolen from Dijkstra.

Papers submitted to medical journals below the top rank are most often reviewed by medical reviewers who have used the Wilcoxon test as a catch-all in their active days. In order to sneak a paper through their phalanx, it is best to garnish it generously with:

p-values
Wilcoxon-tests (t-test: “The authors should test for normality”)
p-values
Bland-Altman plots
p-values
Area under curve (AUC)
p-values

Other modernistic stuff will delay the review process: “Authors should supplement their Bayesian analysis with non-random methods that yield correct p-values”. There is nothing wrong with Wilcoxon tests and Bland-Altman plots, and the mentioned AUC is on the border to semi-professionalism, so is is worth a look. Anything else I forgot to mention?

End of soapbox, let’s get serious

The AUC (area under the curve) or the AOB (signed area over baseline) is popular because it is easily calculated: for the case of equidistant points, it is the average over all values multiplied by the record duration. Or, equivalently, the AUC is the area of the equivalent rectangle shown as the shaded area below, with a height equal to the mean volume and a width equal to record duration. There are better approximations for the AUC (Gabrielsson and Weiner (2006), p. 161), but that does not change the validity of the arguments in the following.

The area under curve is the mean of the raw data multiplied by the duration

The average over all data points recorded is moderately informative. Assuming the vertical axis shows a volume, it has units of ml, which is easy to understand, but the dependence on the arbitrary recording duration is awkward. Should the study result depend on the time the patient or experimenter is willing to sit around and wait?
The record duration is mostly arbitrary. It may differ between records, e.g. when we have to stop recording before target time.

The AUC is highly biased when used naively; i.e. when calculated over an arbitrary data set duration.

The AUC is very useful when computed up to very large times, when the volume has gone back to baseline and no further contribution to the area is to be expected. This is illustrated in the curves below, which are similar to the PDR curves recorded in ¹³C breath tests. For all time series shown, the total area under the curve up to infinity has been chosen to be the same. The AUC is a measure of bioavailability (Gabrielsson and Weiner 2006), the chief reason why in non-compartmental pharmacodynamics it is the prime parameter of interest.

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

The AUCs computed from the recorded data range is biased in the left panel, but a good approximation of the total area in the right panel.

Record too short? Nothing lost!

It is possible to compute the AUC up to infinity even when only a limited data set is available; or, even worse, when the duration of the records is not the same between test units. First, you have to fit a model curve to your data that goes back to the reference line at infinity, and has a finite improper integral; the first does not imply the latter. Remember high school calculus: the function 1/x does not have an improper integral - it is infinite - but 1/x² has one; even 1/x^1.0001 has one.

In analysis of ¹³C breath test curves (see Bluck (2009) for a review), the Gamma function from earlier publications does not have an improper integral, while the exponential beta function has one; some examples are given here. In analysis of gastric emptying, the power exponential generally does not have an improper integral, but the linexp function has one, so the AUC can be calculated.

Once you fitted a parametric curve to your data, the (the only, real, valid, nice!) AUC can be computed by consulting an integral table or feeding Wolfram Alpha. Note that a spline function cannot be used in this case, because the spline is a spline is a spline and thus flexible to bend to data’s capriciousness, but it does not know where volume will wander in the infinite future.

Too much to remember for the next meeting with you boss (the “we did all with Wilcox” guy)? Read the executive summary:

When you present a time series curve …

… and someone suggests “compute the AUC”, follow Nancy Reagan and just say no. It is almost always the wrong approach, even if the tests give wonderful p-values.
… if the someone insists, ask whether using the average over all measured points instead will do¹. A ‘yes’ answer is acceptable, a ‘no’ means that someone misuses poor AUC for pseudo-scientific blow-off.
… if you have measured the curve long enough until it is back to baseline, say “Sorry, will try”. With the exception of some gastric emptying studies, I have never encountered such a case, because patient and experimenters share the desire to be home in time.
… if you have fitted a parametric curve, you can often compute the AUC. It might be useful to quantify your study data, but always consider other alternatives, such as peak height or peak position which you also can extract from the fit. And don’t try half a dozen parameters extracted from a fitted curve to pick “the significant one”; I wish more people would disappear into Nirwana when hiking the garden of forking paths of Borges and Gelman.

1: Use a weighted mean for non-equidistant points

Robustness can hurt

One feature of AUC is its robustness; the form of the curve does not play a role, a curve with two peaks can have the same AUC as one dropping down monotonously, even if both curves tell different stories. Robustness can be a virtue, but can also hide interesting effects; or, to tell your boss: you may loose power by reducing curve details to an area.

When you find a good model function for your interesting alternatives, you are back in power.

Citations

Bluck LJC (2009) Recent advances in the interpretation of the (13)c octanoate breath test for gastric emptying. J Breath Res 3:034002. doi: 10.1088/1752-7155/3/3/034002

Gabrielsson J, Weiner D (2006) Pharmacokinetic and pharmacodynamic data analysis : Concepts and applications, 4. ed., rev. and expanded. Swedish Pharmaceutical Press, Stockholm