Multi-parameter Fits: Theorems C and D
Guysquared
09-16-2007, 02:01 PM
Hello to all:
I would like to broach this seemingly esoteric, but actually very practical and endemic issue. That is: how does one usefully and correctly report confidence limits to parameters in multi-parameter fits. Of course to be an interesting discussion the parameters have to be correlated. If not, the models could be separated and each separate part could be treated as a single parameter.
Theorem C (Chapter 14 in my NR edition), which I think of "joint confidence intervals for multi-parameters fits," was first described convincingly to me in Lampton, Margon, and Bowyer (1976, ApJ, 208, 177). I believe this to be true, that the confidence intervals for a model with N free parameters are determined by finding models with delta chi2 described by a chi2 distribution with N degrees of freedom. So, e.g., if there are 6 parameters, the 1-sigma limits for each parameters, are found within the ensemble of models for which delta chi2 is within 7.1 of chi2 global minimum for all fits.
However, the problem arises with Theorem D. This implies that if one ignores all the other parameters in a multi-parameter fit, then one can treat the single remaining parameter as if the model were a one-parameter fit and define a 1-sigma confidence interval with del chi2 = 1. To do this one steps over the "interesting parameter" and allows the other N (=5 in the example) to vary. Note, however, the ensemble of chi2 fits and values one gets in this procedure is the same ensemble of chi2 value and fits if one allows all the parameters to vary. It is equivalent to making a fine grid of all 6 parameters. One gets the same ensemble of chi2 values and fits in any of these procedures: letting all parameters vary, stepping thru one and letting the others vary, stepping thru all parameters.
So, authors adopt Theorem D because it seemingly allows them to gain something for nothing. Their confidence intervals shrink from the min chi2 + 7.1 ensemble to the min chi2 + 1 ensemble--a very significant "improvement." To do this, they must at best throw away all the information with regards to the other parameters. But worse, the authors serially use this procedure for all 6 parameters and present a single table in one paper that shows the delta chi2 + 1 confidence limits for each of the parameters. This is clearly wrong since each of the confidence limits corresponds to a different ensemble of solutions and the combined confidence interval of all the parameters corresponds to << 1-sigma of the acceptable fits.
The "best" solution of "throwing away" the information from the other parameters is also wrong because that information was use explicitly in finding the ensemble of fits. Furthermore, after one author publishes a paper on his favorite parameter alone, a second author could publish a paper on his different favorite parameter using the same model and data. Both of these papers follow Theorem D (as currently interpreted by many authors) yet they would be incompatible with each other. Can either then be true?
My contention is that Theorem D is either untrue, incorrectly used, or both. Any comments?
TLoredo
10-15-2007, 01:13 PM
Hi Guysquared-
Nice moniker for a stats discussion! Saul Teukolsky asked me to comment on
your post. I'm sorry it's taken so long (various deadlines intervened), but
here are some remarks that might help.
These remarks hold under the proviso that the assumptions stated in the text
before the theorems hold. I'll offer some comments at the end about what to do
when they don't hold.
To directly answer your summary question at the end of your post, Theorem D is
true, but *sometimes* it's incorrectly used (perhaps more often than "sometimes"
in certain research areas).
A source of confusion in using Theorem D is that a particular combination of
data-plus-model may be used to address *multiple* statistical questions.
Some of these may require simultaneous consideration of all of the model
parameters to address; others may require simultaneous consideration of
different subsets of the parameters. For an investigator with just one precise
scientific question in mind, the situation is relatively easy: find and use
the confidence region for the (subset of) parameters of interest. The problem
is that, more often than not, there are several (possibly related) scientific
questions we want to address with the data+model; moreover, other investigators
may want to address still other questions with the same data+model. The issue
then is: How can I report results that correctly address the issues I want to
address, and that enable correct re-use of the fit information to address
other, subsequent questions?
The glib (but correct!) answer is: report the full likelihood function
(equivalently, the full chisqr function). Then other investigators can make
full use of the data+model to address their questions. The problem is that
there is no simple way to quantitatively report a multi-dimensional function in
scientific literature, except in 1 or 2 dimensions when one can show a curve or
contour plot. Yet one *needs* to report *some* summary of the fit; also, one
can anticipate some uses, and provide summaries that address those uses.
Tables of single-parameter projected confidence regions are certainly a valid
such summary, provided they are properly described when published, and properly
used by readers. A particular team can't do much about the latter, but
certainly reviewers and editors should insist on the former, e.g., by
insisting on explicit descriptions of tabulated fit results, including caveats
on how they may be used. I just did this myself last month, as a reviewer for a
paper that presented a table of individual parameter estimates, without stating
how the tabulated individual regions were calculated or how they should be
interpreted/used.
A silly metaphor that might help in understanding how such results should be
understood is that of a "radio button" form linked to a table on a web page.
For example, for a three parameter model (say, spectrum temperature, power law
index, amplitude), such a summary of a fit might look like this:
68% 1-D projected confidence regions
..T......[ 15 +- 3 ] keV
[ ]..alpha..[ ]
[ ]..A......[ ] photons/s
(Sorry about the dots; the forum software does not preserve white space or alignment!)
The "*" indicates a selected radio button. The key is that a radio button lets
you select *only one* of multiple objects. So in such a table, you can only
see the estimate (best value + region) for a single parameter at a time,
portraying the fact that you are only permitted to use the results from this
table for a *single* parameter. A printed publication obviously can't do this,
but perhaps the metaphor helps clarify the meaning of such regions.
Personally, if a model has more than one parameter, I consider it
scientifically irresponsible not to provide summaries that attempt to display
dependencies/correlations among the parameters. Providing a set of contours
showing projected 2-D regions is an obvious option; if one can also report
simple parameterized descriptions of such regions (e.g., via bivariate normal
distributions), that aids quantitatively accurate re-use. Such 2-D summaries
are not only helpful but *necessary* if there are questions whose consideration
depends on two parameters.
For joint consideration of 3 parameters, various types of 2-D renderings of 3-D
objects can help give one some intuition about the relationships, but these
don't facilitate quantitative re-use of the fit. The problem is obviously
worse in higher dimensions. There is no standard way to readily provide
quantitatively useful summaries in such cases. In the limit where Theorem D
holds perfectly, a covariance matrix does the trick. But if Theorem D is only
a reasonable approximation, it doesn't.
A concrete example might help. Consider analyzing data from an exoplanet
system that contains a single planet, with data from Doppler radial velocity
observations. A full model for the velocity curve nominally requires at least 6
parameters (e.g., orbital period, eccentricity, velocity amplitude, origin of
time, orbit orientation, and center-of-mass velocity). For many current
data sets, the assumptions underlying Theorem D do not hold (e.g., one
diagnostic is that 2-D contours often look highly asymmetric, or do not close
far from a parameter boundary). But let's assume we have plentiful data for a
particular system so that the theorem is approximately valid.
In this case, a table of 1-D projected regions could be useful for several
purposes. A theorist who claims to predict the period distribution of planets
could legitimately make use of the period estimates for many planets. Another
investigator uninterested in planets but studying stellar kinematics might even
be interested in the COM velocity estimate.
However, investigators interested in the *mass* (more realistically, m*sin(i))
of this planet (or the distribution of planet masses) must *not* use the
tabulated single-parameter estimates to calculate a mass for their study. They
need to use the 3-D profile likelihood (the statistician's term for the
likelihood as a function of a subset of interesting parameters, maximized at
each point over the nuisance parameters) for period, eccentricity, and amplitude
to construct a mass estimate (I'm assuming the host star mass is known).
Alternately, if they have access to the actual data+model, they can
reparameterize the full model so that one of the parameters is mass (e.g.,
replace period with mass). Then they can find a 1-D region for the mass
directly, using Theorem D. Both approaches will give the same result under the
conditions for which Theorem D holds.
It's okay that a single paper might report various estimates (of varying
dimensionality) to address various scientific questions. As long as each
question is meaningful in isolation, the various regions are useful and
meaningful. One does has to be careful, though, if the answers to the
questions are later considered jointly to address some higher-level question;
correlations between parameters might again rear their ugly head(s).
Regarding the correctness of Theorem D, it is well-established in the
statistics literature. The reference you mentioned by Lampton et al. provides
a good discussion (see the Appendix), including a pointer to the standard
reference on the somewhat messy linear algebra needed to prove it (Cramer's
book). You'll also find it in other stats books on linear models.
However, it's often the case in astronomy that the conditions for Theorem D do
not hold. Astronomers aren't alone in this regard, and there is a lot of work
in statistics on what to do in this case. I'll say more about this in a second
response (the site limits the length of each response).
Cheers,
Tom Loredo
TLoredo
10-15-2007, 01:16 PM
Hi again, Guysquared et al.-
This continues where I left off, regarding situations where the assumptions
leading to Theorem D don't hold.
Taking the conventional "frequentist" approach to statistics, there are two
main threads of work in this direction. The first relies on higher-order
asymptotics; roughly speaking, this involves distributional expansions beyond
the "root-N" Gaussian approximation. There are two main ingredients in the
resulting procedures. First, one finds that a proper definition of a
confidence region requires "conditioning"---one should not consider the
ensemble of *all* possible data to define the coverage of a region, but rather a
subset of data that are "like" the observed data in some respect (again roughly
speaking, some data sets provide wider likelihoods than others of the same
size; one should consider data that provide similarly informative likelihood
surfaces to that provided by the real data). The second main ingredient is a
"correction" or "adjustment" to the profile likelihood---a function of the
parameters of interest that shifts the profile likelihood location. Such
adjustments account for how the uncertainties in the nuisance parameters vary
as a function of the interesting parameters (in the normal limit of Theorem D,
the uninteresting parameters have the same uncertainty no matter what values of
the interesting parameters you look at). There is a significant literature on
these adjusted likelihood methods; unfortunately, they are complicated to
implement.
The second thread of frequentist work going beyond Theorem D relies on
"resampling"---cleverly reusing the available data to build a delta-function
approximation to the unknown underlying true distribution for the data.
Bootstrapping is probably the best-known such resampling method. Bootstrapping
is pretty straightforward if one has a reasonable estimator for a parameter and
one merely wants to reduce its bias. But using bootstrapping to find
*confidence regions* is considerably more difficult, and if you do it wrong, the
resulting regions may be *worse* than if you just stick with Theorem D. I'm
afraid the discussion of bootstrapping in NR2 is sorely lacking in this regard,
and a number of astronomers, relying solely on the brief description of
bootstrapping in NR2, have improperly implemented bootstrapping and reported
incorrect regions. To be fair, the NR2 discussion is so brief that no careful
investigator should have considered it an adequate basis for an analysis.
Hopefully NR3 either expands the discussion, or provides appropriate caveats
and better references. If you pursue this approach, a good hint that you're on
the right track is if you find yourself having to worry about "pivotal
quantities."
Finally, there's the Bayesian approach, which multiplies the likelihood by a
prior; the product is the posterior distribution, and can be straightforwardly
integrated over parameter space (possibly with changes of variables) to
directly address various questions of interest. Such regions are
rigorous summaries without the conditions of Theorem D (but with a
different interpretation than frequentist confidence regions, and a
dependence on the prior). Admittedly, the integrals can be challenging once
you go beyond 3 dimensions, but modern Markov chain Monte Carlo (MCMC) methods
can make them tenable by constructing a random number generator that samples the
posterior (generating hypothetical *hypotheses* rather than the hypothetical
*data* used in conventional Monte Carlo simulation). A nice aspect of this
"posterior sampling" approach is that one can in principle provide a table of
the MCMC samples to readers, providing a full summary of the posterior that
could be easily re-used for various purposes. This is not yet a common
practice. I believe NR3 has some discussion of Bayesian inference and MCMC;
perhaps this will help make such an approach more common, reducing the misuse of
Theorem D and misuse of projected regions that led to your post.
Cheers,
Tom Loredo
Bill Press
10-15-2007, 08:40 PM
Thanks, Tom. I also appreciate your taking the time to spell this out.
Cheers,
Bill P.
Guysquared
11-27-2007, 11:31 AM
Dear Tom:
Thank you for the detailed responses. You address my fundamental concern in your first posting: when, if ever, is it appropriate to report 1-D (or n-D) projected confidence limits from a multi-parameter fit?
You gave an example, repeated here: Consider analyzing data from an exoplanet system that contains a single planet, with data from Doppler radial velocity observations. A full model for the velocity curve nominally requires at least 6 parameters (e.g., orbital period, eccentricity, velocity amplitude, origin of time, orbit orientation, and center-of-mass velocity). In this case, a table of 1-D projected regions could be useful for several purposes. A theorist who claims to predict the period distribution of planets could legitimately make use of the period estimates for many planets. Another investigator uninterested in planets but studying stellar kinematics might even be interested in the COM velocity estimate.
I believe this example illustrates the problem with 1-D projected regions. Let’s assume both investigators separately publish papers. The first uses the one-sigma 1-D projected confidence (delta chi2 + 1) for the orbital period, while the second used the one-sigma 1-D projected confidence region for the COM velocity. Both investigators are pleased to use data that has 68% confidence as prescribed by Theorem D and reported in the Doppler paper. But two independent scientific investigations have been performed using mutually inconsistent data: the 1-D projected confidence range for each parameter requires integration over the other parameters with values well outside their 1-D projected confidence regions.. How can both papers be acceptable? And what if other investigators similarly make use of the other 4 parameters?
Your example also seems to directly contradict your radio button example. In the radio button, only one projected parameter can be seen at any time, presumably implying that only one can be used while the others are excluded. In the Doppler example, the projected parameters are all presented, and they are used simultaneously to draw scientific conclusions.
Finally, I don’t understand how various scientific questions can be meaningful in isolation when they are based on correlated data as in your Doppler example. This too is a contradiction.
Theorem D always leads to these and similar contradictions. Here is another one out of the literature. A researcher wants to demonstrate that photometric measurements of planetary systems, e.g. Kepler satellite data, can resolve the sin(i) ambiguity of radial velocity data. He performs some simulations and shows that the signal-to-noise ratio is often
sufficient to determine the Keplerian orbit based on the photometric data alone. He reports that the 1-D projected confidence on sin(i) is, e.g., 0.3 +/- 0.03. Now when is this number ever useful? If a particular planetary system is measured by photometric and RV instruments, then obviously a joint fit is performed to estimate the Keplerian orbit. The inclination is one parameter of the fit. The photometric alone) limit on sin(i) is never used in conjunction with any RV measurements, so the reported confidence range is meaningless. Again, the result at first appears to be meaningful in isolation but in fact it has no application and the can not be consistent with the RV data that limits the range of the non-interesting parameters.
Reporting 1-D projected confidence regions is a good way to report small numbers, best in the world precisions, but with loss of information and inevitably resulting in misuse of data. Using Theorem D in the way you suggest creates a conflict of interest for a scientist. His data appears to be vastly “better” than it would be using joint confidence limits. The one-sigma limits shrink dramatically. Shouldn’t we be suspicious when something is gained without the “pain” of acquiring better data?
A last problem is this: What about the scientist who uses the same data and reports the joint confidence limits of the parameters? Everyone agrees that Theorem C is correct. In the inclination angle example above he might write that the inclination angle is 0.3 +/- 0.2 (in addition to, of course, the limits on the other parameters), thereby showing that an inclination angle of 0.45 is allowed within the one-sigma joint confidence limits. A reader must then choose between this definitely correct result, and the other result using Theorem D that indicates that an inclination angle of 0.45 is well beyond the one-sigma limit of 0.3 +/- 0.03. Both papers are correct? Both papers are meaningful?
MarCn
04-14-2010, 05:01 AM
Hi Guysquared-
Nice moniker for a stats discussion! Saul Teukolsky asked me to comment on
your post. I'm sorry it's taken so long (various deadlines intervened), but
here are some remarks that might help.
These remarks hold under the proviso that the assumptions stated in the text
before the theorems hold. I'll offer some comments at the end about what to do
when they don't hold.
To directly answer your summary question at the end of your post, Theorem D is
true, but *sometimes* it's incorrectly used (perhaps more often than "sometimes"
in certain research areas).
A source of confusion in using Theorem D is that a particular combination of
data-plus-model may be used to address *multiple* statistical questions.
Some of these may require simultaneous consideration of all of the model
parameters to address; others may require simultaneous consideration of
different subsets of the parameters. For an investigator with just one precise
scientific question in mind, the situation is relatively easy: find and use
the confidence region for the (subset of) parameters of interest. The problem
is that, more often than not, there are several (possibly related) scientific
questions we want to address with the data+model; moreover, other investigators
may want to address still other questions with the same data+model. The issue
then is: How can I report results that correctly address the issues I want to
address, and that enable correct re-use of the fit information to address
other, subsequent questions?
The glib (but correct!) answer is: report the full likelihood function
(equivalently, the full chisqr function). Then other investigators can make
full use of the data+model to address their questions. The problem is that
there is no simple way to quantitatively report a multi-dimensional function in
scientific literature, except in 1 or 2 dimensions when one can show a curve or
contour plot. Yet one *needs* to report *some* summary of the fit; also, one
can anticipate some uses, and provide summaries that address those uses.
Tables of single-parameter projected confidence regions are certainly a valid
such summary, provided they are properly described when published, and properly
used by readers. A particular team can't do much about the latter, but
certainly reviewers and editors should insist on the former, e.g., by
insisting on explicit descriptions of tabulated fit results, including caveats
on how they may be used. I just did this myself last month, as a reviewer for a
paper that presented a table of individual parameter estimates, without stating
how the tabulated individual regions were calculated or how they should be
interpreted/used.
A silly metaphor that might help in understanding how such results should be
understood is that of a "radio button" form linked to a table on a web page.
For example, for a three parameter model (say, spectrum temperature, power law
index, amplitude), such a summary of a fit might look like this:
68% 1-D projected confidence regions
..T......[ 15 +- 3 ] keV
[ ]..alpha..[ ]
[ ]..A......[ ] photons/s
(Sorry about the dots; the forum software does not preserve white space or alignment!)
The "*" indicates a selected radio button. The key is that a radio button lets
you select *only one* of multiple objects. So in such a table, you can only
see the estimate (best value + region) for a single parameter at a time,
portraying the fact that you are only permitted to use the results from this
table for a *single* parameter. A printed publication obviously can't do this,
but perhaps the metaphor helps clarify the meaning of such regions.
Personally, if a model has more than one parameter, I consider it
scientifically irresponsible not to provide summaries that attempt to display
dependencies/correlations among the parameters. Providing a set of contours
showing projected 2-D regions is an obvious option; if one can also report
simple parameterized descriptions of such regions (e.g., via bivariate normal
distributions), that aids quantitatively accurate re-use. Such 2-D summaries
are not only helpful but *necessary* if there are questions whose consideration
depends on two parameters.
For joint consideration of 3 parameters, various types of 2-D renderings of 3-D
objects can help give one some intuition about the relationships, but these
don't facilitate quantitative re-use of the fit. The problem is obviously
worse in higher dimensions. There is no standard way to readily provide
quantitatively useful summaries in such cases. In the limit where Theorem D
holds perfectly, a covariance matrix does the trick. But if Theorem D is only
a reasonable approximation, it doesn't.
A concrete example might help. Consider analyzing data from an exoplanet
system that contains a single planet, with data from Doppler radial velocity
observations. A full model for the velocity generic viagra (http://www.ukviagra.net)curve nominally requires at least 6
parameters (e.g., orbital period, eccentricity, velocity amplitude, origin of
time, orbit orientation, and center-of-mass velocity). For many current
data sets, the assumptions underlying Theorem D do not hold (e.g., one
diagnostic is that 2-D contours often look highly asymmetric, or do not close
far from a parameter boundary). But let's assume we have plentiful data for a
particular system so that the theorem is approximately valid.
In this case, a table of 1-D projected regions could be useful for several
purposes. A theorist who claims to predict the period distribution of planets
could legitimately make use of the period estimates for many planets. Another
investigator uninterested in planets but studying stellar kinematics might even
be interested in the COM velocity estimate.
However, investigators interested in the *mass* (more realistically, m*sin(i))
of this planet (or the distribution of planet masses) must *not* use the
tabulated single-parameter estimates to calculate a mass for their study. They
need to use the 3-D profile likelihood (the statistician's term for the
likelihood as a function of a subset of interesting parameters, maximized at
each point over the nuisance parameters) for period, eccentricity, and amplitude
to construct a mass estimate (I'm assuming the host star mass is known).
Alternately, if they have access to the actual data+model, they can
reparameterize the full model so that one of the parameters is mass (e.g.,
replace period with mass). Then they can find a 1-D region for the mass
directly, using Theorem D. Both approaches will give the same result under the
conditions for which Theorem D holds.
It's okay that a single paper might report various estimates (of varying
dimensionality) to address various scientific questions. As long as each
question is meaningful in isolation, the various regions are useful and
meaningful. One does has to be careful, though, if the answers to the
questions are later considered jointly to address some higher-level question;
correlations between parameters might again rear their ugly head(s).
Regarding the correctness of Theorem D, it is well-established in the
statistics literature. The reference you mentioned by Lampton et al. provides
a good discussion (see the Appendix), including a pointer to the standard
reference on the somewhat messy linear algebra needed to prove it (Cramer's
book). You'll also find it in other stats books on linear models.
However, it's often the case in astronomy that the conditions for Theorem D do
not hold. Astronomers aren't alone in this regard, and there is a lot of work
in statistics on what to do in this case. I'll say more about this in a second
response (the site limits the length of each response).
Cheers,
Tom Loredo
thanks for the detailed info. really usefull.