Saul Teukolsky
02-28-2002, 12:05 PM
The KS test in kstwo assumes you are comparing data from two continuous distributions. You can use the test on discrete data, in which case it is a conservative test (it tends to overestimate d). This is discussed in several textbooks on nonparametric statistics.
Once you admit discrete distributions, you have to decide how to handle tied data. Even continuous distributions can produce tied data if you have large data sets and a small number of significant digits to represent it. In this case you are effectively binning the data. The current implementation of kstwo does not handle tied data.
The "standard" method of handling ties in KS is to combine all the data points and add them to the CDF at once, but to keep the sample size unchanged. (See, e.g. Hollander and Wolfe, "Nonparametric Statistical Methods" 2nd ed. p. 183). This makes sense for data from a discrete distribution. For ties from not resolving a continuous distribution properly, several readers have written to suggest one should "dither" the data to break the ties randomly. I don't know of any rigorous study of this, but it sounds sensible.
The NR text should carry a warning about tied data - maybe in the next reprinting.
Once you admit discrete distributions, you have to decide how to handle tied data. Even continuous distributions can produce tied data if you have large data sets and a small number of significant digits to represent it. In this case you are effectively binning the data. The current implementation of kstwo does not handle tied data.
The "standard" method of handling ties in KS is to combine all the data points and add them to the CDF at once, but to keep the sample size unchanged. (See, e.g. Hollander and Wolfe, "Nonparametric Statistical Methods" 2nd ed. p. 183). This makes sense for data from a discrete distribution. For ties from not resolving a continuous distribution properly, several readers have written to suggest one should "dither" the data to break the ties randomly. I don't know of any rigorous study of this, but it sounds sensible.
The NR text should carry a warning about tied data - maybe in the next reprinting.