Question regarding K-S (two) tests of samples of the same size


N.J.Cooper
08-27-2009, 03:22 PM
In trying to debug a program that uses kstwo from NR 3rd Ed, I wanted to see how the probability statistic behaved under user defined conditions before I used the test on my data. So,I wrote a loop that creates 1000 sets of two populations (e.g of random doubles ranging 1 to 10) and compares each set using kstwo. I then made a histogram of the list of 1000 p-statistics. To my surprise when the two populations where the same size (n =127), certain values of p (e.g 0.565425) cropped up over and over, such that in a histogram of 32 bins, only half have N>0. Is this normal behavior for p?

davekw7x
08-27-2009, 11:26 PM
...To my surprise when the two populations where the same size (n =127), certain values of p ...

If you look at the methodology (and the code) you may be able to see that, if both populations have the same size, n, all values of d will be multiples of 1/n.

For your example, all values of d will be multiples of 1/127. In other words, K-S has already put the d values into bins of width 1/n

Want some numbers?

Here's a little test that prints out all unique values of d from tests with kstwo using a number of instances of two populations of the same size. After storing values obtained from kstwo for all of the tests, I just sorted the array of d values so that it's easier to see how it goes.

//
// xkstwo.cpp
//
//
// The file "points.txt" will show all of the values of d and prob.
// The file "uniq.txt" will show only points with different values of d.
//
//
// davekw7x
//
#include "../code/nr3.h"
#include "../code/ran.h"
#include "../code/sort.h"
#include "../code/ksdist.h"
#include "../code/kstests.h"
#include <iomanip>

inline Doub r(Ran & ran, Int size)
{
return size * ran.doub();
}


int main(int argc, char **argv)
{
Ran ran(7654321); // Use this for debugging so that it's the same every time
//Ran ran(time(0)); // Use this to get new distributions each run

Int size = 10; // Will get "random" values between 0 and 10

// Default number of points and number of tests may be
// overridden by command line arguments.
Int num_points = 127;
Int num_tests = 1000;

if (argc > 1) {
num_tests = atoi(argv[1]);
}
if (argc > 2) {
num_points = atoi(argv[2]);
}
ofstream pointsfile("points.txt");
ofstream uniqfile("uniq.txt");
if (!pointsfile || !uniqfile) {
cout << "There was a problem opening one of the output files."
<< endl;
return EXIT_FAILURE;
}

VecDoub a(num_points), b(num_points);
VecDoub d(num_tests), prob(num_tests);
Int uniq = 1;
for (Int n = 0; n < num_tests; n++) {
for (Int i = 0; i < num_points; i++) {
a[i] = r(ran, size);
b[i] = r(ran, size);
}
kstwo(a, b, d[n], prob[n]);
}

sort2(d, prob);

pointsfile << scientific;
uniqfile << scientific;

pointsfile << " n d prob delta d" << endl;
uniqfile << " n d prob delta d" << endl;

uniqfile << setw(5) << 0 << setw(14) << d[0] << setw(14) << prob[0]
<< endl;

for (Int n = 0; n < num_tests; n++) {

pointsfile << setw(5) << n
<< setw(14) << d[n]
<< setw(14) << prob[n];

if (n > 0) {
if ((d[n]-d[n-1]) > 1.0e-10) {
pointsfile << setw(14) << d[n]-d[n-1];
uniqfile << setw(5) << n
<< setw(14) << d[n]
<< setw(14) << prob[n]
<< setw(14) << d[n]-d[n-1] << endl;
++uniq;
}
}
pointsfile << endl;
}
uniqfile << "Number of tests = " << num_tests << endl;
uniqfile << "Number of points for each test = " << num_points << endl;
uniqfile << "Number of unique values of d = " << uniq << endl;

return 0;
}


uniq.txt contains

n d prob delta d
0 4.724409e-02 9.985557e-01
5 5.511811e-02 9.884444e-01 7.874016e-03
35 6.299213e-02 9.569055e-01 7.874016e-03
89 7.086614e-02 8.965011e-01 7.874016e-03
183 7.874016e-02 8.103527e-01 7.874016e-03
275 8.661417e-02 7.082521e-01 7.874016e-03
384 9.448819e-02 6.009963e-01 7.874016e-03
492 1.023622e-01 4.971489e-01 7.874016e-03
580 1.102362e-01 4.022043e-01 7.874016e-03
663 1.181102e-01 3.189829e-01 7.874016e-03
737 1.259843e-01 2.483810e-01 7.874016e-03
807 1.338583e-01 1.900694e-01 7.874016e-03
842 1.417323e-01 1.430166e-01 7.874016e-03
881 1.496063e-01 1.058448e-01 7.874016e-03
919 1.574803e-01 7.705962e-02 7.874016e-03
944 1.653543e-01 5.519368e-02 7.874016e-03
960 1.732283e-01 3.889297e-02 7.874016e-03
970 1.811024e-01 2.696362e-02 7.874016e-03
978 1.889764e-01 1.839132e-02 7.874016e-03
986 1.968504e-01 1.234174e-02 7.874016e-03
988 2.047244e-01 8.148332e-03 7.874016e-03
993 2.125984e-01 5.292856e-03 7.874016e-03
994 2.204724e-01 3.382522e-03 7.874016e-03
997 2.283465e-01 2.126768e-03 7.874016e-03
998 2.362205e-01 1.315615e-03 7.874016e-03
999 2.440945e-01 8.006942e-04 7.874016e-03
Number of tests = 1000
Number of points for each test = 127
Number of unique values of d = 26


All values of d are multiples of 7.874016e-3 (which is 1/127).

Is this normal?Any of your bins that don't encompass one of the discrete values of d from K-S will be empty.

Regards,

Dave