Chapter 6 - incgammabeta.h : binomialdist:: cdf (bug?)


emilpohl
09-13-2010, 12:59 AM
Hello all,

I've translated to Java. So I'm not sure if it's a bug. But the binomialdist.cdf only 'worked' after I've replaced the line:

return 1. - betai( (Doub)k, n - k + 1., pe);

by

return betai(n - (Doub)k, k + 1., 1. - pe);

Is this difference related to k (since it's discrete), a bug or bad java translation?

regards,

Carlos

davekw7x
09-13-2010, 10:45 AM
...So I'm not sure if it's a bug....
Hmmm...I hadn't tested that function by itself.

So, I wrote a little program for p = 1.0/6.0 (probability of a "one" turning up when you toss a fair six-sided die).
Here's the program

#include "../code/nr3.h"
#include "../code/gamma.h"
//
// uncomment the following to use the original version
#include "../code/incgammabeta.h"

//
// uncomment the following to use the modified version
//#include "./incgammabeta1.h"

int main()
{
int n;
double p;
// Change #if 0 to #if 1 to let the user enter the number of trials and
// the probability of a desired outcome.
#if 0
cout << "Enter n and p: ";
cin >> n >> p;
if (!cin || n < 1 || p <= 0 || p >= 1) {
cout << "Invalid input." << endl;
exit(EXIT_FAILURE);
}
#else
// Probability for a given face of a six-sided die
n = 20; p = 1.0/6.0;
#endif
cout << "CDF for binomial distribution with n = " << n
<< " and p = " << p << endl << endl;
Binomialdist bd(n, p);
cout << fixed;
cout << scientific;
cout << setprecision(11);
cout << " x Pr(X <= x)" << endl;
for (int k = 0; k < n; k++) {
cout << setw(3) << k << " " << setw(20) << bd.cdf(k) << endl;
}
return 0;
}

And here's the output:

CDF for binomial distribution with n = 20 and p = 0.166667

x Pr(X <= x)

0 0.00000000000e+00
1 2.60840533046e-02
2 1.30420266523e-01
3 3.28659071638e-01
4 5.66545637776e-01
5 7.68749218993e-01
6 8.98159510972e-01
7 9.62864656961e-01
8 9.88746715357e-01
9 9.97158384336e-01
.
.
.


Now, for a seat-of-pants sanity check: X can't be less than zero, so the probability that X is less than or equal to zero is equal to the probability that X is equal to zero. This value is not zero as given by the cdf() function; it is (5/6) to the power 20. That's equal to 0.026084053...

Similarly for the other values: It seems to me that for the usual definition of CDF the function uses a value of argument that's off by 1.

I haven't tested to see how this affects any other functions that use this function, but here's a test of cdf() with a slight modification:

Instead of changing the return statement (which would require changing the conditionals that test special cases), I just inserted the following line at the beginning of Binomialdist::cdf() in incgammabeta.h:

++k;

Now I get the following (which is what I would expect):

CDF for binomial distribution with n = 20 and p = 0.166667

x Pr(X <= x)
0 2.60840533046e-02
1 1.30420266523e-01
2 3.28659071638e-01
3 5.66545637776e-01
4 7.68749218993e-01
5 8.98159510972e-01
6 9.62864656961e-01
7 9.88746715357e-01
8 9.97158384336e-01
9 9.99401496063e-01
.
.
.


Or am I missing something? (It wouldn't be the first time, but see Footnote.)


Regards,

Dave

Footnote:
I don't "do" java, but as a quick cross check, I ran the following in GNU octave:

# GNU Octave m-file to check binomial cdf
n = 20;
p = 1.0/6.0;
printf('CDF for binomial distribution with n = %d and p = %f\n\n', n, p);
printf(' x Pr(X <= x)\n\n')
for x=0:n
printf('%3d %.11e\n', x, binocdf(x, n, p))
end


Results agree with my modified cdf() function:

CDF for binomial distribution with n = 20 and p = 0.166667

x Pr(X <= x)

0 2.60840533046e-02
1 1.30420266523e-01
2 3.28659071638e-01
3 5.66545637776e-01
4 7.68749218993e-01
5 8.98159510972e-01
6 9.62864656961e-01
7 9.88746715357e-01
8 9.97158384336e-01
9 9.99401496063e-01
.
.
.

emilpohl
09-13-2010, 11:45 AM
Hi! Dave

thank you for your fast reply.

adding k++ makes cdf looks "right". I've read Poisson explanation 6.14.13. So, it might be the same case.

However, for x = 1; k>20 and pe > 0.9 the cdf goes to 0.0 below 1.0E-16 adding ++k, k++ or not.

could you test this, please?

regards,

Carlos

davekw7x
09-13-2010, 02:19 PM
...
However, for x = 1; k>20 and pe > 0.9 the cdf goes to 0.0...Welcome to the world of finite precision!

Let's look at it for n = 25 and pe = 0.95. With my modified incgammabeta.h, cdf(0) returns zero. Not nearly zero. Exactly zero.

My little sanity check says that for k = 0, the result should be 0.05 to the power 25. That's something like 2.98...e-33. That's pretty small, but it is not beyond representation as a double precision floating point number.

So, what happened?

Well...

For k = 0, the cdf function calls betai with the values:
a = 1, b = 25, and c = 0.95

Inside betai, you can confirm that for these values it is doing the last exit with a calculated return value equal to 1.0-bt*betacf(b,a,1.0-x)/b This turns out to be something like 1.0 - 2.9802322387696186e-33

Since there are only 16 or 17 significant decimal digits in a C++ double precision number, the returned value is 1.0

Then the cdf function returns 1.0 - the betai() value, so it becomes zero.

See what happened? 1.0 - (1.0 - very_small_number) should be equal to the very_small_number, but because of roundoff error it becomes 0.0. Associative and distributive "laws" don't always hold for finite-precision floating point numbers. Bummer!

Bottom line: perhaps the cdf function can be rewritten so that it calculates the smaller terms directly instead of using the the betai stuff, but for now, at least there is my explanation of why everything smaller than approximately 1.0e-16 becomes zero. Understanding the results can give insight.

Regards,

Dave

Footnote:
My little GNU octave program gives pretty much the same results (everything smaller than 1.0e-16 becomes zero), so they apparently use some similar calculations.

On the other hand...

When I used the GNU Scientific Library function gsl_cdf_binomial_P(), I got more precision.

Here's the C program:

#include <stdio.h>
#include <gsl/gsl_cdf.h>
int main()
{
int n = 25;
double p = 0.95;
int k;
printf("CDF of binomial distribution with n = %d, and p = %f\n\n", n, p);
printf(" x Pr(X <= x)\n\n");
for (k = 0; k <= n; k++) {
printf("%3d %18.6e\n", k, gsl_cdf_binomial_P(k, p, n));
}
return 0;
}



Here's the output:

CDF of binomial distribution with n = 25, and p = 0.950000

x Pr(X <= x)

0 2.980232e-33
1 1.418591e-30
2 3.241777e-28
3 4.733943e-26
4 4.960433e-24
5 3.970253e-22
6 2.522780e-20
7 1.305786e-18
8 5.604966e-17
9 2.020747e-15
10 6.174753e-14
11 1.609214e-12
12 3.591139e-11
13 6.876528e-10
14 1.130173e-08
15 1.591912e-07
16 1.915378e-06
17 1.958055e-05
18 1.687532e-04
19 1.212961e-03
20 7.164948e-03
21 3.409060e-02
22 1.271065e-01
23 3.576241e-01
24 7.226104e-01
25 1.000000e+00


In Other Words:

"There's more than one way to skin a cat."
My dear old granny used to say that,
but I don't know where she got it. I mean,
no one in my family ever actually skinned a cat.
(At least as far as I know, they didn't. I mean
she said that we were eating squirrel. But
why would she mention something about cats?
Oh, crap!)
---davekw7x

emilpohl
09-13-2010, 04:31 PM
Dave,

Now, after your explanation, it's clear to me. I've gotten the same GNU Scientific Library results using Colt-1.2.0, SSJ-2.4 java libraries.

This doubt has arisen when I was trying to find a good invcdf function, but for cdf=1.418591e-30, n=25 and p 0.95 the result is 8.0 instead 1., for example.

Thank you very much,

Carlos