BFGS choking on direction in N dimensions [Archive]

eric_r

02-18-2004, 11:46 PM

Hello. I have recently joined a team working on a complex C++ project that
involves calculating the maxima of multidimensional likelihood surfaces. As part
of an investigation into the behavior of the existing code, I have written a
small, standalone implementation of the BFGS algorithm to find the minima of
n-dimensional analytical functions of my own choosing. Perplexed by the behavior
of this test application, I changed its code to exactly mirror the BFGS code
given in Numerical Recipes in C, 2nd ed. (1995 printing). I have implemented
both the BFGS algorithm from sec. 10.7, and the line search algorithm from sec.
9.7.

Although the code generally succeeds in finding the minima of my test functions,
I find it behaves poorly in multidimensional cases, often failing to converge
within 1000 iterations. The algorithm appears to be unable to produce
positive-definite approximations to the inverse Hessian matrix in some of my
test cases. Furthermore, for some multidimensional functions, the behavior
appears to vary strongly with the direction of the starting point's vector;
starting points of the form (1.234, 1.234, 1.234, ...) work well
(positive-definite matricies, converges in a few iterations), while (-1.234,
+1.234, -1.234, ...) do not (non-positive-definite matricies, crawls very slowly
to the minimum, and only if I reset these matricies to the unit matrix in each
iteration).

The text in Numerical Recipes states that the BFGS algorithm can generate
non-positive-definite matricies when the starting point is far from the minimum,
or it can do so due to round-off errors. (Other texts state this, too.) However,
I am observing non-positive matricies even for starting points very close to the
minimum, and I am never seeing the "round-off error!" message that the line
search code of sec. 9.7 can generate.

Has anyone observed this behavior? Does anyone have any suggestions?

Here are more details about my tests:_ I tried getting BFGS to find the minimum
of

f(x, y, z, ...) = -10 + (x-1)^6 + (y-1)^6 + (z-1)^6 + ...,

which is at the point (1, 1, 1, ...). I tested functions obeying this equation
from one parameter (f(x1)) to ten parameters (f(x1, x2, ..., x10)). For each
dimensionality, I ran the algorithm on four different sets of
randomly-determined starting points:

I) x1 = x2 = x3 = ..., xi far from the minimum (-19 < xi < 21)
II) x1 = x2 = x3 = ..., xi close to the minimum (0.9 < xi < 1.1)
III) x1 != x2 != x3 != ..., each independent xi far from the minimum (-19 < xi <
21)
IV) x1 != x2 != x3 != ..., each independent xi close to the minimum (0.9 < xi <
1.1)

(This function somewhat resembles an n-dimensional pillow floating in space; in
cases II and IV, the starting point is on one of the faces of the pillow, while
in cases I and III it is usually off the pillow.)

For each of these cases, I supplied 1000 random starting points, and computed
summary statistics. Here is a summary of my results:

1 parameter:
============
Case I:
-------
Avg. chi-squared = 0.00083 (between observed min and true (1, 1, ...))
Avg. BFSG iterations = 28.3
Avg. iterations producing "bad" matricies, which were reset to unity: 0
Stop condition: Convergence on gradient 97% of the time, on x 3% of the time.

Case II:
--------
Avg. chi-squared = 0.00053 (between observed min and true (1, 1, ...))
Avg. BFGS iterations = 4.6
Avg. iterations producing "bad" matricies, which were reset to unity: 0
Stop condition: Convergence on gradient 70% of the time, on x 30% of the time.

(For only 1 parameter, cases III and IV are identical to cases I and II.)

2 parameters:
=============
Case I:
-------
Avg. chi-squared = 0.0012 (between observed min and true (1, 1, ...))
Avg. BFGS iterations = 15.6
Avg. iterations producing "bad" matricies, which were reset to unity: 3.1 (20%)
Stop condition: Convergence on gradient 99.9% of the time, on x 0.1% of the
time.

Case II:
--------
Avg. chi-squared = 0.0011 (between observed min and true (1, 1, ...))
Avg. BFGS iterations = 4.6
Avg. iterations producing "bad" matricies, which were reset to unity: 0
Stop condition: Convergence on gradient 70% of the time, on x 30% of the time.

Case III:
-------
Avg. chi-squared = 0.010 (between observed min and true (1, 1, ...))
Avg. BFGS iterations = 927
Avg. iterations producing "bad" matricies, which were reset to unity: 910 (98%)
Stop condition: Convergence on gradient 7.5% of the time,
_______________ gave up after 1000 iterations 92.5% of the time.
(Line search always succeeded, always reached condition 9.7.7, "sufficient
function decrease.")

Case IV:
--------
Avg. chi-squared = 0.0031 (between observed min and true (1, 1, ...))
Avg. BFGS iterations = 578
Avg. iterations producing "bad" matricies, which were reset to unity: 573 (99%)
Stop condition: Convergence on gradient 34% of the time, on x 8% of the time,
_______________ gave up after 1000 iterations 58% of the time.
(Line search always succeeded, always reached condition 9.7.7, "sufficient
function decrease.")

10 parameters:
=============
Case I:
-------
Avg. chi-squared = 0.0062 (between observed min and true (1, 1, ...))
Avg. BFGS iterations = 16.0
Avg. iterations producing "bad" matricies, which were reset to unity: 2.8 (17%)
Stop condition: Convergence on gradient 99.9% of the time, on x 0.1% of the
time.

Case II:
--------
Avg. chi-squared = 0.0055 (between observed min and true (1, 1, ...))
Avg. BFGS iterations = 4.8
Avg. iterations producing "bad" matricies, which were reset to unity: 0
Stop condition: Convergence on gradient 72% of the time, on x 28% of the time.

Case III:
-------
Avg. chi-squared = 0.059 (between observed min and true (1, 1, ...))
Avg. BFGS iterations = 1000
Avg. iterations producing "bad" matricies, which were reset to unity: 1000
(100%)
Stop condition: Gave up after 1000 iterations 100% of the time.
(Line search always succeeded, always reached condition 9.7.7, "sufficient
function decrease.")

Case IV:
--------
Avg. chi-squared = 0.024 (between observed min and true (1, 1, ...))
Avg. BFGS iterations = 1000
Avg. iterations producing "bad" matricies, which were reset to unity: 1000
(100%)
Stop condition: Gave up after 1000 iterations 100% of the time.
(Line search always succeeded, always reached condition 9.7.7, "sufficient
function decrease.")

You can see the dramatic difference I am observing when I choose starting
vectors in arbitrary directions vs. starting vectors along the line x = y = z =
....

For this case, I also tried using the exact inverse Hessian matrix, which is
inherently positive definite ((1/30)*(xi-1)^-4) and happens to be diagonal as
well. The accuracy (chi-squared values) was practically identical, but the
efficiency was tremendously improved, requiring only 20 BFGS iterations in case
I, and only 29 iterations in case III, for 10 parameters.

I also used the test function

f(x, y, z, ...) = -exp(-(x-1)^2)*exp(-(y-1)^2)*exp(-(z-1)^2)*...

and tried to get BFGS to find the minimum at (1, 1, 1, ...). For this function,
I changed cases I and II to probe the region -1 < xi < 3 instead of -19 < xi <
21. The algorithm produced non-positive-definite matricies at least 99.9% of the
time for from 2 to 10 parameters, 17.5% of the time for 1 parameter starting far
from the minimum, and never when starting close to the minimum in the
one-parameter case. (I can provide full data upon request.)

Finally, I used this test function,

f(x, y) = (x-1)^4 * (y-1)^2 + (x-1)^4 + (y-1)^2 - 2,

which produces a thin valley when viewed in 3 dimensions._ I haven't thoroughly
inspected this function yet, but my preliminary results do not show any trouble
with the direction calculations.

My main questions are:
1. Does anyone understand why I'm observing this behavior?
2. Can anyone suggest a solution that will reduce the number of iterations?
3. Can any conclusions drawn about this behavior be applied to similar behavior
("bad" matricies, slow creeps toward the minimum) that I'm observing in a
complicated but continuous function of 8 parameters in our application code
base?
4. Regarding the line search algorithm of sec. 9.7, what should be done if the
minimum lambda, alamin=TOLX/test, exceeds the maximum lambda value of 1.0 before
the line search loop begins?

Many thanks for any help anyone can provide!

--Eric R.