Wednesday, January 2, 2008

Hunting Memory Leaks in the NTL Wrapper

Sage's NTL wrapper is more or less famous for leaking memory. During Bug Day 5 at the Clay in Boston I actually did find the last known leak in a live session. The leak was caused by a str() method. As it turns out that mistake had to be corrected 3 or 4 times in the NTL wrapper and a couple times outside of it. Since then the NTL wrapper has been rewritten to be faster and more feature complete, but during the rewrite new memory leaks have been introduced. While I found those before the merge they seemed minor and insignificant. Unfortunately those leaks turned out to cause massive problems in the padics code.

I do occasionally run the whole Sage testsuite under valgrind's memcheck and it takes about a day to finish when skipping memcheck on all external executables. In the 2.9.0->2.9.1 release cycle I fixed 3 memory leaks, one of them in the graph code when converting arbitrary precision numbers to string representation in base 2. Since that function is called quite often when reading or writing adjacency matrices it can be quite a pain. The other two were relatively minor.

But the reason to run memcheck on the test suite was to find and fix all known memory leak issues before the 2.9.2 release. But I have been unable to fix two issues:
  • LinBox leaks via Givaro in certain situations, for example when computing characteristic polynomials over certain fields. Clement Pernet and I did investigate the issue during Sage Days 6, but so far we haven't found a solution. Since it is relatively minor, especially compared to the memory leaks we already fixed in LinBox it has been somewhat on the back burner.
  • The padic doctests leak tremendous amounts of memory. One doctest leaks more than 50 MBytes of memory. That is clearly unacceptable and would make long term computation involving padics nearly impossible. But so far I have been unable to figure out why the code, especially the __pow__ methods are leaking. I attacked the problem with memcheck as well as omega and omega points to some auto-generated Cython code, which is always a bad sign. I guess I need to bother Robert Bradshaw about this :)
So, after having attacked the problem repeatedly with zero progress I have given up on the 2.9.2 release. Maybe with a little distance I will finally figure out what is wrong.

Oh well, some times it just takes time to figure it all out. Then one has to wonder why it took so long to see the obvious.



No comments: