[RASMB] SAUCE - x-ray optics for the AUC

Borries Demeler demeler at biochem.uthscsa.edu
Wed Mar 18 06:18:55 PDT 2009


Dear Peter,

First, let me preface my response by suggesting that you and I both
try our best to keep this conversation civil, cordial and colleagial. I
feel that the tone of your last message is inapproprite for the RASMB
forum, which has many newcomers to AUC who could be turned off by
shouting and CAPITALIZATION. I may choose not to respond at all if
you continue with this tone. RASMB is a great place to exchange ideas
and discuss concepts, and get others excited about science that can be
accomplished with AUC, and learn new things. Let's try to stay friendly,
OK? I certainly pledge to adhere to this, and think anything else would
just be counterproductive for the whole community.  Nevertheless, I think
you bring up a lot of good points which I don't mind discussing,
and there is of course always room for different opinions.

Let me address your first point: You are absolutely correct, I should
have said 10x100 and not 10x10. As you correctly point out, and I hope
everyone will readily see, a 10x10 solution fit doesn't make any sense.
Of course I intended to write 10x100 in some places (not everywhere)
in the manuscript, but along the way the extra zero was regrettably
and inadvertantly dropped. Please accept my sincere apologies for this
oversight, it was an honest mistake and I regret any false impression
it may have left with the reader. It certainly was never my intention
to blatantly misrepresent your work. I will definitely ask Tony Watts
to publish an erratum to correct this mistake. So, thanks for pointing
out the missing zero to me.

More to the science: As you can tell by Figure 2 in the paper, the
statistics clearly show that it is irrelevant if you use a 10x10 or a
10x100 grid, the resolution of neither grid suffices, and one of the
central points we make in the paper is entirely unaffected by this
error. Whether you use a 10x10 or a 10x100 grid, or a 10x50 grid as is
the default in your software, all of them are still outside of the range
where you should be fitting in our opinion. It is not until you use a
resolution of 10,000 or higher that the statistics and parameter values
stabilize and generate reproducible results. In practice, unless the s
range is very narrow based on a preliminary dc/dt or van Holde Weischet
analysis, I usually use 20,000 - 25,000 solutes in my 2DSA fits.

Next point: Memory requirements - again, this should have been the
10x100 grid, here is the arithmetic on that: number of scans * number
of points * number of s values * number of f/f0 values * 8 bytes for a
double precision If you have interference data, measure about 100 scans
you might see this:

100 * 800 * 100 * 10 * 8 = 0.64 gigabyte.

Perhaps you don't use as many scans or use only floats, in which case
you can reduce this number. Now our point is that if you use the grid
resolution suggested by the statistics in Fig. 2, you need to switch
to an alternative approach for solving the NNLS problem to avoid memory
constraints, and Emre's moving grid/divide and conquer approach provides
a very elegant solution for this problem, not just serially, but also
one that can be nicely parallelized to take advantage of supercomputers
or modern multi-core architectures. Sure, a 10x100 grid can be solved
easily in a few minutes on a moderately equipped PC, no argument there,
it's just not what you want to be doing after you look at our results.

Next point: You claim that a proper minimization cannot be achieved with
our divide and conquer approach of fitting coarse grids and merging
the results. I am glad you bring this up, because it may not be clear
enough in our paper, but that assertion is incorrect. We have shown
in reference Brookes et al, 2006, cited under "Iterative Refinement",
that you get _exactly_ the same results with this approach that you get
when you fit the same system in one single iteration with all solutions
included. If you don't believe it, try it out. This is true regardless of
system size and was tested and verified on many problems. If you do not
use the iterative refinement, you will not get exactly the same result,
but for systems with high enough resolution the differences are truly
negligible, plus you can set any convergence criterion in the iterative
method you deem necessary to get the error level you need.

I should note here that when this method is used in conjunction with
time or radially invariant noise subtraction, the number of iterations
will increase until the same convergence criterion is reached. Again,
with the appropriate convergence criterion the differences are much too
small to be of any concern in this application. This is all discussed in:
Brookes, E., Boppana, R.V., and B. Demeler. (2006) Computing Large Sparse
Multivariate Optimization Problems with an Application in Biophysics. ACM
Proceedings 0-7695-2700-0/06 Supercomputing 2006.

Next point: The importance of regularization. Peter, I agree with
you on this. Regularization is very useful for sedimentation velocity
data fitting, however, I have a different philosophy on what kind of
regularization should be used. Tikhonov or ME regularization ends up
broadening/smoothing your solution, and introducing infinitely many
solutes into the result, with smoothly varying amplitudes. This could be
interpreted as a confidence region, but it doesn't satisfy Occam's razor,
which states - and I am paraphrasing - that the solution with the fewest
parameters is the preferred solution. To obtain a regularized solution in
this sense we proposed the parsimonious regularization method, which can
be applied after obtaining a 2DSA fit to further improve the results:
Brookes, E and B. Demeler. Parsimonious Regularization using Genetic
Algorithms Applied to the Analysis of Analytical Ultracentrifugation
Experiments. GECCO ACM Proceedings 978-1-59593-697-4/07/0007 (2007). 

In order to obtain confidence intervals and statistics, we feel that
Monte Carlo is a much better approach for exploring the range of each
parameter. True, it does take more computational time, and is not
practical to be done on your laptop, but with parallel computing so
readily available, there is no reason not to use it. With Monte Carlo the
effect of experimental noise is determined without sacrificing parsimony
(see: Demeler, B. and E. Brookes. Monte Carlo analysis of sedimentation
experiments. Colloid Polym Sci (2008) 286(2) 129-137). You may also want
to look at our strategy for optimizing a sedimentation velocity analysis
by applying all of these methods in the proper sequence: Demeler B,
Brookes E, Nagel-Steger L. Analysis of heterogeneity in molecular weight
and shape by analytical ultracentrifugation using parallel distributed
computing. Methods Enzymol. 2009;454:87-113).

Your last point addresses the use of different scaling laws and segmented
ff0 values. I don't see anything wrong with that approach, it is certainly
another valid approach to study SV experiments. For example, I am sure
that you would be able to successfully fit the DNA/lysozyme example
presented in our paper with the bimodal f/f0 c(s) constraint. But how
would you know what to use when fitting the 5-component system? The SV
data do not show separated boundaries, except for the fastest components
in the highest speed. Therefore, I prefer to use a method that is general
enough to not require from the user to make this decision beforehand
and prefer to obtain an answer that is unbiased by the choices the
user makes. The 2DSA provides this in a more efficient way than any
other method I have seen. I wonder how much of an expert you would
need to be to correctly pick the proper constraints, scaling laws and
f/f0 segmentations, and how you would test that you picked the right
ones. This challenge is avoided by using 2DSA.

HPC resources are nowadays commonplace and affordable, so why not use
them?  We make them freely available through the UltraScan LIMS (courtesy
of NSF Teragrid), and all analyses can be done from a user-friendly
web interface.  We invite anyone who would like to use this method for
SV analysis to try it out - all software is free, and compute cycles on
the HPC backend are also free. You can find everything on the Teragrid
Science Gateway: http://www.teragrid.org/gateways/gateway_list.php

Sorry for the lengthy response, enough said, now back to Stimulus grants...

-Borries



More information about the RASMB mailing list