[RASMB] averages from SV
Peter Schuck
pschuck at helix.nih.gov
Sat Jan 28 12:06:07 PST 2006
Hi All,
It appears to me from Jack and Walter's comments that my earlier point
was perhaps misunderstood, and I would like to clarify. In my view,
these issues become more transparent when keeping the following
distinction in mind:
(A) the ensemble of real molecules we believe are moving around in our
test tube, and the distribution and averages over it
(B) the distribution of values we get from applying a mathematical
algorithm to the experimental data acquired in some measurement, and the
numbers derived from applying the "formal" averages to this calculated
distribution.
If we knew (A), there would be no point to ask any more question. No
measurement is perfect, and therefore we have to restrict the
interpretation of (B) to features which represent faithfully the
corresponding features in (A). For sedimentation coefficient
distributions (dependent on which ones were used), this can be peak
locations, weight average, even peak areas of trace components, but it
is certainly *not* the detailed peak shapes. Unfortunately, the number,
and the z and higher averages are highly sensitive to just that - the
detailed peak shapes.
[This is why I said that these averages are dependent on a variety of
factors, including signal-noise ratio (others are rotor speed, diffusion
coefficients, number of scans considered in the analysis). The
signal-noise ratio is a huge factor for the sharpness of peaks when
using regularization, for example. This implies that for an experiment
with given signal-noise ratio, the determined values may be
reproducible, but their numeric value will be different from another
series of experiments with a different signal-noise ratio. I believe
this is what Jack may have misread to mean "Peter Schuck also suggests
they are too noisy". It seems that the only one who actually said they
are "too noisy" may have been Walter referring to z+1 averages in his
last comment...]
The distinction between (A) and (B) becomes very clear, for example, if
one considers a situation where there is only one single species. In
this case, applying the formulas for the averages to (A) results in sn =
sw = sz = sz+1, since there's only one species to sum up. How do we
measure a single species? Unfortunately, they don't become
delta-functions, and the best we can hope for probably are Gaussians (or
something similar). It is well-known, for example, that dcdt ideally
produces close to Gaussian shaped peaks for single species, which are
broadened mainly from diffusion. If we run the averages over a
Gaussian, or any other smooth function, we find sn < sw < sz < sz+1.
Obviously, these averages do not reflect the "true" averages in (A), and
this is why I argue they usually don't make much sense. (Again, there
is a special consideration about the sw, or the corresponding signal
average, in (B), which can be a faithful property of (A) because of the
way it can be related to the experiment.) This shows that the imperfect
representation of distribution *peak shapes* in (B) destroys the
relationship of the theoretical averages in (A) even for a single
species. This applies to any known sedimentation coefficient
distribution (unless it contains delta-functions, such as in the hybrid
discrete/continuous distribution of SEDPHAT).
What exactly we get for these averages for a single species depends on
the experiment. This is why this discussion is different for
sedimentation equilibriuim and sedimentation velocity, since the
measurement process and the data analysis contain different problems.
One key factor would be - how broad is the peak for a single species?
For sedimentation velocity, the effect can be quite large.
Walter's references are papers about sedimentation equilibrium, not
sedimentation velocity. Not making the distinction between averages in
(A) and (B), it would seem that doesn't matter. The method he describes
silently assumes that we can take algebraic relationships between
averages in (A) and apply them in the same way to the averages in (B).
It is clear that this can't be rigorous for two species, since even for
a single species this correspondence is wrong when applied to
sedimentation velocity (see above).
Fortunately, it is very simple for anybody to test this assumption, for
example, using noise-free simulated data (such tools may not have been
easily available 40 years ago, when these averages attracted some
interest). I picked two protein species at equal concentrations, with
S1 = 4 and S2 = 5.5 (this would be the hypothesized system of (A)), and
calculated Lamm equation solutions for sedimentation at 50,000 rpm. I
took a selection of scans from the middle of the cell, and the g(s*)
numbers are sn = 4.275, sw = 4.662, sz = 5.005, and sz+1 = 5.310. If we
take these averages and insert them in Walter's equations and solve for
s1 and s2 (I'm attaching a short MATLAB script to solve the simultaneous
set, which otherwise is probably cumbersone), we get s1(calculated) =
3.105 and s2(calculated) = 5.689. It seems the imperfections between
the correspondence of the averages in (A) and (B) have actually
amplified to an error of 0.9 S!
This is the only one I tried, and I suspect there are probably examples
where it is worse, and those where it works better. The latter probably
would include systems with very little diffusion or where the separation
of s-values is much larger than the diffusional spread. (For example,
if we look at the same system with c(s) where we get baseline-separated
peaks, and we then take the averages, the above calculation gives almost
correct s1 = 3.99 and s2 = 5.51). But then again, if we can have
baseline-separated peaks, why would we bother to calculate the s-values
of these species via averages, instead of looking at the separate
peaks? Also, in general, if we don't have baseline separated peaks,
this problem can be solved much easier and more rigorously with Lamm
equation fitting, either a two discrete species model, or in combination
with continuous segments.
This problem with strong dependence of the calculated higher averages on
the measurement and 'response function' of the applied distribution, is
also why I doubt that trying to come up with some model, as Jack alluded
to, will be actually promising. In contrast, the reason for why I think
the Gilbert-Jenkins theory is so useful for the quantitative study of
interacting systems is that it allows us to extract *faithful*
parameters from the c(s) sedimentation coefficient distributions: peak
areas and weight-average (or signal-average) s-values taken over the
reaction boundary and the undisturbed boundary. In this way, we do not
need to interpret the detailed peak shapes, but nevertheless can exploit
the bimodal boundary structure of a two-component heterogeneous
interacting system. We have compared that in detail witih the full Lamm
equation modeling of reacting systems (as implemented in SEDPHAT,
similarly as in BPCFIT and SEDANAL), and found that it can be more
robust than Lamm equation modeling, even though not quite as detailed.
However, for many systems, this may be an advantage, given the sometimes
limited purity and stability of biologically interesting protein samples...
Sorry for the long email again,
Peter
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: testaverages.m
URL: <http://list.rasmb.org/pipermail/rasmb-rasmb.org/attachments/20060128/f82c6c24/attachment.ksh>
More information about the RASMB
mailing list