[RASMB] averages from SV

Sat Jan 28 12:06:07 PST 2006

Hi All,
It appears to me from Jack and Walter's comments that my earlier point 
was perhaps misunderstood, and I would like to clarify.  In my view, 
these issues become more transparent when keeping the following 
distinction in mind:

(A) the ensemble of real molecules we believe are moving around in our 
test tube, and the distribution and averages over it

(B) the distribution of values we get from applying a mathematical 
algorithm to the experimental data acquired in some measurement, and the 
numbers derived from applying the "formal" averages to this calculated 
distribution.

If we knew (A), there would be no point to ask any more question.  No 
measurement is perfect, and therefore we have to restrict the 
interpretation of (B) to features which represent faithfully the 
corresponding features in (A).  For sedimentation coefficient 
distributions (dependent on which ones were used), this can be peak 
locations, weight average, even peak areas of trace components, but it 
is certainly *not* the detailed peak shapes.  Unfortunately, the number, 
and the z and higher averages are highly sensitive to just that - the 
detailed peak shapes. 

[This is why I said that these averages are dependent on a variety of 
factors, including signal-noise ratio (others are rotor speed, diffusion 
coefficients, number of scans considered in the analysis).  The 
signal-noise ratio is a huge factor for the sharpness of peaks when 
using regularization, for example.  This implies that for an experiment 
with given signal-noise ratio, the determined values may be 
reproducible, but their numeric value will be different from another 
series of experiments with a different signal-noise ratio.  I believe 
this is what Jack may have misread to mean "Peter Schuck also suggests 
they are too noisy".  It seems that the only one who actually said they 
are "too noisy" may have been Walter referring to z+1 averages in his 
last comment...]

The distinction between (A) and (B) becomes very clear, for example, if 
one considers a situation where there is only one single species.  In 
this case, applying the formulas for the averages to (A) results in sn = 
sw = sz = sz+1, since there's only one species to sum up.  How do we 
measure a single species?  Unfortunately, they don't become 
delta-functions, and the best we can hope for probably are Gaussians (or 
something similar).  It is well-known, for example, that dcdt ideally 
produces close to Gaussian shaped peaks for single species, which are 
broadened mainly from diffusion.  If we run the averages over a 
Gaussian, or any other smooth function, we find sn < sw < sz < sz+1.  
Obviously, these averages do not reflect the "true" averages in (A), and 
this is why I argue they usually don't make much sense.  (Again, there 
is a special consideration about the sw, or the corresponding signal 
average, in (B), which can be a faithful property of (A) because of the 
way it can be related to the experiment.)  This shows that the imperfect 
representation of distribution *peak shapes* in (B) destroys the 
relationship of the theoretical averages in (A) even for a single 
species.  This applies to any known sedimentation coefficient 
distribution (unless it contains delta-functions, such as in the hybrid 
discrete/continuous distribution of SEDPHAT).

What exactly we get for these averages for a single species depends on 
the experiment.  This is why this discussion is different for 
sedimentation equilibriuim and sedimentation velocity, since the 
measurement process and the data analysis contain different problems.  
One key factor would be - how broad is the peak for a single species? 
For sedimentation velocity, the effect can be quite large. 

Walter's references are papers about sedimentation equilibrium, not 
sedimentation velocity.  Not making the distinction between averages in 
(A) and (B), it would seem that doesn't matter.  The method he describes 
silently assumes that we can take algebraic relationships between 
averages in (A) and apply them in the same way to the averages in (B).  
It is clear that this can't be rigorous for two species, since even for 
a single species this correspondence is wrong when applied to 
sedimentation velocity (see above). 

Fortunately, it is very simple for anybody to test this assumption, for 
example, using noise-free simulated data (such tools may not have been 
easily available 40 years ago, when these averages attracted some 
interest).  I picked two protein species at equal concentrations, with 
S1 = 4 and S2 = 5.5 (this would be the hypothesized system of (A)), and 
calculated Lamm equation solutions for sedimentation at 50,000 rpm.  I 
took a selection of scans from the middle of the cell, and the g(s*) 
numbers are sn = 4.275, sw = 4.662, sz = 5.005, and sz+1 = 5.310.  If we 
take these averages and insert them in Walter's equations and solve for 
s1 and s2 (I'm attaching a short MATLAB script to solve the simultaneous 
set, which otherwise is probably cumbersone), we get s1(calculated) = 
3.105 and s2(calculated) = 5.689.  It seems the imperfections between 
the correspondence of the averages in (A) and (B) have actually 
amplified to an error of 0.9 S! 

This is the only one I tried, and I suspect there are probably examples 
where it is worse, and those where it works better. The latter probably 
would include systems with very little diffusion or where the separation 
of s-values is much larger than the diffusional spread.  (For example, 
if we look at the same system with c(s) where we get baseline-separated 
peaks, and we then take the averages, the above calculation gives almost 
correct s1 = 3.99 and s2 = 5.51).  But then again, if we can have 
baseline-separated peaks, why would we bother to calculate the s-values 
of these species via averages, instead of looking at the separate 
peaks?  Also, in general, if we don't have baseline separated peaks, 
this problem can be solved much easier and more rigorously with Lamm 
equation fitting, either a two discrete species model, or in combination 
with continuous segments.

This problem with strong dependence of the calculated higher averages on 
the measurement and 'response function' of the applied distribution, is 
also why I doubt that trying to come up with some model, as Jack alluded 
to, will be actually promising.  In contrast, the reason for why I think 
the Gilbert-Jenkins theory is so useful for the quantitative study of 
interacting systems is that it allows us to extract *faithful* 
parameters from the c(s) sedimentation coefficient distributions:  peak 
areas and weight-average (or signal-average) s-values taken over the 
reaction boundary and the undisturbed boundary.  In this way, we do not 
need to interpret the detailed peak shapes, but nevertheless can exploit 
the bimodal boundary structure of a two-component heterogeneous 
interacting system.  We have compared that in detail witih the full Lamm 
equation modeling of reacting systems (as implemented in SEDPHAT, 
similarly as in BPCFIT and SEDANAL), and found that it can be more 
robust than Lamm equation modeling, even though not quite as detailed.  
However, for many systems, this may be an advantage, given the sometimes 
limited purity and stability of biologically interesting protein samples...

Sorry for the long email again,
Peter

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: testaverages.m
URL: <http://list.rasmb.org/pipermail/rasmb-rasmb.org/attachments/20060128/f82c6c24/attachment.asc>