CCP4 web logo CCP4i: Graphical User Interface
MIR Tutorial Bath - Heavy atom refinement

6. Refinement of heavy-atom parameters

6.1. Background

The heavy-atom refinement programs in the CCP4 package (MLPHARE, which can do phasing as well as refinement, and VECREF, which does refinement only) have been in use now for about 5 years. These programs differ in several fundamental ways from their predecessors, and in order to understand what the new programs do that is different, it will be useful to review some recent history.

The method of refinement of heavy atoms in protein derivatives that was commonly in use before 1991 was originally conceived in 1961 and remained basically unchanged for 30 years, though in the 70's it was observed that the method had poor convergence properties, particularly (as is common) when several derivatives had some or all of their major sites in common. The basic idea was to calculate the most probable value of the native phase for each reflection in turn based on one, or a subset, of the derivatives, and then to use this fixed estimate of the phase to obtain the calculated value of |FPH|. The difference between this and the measured |FPH| is the lack of closure error, and the sum of the squared error could be minimised in a conventional least-squares refinement procedure. Initial estimates of the heavy-atom parameters could in theory be adjusted to produce at convergence a set of parameters that best fitted the measurements.

The method, which was implemented by the program PHARE (phase and refine), worked reasonably well provided that the set of sites used to calculate phases was not the same as the set whose heavy-atom parameters were being refined; in particular it could not cope with the case where only one derivative is available (single isomorphous replacement or SIR).

In order to get round these problems, an alternative method ("FHLE") that required the measurement of anomalous differences was devised, where the heavy-atom amplitudes were estimated directly without the necessity of calculating the protein phases, and consequently each derivative was refined completely independently of the others. This method, which was implemented by the program REFINE (a separate program PHASE was needed to do the phasing), often worked quite well in practice, but had the disadvantage that reasonably accurate anomalous differences had to be used, otherwise the resulting heavy-atom parameters suffered from large errors. Being much smaller than the isomorphous differences, the anomalous differences are in fact notoriously difficult to measure accurately.

Because the isomorphous difference is a good approximation to |FH| for centric reflections, these can be used in the initial stages of refinement; however this is not a general solution because several space groups either have no (e.g. R3), or only one (e.g. P21) centric zone.

6.2. The maximum likelihood method of refinement

The principle of maximum likelihood is that a joint conditional probability density function is constructed, the value of which measures the likelihood that the particular set of measurements that were actually obtained, would have been obtained given any specified set of values for the unknown parameters. The optimal set of parameters is that which maximises the likelihood of having made the actual set of measurements. Usually the errors in the individual measurements are all independent of each other, so the likelihood is just a product of individual probability functions, whose algebraic form is based on informed guesswork about the probability of making a measurement if its true value were known, and from its known error estimate.

L = Phkl P(|FP|,|FPHj| | (xi,yi,zi,Bi,Oi )j, s2(|Dj|)).

The likelihood L = conditional probability of having made the set of observations |FP|, |FPHj| given values of the heavy atom parameters (xi,yi,zi,Bi,Oi)j, s2(|Dj|).

log(L) = Shkl log(P(...|...))

The most likely set of parameters will be the one that maximises the log(likelihood).

The main drawback of the old methods of refinement was that the protein phase was either fixed or just ignored during refinement, leading either to bias in the parameters or to loss of information. The important breakthrough with the new method is that all possible values of the phase are considered during refinement, each value being weighted according to its probability of being correct.

The MLPHARE program, a direct descendant of PHARE, implements the likelihood maximisation procedure, and adjusts the overall and individual heavy-atom parameters of a set of derivatives simultaneously from initial estimates to optimum values.

6.3. The vector-space method of refinement

The principle of vector-space refinement is very simple: the Patterson is calculated from the initial heavy-atom parameters, it is compared with the observed isomorphous difference Patterson, and the parameters are adjusted to minimise the sum of weighted squared differences between the calculated and observed Pattersons.

It can be shown that the isomorphous difference Patterson has the same peaks as the heavy-atom Patterson, but at half height, and with additional uncorrelated noise peaks. So, to reduce the effect of this noise, not all the grid points in the Patterson are used in the refinement, only those that fall within the peaks of the calculated Patterson.

The weight is the reciprocal variance of the Patterson density, which depends only on the position in Patterson space (positions on or near symmetry elements in the point group have higher variance than the average). The VECREF program implements this real-space method in complete constrast with the reciprocal-space method used by MLPHARE, and in fact by all other heavy-atom refinement programs (as far as the author is aware).

The isomorphous difference Patterson of course contains the complete set of information embodied in the native and derivative measured intensities, and with the exception that the overall scale factor is assumed to be correct, does not incorporate any additional assumptions. In particular, and in contrast with the difference Fourier, the Patterson does not rely on phase information. Because any Fourier tends to be dominated by the phases, and not the amplitudes, it is very easily biassed by the sites used to calculate the phases.

It therefore seems logical to make the essential cross-check against the Patterson an integral part of the refinement. A bonus of this procedure is that wrong sites are much more easily discriminated by the Patterson than by the Fourier, and so instead of the cautious stepwise addition of new sites that is required when refining in reciprocal space, caution can be thrown to the winds, and many new trial sites can be added in one go.

It might be argued that several of the weak peaks in the difference Fourier may not be heavy-atom sites at all and may be due instead to imperfect isomorphism, but against this it should be remembered that the aim is to model all differences between the native and derivative structure. Provided there is no significant change in the unit cell dimensions it doesn't matter whether the differences are caused by heavy-atom substitution or by movement of atoms in the native structure, the effect is the same.

Although VECREF can be used for the refinement, it does not calculate phases, so MLPHARE is still used for this.