| Sign In to gain access to subscriptions and/or personal tools. |
Analysis of Caries Experience Taking Inter-observer Bias and Variability into Account1 Biostatistical Centre and 2 School of Dentistry, Catholic University Leuven, Kapucijnenvoer 35, B-3000 Leuven, Belgium; Correspondence: * corresponding author, emmanuel.lesaffre{at}med.kuleuven.ac.be
In larger oral health surveys, clinical measurements are often obtained using several examiners. This raises the issue of inter-observer variability in measurement. Often, the problem is dealt with by reporting kappa values obtained in a calibration exercise. In the present study, the limitations of this statistic are presented, and an alternative, based on a Bayesian approach, is proposed. When the alternative approach was applied to caries experience data obtained in an oral health screening survey in seven-year-old Flemish children (Signal Tandmobiel® study), it could be ruled out that the observed geographic East-West gradient was due to bias induced by variability in scoring of the different dental examiners involved. The proposed method offers an opportunity to refine existing analytical approaches and is relevant to any health outcome study.
Key Words: calibration caries experience kappa value logistic regression
It is customary to report kappa values (Cohen, 1960), denoted as , whenever multiple examiners are involved in an epidemiological study. High values of indicate that the examiners scoring is reliable. Since the introduction of Cohens kappa, several paradoxes in its interpretation have been pointed out (Cicchetti and Feinstein, 1990; Feinstein and Cicchetti, 1990). Further, when a gold standard or benchmark scorer is available, then the appropriate measures are the sensitivity and specificity of each examiner vis-à-vis the gold standard (or benchmark scorer). These issues are seldom considered in dental research. Furthermore, the measures of agreement do not indicate the impact of the bias and variability of scoring of the examiners on the estimates of the regression coefficients of an epidemiological regression model. From the literature on errors-in-variables (see, e.g., Carroll et al., 1995), it is known that when covariates are measured with error (with misclassification as a special case), this will result in an attenuation of the estimated regression coefficients. When the model is non-linear, like a logistic regression model, attenuation also occurs when the response is measured with error. Furthermore, when the bias in scoring is related to some covariates in the model, then bias will also occur in the regression estimates if this phenomenon is not accounted for. Our intention is to correct for examiners misclassification in caries experience. More specifically, we show here how the scores of different dental examiners can be corrected in a logistic regression model with data from calibration exercises. For simplicity, we will explain the concepts for a logistic regression model with a binary outcome. But the method has been applied to the data from the Signal Tandmobiel® study with the use of an ordinal logistic regression model, whereby the outcome is an ordinal score derived by categorization of the dmft-score into 4 classes (Mwalili et al., 2004). The correction method necessitates the availability of at least a benchmark scorer. Even better is the use of a gold standard, e.g., a score obtained through histological examination, because then the corrected model will be free of attenuation.
Epidemiological Dataset The Signal-Tandmobiel® project was a prospective (1996–2001) oral health screening project in Flanders, Belgium. The study design and research methods have been described in detail (Vanobbergen et al., 2000). For this project, 16 trained dentist-examiners conducted annual examinations of 4468 children (2315 boys and 2153 girls) from 179 primary schools, after parental consent was obtained. Data on oral hygiene and dietary habits were obtained through structured questionnaires, completed by the parents. The study protocol was reviewed and approved by the ethical Committee of the Catholic University of Leuven. The presence of caries was scored visually (no radiographs were obtained) and recorded (at the cavitation level) according to the diagnostic criteria published by the BASCD (Pine et al., 1997). Here, only the first years (cross-sectional) data were used, i.e. when the children were in their first year of primary school.
Calibration Data
Measures of Agreement
where p0 is the observed proportion of agreement between the two scores and pe is the agreement obtained purely by chance. Using the misclassification matrix
and (a + b + c + d = n), we obtain p0 = (a + d)/n and pe = [(a + b)(a + c) + (c + d)(b + d)]/n2. In practice, the values of
In the first table, the examiner (row) clearly underscores caries experience compared with the gold standard (column), while in the fifth table the reverse is true. In the third table,
The Logistic Regression Model We denote the true dichotomized dmft score as Y = 0,1 for use as a response in a logistic regression model. A logistic regression model relating Y to p regressors x1, x2, ... xp (also called risk factors) is given by:
where
Correction for Scoring Bias and Variability for a Logistic Regression Model
The probabilities (1 -
where P(Y = 1|X) is the logistic regression model (1).
In general, the correction parameters The Bayesian approach offers an alternative way to estimate the parameters. Prior dental knowledge can be combined with the observed data (epidemiological + validation data) to yield a posterior distribution of the parameters using a sampling procedure called the Markov-Chain Monte Carlo approach (Spiegelhalter et al., 1996). However, here only non-informative priors were chosen. Further, the software WinBugs (Spiegelhalter et al., 1996) allows various epidemiological models and models for the validation data to be tested with a minimum of effort. Finally, the delta method (which can be cumbersome) is replaced by another simple sampling procedure (see also Mwalili et al., 2004).
One can see a clear East-West gradient in the level of caries experience, with a higher level of caries in Eastern Flanders (Fig.
However, a roughly similar East-West gradient in the scoring behavior of the 16 dental examiners is seen in the validation dataset (Fig.
Using an extension of model (3) to the ordinal logistic regression model, we incorporated the under- and overscoring behaviors of the dental examiner vis-à-vis the benchmark examiner into the analysis of the cross-sectional data. We observed that the East-West gradient was again highly significant (in a Bayesian sense), practically to the degree of our first analysis, but was estimated with less precision (Table 1
Large-scale epidemiologic surveys necessarily involve multiple examiners, due to the large numbers of persons to be examined and some unavoidable organizational aspects, like geographical distances. This implies that the (dental) measurements like caries experience, plaque score, etc., could be scored differently by different examiners. This phenomenon is called measurement error, reflecting the idea that there is a true value for the measurement, taken by a gold standard, and the scores of the examiners might deviate from that true score. When the measurement is binary, one speaks of misclassification error. Measurement error on the regressor generally causes an attenuation of the true relationship between the risk factor and the response (disease). This also happens when the measurement error is on the outcome and when the regression model is non-linear. Furthermore, when the measurement error is confounded with other regressors, the estimated regression coefficients of the other regressors are also affected. Therefore, one needs to correct for measurement error, which is possible only when a validation dataset is available, e.g., by the performance of calibration exercises. In the dental and medical literature, kappa values are reported not only to indicate the agreement of the scoring of the different examiners, but also to highlight the overall quality of the study. We have shown that these kappa values were uninformative in our analysis. Moreover, we argue that, in most studies with multiple raters, reporting of kappa values is not sufficient. Geographical differences in, e.g., caries experience are often reported (Nadanovsky and Sheiham, 1994; Tickle et al., 2003). The analysis of determining factors for these differences is of utmost importance and facilitates the introduction of region-specific measures and/or interventions. In spite of the considerable efforts that are undertaken to calibrate examiners involved in such surveys, variability in scoring cannot be avoided. Since examiners often operate in well-defined geographical areas, the presence of possible bias can influence results considerably. The methodology presented here offers an opportunity to refine current analytical approaches, allowing more reliable conclusions to be drawn. We have opted for a Bayesian approach for two reasons. First, the Bayesian approach allows for the incorporation of oral health knowledge into the statistical analysis. Although we have not done so here, we believe that this is an important feature of the approach. Indeed, the validation datasets are most often quite small, implying that the correction terms are then (relatively) poorly estimated. In that case, any external useful oral health information can improve the stability of the estimated correction terms. Second, the Bayesian software provides a flexible way to fit quite complex statistical models and to switch from one model to another with a limited amount of extra work, usually implying much less analytical work, which can be quite cumbersome once one deviates from classic statistical approaches. Finally, despite the fact that, in our study, a gold standard was not available, but only a benchmark examiner, our analysis is not invalidated. Indeed, our regression coefficients estimate an ordinal logistic regression model as if all children were scored by the same individual, in this case the benchmark examiner. Of course, if the benchmark examiner also scores with error, then some attenuation will still be present in the analysis.
This investigation was supported by Research Grant OT/00/35, Catholic University Leuven; data collection was supported by Unilever, Belgium. The Signal-Tandmobiel® project was comprised of the following partners: D. Declerck (Dental School, Catholic University Leuven), L. Martens (Dental School, University Ghent), J. Vanobbergen (Oral Health Promotion and Prevention, Flemish Dental Association), P. Bottenberg (Dental School, University Brussels), E. Lesaffre (Biostatistical Centre, Catholic University Leuven), and K. Hoppenbrouwers (Youth Health Department, Catholic University Leuven; Flemish Association for Youth Health Care). Further, the first two authors are also partially funded by research grant P5/24 from the IAP research network of the Belgian State (Federal Office for Scientific, Technical and Cultural Affairs). Received for publication January 6, 2004. Revision received May 28, 2004. Accepted for publication August 26, 2004.
Journal of Dental Research, Vol. 83, No. 12,
951-955 (2004)
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
, whenever multiple examiners are involved in an epidemiological study. High values of 



= P(Y = 1|X), with X = (x1, x2, ..., xp), represents the probability of having caries experience given specific values for the regressors and logit (
. Let
0 and 


