Advanced Search

Journal Navigation

Journal Home

Subscriptions

Archive

Contact Us

Table of Contents

CiteULike is a free service for managing and discovering scholarly references - click here to get started.

Sign In to gain access to subscriptions and/or personal tools.
Journal of Dental Research
This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to Saved Citations
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Request Reprints
Right arrow Add to My Marked Citations
Citing Articles
Right arrow Citing Articles via Google Scholar
Right arrow Citing Articles via Scopus
Google Scholar
Right arrow Articles by Lesaffre, E.
Right arrow Articles by Declerck, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lesaffre, E.
Right arrow Articles by Declerck, D.
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati   Add to Twitter  
What's this?

Clinical

Analysis of Caries Experience Taking Inter-observer Bias and Variability into Account

E. Lesaffre1,*, S.M. Mwalili1 and D. Declerck2

1 Biostatistical Centre and 2 School of Dentistry, Catholic University Leuven, Kapucijnenvoer 35, B-3000 Leuven, Belgium;

Correspondence: * corresponding author, emmanuel.lesaffre{at}med.kuleuven.ac.be


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS & METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
In larger oral health surveys, clinical measurements are often obtained using several examiners. This raises the issue of inter-observer variability in measurement. Often, the problem is dealt with by reporting kappa values obtained in a calibration exercise. In the present study, the limitations of this statistic are presented, and an alternative, based on a Bayesian approach, is proposed. When the alternative approach was applied to caries experience data obtained in an oral health screening survey in seven-year-old Flemish children (Signal Tandmobiel® study), it could be ruled out that the observed geographic East-West gradient was due to bias induced by variability in scoring of the different dental examiners involved. The proposed method offers an opportunity to refine existing analytical approaches and is relevant to any health outcome study.

Key Words: calibration • caries experience • kappa value • logistic regression


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS & METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
It is customary to report kappa values (Cohen, 1960), denoted as {kappa}, whenever multiple examiners are involved in an epidemiological study. High values of {kappa} indicate that the examiners’ scoring is reliable. Since the introduction of Cohen’s kappa, several ‘paradoxes’ in its interpretation have been pointed out (Cicchetti and Feinstein, 1990; Feinstein and Cicchetti, 1990). Further, when a gold standard or benchmark scorer is available, then the appropriate measures are the sensitivity and specificity of each examiner vis-à-vis the gold standard (or benchmark scorer). These issues are seldom considered in dental research. Furthermore, the measures of agreement do not indicate the impact of the bias and variability of scoring of the examiners on the estimates of the regression coefficients of an epidemiological regression model. From the literature on errors-in-variables (see, e.g., Carroll et al., 1995), it is known that when covariates are measured with error (with misclassification as a special case), this will result in an attenuation of the estimated regression coefficients. When the model is non-linear, like a logistic regression model, attenuation also occurs when the response is measured with error. Furthermore, when the bias in scoring is related to some covariates in the model, then bias will also occur in the regression estimates if this phenomenon is not accounted for.

Our intention is to correct for examiners’ misclassification in caries experience. More specifically, we show here how the scores of different dental examiners can be corrected in a logistic regression model with data from calibration exercises. For simplicity, we will explain the concepts for a logistic regression model with a binary outcome. But the method has been applied to the data from the Signal Tandmobiel® study with the use of an ordinal logistic regression model, whereby the outcome is an ordinal score derived by categorization of the dmft-score into 4 classes (Mwalili et al., 2004). The correction method necessitates the availability of at least a benchmark scorer. Even better is the use of a gold standard, e.g., a score obtained through histological examination, because then the corrected model will be free of attenuation.


    MATERIALS & METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS & METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Epidemiological Dataset
The Signal-Tandmobiel® project was a prospective (1996–2001) oral health screening project in Flanders, Belgium. The study design and research methods have been described in detail (Vanobbergen et al., 2000). For this project, 16 trained dentist-examiners conducted annual examinations of 4468 children (2315 boys and 2153 girls) from 179 primary schools, after parental consent was obtained. Data on oral hygiene and dietary habits were obtained through structured questionnaires, completed by the parents. The study protocol was reviewed and approved by the ethical Committee of the Catholic University of Leuven.

The presence of caries was scored visually (no radiographs were obtained) and recorded (at the cavitation level) according to the diagnostic criteria published by the BASCD (Pine et al., 1997). Here, only the first year’s (cross-sectional) data were used, i.e. when the children were in their first year of primary school.

Calibration Data
Besides oral-health-related data, data from calibration exercises were also available. We organized 3 calibration exercises to assess the scoring of caries experience of the 16 dental examiners. A minimum of 12 children was included in each exercise. The last author of this paper served as benchmark examiner. The calibration exercise yielded, for each child, a dmft score for each dental examiner and the benchmark scorer. To avoid statistical problems when modeling the calibration data (see Mwalili et al., 2004), we classified the dmft score into 4 categories: value 1 corresponding to dmft = 0, 2 to dmft = 1, 3 to 1 < dmft < 4, and 4 to 4 < dmft < 20. We realize that our approach is not standard, but the distribution of the dmft score is complex (over-dispersed Poisson). Taking the dmft score as a response would imply a complex modeling analysis, which is beyond the scope of this paper.

Measures of Agreement
Cohen’s kappa is a coefficient of agreement for binary outcomes (0, 1) and is equal to (Cohen, 1960):


Formula

where p0 is the observed proportion of agreement between the two scores and pe is the agreement obtained purely by chance. Using the misclassification matrix



Formula

and (a + b + c + d = n), we obtain p0 = (a + d)/n and pe = [(a + b)(a + c) + (c + d)(b + d)]/n2. In practice, the values of {kappa} range from 0 (agreement not better than by chance) to 1 (perfect agreement) (Ludbrook, 2002). For an ordinal outcome, as in our study, a weighted kappa is used, with weights penalizing the severe more than the mild disagreements (Shoukri, 2003). Kappa statistics, even when high (> 0.8), do not rule out that the results of an epidemiological analysis are biased when different examiners are involved, as will be shown in our analysis below. Further, kappa statistics do not distinguish between bias and variability. The misclassification matrices below illustrate this. They all correspond to {kappa} = 0.6.



Formula

In the first table, the examiner (row) clearly underscores caries experience compared with the gold standard (column), while in the fifth table the reverse is true. In the third table, {kappa} = 0.6 because of scoring variability. When a gold standard is available, it is preferable to calculate the sensitivity (sens) and specificity (spec) of each examiner vis-à-vis this gold standard. The above notation, sens = d/(b+d), estimates the probability that the examiner rates caries when the gold standard also rated caries, while spec = a/(a+c) estimates the probability that the examiner did not rate caries when the gold standard also did not. The sensitivity and specificity can make the distinction between bias and variability. Indeed, the sensitivities of the above tables are 27/40, 29/40, 32/40, 35/40, and 37/40, respectively, and the specificities are 37/40, 35/40, 32/40, 29/40, and 27/40. The sensitivity and specificity of each examiner vs. the benchmark scorer will be used as correction terms for the logistic regression model.

The Logistic Regression Model
Our methodology is explained for a binary outcome (say no-caries [0] vs. caries [1]Go). The extension to an ordinal outcome does not introduce new ideas but would complicate matters unnecessarily, and we refer to Mwalili et al.(2004) for more technical details.

We denote the true dichotomized dmft score as Y = 0,1 for use as a response in a logistic regression model. A logistic regression model relating Y to p regressors x1, x2, ... xp (also called risk factors) is given by:


Formula 1(1)

where {pi} = P(Y = 1|X), with X = (x1, x2, ..., xp), represents the probability of having caries experience given specific values for the regressors and logit ({pi}) = log({pi}/[1 - {pi}]). The coefficients β0, β1, ..., βp are called regression coefficients and are estimated according to the method of maximum likelihood (see, e.g., Agresti, 2002). The coefficient β0 is called the intercept. The other coefficients measure the strength of the relationship between the regressors and the response. Here, the regressors are age and gender and the geographical location (x- and y-coordinates) of the school of the child. Since children of the same school share some common characteristics, ‘school’ was included in the model as a random effect (see Mwalili et al., 2004) measuring the between-school variation. In a subsequent model, the 16 examiners were included in the model as binary variables, i.e., equal to 1 when the jth examiner scored the child and 0 when another examiner scored the child. Model (1) is called the ordinary multiple logistic regression model. For an ordinal outcome with k classes, there are (k - 1) intercepts, and we call it an ordinal (multiple) logistic regression model.

Correction for Scoring Bias and Variability for a Logistic Regression Model
Assume that the benchmark scorer rates the true caries experience by Y while the dental examiner scores it as Formula 1. Let {gamma}0 and {gamma}1 denote the response misclassification probabilities


Formula 2(2)

The probabilities (1 - {gamma}0) and (1 - {gamma}1) are the specificity and sensitivity, respectively, of the scoring behavior of the dental examiner. If the misclassification is independent of the covariates, then the true model of the observed Y is (Neuhaus, 1999)


Formula 3(3)

where P(Y = 1|X) is the logistic regression model (1).

In general, the correction parameters {gamma}0 and {gamma}1 must be estimated from a validation dataset, here obtained from the calibration exercises. After one imputes the estimated correction parameters, the regression coefficients can be estimated using maximum likelihood based on model (3). For the calculation of the P-value (and 95% CI) of the regression parameters, the imprecision with which the correction parameters are estimated needs to be taken into account, e.g., with the delta method (Carroll et al., 1995).

The Bayesian approach offers an alternative way to estimate the parameters. Prior dental knowledge can be combined with the observed data (epidemiological + validation data) to yield a posterior distribution of the parameters using a sampling procedure called the Markov-Chain Monte Carlo approach (Spiegelhalter et al., 1996). However, here only non-informative priors were chosen. Further, the software WinBugs (Spiegelhalter et al., 1996) allows various epidemiological models and models for the validation data to be tested with a minimum of effort. Finally, the delta method (which can be cumbersome) is replaced by another simple sampling procedure (see also Mwalili et al., 2004).


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS & METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
One can see a clear East-West gradient in the level of caries experience, with a higher level of caries in Eastern Flanders (Fig.Go). This gradient was confirmed by a highly significant regression coefficient for the x-coordinate of the geographical location of the school based on a (Bayesian) ordinal logistic regression analysis (Table 1Go, without correction).


Figure 1
View larger version (31K):
[in this window]
[in a new window]

 
Figure. Map of Flanders with level of caries experience and over- and underscoring of dental examiners. Caries experience was split into 3 categories according to quartiles of the mean dmft scores obtained per school and coded as 0 (minimum to Q1), 1 (Q1 to Q3), or 2 (above Q3). The over- and underscoring of the examiner is indicated with the symbols "–", "*", and "+". The symbol "–" signifies that the dental examiner scoring the respective school underscored 5% to 15% compared with the benchmark examiner in the calibration exercises. The symbol "*" signifies between 5% under- and 5% overscoring, and the symbol "+" signifies at least 5% overscoring (up to 18%).

 

View this table:
[in this window]
[in a new window]

 
Table 1. Ordinal Logistic Regression Analysis Without (and With) Correction for the Bias and Variability of the Dental Examiners
 
However, a roughly similar East-West gradient in the scoring behavior of the 16 dental examiners is seen in the validation dataset (Fig.Go). Hence, one could question whether the first East-West gradient is genuine or caused by a different scoring of the examiners. The fact that the (weighted) kappa values ranged between 0.72 and 0.91 and are higher than 0.80 for 13 of the 16 dental examiners did not help us here. Therefore, the examiners were taken into account directly when the degree of caries experience was predicted from the x-coordinate. First, the 16 dental examiners in the ordinal logistic regression model were considered as regressors (see MATERIALS & METHODS). The regression coefficient of the x-coordinate is still significant, but to a much lesser degree and has shrunk considerably (Table 2Go). However, this analysis quantifies only a local East-West gradient in the geographical area where the dental examiner was operating. A better way to take the examiner effect into account is given in the next analysis.


View this table:
[in this window]
[in a new window]

 
Table 2. Ordinal Logistic Regression Analysis with the 16 Dental Examiners as Covariates
 
Using an extension of model (3) to the ordinal logistic regression model, we incorporated the under- and overscoring behaviors of the dental examiner vis-à-vis the benchmark examiner into the analysis of the cross-sectional data. We observed that the East-West gradient was again highly significant (in a Bayesian sense), practically to the degree of our first analysis, but was estimated with less precision (Table 1Go, with correction). Further analyses—including other covariates like deprivation indices of the region where the school belongs, fluoride level of the drinking water of the region, etc.—did not remove the East-West gradient. In all analyses, the random school effect was significant but had a minor effect on the regression estimates.


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS & METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Large-scale epidemiologic surveys necessarily involve multiple examiners, due to the large numbers of persons to be examined and some unavoidable organizational aspects, like geographical distances. This implies that the (dental) measurements like caries experience, plaque score, etc., could be scored differently by different examiners. This phenomenon is called ‘measurement error’, reflecting the idea that there is a true value for the measurement, taken by a gold standard, and the scores of the examiners might deviate from that true score. When the measurement is binary, one speaks of ‘misclassification error’. Measurement error on the regressor generally causes an attenuation of the true relationship between the risk factor and the response (disease). This also happens when the measurement error is on the outcome and when the regression model is non-linear. Furthermore, when the measurement error is confounded with other regressors, the estimated regression coefficients of the other regressors are also affected. Therefore, one needs to correct for measurement error, which is possible only when a validation dataset is available, e.g., by the performance of calibration exercises.

In the dental and medical literature, kappa values are reported not only to indicate the agreement of the scoring of the different examiners, but also to highlight the overall quality of the study. We have shown that these kappa values were uninformative in our analysis. Moreover, we argue that, in most studies with multiple raters, reporting of kappa values is not sufficient.

Geographical differences in, e.g., caries experience are often reported (Nadanovsky and Sheiham, 1994; Tickle et al., 2003). The analysis of determining factors for these differences is of utmost importance and facilitates the introduction of region-specific measures and/or interventions. In spite of the considerable efforts that are undertaken to calibrate examiners involved in such surveys, variability in scoring cannot be avoided. Since examiners often operate in well-defined geographical areas, the presence of possible bias can influence results considerably. The methodology presented here offers an opportunity to refine current analytical approaches, allowing more reliable conclusions to be drawn.

We have opted for a Bayesian approach for two reasons. First, the Bayesian approach allows for the incorporation of oral health knowledge into the statistical analysis. Although we have not done so here, we believe that this is an important feature of the approach. Indeed, the validation datasets are most often quite small, implying that the correction terms are then (relatively) poorly estimated. In that case, any external useful oral health information can improve the stability of the estimated correction terms. Second, the Bayesian software provides a flexible way to fit quite complex statistical models and to switch from one model to another with a limited amount of extra work, usually implying much less analytical work, which can be quite cumbersome once one deviates from classic statistical approaches.

Finally, despite the fact that, in our study, a gold standard was not available, but only a benchmark examiner, our analysis is not invalidated. Indeed, our regression coefficients estimate an ordinal logistic regression model as if all children were scored by the same individual, in this case the benchmark examiner. Of course, if the benchmark examiner also scores with error, then some attenuation will still be present in the analysis.


    ACKNOWLEDGMENTS
 
This investigation was supported by Research Grant OT/00/35, Catholic University Leuven; data collection was supported by Unilever, Belgium. The Signal-Tandmobiel® project was comprised of the following partners: D. Declerck (Dental School, Catholic University Leuven), L. Martens (Dental School, University Ghent), J. Vanobbergen (Oral Health Promotion and Prevention, Flemish Dental Association), P. Bottenberg (Dental School, University Brussels), E. Lesaffre (Biostatistical Centre, Catholic University Leuven), and K. Hoppenbrouwers (Youth Health Department, Catholic University Leuven; Flemish Association for Youth Health Care). Further, the first two authors are also partially funded by research grant P5/24 from the IAP research network of the Belgian State (Federal Office for Scientific, Technical and Cultural Affairs).

Received for publication January 6, 2004. Revision received May 28, 2004. Accepted for publication August 26, 2004.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS & METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 

  • Agresti A (2002). Categorical data analysis. New York: Wiley.
  • Carroll RJ, Ruppert D, Stefanski LA (1995). Non-linear measurement error models. London: Chapman and Hall.
  • Cicchetti DV, Feinstein AR (1990). High agreement but low kappa: II. Resolving the two paradoxes. J Clin Epidemiol 43:551–558.[CrossRef][Medline] [Order article via Infotrieve]
  • Cohen J (1960). A coefficient of agreement for nominal scales. Educ Psycholog Meas XX(1):37–46.
  • Feinstein AR, Cicchetti DV (1990). High agreement but low kappa: I. The problems of two paradoxes. J Clin Epidemiol 43:543–549.[CrossRef][Medline] [Order article via Infotrieve]
  • Ludbrook J (2002). Statistical techniques for comparing measurers and methods of measurement: a critical review. Clin Exp Pharmacol Physiol 29:527–536.[CrossRef][Medline] [Order article via Infotrieve]
  • Mwalili SM, Lesaffre E, Declerck E (2004). Correcting for inter-observer effects in a geographical oral health study. Appl Statist (in press).
  • Nadanovsky P, Sheiham A (1994). The relative contribution of dental services to the changes and geographical variations in caries status of 5- and 12-year-old children in England and Wales in the 1980s. Community Dent Health 11:215–223.[Medline] [Order article via Infotrieve]
  • Neuhaus JM (1999). Bias and efficiency loss due to misclassified responses in binary regression. Biometrika 86:843–855.[Abstract/Free Full Text]
  • Pine CM, Pitts NB, Nugent ZJ (1997). British Association for the Study of Community Dentistry (BASCD) guidance on the statistical aspects of training and calibration of examiners for surveys of child dental health. A BASCD coordinated dental epidemiology programme quality standard. Community Dent Health 14(Suppl 1):18–29.[Medline] [Order article via Infotrieve]
  • Shoukri MM (2003). Measures of interobserver agreement. Boca Raton: Chapman and Hall/CRC.
  • Spiegelhalter D, Thomas A, Best N, Gilks W (1996). Bayesian inference using Gibbs Sampling Manual (version ii). Cambridge, UK.
  • Tickle M, Milsom KM, Jenner TM, Blinkhorn AS (2003). The geodemographic distribution of caries experience in neighboring fluoridated and nonfluoridated populations. Public Health Dent 63:92–98.
  • Vanobbergen J, Martens L, Lesaffre E, Declerck D (2000). The Signal-Tandmobiel® project, a longitudinal intervention health promotion study in Flanders (Belgium): baseline and first year results. Eur J Paediatr Dent 2:87–96.

Journal of Dental Research, Vol. 83, No. 12, 951-955 (2004)
DOI: 10.1177/154405910408301212


Add to CiteULike CiteULike   Add to Complore Complore   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati   Add to Twitter Twitter    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to Saved Citations
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Request Reprints
Right arrow Add to My Marked Citations
Citing Articles
Right arrow Citing Articles via Google Scholar
Right arrow Citing Articles via Scopus
Google Scholar
Right arrow Articles by Lesaffre, E.
Right arrow Articles by Declerck, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lesaffre, E.
Right arrow Articles by Declerck, D.
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati   Add to Twitter  
What's this?