|
Sign In to gain access to subscriptions and/or personal tools.
|
Modeling Multivariate Binary Responses with Multiple Levels of Nesting Based on Alternating Logistic Regressions: an Application to Caries Aggregation
C.V. Ananth1,* and
M.L. Kantor2
1 Section of Epidemiology and Biostatistics, Department of Obstetrics, Gynecology and Reproductive Sciences, University of Medicine and Dentistry of New Jersey, Robert Wood Johnson Medical School, 125 Paterson Street, New Brunswick, NJ 08901-1977; and 2 Division of Oral and Maxillofacial Radiology, Department of Diagnostic Sciences, University of Medicine and Dentistry of New Jersey, New Jersey Dental School, Newark, NJ;
Correspondence: * corresponding author, cande.ananth{at}umdnj.edu
 |
ABSTRACT
|
|---|
Clustered binary responses are commonly encountered in dental research. Data analysis may include modeling both the marginal response probabilities (i.e., risk) and the dependence structure between pairs of responses (i.e., aggregation). While second-order generalized estimating equations (GEE2) is a well-known approach for such data, alternating logistic regressions (ALR) is a computationally efficient alternative method, especially for large clusters. We illustrate ALR with an application to caries aggregation using a dataset with 3 levels of nesting: tooth surfaces within an interproximal (IP) region, IP regions within a jaw, and jaws within a subject. Caries lesions appear to aggregate strongly within subjects with a spatially distributed risk. The minimum within-IP-region odds ratio (OR) was 2.25 (95% confidence interval 1.15, 4.41), and the within-IP-region ORs were always greater than the between-IP-region ORs. ALR is a convenient and useful regression technique for explicit modeling of the dependence structure, and may be applicable to other dental research problems involving clustered or nested responses.
Key Words: clustered responses nested responses correlated binary responses alternating logistic regressions generalized estimating equations site-specific caries data
 |
INTRODUCTION
|
|---|
Clustered binary responses are frequently encountered in the biomedical sciences. Clustering violates the basic statistical assumption of independence of observations, thereby requiring sophisticated approaches to statistical modeling. In dentistry, for instance, caries lesions occur on individual tooth surfaces, and tooth surfaces are clustered within patients. In this paper, we will use a caries dataset to illustrate an approach for analyzing clustered data. There may be 2 objectives in the analysis of clustered response data. The first is to model the marginal response probabilities as a function of covariates (called marginal models), which is to model the overall risk of caries. The second is to model the dependence structure among pairs of binary responses (called the log-odds ratio or dependence model), which is to model the pairwise association of caries or caries aggregation.
Dental caries causes destruction of discrete tooth surfaces. Site-specific sampling (the tooth surface) results in not only clustered (correlated) but also nested (hierarchical) responses. One approach is to avoid the problem of clustered data by using the subject as the unit of analysis (Manji and Fejerskov, 1994). Most epidemiological and caries risk studies use the total number of decayed, missing, or filled teeth (DFMT) or surfaces (DFMS) as a subject-specific outcome. Another approach is to ignore the clustering and use ordinary logistic regression (OLR) to model caries risk. However, the variance of regression parameters from an OLR model will be imprecise and may lead to incorrect inferences.
There are several regression techniques for the analysis of clustered binary responses when modeling the marginal mean is the focus and the dependence structure is considered a nuisance. For such scenarios, first-order generalized estimating equations (GEE1) offer one elegant solution (Liang and Zeger, 1986; Prentice, 1988). However, when modeling the dependence structure is of scientific interest, second-order generalized estimating equations (GEE2), using the correlation coefficient (Prentice, 1988) or the odds ratio (Lipsitz et al., 1991) as the measure of pairwise dependence, provides an analytic solution. GEE2 methods have also been applied to marginal models for correlated binary responses with multiple classes and several levels of nested responses (Qagish and Liang, 1992; Podgor et al., 1996). However, when the cluster size gets large, the GEE2 procedure can quickly become computationally impractical. Alternating logistic regressions (ALR) overcomes this limitation, and can be used to model, simultaneously, the marginal probability and the pairwise association for large clusters (Carey et al., 1993).
In this paper, we illustrate ALR with an application to clustered data with multiple levels of nesting. The data came from a radiographic efficacy study (Kantor, 1989; Kantor and Stinnett, 1990), where the clustered response variable is the presence or absence of a caries lesion on each of 16 tooth surfaces per subject. There are 3 levels of nesting: tooth surfaces within an interproximal (IP) region, IP regions within a jaw, and jaws within a subject. We applied ALR to assess the extent of caries aggregation at each level of nesting.
 |
MATERIALS & METHODS
|
|---|
Caries Study
In this 1986 institutional review board-approved study, a bitewing radiograph produced with D-speed film at 70 kVp was obtained on the right or left side (randomly selected) in each of 100 subjects. Each bitewing radiograph was read independently by five community-based dentists, who scored each proximal surface for dental caries (Fig. ). A complete description of the variables is shown in Table 1 .

View larger version (28K):
[in this window]
[in a new window]
|
Figure. Schematic of a bitewing radiograph showing 16 scorable surfaces per subject. A caries lesion on the mesial surface of the maxillary second premolar is represented by the dot, and the molar 1/molar 2 (MM) interproximal region by the oval. The MM interproximal region serves as the reference value for the remaining interproximal regions: canine/premolar 1 (CP); premolar 1/premolar 2 (PP); and premolar 2/molar 1 (PM).
|
|
A tooth with its 2 proximal surfaces seems like a natural cluster. However, the biological micro-environment of an IP region may be more alike than the mesial and distal surfaces of a given tooth. Therefore, for this analysis, we considered 2 adjacent tooth surfaces nested within an IP region, IP regions within a jaw, and jaws within a subject. Patient demographics were not available.
The Multivariate Model
As we analyzed individual tooth surfaces for aggregation at 3 levels of nesting, there were 16 correlated response variables; hence, a multivariate approach to modeling was required. Following the notations of Qaqish and Liang (1992), let Yij denote the jth binary response on the ith cluster, 1 i K and 1 j ni; for example, the presence (Yij = 1) or absence (Yij = 0) of caries on the jth tooth surface for the ith subject (or cluster). For a single cluster i, Yi = (Yi1, Yi2,...,Yini)' denotes a vector of ni-binary responses. Let Xij (1 j ni) denote a p-covariate vector associated with the jth binary response on the ith cluster Yij, and let Zijk (j < k) denote a q-covariate vector associated with the jth and kth response pair (for instance, the presence or absence of a caries lesion on 2 separate tooth surfaces for the ith subject). We suppress the conditioning of covariates in the models of the form E(Yij | xij), but rather denote them simply as E(Yij).
Let µij = E(Yij) denote the (ni x 1) first-order mean vector, and let µijk = E(Yij,Yik), 1 j < k ni, denote the {ni(ni – 1)/2 x 1} second-order mean vector. With Yij being distributed as a Bernoulli variate, the marginal probability of caries (i.e., caries risk) is modeled assuming a logistic representation:
 | (1) |
and the dependence (i.e., caries aggregation) as
 | (2) |
Eq. 1 is the log odds for E(Yij), and Eq. (2) is the log of the odds ratio relating Yij to Yik.
The pairwise association for the 2 binary responses (Yij, Yik), (j < k), is measured in terms of a marginal odds ratio (Lipsitz et al., 1991), given by
 |
The log of the odds ratio, ijk, for caries between 2 tooth surfaces (j,k) for the ith subject is the log of the ratio of the odds of caries on tooth surface j, given that tooth surface k is carious, to the odds of caries on tooth surface j, given that tooth surface k is caries-free. Therefore, when ijk is zero, the odds ratio is 1, implying the absence of clustering of caries between the 2 surfaces (j,k) within a subject. An odds ratio significantly different from the null suggests the presence of "aggregation" of caries within a subject. The dependence structure in our study was modeled with the marginal odds ratio (Lipsitz et al., 1991), instead of the correlation coefficient (Prentice, 1988). A limitation of using the correlation coefficient as a measure of dependence involves range restrictions (Diggle et al., 2002), but using the odds ratio as a measure of association overcomes this limitation (Lipsitz et al., 1991; Liang et al., 1992). Finally, odds ratios are commonly used in epidemiologic studies and are easy to interpret.
Ordinary Logistic Regression (OLR)
When the clustering is ignored, Eq. (1) is the OLR model. However, the estimated variances from this model can be imprecise, and so, statistical tests and biologic inferences may be incorrect.
First-order Generalized Estimating Equations (GEE1)
GEE1 are multivariate analogues of the quasi-likelihood estimating functions (Wedderburn, 1974). When interest centers solely on modeling the marginal mean, µij, and the dependence is considered a nuisance, then the estimating function for β has the form
Let Wi = (Y1Y2, Y1Y3,...,Yni–1Yni)' denote the {ni(ni – 1)/2 x 1} vector of pairwise (cross-products) outcomes of the vector Yi. Define i = E(Wi). Since the joint distribution of Yi is not fully specified by the parameters = (β, ), where is the "working" dependence structure, a full likelihood-based approach to estimation is not feasible (Liang et al., 1992). Liang and Zeger (1986) showed that the solution of of U( ) = 0, as K , has an asymptotic multivariate Gaussian distribution with mean 0 and covariance which can be consistently estimated using , where H0 and H1 are defined as
The advantages of GEE1 are that even when the dependence structure is incorrectly specified, GEE1 provides consistent estimation of the marginal mean parameters.
Alternating Logistic Regressions (ALR)
When the dependence structure (i.e., caries aggregation) is the focus, GEE2 may be used to model µij and µijk simultaneously (Prentice, 1988). However, the simultaneous estimation of µij and µijk can become computationally prohibitive, especially when the cluster sizes become large. For instance, for a cluster of size ni, the dimension of the matrix cov(Yi,Wi) is of the order {ni + ni(ni – 1)/2} x {ni + ni(ni – 1)/2}. In our study, the maximum cluster size for each subject was 16 surfaces, resulting in a 136 x 136 matrix ({16 + 16(16 – 1)/2}) for cov(Yi,Wi). Moreover, when odds ratios are used to model the pairwise associations, estimation of cov(Yi,Wi) requires solving O(n3) cubic and O(n2) seventh-degree polynomial equations for each cluster (Carey et al., 1993).
ALR overcomes the computational difficulty of GEE2 and can be used to estimate = (β, ) for simultaneous modeling of µij and µijk (Carey et al., 1993). For cluster sizes 4, not only can using GEE2 to model µij and µijk become computationally prohibitive, but also the efficiency in the estimates of β from GEE1 and ALR is > 90% compared with that of GEE2 (Heagerty and Zeger, 1996). Briefly, the ALR algorithm iterates between the following two steps until convergence:
Step 1: Given the initial values of = (r), calculate côv(r), and solve the following estimating equation (Score function) for an updated (r+1):
Step 2: Given (r+1) and (r), evaluate the offset from the following equation:
and perform the offset logistic regression of Yij on ZijkYik, to obtain (r+1).
Let ijk = E(Yij | Yik = yik). Define Rijk = Yij – ijk to denote a vector of residuals, and let Si denote a matrix with diagonal elements ijk(1 – ijk). The ALR estimate of = (β, ) is the simultaneous solution of the following set of unbiased estimating equations:
 |
 |
The loss of efficiency in estimating the pairwise association parameters is typically much less severe when centered values, yij – E(yij), are used, rather than yij itself (Heagerty and Zeger, 1996). A limitation of ALR is that although modeling µij is robust to different choices of working correlation, µijk modeling is not (i.e., the model must be specified correctly). Thus, measures of association among responses need to be cautiously interpreted, since they are sensitive to the assumption that the model is correctly specified (Stokes et al., 2000). All regression models were fit in SAS version 8.2 (SAS Institute, Cary, NC, USA) with the GENMOD procedure based on the binomial distribution assumption and with the logit link function for uncentered products.
 |
RESULTS
|
|---|
Modeling Caries Risk
Eleven percent of surfaces had a caries lesion. The full model for the marginal probability of caries was fitted:
 | (3) |
where ij = Pr(Yij = 1), and I(.) is an indicator function for observers. We also tested for other two-way interactions in Eq. (3) , and none reached statistical significance. For comparison, we show the parameter estimates with naïve standard errors of OLR (model a) that ignored the clustering, and the estimates based on GEE1 (model b) that accounted for clustered observations (Table 2 ). The naïve standard errors for all terms in the OLR model were smaller than with GEE1 estimation. We did not interpret the marginal model in depth, since our purpose was to illustrate the application of ALR for modeling the dependence structure, that is, caries aggregation.
View this table:
[in this window]
[in a new window]
|
Table 2. Analyses of the Dental Caries Data: Results from Modeling the Marginal Mean and the Dependence Structure (K = 100 subjects)a
|
|
Modeling Caries Aggregation
In conjunction with the marginal mean model (Eq. 3 ), 4 different regressions were specified for modeling the aggregation of caries risk:
 | (4) |
 | (5) |
 | (6) |
 | (7) |
All models were fitted with ALR. Eq. 4 was based on a constant pairwise log odds ratio within a subject (model c), Eq. 5 allowed for separate log odds ratios for within- and between-tooth surfaces (model d), Eq. 6 allowed for separate estimation of the pairwise association for the two jaws (model e), and finally, Eq. 7 allowed for estimation of the 10 separate associations of within- and between-IP regions (model f). Model (f) was preferred over models (c-e) since it allowed for an assessment of caries aggregation at the lowest level of nesting (i.e., IP region).
The parameter estimates for these 4 models with their robust standard errors are shown in Table 2 . A subject with caries on any tooth surface was exp(0.935) or 2.55 (95% CI 2.18, 2.97) times as likely to have caries on any other different surface (model c). Caries was more likely to aggregate within- and between-tooth surfaces (model d). A test of 1 = 2 = 3 = 0 resulted in a significant Wald-like (Table 2 ). Caries was also more likely to aggregate within and between jaws (model e) once again, a test of 4 = 5 = 6 = 0 being highly significant (Wald-like ).
Model (f) allows for separate estimation of the pairwise associations by IP regions. A simultaneous 10-df test of 7 = ... = 16 = 0 was significant (Wald-like , P < 0.001), confirming the presence of strong caries aggregation. Odds ratios from model (f), with robust 95% confidence intervals, are presented in Table 3 . Three notable findings are apparent from this analysis. First, pairwise associations of caries appeared to be stronger within rather than between IP regions (P < 0.05); that is, the odds ratios on the diagonal are greater than on the off-diagonal entries. For example, a subject with caries on 1 surface in the PP IP region was exp(1.509) or 4.52 times as likely to have caries on the adjacent surface within the PP IP region, but at most exp(1.046) or 2.85 times as likely to have caries in another IP region (i.e., the PM region in this example). Second, as the distance between IP regions increased, the association grew weaker (P < 0.05). This is seen from the diminishing odds ratios going from the left to right for each row, and from bottom to top for each column in Table 3 . Third, regardless of the choice of the pairwise association structure, parameter estimates for the marginal probability and their robust standard errors among models (c-f) were fairly similar (Table 2 ).
View this table:
[in this window]
[in a new window]
|
Table 3. Odds Ratios and 95% Confidence Intervals for the Pairwise Association Models for Caries Derived from Alternating Logistic Regressions (Model f in Table 2 ; K = 100 subjects)*
|
|
 |
DISCUSSION
|
|---|
In this paper, we illustrated the application of ALR to the analysis of a complex dataset involving caries. The complexity of the analysis was due to the multiple levels of nesting: tooth surfaces within an IP region, IP regions within a jaw, and jaws within a subject. The goal of our analysis was to assess if caries aggregated within a subject. In our study, the within-IP-region odds ratios were always greater than the between-IP-regions odds ratios. In general, the odds ratios diminished as distance between the IP region pairs increased, which might indicate an auto-regressive or moving-average structure. While our study did not consider the traditional risk factors for dental caries, it demonstrates an application of ALR when the association between pairs of responses is the scientific focus of research. Given a caries lesion on an index surface, ALR enables us to determine the odds of a caries lesion on the adjacent tooth surface within the index IP region and on a tooth surface in a distant IP region.
Given that modeling caries aggregation was the goal of our analysis, we could have applied either GEE2 or ALR. Since GEE2 is computationally inefficient relative to ALR when the number of clusters is large (Carey et al., 1993), we preferred ALR. Consistent estimation of the parameters in the marginal probability model with GEE2 requires the correct specification of the model for ijk. However, ALR provides consistent estimates of the mean parameters, even when the dependence structure is misspecified.
Alternatively, the marginal probability and the conditional pairwise odds ratios of the multivariate binary responses can be modeled with the log-linear framework (Fitzmaurice and Laird, 1993). However, such models have several shortcomings. The primary disadvantages of log-linear models are two-fold. First, they are not reproducible (Liang et al., 1992), in that if the vector Yij = (Yi1, Yi2,...,YiKi)' satisfies a log-linear model, then a subset of Yij need not necessarily follow the same log-linear model. This restriction has rather profound implications in oral health studies when the cluster size varies across clusters. In contrast, marginal models are fully reproducible. The second disadvantage of log-linear models is that if one is interested in modeling the pairwise associations as a function of covariates, the log-linear framework is not applicable. These issues are discussed in greater detail by Liang et al.(1992) and Qaqish and Liang (1992).
In the case of clusters of size 2, Rosner (1989) proposed estimating the pairwise association based on beta-binomial regression models, and Ananth and Preisser (1999) demonstrated an application of bivariate logistic regression. Recently, the ALR algorithm has been extended to model ordinal responses when the association structure is the primary focus of analysis (Heagerty and Zeger, 1996). Population-averaged and cluster-specific regression models have been applied to a longitudinal study of periodontal disease progression where estimation of the marginal probability was the primary focus (Ten Have et al., 1995). The issue of analyzing clustered responses, specifically within- and between-subject comparisons, in periodontal disease studies has been the focus of recent papers (DeRouen et al., 1995; Mancl et al., 2000).
In summary, the ALR procedure offers a computationally convenient method for fitting complex regression structures that involve multiple classes and several levels of nesting which are commonly encountered in dental research. Specifically, in our case study, the analysis reveals that caries lesions appear to aggregate strongly within subjects.
 |
ACKNOWLEDGMENTS
|
|---|
The original data are from a study funded by a grant (R03-DE07422) from the National Institutes of Health awarded to Dr. Kantor. Dr. Ananth is supported, in part, though a grant (R01-HD038902) awarded to him by the National Institutes of Health. We acknowledge Drs. John Preisser and Bahjat Qaqish from the Department of Biostatistics, Dr. James Bader from the Department of Operative Dentistry, The University of North Carolina at Chapel Hill, NC; Dr. Vincent Carey from the Channing Laboratory, Harvard Medical School, Boston, MA, and Dr. John Smulian, from the Department of Obstetrics, Gynecology and Reproductive Sciences, University of Medicine and Dentistry of New Jersey-Robert Wood Johnson Medical School, New Brunswick, NJ, for reviewing the manuscript and offering valuable suggestions. We also thank the two anonymous referees for their comments that helped improve the manuscript and for pointing us to some pertinent literature.
Received for publication March 28, 2003.
Revision received March 18, 2004.
Accepted for publication April 23, 2004.
 |
REFERENCES
|
|---|
- Ananth CV, Preisser JS (1999). Bivariate logistic regression: modelling the association of small for gestational age births in twin gestations. Stat Med 18:2011–2023.[CrossRef][Medline]
[Order article via Infotrieve]
- Carey V, Zeger SL, Diggle P (1993). Modeling multivariate binary data with alternating logistic regressions. Biometrika 80:517–526.[Abstract/Free Full Text]
- DeRouen TA, Hujoel PP, Mancl LA (1995). Statistical issues in periodontal research. J Dent Res 74:1731–1737.
- Diggle PJ, Heagerty P, Liang K-Y, Zeger SL (2002). Analysis of longitudinal data. 2nd ed. New York, NY: Oxford Science Publications.
- Fitzmaurice GM, Laird NM (1993). A likelihood-based method for analysing longitudinal binary responses. Biometrika 80:141–151.[Abstract/Free Full Text]
- Heagerty PJ, Zeger SL (1996). Marginal regression models for clustered ordinal measurements. J Am Stat Assoc 91:1024–1036.[CrossRef]
- Kantor ML (1989). Radiographic caries detection as a function of film speed and kilovoltage (abstract). J Dent Res 68(Spec Iss):961.
- Kantor ML, Stinnett SS (1990). Factors influencing radiographic detection of dental caries (abstract). J Dent Res 69(Spec Iss):288.
- Liang K-Y, Zeger SL (1986). Longitudinal data analysis using generalized estimating equations. Biometrika 73:13–22.[Abstract/Free Full Text]
- Liang K-Y, Zeger SL, Qaqish BF (1992). Multivariate regression analyses for categorical data (with discussion). J R Stat Soc [Ser B] 54:3–40.
- Lipsitz SR, Laird NM, Harrington DP (1991). Generalized estimating equations for correlated binary data: using the odds ratio as a measure of association. Biometrika 78:153–160.[Abstract/Free Full Text]
- Mancl LA, Leroux BG, DeRouen TA (2000). Between-subject and within-subject statistical information in dental research. J Dent Res 79:1778–1781.
- Manji F, Fejerskov O (1994). An epidemiological approach to dental caries. In: Textbook of clinical cariology. 2nd ed. Thylstrup A, Fejerskov O, editors. Copenhagen, DK: Munksgaard, pp. 159–191.
- Podgor MJ, Hiller R, the Framingham Eye Studies Group (1996). Associations of types of lens opacities between and within eyes of individuals: an application of second-order generalised estimating equations. Stat Med 15:145–156.[CrossRef][Medline]
[Order article via Infotrieve]
- Prentice RL (1988). Correlated binary regression with covariates specific to each binary observation. Biometrics 44:1033–1048.[CrossRef][Medline]
[Order article via Infotrieve]
- Qaqish BF, Liang K-Y (1992). Marginal models for correlated binary responses with multiple classes and multiple levels of nesting. Biometrics 48:939–950.[CrossRef][Medline]
[Order article via Infotrieve]
- Rosner B (1989). Multivariate methods for clustered binary data with more than one level of nesting. J Am Stat Assoc 84:373–380.[CrossRef]
- Stokes ME, Davis CS, Koch GG (2000). Categorical data analysis using the SAS system. 2nd ed. Cary, NC: SAS Institute, Inc., pp. 527–533.
- Ten Have TR, Landis R, Weaver SL (1995). Association models for periodontal disease progression: a comparison of methods for clustered binary data. Stat Med 14:413–429.[Medline]
[Order article via Infotrieve]
- Wedderburn RWN (1974). Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika 61:439–447.[Abstract/Free Full Text]
Journal of Dental Research, Vol. 83, No. 10,
776-781 (2004)
DOI: 10.1177/154405910408301008

CiteULike Complore Connotea Del.icio.us Digg Reddit Technorati Twitter What's this?
This article has been cited by other articles:

|
 |

|
 |
 
S. O. Manda, M. S Gilthorpe, Y.-K. Tu, A. Blance, and M. T Mayhew
A Bayesian analysis of amalgam restorations in the Royal Air Force using the counting process approach with nested frailty effects
Statistical Methods in Medical Research,
December 1, 2005;
14(6):
567 - 578.
[Abstract]
[PDF]
|
 |
|
|
|