Actualmente los modelos multivariables de regresión son parte importante del arsenal de la investigación clínica, ya sea para la creación de puntuaciones con fines pronósticos o en investigación dedicada a generar nuevas hipótesis. En la creación de estos modelos, se debe tener en cuenta: a) el uso apropiado de la técnica estadística, que ha de ser acorde con el tipo de información disponible; b) mantener el número de variables por evento no mayor de 10:1 para evitar la sobresaturación del modelo, relación que se puede considerar una medida grosera de la potencia estadística; c) tener presentes los inconvenientes del uso de los procesos automáticos en la selección de las variables, y d) evaluar el modelo final con relación a las propiedades de calibración y discriminación. En la creación de modelos de predicción, en la medida de lo posible se debe evaluar estas mismas medidas en una población diferente.

Multivariable regression models are widely used in health science research, mainly for two purposes: prediction and effect estimation. Various strategies have been recommended when building a regression model: a) use the right statistical method that matches the structure of the data; b) ensure an appropriate sample size by limiting the number of variables according to the number of events; c) prevent or correct for model overfitting; d) be aware of the problems associated with automatic variable selection procedures (such as stepwise), and e) always assess the performance of the final model in regard to calibration and discrimination measures. If resources allow, validate the prediction model on external data.

Multivariable regression models are widely used in health science research. Data are frequently collected to investigate interrelationships among variables or to determine factors affecting an outcome of interest. It is here where multivariable regression models become a tool to find a simplified mathematical explanation between the candidate predictors and the outcome. The ultimate goal is to derive a parsimonious model that makes sense from the subject matter point of view, closely matches the observed data, and has valid predictions on independent data.

Due to advances in statistical software, which have made them friendlier to the user, more researchers with limited background in biostatistics are now engaged in data analysis. Thus, the goal of this review is to provide practical advice on how to build a parsimonious and more effective multivariable model. The overall steps in any regression model exercise are listed in Table 1. Due to limited space, only the most practical points are presented.

Table 1. Overall Steps in Multivariable Regression Modeling.

Determining the aim of the model |

Prediction (prognostic models) |

Effect size (or explanatory models) |

Ascertainment of true outcome |

Minimize endpoint misclassification error |

Prefer hard outcomes for prognostic models |

If combined endpoint, ensure the direction of the effect is the same for both components |

Consider using new outcomes such as days alive and out of hospital in heart failure studies |

Choosing appropriate statistical method dependent on outcome and prediction type |

Continuous: linear regression |

Binary: logistic regression |

Binary with censored observations |

Cox proportional regression |

Parametric survival regression |

Competing risks |

Longitudinal & time-to-event endpoint: Joint modeling approach |

Longitudinal data with interest in intermediate endpoints: multi-state Markov modeling |

Proper model building, including internal validation |

Parsimony versus complexity |

Selection of the right variables. Caution on the inappropriate use of stepwise procedures. Use backward instead of forward. Tune up the stopping rule according to the sample size 1 |

Avoidance of overffiting (EPV rule of thumb) |

Do not assume linearity for continuous variables; transform them if necessary. Use FPF or RCS for complex nonlinear functions |

Assessing model's performance |

Internal validation (preferably bootstrapping). Parameters to be evaluated |

Overall performance measures (R2, Brier score) |

Discrimination ability (AUC, C-statistics, IDI, NRI) |

Calibration (Hosmer-Lemeshow goodness of fit test, calibration plot, calibration slope, Gronnesby and Borgan test, calibration-in-the-large) |

External validation. The same parameters, but in external data |

The need for regression coefficient's shrinkage |

If calibration assessment shows overly optimistic coefficients, then |

Adjust shrinkage based on calibration slope, or |

Use more complex penalization methods such as LASSO and MLE |

Presenting the results |

Unadjusted versus adjusted |

Relative metrics (OR, HR) |

Absolute metrics (ARD, NNT) |

ARD, absolute risk difference; AUC, area under the ROC curve; C-statistics, equivalent to AUC for censored data; EPV, number of events per variable; FPF, fractional polynomial function; HR, hazard ratio; IDI, integrated discrimination index; LASSO, least absolute shrinkage and selection operator; MLE, maximum likelihood estimation; NNT, number needed to treat; NRI, net reclassification index; OR, odds ratio; R2, explained variation measure; RCS, restricted cubic splines.

Regression models share a general form that should be familiar to most, usually: response=weight1×predictor1+weight2×predictor2+… weightk×predictork | normal error term. The variable to be explained is called the dependent (or response) variable. When the dependent variable is binary, the medical literature refers to it as an outcome (or endpoint). The factors that explain the dependent variable are called independent variables, which encompass the variable of interest (or explanatory variable) and the remaining variables, generically called covariates. Not infrequently, the unique function of these covariates is to adjust for imbalances that may be present in the levels of the explanatory variable. Sometimes, however, the identification of the predictors for the response variable is the main study goal, and in this case, every independent variable becomes of interest. Models can be used for different tasks (Table 1) that can be summarized as prediction and/or effect estimation.2 Table 2 summarizes the differences in strategies between prediction and effect estimation models.

Table 2. Differences in Strategies According to the Task Assigned to the Model.

Aims | Considerations | Validation |

Prediction | Parsimony over complexity; the integrated information from all predictors is what matters | Calibration and discrimination using bootstrapping (internal validation) |

Prediction of an outcome of interest (prognostic score) | Avoid overuse of cutpoints | Shrinkage of main effects (linear shrinkage or penalization) |

Identification of important predictors | Multiple imputation if missingness>5% | Calibration and discrimination on independent data (external validation) |

Stratification by risk | Don’t excessively rely on stepwise procedures. If too many predictors or insufficient subject matter knowledge, then use MFP algorithm | Update coefficients in the new data if necessary |

Use EPV rule of thumb to limit the number of predictors. Group predictors if needed (i.e., propensity score) | Convert regression coefficients into scores using appropriate algorithms | |

Spend additional degrees of freedom on interactions and modeling nonlinear relationship of continuous predictors if prior knowledge suggested. Limit testing of individual terms; instead consider overall significance | Evaluate clinical usefulness of the derived score (nomogram, prognostic score chart, decision curve analysis) | |

If clinical decision making is in mind, also report absolute metrics (ARD or NNT) | ||

Effect estimation | The weight between parsimony and complexity varies according to the research question | At least calibration and discrimination ability of the model should be reported; internal validation is a plus |

Explanatory: understanding the effect of predictors | Always minimize continuous predictors categorization | If clinical decision making in mind, also report absolute metrics (ARD or NNT) |

Adjustment for predictors in experimental design to increase statistical precision | Multiple imputation if missingness>5% | |

Focus on the independent information provided by one predictor (or explanatory variable) | Use MFP algorithm as the preferred selection variable approach | |

Always keep present the EPV rule of thumb. Group predictors in case of small sample size (i.e. propensity score) | ||

If sample size allows, model nonlinear relationship of continuous predictors with no more than 4-5 degree of freedom (FPF or RCS). Final decision should be based on the overall significance (omnibus P-value) | ||

As hypothesis generating study, test for interactions. Keep interaction terms only when the omnibus P-value is significant |

ARD, absolute risk difference; EPV, number of events per variable; FPF, fractional polynomial function; MFP, multivariable fractional polynomial; NNT, number needed to treat; RCS, restricted cubic splines.

These models are created when the main goal is to predict the probability of the outcome in each subject, often beyond the data from which it originated. As an example, the clinical prediction rules derived from a model fitted to the Framingham data has been shown, after multiple external validations, to provide a quantitative estimation of the absolute risk of coronary heart disease in a general population.3 For these types of models, the researcher needs to balance complexity (and accuracy) versus parsimony; in other words, how closely the model needs to fit the data at hand versus how generalizable the predictions will be in external populations. Complex models, such as those with multiple interactions, excessive number of predictors, or continuous predictors modeled through complex nonlinear relationship, tend to fit poorly in other populations.

Several recommendations have been proposed for building these types of models,2, 4, 5 the following being the most important: a) incorporate as much accurate data as possible, with wide distribution for predictor values; b) impute data if necessary as sample size is important; c) specify in advance the complexity or degree of nonlinearity that should be allowed for each predictor; d) limit the number of interactions, and include only those prespecified and based on biological plausibility; e) for binary endpoints, follow the 10-15 events per variable (EPV) rule to prevent overfitting. If not possible, then proceed to data reduction; f) be aware of the problems with stepwise selection strategies. If used, proceed with a backward elimination instead, and set the criterion for stopping rule equivalent to AIC (P=.157). With small samples, relax even more the stopping rule (P=.25 to .5). Use prior knowledge whenever possible; g) check the degree of collinearity between important predictors and use subject matter expertise to decide which of the collinear predictors should be included in the final model; h) validate the final model for calibration and discrimination, preferably using bootstrapping, and i) use shrinkage methods if validation shows over-optimistic predictions.

Models for Effect EstimationThese models are created either as tools for effect estimation, or as a basis for hypothesis testing. Most articles published in biomedical literature are based on this type of models. Because there is a little concern for parsimony, the balance would be in favor of developing a more accurate and complex model that reflects the data at hand. However, always use principles that prevent overfitted estimates, and if necessary precede to data reduction methods. It is always a good principle to validate the final model based on calibration and discrimination measures.

Types of Models Driven by the Structure of the DataAnother consideration when building a regression model is to choose the appropriate statistical model that matches the type of dependent variable. There is a multitude of variation in how the data is collected, and not infrequently the same data can be analyzed with more than one regression method. Table 1 shows the type of regression methods that match the most frequent types of data collected (based on a normal error assumption). A detailed explanation of each of these methods is beyond the scope of this paper, and therefore, only the most important aspects will be provided.

Linear Regression AnalysisThe main assumption is the linearity of the relationship between a continuous dependent variable and predictors. When untenable, linearize the relationship using variable transformation or apply nonparametric methods.

Logistic Regression AnalysisThe logistic regression model is appropriate for modeling a binary outcome disregarding time dimension. All we need to know about the outcome is whether it is present or absent for each subject at the end of the study. The resulting estimate of effect for treatment is the odds ratio (OR) adjusted for other factors included as covariates. Sometimes, logistic regression has been used inappropriately to analyze time-to-event data. Annesi et al.6 demonstrated that, compared to Cox, the two methods yielded similar estimates, and an asymptotic relative efficiency of both models close to 1 only in studies with short follow-up and low event rate. Therefore, logistic regression should be considered as an alternative to Cox regression only when the duration of the cohort follow-up can be disregarded for being too short, or when the proportion of censoring is minimal and similar between the two levels of the explanatory variable.

Time-to-Event DesignSurvival regression methods have been designed to account for the presence of censored observations, and therefore are the right choice to analyze time-to-event data. Cox proportional hazard is the most common method used. The effect size is expressed in relative metrics as a hazard ratio (HR). A constant instantaneous hazard risk difference during follow-up is the main assumption. Several nonparametric alternatives to Cox regression have been proposed in the presence of nonproportionality.7

Parametric survival methods are recommended as follows: a) when the baseline hazard or survival function is of primary interest; b) to get more accurate estimates in situations where the shape of the baseline hazard function is known by the researcher; c) as a way to estimate the adjusted absolute risk difference (ARD) and number needed to treat (NNT) at prespecified time points; d) when the proportionality assumption for the explanatory variable is not tenable,8 and e) when there is a need to extrapolate the results beyond the observed data.

In survival analysis, each subject can experience one of the several different types of events at follow-up. If the occurrence of one type of event either influences or prevents the event of interest, a competing risks situation arises. For example, in a study of patients with acute heart failure, hospital readmission, the event of interest, is prevented if the patient dies at follow-up. Death here is a competing risk, preventing that this patient could be readmitted. In a competing risks scenario, several studies have demonstrated that the traditional survival methods, such as Cox regression and Kaplan-Meier method (KM) are inappropriate.9 Alternative methods have been proposed, such as the cumulative incidence functions and the proportional subdistribution hazards model by Fine and Gray.10

In summary, consider these methods: a) when the event of interest is an intermediate endpoint and the patient's death prevents its occurrence; b) when one specific form of death (such as cardiovascular death) needs to be adjusted by other causes of death, and c) to adjust for events that are in the causal pathway between the exposure and the outcome of interest (usually a terminal event). For instance, revascularization procedures may be considered in this category by modifying the patient's natural history of the disease, and therefore, influencing the occurrence of mortality.

Repeated Measures With Time-to-Event EndpointMany studies in biomedical research have designs that involve repeated measurements over time of a continuous variable across a group of subjects. In cardiovascular registries, for instance, information is obtained from patients at each hospitalization and collected along their entire follow-up; this information may include continuous markers, such as BNP, left ventricle ejection fraction, etc. In this setting, the researcher may be interested in modeling the marker across time, by determining its trajectory and the factors responsible for it; or perhaps the effect of the marker on mortality becomes the main interest; or the marker may be simply used as a time-varying adjuster for a treatment indicated at baseline. A frequent and serious problem in such studies is the occurrence of missing data, which in many cases is due to a patient's death, leading to a premature termination of the series of repeated measurements. This mechanism of missingness has been called informative censoring (or informative drop-out), and requires special statistical methodology for analysis.11, 12, 13, 14 This type of approach is called Joint modeling regression, and started being implemented in standard statistical softwares.15, 16

In a different setting, when the aim is to describe a process in which a subject moves through a series of states in continuous time, Multistate Markov modeling becomes the right analytical tool.17, 18 The natural history of many chronic diseases can be represented by a series of successive stages, and with an “absorbing state” at the end of the follow-up (usually death). Within this model, patients may advance into or recover from adjacent disease stages, or die, allowing the researcher to determine transition probabilities between stages, the factors influencing such transitions, and the predictive role of each intermediate stage on death.

Data manipulationNot infrequently, the data require clean-up before fitting the model. Three important areas need to be considered here:

1. Missing data. This is a ubiquitous problem in health science research. Three types of missingness mechanisms have been distinguished19: missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR) (Table 3). Multiple imputation was developed for dealing with missing data under MAR and MCAR assumptions by replacing missing values with a set of plausible values based on auxiliary information available on the data. Despite the fact that in most cases the missing mechanism is untestable and seldom MCAR, most contemporary statisticians are in favor of imputing missing values with complex multiple imputation algorithms, particularly when the missingness is equal to or greater than 5%.

Table 3. Missing Mechanisms.

Mechanism Description Example Effects MCAR Probability of missing not related either to observed or unobserved data Accidental loss of patient records by a fire. Patient lost to follow-up because of new job Loss of statistical power. No bias on estimated parameters MAR Given the observed data, the probability of missingness does not depend on unobserved data Missingness related to known patient characteristics, time, place, or outcome Loss of statistical power and bias with CC analysis on data with > 25% missingness. 20 Bias can be minimized with the use of multiple imputation (with missingness between 10-50%) NMAR The probability of missingness depends on unobserved variables and/or missing values in the data Missingness related to the value of the predictor, or characteristics not available in the analysis Loss of statistical power. Bias cannot be reduced in this case, and sensitivity analyses have to be conducted under various NMAR assumptions CC, completed case; MAR, missing at random; MCAR, missing completely at random; NMAR, not missing at random.

2. Variables coding. Variables must be modeled under appropriate coding. Try to collapse categories for an ordered variable if data reduction is needed. Keep variables continuous, as much as possible, since their categorization (or even worse, their dichotomization) would lead to an important loss of prediction information, to say nothing about the arbitrariness of the chosen cutpoint. Therefore, in the case of variable dichotomization, provide arguments on how the threshold was chosen, or if it was based on an acceptable cutpoint in the medical field.

3. Check for overly influential observations. When evaluating the adequacy of the fit model, it is important to determine if any observation has a disproportionate influence on the estimated parameters, through influence or leverage analysis. Unfortunately, there is no firm guidance regarding how to treat influential observations. A careful examination of the corresponding data sources may be needed to identify the origin of the influence.

Variable selection is a crucial step in the process of model creation (Table 1). Including in the model the right variables is a process heavily influenced by the prespecified balance between complexity and parsimony (Table 3). Predictive models should include those variables that reflect the pattern in the population from which our sample was drawn. Here, what matters is the information that the model as a whole represents. For effect estimation, however, a fitted model that reflects the idiosyncrasy of the data is acceptable as long as the estimated parameters are corrected for overfitting.

Overfitting is a term used to describe a model fitted with too many degrees of freedom with respect to the number of observations (or events for binary models). It usually occurs when the model includes too many predictors and/or complicated relations between the predictors and the response (such as interactions, complex nonlinear effects) that may indeed exist in the sample, but not in the population. As a consequence, predictions from the overfitted model will not likely replicate in a new sample, some selected predictors may be spuriously associated to the response variable, and regression coefficients will be biased against the null (over-optimism). In other words, if you put too many predictors in a model, you are very likely to get something that looks important regardless of whether there is anything important going on in the population. There are a variety of rules of thumb to approximate the sample size according to the number of predictors. In linear multiple regression, a minimum of 10 to 15 observations per predictor has been recommended.21 For survival models, the number of events is the limiting factor (10 to 15).22 For logistic regression, if the number of non-events is smaller than the number of events, then it will become the number to be used. In simulation studies, 10 to 15 events per variable were the optimal ratio.23, 24

Additional measures have been proposed to correct for overfitting: a) use subject matter expertise to eliminate unimportant variables; b) eliminate variables whose distributions are too narrow; c) eliminate predictors with a high number of missing values; d) apply shrinkage and penalization techniques on the regression coefficients, and e) try to group, by measures of similarity, several variables into one, either by using multivariate statistical techniques, an already validated score, or an estimated propensity score.

Automatic Selection of VariablesMost statistical software offers an option to automatically select the “best model” by sequentially entering into and/or removing predictor variables. In forward selection, the initial model comprises only a constant, and at each subsequent step the variable that leads to the greatest (and significant) improvement in the fit is added to the model. In backward deletion, the initial model is the full model including all variables, and at each step a variable is excluded when its exclusion leads to the smallest (nonsignificant) decrease in the model fit. A “combination” approach is also possible, which begins with forward selection but after the inclusion of the second variable it tests at each step whether a variable already included can be dropped from the model without a significant decrease in the model fit. The final model of each of these stepwise procedures should include a set of the predictor variables that best explains the response.

The use of stepwise procedures has been criticized on multiple grounds.2, 24, 25, 26, 27 Stepwise methods frequently fail by not including all variables that actually have influence on the response, or by selecting those with no influence at all. Relaxing the P=.05 value used as the stopping rule improves the selection of important variables in small datasets.1 Stepwise procedures have also been associated with icreased probability of finding at least one of the variables significant by chance (type I error), due to multiple testing and no error-level adjustment. A forward selection with 10 predictor variables performs 10 significance tests in the first step, 9 significance tests in the second step, and so on, and each time includes a variable in the model when it reaches the specified criterion. In addition, stepwise procedures tend to be unstable, meaning that only slight changes in the data can lead to different results as to which variables are included in the final model and the sequence in which they are entered. Therefore, these procedures are not appropriate for ranking the relative importance of a predictor within the model.

To overcome the drawbacks associated with stepwise procedures, Royston28 has developed a procedure called multivariable fractional polynomials (MFP), which includes two algorithms for fractional polynomials model selection, both of which combine backward elimination with the selection of a fractional polynomials function for each continuous predictor. It starts by transforming the variable from the most complex permitted fractional polynomials,and then, in an attempt to simplify the model, reduces the degree of freedom toward 1 (or linear). During this linearization process, the optimal transformation of the variable is set by statistical testing. It has been claimed that the default algorithm resembles a closed-test procedure by maintaining the overall type I error rate at the prespecified nominal level (usually 5%). The alternative algorithm available in MFP, the sequential algorithm, is associated with an overall type I error rate about twice that of the closed-test procedure, although it is believed that this inflated type I error rate confers increased power to detect nonlinear relationships.

Other sophisticated alternatives have been proposed to improve the process of variable selection, such as best subset regression,29 stepwise techniques on multiple imputed data,30, 31 using automated selection algorithms that make appropriate corrections, such as the LASSO (least absolute shrinkage and selection operator) method,32 and maximum likelihood penalization.33

Recently, the use of bootstrap resampling techniques has been advocated as a means to evaluate the degree of stability of the models resulting from stepwise procedures.34, 35, 36, 37 Such bootstrap samples mimic the structure of the data at hand. The frequency of the variables selected in each sample, called bootstrap inclusion fractions (BIF), could be interpreted as criterion for the importance of a variable. A variable that is weakly correlated with others and significant in the full model should be selected in about half of bootstrap samples (BIF≥50%). With lower P-values, the BIF increases toward 100%.

Final model evaluationCentral to the idea of building a regression model is the question of model performance assessment. A number of model performance metrics have been proposed, although they can be grouped in two main categories: calibration and discrimination measures (Table 1, Table 3). Independent of the aim for which the model was created, these two performance measures need to be derived from the data which gave origin to the model, or even better, through bootstrap resampling (known as internal validity). With bootstrapping, it is possible to quantify the degree of over-optimism and the amount of shrinkage necessary to correct the model's coefficients. However, if the goal is to evaluate the model's generalizability, a crucial aspect in prediction models, then these performance measures need to be estimated on external data. However, this is not always possible due to lack of resources. For a comprehensive review of this topic, refer to “Clinical Prediction Models” by Steyerberg.38

Calibration refers to the agreement between observed outcomes and model predictions. In other words, it is the ability of the model to produce unbiased estimates of the probability of the outcome. The most common calibration measures are calibration-in-the-large, calibration slope (both derived from calibration plots), and the Hosmer-Lemeshow test (or its equivalent for Cox regression, the Gronnesby and Borgan test39).

Discrimination is the model's ability to assign the right outcome to a randomly selected pair of subjects; in other words, it allows the model to classify subjects in a binary prediction outcome setting. The area under the ROC curve (AUC) is the most common performance measure used in the evaluation of the discriminative ability for normal-error models with a binary outcome. The equivalent for censored data is the C-statistic.40

In the context of translational research (omics & biomarkers era), the evaluation of the added predictive value for a predictor is perhaps as important as the validation of the prediction accuracy of the model as a whole. Several approaches have been proposed, the most important being net reclassification improvement, integrated discrimination improvement,41, 42 and decision curve analysis.43

In summary, a good model calibration and discrimination, evaluated through bootstrapping, is currently considered an important prerequisite for the application of any prediction model, assuming that independent data testing is not feasible.

Results presentationThe final consideration in the process of model creation is how the estimated parameters will be presented. Commonly, the statistical packages, when comparing two groups on a binary outcome, express the effect size of the explanatory variable in relative metrics. For logistic and Cox regressions, the OR and HR are the traditional metrics to indicate the degree of association found between a factor and the outcome. Because these are a ratio of proportions, the information conveyed about their effect size is relative to each other. Similarly, in randomized controlled trials, the relative risk, which is no more than a ratio of proportions, is often used to summarize the association between the intervention and the outcome. The main limitation inherent to these relative measures is that they are not affected by variations in baseline event rate. For instance, in Cox regression the HR does not take into account the baseline hazard function, and thus must be interpreted as a constant effect of the exposure (or intervention) during the follow-up. As such, there is a concern that relative measures provide limited clinical information, while absolute measures, by incorporating the information about the baseline risk of the event, will be more relevant for clinical decision making.

The most common absolute measures are: ARD and NNT. It must follow that NNT is just a reciprocal of the ARD. Because of randomization, in randomized controlled trials it is relatively straightforward to estimate the ARD as simply the difference in the outcome proportions, measured at the end of the trial, between treated and untreated subjects. Even when the outcome is time-to-event in nature, the differences in survival probabilities can be estimated at different durations of follow-up with the KM survival curves.44 However, with observational studies, subjects in the 2 groups of the exposure variable often differ systematically in prognostically important baseline covariates, which in turn leads to the application of statistical methods that allow the calculation of adjusted ARD (and NNT).45

In summary, in observational cohort studies, ARD and NNT can be derived from adjusted logistic and Cox regression models, and as much as possible such measures should supplement the reporting of the traditional regression estimates.

Conflicts of interestNone declared.

Corresponding author: Epidemiology and Statistical Services, Cuore International, 2914 Leiszs Bridge Rd. Reading, PA 19605, United States. enunezb@gmail.com

*et al*.