Unbiased measures of variable importance for random forests

Abstract

Random forests have become very popular in many scientific fields because they can cope with "small n large p" problems involving complex interactions. Random forest variable importance measures have been suggested as screening tools, e.g., for gene expression studies. However, these variable importance measures have been shown to be biased in favor of predictor variables of certain types and towards correlated predictor variables.

While the former issue could be addressed straightforwardly by means of unbiased split selection and resampling schemes (Strobl et al., 2007), in the case of correlated predictors the original permutation importance is highly misleading, creating a new source of bias in interpretations drawn from random forests. Therefore, Strobl et al. (2008) suggested a solution for this problem in the form of a new, conditional permutation importance measure.

In the talk, the rationale and application of this measure is outlined and illustrated. Moreover, some hands-on advice is given for sensibly using and interpreting random forests.

References

  • C. Strobl, A.-L. Boulesteix, A. Zeileis and T. Hothorn (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8:25.
  • C. Strobl, A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis (2008). Conditional variable importance for random forests. BMC Bioinformatics, 9:307.