Predictive models are typically trained on a single – or few – data sources but are often expected to work well in other environments. Model robustness can be demonstrated using external validation, the process of evaluating model performance on data sources that were not used for its derivation. In this project we adopt a more proactive approach and aim to train robust models which perform well on multiple datasets.
The methodologies we explore combine ideas from causal inference and distributionally robust optimization and address the complexities of medical data, such as population variability across geographies as well as differences in clinical settings and policies. Specifically, we assume that we have full access to a few sources of data along with some statistics from other external resources; and we investigate methods that exploit the heterogeneity of external sources to guide robust model development.