Symposia
Research Methods and Statistics
Alessandro S. De Nadai, Ph.D. (he/him/his)
Assistant Professor
McLean Hospital/Harvard Medical School
Belmont, Massachusetts, United States
Ryan Zamora, M.S. (he/him/his)
Senior Data Analyst
McLean Hospital/Harvard Medical School
Belmont, Massachusetts, United States
Alyse Finch, M.A. (she/her/hers)
Clinical Research Assistant
McLean Hospital/Harvard Medical School
Belmont, Massachusetts, United States
The availability of big data has created exciting opportunities to advance research and practice in cognitive behavioral therapy (CBT). However, inadvertent problems frequently arise in CBT data that stem from mental health-specific causes and are not widely recognized. These issues include 1) unreliable measurement, 2) heterogeneous construct definition, 3) population mixtures with differing biopsychosocial mechanisms, 4) behavioral reporting bias by both patients and clinicians, 5) selection bias, and 6) data that are not missing at random. Collectively, we term these unintentional problems as “data pollution” (De Nadai et al., 2022). This concept contrasts with “data poisoning,” which reflects intentional attempts to introduce bias into models, such as misinformation that is purposefully added to social media platforms.
In common research designs, currently accepted unreliability standards can reduce effect sizes by nearly 30%, which more than doubles sample size requirements (Faul et al., 2009). In the case of population heterogeneity, between-group effect sizes can be reduced by 70% when only 10% of the population departs from the main distribution (Wilcox et al. 2013). These distortions not only prevent the emergence of critical new findings, but they also jeopardize both individual and federal investments in research. Without addressing data pollution more effectively, CBT research will be hindered in its efforts improve clinical outcomes. Similarly, data pollution threatens clinical applications intended for widespread service delivery. For instance, smartphone intervention apps were a part of approximately 115 NIH funded projects as of 2018 (Hansen and Scheier, 2019). Data pollution has an impact on the models upon which these interventions depend, which threatens their success and could even lead to harm if inaccurate models lead to misguided clinical services.
In this presentation, we will show how immediately available corrections for unreliability and population heterogeneity can substantially mitigate the problems introduced by data pollution. Specifically, we will use real-world data to illustrate how latent variable modeling and mixture modeling can be applied to CBT research. In our example, these adjustments resulted in a four-fold increase in effect sizes and a 90% reduction in required sample sizes. Overall, this research has the potential to enable across-the-board improvements in effect sizes, reduce sample size requirements, and enhance methodological rigor for many CBT-relevant applications.