Technology/Digital Health
What AI Chatbots Say About Non-Suicidal Self-Harm: Comparisons to Expert Factsheets
Shealyn K. Tomlinson, B.A.
Graduate Student
Texas A&M University- Corpus Christi
Corpus Christi, Texas, United States
Sean Lauderdale, Ph.D.
Assistant Professor
University of Houston – Clear Lake
Houston, Texas, United States
With advances in internet availability and search capabilities, more people will rely on internet searches for guidance with mental health. OpenAI was the first to release one such advancement, ChatGPT-3: An artificial intelligence (AI) language model chatbot delivering search results in a narrative mimicking human speech (Ramponi, 2022). Since ChatGPT-3, Google (Bard) and Microsoft (Bing) have released their own AI chatbots. All chatbots can generate inaccurate and biased results (Ramponi, 2022). Users may find the narrative responses influential if unaware of this. Additionally, most internet searchers don’t practice digital media literacy (Stvilia et al., 2009) and trust health information provided by chatbots (Abd-Alrazaq et al., 2021). Given the uncertainty about the quality of AI generated psychoeducation concerning mental health, we compared ChatGPT-3, Bard, and Bing outputs about non-suicidal self-injury (NSSI) to expert-written factsheets to assess the accuracy and readability characteristics of the generated information.
Qualitatively, all chatbot outputs were more advanced than average United States reading levels (Mean = 13.25 vs 7 years). Bing's responses were comparatively short and undetailed compared to the factsheets, ChatGPT-3, and Bard. Bing provided graphic images of self-harm in the output unprompted. Finally, ChatGpt-3 initially refused to answer some queries (“What are the risks of self-harming”) without evidence that the query had been answered in the past (as part of another investigation).
Method: Factsheets prepared by experts and vetted by advocacy/government organizations (i.e., National Alliance on Mental Illness, UK National Health Service, Rethink Mental Illness, and Better Health Channel) were used to compare the accuracy and readability characteristics of output about NSSI by Bard, Bing, and ChatGPT-3. Headings from the fact sheets were used to pose queries. Similarity cosine, which assigns vectors to all words in documents, was used to assess document similarity, where cosines closer to 1 indicate greater similarity. The Mann-Whitney U was used to compare differences in word count, vocabulary density, Coleman-Liau Index (years of education needed to read fluently), words per sentence, positive words, and negative words. AI outputs were also reviewed for relevance.
Results: The similarity cosine between the factsheets and AI chatbots were moderate (ranging from .67 to .78). ChatGPT-3 had more total words, vocabulary density, positive words, and negative words than the factsheets (mean Z scores = 3.88, p’s < .01). Bard had more total words, unique words, positive words, and negative words than the factsheets (mean Z scores = 3.82, p’s < .05). Bing only had more positive words and negative words than the factsheets (mean Z score = 2.34, p < .05).
Conclusions: While the factsheets and AI chatbots demonstrated overall similarity, the chatbots’ higher reading level could limit accessibility. Particularly, ChatGPT-3’s elevated reading level and content density may hinder its user effectiveness. Bing’s inclusion of explicit visuals risks triggering adverse reactions among certain users. Further data analysis may reveal additional disparities in accuracy and treatment coverage.