Assessment
Development and validation of a database of diverse validated face emotion morphs
Thea McAfee, B.A.
Research Associate
Yale University School of Medicine
New Haven, Connecticut, United States
Wenjing Luo, Ph.D.
Postdoctoral Associate
Yale University School of Medicine
New Haven, Connecticut, United States
Jennifer Klix, None
Student Intern
Hamilton College
New Haven, Connecticut, United States
Sarah Fineberg, M.D., Ph.D.
Assistant Professor
Yale University School of Medicine
New Haven, Connecticut, United States
Facial Emotion Recognition (FER) tasks are commonly used measures of social cognition, but face stimuli lack face diversity, stimulus validation, and open access. Williams et al. (2023) published accuracy ratings for a face set with male and female actors of four ethnicities showing five emotions at graded intensity. Unfortunately, the face set was limited by floor and ceiling effects for accuracy and was not open access.
Using the RADIATE face set, we replicate and extend these results by generating a novel database of face stimuli that vary by gender, race, emotion, and emotion intensity. These stimuli are accessible for free online use, and the large number of stimuli allows tasks to be scalable for varied research aims.
We recruited human raters on Prolific to assign each stimulus to one of five emotions (Anger, Disgust, Fear, Happiness, Sadness), and completed surveys (demographics, Borderline Symptom List (BSL23) and Post-traumatic Stress Disorder Checklist (PCL5). First, we tested percent rater accuracy for 80 face stimuli (50 % female, 25% each for Black, Asian, Hispanic, White). Raters (n=482) each rated 16 face stimuli. Face stimulus quality was quite variable with accuracy ranges by emotion: happy 88-100%, sad 52-94%, angry 24-97%, fear 65-90%, disgust 34-97%.
To examine the quality of faces morphed for intensity (20-100% of full emotion), we used visual inspection to identify face sets with good range (low to high accuracy across intensity levels) and stepwise increase in accuracy with each intensity level. For face morph sets limited by ceiling and/or floor effects, we rescaled the morph range and re-tested in a new cohort of 590 raters (57% female, 40% male, 3% not listed or other; 67% white, 10% Black, 7% Asian, 3% Hispanic) who each rated 79 re-scaled faces. BSL23 score mean was 0.65 (range 0-3.5, SD = 0.704). For the 52.5% with history of a traumatic event, PCL-5 score mean was 23.7 (range 0-76, SD = 17.8).
This approach yielded a final database of morphed face sets (15 anger, 9 disgust, 8 fear, 10 happiness, and 13 sadness), each with faces at five levels of intensity.
Using these validated face sets, we confirmed that face intensity predicts accuracy (logistic regression, p < 0.001). Even in these validated morph sets, the face sets differed in accuracy by emotion (neutral > happy > fear > disgust > anger > sadness; logistic regression, p< 0.01) Next, we examined the effect of demographic concordance on accuracy. There was no significant interaction for either gender or race (p > 0.05). We did find a significant impact of PCL5 score on accuracy, only for 20% fear stimuli (z = 1.984, p = 0.04731). There was no significant effect of BSL23 score on accuracy.
Using a confusion matrix, we found a qualitative increase in faces misread as disgust in people with high PCL5 scores ( >33), but no no clear differences between high and low BSL23 scores (cutoff at 1.3).
These results confirm that actor-generated face emotion stimuli need validation, and that rescaling of raw stimuli can generate high-quality morphed face sets. This work contributes to an important critique on commonly used face emotion recognition task methods, offers a new validated face set to the psychology research community, and serves as a roadmap for other teams to develop similar tasks.