This study investigated the ability of children with ASD, including the minimally verbal subgroup, to perceive angry, neutral, and happy prosody in low-pass filtered speech when provided with a structured training paradigm.
13 children with ASD and 21 TD children completed the experimental task and two additional measures (nonverbal cognitive abilities, social responsiveness deficits) for regression analyses.
The ASD group recognized prosodic conditions significantly less accurately than the TD group, and took significantly longer times to recognize all sentences compared to the TD group. Angry prosody was consistently the most difficult to recognize across groups. Nonverbal cognitive abilities is a significant predictor variable for successful recognition of neutral and happy prosody; although low nonverbal cognitive skills do not preclude minimally verbal children with ASD from accurately perceiving affective prosody.
The present study shows it is possible for minimally verbal children with ASD to successfully participate in experimental research using judgment tasks when provided with appropriate training.
Autism spectrum disorder (ASD) is characterized by impairments in two broad areas: (1) persistent difficulties in social communication and social interaction; and (2) restricted and repetitive behavior patterns [
Prosody refers to the rhythm and tune of speech [
At the most fundamental level, however, prosody can be considered an integral building block of language acquisition and social functioning. Research has shown that Infants prefer listening to infant-directed speech (also known as Motherese) from as young as one month of age [
As alluded to above, there is a known subgroup of children with ASD who do not develop a robust system of communication – a subgroup variously estimated at 10–30% of the ASD population [
Studies that have investigated the specific ability of individuals with ASD to perceive affective prosody and therefore make judgments about the emotional state of others have tended to focus on the higher functioning end of the spectrum. In one study, adolescents with Asperger Syndrome (AS; diagnosed prior to DSM-V), compared to age-matched controls, were shown to have significantly poorer ability to use nonverbal cues such as facial expression, body gestures, and prosody to interpret the feelings of actors featured in video-taped scenes [
While there has been research that indicates individuals with HFA do have intact affective prosody interpretation abilities, and they are able to identify emotions as well as their typically developing peers [
In general, it appears that individuals with ASD demonstrate difficulties in interpreting emotions from affective prosody. What is unequivocal is that most of the research has focused on individuals with ASD who are high-functioning, while there seems to be little to no research examining the minimally verbal subgroup, and even fewer studies to investigate their ability to perceive and/or interpret affective prosody. The inherent heterogeneity within the ASD population adds complexity to any research on this population. Most pressing, perhaps, is the glaring lack of research in the subgroup of children with ASD who are minimally verbal. While recent scientific inquiry and technological advancements (e.g., brain imaging) have allowed us a glimpse into the world of ASDs, still very little is known about this minimally verbal subgroup, poignantly termed “the neglected end of the spectrum” [
Given the complexity of the disorder, the heterogeneity of the autism spectrum, and the seemingly contradictory results, it is difficult to make conclusions regarding the nature and the extent of prosodic deficits in individuals with ASD. The subgroups within the ASD population, as well as the different functions of prosody, need to be clearly specified experimentally before pertinent findings can be derived. Critically, studies need to begin including the subgroup of children with ASD who are minimally verbal, instead of continuing to ignore them. The present study therefore seeks to investigate the ability of children with ASD, including those who are minimally verbal, to perceive affective prosody compared to typically-developing (TD) peers. A conscious effort was made to design innovative experimental procedures that allow even minimally verbal children with ASD to participate and complete the task.
The main objective of this study is to determine the ability of children with ASD to perceive affective prosody. The prosodic qualities of happy and angry will be investigated, as recent research using near-infrared spectroscopy indicated that typically-developing infants as young as 7 months show increased response in right temporal voice-sensitive regions of the brain when hearing happy and/or angry prosody [
The main hypothesis is that children with ASD will show poorer ability to perceive affective prosody compared to TD peers. While it remains to be seen what particular aspect(s) of affective prosody contribute to this poorer performance, it may be that some minimally verbal children never gain competence with language and with the nuances of language in conveying feelings because of deficits in perceiving affective prosody. A secondary hypothesis is that there will be a negative relationship between the ability to perceive affective prosody and the degree of deficit in social responsiveness. Finally, it is also hypothesized that the ability to perceive affective prosody is not related to nonverbal cognitive abilities.
The set of stimulus sentences used in the experiment were selected from a larger group of low-pass filtered sentences based on the results of a pretest to validate the emotional manipulation of prosody and a series of acoustic analyses. The goal of low-pass filtering the sentences was to remove the spectral information associated with the perception of phonemes and therefore the semantic content of the utterance, while preserving prosodic contours. These low-pass filtered sentences were then pre-tested to identify those that were most accurately perceived as communicating the target emotions. The acoustic analyses examined the fundamental frequency (F0) contours and duration of the stimuli.
One potentially confounding factor in the research on affective prosody is the interaction between syntax, semantics, and emotions. Humans have access to a limited set of acoustic features to express different intentions through speech. How much focus one assigns to the categories of grammatical, pragmatic, and affective prosody becomes a negotiation that speakers necessarily must engage in during the encoding of information, which in turn affects the subsequent decoding of this information. In one such study aimed at understanding the interplay between grammatical, pragmatic, and affective prosody, Pell [
At the same time, words tend to be nuanced with layers of meanings, and as such hold varying degrees of emotional valence that differ from person to person, even when there is a shared overall concept of a given lexical item. How one teases apart the contribution of semantics from true expressions of emotional states impacts the cogency of any research purporting to investigate affective prosody. Many researchers have used an evolutionary framework to explain the ecological significance and validity of vocal expressions of emotion, offering neurophysiological responses and social adaptation as reasons-why [
To circumvent the potential confound introduced by lexical and semantic contexts to the accurate interpretation of affective prosody, stimuli were low-pass filtered in this study. Low-pass filtered sentences were also designed to help to level the playing field between the ASD and TD groups, especially in accommodating the heterogeneity within the autism spectrum. Specifically, low-pass filtered sentences would (1) allow even minimally verbal ASD participants to complete the perception task despite impaired linguistic abilities; and (2) prevent TD participants from relying on lexical-semantic cues to determine affect.
We recorded the original sentences with a female native American English speaker, who is both a graduate student in communication sciences and disorders knowledgeable about prosodic theory, and an actress with very good control of her voice. We used Praat software [
Following the recording session, stimuli were low-pass filtered using Praat software to remove sound wave frequencies associated with the perception of phonemes and thus eliminating the ability to perceive any lexical-semantic content, while preserving prosodic contours. Based on a review of past research that aimed to investigate the recognition of affective prosody using low-pass filtered speech, different cutoffs have been used, including: 333 Hz [
All 105 low-pass filtered sentences were pretested to determine those for which judgment regarding the target emotion communicated by the speaker was most accurate. Ten neurotypical native American English adult speakers, who were colleagues of the authors, participated in the pretest. The sound file for each sentence was embedded into a Microsoft Power-Point document in random order, and shared with the participants. Three training items were also included in both the original and low-pass filtered versions to allow listeners to gain familiarity with the target emotion as communicated through filtered speech. Participants were instructed to listen to each stimulus sentence no more than three times, and rate each item as being angry, neutral, or happy.
Overall, the full set of low-pass filtered sentences was recognized correctly at a high mean level of 86.7%, where chance performance was 33.3%. This rate of recognition is broadly in line with existing research on the perception of vocally expressed emotions, which tend to yield accuracy scores at higher-than-chance levels [
Given that listeners across studies are consistently able to correctly identify emotions in unfamiliar languages at levels well above chance, researchers have posited the existence of a set of basic emotional intonations (e.g., anger, happiness, sadness, fear) that prevail cross-linguistically. Studies in the categorical perception of vocally expressed emotions support the presence of discrete emotion categories in the auditory modality [
With this in mind, acoustic analyses on the set of low-pass filtered sentences were used for two purposes:
To identify the set of acoustic parameters that differentiated the three target emotions in the current study To validate the fidelity of the low-pass filtered sentences with reference to current knowledge regarding acoustic markers of the three target emotions in this study
Praat software was used to obtain durational measures (in seconds) and measures of pertinent F0 parameters (in Hz), including minimum F0 (MinF0), maximum F0 (MaxF0), mean F0 (MeanF0) and F0 range (F0_Range), by prosodic condition. The means and standard deviations of these acoustic parameters are shown in
For each acoustic parameter, a repeated measures ANOVA was conducted across the prosodic conditions. There was a significant main effect of Prosody on all F0 acoustic parameters as well as duration: MinF0 [
Overall, happy prosody had the highest minimum, maximum and mean F0, and largest F0 range, closely followed by angry prosody. Neutral prosody had the lowest maximum and mean F0, and the lowest F0 range. This is generally in line with previous research on the acoustic correlates of emotional prosody – happy and angry utterances are characterized as having a higher mean F0 and greater variation in F0 [
To validate the fidelity of the low-pass filtered sentences, the prosodic contours of several sentences were visually examined against known acoustic markers of angry, neutral, and happy prosody. Sound waveforms were extracted and the corresponding pitch contour of several sentences in each of its angry, neutral, and happy prosodic conditions, using Praat software.
13 children with ASD and 21 TD children participated in this study. Of the 13 children with ASD, 11 were minimally verbal (i.e., their speech was not spontaneous and non-generative, and in many cases verbal expression was extremely limited if not absent). The participants were recruited through local schools and via word of mouth through colleagues. All participants came from English-speaking homes, and were reported to have normal hearing and vision.
ASD participants had to have met criteria for autistic disorder as stated in the DSM-IV, diagnosed by a certified clinical child psychologist/child psychiatrist/neurologist or qualified pediatrician. All the ASD participants attended schools for children with autism.
Participants were administered the Matrices subtest from the Kaufman Brief Intelligence Test, Second Edition (KBIT2; [
The ethics of this study with regard to human subject participants and the procedures were approved by the Spaulding Rehabilitation Hospital Institutional Review Board. The parents of all children gave written informed consent prior to their participation, and all procedures were conducted according to the approved protocol.
The training stimuli consisted of a mix of unfiltered and low-pass filtered sentences (n=18), and the experimental stimuli consisted of a set of 21 low-pass filtered sentences that each conveyed happy, angry, and neutral prosody (i.e., n=63 sentences). As discussed above, the sentences were selected based on perceptual saliency given results of the pretest, as well as acoustic features highly characteristic of happy, angry, and neutral prosody. The sentences ranged in length from four to 13 syllables, with an equal number of sentences matched on syllable length (i.e., 3 sentences ×7 syllable lengths×3 prosodic conditions). Matching an equal number of sentences on syllable length allowed us to investigate the effect of sentence length on various aspects of performance.
Participants were tested in a quiet room. The experiment was run using FLXLab software [
The training component was specifically designed to allow even ASD children with very low language levels to be able to successfully participate in the experiment. The levels of progression of the training block are shown in
After successful completion of the training block, participants completed the experiment. There were 63 sentences in the experiment phase, divided into three blocks. All trials were initiated with the presentation of a photograph of a boy wearing headphones to alert the participants to listen, followed by a 1,500 ms inter-stimulus interval. Next, the stimulus sentence was presented in its entirety. Reaction time was measured from the offset of the stimulus to the time of the response key press. After participants rendered an emotion decision, a reinforcing image was shown indicating success. The examiner then initiated the next trial. The order of presentation of experiment blocks was systematically rotated across participants, and the order of presentation of stimulus sentences within each block was automatically randomized within the FLXLab program.
Pearson correlation coefficients across groups were computed between the stimuli lengths and reaction time (RT). The relationships among the RTs for the different lengths of stimuli were tested for the neutral prosodic condition to determine stimulus length categories with the greatest discriminate validity. Results of the correlational analyses produced the following two categories:
Short sentences: 4-, 5-, and 7-syllable sentences Long sentences: 8-, 10-, 11-, and 13-syllable sentences
All correlation coefficients were statistically significant within the two categories except for one in the long sentence category (10-syllable and 13-syllable sentences with a coefficient of r=0.33,
The mean accuracy for short sentences was 44.1% (SD=14.1) while the mean accuracy for long sentences was not significantly different at 43.4% (SD=15.4),
The mean RT across groups for short sentences was 1,518.4 ms (SD=522.4), and the mean RT across groups for long sentences was statistically significantly less, 1,333.6 ms (SD= 582.0),
There was also a main effect of Length,
An error matrix was generated to further analyze the patterns of responses made by the participants.
To further analyze the apparent difficulty that TD participants have in recognition of angry prosody, data collected across all TD participants in both the pretest and experimental study were compared. A small pilot study was also conducted with TD children from 9–15 years of age.
Accuracy across prosodic conditions generally shows a developmental trend, the most significant of which appears to be for angry prosody.
Each ASD participant’s chronological age and nonverbal age-equivalence on the KBIT2 were examined together with their correct recognition of affective prosody to explore the relationship between these few factors (see
At first glance, the data appears to trend towards a positive correlation between nonverbal intelligence age-equivalence on the KBIT2 and correct recognition of affective prosody. However, three participants (Participants 2, 3, and 12) had low nonverbal intelligence age-equivalence on the KBIT2, yet demonstrated the ability to consistently recognize neutral and/or happy prosody at levels that were above chance, suggesting that perhaps nonverbal cognitive abilities may not always predict poor ability to recognize affective prosody.
A series of all-possible-subsets regression analyses were conducted to identify the significant effects of behavioral score profiles on the dependent experimental variables of accuracy and reaction time. Specifically, we were interested in the KBIT2 and SRS total T-score as predictor variables. A decision was made to perform regressions only for the ASD group, given the clinical relevance of this population. In addition, only the statistically significant variables from the repeated measures ANOVA were entered into the regression models as criterion variables (i.e., mean accuracy by prosodic condition and mean reaction time by length). One significant model and one model approaching significance emerged with respect to the accuracy data (see
The aim of the study was to investigate the ability of children with ASD, including those who are minimally verbal, to perceive affective prosody when presented with low-pass filtered sentences as stimuli. The inclusion of this subgroup of children with ASD who are minimally verbal represents a departure from a long tradition of research that has tended to focus on high-functioning individuals on the autism spectrum. As discussed, there is a dearth of research into the minimally verbal “neglected end of the spectrum” [
The goal of the present study to investigate the perception of affective prosody in children with ASD and TD peers is driven by the fundamental role that prosody plays in language acquisition. Given findings that there are some very young children with ASD who are less responsive to their mother’s voice [
The main hypothesis that children with ASD will show a poorer ability to perceive affective prosody compared to TD peers was borne out in this study. The ASD group recognized prosodic conditions significantly less accurately than the TD group. This is perhaps not surprising, and is in line with the majority of past research that has shown that individuals with HFA/AS typically demonstrate deficits in recognition of affective prosody [
The ASD group also evidenced a different pattern in time to recognize affective prosody. The TD group required a significantly longer time to recognize the affective prosody in shorter sentences compared to longer sentences. However, the ASD group took more time on all sentences in general but did not differ significantly between short and long sentences. This finding suggests that TD children may need extra processing time to judge emotions in prosody when less verbal information is available; although there is also the possibility that the longer sentences allowed the TD participants to form a judgment while the sentence was still playing resulting in a shorter reaction time. In contrast, individuals with ASD spent more time overall processing all stimuli. While sentence length has not been a focus of research on affective prosody in the past, findings from the present study may motivate further investigation into the role of stimulus length in supporting successful decoding of emotions from the prosody of speech (e.g., as an instructional strategy).
In any perception study, it is often informative to analyze what and where errors were made that influenced accuracy. In the present study, both the ASD and TD groups struggled with recognizing angry prosody. Not only was angry prosody recognized at approximately chance level across both groups, it was also most often mistakenly categorized as neutral-sounding. While there may have been an inherent bias in the selection of neutral as a response, this finding is important because it shows that both ASD and TD participants show similar error patterns when it comes to angry prosody. It is an interesting contrast to the research cited above in which infants from around 5 months are able to discriminate between happy, angry, and sad emotional prosody in familiar contexts [
A secondary hypothesis was that a negative relationship would exist between the ability to perceive affective prosody and the degree of social responsiveness deficits. One finding here suggests that the more deficits one has in social responsiveness, the better one will be at recognizing neutral prosody; contrary to the hypothesis. One reason for this counterintuitive finding may be an artifact of the SRS questionnaire (used to measure social responsiveness deficits) as administered in the present study. That is, one subset of the ASD group had questionnaires completed by parents while the other subset of the ASD group had questionnaires completed by teachers. Inherent differences in the relationship to the child, as well as the depth and quality of interaction spent with the child may have introduced a confound to the SRS scores.
A final hypothesis was that the ability to perceive affective prosody was not related to nonverbal cognitive abilities. The significant models generated from the series of all-possible-subsets regression analyses rejected this hypothesis. Nonverbal cognitive abilities was a significant predictor variable for mean accuracy scores for neutral and happy prosody – a higher nonverbal cognitive ability (as measured on KBIT2) was a reliable predictor that a child with ASD was better able to correctly recognize neutral and happy prosody. While it seems logical that nonverbal cognitive abilities will have a direct impact on one’s ability to perceive and accurately recognize affective prosody, it is also the contention of these authors that low nonverbal cognitive skills do not preclude one from being able to accurately perceive affective prosody. At the very least, the current study provides some initial finding that suggests low nonverbal cognitive abilities may not necessarily hinder one’s capacity to correctly recognize neutral and/or happy prosody.
One limitation of the present study is the small sample size, especially in the ASD group. There was also lack of experimental control in terms of including both high- and low-functioning individuals with ASD as well as controlling for nonverbal cognitive abilities in order to better tease out differences in the perception of affective prosody. The two different administrations of the SRS questionnaire may also have skewed findings somewhat, especially with regards to properly identifying predictor variables for accuracy rate and reaction time. At the same time, a better definition of the minimally verbal child with ASD may also be helpful to properly identify and include this subgroup in future research. The inclusion of different prosodic conditions in future studies with minimally verbal children with ASD will shed further light on what is being differentially processed by this population of children.
In conclusion, the driving force behind the present study is to extend research into the subgroup of children with ASD who are minimally verbal. Through conscious decision and careful deliberation, an innovative and structured experimental design came to fruition in the present study. While the mean accuracy rates for the ASD group tended to approximate chance levels, individual differences were found, a necessary consequence of the inherent heterogeneity within the population. Most importantly, however, this is one of a few studies that have included minimally verbal children with ASD in the sample. The present study has shown that it is possible for minimally verbal children with ASD to successfully participate in experimental research using judgment tasks when provided with appropriate training and scaffolding to task expectations. To this end, it remains our hope that more can and will be done to further our understanding of this long neglected other end of the autism spectrum.
Mean accuracy of all sentences by prosodic condition.
*
Mean accuracy of all sentences by group.
*
RT for short vs. Long sentences by group.
*
RT for All sentences by length.
*
RT for All sentences by group.
*
Pretest % Correct Recognition of Prosodic Condition
|
| |
---|---|---|
Angry | 80.3 | 87.6 |
Neutral | 91.4 | 94.3 |
Happy | 88.3 | 93.3 |
Means and Standard Deviation of Acoustic Parameters by Prosodic Condition
Angry | 1.89 (0.61) | 370 (52.55) | 108 (27.04) | 249 (25.75) | 261 (55.54) |
Neutral | 1.78 (0.68) | 336 (53.12) | 112 (31.62) | 204 (16.62) | 224 (57.07) |
Happy | 1.79 (0.57) | 484 (24.73) | 152 (39.99) | 321 (24.69) | 331 (45.24) |
Spectrograms and Prosodic Contours of Selected Sentences
|
|
|
Qualitative description: |
Angry prosody is characterized by moderate-high F0, moderate-high F0 variability [ |
|
|
|
Qualitative description: |
Neutral prosody is characterized by low-moderate F0, less F0 variability compared to angry prosody and happy prosody, faster speech rate [ |
|
|
|
Qualitative description: |
Happy prosody is characterized by high F0, high F0 variability [ |
Descriptive Characteristics of Participants by Group
Gender (#) | M | 12 | 8 |
F | 1 | 13 | |
Age (yr;mth) | Mean (SD) | 11;7 (1;0) | 8;0 (0;5) |
Min/Max | 9;10/12;10 | 6;11/8;8 | |
SRS total T-score |
Mean (SD) | 76.46 (13.43) | 48.62 (9.48) |
Min/Max | 61/108 | 40/76 | |
KBIT2 | Mean (SD) | 52.38 (18.94) | 106.43 (10.66) |
Min/Max | 40/107 | 79/122 | |
Nonverbal intelligence age-equivalent (yr;mth) | Mean (SD) | 5;4 (2;11) | 8;11 (1;10) |
Min/Max | 3;11/14;8 | 5;2/12;8 |
SRS total T-score ≤59T indicates normal range; ≥76T indicates severe range strongly associated with autistic disorder.
Training Block – Levels of Progression
Level 1 (Maximum scaffolding) | Unfiltered sentences (n=3) with matching semantic content and affective prosody presented, direct teaching by the tester with verbal explanation and gestural modeling (e.g., “I am sitting” and pointing to neutral face). |
Level 2a (Moderate scaffolding) | Filtered Level 1 sentences (n=3) maintaining prosodic contour without semantic content presented, direct teaching by the tester with verbal explanation and gestural modeling (e.g., “she feels in the middle” and pointing to neutral face). |
Level 2b (Minimal scaffolding) | Filtered Level 1 sentences (n=3) maintaining prosodic contour without semantic content presented, independent response by the participant with verbal correction and gestural modeling if an error was made. |
Level 3 (No scaffolding) | Filtered sentences (n=9) from the original stimulus set used during the pretest presented, independent response by the participant with no feedback. |
% Correct Recognition across Prosodic Conditions by Group
|
| |||||
---|---|---|---|---|---|---|
Angry prosody | 30.2 | 41.9 | 27.9 | 34.2 | 47.8 | 17.9 |
Neutral prosody | 32.3 | 36.6 | 31.1 | 25.6 | 65.1 | 9.4 |
Happy prosody | 34.4 | 30.1 | 34.4 | 21.9 | 25.3 | 52.7 |
% Correct Recognition of Prosodic Conditions by TD Participants across Age Clusters
Recognition (% Correct) in Experiment | Recognition (% Correct) in Pilot | Recognition (% Correct) in Pretest | ||
---|---|---|---|---|
|
|
| ||
Angry | 34.2 | 34.9 | 61.9 | 87.6 |
Neutral | 65.1 | 85.7 | 81.0 | 94.3 |
Happy | 52.7 | 79.4 | 61.9 | 93.3 |
Chronological Age, Nonverbal Intelligence Age-Equivalence, and % Correct Recognition by Individual ASD Participant
| ||||||
---|---|---|---|---|---|---|
ASD1 | 12;2 | 6;3 | 38.1 | 42.9 | 33.3 | 38.1 |
ASD2 | 12;10 | 4;0 | 39.7 | 33.3 | 42.9 | 42.9 |
ASD3 | 10;7 | 4;0 | 36.5 | 23.8 | 47.6 | 38.1 |
ASD4 | 12;6 | 6;0 | 31.7 | 38.1 | 38.1 | 19.0 |
ASD5 | 12;1 | 14;8 | 46.0 | 14.3 | 61.9 | 61.9 |
ASD6 | 11;3 | 4;10 | 31.7 | 38.1 | 28.6 | 28.6 |
ASD7 | 11;5 | 4;8 | 19.0 | 14.3 | 28.6 | 14.3 |
ASD8 | 11;9 | 3;11 | 30.2 | 19.0 | 38.1 | 33.3 |
ASD9 | 10;0 | 3;11 | 34.9 | 42.9 | 23.8 | 38.1 |
ASD10 | 10;11 | 4;10 | 41.3 | 52.4 | 38.1 | 33.3 |
ASD11 | 12;1 | 3;11 | 17.5 | 23.8 | 19.0 | 9.5 |
ASD12 | 9;10 | 3;11 | 34.9 | 23.8 | 38.1 | 42.9 |
ASD13 | 12;10 | 3;11 | 25.4 | 14.3 | 28.6 | 33.3 |
CA=chronological age; NVIA AE=nonverbal intelligence age-equivalence.
Significant Regression Models for Accuracy
|
Dependent variable: Mean accuracy for neutral sentences ■ KBIT2, B=0.328, ■ SRS T-score, B=0.382, |
|
|
Dependent variable: Mean accuracy for happy sentences ■ KBIT2, B=0.374, |