Speech is the most common type of communication between humans and is often defined
as the most natural human form of communication. With little effort, people use speech
to communicate, learn and share different messages. However, human communication
is not limited merely to the vocal sounds, but is also complemented by nonverbal cues.
Speech is often accompanied by various gestures, facial expressions, posture, touch
etc. They are perceived unconsciously by all the senses that are available in a given
situation. The information thus gathered is collected and processed in the brain, which
enables us, just as unconsciously, to interpret the message correctly and recognise its
context. This means that nonverbal communication is an important supplement to the
human voice communication, enabling the recognition of additional information, which
makes it possible to comprehend the message efficiently and place it into context.
The doctorial dissertation’s aim is to research the formation and perception of
vocal communication. Vocal communication can be defined as the utterance of words
in a certain language and the accompanying nonverbal signs, which are often hidden.
Nonetheless, with attentive listening the recipient can easily recognize and respond to it
accordingly. Nonverbal communication modifies the acoustic message and is frequently
described as paralanguage. It is an integral part of vocal communication and can
be divided into several components: rhythm, tone, intonation, language slips, word
emphases, pauses and silence. The sum of these components combined with the uttered
words form the entirety of vocal communication.
Another component of paralanguage are the paralinguistic states of the speaker and
emotions represent a distinctive part of these states. Speakers who are experiencing
various emotional states will often modify their speech accordingly and communicate
it with unique nonverbal signs. It is rare for people to actually be aware of how they
modify their speech. On the other hand, this is precisely what often helps to recognise
the true meaning of the communicated message. The recipient of an emotionally expressed
vocal message can thus easily recognise such vocal modifications and classify
the message, albeit unconsciously, into a certain group of the interlocutor’s emotional
states. This unconscious classification, as well as the unconscious formation of emotional
messages, is a part of our day-to-day lives and influences the verbal communication,
comprehension and, last but not least, perception of messages. The combination
of verbal communication and all of its paralanguage components represents the entirety
of a message, which is formed and perceived unconsciously and forms a part of
our most natural means of communication. Speech, together with all the elements of
nonverbal communication, is thus one of the most natural communication means which
is experienced daily and spontaneously.
Ever since the beginnings of the digital era, researchers wished to develop a way
in which humans and machines could interact most naturally, i.e. by speaking to each
other. Such human-machine verbal dialogue ought to reflect interpersonal communication
as closely as possible. This means that both machine and human form and receive
verbal messages. The reception of messages by machines is defined as the problem of
speech recognition, while the formation of speech is defined as speech synthesis. Both
fields have many common characteristics and speech synthesis is often described as an
inverted process of speech recognition. Recently, the principles and processes involved
in both have been significantly refined. However, despite having increasingly more
powerful machines such as personal computers, smartphones and other modern-day
digital devices, we still do not communicate with them verbally. One reason for this
could be language diversity, besides the obviously difficult research work necessitated
in the field of speech technologies. The fact that systems for speech modeling and
synthesis are highly dependent on the language involved means that specific acoustic
and lexical research must be carried out on each language separately. At the time
of this writing, there are only a handful of languages for which systems for limited
human-machine dialogue have been developed. Unfortunately, the majority of languages
still lack such systems. One of the reasons for this could be the absence of
individual language databases, which are necessary for the implementation of already
developed solutions. Only well annotated and sufficiently large speech databases make
the development of such systems possible.
The dissertation treats the development of systems for artificial synthesis of Slovenian
speech. The main goal of these systems is to produce artificial speech that is
understandable and natural. It is often the case that artificial speech does not sufficiently
resemble natural speech. Because of this, researchers mostly endeavour to
develop a system with improved performance in these categories. If they had access
to a speech database which was large enough to reflect all the characteristics of the
language of a particular speaker, they would undoubtedly be able to create a superior
system. Unfortunately, there are no such databases available at this time. The development
of well performing systems is thus held back by the amount of data in speech
databases.
Because building speech databases is a lengthy and costly process, smaller and more
specialized databases are often produced. For the purpose of making artificial speech
more natural, a recent trend in database production has been to add labels describing
individual paralanguage components and speaker’s emotional states. Speech synthesis
requires the analysis of as many utterances by a single speaker as possible. With access
to such information, modern approaches to speech synthesis are able to model the
characteristics of an individual’s speech reasonably well. If we add emotion labels, we
are able to model emotional characteristics as well. Again, however, this is only possible
if there are enough samples available of an individual’s speech in a particular emotional
state. Additionally, acquiring sufficiently large quantities of emotional speech samples
is not the only problem encountered when building a database; there is also the question
of clear definition. It is impossible to unambiguously define a speaker’s emotional state.
Therefore, its perception is always subjective and dependent on the recipient. People
will inevitably vary in their interpretations of emotional states, especially when the
speaker is someone they do not know. The acquisition of quality labels is one of the
major problems in emotional speech sampling and will be treated extensively in this
doctorial dissertation.
Modern literature records two distinct approaches to speech synthesis. One focuses
on joining natural speech segments, while the other is based on the parametrization and
modeling of speech segments. The main characteristic of the first approach is the ability
to produce more naturally sounding artificial speech, since it uses segments of real-life
recordings, while the latter produces artificial speech by modeling speech segments and
using acoustic units. A major distinction between the two is in the amount of resources
needed to produce working systems, with the joining approach requiring significantly
larger databases than the modeling approach. This is compounded when trying to
implement components of paralanguage or emotional states. In this case, the joining
approach needs even more data.
Since emotional states are hard to define, we can expect to have insufficient amount
of quality emotion resources. Because of this, we based our development of a system
for artificial synthesis of Slovenian emotional speech on parametric speech models. We
obtain these models with the use of hiddenMarkov models (HMM). Building the system
on the basis of speech parametrization enables us to model the speech with the use of
statistical models estimated from the speech database. With modifying the parameters
of these statistical models, we can change the acoustic and intonational properties of the
speech as well as its length. This is performed through the processes of adapting and
interpolating, which in this doctorial dissertation were also used to produce emotional
states.
Every system for artificial speech synthesis needs to be evaluated. As mentioned
before, there are two levels of criteria: on the first level, we evaluate how understandable
the speech is, and on the second we evaluate its naturalness. The evaluation can
be performed with a process similar to the one used for building the emotion database:
the recordings of synthesized speech are evaluated by human evaluators, who fill out
questionnaires to determine whether a particular recording exhibits the required emotional
states. To guarantee a reliable evaluation, we need a large number of evaluators
working on large amounts of artificialy synthesized emotional speech samples. This
process is categorised as a subjective evaluation, which are known to be both lengthy
and costly. That is why researchers endeavour to develop faster and more objective
ways of evaluating their systems. However, at the time of this writing, there are no
reliable and objective techniques available which would offer faster and more efficient
evaluation of systems for emotional speech synthesis.
The doctorial dissertation focuses on the development of a system for artificial
emotional speech synthesis in the Slovenian language. We build all the components
necessary for the development of a parametric system for artificial speech synthesis.
Through modifying existing methods based on hidden Markov models (HMM), we propose
a new technique to build a system for Slovenian emotional speech that works with
limited emotion resources. The proposed technique is based on the statistical analysis of
the quality of labels applied to emotional speech recordings. Such an approach enables
us to extract specific information expressed by speakers in certain emotional states using
only small amounts of emotional speech. Emotionally neutral speech resources also
play an important role in this process. Such recordings are usually the most abundant
in emotional speech databases and thus represent the basis for building a system for
artificial emotional speech synthesis. Emotionally neutral resources can be used with
HMM techniques to develop a basic statistical model. Adaptation techniques allow us
to transform statistical models of natural speech which scored well in evaluation into
statistical models of a particular emotional state. Once we have such a model, we can
use it to synthesize high quality speech in the target emotional state.
Another innovation introduced by the dissertation pertains to objective evaluation
of systems for emotional speech synthesis. We propose a technique based on
Euclidean distance between mel-cepstral feature vectors of the original and artificial
speech recordings. The dissimilarities exhibited by each artifically synthesized emotional
speech recording represent the measure of its closeness to the original recording.
The smaller the dissimilarities, the closer the artificial recording is to the original. If
the latter is annotated with emotion labels, we can use this method of verification to
automatically acquire a result that expresses whether or not the system for artificial
speech synthesis produced a speech which is closest to the original.
We also present the new Slovenian emotional speech database that we built from the
recordings of Slovenian radio plays. The acquisition, labeling and further processing
of recordings was performed with full permission from RTV Slovenija. Although the
resources involve emotional states that are acted out, we presume them similar to the
emotional states that arise in spontaneous speech. This presumption is based on the
wider context of plays and the dialogues between the protagonists. The actors carry
out their roles with a wide array of emotional states expressed by emotional speech.
This means that we are not limited merely to one play, but can instead gather acoustic
material of a particular actor or actress from several plays.
An important factor while gathering acoustic recordings is the level of their quality.
Radio plays are generally recorded with professional studio equipment, so the recordings
are of sufficient quality to allow further processing. Based on the examples from
one actor and one actress, the dissertation presents the methodology required for the
extraction of acoustic emotion resources from radio plays. Through measures of agreement
between evaluators, we present the problem of individual perception of emotional
states. High quality of labeled resources was achieved by performing the evaluation
with the same evaluators twice at different times. This also allowed us to check the
consistency of the individuals’ perceptions of emotional states. Besides the recordings
and their transcriptions, the resulting database contains emotion labels with scores expressing
their quality. The fact that our emotion labels are scored puts this database
among the few Slovenian emotional speech databases that contain such information.
The doctorial dissertation will present all the above mentioned innovations. It is
divided into six chapters. The introduction comprises a presentation of the topic, a
description of the research goals determined at the start of research, and a detailed
overview of the content. In the second chapter, we frame our work within the wider
field of speech technologies and highlight the existing techniques that form the basis
for the development of systems for artificial emotional speech synthesis. At the same
time, we attempt to explain the research paths chosen by providing a broad overview
of the treated research field.
The new Slovenian emotional speech database, as well as the methodology used
in its development, are presented in the third chapter. We focus on the difficulty of
labeling emotional states in speech and underline it with the results of two separate
labelings of selected emotional recordings carried out by the same evaluators. Double
labeling of emotional states provides us with an insight into the evaluators’ consistency.
The labels are analysed and an objective evaluation of emotional speech is given with
the use of an automated system for distinguishing speaker-dependent emotional states.
The fourth chapter focuses on describing the proposed method of artificial emotional
speech synthesis based on the quality of labels applied to emotion resources. We start
by presenting the existing method of synthesizing artificial emotional speech on the
basis of HMM modeling. For the sake of clarity, we divide the process into individual
components, which also allows us to emphasise the differences encountered in developing
the system for emotional states synthesis. In the following part, we continue with the
description of the adjustments to the process involving the developed emotional speech
database, where we make good use of the quality of labels applied to emotion resources.
The issue of evaluating the systems for artificial speech synthesis is treated in the
fifth chapter. Here, we describe the existing subjective and objective evaluation techniques.
Special attention is given to the evaluation of emotionally expressed artificial
speech and the proposed method for objective evaluation is presented. Our method is
based on the verification of artificially synthesized emotional speech recordings. The
process of verification involves the comparison between text-dependent artificially synthesized
signals and their original recordings. If the target emotional state label corresponds
to the original one, we can consider the artificially synthesized recording as
the closest possible approximation of the original recording. We conclude the chapter
with a presentation of the results of the evaluation of the developed system for artificial
Slovenian emotional speech synthesis, which was developed on the basis of emotion
resources in the EmoLUKS database.
In the final chapter, we summarise the more important achievements of the dissertation
and attempt to evaluate them. We close the chapter by proposing directions for
further research and giving guidelines from our findings for possible improvements to
the systems for artificial Slovenian emotional speech synthesis.
|