Voice Conversion is a technique which can be used to convert or change the speech uttered by a source speaker in such a manner that it is heard as if spoken by another target speaker. Here, an approach for static voice conversion is developed and implemented. Static speech parameters are the parameters over which speaker has least control such as vocal tract structure, natural pitch of speech etc. Here, two main parameters are considered Vocal Tract Structure and Pitch. Also two different approaches are studied and implemented in MATLAB.
In the first approach, source and target speeches are resolved into excitation component and filter component using LPC based source-filter technique and pitch modification is achieved using a method called PSOLA (Pitch Synchronous Overlap-Add). Whereas in the second approach is based on speech generation model governed by voicing detection. For voiced frames pitch is estimated using auto correlation method and the excitation component is generated using a set of signal generators driven by voicing detection flag. Filter coefficients are modified to approach target speaker coefficients.
Finally, a user friendly demo using MATLAB GUI is developed which demonstrate the idea behind the system. This field of Speech Technology can contribute greatly to the Entertainment Industry as well as can significantly reduce the database size for multiple speaker TTS (Text$to$Speech) Systems making them more convenient to implement on portable devices.
VOICE CONVERSION INTRODUCTION
Voice Conversion is a process of transforming the parameters of a source voice to those of a target voice. Source voice is a recorded speech whereas the target voice can be either another recorded speech or more general descriptors like pitch or formant frequencies, prosody. These general descriptors can be specified indirectly in terms of age, gender and speaking style.
SPEECH SIGNAL ANALYSIS
Human Speech Production System:
To understand the voice conversion process, it is mandatory to digest the human speech production process and understand the parameters which are responsible for voice distinction in different humans.
The anatomy of human speech production system.The human speech production system begins with the lungs and end with mouth and nasal cavity with neural signals from human brain being the driving or controlling element in the whole speech production process.
Modeling the Speech Signal:
The speech signals can be modeled as unstructured signals generated by a source (lungs) and passed through interconnection of systems which structures the signal to yield speech. The system can be modeled either as a linear or a nonlinear model. Though a linear model does not mimic the exact behavior, it is preferred as it provides a fair amount of accuracy with ease of implementation.
PSOLA BASED APPROACH
Aim of the approach
The aim of this approach is to modify the source pitch to match the target pitch. This cannot be done by simply increasing the pitch value / decreasing the pitch period as this will lead to compression or expansion of time scale and the speech will no longer remain intelligible.
TD-PSOLA stands for Time-Domain Pitch Synchronous Overlap-Add. It is a simple and effective algorithm for both Time and Pitch scale modifications. The idea is to process the speech signal on a short-time basis where the segments are obtained pitch synchronously. These segments are concatenated in an appropriate manner to obtain the desired modifications.The main steps of the algorithm are explained here.
SPEECH SYNTHESIZER APPROACH:
Auto-Correlation based approach
In this method first using the voiced frame of signal, we generate the autocorrelation function r(s) defined as the sum of the point wise absolute difference between the two signals over some interval.
VOICE CONVERSION DEMO
The first Demo is a GUI based program made in MATLAB 7.0, which gives a basic insight into how the basic parameter modifications enable the voice to be changed effectively, is a snapshot of the main screen of the GUI.
This is also a MATLAB GUI program, which presents the conversion process described.
PSOLA based approach. This GUI actually takes as in puts recorded source and target speeches and extract the parameters and carry out the conversion as suggested.
The demo 3 is implementation of speech synthesizer approach using a MATLAB program. The results obtained are not very encouraging, yet they are displayed here for two test files. The converted voice definitely has resemblance with target sound, but it is too “noisy” to be used any further.
IMPROVEMENTS AND FUTURE WORK
The results obtained suggests that PSOLA based approach outperforms Synthesizer based approach; a noise cancellation strategy can further improve the Quality. Also, the system can be made more efficient by including prosodic modifications and time alignment.
The system becomes very robust, efficient and generic in nature if training is implemented. Being a new field in the speech technology, it has a lot of scope for implementing improvements in the system
Here, two different approaches are developed to achieve voice conversion. Also the MATLAB Demos developed here give a primitive insight into the field of voice conversion. The system discussed here processes on pre-time aligned speech samples. Numerous efforts are required to implement many modifications to make the present system more robust, efficient and generic.
One of such modification is training. An ideal voice conversion system should include a training phase so that the system can be trained with target speech and can be used to convert any arbitrary speech uttered by source speaker which could not be done because of restriction of time and inadequate knowledge at the current stage.
High quality transformations can be obtained with more complex and computationally expensive techniques. Also real-time voice conversion can be achieved with powerful Digital Signal Processors or similar Hardware. Voice conversion is yet an unexplored field in speech technology and expects a lot of contribution from speech researchers in the future years.
Source: Nirma University
Author: Akash Mecwan