Development of a Voice Conversion System

ABSTRACT

Voice Conversion is a technique which can be used to convert or change the speech uttered by a source speaker in such a manner that it is heard as if spoken by another target speaker. Here, an approach for static voice conversion is developed and implemented. Static speech parameters are the parameters over which speaker has least control such as vocal tract structure, natural pitch of speech etc. Here, two main parameters are considered Vocal Tract Structure and Pitch. Also two different approaches are studied and implemented in MATLAB.

In the first approach, source and target speeches are resolved into excitation component and filter component using LPC based source-filter technique and pitch modification is achieved using a method called PSOLA (Pitch Synchronous Overlap-Add). Whereas in the second approach is based on speech generation model governed by voicing detection. For voiced frames pitch is estimated using auto correlation method and the excitation component is generated using a set of signal generators driven by voicing detection flag. Filter coefficients are modified to approach target speaker coefficients.

Finally, a user friendly demo using MATLAB GUI is developed which demonstrate the idea behind the system. This field of Speech Technology can contribute greatly to the Entertainment Industry as well as can significantly reduce the database size for multiple speaker TTS (Text$to$Speech) Systems making them more convenient to implement on portable devices.

VOICE CONVERSION INTRODUCTION

Voice Conversion is a process of transforming the parameters of a source voice to those of a target voice. Source voice is a recorded speech whereas the target voice can be either another recorded speech or more general descriptors like pitch or formant frequencies, prosody. These general descriptors can be specified indirectly in terms of age, gender and speaking style.

Prosodic Dependencies of Human Speech

General Framework:

General Framework of a Voice Conversion System

SPEECH SIGNAL ANALYSIS

Human Speech Production System:

To understand the voice conversion process, it is mandatory to digest the human speech production process and understand the parameters which are responsible for voice distinction in different humans.

The anatomy of human speech production system.The human speech production system begins with the lungs and end with mouth and nasal cavity with neural signals from human brain being the driving or controlling element in the whole speech production process.

Human Speech Production System

Modeling the Speech Signal:

The speech signals can be modeled as unstructured signals generated by a source (lungs) and passed through interconnection of systems which structures the signal to yield speech. The system can be modeled either as a linear or a nonlinear model. Though a linear model does not mimic the exact behavior, it is preferred as it provides a fair amount of accuracy with ease of implementation.

Block Diagram of the Speech Signal Production Process

PSOLA BASED APPROACH

Aim of the approach

The aim of this approach is to modify the source pitch to match the target pitch. This cannot be done by simply increasing the pitch value / decreasing the pitch period as this will lead to compression or expansion of time scale and the speech will no longer remain intelligible.

TD-PSOLA

TD-PSOLA stands for Time-Domain Pitch Synchronous Overlap-Add. It is a simple and effective algorithm for both Time and Pitch scale modifications. The idea is to process the speech signal on a short-time basis where the segments are obtained pitch synchronously. These segments are concatenated in an appropriate manner to obtain the desired modifications.The main steps of the algorithm are explained here.

SPEECH SYNTHESIZER APPROACH:

Auto-Correlation based approach

In this method first using the voiced frame of signal, we generate the autocorrelation function r(s) defined as the sum of the point wise absolute difference between the two signals over some interval.

Representation of Auto correlation at a Particular shift

VOICE CONVERSION DEMO

Demo 1:

The first Demo is a GUI based program made in MATLAB 7.0, which gives a basic insight into how the basic parameter modifications enable the voice to be changed effectively, is a snapshot of the main screen of the GUI.

Demo 2:

This is also a MATLAB GUI program, which presents the conversion process described.
PSOLA based approach. This GUI actually takes as in puts recorded source and target speeches and extract the parameters and carry out the conversion as suggested.

Demo 3:

The demo 3 is implementation of speech synthesizer approach using a MATLAB program. The results obtained are not very encouraging, yet they are displayed here for two test files. The converted voice definitely has resemblance with target sound, but it is too “noisy” to be used any further.

IMPROVEMENTS AND FUTURE WORK

The results obtained suggests that PSOLA based approach outperforms Synthesizer based approach; a noise cancellation strategy can further improve the Quality. Also, the system can be made more efficient by including prosodic modifications and time alignment.

The system becomes very robust, efficient and generic in nature if training is implemented. Being a new field in the speech technology, it has a lot of scope for implementing improvements in the system

CONCLUSION

Here, two different approaches are developed to achieve voice conversion. Also the MATLAB Demos developed here give a primitive insight into the field of voice conversion. The system discussed here processes on pre-time aligned speech samples. Numerous efforts are required to implement many modifications to make the present system more robust, efficient and generic.

One of such modification is training. An ideal voice conversion system should include a training phase so that the system can be trained with target speech and can be used to convert any arbitrary speech uttered by source speaker which could not be done because of restriction of time and inadequate knowledge at the current stage.

High quality transformations can be obtained with more complex and computationally expensive techniques. Also real-time voice conversion can be achieved with powerful Digital Signal Processors or similar Hardware. Voice conversion is yet an unexplored field in speech technology and expects a lot of contribution from speech researchers in the future years.

Source: Nirma University
Author: Akash Mecwan

Download Project

>> 60+ Simple Biomedical Project Titles for Final Year Engineering Students

>> Matlab Projects for Biomedical Engineering Students

>> Image Processing Project Topics with Full Reports and Free Source Code

>> 50+ Matlab projects for Digital Image Processing for Final Year Students

>> More Matlab Projects on Signals and Systems for Students

>> 200+ Matlab Projects based on Control System for Final Year Students