When combined with acoustical speech information, visual speech information (lip movement) significantly improves Automatic Speech Recognition (ASR) in acoustically noisy environments. Previous research has demonstrated that visual modality is a viable tool for identifying speech. However, the visual information has yet to become utilized in main stream ASR systems due to the difficulty in accurately tracking lips in real-world conditions. This paper presents our current progress in addressing this issue. We derive several algorithms based on a modified HSI color space to successfully locate the face, eyes, and lips. These algorithms are then tested over imagery collected in visually challenging environments.
For example, a 360×240 frame size produces ROI sizes of 69×90; resulting in a height/width ratio of 1.3. The three face models are created by manually selecting the face (excluding head-hair and neck) of three subjects. Each model is then converted to their representative pdf forms and stored offline (prior to invoking the system). The three face models can be seen in Figure 2.
Requiring the summation of pixel containment within the bounding box to exceed any previous row’s summation by 2% biases the “slide” towards the top of the image, removing potential coordinate selection errors based on neck visibility. In other words, while the typical face has a height/width ratio of approximately 1.2 and since this module selects skin, the neck of an individual would affect the extracted bounds. Examples of this processing can be seen in Figure 4.
All testing was performed using MATLAB R2006a on a desktop PC with a 2.93 GHz Celeron processor and 1.0 GB of memory. A total of 7 video files are used for testing. The first frame in which a face is detected is then passed to the Extract Face Coordinates and Extract Lip Coordinates modules.
In this paper we presented two modules of our lip parameter extraction system. Based on five regions of interest and their respective Bhattacharyya coefficients, the approximate location of a face can be determined. Our modules then accurately locate the face and lips for downstream processing. Current work is focused on increasing the accuracy of the Extract Lip Coordinates module, enabling removal of the Track Lips and Create Lip Target Model modules. In the future physical dimensions of the lips will be extracted based on the identified lip region and input to a recognition engine to perform automatic speech recognition.
Source: California Polytechnic State University
Authors: Brandon Crow | Jane Xiaozheng Zhang