The Audio-Video OZstralian English Speech (AVOZES) data corpus has recently been made publicly available for other interested researchers. It is the first publicly available audio-video speech data corpus for Australian English. It contains recordings from 20 speakers and the sequences provide both a systematic coverage of the phonemes and visemes of Australian English as well as some application-driven utterances. AVOZES is also the first audio-video speech data corpus with stereo-video recordings, which enable a more accurate measurement of geometric facial features.
For testing and comparing results published by various research groups in the field of Audio Visual Speech Processing (AVSP), a common basis in the form of a comprehensive, systematically designed AV speech data corpus would be of great value. Many corpora appear to have been designed with a specific application in mind, rather than being based on a general phonemic and visemic analysis. The Audio-Video OZstralian English Speech (AVOZES ) data corpus was designed and recorded with two major goals in mind. Firstly, a new framework for the design of comprehensive, well-structured, multiple-use AV speech data corpora was proposed and followed in the production of the AVOZES data corpus. Secondly, the first publicly available, comprehensive AV speech data corpus for Australian English (AuE) was produced. In addition, it is the first AV speech data corpus to use a stereo vision system. A stereo vision system has the advantage over monocular systems that 3D coordinates can be recovered accurately. Thus, 3D distances can be measured, not just distances in 2D image coordinates, which makes the measurements robust against rotations of the face. These factors relate to the corpus recording process. One can argue that recordings made in laboratories do not mirror exactly the conditions in the real world. However, in terms of facilitating the interpretation of experimental results, it is an advantage to be able to control the experimental conditions. These conditions include the recording equipment, the possible use of markers, the layout of the recording room (e.g. background), the sitting arrangement, the illumination arrangement, and the level of acoustic noise. Going through all possible combinations of these conditions in a systematic way would result in an exponential growth of the corpus and quickly become impractical. It is suggested here to leave all conditions but one constant at a time, and to study the effects of changing that condition, rather than mixing the effects of various changing conditions in one recording.
AVOZES currently contains recordings made from 20 native speakers of AuE. The group is gender balanced with ten female and ten male speakers. Six speakers wear glasses, three wear lip make-up, two have beards. At the time of the recordings, these speakers were between 23 and 56 years old. The speakers were tentatively classified into the three speech varieties of AuE (broad, general, cultivated) by the recording assistant, which created groups of 6 speakers for broad AuE, 12 speakers for general AuE, and only 2 speakers for cultivated AuE.
Video information is encoded using the NTSC format, 720×480 pixels, 29.97Hz frame rate. The AVOZES AVI files use the Adaptec DVSoft codec, which most media players like RealPlayer, Windows Media Player, etc. have pre-installed.
Audio information is encoded as 48kHz, 16-bit stereo.
Data time period: 2000 to 2001
108.70312,-10.21691 155.10937,-10.56271 154.75781,-44.14227 107.29687,-43.63555 108.70312,-10.21691
User Contributed TagsAustralian English English Language Linguistics
Login to tag this record with meaningful keywords to make it easier to discover
- Local : canberra.edu.au/Collection/Davozes001