Understanding VoiceXML

VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed-initiative conversations.

The main goal of VoiceXML is to bring the full power of web development and content delivery to voice response applications, and to free the authors of such applications from low-level programming and resource management. VoiceXML enables integration of voice services with data services that use the familiar client-server paradigm. A voice service is viewed as a sequence of interaction dialogs between a user and an implementation platform.

Document servers, which may be external to the implementation platform, provide the dialogs. Document servers maintain overall service logic, perform database and legacy system operations, and produce dialogs. A VoiceXML document specifies each interaction dialog to be conducted by a VoiceXML interpreter. User input affects dialog interpretation and is collected into requests submitted to a document server. The document server may reply with another VoiceXML document to continue the user session with other dialogs.

VoiceXML is a markup language that performs the following tasks:

Minimizes client/server interactions by specifying multiple interactions per document.
Shields application authors from low-level and platform-specific details.
Separates user interaction code (in VoiceXML) from service logic (CGI scripts).
Promotes service portability across implementation platforms; VoiceXML is a common language for content providers, tool providers, and platform providers.
Provides an easy programming language to use for simple interactions, and yet provides language features to support complex dialogs.

The VoiceXML language describes the human-machine interaction provided by voice response systems, which include the following:

Output of synthesized speech (Text-To-Speech, or TTS)
Output of audio files
Recognition of spoken input
Recognition of DTMF input
Recording of spoken input
Provision of telephony features such as call transfer and disconnect

VoiceXML provides the means for collecting character and/or spoken input, for assigning the input to document-defined request variables, and for making decisions that affect the interpretation of documents written in the language. You can use URLs to hyperlink a document to other documents.