Public Draft Technical Report
Offering CSTA (ECMA-269) Voice Services in Web
Browsers
(July 2011)
Introduction
One notable addition to CSTA (ECMA-269) since Edition 6 is the enhanced voice
services for automatic speech recognition, speech verification, speaker
identification, speaker verification, and text to speech synthesis. Although
historitically these functions are mostly made available through CSTA switching
function implementations in the call centers, the strong needs to access
internet from mobile and other web enabled devices have given rise to the
development where ECMA-269 interactive voice devices are implemented as an integral
part of the web browsers. One such implementation is the Speech Application
Language Tags (SALT) made available by Microsoft in 2004 as a browser plug-in
for Internet Explorer.
Recently, the World Wide Web Consortium (W3C) has begun the Hypertext Markup
Language (HTML) recommendation version 5, the first major revision since HTML
4.01 was adopted as ISO/IEC 15445 in May 2000. Among the new functionalty being
proposed to HTML 5 is the native support for multimedia capabilities. Following
this trend, Google, on March 22, 2011, released a new version of the Chrome
browser that implements a proposal for HTML 5 Speech Input Application Program
Interface (API).
This Technical Report examines the speech input
capabilities in the two web brwosers and conclude they are largely compliant with
the interactive voice device specification in ECMA-269.
This TR is part of a suite of CSTA Phase III Standards and Technical Reports. All of
the Standards and Technical Reports in this Suite are based upon
the practical experience of Ecma member companies and each one
represents a pragmatic and widely based consensus.
References
This TR provides examples of how subsets of CSTA Interactive Voice services can
be included to facilitate a browser based speech processing. ECMA-269,
Services for Computer Supported Telecommunications Applications (CSTA) Phase
III, should be used as the definitive reference for CSTA. This TR also makes
reference to how CSTA Interactive Voice services, adapted from SALT, can be
implemented in web browsers in an objected oriented manner. ECMA TR/88 should be
used as the reference.
In addition, this TR refers to an HTML speech input API proposal that has been
implemented and distributed via Google's Chrome browser. The proposed
specification examined by this TR has been published by W3C HTML Speech
Incubation Group at
http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Feb/att-0020/api-draft.html.
ECMA-269 Speech Input Capabilities
The Listerner Interactive Voice Device (ECMA-269 Clause 6.1.1.4.9) specifies a
CSTA device that provides speech input capabilities. The operational model for
the Listener resource consists of the following three states:
- the NULL state
in which the call and the Listener resource are not interacting,
- the STARTED
state in which the Listener resource is processing the incoming audio, and
-
the SPEECH DETECTED state in which speech-like sound is detected in the audio
and the Listener resource has started the process of matching the audio to the
speech grammar.
The services that can prompt the state transitions
(ECMA-269 Clause 26.1) and the events that can be raised as the result of the
transitions (ECMA-269 Clause 26.2) are depicted in Figure 6-17 of ECMA-269 that is
copied below:

A Listener resource can operate under one of the three modes:
- the
“multiple” mode allows a self transition on the SPEECH DETECTED state and
multiple Recognized event being raised,
- the “automatic” mode
(default) will
transition from the SPEED DETECTED state to the NULL state upon the first Recognized
event, and
- the “single” mode that would only make the transition from
the
SPEED
DETECTED to the NULL state and raise a single Recognized event upon an explicit Stop
service, preventing occasional pauses in user’s speech to inadvertently cut off
the recognition process.
Upon leaving the NULL state, CSTA further specifies three timeout conditions that
may lead the Listener resource to reset itself to the NULL state:
- Silence
timeout: the maximum time allowed before entering the SPEECH DETECTED state,
-
Babble timeout: the maximum time allowed to stay in the SPEECH DETECTED state,
and
- Maximum timeout: the maximum time allowed before returning to the NULL
state.
Other than the Silence timeout that will raise an individual Silence
Timeout Expired event (Clause 26.2.12), the other two timeout expirations raise
the Voice Error Occurred event (Clause 26.2.18).
A Listener resource can additionally be configured with a speech grammar to
govern its behaviours. For speech recognition purposes, the W3C Speech
Recognition Grammar Specification (SRGS) format with W3C Semantic Interpretation
for Speech Recognition (SISR) annontation is required to specify the grammar.
Two CSTA services, Activate and Deactivate (Clause 26.1.1 and 26.1.4,
respectively), allow the SRGS grammar rules to be put in the active or dormant
state. W3C Extensible Multimodal Annotation (EMMA) or Microsoft Semantic Markup
Language is required to describe the outcomes when Recognized Event is raised
(Clause 26.2.8). All the configuration parameters of a Listener resource can be
set or retrieve using Set Voice Attribute (Clause 26.1.13) and Query Voice
Attribute (Clause 26.1.7) request, respectively.
Listener Implementation in SALT and Chrome
All CSTA IVDs are implemented by SALT as an Internet
Explorer browser plug-in. Specifically, the
Listener resource is embodied as an <listen>
element. The element may use a
“src” attribute or one or multiple <grammar>
subelements to specify speech
grammars. An optional sub-element <param>
allows the application to specify the URI of the remote server resources
needed.
All the three modes and the three timeout mechanisms are implemented by SALT. The mode
of the Listener can be specified with an namesake attribute, and
the timeout values can be specified as XML attributes of the element.
The names of these timeout attributes are “initialtimeout”, “babbletimeout”, and
“maxtimeout”, respectively. When the Recognized event is raised, SALT makes
available the “result” and “text” parameters (Clause 26.2.8.1) as the namesake
properties of the <listen> element.
The Chrome implementation of speech input adheres to the CSTA Listener
operational model. The Listener resource is made available by declaring a
“speech” attribute to the HTML <input> or <textarea> element. A “grammar”
attribute is used to specify the speech grammar. The current proposal only
imlements the Silence timeout, which is specified in Chrome through a
“nospeechtimeout” attribute of the HTML element. Only the “automatic” mode is
considered in the currect specification. Instead of W3C Extensible Multimodal
Annotation (EMMA) format, Chrome implementation makes available the recognition
outcome exclusively as an ECMAScript (ECMA-262) object.
The following table summarises the syntactical mapping of the Listener resource service requests
and the events to SALT and Chrome implementations:
| ECMA-269 (clause) |
SALT in Internet Explorer |
Chrome |
| Start (26.1.14) |
Start |
startSpeechInput |
| Stop (26.1.15) |
Stop |
stopSpeechInput |
| Clear (26.1.2) |
Cancel |
cancelSpeechInput |
| Activate (26.1.1) |
Activate |
- |
| Deactivate (26.1.4) |
Deactivate |
- |
| Emptied (26.2.4) |
- |
speechend |
| Interruption Detected (26.2.5) |
audiointerrupt |
- |
| Not Recognized (26.2.6) |
noreco |
speecherror (error code 5) |
| Recognized (26.2.8) |
reco |
speechchange |
| Silence Timeout Expired (26.2.11) |
silence |
speecherror (error code 4) |
| Speech Detected (26.2.12) |
speechdetected |
speechstart |
| Started (26.2.13) |
- |
capturestart |
| Voice Error Occurred (26.2.18) |
error |
speecherror |
Sample Programs
This section illustrates the similarities between Chrome’s and SALT’s
implementation of ECMA-269 Listener resource, using the sample codes in SALT’s
and Chrome’s specifications as examples. As can be seen, the two impementations
bear strong resemblances to each other.
Even though the differences are minor, they notably reflect the distinct design
philosophies. While Chrome speech API aims to modify the specification and be
applicable only in a new version of HTML, SALT is designed to be a general XML
application that can be embedded into any markup languages (e.g. SVG, openXML,
etc.) to provide speech functionality for its hosting environment. As such, all
the SALT samples below are able to introduce speech features without changing
HTML specification.
Click to talk Example
This example shows a HTML web page that allows users to enter a city name into a
text input element.
SALT:
Because speech input is an add-on to HTML, applications can assign
speech recognition outcome to any HTML element (or elements) by scripting:
<html xmlns:salt="urn:schemas.saltforum.org/2002/02/SALT">
...
<script type="text/javascript">
function handleSpeechInput() {
textBoxCity.value = event.srcElement.text;
}
</script>
...
<input id="textBoxCity" type="text">
<input type="button" name="q" onclick="listenCity.Start()">
<salt:listen id="listenCity" src="city.grxml" onreco="handleSpeechInput()">
...
Chrome:
By changing HTML, applications can speech-enable
individual HTML input element in Chrome in a very straightfoward manner:
<input id="textBoxCity" type="text" speech grammar="city.grxml">
Search by voice, with "Did you say..."
Chrome:
This example demonstrates how the second best recognition result
can be parsed out and submit to the web search engine so that the search
result page can display a link with the text "Did you say second_best?". The example
is taken from Section 2 of Google's HTML 5 Speech API proposal.
<script type="text/javascript">
function startSearch(event) {
if (event.target.results.length > 1) {
var second = event.target.results[1].utterance;
document.getElementById("second_best").value = second;
}
event.target.form.submit();
}
</script>
<form action="http://www.google.com/search">
<input type="search" name="q" speech required onspeechchange="startSearch">
<input type="hidden" name="second_best" id="second_best">
</form>
SALT:
The same scripting method in the previous example can be equally
applicable to implement this scenario.
<script type="text/javascript">
function startSearch() {
var nBest = event.srcElement.recoresult.childNodes;
q.value = nBest.item(0).text;
if (nBest.length > 1) {
second_best.value = nBest.item(1).text;
}
event.srcElement.form.submit();
}
</script>
<form action="http://www.google.com/search">
<input type="search" name="q">
<input type="button" onclick="listenSearch.Start()">
<salt:listen id="listenSearch" onreco="startSearch()">
<input type="hidden" name="second_best" id="second_best">
</form>
Back