A multimodal interactive system based on hierarchical Model-View-Controller architecture

Multimodal interactive systems are expected to be used widely. To realize life-like agents or humanoid robots, flexible architecture for integrating software modules is necessary. Many frameworks are proposed.

Joseph Polifroni, Stephanie Seneff. 2000. Galaxy-II as an Architecture for Spoken Dialogue Evaluation. Proceedings of Second International Conference on Language Resources and Evaluation, pp.42-50.
Yosuke Matsusaka, Kentaro Oku, Tetsunori Kobayashi. 2003. Design and Implementation of Data Sharing Architecture for Multi-Functional Robot Development. Trans. of IEICE, Vol.J86-D1, No.5, pp.318-329 (in Japanese).
SRI International, The Open Agent Architecture. http://www.ai.sri.com/~oaa/

In this post, the following topics related to Galatea Toolkit are discussed;

A developer should be able to customize a parameter which influences many modules within the system easily.
A developer who doesn’t have knowledge concerning the speech technology should be able to develop the spoken dialog applications efficiently.

Galatea is a project for providing an open-source, license-free software toolkit for building anthropomorphic spoken dialogue agents.

Shin-ichi Kawamoto, et al. 2002. Open-source Software for Developing Anthropomorphic Spoken Dialog Agent. Proc. of PRICAI-02, International Workshop on Lifelike Animated Agents, pp.64-69.
The development of Galatea Toolkit started in 2000 under the support of IPA (Information technology Promotion Agency, Japan). In 2003, the first public version was released. From 2003, newly established Interactive Speech Technology Consortium (ISTC) under IPSJ SIG-SLP had been working on development and improvement of the toolkit. In 2009, the latest version of the software is scheduled to be open to the public again.
See http://sourceforge.jp/projects/galatea/

The Galatea system comprises for basic modules: Speech Recognition Module (SRM), Speech Synthesis Module (SSM), Face Synthesis Module (FSM), Agent Manager (AM), and Dialog Manager (DM). Only Japanese language is supported currently.

The integration system for Linux adopted a concept of architecture that the virtual machines communicate using simple commands.
Agent Manager, which communicates between modules, is implemented with Perl language.
An example of Galatea protocol which Agent Manager invokes Speech Synthesis Module is as follows;

to @SSM set Text = hello
from @SSM rep Run = LIVE
from @SSM rep Speak.stat = PROCESSING
from @SSM rep Text.pho = h[20] e[20]…
from @SSM rep Speak.stat = READY
to @SSM set Speak = NOW
from @SSM rep Speak.stat = SPEAKING
from @SSM rep Speak.stat = READY

The remarkable point of the toolkit is that the user can easily create new persona with its own face and voice.

Face Animation: Texture-mapped wire-frame model for photo-real 3D images can be produced from a single photo. GUI-based photo fitting tool (FaceMaker) is bundled.
Speech Synthesis: HMM-based, speaker-adaptive Japanese speech synthesis (GalateaTalk) is provided. Training tool of the statistical speaker model (VoiceMaker) is bundled.

Galatea Dialog Studio, which performs the dialog management, is implemented with Java and Ruby language. It is developed to adopt VoiceXML, which is one of the standards of voice interface description. After the previous release in 2003, Ubuntu, one of the latest Linux environments, is newly selected as the execution environment. For the ease of installation, ‘deb’ packages are provided.

The galatea-generate command is newly introduced to generate a set of project-related files, which contains the configurations such as the options given to the modules.

The management between DM and Julius-based SRM is modified to improve the compatibility with the VoiceXML standard and to improve the productivity of the application developers.

More natural face movements and emotional expressions are realized. Utterances and changes of expressions can be performed in parallel. The graphical user interface for the developers is reinforced. System states, logs, and errors can be viewed more easily. Compatibility with the latest tools of web application development (such as Ruby on Rails) is improved.

There are demands to easily perform setting which affects several modules. For example, when virtual character is changed, the setting affects both SSM and FSM. This often causes error, and this is inefficient. Therefore, it is effective to generate dynamically several configuration files with one master configuration file and template engine.

The YAML format of master configuration:

[project.yml]
agents:
- speaker: female01
  gender: female
  mask: woman01
- speaker: male01
  gender: male
  mask: man01

The templates and outputs for SSM/FSM configurations:

[(template) ssm.conf.erb]
<% @agents.each do |a| %>
SPEAKER-ID: <%= a.speaker %>
GENDER: <%= a.gender %>
DUR-TREE-FILE:   ../speakers/<%= a.speaker %>/tree-dur.inf
...(omitted)...
<% end %>

[(template) fsm.conf.erb]
<% @agents.each do |a| %>
MaskFile <%= a.mask %> ../sample/<%= a.mask %>.rgb ...
<% end %>

[(output) ssm.conf]
SPEAKER-ID: female01
GENDER: female
DUR-TREE-FILE:   ../speakers/female01/tree-dur.inf
...(omitted)...

[(output) fsm.conf]
MaskFile woman01 ../sample/woman01.rgb ../sample/woman01.pnt
MaskFile man01 ../sample/man01.rgb ../sample/man01.pnt

Useful applications can be realized with our system, although current implementation does not allow the spontaneous speech recognition and understanding. It is suggested that building a good machine, rather than striving to the natural conversation, leads to effective use of speech recognition.

Bruce Balentine, Leslie Degler. 2007. It’s Better to Be a Good Machine Than a Bad Person: Speech Recognition and Other Exotic User Interfaces in the Twilight of the Jetsonian Age. ICMI Press.

Our previous work on exploratory search also indicates that

the conversations between human beings have rational structures, and
natural interaction can be realized by adopting this structure.

Takuya Nishimoto, Takayuki Watanabe. 2008. An analysis of human-to-human dialogs and its application to assist visually-impaired people. Computers Helping People with Special Needs, LNCS 5105, Springer, (Proceedings of 11th International Conference, ICCHP 2008, Linz, Austria) pp.809-812.

As our hypothesis, in goal to make good talking machine, it is important to associate the speech interface and the technology to develop complicated application which uses large-scale databases.

The speech interface committee of the ITSCJ proposed a hierarchical architecture of the MMI system.

Masahiro Araki, Tsuneo Nitta, Kouichi Katsurada, Takuya Nishimoto,
Tetsuo Amakasu, Shinnichi Kawamoto. 2007. Proposal of a Hierarchical Architecture for Multimodal Interactive Systems. Workshop on W3C’s Multimodal Architecture and Interactives. http://www.w3.org/2007/08/mmi-arch/agenda.html

Model-View-Controller (MVC) architecture, which separates the application logic to the user interface description intermediated by the controller, is suitable for aggregating modality dependent processing, and facilitating the addition of new modality. VoiceXML is used between Layer-4 and Layer-5. Galatea Dialog Studio (including sub-modules) corresponds to Layer-4 and below.

To complete the multimodal system, Layer-5 and Layer-6 should be implemented. Ruby on Rails is expected to contribute building those layers.

Rails is a web development framework written in the Ruby language. Rails is organized around the MVC architecture and it allows the developer to write less code than many other languages and frameworks.

In a conventional manner, the management of spoken dialog system is realized with state transition models, although state transitions may deterministic or probabilistic. It is often that, however, state transition model depends on modality configurations.

When REST (Representational State Transfer) is introduced according to the style of Rails, in contrast, the states can be abstracted with controller and action.

Leonard Richardson, Sam Ruby. 2007. RESTful Web Services. O’Reilly.

For example, a dialog to make query for the price of products can be abstracted by actions such as index and show of controller product.

[app/controllers/product_controller.rb]
class ProductController < ApplicationController
  def index
    @products = Product.find(:all)     
  end
  def show
    @product = Product.find(params[:id])
  end
end

[app/views/product/index.vxml.erb]
<vxml> 
<form id='main'>
 <field name='id'>
  <prompt> <% @products.each do |p| %> <%=h p.name %>, <% end %> </prompt>
  <grammar> <rule> <one-of>
   <% @products.each do |p| %>
    <item> <token sym="<%= p.yomi %>" slot="id" value="<%= p.id %>">
     <%= p.name %> </token> </item>
   <% end %>
  </one-of> </rule> </grammar>
 </field>
 <block> <submit next="<%= url_for(:action=>'show', :format=>'vxml')%>"/> </block>
</form> 
</vxml>

[app/views/product/show.vxml.erb]
<vxml>
 <form id='main'>
 <block>
 <prompt> The price of <%= @product.name %> is <%= @product.price %> yen. </prompt>
 <goto next="<%= url_for(:action=>'index', :format=>'vxml')%>" />
 </block>
 </form> 
</vxml>

The product controller uses the product model, which corresponds to a table of SQL database.
Only the View is depending on modality, which is different between HTML and VoiceXML.

In the controller, an action can be forwarded to any action, therefore, state transition that is not deterministic is feasible.

Supposing an application system, which is designed for HTML-based interface, has strictly separated MVC, the Models and Controllers, which are the modality-independent parts of the system, can be realized effectively, because we can utilize the support of Rails for the parts. It is easy to add the VoiceXML-based elements to the original Views of Rails, which is designed for HTML.

We investigated the typical problems in MMI systems. Interestingly, problem of device initialization and problem of task description can be solved in the same way, that is, by utilization of MVC architecture, Ruby language, and the template engine. This gives big suggestion in description language standardization in each six layers of the hierarchy.

Future work includes:

Based on irb command of Ruby, interactive development environment may be realized.
Internationalization functions of Rails will be available to take over the difference of modalities as well as languages.

It is important to excite the interest in the tool of researchers and developers. Localization or internationalization must be realized. Live CD/DVD versions of Galatea based on Knoppix may help the users. The documents and licenses must be arranged to expand the open-source community.

Published by nishimotz