Voice interface and effectiveness

One of my colleague made a presentation at Human-Agent-Interaction symposium in Tokyo yesterday.

The assumption is that the human-like spoken dialogs are highly effective. Our proposal is to use the reinforcement-learning for acquiring the strategy how to respond quickly to overlapped utterances, interruptions, or gestures during spoken dialogs between human and machine. Although the research is still in early stage, we hope something like mind-reading will be possible, in other words, the users of spoken dialog systems do not need to say from the beginning to the end.

In another research project with a company during 2007-2008, a prototype of multimodal interface system was developed. In this work, we expected that the visual feedbacks in realtime while the user is speaking may help the user to notice the speaking style or vocabulary is valid or not. Moreover, this exploratory search allows the user to stop speaking even if the speech is not finished, when the user can see the expected results.

Another work related to the effectiveness was conducted by my ex-colleague, Dr. Masahiro Araki.

Input Prediction Method of Speech Front End Processor using Prosodic Information

In general, prosody of speech contains various information. For example, in Japanese, accent information is used for distinguishing homonyms and identifying word boundaries. In this paper, we propose a HMM-based accent type recognition method and, as an application of this method, an input prediction front end processor for dictation. From a few morae inputs, completion candidates that are sorted by input history and by the accent pattern are listed up. We examined two accent usage methods for both registered words and unregistered words, and implemented an input prediction system combining a speech recognizer, a prediction server and an accent usage module.

  • Another publication : Masahiro Araki, Hiroyoshi Ohmiya, and Satoshi Kida, “Input Prediction Method of Speech Front End Processor Using Prosodic Information,” Proceedings of International Conference: Speech Prosody 2004, Nara, Japan, 2004.

This kind of effective voice interfaces had been repeatedly re-invented.
The important point is that the strategy of effective interface highly depends on the situations, the tasks and available modalities.
That’s the reason of my recent interests in the machine learning of such skills.

Our concept of voice interface was inspired by the Hyakunin-isshu game. Hyakunin-isshu (The hundred poems by one hundred poets) is one of the famous game in Japan.

http://www.japanlink.co.jp/ka/play1.htm#Card%20game

Karuta (Card game)

The word karuta is said to have come from the Portuguese carta. Karuta are rectangular, like ordinary playing cards, with pictures or Japanese writing drawn on them. When playing, one player reads out a card for reading (yomi-fuda) and the other players compete to take the picture card (efuda) that matches it; the player who takes the most cards is the winner. As typical Japanese aspects of the game, there are iroha-garuta that contain the Japanese proverbs and poem cards (uta- garuta) on which the poems in 31 syllables known as tanka are written. Nowadays, the game is played principally at New Year.

Hyakunin-isshu (The hundred poems by one hundred poets)

This generally refers to the poetry anthology entitled “Ogura hyakunin-isshu,”compiled by FUJIWARA no Teika(Sadaie). It gathered 100 waka–the classical Japanese poem, specifically in this case Poem in 31 syllables–one each by the most outstanding poets from the Heian Period(794-1185) and the early years of the Kamakura Period(1185-1333). From the time of the Edo Period(1603-1867), this poetry anthology was widely spread and used as the poem cards. The overwhelming majority–43 selections–are love poems, followed by seasonal poems–32 selections. 79 of the poets are male, 21 female, and they express thoughts of love, nature and the seasons with a refinement unique to the Japanese people. It has become well-known as the representative work of classical Japanese literature. It’s one of the essential game of New Year.

In cases of the seven of the hundred cards, the player can decide a single correct card when the game master’s only first syllable is spoken.

In general, prosody of speech contains various information. For example, in Japanese, accent information is used for
distinguishing homonyms and identifying word boundaries. In this paper, we propose a HMM-based accent type
recognition method and, as an application of this method, an input prediction front end processor for dictation. From a
few morae inputs, completion candidates that are sorted by input history and by the accent pattern are listed up. We
examined two accent usage methods for both registered words and unregistered words, and implemented an input
prediction system combining a speech recognizer, a prediction server and an accent usage module.