inproceedings

Multimodal Silent Speech-based Text Entry with Word-initials Conditioned LLM

Proceedings of the 7th ACM Conference on Conversational User Interfaces | 2025

[URL] https://doi.org/10.1145/3719160.3736612

Author

Su, Zixiong and Fang, Shitao and Rekimoto, Jun

Abstract

Although exhibiting great potential in enabling seamless communication between humans and conversational agents, large vocabulary recognition is still challenging for silent speech interfaces. In this research, we propose a novel interaction technique that combines silent speech and typing to enable more efficient text entry while preserving privacy. This technique allows users to use abbreviated phrase input while still ensuring high accuracy by leveraging visual information. By fine-tuning a large language model with a visual speech encoder, we condition the models to decode the speech content with word initials as hints. Evaluations on existing datasets show that our model can reduce the Word Error Rate from 20.3\% to 9.19\%, compared to state-of-the-art visual speech recognition models. Results from a user study demonstrated significant improvements in input speed and keystroke saving. Participants reported that our prototype, LipType, leads to an overall lower perceived workload, particularly in the effort and physical demand dimension.

DOI

https://doi.org/10.1145/3719160.3736612

Multimodal Silent Speech-based Text Entry with Word-initials Conditioned LLM

Author

Abstract

DOI

Related Members