Multimodal Silent Speech-based Text Entry with Word-initials Conditioned LLM
Author
Su, Zixiong and Fang, Shitao and Rekimoto, Jun
Abstract
Although exhibiting great potential in enabling seamless communication between humans and conversational agents, large vocabulary recognition is still challenging for silent speech interfaces. In this research, we propose a novel interaction technique that combines silent speech and typing to enable more efficient text entry while preserving privacy. This technique allows users to use abbreviated phrase input while still ensuring high accuracy by leveraging visual information. By fine-tuning a large language model with a visual speech encoder, we condition the models to decode the speech content with word initials as hints. Evaluations on existing datasets show that our model can reduce the Word Error Rate from 20.3\% to 9.19\%, compared to state-of-the-art visual speech recognition models. Results from a user study demonstrated significant improvements in input speed and keystroke saving. Participants reported that our prototype, LipType, leads to an overall lower perceived workload, particularly in the effort and physical demand dimension.