STICKERCONV: Generating Multimodal Empathetic Responses from Scratch

Abstract

Stickers, while widely recognized for enhancing empathetic communication in online interactions, remain underexplored in current empathetic dialogue research, notably due to the challenge of a lack of comprehensive datasets. In this paper, we introduce the Agent for STICKERCONV (Agent4SC), which uses collaborative agent interactions to realistically simulate human behavior with sticker usage, thereby enhancing multimodal empathetic communication. Building on this foundation, we develop a multimodal empathetic dialogue dataset, STICKERCONV, comprising 12.9K dialogue sessions, 5.8K unique stickers, and 2K diverse conversational scenarios. This dataset serves as a benchmark for multimodal empathetic generation. To advance further, we propose PErceive and Generate Stickers (PEGS), a multimodal empathetic response generation framework, complemented by a comprehensive set of empathy evaluation metrics based on LLM. Our experiments demonstrate PEGS's effectiveness in generating contextually relevant and emotionally resonant multimodal empathetic responses, contributing to the advancement of more nuanced and engaging empathetic dialogue systems.

Technical Description

• Agent for STICKERCONV (Agent4SC)

Figure 1: The overview of Agent4SC. Memory and Plan modules enable the agent to mimic human observation and thought, overcoming LLMs' inability to grasp nuanced emotions. The Action module supports generating insights with human-like emotional reactions. The Profile module gives each agent distinct reflections and actions. Furthermore, Agent4SC uses stickers as a Tool for more natural conversation, allowing the agent to choose stickers like humans. These modules streamline observation, reflection, and action, while the Manager Agent maintains performance and quality.

• STICKERCONV

Figure 2: An example of multimodal conversation in our STICKERCONV dataset. Both parties can utilize the stickers to express their emotions, which enhances interactivity and expression. Assistant can empathize with the user according to the conversation (green text).

Figure 3: The statistics of STICKERCONV.

Figure 4: The chart of emotional distribution in the choice of stickers between users and the system. This chart revealed a striking trend: users have a significant preference for stickers that convey negative emotions, in contrast to the system. The system predominantly utilizes stickers expressing neutral and positive emotions. This comparison not only reflects the distinct emotional expression preferences between users and the system but also highlights the system's active and supportive role in interactions.

Figure 5: Emotion distribution of user profile in Agent for STICKERCONV.

Figure 6: The 200 most popular emotion-related words in STICKERCONV.

• PEGS

Figure 7: The architecture of PEGS framework includes various routing options, distinguished by colored connecting lines. Input stickers undergo joint encoding by an image encoder, Q-Former, and a linear layer, with Vicuna serving as the language model. The output of the LLM activates two sets of tokens differently across model versions: one for image retrieval and the other as a textual condition. Subsequently, the frozen image decoder generates images.

Results

Figure 8: Examples of conversations by users interacting with PEGS. Users can chat with multimodal content (text and stickers) and will receive multimodal empathetic responses. Left: a conversation characterized by positive emotion (happiness). Right: a conversation characterized by negative emotion (sadness).

BibTeX


@article{zhang2024stickerconv,
  title={STICKERCONV: Generating Multimodal Empathetic Responses from Scratch},
  author={Zhang, Yiqun and Kong, Fanheng and Wang, Peidong and Sun, Shuang and Wang, Lingshuai and Feng, Shi and Wang, Daling and Zhang, Yifei and Song, Kaisong},
  journal={arXiv preprint arXiv:2402.01679},
  year={2024}
}