Multimodal Interaction

Broken Conversations – A Practical Guide for Improving Chatbot UX

Multimodal Interaction

Looks at how well the chatbot handles inputs and outputs beyond text such as voice, images, document uploads, or structured data and whether these modes are used effectively.

Multimodal interaction refers to a chatbot’s ability to process and deliver information across various formats, including text, voice, images, document uploads, and structured data. User efficiency is maximized when the input and output methods are tailored to the specific task. For example, it is often faster to upload an image of a complex error code than to transcribe it, while a well-structured table may be easier to parse than a long voice description.

When these modes are misaligned with the user’s needs, the result is always the same: a frustrated user.

Scenario 1: Image upload constraints

A user is asked by the chatbot to provide complex information that is difficult to put into text, but the chatbot only accepts text input and does not allow image upload or scanning capabilities.

Learn more

Scenario 2: Lack of visual support for complex data

A user asks the chatbot for information that is dense, or multi-dimensional. The chatbot responds with a long, text-only explanation.

Learn more

Scenario 3: Non-interactive visual responses

A user interacts with a chatbot and asks for information they want to quickly scan, copy, reuse, or open in separate tabs (e.g. instructions, links, or policy references). Instead of returning text with links, the chatbot provides the answer as a static image or screenshot rather than selectable text.

Learn more

Scenario 4: Missing voice input options

A user interacts with a chatbot on their mobile device while on the move, multitasking, or holding items, making typing inconvenient. The user would prefer to ask a question verbally or dictate a longer description rather than typing it out. However, the chatbot only accepts typed text and does not support voice input or voice messages.

Learn more

Scenario 1:
Image upload constraints

A user is asked by the chatbot to provide complex information that is difficult to put into text, but the chatbot only accepts text input and does not allow image upload or scanning capabilities.

Examples

Do I need to type this word for word?

The chatbot asked the user to share the error message they encountered. The error message was very complex for a user, with long alphanumeric numbers etc.

Did I miss a digit?

The chatbot asked the user to share a long and complex number such as their IMEI (International Mobile Equipment Identity: a 15-digit unique numeric identifier for mobile phones) or VIN (Vehicle Identification Number (VIN): a 17-character identifier for individual vehicles).

Why is this an issue?

The chatbot forces the user to manually enter complex information:

Users must manually type or copy long strings of characters.
Errors are common and may go unnoticed.
The interaction becomes slower and more frustrating than necessary.

Why do we care?

Inputting long strings of characters is prone to errors and takes a lot longer than uploading an image that contains the requested information:

Higher error rate: Long alphanumeric strings are prone to transcription mistakes.
Inefficiency: Manual entry takes significantly longer than sharing an image.
Task failure: Invalid inputs lead to retries, loops, or escalation to support.
Poor experience: The chatbot feels rigid and misaligned with real-world usage.

What is the remedy?

When users are asked to provide complex identifiers, the chatbot should minimize manual effort and thus the risk of mistakes:

Prefer image-based input when possible: Allow users to upload images with the required information.
Support scanning and automation: Offer a scanning option to reduce transcription errors, where possible.
Offer a copy/paste friendly flow: Add a “paste from clipboard” prompt and detect common formatting issues (e.g. removing leading or trailing spaces etc.).
Allow lookup via alternative identifiers: Ask for a license plate to retrieve the VIN, for example.
Use a structured input UI: To simplify the entry, use segmented fields (e.g. 3–4 character blocks) auto-formatting, and character restrictions where applicable.
Validate early and clearly: Run real-time checks (length, forbidden characters etc.). Highlight likely mistakes immediately and explain how to fix them.

Are there any exceptions to this rule?

There are justified exceptions when an uploading an image might not be possible, such as:

Channel limitations: Channels like voice bots inherently do not support attachments or camera access.

In these cases, the chatbot should optimize the entry process as much as possible.