Bringing vision to apps: .NET MAUI introduces multimodal intelligence

Microsoft is making it easier for developers to integrate vision intelligence into their apps with the latest advancements in .NET MAUI. By combining AI and multimodal input, developers can now create richer, image-aware experiences beyond traditional text-based interactions.

Here’s how you can let users capture or select images—and have AI extract actionable data to create projects and tasks in your app:

  • Step 1: Launch the Camera

    From the floating action button on the MainPage, users can tap the camera icon, instantly navigating to the PhotoPage. Here, the MediaPicker API handles photo capture and selection.

  • Step 2: Handle Media Input

    The PhotoPageModel manages image input using the EventToCommandBehavior triggered by the PageAppearing lifecycle event.

  • Step 3: Capture or Select Based on Device

    Decorated with [RelayCommand], the PageAppearing method dynamically decides whether to launch the camera or file picker—using .NET MAUI’s cross-platform APIs like DeviceInfo and MediaPicker to abstract away platform differences.

  • Step 4: Display and Interact

    Once the image is received, it’s displayed alongside an optional Editor for user instructions. The prompt is built using StringBuilder, and an instance of IChatClient (from Microsoft.Extensions.AI) handles both text and image data using ChatMessage, which combines TextContent and DataContent.