Selected work

Distyl AI: Designing for golden set management

Timeline

September - December 2024 (4 months)

Team

1 Product Manager

1 Project Mentor

2 Product Designers

1 UX Researcher

4 Software Engineers

Role

Product Designer

Team

Figma

Notion

Context

Overview

Distyl AI develops enterprise-focused LLM solutions to provide model implementations for specific customer use cases. AI strategists improve and evaluate their AI through existing tools.


The ask was to build a tool for audiences to view and manage golden sets (human-created input and response pairs) for specific model implementations. This involved visualizing output comparisons and fostering iterative feedback between AI strategists .

Project Brief

Golden sets are human-created input and response pairs necessary to compare and validate AI model responses.

Golden sets are human-created input and response pairs necessary to compare and validate AI model responses. My task was to build a tool for audiences to view and manage golden sets for specific model implementations.

Project Brief

⚡️ The Challenge: The Current golden set management process on Distyl Cloud is fragmented and difficult to navigate.

When reviewing model performance, AI Strategists must leave Distyl Cloud to manage solutions on external sources. This inefficient process leads to frustration, slows down iteration, and introduces inconsistencies, resulting in more unattended golden sets and a reliance on manual QA.

User pulls ground truth answers from client platform manually logs into Excel.

Model outputs from Distyl Cloud are exported separately & compared in parallel sheets.

Users inspect mismatches row-by-row. Even correct but different phrased responses are flagged, requiring manual override.

Feedback loops happen in Notion, where users document disruptions & validate answers, resulting in fragmented & non-scalable workflows.

User Research

We conducted one-on-one interviews with AI strategists and SMEs to identify pain points in their current LLM systems. Through these discussions, we gathered several key insights that informed our design process:

Dynamic tool integration - Spreadsheet tools do not integrate with system models to incorporate model recomputations.

Actions to Promote Collaboration - Allow In-line/global editing, version control, tag teammates and leave annotated feedback.

User Experience - Existing evaluation processes are disjointed creating a frustrating user experience and decreased actionability as a result.

We also identified the primary user persona with the user researchers of this project, known as the Enactor.

Enactor

Responsible for a bit of everything, from model impls to validation. Work with enterprise customers to understand requirements, refine and ensure quality.

We keep the models accurate, up-to-date and do everything in between.

CUSTOMER

DRIVE AI ADOPTION

PLAN & DESIGN THE FUTURE

PROCURE FEEDBACK LOOP

MAINTAIN EVALUATION STANDARD

ENSURE A SCALABLE PROCESS

DEVELOP & INTEGRATE SOLUTIONS

Mindsets

Common Titles

AI Strategist (IC or Manager/Director)

ML Implementation (IC or Manager/Director)

Data Scientist (IC or Manager/Director)

Product Manager/Director

Organization Type

Mostly 5MB and Midmarket organizations (fewer than 5,000 employees)

Top Tool Types

Excel/Google Sheets

Distyl Cloud

Synchronization (Notion)

Coffee (Internal System)

Goals

Scale AI impls across multiple enterprises

Track & Validate golden set performance

Collaboration with customers & team members

High end-user satisfaction

Curate and update golden sets

Review and annotate model Current outputs

Assess and Tag/Assign cases to Collaborators

Maintain data inventory and remove retired sets

Translate enterprise requirements into applications data

Manage support cases and customer feedback

Manage stakeholder engagement and system testing / validation criteria

Manage product demoes during weekly check-ins

Fragmented workflows across Excel, Notion and Distyl Cloud

Lacks model system integration and relies on manual exports and test environments

Difficult to track version history & set changes

Iteration

String-level match, Golden set and LLM based comparison

Support

Case management, deployment guides, technical documentation, learning materials

Milestones

Summary of model versions milestones, evaluation benchmarks

Data Governance:

Simplified permission and access control for test sets.

Upgrades:

AI-assisted prompts and output suggestions

Inconsistent annotations without cross-feedback

No centralized system for status-tracking & evaluation

Challenges

Key Tasks

Ideal Features

High involvement

Medium involvement

Low involvement

No involvement

END-TO-END INVOLVEMENT

Implement & Onboard

Evaluate

Need

Retire & Remove

Pay

Optimize & Modify

Scale

Order

Solution Brainstorming

During our brainstorming process, we employed story-framing to assess the feasibility, impact and user engagement of each concept. This approach allowed us to visualize how each solution could effectively address user pain points and align with our design objectives.

Decision 1 - Solution Direction

Spotify Wrapped

Our first idea focused on highlighting the model implementation’s golden set health, similar to Spotify’s year-in review. However it didn’t ease the evaluation process or address the key takeaways from user research, so it didn’t solve the core problem.

Support Route

We chose this route instead because it directly addresses the evaluation experience by providing immediate, actionable support. This eliminates the need for users to leave the platform, reducing the time and effort required to compare sets.

[VISUAL - Examples of our storyframes]

©️ Dheeksha Mageswaran 2024

Decision 2 - Technical AI implementation

Chatbot with Generative AI

While considering a chatbot, we identified several critical issues:

Technical feasibility and circular reliability: Developing a reliable AI chatbot requires significant resources and advanced technology, which may not be fully mature or available for our needs.

Risks associated with Generative AI: Distyl deals with sensitive customer information and substantial enterprise integrity. Generative AI carries inherent risks with hallucinations and misinformation.

Build with AI

We identified that most of the necessary data already existed across existing impls, making it accessible analysis and pattern recognition. By training language models with this data, we could build intelligent logic to curate answer to queries based on user inputs. Additionally, we considered incorporating SME feedback and business rules to enhance the accuracy of our solutions.

Decision 3 - Side Pane vs. Modal

Streamlined User Experience

Side Panel provides a seamless, uninterrupted flow of information, while simultaneously allowing user to follow and engage with the background content. It causes minimal disruption and provides continuous context.

Consistency with AI Interfaces

Competitive analysis of competitors, such as ChatGPT, showed that this format is familiar to users due to its common use in AI interfaces for data review and edits.

Scalable Design

Side Panels are more adaptable for extensibility and nesting, providing a better experience for users accessing the content in a split pane or multi-edit function. 

V1 Design

For our initial design, we focused on exploring page layouts and utilizing components in our existing Figma libraries. We used an existing query dataset in our testing environment as our example use case to populate content.

[VISUAL - B&W Page Layout 1]

V2 Design

Version Control Display

After getting feedback from other team members and client, we implemented these changes:

[VISUAL - Page Layout 2]

Build with AI

CRUM platforms, including our example, often include version control visibility. We created two different flows to display version control:

Option 1 - Version Block

The user toggles into Version History view where changes are displayed in version blocks of past states to stay informed of asynchronous review. 

[VISUAL - Option 1]

Option 2 - Activity Drawer

The user is informed of asynchronous review in a slide-in activity drawer. The user can minimize the drawer and interact with the full golden set display. 

Preference testing showed that users preferred option 2.

Given that collaboration tags and comments drive collaboration, users liked being able to maintain context and validate on a granular level rather than view version snapshots.

Common positive feedback came from the ability to minimize the drawer, since screen size was being optimized and showed layout consistency with golden set comparison panels.

Prototyping

My project mentor and I took ownership of building our final prototype that was presented to clients to demonstrate our idea.

No More Messy Pages!

A common pain point we wanted to address was the common large “spider web” of frames, making the file messy to navigate through and difficult to collaborate with other designers.

Example of a prototype found online. This sucks :(

Prototyping with Systems

Instead of creating a frame for every different screen, we approached this with a design system way of thinking. This meant building the prototype with scalability in mind: adding interactions to molecules that are constantly reused across screens.

Motion Sequence

I also learned about motion sequence chains that are prominent in designing for AI experiences. These chains allow the user to maintain context and orientation within the experience.

Motion sequence

MOTION

Motion sequence chains are made by combining motionsequences together. For example, you can put the Responding state sequence right after the Processing state sequence to transition smoothly from one sequence to another.

processing-start

processing-loop

processing-end

static state

processing state sequence

responding state sequence

responding - start

responding - loop

responding - end

static state

Results

In alignment with Disty’s strategic shift towards enhancing the Customer Experience (CX) through AI integration, our team presented our solution to executive leadership, alongside two other teams contributing ideas to this new direction.

What We Did Well

Our solution received positive feedback for its strong technical feasibility and clear implementation strategy within the given time frame, which aligned with Distyl’s practical needs. Leadership appreciated our user-centered approach and how we identified realistic AI applications that could be brought to life in the near term.

Where We Can Improve

However, we were also encouraged to innovate further and push the boundaries of our AI integration, with suggestions to enhance the novelty and impact of our concepts. 


Our ideas played a significant role in shaping the future direction of the Distyl LLM evaluation system. The feedback we received not only validated our execution but also influenced client’s decision to emphasize AI-driven experiences as a core component of the department’s new strategic focus. Our work demonstrated the importance of aligning technological advancements with user needs, and laid the groundwork for future AI initiatives within Distyl AI.