Timeline
September - December 2024 (4 months)
Team
1 Product Manager
1 Project Mentor
2 Product Designers
1 UX Researcher
4 Software Engineers
Role
Product Designer
Team
Figma
Notion
Context
Overview
Distyl AI develops enterprise-focused LLM solutions to provide model implementations for specific customer use cases. AI strategists improve and evaluate their AI through existing tools.
The ask was to build a tool for audiences to view and manage golden sets (human-created input and response pairs) for specific model implementations. This involved visualizing output comparisons and fostering iterative feedback between AI strategists .
Project Brief
Golden sets are human-created input and response pairs necessary to compare and validate AI model responses.
Golden sets are human-created input and response pairs necessary to compare and validate AI model responses. My task was to build a tool for audiences to view and manage golden sets for specific model implementations.

Project Brief
⚡️ The Challenge: The Current golden set management process on Distyl Cloud is fragmented and difficult to navigate.
When reviewing model performance, AI Strategists must leave Distyl Cloud to manage solutions on external sources. This inefficient process leads to frustration, slows down iteration, and introduces inconsistencies, resulting in more unattended golden sets and a reliance on manual QA.

User pulls ground truth answers from client platform manually logs into Excel.

Model outputs from Distyl Cloud are exported separately & compared in parallel sheets.

Users inspect mismatches row-by-row. Even correct but different phrased responses are flagged, requiring manual override.

Feedback loops happen in Notion, where users document disruptions & validate answers, resulting in fragmented & non-scalable workflows.
User Research
We conducted one-on-one interviews with AI strategists and SMEs to identify pain points in their current LLM systems. Through these discussions, we gathered several key insights that informed our design process:
Dynamic tool integration - Spreadsheet tools do not integrate with system models to incorporate model recomputations.
Actions to Promote Collaboration - Allow In-line/global editing, version control, tag teammates and leave annotated feedback.
User Experience - Existing evaluation processes are disjointed creating a frustrating user experience and decreased actionability as a result.
We also identified the primary user persona with the user researchers of this project, known as the Enactor.
Enactor
Responsible for a bit of everything, from model impls to validation. Work with enterprise customers to understand requirements, refine and ensure quality.
We keep the models accurate, up-to-date and do everything in between.
CUSTOMER
DRIVE AI ADOPTION
PLAN & DESIGN THE FUTURE
PROCURE FEEDBACK LOOP
MAINTAIN EVALUATION STANDARD
ENSURE A SCALABLE PROCESS
DEVELOP & INTEGRATE SOLUTIONS
Mindsets
Common Titles
AI Strategist (IC or Manager/Director)
ML Implementation (IC or Manager/Director)
Data Scientist (IC or Manager/Director)
Product Manager/Director
Organization Type
Mostly 5MB and Midmarket organizations (fewer than 5,000 employees)
Top Tool Types
Excel/Google Sheets
Distyl Cloud
Synchronization (Notion)
Coffee (Internal System)
Goals
Scale AI impls across multiple enterprises
Track & Validate golden set performance
Collaboration with customers & team members
High end-user satisfaction
Curate and update golden sets
Review and annotate model Current outputs
Assess and Tag/Assign cases to Collaborators
Maintain data inventory and remove retired sets
Translate enterprise requirements into applications data
Manage support cases and customer feedback
Manage stakeholder engagement and system testing / validation criteria
Manage product demoes during weekly check-ins
Fragmented workflows across Excel, Notion and Distyl Cloud
Lacks model system integration and relies on manual exports and test environments
Difficult to track version history & set changes
Iteration
String-level match, Golden set and LLM based comparison
Support
Case management, deployment guides, technical documentation, learning materials
Milestones
Summary of model versions milestones, evaluation benchmarks
Data Governance:
Simplified permission and access control for test sets.
Upgrades:
AI-assisted prompts and output suggestions
Inconsistent annotations without cross-feedback
No centralized system for status-tracking & evaluation
Challenges
Key Tasks
Ideal Features
High involvement
Medium involvement
Low involvement
No involvement
END-TO-END INVOLVEMENT
Implement & Onboard
Evaluate
Need
Retire & Remove
Pay
Optimize & Modify
Scale
Order
Solution Brainstorming
During our brainstorming process, we employed story-framing to assess the feasibility, impact and user engagement of each concept. This approach allowed us to visualize how each solution could effectively address user pain points and align with our design objectives.
Decision 1 - Solution Direction
Spotify Wrapped
Our first idea focused on highlighting the model implementation’s golden set health, similar to Spotify’s year-in review. However it didn’t ease the evaluation process or address the key takeaways from user research, so it didn’t solve the core problem.
Support Route
We chose this route instead because it directly addresses the evaluation experience by providing immediate, actionable support. This eliminates the need for users to leave the platform, reducing the time and effort required to compare sets.
[VISUAL - Examples of our storyframes]
©️ Dheeksha Mageswaran 2024

Decision 2 - Technical AI implementation
Chatbot with Generative AI
While considering a chatbot, we identified several critical issues:
Technical feasibility and circular reliability: Developing a reliable AI chatbot requires significant resources and advanced technology, which may not be fully mature or available for our needs.
Risks associated with Generative AI: Distyl deals with sensitive customer information and substantial enterprise integrity. Generative AI carries inherent risks with hallucinations and misinformation.
Build with AI
We identified that most of the necessary data already existed across existing impls, making it accessible analysis and pattern recognition. By training language models with this data, we could build intelligent logic to curate answer to queries based on user inputs. Additionally, we considered incorporating SME feedback and business rules to enhance the accuracy of our solutions.


Decision 3 - Side Pane vs. Modal
Streamlined User Experience
Side Panel provides a seamless, uninterrupted flow of information, while simultaneously allowing user to follow and engage with the background content. It causes minimal disruption and provides continuous context.
Consistency with AI Interfaces
Competitive analysis of competitors, such as ChatGPT, showed that this format is familiar to users due to its common use in AI interfaces for data review and edits.
Scalable Design
Side Panels are more adaptable for extensibility and nesting, providing a better experience for users accessing the content in a split pane or multi-edit function.
V1 Design
For our initial design, we focused on exploring page layouts and utilizing components in our existing Figma libraries. We used an existing query dataset in our testing environment as our example use case to populate content.
[VISUAL - B&W Page Layout 1]
V2 Design
Version Control Display
After getting feedback from other team members and client, we implemented these changes:
[VISUAL - Page Layout 2]
Build with AI
CRUM platforms, including our example, often include version control visibility. We created two different flows to display version control:
Option 1 - Version Block
The user toggles into Version History view where changes are displayed in version blocks of past states to stay informed of asynchronous review.
[VISUAL - Option 1]
Option 2 - Activity Drawer
The user is informed of asynchronous review in a slide-in activity drawer. The user can minimize the drawer and interact with the full golden set display.
Preference testing showed that users preferred option 2.
Given that collaboration tags and comments drive collaboration, users liked being able to maintain context and validate on a granular level rather than view version snapshots.
Common positive feedback came from the ability to minimize the drawer, since screen size was being optimized and showed layout consistency with golden set comparison panels.

Prototyping
My project mentor and I took ownership of building our final prototype that was presented to clients to demonstrate our idea.
No More Messy Pages!
A common pain point we wanted to address was the common large “spider web” of frames, making the file messy to navigate through and difficult to collaborate with other designers.

Example of a prototype found online. This sucks :(
Prototyping with Systems
Instead of creating a frame for every different screen, we approached this with a design system way of thinking. This meant building the prototype with scalability in mind: adding interactions to molecules that are constantly reused across screens.

Motion Sequence
I also learned about motion sequence chains that are prominent in designing for AI experiences. These chains allow the user to maintain context and orientation within the experience.
Motion sequence
MOTION
Motion sequence chains are made by combining motionsequences together. For example, you can put the Responding state sequence right after the Processing state sequence to transition smoothly from one sequence to another.
processing-start
processing-loop
processing-end
static state
processing state sequence
responding state sequence
responding - start
responding - loop
responding - end
static state
Results
In alignment with Disty’s strategic shift towards enhancing the Customer Experience (CX) through AI integration, our team presented our solution to executive leadership, alongside two other teams contributing ideas to this new direction.
What We Did Well
Our solution received positive feedback for its strong technical feasibility and clear implementation strategy within the given time frame, which aligned with Distyl’s practical needs. Leadership appreciated our user-centered approach and how we identified realistic AI applications that could be brought to life in the near term.
Where We Can Improve
However, we were also encouraged to innovate further and push the boundaries of our AI integration, with suggestions to enhance the novelty and impact of our concepts.
Our ideas played a significant role in shaping the future direction of the Distyl LLM evaluation system. The feedback we received not only validated our execution but also influenced client’s decision to emphasize AI-driven experiences as a core component of the department’s new strategic focus. Our work demonstrated the importance of aligning technological advancements with user needs, and laid the groundwork for future AI initiatives within Distyl AI.