Designing AI Infrastructure: The UX Behind IBM's Vela 2

Vela 2 is a cloud-based AI supercomputing platform enabling clients to configure, monitor, and manage high-demand infrastructure. As the sole Product Designer, I designed a UI that balances usability with advanced customization, providing scalable AI solutions without hardware investment. Through iterative design, Vela 2 integrates seamlessly into the broader AI ecosystem.

Solution context

Duration: 8 Months | Beta: 4Q24 | GA: 1Q25 | Role: Lead Product Designer

Every time someone interacts with generative AI — from a consumer rewriting an email to a banking institution detecting fraudulent activity — their inquiry travels from their device to set of powerful computers in a distant data center. These computers processes the request and generate a response within fractions of a seconds, this is the cloud-native AI in action. While a phone’s hardware may handle simple tasks, complex operations like predicting weather patterns or real-time health monitoring demand reliable systems with near-zero latency — something only a carefully designed AI infrastructure can provide.

Vela 2 is that infrastructure. More than just a collection of computers, Vela 2 is a cloud-native high performance computing infrastructure for model training and inferencing at a large scale.

My contribution

At the start of 2024, IBM Cloud’s cross-functional Infrastructure Compute and Networking teams began developing a client facing supercomputing platform to integrate into our existing Virtual Private Cloud portfolio. This project progressed from concept to General Availability within eight months, from March to November of 2024, requiring rapid, iterative design and development.

As the sole Product Designer, I translated highly technical architectures and ideas into a client-facing UI. The design enables seamless AI cluster configuration, management, and monitoring while meeting technical and client requirements. To deliver an accurate set of designs, I worked with and led the collaboration between engineering, API, architecture, research and product teams.

Networking paradigms

One of the primary challenges was building an interface that would allow clients to configure clusters across two distinct networking paradigms:
North-South Networking
is what allows clients to access, send, and receive data from their servers. In consumer facing generative AI use cases, this is how you send an inquiry and receive a response.
East-West Networking
is the communication channel between servers and their GPUs. For Vela 2, servers are unique as they each utilize eight GPUs instead of the usual single GPU. Clients will not interface with this paradigm beyond initial configuration.

3-Node Cluster Network Architecture

User flow

The experience of Vela 2 UI is organized within two main Virtual Private Cloud infrastructure pillars:
Virtual Server Instances
is where users set up and deploy each server with accelerated GPUs. We call this a node.
Cluster Networks
is the new top level resource we created to support bi-directional networking and grouping nodes into a unified cluster. With cluster networks, or AI clusters, servers work cohesively with each other and with the user’s existing Cloud deployments.

Service stack across Compute and Network infrastructure pillars

Because Vela 2 serves as the infrastructure for an institution’s generative AI platform, it supports a spectrum of prospective users, each interacting with different services across the entire system. Focusing on their diverse use cases, I designed the UI to include a streamlined “one-click” deployment for quick setup, as well as advanced options for detailed customization as needed. Three primary pages guide clients through setup and management:
List View offers users a high-level overview of their clusters, and understand how they fit in with other cloud deployments. We provide them with essential statuses at first glance, and a way to view more details as needed.
Provisioning
is where users configure each node, their housing cluster, and any supporting deployments. I designed quick setup templates for clients who want a plug-and-play experience, while also providing options for those who want full control to dive into each setting and customize their deployments.
Details
is a hub for users to view in-depth details of each resource after node and cluster deployments. This offers a centralized portal for clients to monitor and adjust each configuration as needed.

Mid fidelity wireframes

Designing each new page and integrating the Vela 2 experience into existing cloud resources and configuration pages required an in-depth understanding of the API calls behind each UI element. Success relied not only on understanding and communicating the backend architecture to non-technical stakeholders but also on effectively managing the frontend integration.

Usability testing

I conducted moderated usability testing with both internal users and external clients to redefine and refine the experience as it progressed. Internal sessions with the backend teams of related projects, like Kubernetes Clusters and IBM’s flagship AI service watsonx, helped optimize workflows for technical users and ensured consistency with other IBM product. Interviews with top energy sector clients provided valuable insights, shaping product expectations and market alignment. These sessions highlighted the need for a secure, flexible system capable of processing and analyzing petabytes of proprietary data without upfront hardware investment.

“[sessions] like these allow two companies to work together and give feedback, get visibility on how you think about products and you get some sense of how I think about consuming your products. I feel successful in this engagement and hope you feel the same on your side.” — Technical Program HPC Director, Energy Client

User flows of A/B test

Conclusion

Vela 2’s Virtual server instances and Clusters networks have completed successful QA and Beta periods, and are scheduled for a general availability release within the first quarter of 2025.

This project was both exciting and challenging, pushing me to evolve my design communication toolkit and expand my understanding of cloud architectures. I played a key role in guiding the team through a highly iterative design and development process within a fast-paced, high-stakes environment. Throughout my time on this project, I gained a deeper appreciation for the engineering and hardware constraints that shape every Cloud and AI experience. Check out a preview of the UI below, and contact me to learn more.

Team feedback

“Releasing a new product within one year is incredible. I witnessed tremendous growth in both your domain expertise and technical skills. You expanded your cloud computing knowledge by building on your networking background and connecting it to compute concepts. Not only that, but you also helped partners grasp new concepts like east-west networking! As part of the […] initiative, we’re transitioning IBM Software to run on IBM Cloud. Before the beta release of Cluster Networking, watsonx operated on IBM Research’s Vela A100 system. Now, thanks to Eli’s efforts, watsonx runs on our own cloud infrastructure. Once GA’d, this will enable customers to train their own AI models. Bravo on supporting such a pivotal project and contributing to this incredible achievement!” — Director and Head of Design, IBM Cloud Platform