Case Studies

The Invisible Interface of Cloud-Based AI SUperComputers

DesigninG an IBM Cloud Software suite

‍

Product Design Lead | 12 Months | 2024 | Live Product

Imagine building, configuring, and managing an AI supercomputer from scratch. Not in a high-tech data center, but in a browser window. That’s the reality of Vela 2, IBM’s cloud-based AI supercomputing platform. As the sole Product Designer, I led the design from concept to launch, creating a scalable, intuitive UI that balances usability with advanced customization. By abstracting intricate configurations and reducing the risk of errors, I made the complexity invisible to the user for easy AI cluster management at scale.

2025 Outstanding Technical Achievement Award Winner | GPU Accelerated AI Workloads Experience

Pulling Back the Curtain of Generative AI

‍Every time someone interacts with generative AI — from a consumer rewriting an email to a banking institution detecting fraudulent activity — their inquiry travels from their device to set of powerful computers in a distant data center. These computers processes the request and generate a response within fractions of a seconds, this is the cloud-native AI in action. While a phone’s hardware may handle simple tasks, complex operations like predicting weather patterns or real-time health monitoring demand reliable systems with near-zero latency — something only a carefully designed AI infrastructure can provide.

Vela 2 is that infrastructure. More than just a collection of computers, Vela 2 is a cloud-native high performance computing infrastructure for model training and inferencing at a large scale.

‍

Designing the Invisible

At the start of 2024, IBM Cloud’s cross-functional Infrastructure Compute and Networking teams began developing a client facing supercomputing platform to integrate into our existing Virtual Private Cloud portfolio. This project progressed from concept to General Availability within eight months, from March to November of 2024, requiring rapid, iterative design and development.

As the sole Product Designer, I translated highly technical architectures and ideas into a client-facing UI. The design enables seamless AI cluster configuration, management, and monitoring while meeting technical and client requirements. To deliver an accurate set of designs, I worked with and led the collaboration between engineering, API, architecture, research and product teams.

‍

Rethinking Established Networking Paradigms

Introducing a dedicated back-end infrastructure for AI workloads was a challenge that demanded a united effort, led with innovation. Prior to Vela 2, IBM Cloud was built to exclusively support general purpose cloud traffic. Enterprise e-commerce platforms rely on our infrastructure to handle North-South traffic (end-user requests to browse and purchase products), Burst traffic (high demand during peak hours), Latency-Sensitive traffic (real time payment processing and inventory updates), and many more types of traffic all at once. In consumer facing generative AI use cases, this is how you send an inquiry and receive a response. This design works well for unpredictable, high entropy traffic, but isn’t optimized for any single use case.

*Data traffic flowing between the four nodes of a cluster, 2x multiplex (two paths per 8 AI accelerators).*

AI workloads are fundamentally different. Training and fine tuning models generate structured, low entropy traffic dependent on High Bandwidth and East-West communication (direct, robust traffic between GPUs within one cluster). This required a purpose built network infrastructure to handle these unique patterns at scale without interfering with any, well established cloud traffic. With Vela 2, we designed and implemented a dedicated network for AI traffic, then integrated it into our existing cloud ecosystem.

‍

From Single Nodes to Inferencing Clusters

The experience of Vela 2 UI is organized within two main Virtual Private Cloud infrastructure pillars:
Virtual Server Instances is where users set up and deploy each server with accelerated GPUs. We call this a node. Cluster Networks are the new top level resource we created to support bi-directional networking with nods grouped into a unified cluster. With cluster networks, or AI clusters, servers work cohesively with each other and with the user’s existing Cloud deployments.

*Service stack across Compute and Network infrastructure pillars*

Because Vela 2 serves as the infrastructure for an institution’s generative AI platform, it supports a spectrum of prospective users, each interacting with different services across the entire system. Focusing on their diverse use cases, I designed the UI to include a streamlined “one-click” deployment for quick setup, as well as advanced options for detailed customization as needed. Three primary pages guide clients through setup and management:
‍List View offers users a high-level overview of their clusters, and understand how they fit in with other cloud deployments. We provide them with essential statuses at first glance, and a way to view more details as needed.
Provisioning is where users configure each node, their housing cluster, and any supporting deployments. I designed quick setup templates for clients who want a plug-and-play experience, while also providing options for those who want full control to dive into each setting and customize their deployments.
Details is a hub for users to view in-depth details of each resource after node and cluster deployments. This offers a centralized portal for clients to monitor and adjust each configuration as needed.

Designing each new page and integrating the Vela 2 experience into existing cloud resources and configuration pages required an in-depth understanding of the API calls behind each UI element. Success relied not only on understanding and communicating the backend architecture to non-technical stakeholders but also on effectively managing the frontend integration.

‍

Testing the Boundaries with Real Customers

I conducted moderated usability testing with both internal users and external clients to redefine and refine the experience as it progressed. Internal sessions with the backend teams of related projects, like Kubernetes Clusters and IBM’s flagship AI service watsonx, helped optimize workflows for technical users and ensured consistency with other IBM product. Interviews with top energy sector clients provided valuable insights, shaping product expectations and market alignment. These sessions highlighted the need for a secure, flexible system capable of processing and analyzing petabytes of proprietary data without upfront hardware investment.

“[sessions] like these allow two companies to work together and give feedback, get visibility on how you think about products and you get some sense of how I think about consuming your products. I feel successful in this engagement and hope you feel the same on your side.” — Technical Program HPC Director, Energy Client

‍

Making it Real and Available to the Public

Vela 2’s Virtual server instances and Clusters networks have completed successful QA and Beta periods, and are scheduled for a general availability release within the first quarter of 2025.

This project was both exciting and challenging, pushing me to evolve my design communication toolkit and expand my understanding of cloud architectures. I played a key role in guiding the team through a highly iterative design and development process within a fast-paced, high-stakes environment. Throughout my time on this project, I gained a deeper appreciation for the engineering and hardware constraints that shape every Cloud and AI experience. Check out a preview of the UI below, and contact me to learn more.

‍

Team Feedback

‍“Releasing a new product within one year is incredible. I witnessed tremendous growth in both your domain expertise and technical skills. You expanded your cloud computing knowledge by building on your networking background and connecting it to compute concepts. Not only that, but you also helped partners grasp new concepts like east-west networking! As part of the […] initiative, we’re transitioning IBM Software to run on IBM Cloud. Before the beta release of Cluster Networking, watsonx operated on IBM Research’s Vela A100 system. Now, thanks to Eli’s efforts, watsonx runs on our own cloud infrastructure. Once GA’d, this will enable customers to train their own AI models. Bravo on supporting such a pivotal project and contributing to this incredible achievement!” — Director and Head of Design, IBM Cloud Platform