As Vice President of Software Engineering, Doug is responsible for all aspects of the Cornelis Networks’ software stack, including the Omni-Path Architecture drivers, messaging software, and embedded device control systems. Before joining Cornelis Networks, Doug led software engineering teams at Red Hat in cloud storage and data services. Doug’s career in HPC and cloud computing began at Ames National Laboratory’s Scalable Computing Laboratory. Following several roles in university research computing, Doug joined the US Department of Energy’s Oak Ridge National Laboratory in 2009, where he developed and integrated new technologies at the world-class Oak Ridge Leadership Computing Facility.
Cornelis Networks is a technology leader delivering purpose-built high-performance fabrics for High Performance Computing (HPC), High Performance Data Analytics (HPDA), and Artificial Intelligence (AI) to leading commercial, scientific, academic, and government organizations.
What initially attracted you to computer science?
I just seemed to enjoy working with technology. I enjoyed working with the computers growing up; we had a modem at our school that let me try out the Internet and I found it interesting. As a freshman in college, I met a USDOE computational scientist while volunteering for the National Science Bowl. He invited me to tour his HPC lab and I was hooked. I’ve been a supercomputer geek ever since.
You worked at Red Hat from 2015 to 2019, what were some of the projects you worked on and your key takeaways from this experience?
My main project at Red Hat was Ceph distributed storage. I’d previously focused entirely on HPC and this gave me an opportunity to work on technologies that were critical to cloud infrastructure. It rhymes. Many of the principles of scalability, manageability, and reliability are extremely similar even though they are aimed at solving slightly different problems. In terms of technology, my most important takeaway was that cloud and HPC have a lot to learn from one another. We’re increasingly building different projects with the same Lego set. It’s really helped me understand how the enabling technologies, including fabrics, can come to bear on HPC, cloud, and AI applications alike. It’s also where I really came to understand the value of Open Source and how to execute the Open Source, upstream-first software development philosophy that I brought with me to Cornelis Networks. Personally, Red Hat was where I really grew and matured as a leader.
You’re currently the Vice President of Software Engineering at Cornelis Networks, what are some of your responsibilities and what does your average day look like?
As Vice President of Software Engineering, I am responsible for all aspects of the Cornelis Networks’ software stack, including the Omni-Path Architecture drivers, messaging software, fabric management, and embedded device control systems. Cornelis Networks is an exciting place to be, especially in this moment and this market. Because of that, I’m not sure I have an “average” day. Some days I’m working with my team to solve the latest technology challenge. Other days I’m interacting with our hardware architects to make sure our next-generation products will deliver for our customers. I’m often in the field meeting with our amazing community of customers and collaborators making sure we understand and anticipate their needs.
Cornelis Networks offers next generation networking for High Performance Computing and AI applications, could you share some details on the hardware that is offered?
Our hardware consists of a high-performance switched fabric type network fabric solution. To that end, we provide all the necessary devices to fully integrate HPC, cloud, and AI fabrics. The Omni-Path Host-Fabric Interface (HFI) is a low-profile PCIe card for endpoint devices. We also produce a 48-port 1U “top-of-rack” switch. For larger deployments, we make two fully-integrated “director-class” switches; one that packs 288 ports in 7U and an 1152-port, 20U device.
Can you discuss the software that manages this infrastructure and how it is designed to decrease latency?
First, our embedded management platform provides easy installation and configuration as well as access to a wide variety of performance and configuration metrics produced by our switch ASICs.
Our driver software is developed as part of the Linux kernel. In fact, we submit all our software patches to the Linux kernel community directly. That ensures that all of our customers enjoy maximum compatibility across Linux distributions and easy integration with other software such as Lustre. While not in the latency path, having an in-tree driver dramatically reduces installation complexity.
The Omni-Path fabric manager (FM) configures and routes an Omni-Path fabric. By optimizing traffic routes and recovering quickly from faults, the FM provides industry-leading performance and reliability on fabrics from tens to thousands of nodes.
Omni-Path Express (OPX) is our high-performance messaging software, recently released in November 2022. It was specifically designed to reduce latency compared to our earlier messaging software. We ran cycle-accurate simulations of our send and receive code paths in order to minimize instruction count and cache utilization. This produced dramatic results: when you’re in the microsecond regime, every cycle counts!
We also integrated with the OpenFabrics Interfaces (OFI), an open standard produced by the OpenFabrics Alliance. OFI’s modular architecture helps minimize latency by allowing higher-level software, such as MPI, to leverage fabric features without additional function calls.
The entire network is also designed to increase scalability, could you share some details on how it is able to scale so well?
Scalability is at the core of Omni-Path’s design principles. At the lowest levels, we use Cray link-layer technology to correct link errors with no latency impact. This affects fabrics at all scales but is particularly important for large-scale fabrics, which naturally experience more link errors. Our fabric manager is focused both on programming optimal routing tables and on doing so in a rapid manner. This ensures that routing for even the largest fabrics can be completed in a minimum amount of time.
Scalability is also a critical component of OPX. Minimizing cache utilization improves scalability on individual nodes with large core counts. Minimizing latency also improves scalability by improving time to completion for collective algorithms. Using our host-fabric interface resources more efficiently enables each core to communicate with more remote peers. The strategic choice of libfabric allows us to leverage software features like scalable endpoints using standard interfaces.
Could you share some details on how AI is incorporated into some of the workflow at Cornelis Networks?
We’re not quite ready to talk externally about our internal uses of and plans for AI. That said, we do eat our own dog food, so we get to take advantage of the latency and scalability enhancements we’ve made to Omni-Path to support AI workloads. It makes us all the more excited to share those benefits with our customers and partners. We have certainly observed that, like in traditional HPC, scaling out infrastructure is the only path forward, but the challenge is that network performance is easily stifled by Ethernet and other traditional networks.
What are some changes that you foresee in the industry with the advent of generative AI?
First off, the use of generative AI will make people more productive – no technology in history has made human beings obsolete. Every technology evolution and revolution we’ve had from the cotton gin to the automatic loom to the telephone, internet and beyond have made certain jobs more efficient, but we haven’t worked humanity out of existence.
Through the application of generative AI, I believe companies will technologically advance at a faster rate because those running the company will have more free time to focus on those advancements. For instance, if generative AI provides more accurate forecasting, reporting, planning, etc. – companies can focus on innovation in their field of expertise
I specifically feel that AI will make each of us a multidisciplinary expert. For example, as a scalable software expert, I understand the connections between HPC, big data, cloud, and AI applications that drive them toward solutions like Omni-Path. Equipped with a generative AI assistant, I can delve deeper into the meaning of the applications used by our customers. I have no doubt that this will help us design even more effective hardware and software for the markets and customers we serve.
I also foresee an overall improvement in software quality. AI can effectively function as “another set of eyes” to statically analyze code and develop insights into bugs and performance problems. This will be particularly interesting at large scales where performance issues can be particularly difficult to spot and expensive to reproduce.
Finally, I hope and believe that generative AI will help our industry to train and onboard more software professionals without previous experience in AI and HPC. Our field can seem daunting to many and it can take time to learn to “think in parallel.” Fundamentally, just like machines made it easier to manufacture things, generative AI will make it easier to consider and reason about concepts.
Is there anything else that you would like to share about your work or Cornelis Networks in general?
I’d like to encourage anyone with the interest to pursue a career in computing, especially in HPC and AI. In this field, we’re equipped with the most powerful computing resources ever built and we bring them to bear against humanity’s greatest challenges. It’s an exciting place to be, and I’ve enjoyed it every step of the way. Generative AI brings our field to even newer heights as the demand for increasing capability increases drastically. I can’t wait to see where we go next.
Thank you for the great interview, readers who wish to learn more should visit Cornelis Networks.
Credit: Source link
Comments are closed.