High Performance Networking for Distributed DL Training in Production K8s - Nivedita Viswanath
Production Multi-node Jobs with Gang Scheduling, K8s, GPUs... Madhukar Korupolu & Sanjay Chatterjee
NSDI '19 - Tiresias: A GPU Cluster Manager for Distributed Deep Learning
NCCL and Libfabric: High-Performance Networking for Machine Learning
Large Scale Distributed Deep Learning on Kubernetes Clusters - Yuan Tang, Ant Financial & Yong Tang
Building a GPU cluster for AI
Kubernetes and High Performance Computing
What Do You Mean K8s Doesn't Have Users? How Do I Manage User Access Then? - Jussi Nummelin
PipelineAI High Performance TensorFlow + GPU + Kubernetes + Jupyter Workshop
AI Inference: The Secret to AI's Superpowers
Efficient Data Parallel Distributed Training with Flyte, Spark & Horovod
Building Scalable End-To-End Deep Learning Pipelines In The Cloud
Tensorflow XLA JIT AOT Compiling Nvidia GPU Half Precision FP16, INT8 TensorRT Spark JSON Parsing
Using Kubernetes to Offer Scalable Deep Learning on Alibaba Cloud - Kai Zhang & Yang Che, Alibaba
Distributed deep learning and why you may not need it - Jakub Sanojca, Mikuláš Zelinka
Co-Location of CPU and GPU Workloads with High Resource Efficiency - Penghao Cen & Jian He
IoT Application Running on KubeEdge + ARM Platform - Xuan Jia & Bin Lu
Scaling Language Training to Trillion-parameter Models on a GPU Cluster
OSDI '20 - A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous...
Hybrid Heterogeneous auto-scalable FPGA deployment on aws
GPUs: Explained
Lightning Talk: Managing Drivers in a Kubernetes Cluster - Renaud Gaubert, NVIDIA
Using MPI Operator for GPU-Accelerated Workloads with Lustre FS David Gray Red Hat | NVIDIA GTC OSCG
Canonical - Using Kubernetes and OpenStack for CPU and GPU Intensive Workloads
Collective-on-Ray: High-performance Collective Communication for Distributed Machine Learning on Ray