High Performance Networking for Distributed DL Training in Production K8s - Nivedita Viswanath
Production Multi-node Jobs with Gang Scheduling, K8s, GPUs... Madhukar Korupolu & Sanjay Chatterjee
Building a GPU cluster for AI
NCCL and Libfabric: High-Performance Networking for Machine Learning
NSDI '19 - Tiresias: A GPU Cluster Manager for Distributed Deep Learning
PipelineAI High Performance TensorFlow + GPU + Kubernetes + Jupyter Workshop
Large Scale Distributed Deep Learning on Kubernetes Clusters - Yuan Tang, Ant Financial & Yong Tang
Kubernetes and High Performance Computing
What Do You Mean K8s Doesn't Have Users? How Do I Manage User Access Then? - Jussi Nummelin
AI Inference: The Secret to AI's Superpowers
Efficient Data Parallel Distributed Training with Flyte, Spark & Horovod
Scaling Language Training to Trillion-parameter Models on a GPU Cluster
Distributed deep learning and why you may not need it - Jakub Sanojca, Mikuláš Zelinka
GPUs: Explained
Using Kubernetes to Offer Scalable Deep Learning on Alibaba Cloud - Kai Zhang & Yang Che, Alibaba
OSDI '20 - A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous...
Co-Location of CPU and GPU Workloads with High Resource Efficiency - Penghao Cen & Jian He
Canonical - Using Kubernetes and OpenStack for CPU and GPU Intensive Workloads
Scaling AI Inference Workloads with GPUs and Kubernetes - Renaud Gaubert & Ryan Olson, NVIDIA
Tensorflow XLA JIT AOT Compiling Nvidia GPU Half Precision FP16, INT8 TensorRT Spark JSON Parsing