Model training is the most resource-intensive stage in the AI workflow. Training models on a large amount of data and training large models with many layers require a large amount of memory. At the same time, the increasing heterogeneity of hardware platforms creates new challenges at all levels of the systems stack.
We explore two major uncharted problems in this project: (1) the mismatch in the rate of compute and data movement resulting in overall slow execution, demanding new memory-aware data allocation and transfer techniques; and (2) a greater variety of memory devices resulting in greater data persistence and capacity planning requirements, demanding new scalability improvements to support parallel model training in an efficient manner. To ensure correctness and maximize performance on heterogeneous hardware when handling large amounts of training data, we need dynamic data management that accounts for device-specific performance characteristics. Therefore, this project aims to leverage the disaggregated servers with CXL memories to optimize data allocation and data transfer during AI model training, by designing comprehensive end-to-end stack memory-storage support for AI frameworks using advanced libraries such as DeepSpeed and open-source disaggregated storage solutions such as OpenMPDK DSS to optimally utilize the available hardware/software resources across GPU and CPU in an AI cluster.
This project (Mar. 2023 - Feb. 2024) is funded by Samsung MSL through NSF IUCRC for ASIC.