PhD student Matthew Dearing received Best Paper Award

Matthew Dearing

Matthew Dearing received the 38th Association of Computing Machinery SIGSIM-PADS PhD Colloquium Best Paper Award for his work on a paper titled “Deep Learning Surrogate Models for Network Simulation.

The work focuses on understanding how to simulate applications running on high-performance computing (HPC) systems to help optimize how jobs are scheduled, maximize workflow, and even inform the design of next-generation HPC networks. HPCs are complex systems, and they continue to get bigger and faster. One way to measure computer performance is in floating point operations per second, or FLOPS. A modern personal computer operates in the gigaflop range. In contrast, the Aurora supercomputer at Argonne National Laboratory, for example, is an exascale computer, and can perform a quintillion operations per second.

Multiple workflows or tasks are run simultaneously on supercomputers, using what is known as parallel computing. Large problems can be divided into smaller problems, which can be solved at the same time. Prioritizing the order that tasks should be completed is key to maximizing computing power.

Learning how to best prioritize such tasks and optimize the performance of real-world HPCs like Aurora is challenging by simply observing these multiple workflows running in real time. By creating simulations of these workflows, researchers can experimentally test various ideas for what might offer system improvements. However, HPC systems generate enormous amounts of data, so simulating this data is expensive and time consuming–it can take days to run a simulation of just a few seconds of a real-world application, let alone multiple applications running at one time.

Dearing employed a deep learning surrogate approach, which incorporates machine learning techniques to predict HPC application runtimes during stable periods of network traffic.

“By conducting a time series analysis using deep machine learning and AI techniques, we can try to predict what’s happening during part of the HPC application simulation,” Dearing said. “Then, when we need more detailed data, we can switch back to the long, slow high-fidelity simulation.”

While the accuracy of a deep learning surrogate simulation may be a bit lower than the high-fidelity simulations, the time frame to run the simulation is far faster, and multiple simulations can be run to improve the model’s accuracy.

“We were researching how accurately we could get these surrogate models to behave, how much better can we do,” Dearing said. “Early results suggest that it’s doing pretty well; it’s predicting what the high-fidelity simulation is doing better than anything we’ve seen before.”

Not only could these simulations optimize how jobs are placed on an HPC, but they also hold the potential to impact the architecture of the next version of supercomputers.

“The design part is very expensive, so as much as you can learn ahead of time the better,” Dearing said.

Additional research questions Dearing hopes to explore include homing in on the point when it’s necessary to switch from the predictive model to the high-fidelity simulation, and if a specific surrogate model could be used in multiple simulations.

Dearing is a second-year doctoral student in Professor Zhiling Lan’s Systems for Performance, Energy, and Resiliency (SPEARS) lab.