Rice lab develops circuit-switching tech to help data centers recover from failures
HOUSTON – (Aug. 16, 2018) – Anyone who has ever cursed a computer network as it slowed to a crawl will appreciate the remedy offered by scientists at Rice University.
Rice computer scientist Eugene Ng and his team say their solution will keep data on the fast track when failures inevitably happen.
Ng introduced ShareBackup, a strategy that would allow shared backup switches in data centers to take on network traffic within a fraction of a second after a software or hardware switch failure.
He will present a peer-reviewed paper on the work this week at the SIGCOMM 2018 conference in Budapest, Hungary. The paper is available for download.
Ng said the idea would solve a common annoyance among data professionals, scientists and everyone who relies on a network to deliver results day in and day out.
“A data network consists of servers and network switches,” said Ng, a professor of computer science and electrical and computer engineering. “Switches move data packets to where they need to go. But things fail, especially in large-scale data centers with thousands of pieces of hardware.”
The usual response to a failed switch is to shunt the flow of data to another line. “Generally, the network has multiple paths for connecting servers so, just like if there’s a closure on the highway, we’d drive around it. This is a conventional, natural approach that makes a lot of sense: You reroute around the failure to get where you need to go.”
But sometimes that other road is congested and everything slows down. “Data centers aren’t the internet; they’re not about people surfing websites,” Ng said. “They’re about supporting data-intensive applications like data mining or machine learning. And a lot of these applications have stringent performance deadlines, so blindly rerouting traffic could be the wrong thing to do in a data center.”
Rather than the expensive option of installing redundant switches throughout a network, the Ng lab’s strategy would put fast switches and software in strategic locations that could pick up the traffic from a failed switch in a microsecond. When that problem is resolved, the team’s software makes the backup switch available to handle another failure.
The switch is fast enough — the failure-recovery time is 0.73 milliseconds, including latency from hardware and control systems — that most users would never know that part of the system had failed.
“The reality is that the fraction of devices that fail at any given time is very small, and most of these failures can be addressed by things like rebooting the device,” Ng said. “Sometimes the software gets screwed up and a simple power cycle will bring it back. These failures may also not last long.
“These are the characteristics we’re trying to exploit,” he said. “Because of that, we can get away with having very few devices back up a large number of devices.”
Ng said ShareBackup could save data centers time and money not only by maintaining full bandwidth but by also helping to analyze problems, including misconfigurations that commonly lead to network failure.
“Part of our work is to help data centers figure out what went wrong in the network,” he said. “Once the backup is activated, you can take the failed device out of the production network and test it to identify which component caused the problem.
“Now, if we take two devices out and can’t figure out which went bad, both need to be replaced,” he said. “It’s very likely only one of the devices is having the problem. Our software can diagnose these devices in a semiautomatic manner, and if one of the parts is good, it can be reinstated.”
Lead authors of the paper are Rice graduate student Dingming Wu and alumnus Yiting Xia, now a computer scientist at Facebook. Co-authors are Rice graduate students Xiaoye Steven Sun, Xin Sunny Huang and Simbarashe Dzinamarira.
The National Science Foundation supported the research.