AZ-10GE Applications Acceleration Zone: Analysis of TCP Throughput Collapse in Ordinary Ethernet-based Clustered Storage Systems

Situation
Client access to data from a storage cluster or iSCSI-based storage system on ordinary Ethernet can be severely impaired thereby providing it a much lower read-bandwidth than should be available from configured network links.

The following example depicts a client-initiated synchronized read operation across a simple clustered storage system.

Incast Problem Definition
Incast is a catastrophic TCP throughput collapse that occurs as the number of storage servers sending data to a client increases past the ability of an Ethernet switch to buffer sufficient number of packets.

Anatomy of the Incast Problem
The Incast problem arises from a subtle interaction between depleted Ethernet buffers, cluster-centric communication patterns, and inadequate TCP loss-recovery mechanisms. A synchronized read operation of striped data from storage servers floods the switch buffers leading to packet loss and TCP timeouts. As striping also couples the behavior of multiple storage servers, overall system latency can be reduced to hundreds of milliseconds, if not more, which is a significant order of magnitude greater than typical data fetch times.

The following graph illustrates TCP throughput collapse during synchronized read for a simple clustered storage system.

For details on the Incast problem, simulation, and real-world test results please refer to the USENIX Association paper titled "Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems" by Amar Phanishaayee et al. from Carnegie Mellon University (CMU). A copy of this paper can be found here.

Best-of-Breed Ordinary Ethernet Switch Behavior
Does the Incast problem occur in real-world storage clusters with best-of-breed Etherent switches?

The CMU team analyzed the issue for the following three best-of-breed 1GE and 10GE switches:
1) HP ProCurve 2848
- 44x 1GE ports with 4x 1GE SFP ports, List Price: $3,299
2) Force10 S50
- 48x 1GE ports and up to 4x 10GE ports, List Price: $16,500
3) Force10 E1200
- Up to 1260x 1GE ports or 224x 10GE ports, List Price: >$500,000

The following graph pictorially depicts TCP throughput collapse in each scenario.

Observations
(1) Incast is a generic problem with ordinary Ethernet switches
(2) QoS implementation and memory allocation policies for buffer management is vendor specific
(3) QoS is typically implemented by partitioning output queues for each class of service. Disabling QoS increases effective size of output queues and can affect onset of Incast
(3) Switch buffer sizes play an important role in mitigating Incast
(4) HP ProCurve 2848 uses small buffers. Incast-induced throughput collapse occurs around seven servers
(5) Force10 S50 allocates a relatively large amount of buffer space and switch resources to support QoS. With QoS disabled, incast-induced throughput collapse occurs around 35 servers
(6) On Force10 E1200, incast-induced throughput collapse occurs around 87 servers
(7) Status-quo Ethernet mechanisms are inadequate for handling mission-critical storage traffic in data centers. A fundamentally different approach is necessary.

AZ-10GE Applications Acceleration Zone

Friday, May 2, 2008

Analysis of TCP Throughput Collapse in Ordinary Ethernet-based Clustered Storage Systems

No comments:

About Me

Blog Archive

Interesting Blog Topics?

Subscribe Now: Feed Icon