Friday, May 2, 2008

Analysis of TCP Throughput Collapse in Ordinary Ethernet-based Clustered Storage Systems

Situation
Client access to data from a storage cluster or iSCSI-based storage system on ordinary Ethernet can be severely impaired thereby providing it a much lower read-bandwidth than should be available from configured network links.

The following example depicts a client-initiated synchronized read operation across a simple clustered storage system.


Incast Problem Definition
Incast is a catastrophic TCP throughput collapse that occurs as the number of storage servers sending data to a client increases past the ability of an Ethernet switch to buffer sufficient number of packets.

Anatomy of the Incast Problem
The Incast problem arises from a subtle interaction between depleted Ethernet buffers, cluster-centric communication patterns, and inadequate TCP loss-recovery mechanisms. A synchronized read operation of striped data from storage servers floods the switch buffers leading to packet loss and TCP timeouts. As striping also couples the behavior of multiple storage servers, overall system latency can be reduced to hundreds of milliseconds, if not more, which is a significant order of magnitude greater than typical data fetch times.

The following graph illustrates TCP throughput collapse during synchronized read for a simple clustered storage system.


For details on the Incast problem, simulation, and real-world test results please refer to the USENIX Association paper titled "Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems" by Amar Phanishaayee et al. from Carnegie Mellon University (CMU). A copy of this paper can be found here.

Best-of-Breed Ordinary Ethernet Switch Behavior
Does the Incast problem occur in real-world storage clusters with best-of-breed Etherent switches?

The CMU team analyzed the issue for the following three best-of-breed 1GE and 10GE switches:
1) HP ProCurve 2848
- 44x 1GE ports with 4x 1GE SFP ports, List Price: $3,299
2) Force10 S50
- 48x 1GE ports and up to 4x 10GE ports, List Price: $16,500
3) Force10 E1200
- Up to 1260x 1GE ports or 224x 10GE ports, List Price: >$500,000

The following graph pictorially depicts TCP throughput collapse in each scenario.


Observations
(1) Incast is a generic problem with ordinary Ethernet switches
(2) QoS implementation and memory allocation policies for buffer management is vendor specific
(3) QoS is typically implemented by partitioning output queues for each class of service. Disabling QoS increases effective size of output queues and can affect onset of Incast
(3) Switch buffer sizes play an important role in mitigating Incast
(4) HP ProCurve 2848 uses small buffers. Incast-induced throughput collapse occurs around seven servers
(5) Force10 S50 allocates a relatively large amount of buffer space and switch resources to support QoS. With QoS disabled, incast-induced throughput collapse occurs around 35 servers
(6) On Force10 E1200, incast-induced throughput collapse occurs around 87 servers
(7) Status-quo Ethernet mechanisms are inadequate for handling mission-critical storage traffic in data centers. A fundamentally different approach is necessary.

Thursday, May 1, 2008

AZ-10GE – Not Just Another Acronym – A Tectonic Shift!

Teak Technologies has pioneered a new category of scalable and standards-compliant switching solutions that deliver breakthrough price-performance and transform data center networks into an Applications Acceleration Zone (AAZ), a discontinuous innovation that forever alters the Ethernet switching landscape at a fundamental level.

Applications Acceleration Zone?
An Applications Acceleration Zone is an isolated data center network environment with tightly controlled levels of prevailing artifacts that reduce performance of distributed mission-critical applications.

An Applications Acceleration Zone is to a data center as a "clean room" is to a semiconductor facility - or, for the un-initiated, an "operating room" to a hospital.

Think of "germ-free." Think of "isolation." Think of "environmental pollutants." Think of "vital signs." Think of "life saving."

Clean/operating rooms are isolated environments with tightly controlled levels of contamination from pollutants. Isolation is just as critical while processing semiconductor wafers as it is for vital life saving purposes.

Collating the Concepts
AAZ enables applications to maintain their vital performance signs in distributed and virtualized data center environments.

LANs based on AZ-10GE allow performance-impacting traffic to cut through all networking artifacts including congestion. Innovative IT managers substitute AZ-10GE for ordinary 10GE in all mission critical applications with stringent requirements for performance, reliability, and predictability. AZ-10GE LANs have 4x fewer links - optimally utilized to their capacity, consume up to 4x less power, are simpler to manage, and reduce time-to-profitability.

Applications Acceleration Zone Attributes
- 10Gbps overlay network (AZ-10GE)
- Isolated environment delivers predictable application performance
- It is Ethernet - just simply a whole lot better
- All applications run unmodified
- Complementary approach. Requires no forklift upgrades
- Works with legacy 1GE and 10GE equipment
- Leverages portals or gateway entry points for legacy 1GE and ordinary 10GE applications

How Large is Large?
AZ-10GE switching solutions can be deployed everywhere in the data center - from within a blade server chassis, to aggregating rack server and storage traffic at a rack-level, and then to scaling linearly across the entire access layer. Innovative portal and gateway appliances also enable applications with ordinary 1GE and 10GE connectivity to participate in the Applications Acceleration Zone without requiring any fork-lift upgrades.

Acceleration Zone Scales Linearly Across the Data Center