Greg Papadapolous, CTO of Sun Microsystems, gave the keynote presentation at the CNS research review this morning. This was personally very gratifying for me as Greg was an inspiration to me (and many other graduate students) during our work on the Network of Workstations (NOW) project.
Greg pushed the envelope in addressing some of the commonly held wisdom in large-scale computer systems design, namely:
- Ethernet is good enough
- SMPs are too expensive
- Failures are frequent
As systems designers, we have become accustomed to embrace simplicity and weak semantics in the underlying system architecture, though often at the cost of increased complexity in the higher-level applications that are left to deal with the failures. His challenge to the audience was to consider the inherent costs of building more reliable, higher performance systems. In many cases, the savings to the application developers and the improvements in end-to-end performance will be valuable.
The talk went through a number of interesting examples. In Ethernet, we consider congestion spreading through the network (no isolation), packet drops, etc. to be things that application developers have to account for. In building large-scale systems, we assume that all parts should be commodity and quite failure prone. The networks we employ to connect data-center scale systems cap out at a relatively modest maximum bisection bandwidth, leaving many applications starved on the network.
Put another way, Greg advocates applying the end-to-end argument, a driving force behind network design. The end-to-end argument states that reliability, fault tolerance, etc. in the end belongs in the application because that is the only place that it can be completely implemented. One of my takeaways from the talk is that it is possible that we have become too lazy in applying the end-to-end principle. The original paper clearly states that there are exceptions to the principle, especially where additional performance gains are possible, or by extension in the current environment, reduced cost or energy consumption.
Are there opportunities to add additional engineering into the system infrastructure to reduce application complexity and reduce overall cost?