Archive for the 'academia' Category

Gray Sort: The Most Fun I’ve Ever Had with (a few racks of) Computers

Things have been quite on the blog, but not because there has not been a lot to say.  In fact, there has been so much happening that I have not had the idle cycles to write about them.  However, I do want to highlight some of the interesting things that have taken place over the past few months.

There has been significant recent interest in large-scale data processing.  Many would snicker that this is far from a new problem and indeed the database community has been pioneering in this space for decades.  However, I believe it is the case that there has been an uptick in commercial interest in this space, for example to index and analyze the wealth of information available on the Internet or to process the multiple billions of requests per day made to popular web services.  MapReduce and open source tools like Hadoop have significantly intensified the debate over the right way to perform large-scale data processing (see my earlier post on this topic).

Observing this recent trend along with my group’s recent focus on data center networking (along with a healthy dose of naivete) led us to go after the world record in data sorting.  The team recently set the records for both Gray Sort (fastest time to sort 100 TB of data) and Minute Sort (most data sorted in one minute) in the “Indy” category. See the sort benchmark page for details. This has been one of the most gratifying projects I have ever been involved with.  The work was of course really interesting but the best part was seeing the team (Alex Rasmussen, Radhika Niranjan Mysore, Harsha V. Madhyastha, Alexander Pucher, Michael Conley, and George Porter) go after a really challenging problem. While some members of the team would disagree, it was also at least interesting to set the records with just minutes to spare before the 2010 deadline.

Our focus in this work was not so much to set the record (though we are happy to have done so) but to go after high-levels of efficiency while operating at scale. Recently, setting the sort record has largely been a test of how much computing resources an organization could throw at the problem, often sacrificing on per-server efficiency. For example, Yahoo’s record for Gray sort used an impressive 3452 servers to sort 100 TB of data in less than 3 hours.  However, per server throughput worked out to less than 3 MB/s, a factor of 30 less bandwidth than available even from a single disk.  Large-scale data sorting involves carefully balancing all per-server resources (CPU, memory capacity, disk capacity, disk I/O, and network I/O), all while maintaining overall system scale.  We wanted to determine the limits of a scalable and efficient data processing system. Given current commodity server capacity, is it feasible to run at 30 MB/s or 300 MB/s per server?  That is, could we reduce the required number of machines for sorting 100 TB of data by a factor of 10 or even 100?

The interesting thing about large-scale data sorting is that it exercises all aspect of the computer system.

  • CPU is required to perform the O(n log n) operation to sort the data.  While not the most compute-intensive application, CPU requirements nonetheless cannot be ignored.
  • Disk Bandwidth: earlier work proves that external memory sort (the case where the data set size is larger than aggregate physical memory) requires at least two reads of the data and two writes of the data.  One of the banes of system efficiency is the orders of magnitude difference in I/O performance for sequential versus random disk I/O.  A key requirement for high-performance sort is ensuring that disks are performing sequential I/O (either read or write) near continuously.
  • Disk capacity: Sorting 100 TB of data requires at least 200 TB of storage, 300 TB if the input data cannot be erased.  While not an enormous amount of data by modern standards, simply storing this amount of data amounts to an interesting systems challenge.
  • Memory capacity: certainly in our architecture, and perhaps fundamentally, ensuring streaming I/O while simultaneously limiting the number of disk operations to 2 reads and 2 writes per tuple requires a substantial amount of memory and careful memory management to buffer data in preparation for large, contiguous writes to disk.
  • Network bandwidth: in a parallel sort system, data must be shuffled in an all-to-all manner across all servers.  Saturating available per-server CPU and storage capacity, requires significant network bandwidth, approaching 10 Gb/s of sustained network throughput per server in our configuration.

Managing the interaction of these disparate resources along with parallelism both within a single server and across a cluster of machines was far more challenging than we anticipated.  Our goal was to use commodity servers to break the sort record while focusing on high efficiency. We constructed a cluster with dual-socket, four-core Intel processors, initially 12GB RAM (later upgraded to 24GB RAM once we realized we good not maintain sequential I/O with just 12GB RAM/server), 2x10GE NIC (only one port active for the experiment), and 16 500GB drives.  The number of hard drives per server was key to delivering high levels of performance.  Each of our drives could sustain approximately 100 MB/s of sequential read or write throughput.  We knew that, in the optimal case (see this paper), we would read and write the data twice in two discrete phases separated by a barrier.  So, if we managed everything perfectly, in the first phase, we would read data from 8 drives at an aggregate rate of 800 MB/s (8*100 MB/s) while simultaneously writing it out to the remaining 8 disks at an identical rate.  In the second phase, we would similarly read the data at 800 MB/s while writing the fully-sorted data out at 800 MB/s.  Once again, in the best case, we would average 400 MB/s of sorting per server.

Interestingly, the continuing chasm between CPU performance and disk I/O (even in the streaming case) means that building a “balanced” data-intensive processing cluster requires a large number of drives per server to maintain overall system balance. While 16 disks per server seems large, one conclusion of our work is that servers dedicated to large-scale data processing should likely have even more disks.  At the same time, significant work needs to be done in the operating system and disk controllers to harness the I/O bandwidth available from such large disk arrays in a scalable fashion.

Our initial goal was to break the record with just 30 servers.  This would correspond to 720 GB/min assuming 400 MB/s/server, allowing us to sort 100 TB of data in ~138 minutes. We did not quite get there (yet); our record-setting runs were on a 48-server configuration. For our “certified” record-setting run, we ran at 582 GB/min on 48 servers, or 200 MB/s/server.  This corresponds to 50% of the maximum efficiency/capacity of our underlying hardware.  Since the certified experiments, we have further tuned our code to sort at ~780 GB/min aggregate or 267 MB/s/server. These newest runs correspond to ~67% efficiency.  Now obsessed with squeezing the last ounce of efficiency from the system, we continue to target >90% efficiency or more than 1 TB/min of sorting on 48 machines.

While beyond the scope of this post, it has been very interesting just how much we had to do for even this level of performance.  In no particular order:

  • We had to revise, redesign, and fine tune both our architecture and implementation multiple times. There is no one right architecture because the right technique varies with evolving hardware capabilities and balance.
  • We had to experiment with multiple file systems and file system configuration before settling on ext4.
  • We were bit multiple times by the performance and caching behavior of our hardware RAID controllers.
  • While our job overall is not CPU bound, thread scheduling and core contention became a significant issue.  In the end, we had to come up with our own custom core allocation bypassing the Linux kernel’s own approach.  One interesting requirement was avoiding the core that by default performed most of the in-kernel system call work.
  • Performing all-to-all communication at near 10 Gb/s, even among 48 hosts on a single switch, is an unsolved challenge to the best of our knowledge.  We had to resort to brittle and arcane socket configuration to sustain even ~5Gb/s.
  • We had to run with virtual memory disabled because the operating system’s memory management behaved in unexpected ways close to capacity.  Of course, with virtual memory disabled, we had to tolerate kernel panics if we were not careful about memory allocation.

In the end, simultaneously addressing these challenges turned out to be a lot of fun, especially with a great group of people working on the project.  Large-scale sort exercises many aspects of the operating system, the network protocol stack, and distributed systems.  It is far from trivial, but it is also simple enough to (mostly) keep in your head at once. In addition to improving the efficiency of our system, we are also working to generalize our infrastructure to arbitrary MapReduce-style computation. Fundamentally, we are interested to determine how much efficiency and scale we can maintain in a general-purpose data processing infrastructure.

Doing “Big Science” In Academia

Recently, there has been a lot of handwringing in the systems community about the work that we can do in the age of mega-scale data centers and cloud computing.  The worry is that the really interesting systems today consist of tens of thousands of machines interconnected both within data centers and across the wide area.  Further, appropriate system architectures are heavily dependent on the workloads imposed by millions of users on particular software architectures.  The worry is that  we in academia cannot perform good research because we do not have access to either systems of the appropriate scale or application workloads to inform appropriate system architectures.

The concern further goes that systems research is increasingly being co-opted by industry, with many (sometimes most) of the papers in top systems and networking conferences being written by our colleagues in industry.

One of my colleagues hypothesized that perhaps the void in the systems community was partially caused by the void in “big funding” that was historically available to the academic systems community from DARPA. Starting in about 2000, DARPA moved to more focused funding to efforts likely to have direct impact in the near term.  Though, it looks that this policy is changing under new DARPA leadership, the effects in the academic community have yet to be felt.

My feeling is that all this worry is entirely misplaced.  I will outline some of the opportunities that go along with the challenges that we currently face in academic research.

First, for me, this may in fact be another golden age in systems research, borne out of tremendous opportunity to address a whole new scale of problems collaboratively between industry and academia. Personally, I find interactions with my colleagues in industry to be a terrific source of concrete problems to work on.  For example, our recent work on data center networking could never have happened without detailed understanding of the real problems faced in large-scale network deployments.  While we had to carry out a significant systems building effort as part of the work, we did not need to build a 10,000-node network to carry out interesting work in this space.  Even the terrific work coming out of Microsoft Research on related efforts such as VL2, DCell, and BCube typically employ relatively modest-sized system implementations as proofs of concepts of their designs.

A related approach is to draw inspiration from a famous baseball quote by Willie Keeler, “I keep my eyes clear and I hit ’em where they ain’t.” The analog in systems research is to focus on topics that may not currently be addressed by industry.  For example, while there has been tremendous interest and effort in building systems that scale seemingly arbitrarily, there has been relatively little focus on per-node efficiency.  So a recent focus of my group has been on building scalable systems that do not necessarily sacrifice efficiency.  More on this in a subsequent post.

The last, and perhaps best, strategy is to actively seek out collaborations with industry to increase overall impact on both sides. One of the best papers I read in the set of submissions to SIGCOMM 2010 was on DCTCP, a variant of TCP targeting the data center.  This work was a collaboration between Microsoft Research and Stanford with the protocol deployed live on a cluster consisting of thousands of machines.  The best paper award from IMC 2009 was on a system called WhyHigh, a system for diagnosing performance problems in Google’s Content Distribution Network.  This was a multi-way collaboration between Google, UC San Diego, University of Washington, and Stony Brook.  Such examples of fruitful collaborations abound.  Companies like Akamai and AT&T are famous for multiple very successful academic collaborations with actual impact on business operations.  I have personally benefitted from insights and collaborations with HP Labs on topics such as virtualization and system performance debugging.

I think the big thing to note is that industry and academia have long lived in a symbiotic relationship. When I was a PhD student at Berkeley, many of the must read systems papers came out of industry: the Alto, Grapevine, RPC, NFS, Firefly, Logic of Authentication, Pilot, etc., just as systems such as GFS, MapReduce, Dynamo, PNUTS, and Dryad are heavily influencing academic research today.  At the same time, GFS likely could not have happened without the lineage of academic file systems research, from AFS, Coda, LFS, and Zebra to xFS.  Similarly, Dynamo would not have been as straightforward if it had not been informed by Chord, Pastry, Tapestry, CAN, and all the peer to peer systems that came afterward.  The novel consistency model in PNUTS that enables its scalability was informed by decades of research in strong and weak data consistency models.

Sometimes things go entirely full circle multiple times between industry and academia.  IBM’s seminal work on virtual machines in the 1960’s lay dormant for a few decades before inspiring some of the top academic work of the 1990’s, SimOS and DISCO.  This work in turn led to the founding of VMWare, perhaps one of the most influential companies to directly come out of the systems community.  And of course, VMWare has helped define part of the research agenda for the system’s community in the past decade, through academic efforts like Xen.  Interestingly, academic work on Xen led to a second high-profile company, XenSource.

This is all to say that I believe that the symbiotic relationship between industry and academia in systems and networking will continue.  We in academia do not need a 100,000-node data center to do good research, especially by focusing on direct collaboration with industry where it makes sense and otherwise on topics that may not be being directly addressed by industry.  And the fact that there are so many great systems and networking papers from industry in top conferences should only serve as inspiration, both to define important areas for further research and to set the bar higher for the quality of our own work in academia.

Finally, and only partially in jest, all the fundamental work in industrial research is perhaps further affirmation of the important role that academia plays, since many of the people carrying out the work were MS and PhD students in academia not so long ago.

SIGCOMM 2010 Travel Grants and VISA Workshop

This year, I had the pleasure of serving on the SIGCOMM 2010 program committee.  I may write more about the experience later, but the short version is that I really enjoyed reading the papers and was particularly impressed by the deep discussions at the two-day program committee last month.  K.K. Ramakrishnan and Geoff Voelker did a terrific job as co-chairs and I believe their efforts are well reflected in a very strong program.

The conference will be held in New Delhi this year and the organizing committee has been fortunate to secure some generous support for travel grants.  This year, grants will be available not just for students, but also for post docs and junior faculty.  The deadline for application has been extended to June 12, 2010.  Full details are available here.  On behalf of the SIGCOMM organizing committee, I encourage everyone interested to apply.

If you do attend SIGCOMM, let me also put in a plug for the VISA workshop.  This is the second workshop on Virtualized Infrastructure Systems and Architecture, building on the successful program we had last year.  I was the co-program chair this year with Guru Parulkar and Cedric Westphal.  Virtualization remains an important topic and VISA is playing an important role for discussion of important problems across systems and networks.

Why Go to Graduate School and How to Get into the Program of your Dreams

This is the time of year when applications to graduate schools are due and I see a lot of both misinformation and lack of information among applicants.  I thought it might be valuable to put together some advice on the application process from “the other side,” someone who spends a lot of time looking at the applications and helping to decide who is admitted.  My experience is with applications to a highly research-oriented  MS and PhD program in Computer Science and Engineering at UC San Diego.  However, in speaking with my colleagues over the years, I believe that the thoughts below generalize to a variety of top CS research programs and, to some extent, to science and engineering graduate programs as a whole.

For myself, I did not know most of the below when I was applying to graduate programs.  All I knew was that I wanted to be a professor and that I needed a PhD.  Sometimes, that is enough.

I will edit this document as I get additional questions and feedback, so feel free to post your thoughts and comments.

Q: Why should I go to graduate school?

There are a number of good reasons to go to graduate school, though of course it is not for everyone.

  • You love Computer Science and are passionate about learning more about it.  Four years was just not enough to cover everything you wanted to learn.  More advanced classwork on topics you saw as an undergraduate will often make the material “click”.  The opportunity to perform research will give you a new perspective on how to approach problem solving and a new skill set that will be broadly applicable to many different work settings.
  • You really want to be a college professor. You look around at your own professors and think they have the greatest job imaginable. There are of course exceptions that prove the rule, but in general you need a PhD to become a professor.
  • You want to perform research.  You like working on open-ended problems with the opportunity to both advance the state of scientific understanding and the chance to perhaps influence the way people and companies do things in the future. There are a number of great industrial research labs that hire PhDs to perform exactly this kind of work.
  • You have an entrepreneurial bent and want to start a company. This reason is probably a bit controversial since you may be better off just going to work and learning about important problems facing industry. However, a higher-risk though perhaps higher-reward approach is to go to graduate school to learn about cutting-edge ideas with an eye toward applying them to the marketplace. This applicant is fairly rare as it requires both a strong entrepreneurial spirit and the ability to perform leading research (which often times does not have any immediate commercial application).
  • You want to get a better job, with more interesting responsibilities and a higher salary, than what you might be able to get with a Bachelor’s degree.  Depending on the job market and your own qualifications, this could be a great reason to go to graduate school, but very likely an MS rather than a PhD program. A typical 2-year MS program at a good school is likely to put you in a position for better jobs with higher starting salaries.  However, a PhD is likely the wrong way to go because by the time you account for all the years required to complete your PhD, you would have been better off starting in industry, gaining experience, gaining promotions, and perhaps moving on to your second or third job. For better or for worse, people rarely stay at companies for very long these days.  In Silicon Valley, the median amount of time at a company seems to be 18 months. You may be able to get significantly further ahead by just working and gaining experience and contacts rather than going for a PhD.

Q: What does the admissions committee look for in a successful applicant?

The ideal graduate student will have the following characteristics:

  • Research experience. Nothing prepares a student for graduate work like actually focusing on the most important aspect of the graduate school learning process, performing original research. However, such experience is relatively rare for undergraduates, paradoxically especially so at major research universities. The most important aspect of research experience is typically not the actual work you do but the opportunity for you to get to know a professor relatively well. And this leads to the next point.
  • Letters of recommendation. Having strong letters of recommendation is critical and something that you can control more than good research experience in many cases. It does help to have letters from writers who members of the admissions committee know. So if you are interested in doing research in operating systems, then you should try to take that course early. Chances are decent that some member of the admissions committee at some of the schools you are applying to will know the operating systems professor at your university.
  • Important personal characteristics. There are a number of qualities that are more important predictors of success in graduate school (and beyond) than generic intelligence. These qualities include creativity, focus, leadership, independence, diligence, passion, integrity. Unfortunately, it is possible to attend a great school, earn terrific grades, and even publish some papers without having these critical qualities.  This is why appropriately detailed letters of recommendation are so important.  If they can attest to some of these difficult to quantify characteristics, then the applicant will definitely have a leg up.
  • Rigorous undergraduate program. Attending a strong undergraduate program ensures that you have some baseline mastery of important computer science topics and techniques. Essentially, the admissions committee is looking for applicants that are as “research ready” as possible. If you do not have to spend time to learn the basics, then you can get started with successful research more quickly.
  • Strong GPA/GRE scores.  The definition of a good GPA is calibrated by the quality of a school and also historic norms for “grade inflation” at a particular institution.  Since we see many applications from a subset of schools every year, the admissions committee often has a logical database of norms to consult against. GRE scores are a bit more difficult to evaluate, especially since it is possible to essentially memorize one’s way to strong GRE scores.
  • Work experience.  Contrary to popular opinion, a few years of industrial experience can be a huge plus for an applicant. Practical experience in leading industrial positions can expose students to important problems and often leads to students who have stronger implementation skills coming into the program.  In addition, an applicant who spends time in industry and makes the conscious decision to come back to graduate school (giving up regular hours, a higher salary, etc.), typically shows a high level of dedication to graduate study.  They know it is what they want, rather than “it seemed like the next thing to do.”
  • Personal statement. You can consider this to be a writing sample that also gives some insight into your personality and maturity.  This is your chance to describe some of the work that you have done and why you found it interesting and important.  If you already have an idea of what research you would like to pursue and why, this would be a great place to discuss it.  If you have spent the time to get to know the research of a one or more professors in the department you are applying to, it would definitely help to include a personalized paragraph in the personal statement.  Many applicants use the personal statement as an opportunity to wax eloquent on the beauty of basic research and how they were set on the path to fundamentally change scientific understanding at an early age.  Some faculty (e.g., me) have a soft spot for such idealism.  But most are turned off by it, so on balance it is best to avoid such discussion unless you have something really distinctive or substantial to say (the wonder in your eye when you first laid eyes on a computer does not count).

On the PhD side, applicant screening is difficult because the characteristics of a good PhD student are different from the characteristics of a great undergraduate student.  Doing well in undergraduate courses requires being able to apply a relatively small set of concepts in a particular course to a relatively focused problem domain.  Individual problems may take hours to solve and, in rare cases, may require more focused work for days or weeks. Performing well in research requires applying ideas from a large set of domains to a problem that is likely poorly defined and almost certainly has no fixed answer.  Still, the admissions committee does consider a student’s grades as reflective of raw intellect and baseline knowledge of important computer science skills.

GRE scores are similarly an indication of at least some baseline mathematical and writing ability.  Overall, the GRE scores tend to provide the least differentiation among applicants. I cannot think of a single instance where a student was selected over another student over GRE scores.  Still, it is something that the admission committee does at least look at.  Since the GRE tests the most basics of mathematics and since Computer Science typically requires strong mathematical and analytical abilities, most admissions committee members look for near perfect GRE math scores.  Some admissions committee members largely dismiss the GRE math as only an indication of an applicant’s ability to perform simple mathematics quickly.  I know at least a few admissions committee members who put significant weight on the GRE Verbal score.  Communicating research ideas, both through oral presentations and written research papers, is critically important.  Since this skill is relatively under-developed in many graduate students, this is a skill that we look for.

Q: What can I do to prepare for graduate school applications?

The key is to be organized and to plan ahead (two skills not necessarily required for success in undergraduate programs but that will prove to be critical for success in graduate school!).  Many programs now offer online admissions applications (we certainly do at UCSD here).  Still, you have to arrange for all of your letter writers to send their letters to the various programs you are applying to.  Many schools offer letter services for their undergraduates where they can ask their writers to place a letter in a file for them.  The applicant can then simply request that copies of the letter be sent to individual programs. You have to ensure that your GRE scores are similarly delivered.

As indicated above, having strong letters of support is one of the most important parts of an application.  And this is simply not something that you can start preparing for in November before December applications are due in the same year.  Ideally, this is a process that spans multiple years by cultivating a relationship with faculty members in your department.  Summer internships at companies are also a good opportunity for securing letters. Becoming involved in a research internship at a remote institution for the summer is another terrific opportunity.  A number of programs such as NSF’s REU (Research Experiences for Undergraduates) recruit for such positions at universities across the country.  This is something to apply for in your sophomore or junior year (or earlier!).

Of course, another option is to work on research with faculty in your own institution.  If you have done well in a professor’s class, they are very likely to be happy to work with you.  Doing research during the academic year is challenging because of all of the short-term demands on your time (a preview for your first few years of graduate school!).  So again you have to be organized.  A great way to get momentum for research is to start over the summer. Some professors offer paid internships for undergraduates over the summer.  Other times, however, such funding is not available.  My advice would be that if you have an opportunity to perform research with a great faculty member/set of students over the summer and you are very interested in learning about research/graduate school, then volunteering for an unpaid internship is a great investment in your future.

Q: Should I apply for MS or PhD programs?

There are multiple tradeoffs here.  I will summarize at a high level.


+ Largely a prerequisite if you want to teach at the college/university level or focus on basic research in industry (there are exceptions that prove the rule).
+ Typically, admission comes with a guarantee of funding.


+ Significantly easier to be admitted into an MS program.  From the department’s perspective, the risk is lower because typically there is no offer of financial support and the commitment is for two years rather than five to six years.  Someone with a strong undergraduate record is also fairly likely to perform well in an MS program though perhaps may not be an excellent researcher.
+ Relatively short time commitment (18-24 months) with significantly improved job prospects relative to a Bachelor’s degree.
+ If your record is relatively borderline for admission into top PhD programs, can use the MS as a proving ground to significantly improve chances for PhD admission later.

While the MS option typically does not guarantee funding, some to many MS students (certainly at UCSD) still obtain funding through TAships, RAships, or summer internships (currently a 3 month summer internship in the US often pays in the $18-20k total range).  Still, you should only go into an unfunded MS position if you have the means to fund it (through loans or otherwise) in the worst case.  Looking at it another way, many graduate students in law and medicine in the US go into debt (certainly more than the cost of an 18-24 month MS program) as an investment against their future earning power.  It may be worth considering the tradeoffs here as well if you are very excited about pursuing graduate work in computer science.

Q: Does it help to send email to a professor asking for an evaluation?

In general, sending a generic form letter to hundreds of professors is unlikely to help at all.  If you do send such a letter, make sure that you proofread it and that you get the professor’s name and area of research correct.  A poorly written note or one that cites a different professor’s papers can leave a bad impression.  However, if you have something intelligent to say about a professor’s research, beyond a simple “I found your paper on X to be very interesting and in line with my own interests,” then it could be worthwhile.  And, of course, if you have an exceptionally strong record where you might be a clear admit, then it could be worthwhile to get yourself on a particular faculty member’s radar.

But note that the bar for “clear admit” is quite high.  At UC San Diego, we get many strong applications where the line between accept and reject is very fine and impossible to predict ahead of time.  Clear admit says essentially: independent of available funding, current research focus, the strength of the rest of the pool, etc., this student will be admitted in any given year.  At most top 25 departments, this means at least 3 of 4 of the following: top 1% recommendation letters from well-known letter writers, top undergraduate institution, very high GPA/GRE scores, and research experience preferably with published papers in top venues. Out of 1000 applicants, we might only have 40-50 that fall into this clear admit category in any given year.

Q: I have been admitted to a number of programs.  What should I look for in a school?

The biggest mistake I see students make, especially among foreign applicants, is to order their admits based on US News and World Report rankings and select the school with the highest ranking  Your goal is to maximize your long-term success and that means maximizing your prospects once you complete your degree.  I will focus here on the PhD side, but similar considerations apply for the MS degree.

In maximizing your experience in graduate school, in general you want to maximize the quality of the research that you perform and the single most important thing here is your research adviser and the other graduate students you work with on a day to day basis.  So, in considering a school, the first thing to look at are the set of faculty members that you might be interested in working with.  If you are not sure what you might like to do, you should make sure that the various areas that you are interested in are well represented in a particular department.  If you are interested in working in a particular area, is there more than one faculty member working in that space?  You might love the work of a particular professor, but it might be the case that the professor may not be taking on students or may be on leave in a given year. More subtly, your personalities may not mesh well or the advising style of a particular professor may not work well for you.  Some high level distinctions include students who like significant freedom versus professors who might have a relatively narrow set of topics that they want their students to work on.  The reverse can also be problematic: some professors are very hands off while a particular student may need relatively close interaction (at least initially).

The best way to determine whether you would enjoy working with a faculty member is to attend the school’s visit day.  This will give not only the opportunity to meet the professor but to speak with the professor’s other students to get a good feel for what it would be like working with a faculty member.  Of course, the difficulty of attending visit days for foreign students is one of the challenges in accurately evaluating all of the alternatives on a list.  In this case, students should still be proactive in setting up telephone conversations with both faculty and students at the institution.  At the very least, you should verify that some of the faculty you are interested in working with have the capacity (in terms of both time and money) to take on additional students.

Circling back to the topic of rankings, if a higher ranked institution does not have any professors working in areas you are interested in or if your style of working does not mesh well with the available faculty, then it is less likely that you will be able to perform high quality research.  And, of course, this will in turn impact your chances of getting your dream job upon graduation.

Clearly, rankings do play some role in your subsequent success and it would be naive to think they do not matter at all.  If you are able to do work of equivalent quality at two institutions and one is substantially more prestigious than another, then choosing the higher-ranked one makes sense.  But the quality of your work trumps all other considerations in my opinion.  Certainly, when we evaluate faculty applicants for our own department, the quality and impact of the research performed by an applicant is by far the number one evaluation.  Probably the second most criteria is the leadership skills and vision of the applicant.  School ranking is never explicitly considered.

Since you will not be spending 100% of your time doing research and since your personal happiness goes a long way in determining your overall work productivity, other considerations are also important.  Essentially, are there factors about the location of a school that would impact the things you like to do in your free time (e.g., spending time with friends or family, going to the theatre, snowboarding, museums, outdoor sports, etc.).

Q: One school is offering me a better financial aid package than another.  Can I use this to negotiate?

You can try, but in most cases, schools offer the best financial packages they can to an applicant.  If the difference is between no funding at one one school and full support at another, then it is worth inquiring about available funding.  However, if the difference is a few thousand dollars in the form of a special fellowship at one school relative to another, I would consider the difference to be in the noise relative to all the other things that go into determining your long term success.  Once again, if everything else is equal, then choosing the school with a slightly better financial package makes sense.  But in virtually all cases, other considerations will be more important than the total amount of support.

Another question to consider is the length of guaranteed support in an offer letter.  Some schools promise support to PhD applicants for five+ years, while others may only promise support for one, two, or three years.  You should not place too much stock in the various differences here.  The fact is that, currently, virtually all PhD students in top tier departments receive one form of support or another as long as they are making good progress toward their dissertation. Available support of course varies from school to school and from research area to research area, but it is the clear exception where a PhD student making good progress has no funding options.

And guaranteeing funding has legal implications at some schools that make it difficult to provide such guarantees.  For example, if a professor wishes to recruit 2 new graduate students in a given year and the historical accept rate for admissions offers is 40%, then the professor may wish to admit 5 students total.  However, a particular university might require the professor to demonstrate funding for all 5 students for all 5 years, or 25 years of total graduate student support.  This requirement comes despite the fact that the faculty member only expects 2 of the students to accept and hence really only needs 10 years of total support.  (If there is a “success disaster” where 3 or 4 students accept, presumably that same professor would not recruit in subsequent years to absorb the bubble.)  So overall, depending on campus requirements, it may not even be possible for a faculty member to guarantee support since there may be legal contractual obligations associated with the guarantee.

In general, the best way to determine what the real funding situation is like at a school or a particular group is to ask other students.  If senior students have all had full RAships and full summer support for the past five years, then you can typically use the past as a good predictor for the future, independent of the specifics of the offer letter.

Presentation Summary “High Performance at Massive Scale: Lessons Learned at Facebook”

Recently, we were fortunate to host Jeff Rothschild, the Vice President of Technology at Facebook, for a visit for the CNS lecture series.  Jeff’s talk, “High Performance at Massive Scale: Lessons Learned at Facebook” was highly detailed, providing real insights into the Facebook architecture. Jeff spoke to a packed house of faculty, staff, and students interested in the technology and research challenges associated with running and Internet service at scale.  The talk is archived here as part of the CNS lecture series.  I encourage you to check it out; below are my notes on the presentation.
Site Statistics:
  • Facebook is the #2 property on the Internet as measured by the time users spend on the site.
  • Over 200 billion monthly page views.
  • >3.9 trillion feed actions proceessed per day.
  • Over 15,000 websites use Facebook content
  • In 2004, the shape of the curve plotting user population as a function of time showed exponential growth to 2M users.  5 years later they have stayed on the same exponetial curve with >300M users.
  • Facebook is a global site, with 70% of users outside of the US.
  • Today, there are 1.3B people in the world who have quality Internet connectivity, so there is at least another factor of 4 growth that Facebook is going after. Jeff presented statistics for the number of users that each engineer supports at a variety of high-profile Internet companies: 1.1M for Facebook, 190,000 Google, 94,000 Amazon, 75,000 Microsoft.
Photo sharing on Facebook:
  • Facebook stores 20 billion photos in 4 resolutions
  • 2-3 billion new photos uploaded every month
  • Originally provisioned photo storage for 6 months, but blew through available storage in 1.5 weeks.
  • Facebook serves 600k photos/second –> serving them is more difficult than storing them.
Scaling photos, first the easy way:
  • Upload tier: handles uploads, scales the images, sotres on NFS tier
  • Serving tier: Images are served from NFS via HTTP
  • NFS Storage tier built from commercial products
  • Filesystems aren’t really good at supporting large numbers of files
Scaling photos, 2nd generation:
  • Cachr: cache the high volume smaller images to offload the main storage systems.
  • Only 300M images in 3 resolutions
  • Distribute these through a CDN to reduce network latency.
  • Cache them in memory.
Scaling photos, 3rd Generation System: Haystack
  • How many IO’s do you need to serve an image?  Originally, 10 I/O’s at Facebook because of the complex directory structure.
  • Optimizations got it down to 2-4 IOs per file served
  • Facebook built a better version called Haystack by merging multiple files into a single large file. In the common case, serving a photo now requires 1 I/O operation.  Haystack is available as open source.
Facebook architecture consists of:
  • Load balancers as front end requests are distributed to Web Servers retrieve actual content from a large memcached layer because of the latency requirements for individual requests.
  • Presentation Layer employs PHP
  • Simple to learn: small set of expressions and statements
  • Simple to write: loose typing and universal “array”
  • Simple to read
But this comes at a cost:
  • High CPU and memory consumption.
  • C++ Interoperability Challenging.
  • PHP does not encourage good programming in the large (at 3M lines of code it is a significant organizational challenge).
  • Initialization cost of each page scales with size of code base
Thus Facebook engineers undertook implementing optimizations to PHP:
  • Lazy loading
  • Cache priming
  • More efficient locking semantics for variable cache
  • Memcache client extension
  • Asynchrnous event-handling
Back-end services that require the performance are implemente in C++. Services Philosophy:
  • Create a service iff required.
  • Real overhead for deployment, maintenance, separate code base.
  • Another failure point.
  • Create a common framework and toolset that will allow for easier creation of services: Thrift (open source).
A number of things break at scale, one example: syslog
  • Became impossible to push large amounts of data through the logging infrastructure.
  • Implemented Scribe for logging.
  • Today, Scribe processes 25TB of messages/day.
Site Architecture
Overall, Facebook currently runs approximately 30k servers, with the bulk of them acting as web servers.
The Facebook Web Server, running PHP, is responsible for retrieving all of the data required to compose the web page.  The data itself is stored authoritatively in a large cluster of MySQL servers.  However, to hit performance targets, most of the data is also stored in memory across an array of memcached servers. For traditional websites, each user interacts with his or her own data.  And for most web sites, only 1-2% of registered users concurrently access the site at any given time.  Thus, the site only needs to cache 1-2% of all data in RAM.  However, data at Facebook is deeply interconnected; each user is interested in the state of hundreds of other users.  Hence, even with only 1-2% of the user population at any given time, virtually all data must still be available in RAM.
Data partitioning was easy when Facebook was a college web site, simply partition data at the level of individual colleges.  After considering a variety of data clustering algorithms, found that there was very little win for the additional complexity of clustering.  So at Facebook, user data is randomly partitioned across indiviual databases and machines across the cluster.  Hence, each user access requires retrieving data corresponding to user state spread across hundreds of machines.  Intra-cluster network performance is hence critical to site performance. Facebook employs memcache to store the vast majority of user data in memory spread across thousands of machines in the cluster.  In essence, nodes maintain a distributed hash table to determine the machine responsible for a particular users data.  Hot data from MySQL is stored in the cache.  The cache supports get/set/incr/decr and
multiget/multiset operations.
Initially, the architecture needed to support 15-20k requests/sec/machine but that number has scaled to approximately 250k requests/sec/machine today.  Servers have gotten faster to keep up to some but Facebook engineers also had to perform some fundamental re-engineering of memcached to improve its performance.  System performance improved from 50k requests/sec/machine to 150k to 200k to 250k by adding multithreading, polling device drivers, stats locking, and batched packet handling respectively. In aggregate, Memcache at Facebook processes in 120M requests/sec.
One networking challenge with memcached was so-called Network Incast. A front-end web server would collect responses from hundreds of memcache machines in parallel to compose an individual HTTP response. All responses would come back within the same approximately 40 microsecond window.  Hence, while overall network utilization was low at Facebook, even at short time scales, there were significant, correlated packet losses at very fine timescales.  These microbursts overflowed the limited packet buffering in commodity switches (see my earlier post for more discussion on this issue).
To deal with the significant slow down that resulted by synchronized loss in relatively small TCP windows, Facebook built a custom congestion-aware UDP-based transport that managed congestion across multiple requests rather than within a single connection. This optimization allowed Facebook to avoid the, for example, 200 ms timeouts associated with the loss of an entire window’s worth of data in TCP.
Authoritative Storage
Authoritative Facebook data is stored in a pool of MySQL servers. The overall experience with MySQL has been very positive at Facebook, with thousands of MySQL servers in multiple datacenters.  It is simple, fast, and reliable.  Facebook currently has 8,000 server-yearas of runtime experience without data loss or corruption.
Facebook has learned a number of lessons about data management:
  • Shared architecture should be avoided; there are no joins in the code.
  • Storing dynamically changing data in a central database should be avoided.
  • Similarly, heavily-referenced static data should not be stored in a central database.
There are a number of challenges with MySQL as well, including:
  • Logical migration of data is very difficult.
  • Creating a large number of logical dbs, load balance them over varying number of physical nodes.
  • Easier to scale CPU on web tier than on the DB tier.
  • Data driven schemas make for happy programmers and difficult operations.

Lots of examples of Facebook’s contribution back to open source here.

Given its global user population, Facebook eventually had to move to replicating its content across multiple data centers.  Facebook now runs two large data centers, one on the West coast of the US and one on the East coast.  However, this introduces the age-old problem of data consistency. Facebook adopts a primary/slave replication scheme where the West coast MySQL replicas are the authoritative stores for data.  All updates are applied to these master replicas and asynchronously replicated to the slaves on the East coast.  However, without synchronous updates, consecutive requests to the same data item from the same user can return inconsistent or stale results.
The approach taken at Facebook is to set a cookie on user update requests that will redirect all subsequent requests from that user to the West coast master for some configurable time period to ensure that read operations do not return inconsistent results.  More details on this approach is detailed on the Facebook blog.
Areas for future research at Facebook:
  • Load balancing
  • Middle tier: balance between programmer productivity and machine efficiency
  • Graph-based caching and storage systems
  • Search relevance via the social graph
  • Object discovery and ranking
  • Storage systems
  • Personalization
Jeff also relayed an interesting philosophy from Mark Zuckerberg: “Work fast and don’t be afraid to break things.”  Overall, the idea to avoid working cautiously the entire year, delivering rock-solid code, but not much of it.  A corollary: if you take the entire site down, it’s not the end of your career.

YY Zhou Joins UC San Diego

I wanted to welcome Professor YY Zhou to UC San Diego.  YY is also joining our Center for Networked Systems as our 20th faculty member.  We were very happy to hire YY, most recently from the University of Illionois Computer Science department.  YY has been prolific in operating systems, storage, computer architecture, software engineering, and a variety of other areas.  I think it is fair to say that she and her students have performed some of the most creative work in recent years, pushing the state of the art in some of the most difficult problems in system reliability.

YY and her graduate students co-founded PatternInsight to commercialize some of their advancements.  The company already has a number of customers for their product, including places such as Intel, Cisco, Juniper, and Network Appliance.

Earlier, her work on software reliability has made quite a splash at SOSP, the premier computer systems conference, with six of her papers appearing there over the past three iterations.  Her most recent paper at SOSP 2009 investigates techniques for reproducing concurrency bugs in multicore/multiprocessor systems, a critical problem in software reliability as virtually all software must become increasingly concurrent to take advantage of performance improvements in underlying processors.

Her work at SOSP would be enough for most, but YY and her colleagues have also been regular contributors to MICRO, ISCA, ASPLOS, FAST, and OSDI.

We are very excited to have YY join our systems and networking group.

David vs. Goliath, UCSD vs. Microsoft?

In high school, I took journalism and worked on the school newspaper.  This means that I know the value of a good headline, that the headline may not reflect reality, and that the author of an article often does not write the headline.  In fact, the headline may not even reflect the contents of the article.  Still, I was surprised to see the headline to a recent Network World article, It’s Microsoft vs. the professors with competing data center architectures.  The article invokes an image of one side or the other throwing down the gauntlet and declaring war. In fact, nothing could be further from the truth.  If it were true, I would definitely feel worried about taking on Microsoft.

My group has been active in data center research.  The article does a very nice job of describing the architecture of PortLand, our recent work on a layer 2 network fabric designed to scale to very large data centers.  The article also describes recent work on VL2 from my colleagues at Microsoft Research.  Both papers appeared at SIGCOMM 2009 and were presented back-to-back in the same session at the conference.

I will leave detailed comparison between the two approaches to the papers themselves.  However, at a high level, both efforts start with a similar premise: the data center networking fabric, at the scale of 10k-100k ports, should be managed as a single network fabric.  One desirable goal here is to manage the netwrok as a single layer 2 domain.  However, conventional wisdom dictates that you cannot go beyond a few 100’s of ports for a layer 2 domain because of scalability and performance problems with traditional layer 2 protocols.  I described one such scalability limitation, limited switch state for forwarding tables, in an earlier post.  There are other challenges including spanning tree protocols, and broadcast overhead of ARP.

So the main takeaway is that we cannot scale a layer 2 network to target levels without changing some of the underlying protocols, at least a bit.  With perfect hindsight, the key difference between PortLand and VL2 is one of philosophy.  Both groups agree that the network should consist of unmodified switch hardware.  However, we believe that the end hosts should also remain unmodified, instead implementing new functionality by modifying switch software.  All switch hardware vendors export some API for programming switch forwarding tables and recently, systems such as OpenFlow export standard APIs for programming switch forwarding tables.  In fact, we implemented our prototype of PortLand using OpenFlow with the goal of maintaining the boundary between system and network administration.  VL2, on the other hand, prefers to leave the switch software unmodified and instead introduces its new functionality by modifying the end hosts themselves.  This leads to different architectural techniques and different designs.

One of our overriding goals is to reduce management burden, so we further introduce a decentralized Location Discovery Protocol (LDP) to automatically assign hierarchical prefixes to switches and end hosts.  These prefixes are the basis for compact forwarding tables in intermediate switches.  Both VL2 and PortLand leverage a directory service to essentially find an efficient path between a source and destination without resorting to broadcast (as would be required by default with ARP).

I consider the VL2 paper to be excellent.  I certainly learned a lot from reading the paper.  Perhaps the ultimate complement I can give is that I plan to assign it to my class in the spring when I teach graduate computer networks again.

Still, it is true that one of the best things about research is that we live in a marketplace of ideas and hence there must be some implicit competition.  We can only get better knowing that the folks at Microsoft are working on similar problems and certainly the “truth” as ascertained with 20/20 hindsight in 5-10 years will consist of some mixture of the competing techniques.  That way, everyone can declare victory.

Amin Vahdat is a Professor in Computer Science and Engineering at UC San Diego.

May 2020