Why Vista took so long to build

Joel Spolsky complained that the start menu in Vista was too complicated for him. I didn’t pay too much attention to that, but another former Microsoftie by the name of Moishe Lettvin who had worked on that very same start menu replied with a very interesting post about the build process used for the Vista operating system. Here are some excerpts:

I worked at Microsoft for about 7 years total, from 1994 to 1998, and from 2002 to 2006.

The most frustrating year of those seven was the year I spent working on Windows Vista, which was called Longhorn at the time. I spent a full year working on a feature which should’ve been designed, implemented and tested in a week. […]

I worked on the “Windows Mobile PC User Experience” team. This team was part of Longhorn from a feature standpoint but was organizationally part of the Tablet PC group. To find a common manager to other people I needed to work with required walking 6 or 7 steps up the org chart from me. […]

So just on my team, these are the people who came to every single planning meeting about this feature:

1 program manager
1 developer
1 developer lead
2 testers
1 test lead
1 UI designer
1 user experience expert

8 people total

These planning meetings happened every week, for the entire year I worked on Windows.

In addition to the above, we had dependencies on the shell team (the guys who wrote, designed and tested the rest of the Start menu), and on the kernel team (who promised to deliver functionality to make our shutdown UI as clean and simple as we wanted it). The relevant part of the shell team was about the same size as our team, as was the relevant part of kernel team.

So that nets us a conservative estimate of 24 people involved in this feature. Also each team of 8 was separated by 6 layers of management from the leads, so let’s add them in too, giving us 24 + (6 * 3) + 1 (the shared manager) 43 total people with a voice in this feature. Twenty-four of them were connected sorta closely to the code, and of those twenty four there were exactly zero with final say in how the feature worked. Somewhere in those other 17 was somebody who did have final say but who that was I have no idea since when I left the team – after a year – there was still no decision about exactly how this feature would work.

By the way “feature” is much too strong a word; a better description would be “menu”. Really. By the time I left the team the total code that I’d written for this “feature” was a couple hundred lines, tops. […]

I’d also like to sketch out how actual coding – what there is of it – works on the Windows team.

In small programming projects, there’s a central repository of code. Builds are produced, generally daily, from this central repository. Programmers add their changes to this central repository as they go, so the daily build is a pretty good snapshot of the current state of the product.

In Windows, this model breaks down simply because there are far too many developers to access one central repository – among other problems, the infrastructure just won’t support it. So Windows has a tree of repositories: developers check in to the nodes, and periodically the changes in the nodes are integrated up one level in the hierarchy. At a different periodicity, changes are integrated down the tree from the root to the nodes. In Windows, the node I was working on was 4 levels removed from the root. The periodicity of integration decayed exponentially and unpredictably as you approached the root so it ended up that it took between 1 and 3 months for my code to get to the root node, and some multiple of that for it to reach the other nodes. It should be noted too that the only common ancestor that my team, the shell team, and the kernel team shared was the root.

So in addition to the above problems with decision-making, each team had no idea what the other team was actually doing until it had been done for weeks.

The end result of all this is what finally shipped: the lowest common denominator, the simplest and least controversial option.

Here’s one interesting anonymous comment on the weblog:

disclaimer - I was a manager at Microsoft during some of this period (a member of the class of 17 uninformed decision makers) although not on this feature, er, menu.

The people who designed the source control system for Windows were not idiots. They were trying to solve the following problem:

  • thousands of developers,
  • promiscuous dependency taking between parts of Windows without much analysis of the consequences
    –> with a single codebase, if each developer broke the build once every two years there would never be a Longhorn build (or some such statistic - I forget the actual number)

There are three obvious solutions to this problem:

  1. federate out the source tree, and pay the forward and reverse integration taxes (primarily delay in finding build breaks), or…
  2. remove a large number of the unneccesary dependencies between the various parts of Windows, especially the circular dependencies.
  3. Both 1&2
    #1 was the winning solution in large part because it could be executed by a small team over a defined period of time. #2 would have required herding all the Windows developers (and PMs, managers, UI designers…), and is potentially an unbounded problem.

(There was much work done analyzing the internal structure of Windows, which certainly counts as a Microsoft trade secret so I am not at liberty to discuss it)

Note: the open source community does not have this problem (at least not to the same degree) as they tend not to take dependencies on each other to the same degree, specifically:

  • rarely take dependencies on unshipped code
  • rarely make circular dependencies
  • mostly take depemdencies on mature stable components.

As others have mentioned, the real surprise here is that they managed to ship anything.

And a follow-up comment by one RichB:

As someone else pointed out, dependency management is Microsoft’s big weakness when compared to Linux. This manifests itself in the complexity of version compatibility with beta versions of software (see the VS CTP compatibility tables on the web). It also manifests itself in the horrid MSI-based security patching that Microsoft use. And as someone said, it manifests itself in the internal build process. If you can’t run a static dependency checking analyzer over your dependencies, then you have no idea whether a build will succeed or not.

It’s also worth noting that SourceDepot is a conventional VCS, whereas the Linux kernel and most of the large open source products (Java, Xen, etc) use a distributed VCS. I’ve never used Git, Bzr, Mercurial, or any of the other systems to know the advantages over a central, non-distributed system. Perhaps Microsoft should have given them a go?

It seems to me that Microsoft will have to bite the bullet and drastically reduce interdependencies between Windows components if the next Windows version is going to happen at all. The whole system ships as a bundle of DLLs, after all – it’s silly that they still need to treat it as a single super-slow build because those DLLs aren’t separated cleanly enough.

A camel is a horse designed by bureaucrats…

There’s a fairly interesting interview with Jim Allchin here:

http://blogs.zdnet.com/microsoft/?p=65

Apparently OS team knows that the componentization problem was killing them; hopefully the next rev of the OS will be able to realize some of the benefits of the supposedly ongoing work in this area. Here’s the relevant bits of the interview:

Q: You say you’ve learned a lot from the whole Longhorn/Vista development process. Could you share some of these lessons?

Allchin: What you really want to do is componentize, get the stuff done, make sure you understand the dependencies, easy as one, two, three. But you need to understand those dependencies and then you put the pieces together. And without that, your ability to build large, complicated products — could be cars or operating systems or whatever — it doesn’t scale.

So you’ve got to get it down to components that you really can thoroughly understand and make sure you understand the dependencies so you’re not building them on a quicksand-ish or shaky foundation.

So it sounds trivial, (but) at scale it’s very hard to do. You (might) say, well, I’ll do the memory manager separate. No, the memory manager can’t be separate because it’s so integrated in with the other parts of the operating system. And so it’s a tricky balance. And when we were doing stuff in Windows – actually, alpha Longhorn — there were many learnings that we had to go through there because we had pieces that we knew eventually had to fit together, and we were just binding them too closely, we didn’t understand the dependencies well enough.

Q: But inside Microsoft’s Core Operating System Division (COSD), it seems you already knew about building modularly and were doing development that way. How much more modular can you get than you were?

Allchin: Much.

Q: How?

Allchin: We paid a huge price in the engineering in Windows Vista for the componentization work that we’ve done. Now, components mean different things to different people, and so I don’t want to spend time on a definition here.

But we have rearchitected the system for a new setup technology, which is componentized so that people have pieces of code that have manifests associated, which describe how they touch the other parts of the system. We have dependency analysis between all these components now so we understand the layers of the system, we understand who impacts what.

And so, for example, we’re in the middle of escrow right now for Windows Vista as we walk through it, and if we have to do a bug fix at a particular point in the system, we know what other components in the system will be impacted, so where else we have to test. We didn’t have that understanding before. So we’ve paid a huge cost, which the benefits for Windows Vista, it’s very hard for me to point to one other than if you’ve seen fast install of a clean system, that’s a direct customer benefit, but most of it is all benefits for the future.

Now, with that being said, I consider it to be just step one in Windows Vista. There is a tremendous amount more to be done in componentization, and breaking up bigger components into smaller components, and just making it cleaner, cleaner, cleaner, but we paid a huge cost. It’s like Windows Vista in componentization is like the first time we did localization — or internationalization earlier, because once you put the foundation in place, then you can easily say, okay, I’m going to add new languages or I’m going to tweak it, I’m going to continue to refine it.

Q: So you’re taking the hit now. You’re going to get the payoff later.

Allchin: There are many things that we took a hit in Longhorn that will have huge benefits in the future. The step towards componentization we did, which I include in that all the stuff about the dependency analysis and the layering and all of those things. Those are return benefits later.

Security, I mean, in that particular space (it’s my opinion that) it will be years before people appreciate the improvements that we did in Vista, because it’s not something super visible. And improvements in the code quality underneath the covers in Vista is a big step, but there’s a whole bunch more to do.

I built a dependency graph for windows that could run on every build. Did that project get abandoned?

Oh. That’s what happened.

I’ve never worked at MS, but it sounds to me like he’s waaaay exaggerating the number of people with a voice in the feature.

For example, he’s got 3 testers in his primary group of 8. In my experience, testers like to sit in on design meetings so that they can get a start designing their tests, but they don’t (and shouldn’t) actually contribute to the design/implementation of the feature. He’s also adding in a ton of managers that probably didn’t actually poke their oars in much–usually that stuff stops with the program manager.

I’ve worked on features that (if we included every person who was in any way connected with the feature/project) probably had 60 people involved, but we still managed to get done inside of a month. The thing is, most of those people are only tangentially involved.

Seems like part of this is the size of the bites that they’re taking. Maybe they should stop trying to rev the whole beast every six years and version it every year instead.

Microsoft definitely didn’t plan on it being six years…

That’s the last correct thing you said in your post regarding Microsoft, unfortunately. It’s not quite like any other place I’ve ever worked. Sometimes it works real well, sometimes it doesn’t.

Vista development was actually rather efficient – once they started working on the product that became Vista, as opposed to the first few years of “Longhorn.” The OS that’s about to ship came together in about two years. Google, er, Live Search “Longhorn reset” for more background. Yeah, there was quite a bit of time spent before that, but boy, was that a lesson learned and taken to heart.

He is exaggerating. There’s a PM who “owns” the feature and can make a final call. A good PM takes input from design, user assistance, the feature team (programmers), and so on, sure. But then the PM needs to make the call. And typically, this is what happens, although there are poor managers in every company. He may have been in a chaotic team.

The team size may be unusual or exaggerated; anyway I’d expect projects at big companies to require a lot of coordination with many people involved.

What surprised me more was a source code control system that takes months to propagate any code change to all other teams, and a level of interdependence that means teams cannot risk any significant changes because they would have to wait for this change propagation to complete.

However, according to the Allchin interview spacerat quoted Microsoft is working to improve the situation, so I suppose we might see another Windows version in our lifetimes!

Actually, considering that QA are the people who use the final product, they are the first barrier between the software and then end user. Having QA around can quite often result in catching a potential UI problem long before it propagates.

There’s about 200 people working on my project (a game), and there’s still about a half day at least a week burned simply by coordinating.

That’s because the quicker changes propagate, the less time you have for testers to find potential issues before it gets sent out to everyone’s base. And with that many people, if someone introduces a random crash that causes everyone using that code to be blocked from working, if you had a weekly propagation it means that you immediately flush a week of everyone’s work.

Propagating over months sounds like a bad idea, until you think of the alternative. Granted, I’m sure there’s better ways (for example, an ‘unsafe’ code server which keeps the merged versions of everything), but honestly, I would not want to work on a project with a thousand developers who check directly in to the core branch. It’s bad enough with 40 programmers, I can’t imagine it with 1000.

In companies that are developing hardware that the same company’s software team is developing for at the same time, coordination can be even more difficult. You can’t put your physical components into a differential changelog system like CVS. Thankfully modern hardware runs software at its core, such as the BIOS, which can be and there are such things as EEPROMs.

But we’re talking about an operating system here, composed of many modules that (should) have clearly differentiated tasks and well-defined interfaces to the rest of the system. As long as the individual teams are programming to these interfaces they shouldn’t be stopped dead by bugs in another team’s implementation. Every module should be run against mock systems and test suites before checking in any changes, and if bugs do slip through other teams can continue local development on the basis of these test suites.

But not without pain, if they’ve already synced to a broken version. Consider if one of the base modules has a strange crash in it based on usage of another modules, and both those modules are required to run the system? Sure you can roll back, but that requires finding out when that specific crash was introduced, which involves spending time with other teams figuring it out, and ultimately, it’s lost time.

You may also need a new feature added, and you can’t add it yourself because your group doesn’t have the authority, and the group that is responsible isn’t going to make it a high priority, and it has to propagate up to the root (or common parent branch) and then back down to you because integrating straight across could neglect other crucial dependencies, but you’re not ready to resync because your other changes will break Group B’s tests and they haven’t caught up yet and…

I have enough trouble with relatively tiny teams and medium-sized apps that I can barely conceive of the difficulty a place like MS must face.

And yet, an article I just read about Vista said that they were doing away with DLLs entirely.

And yet, an article I just read about Vista said that they were doing away with DLLs entirely.

They might have meant in name or as traditional chunks of Win32 code, as things are shifting towards .NET assemblies. They’re still .DLL files, but are supposed to avoid some of the dependency problems with private copies and better versioning information.

They definitely wouldn’t be getting rid of the concept of shared libraries entirely, and there’s too much of a need for traditional DLLs for compatibility, at least.

And yet, an article I just read about Vista said that they were doing away with DLLs entirely.

I don’t think you read that correctly.