Test Plan Charlie Unplugged: An Interview with David Boyes - page 5
Meet David Boyes
What did you do at Telia? How was their application a good Linux and S/390 fit?
Telia is a perfect example of the infrastructure play. The piece of Telia that's interested in this is their ISP division. They have to look at provisioning and getting the customer online as quickly as possible. The other thing they're looking at is some really I/O intensive applications. They have to collect billing information from their dialup servers. That's multiple hundreds of megabytes per day. They've got Usenet feeds that have to handle hundreds of megabytes per day. That's something that plays very well in this structure. They're lucky in that most of their applications are either open source or they wrote the code themselves. They were easily ported. IBM provided a very scalable platform for those applications to run. The 390 vastly accelerated their ability to react to resource demands.
The problem with the 70 Suns [which Telia replaced] is that they could not move resources around. You ended up doing box replacements, or many boxes. Solaris is a good general-purpose solution but it's not optimized for anything. It's like a Swiss army knife. The combination of Linux and VM for Telia was interesting because they got to keep all their UNIX applications. They already knew how to run that; all they did was move them to a new container. What that container gave them was the ability to move resources around, and to have a much bigger container. They were able to start out with a very flexible platform and could move things around without having to go out and touch the physical machines every time.
Are there other companies doing similar things? Are you involved?
Yes. I can't discuss a lot of the details, but there are a lot of companies in the enterprise, telecomm, and financial fields that are absolutely beating down our doors about this. Particularly in non-US areas where floor space is at a premium -- Japan is an example, where floor space is $475 a square foot. We're not talking about companies with a few dozen web servers, but companies with 30 or 40 thousand web servers.
If you actually do the math on what an NT server costs, the physical server is cheap but the support environment, management, and backups around it are very expensive.
We really have more work than we can do. But as competitive as the marketplace is now, none of these companies want to raise their kimonos. This is a seriously competitive tool. I've gotten calls from as high as Lou Gerstner's office wanting to know who these companies are, and I can't tell them. These are generally very secretive projects because you're messing with the bottom line, and they're playing for keeps.
People were amazed when you announced the results of Test Plan: Charlie, in which over forty thousand Linux images ran simultaneously on one mainframe. Can you explain just exactly what Test Plan: Charlie involved?
Test Plan: Charlie was originally a demonstration project. A customer, a telco, was looking to build an ISP service. They wanted to sell bandwidth. One of the services they wanted to do was a managed router project. These services tend to be everything down to the plug into your LAN. The telco sells you a server, a router, they preconfigure everything and manage it at their office where you don't have to see it. The telco was looking at the back end of this.
What they did was go to a consulting firm and paid them a large sum of money to get them an infrastructure to provide this service. They had to provide a pretty stringent service level, with guaranteed uptime. All the versions of Linux and NT that I'm aware of don't provide the ability to control the resources that any given user uses. In the mainframe environment this is something that was solved years ago. When they went to the original consultant, they said, 'Give us a design.' The consultant gave them a pretty basic Sun design, with two machines for DNS and one machine for a more I/O intensive application. Two UE2s and one UE450, mostly because you needed bunches of disks. This is about $50K worth of hardware and they were going to replicate it for each customer. They projected 250 initial customers, so they would need 750 machines.
You also need rack space, and a building to put it in. You need network cables, routers, switches -- all this peripheral infrastructure. If you're held to service level agreements, you need some way to measure that. You need a way to quickly restore a system if you have a hardware failure, backups -- this all takes people.
They were looking at a price tag on the order of 50 to 60 million dollars. As everyone knows, the phone company has lots of money and can do this kind of system. We had done some other work for them., though, so they called us and said, 'Can you give us a second opinion?' We realized this was a configuration in which most of the servers were identical. What if we could do this in an environment of virtual servers?
This particular customer has a System/390 already in place. They use it to print about 300,000 or 400,000 bills a month, just for one region of their national network. They had a test partition available and were already VM customers. We proposed to them building a similar server "farm" using Linux, VM, and open source tools to provide the services to the customers on their 390 platform.
We did the initial study in terms of floor space, power consumption, and assuming they bought a new [System/390] machine to do it. It came out around two to three million dollars. They were uncomfortable with committing to this kind of solution without actually seeing it work. We were uncomfortable too -- this was all pretty leading edge stuff.
The initial [project] was to build a working model of day one, with 250 customers. Over the next week, we built a lot of tools, did a lot of experimenting and crashed Linux about ninety billion times. The process was to build tools that let us create, and duplicate, Linux environments between virtual machines. We had to come up with what could be shared between instances. It's different from a development environment where people are screwing around with stuff. The only thing that had to be different was the actual data that you're working on.
We built the web infrastructure, and something to generate the load. They put some pretty serious constraints around it. It had to be a pretty real-world case. We did it in stages. The first was the initial 250 customers. Test Plan: Beta took this to two thousand, and then ten thousand customers, to see if this would work, and what were the boundaries on how far this thing could scale.
Test Plan: Charlie was just, 'Let's go for broke.' So we started duplicating the virtual servers. When we hit 41,400 servers, we saw a message that said, 'I can't allocate any more resources.' It didn't crash and burn, it stayed up. This is an insane load! This thing was paging its brains out. [It had only 128 meg of RAM]. I wouldn't recommend this configuration for the real world.