T3CHIE insights: performance

This has got to be the worst question that I can get asked because I don't have a direct answer. I always tell people "It depends" which sounds like a cop out but it is true. It depends on a lot of factors such as:

How much data are you backing up?
How fast is your production storage to send the data to the backup server?
How fast is your media server\proxy server to be able to process the data it is getting?
How fast is your network to handle the flow of data (unless doing LAN-free backups)?
How fast is your target storage for the backup files (doing disk-to-disk backups)?
How fast is your network from the media server\proxy server to your target storage?
How much data are you backing up each night?
How often are you doing full backups?
Are you doing active fulls or synthetic fulls?
How big is your backup window?
Is it ok to have your backups overflow into production hours?
Do you need to backup using application agents?
Are you backing up physical or virtual?
If virtual are you using image-based backup solutions to get the VM?
If virtual are you using VMware technologies like Change Block Tracking?
It all comes down to "how fast can your infrastructure go?"

You can relate this question to one in your personal life when someone asks "How long is it going to take you to get there?" Well you can give a rough estimate, but there are many things to take into consideration, what is the rate of travel, what is the traffic like, which streets are you going to take, are you going to take a bike or your car or a bus or are you going to walk? All of these are similar to your backups. It depends on the mode at which your going to get your data from point A to point B and how congested is that route?

Now it gets much easier when these questions are answered and the more information you give us the more detailed response you'll get. But nothing beats good 'ol testing. Test your environment. See how long it takes to back up one server, ok see how long it takes to do that incremental, ok now see how long it takes to backup two servers at the same time. When running things in parallel in the IT world it is not a linear improvement. Running one backup might take 30 minutes, running two backups might take 35 minutes and running three backups might take 45 minutes. There will definitely be a point of diminishing returns and that is where testing will prove to have paid off. You will get to a point where it takes you longer to back up than if you backed off one or two servers and waited for the one of them to finish running. Where that point is is entirely up to your hardware, most likely it is not the software. Most backup software these days are built for multi-threading and to utilize the resources as efficiently as possible but you can't magically squeeze more out of your hardware that what its capacity is. Now some products offer compression and deduplication that will allow you to technically send more data than others.

I hope you enjoyed my little lesson here. It is something I learned when I came to the dark side (more on this later).

It seems like there are still a lot of people out there that don't really understand or even know about CPU ready within your VMware environment. CPU Ready is a metric that is measured in VMware either by % or ms. Depending on where you look depends on which value you get.

CPU Ready
CPU ready is the time a virtual CPU is ready to run but is not being scheduled on a physical CPU. It means that when the OS wants to process something on the processor the hypervisor is saying, ok hang on until it is your turn. Think of the metered ramps when getting onto the highway during rush hour. This happens when you have too many vCPUs on your host. Now a lot of new virtualization admins tend to think that just because you have available CPU resources (usage) then you can throw more vCPUs or VMs on the host. This is not always the case because even though the CPU isn't actively processing data it is still scheduling VMs to process data if they have it. Think if it sort of like round-robin or old token ring network. Everybody gets a turn and you will have to wait your turn. If you add 5 VMs with 1 vCPU on each VM then there are 5 vCPUs that need to be scheduled on the physical host. If you add 5 more VMs with 1 vCPU each then you have 10 vCPUs that need to be scheduled. Now to make matters worse you start adding VMs with multiple vCPUS, lets say 3 VMs with 2 vCPUs each. Now not only do you have 16 vCPUs that need to be scheduled you have 3 pairs of 2 that need to be scheduled a relatively the same time. And all of this is happening even if all of the VMs are idle.

I mentioned that depending on where you look you will see different values for CPU Ready. In the vSphere client you will see Milliseconds (ms). In ESXTOP you will see %RDY (Percent Ready). There is a conversion chart to help show the relationship.

1% = 200ms (0.2 seconds)
5% = 1,000ms (1 second)
10% = 2,000ms (2 seconds)
50% = 10,000ms (10 seconds)
100% = 20,000ms (quit your job before you're fired)

How do I know to look for CPU Ready?
First off you should always be looking at your CPU Ready times. I'm not saying to always have your vSphere client open on the CPU Ready graph, but you should be checking it out often. CPU Ready is probably the easiest thing to notice when it is high. You will start to see lagging inside the VM. The users will notice this when trying to run things within the VM, or you are processing something in the VM and there isn't much CPU utilization when there should be. I first got smacked with CPU Ready when I was managing a large Citrix XenApp farm running on vSphere 3.5 (it was still be called Presentation Server at the time). My users were experiencing lagging within their sessions but the CPU utilization wasn't very high. I had originally given my Citrix servers 2 vCPUs at the time because I thought with so many users and so many threads it will utilize the 2 vCPUs. Boy was I wrong. To start with I was getting high CPU Ready times because I was running about 5 - 2 vCPU Citrix Presentations servers on a host with 2 - dual core processors. So I was running 10 vCPUs on 4 physical cores. That is not good, now to make things worse I was running Citrix on these VMs with about 10-15 users per VM which caused a ton of context switching (Citrix servers have a lot of context switching by themselves, then you add virtualization into it and you multiply the problem) and then compound that on the virtualization context switching and then having to schedule 2 vCPUs at relatively the same time. I know it was crazy. My %Ready was about 8%-14% (1.5s - 3s) which meant my users were not happy because when they clicked nothing came up for a second or two. I ended up dropping my Citrix servers to 1 vCPU and my issues subsided and I was able to keep the user count the same with less latency issues.

I hope this helps you understand CPU Ready and gets you looking for it so you can head it off before it starts disrupting end user experience. Now I'm sure I'll get asked "What is the optimal\acceptable CPU Ready times?" Well that depends on your workload. If your workload can sustain a 5% CPU Ready time then that is optimal. I would try to keep it under 3% if possible, not always possible so at the most 10%, once you start getting above that you start seeing the latency in the VMs.

I recommend you set up a custom alarm in your vSphere client to warn you of high CPU Ready times, or get something like Veeam Monitor that already has a built in alarm for CPU Ready with a warning at 10% and an error at 20%.

T3CHIE insights

Thursday, May 3, 2012

How many backup servers do I need?

VMware CPU Ready, What is it?