I worked recently on a project to improve the performance and stability of a set of engineering applications after migration to a new datacentre and computing platform. We had really excellent data produced by the application centre business analysts. These showed in detail that applications were significantly slower than previously, across a wide range of transactions. On average, transactions were taking 25% longer (let's say). Someone set the objective that we would not be satisfied until 90% of transactions were within the benchmark figure for each transaction.
On the face of it this was going to be difficult, because we knew that there would always be variability, and this new target effectively outlawed variation. We did not know the previous variability. If the benchmark transaction times were only met say 70% of the time previously, then there was no reason for them to now be met 90% of the time.
The first and obvious variable was the user site. We found that, if we excluded the sites with known poor networks, or those sites which seemed to have a much higher incidence of poor results (because that is how we knew they had a poor network), then the number of transactions outside the benchmark dropped significantly. But they were still a lot more than 10%. Obviously the site and the network did not account for all poor performance.
The second obvious factor was the performance of the computing platform (Citrix XenDesktop). We could not tell if a poor test result correlated with a general experience for other users of poor performance on the platform at that precise time. But the general feeling was that the platform must have periods of poor performance. So the number of virtual machines was increased; the number of users per virtual machine reduced; and in some cases the number of vCPU's per virtual machine increased. It made no difference. There continued to be a significant number of transactions outside the benchmark times.
One of the issues for us was that we could not reproduce the problem on demand. The analysts had all experienced a bad transaction. But it was not repeatable. So we knew that we were looking for erratic rather than predictable results. When we looked at the test data again, we found that the Average time (the average time taken for a number of instances of the same transaction) was very misleading. We found that the Median value was indeed well below the benchmark transaction time. Most people were experiencing good performance most of the time, but some people were experiencing poor performance some of the time. The measurements at the time of poor performance were extreme, so they made the averages less useful.
The example I think of is taking a train to work. It normally takes 30 minutes. Four times out of five the train runs on time, but the fifth time it is cancelled and you have a 20 minute delay for the next train, which also runs more slowly, taking 40 minutes. It is not useful to say that the journey takes on average 36 minutes. You would not be on time to work more often if you allowed 36 minutes. Instead the conclusion is that the service is unreliable, which is quite a different thing.
So we plotted the actual times in a scatter graph, and it was immediately clear that the real problem was not performance, but reliability. We also calculated the standard deviation, as a more accurate representation of variability, which told us the same thing. Examples:
We decided that, instead of looking at the things that affect performance (vCPU, vRAM, disk latency, network latency) we would look at the things that affect reliability. We started by analysing each transaction with SysInternals Process Monitor and Wireshark, to understand what exactly caused time to be taken. The results were a revelation. We found a set of causes that we would not have guessed existed:
- A benchmark transaction exported from the old system without version history. The transaction attempted to validate the version number by checking prior versions, before giving up and running.
- An export to Excel failed if Excel is already open in the background. It continued to fail silently until the user runs it with Excel not open.
- A transaction called an external module. The module is signed with a certificate from the vendor. The transaction attempts to check the revocation of the certificate. If the user has an invalid proxy server configuration then there is a delay before a timeout expires and it continues. If the transaction is run a second time there is no check and it is fast.
- The user logs on. The application searches in various non-existent locations for a user configuration. After around 20 seconds it finds a configuration and begins.
- Running a transaction for the first time causes the data to be cached locally. The second time it runs from cache and is fast. Therefore the recorded time depends on the instance of running.
- A report writes to Excel at a network location. The data is transferred to the remote file in very small packets, taking a long time. Another report run to a local file, which is then copied to the remote destination, and completes in a fraction of the time.
The conclusions? it is important to look at the data statistically to see whether the problem is about performance or reliability; and you need to understand the makeup of the transaction to know what may cause it to take longer than expected.