The call went something like this.
“Hi, I’d like to log a call please”.
“Sure, how can I help?”
“Well, <insert application name here> is running slow and it’s been like that for a while.”
“OK, what is it that you’re trying to do?”
“Just my normal work, you know, the usual activity.”
“OK, are you getting any error messages?”
“No, it’s just slow. It does this a lot.”
“Oh, really? You mean all the time or just recently? Any particular time of the day?”
“Hard to say really. It was OK for most of last week, then it was slow for a while on Thursday and it was like it again on Monday. It’s starting to get really bad and that’s why I’m calling.”
“Is anyone else having the same problems?”
“Not sure. Probably. Most of the time we just get on with it but I know people have been frustrated and some say it’s really slowing them down in their work.”
“Do you know if any other calls have been logged?”
“No, but it’s getting really frustrating now”.
“Can you tell me exactly what you were doing leading up to the problems?”
“No, truly I can’t.”
Of course, this type of scenario never happens because everyone knows precisely what they were doing before they have a problem. They also know the exact time of the occurrence, the frequency and all of the details relative to their previous problems, right?
No. Absolutely wrong. The business users are trying to do what’s in the best interests of the business – their job. They just want an application that works as it’s supposed to work.
So, where do we go with this problem? Well, a few options come to mind.
1 – Problem diagnosis as usual
We could spend probably a lot of time further diagnosing the root cause of this issue, including who else is affected (possibly canvassing users), their locations, their connectivity, specific times of the day that are problematic, what application activity is underway at the time, what other apps are competing for network resources etc. etc. We will probably get there in the end this way although it’s likely to involve a lot of time and resources, a little head scratching and perhaps some finger-pointing.
If we have some tools that can do this for us, then great. That said, if we had a great Application Performance Management (APM) tool we would’ve been on top of the issue before the user called. Besides, this method looks too long so we should perhaps skip to step 2.
2 – What’s on the network?
“The network’s slow today.” Ever heard that? No, me neither!! Whilst we know in our hearts that “the network” is rarely the root cause of a problem, it’s diligent (and often an eye opener) to take a good look, perhaps by way of ruling it out, but also as an opportunity to streamline traffic.
Of course you may not know the entire scope of what’s on your network and what may be contributing to any issues. Often we need to validate whether providers have adhered to requirements, have some insight into the potential capacity or assess any other constraints that could impact the user experience with key applications.
Here an exercise to assess the network can be useful to pull-out factors such as the users and the applications that are generating traffic and any anomalous conditions such as packet loss, slow server responses, data corruptions, etc. The iTrinegy Network Profiler devices are especially helpful in this regard since they install very quickly and immediately begin to report.
3 – The fast option
We could determine the root cause of the issue by using Application Profiling, a technique, specifically designed to rapidly capture and examine network packets generated by the affected users PC at the time of the problem occurrence. Lightweight agents can be distributed across the infrastructure as needed. If the problem is intermittent we can set these to unattended capture mode and when the issue resurfaces we can roll back the time and analyse the slice of data concerned. This will tell us in a very clean visual format, where the extended time was spent (client network, servers etc.) and why. You can see more about application profiling here. This will find the answer to the current issue, however ideally we need to move to a proactive stance in performance management.
4 – Taking control
Ideally we should be looking for an APM solution that will tell us about the issues before the users are affected, remembering that users can be internal or external (our customers) and so the impact of a severe issue can be felt all round. APM can mean many things to many people, however in this context we’re looking at measuring performance where it truly matters, at the User Experience. After all, that’s where the calls come from and where business can be won or lost.
A good APM solution should automatically discover your applications and your underlying architecture. In today’s world systems change frequently, traffic patterns and routes fluctuate and infrastructure and applications are spun up and down as needed. As a business we don’t want to have to tell a tool where everything is or when anything changes. One of the best (and coincidentally the market leader in its field) is Dynatrace.
One of its key strengths is that Dynatrace captures all users’ experiences (good and bad). This way we know what “good” looks like and can correlate that to business metrics. For example in a B2C environment you may want to know that you do more business with customers who have X experience with you. Or when monitoring internal users you could determine the specific failing sessions from the positive ones and so have evidence of a real problem rather than a perceived one.
5 – Consider performance early
We have to acknowledge the role of testing here since some key areas should be covered early in the application release cycle in the quest for a positive user experience.
We’ve all heard the tales. From ambulance dispatch systems to web-sites for the electronic submission of tax returns, systems fail as they scale and experience peak demands. Good practice is to start the development or upgrade of a system with its performance clearly in mind, particularly in regard to scalability testing, volume testing and load testing.
To create this performance testing focus, it’s helpful to research and quantify the target data and transaction volumes, the locations and likely connectivity of your users, recognise any external services and limitations, consider the test environment and importantly, allow sufficient time for testing …and re-testing. Further reading on performance testing can be found here.
When dealing with an ongoing performance problems, consider the impact. Commercially this may include revenue, investigative support time and resources (internal and external) etc. Personally it may involve staff morale, customer experiences, retention and brand image etc. Weigh these against the cost to deliver a short term fix and to implement a long term solution. Remember, the earlier in a project that you start to think about performance, the less pain you may feel further down the road. Contact us if you’d like to know more about how to rapidly resolve performance issues and to create a strategy of problem prevention.
Author: David Collins is the Operations Manager at Verify Solutions.