Does Web Application Downtime cut into your Bottom Line?
When last year an internal report came back that a company had website downtime occour regularly that translated into lost revenues in the thousands of dollars per day, coupled with furious clients and low customer satisfaction ratings, the CTO’s hair went grey overnight. This is a global company that has reliable 24/7 support as a top priority for their business.
From the company’s point of view it had taken the common-sense approach of trying to solve the issue in-house. It quickly became apparent to the CTO, however, that his team was struggling to diagnose problems due to the following:
- Team were challenged enough with the normal business operations.
- Off-hours business support was limited when problems were arising on the application.
- Lack of troubleshooting expertise within the company
- IT team were stuck in the detail and it couldn’t conceptualize the overall problems and therefore assess the real impact from many dispirit logs.
The CTO realised that he needed a fresh pair of eyes to support his team and that’s when we got a call to see if we could make sense of what was causing the downtime.
All aspects of the Web App needed to be examined: ranging from infrastructure problems to application glitches.
In many of these scenarios, we see certain similarities when we are assessing causality of downtime within an application, these similarities include:
- Lack of automation of failure or error events
- Current automation missing critical failure / error events or causing many false-positive scenarios
- Lack of overall visibility of the health of the environment / system
- Time consuming process of collecting and analysing log / event data
We approach Web Application challenges with the following steps:
- Full assessment of logging and event management for the current environment
- Presentation / working session explaining the latest best practices
- Implementation of robust / scalable logging and event monitoring solution
- Training and support for effective use for new system and improved SLA
In this particular case the downtime was being caused by insufficient time to carry out the necessary deep level log analysis resulting in system mitigations based on partial information or guesswork.
In order to mitigate this cause we incorporated the following solutions to support their Web Applications:
ElasticCache (Redis) — an extremely fast and robust pub/sub model that provides a decoupling of log data shipping from log data ingestion.
ElasticSearch Service – a fully managed ES Cluster eliminating the need for configuration / patching / backup / recovery etc.
Kibana – an AWS Managed service that facilitates the visualization and ultra-fast search of log data.
Filebeat – a flexible open source agent that can be configured to extract all or any subset of logs and ship to AWS service
Logstash – an efficient system for the collection and transmission of log data
CloudWatch – fully managed service that collects and stores AWS service metrics as well as providing rules and triggers for event management.
This type of solution can assist companies where their web app downtime has high costs in time, money and reputation. Whether its clients being unable to order, staff unable to work on client files, or a company where SLA’s are mandated by compliance requirements.
Feel free to contact us and discover for yourself what how our Experinced team of local Cloud Architects can help you improve your company’s infrastructure.
Follow us for the latest cloud news and job opportunities