When the report came back that the Companies website downtime translated into lost revenues in the thousands per day along with furious clients and low customer satisfaction ratings, the CTO’s hair went grey overnight. This was a global company and 24/7 support/response a top priority for the business.
From the companies point of view it had taken a common-sense approach by trying to solve the issue in-house. It quickly became apparent to the CTO however, that his team were struggling to diagnose problems due to the following:
- Team were challenged enough with the normal business operations.
- Off-hours business support was limited when problems were arising on the application.
- Lack of trouble shooting expertise within the company
- IT team were stuck in the detail and it was a challenge to conceptualize overall problems or assess the real impact from many dispirit logs.
The CTO realised that he needed a fresh pair of eyes to diagnose and support his team and that’s when we got a call to see if we could make sense of what on earth was causing the outages.
Of course that’s easier said than done! all aspects of the Web App need to be examined ranging from infrastructure problems to application glitches and it can take some time.
In many of these scenarios, we see certain similarities when we are assessing causality of downtime within an application these include:
- Lack of automation of failure or error events
- Current automation missing critical failure / error events or causing many false-positive scenarios
- Lack of overall visibility of the health of the environment / system
- Time consuming process of collecting and analysing log / event data
Lucrodyne approaches Web Application challenges with the following steps:
- Full assessment of logging and event management for the current environment
- Presentation / working session explaining the latest best practices
- Implementation of robust / scalable logging and event monitoring solution
- Training and support for effective use for new system and improved SLA
In this particular case the downtime was being caused by insufficient time to carry out the necessary deep level log analysis resulting in system mitigations based on partial information or guesswork
In order to mitigate this cause we incorporated the following Software to support the Application:
ElasticCache (Redis) — an extremely fast and robust pub/sub model that provides a decoupling of log data shipping from log data ingestion.
ElasticSearch Service – a fully managed ES Cluster eliminating the need for configuration / patching / backup / recovery etc.
Kibana – an AWS Managed service that facilitates the visualization and ultra-fast search of log data.
Filebeat – a flexible open source agent that can be configured to extract all or any subset of logs and ship to AWS service
Logstash – an efficient system for the collection and transmission of log data
CloudWatch – fully managed service that collects and stores AWS service metrics as well as providing rules and triggers for event management.
This type of solution can assist clients where web app downtime costs them time, money and reputation. Whether its clients being unable to order, staff unable to work on client files or a company where SLA’s are mandated by compliance requirements.
Chief Technology Officer, Lucrodyne
Tel: 905 882 0701 x 221