In an always-on world, companies look to systems and processes to keep their services up and running at all times. The most important part of maintaining this uptime is having an Incident Management process in place to restore your services in the event of an interruption or unplanned downtime.
Incident Management processes are typically used to respond to incidents that affect services and work on restoring their uptime.
Incidents that disrupt services are unavoidable. But every breakdown is an opportunity to learn & improve.
Every incident in system infrastructure, helps product development & engineering teams understand better about the capabilities of system architecture. This can further help us in building a more sustainable and reliable product.
Incidents are identified through reports from monitoring systems, or by manual identification. Once an incident is identified, it is logged. An incident log can be used to validate that all incidents have been addressed and to identify trends. At this point, the incident is categorised by adding additional information like severity, functional area, and ownership.
This stage is about notifying the right people to address an incident. It may also involve assigning tasks and performing escalating procedures. In a high-fire situation, the ease with which you can communicate can make or break the customer perception and ultimately the impact on your bottom line.
Based on the business impact of the incident, we assign one of the following severity levels.
Once you establish the impact of the incident, adjust or confirm the severity of the incident issue and communicate that severity to the team.
The roles and responsibilities during and incident are: