Achieving Operational Excellence in AWS

Mar 07, 2021

Category: AWS Well-Architected Framework

Operational excellence is one of the pillars of AWS Well-Architected Framework in which it suggests ways at various levels in an organization to achieve OpEx. OpEx essentially means running your operations in the most efficient ways. It could be application operations or platform operations.

Every organization defines its business objectives. To support the same applications are developed and deployed on an infrastructure. It would not be wrong to say business productivity directly depends on how well the IT workloads are managed. OpEx talks about efficiently managing the applications and platforms they run on along with organizational and people aspects. Below are 5 pillars or OpEx described in AWS Well-Architected Framework.

Automate everything - Application codes are managed using various tools like version control systems, build tools, tests, etc. The idea here is to handle operations as code. Build your infrastructure in the form of code and write anything that has to do with operational tasks as automation scripts.
Make frequent, small, and reversible changes. - Doing this enables teams to deliver beneficial patches, fixes, or upgrades earlier into production. Early delivery of desired features directly benefits customers. Small and reversible changes help to roll back the update in case of any failure, thus mitigating risks to a certain extent.
Refine operational procedures frequently. - Having operational procedures in place is not enough. It should be tested frequently to make sure it works when it is required. OpEx suggests running mock drills of operational procedures so that it can prepare the required personnel for reality.
Anticipate failure - We should think of anything that can go wrong and be prepared for it. Simulate various negative scenarios to learn and build handlers.
Learn from operational failures. - Any failed operational activity should be documented, learned, and resolved before it happens again. The lesson learned should be shared across teams and organizations.

OpEx is composed of 4 areas - Organization, preparation, operation, and evolution. Operational Excellence is an aspect that is contributed by every part of the organization - management, development teams, and operational teams. Management defines business objectives and prioritizes the same. Everybody in the organization should be made aware of these objectives for better alignment so that the efforts are focused on the right area.

The choice of operational model also influences operational excellence. The whitepaper describes some of the models as below:

Fully separated operational model - Where everything is handled by different teams - application development, application operations, platform development, and platform operations.
Separated Application engineering and operations (AEO) and Infrastructure engineering and operations (IEO) with centralized governance - where the engineering and operations team is the same for both application and platform operations.
Separated AEO and IEO with centralized governance with Service Provider - At times when the organization does not have the required expertise in platform engineering and operations, the efforts are outsourced from 3rd parties.
Separated AEO and IEO with decentralized governance.

Teams coordinate efforts using processes deployed on IT service management systems in the form of tickets or tasks. Processes like Incident management, change management, problem management are used. The processes should be fine-tuned to result in low friction in independently working units.

Continual improvement of the operational processes depends on how we prepare our infrastructure. We should implement various logging and monitoring methods to gain insights into the technical, operational, and procedural workings. This is also known as setting up design telemetry. AWS services like CloudWatch, CloudTrail, etc. can be used to log and analyze various types of information.

User access, system performance, API calls, network bandwidth, dependencies are certain areas that can be the focus of telemetry. Logs generated from these areas help identify weak spots, improve the overall flow of operations and mitigate deployment risks. Keeping all this in perspective presents us with an operational readiness view.

Once you get the ball rolling, i.e once the operations begin, we should monitor the health of workloads being carried out. Alarms should be set to raise any threshold breaches in terms of system health, performance, and any anomaly.

Operational excellence is all about churning out the efficiency in operational procedures by automating everything under the hood. It could be infrastructure automation or any scripts which are part of runbooks and play books.

Hey! That's it for this post. I am coming up with courses and consultation options on Let’s Do Tech - do consider subscribing, following, and sharing!Twitter, Instagram, Facebook, LinkedIn

Let's Do Tech

Discussion about this post