Imagine waking up in the morning, arriving at the office, and finding no open tickets on production issues, no emails from management about cost increases or security issues. Instead, you have time to work on your new projects, time for architecture, and strategy. Sounds good, right?
By taking the right approach or having the right mindset for your infrastructure management, you can actually be there.
Day 2 challenges
When organizations reach “Day 2” in the cloud—operating with extensive environments for a few years — they encounter unique challenges. At this stage, the original architects may have moved on, leaving behind a complex network of systems.
From my conversations with DevOps leaders in such organizations, they report similar issues: Their teams are overstretched, turning into bottlenecks due to the complexity of their cloud environments. This situation prevents others from safely making changes or provisioning new infrastructure, hindering innovation.
Instead of leading initiatives like Generative AI, DevOps teams are preoccupied with infrastructure plumbing.
Moreover, your cloud environment is constantly evolving. New projects emerge, new team members require onboarding, and acquisitions can suddenly expand your infrastructure responsibilities. All these elements contribute to the already existing complexity.
Addressing these Day 2 challenges is essential for maintaining agility, encouraging innovation, and supporting business growth.
Proactive DevOps Strategy – A New Standard for Cloud Management
The good news is that there’s a solution for Day 2 challenges – I call it ‘Proactive DevOps Strategy’.
This isn’t just a methodology; it’s a philosophy or a state of mind we’re helping DevOps/SRE/Platform Engineering teams achieve.
Let’s start with the benefits of the ‘Proactive DevOps’ mindset:
- There are zero to minimal surprises in production; you stop production issues way before they happen.
- The firefighting of misconfigurations in production has become a thing of the past.
- You’re no longer a bottleneck, everyone can suggest/request an infrastructure change.
- Inquiries from management about operational issues are significantly reduced.
- You have ample time for innovation, strategy, and architecture.
‘Proactive DevOps’ provides a new standard for managing your cloud infrastructure. It dictates the tools you provide your team with and signifies more efficient management – you can manage more infrastructure in fewer hours.
Proactive DevOps Strategy – Building Blocks
To meet this new standard, there are 3 main building blocks:
- Desired Configuration for Your Cloud Infrastructure – The first thing needed is defining what resources should be running in your cloud and with which configuration.
Luckily there’s a thing called infra-as-code, so to make it tangible you are required to have 99% Infra-as-code coverage for your cloud resources.
With a clear understanding of how your cloud should be configured, you’re better equipped to be in full control.
- Real-Time Identification of Deviations/Drifts – Now that you have the desired configuration for your cloud resources, it’s crucial to identify any deviations between the desired configuration and the actual state in real-time or near real-time.
Any deviation could result in a cost/security/compliance incident and stop other engineers from making changes to the drifted resource since you need to resolve the ambiguity (Is the correct status reflected in the code, or in what’s actually running??) of that resource first.
Remediation of drifts months after occurrence requires significant time to investigate, track changes, and decide on the “correct” state. You shouldn’t wait months.
- Quality Gate to your production with Proactive Controls –
The next building block is to define a single, consolidated, unified “Quality Gate” that all changes to your production go through.
This means that each suggested change can be inspected before being deployed. Now the nice thing here is that because you manage your cloud with infra-as-code (the first building block) you can follow the same methodologies you’re using for your application code.
So each change to your IaC code in your Git repo can be inspected at the Pull/Merge Request level and the engineer gets instant feedback on whether they can or can’t merge and apply the change.
This means shifting left your cloud policies.
This is also helpful for auditing team transparency and knowledge sharing.
Let’s discuss shift-left cloud policies. Shift-left is a term used mostly for security, but why not use it for all cloud policies like tags, cost, naming conventions, allowed regions, and more?
By shiftling lift your cloud policies you can prevent most of your production issues.
Let’s get back to the engineer who is getting instant feedback when they create a pull request on their Terraform code. They understand immediately whether they can merge and apply it or if they need to make corrections first.
This empowers any engineer, even on their first day on the job, to suggest changes to your cloud environment without fear of mistakes.
Consider the amount of time you can save on code reviews with this approach. Until your policy engine approves the PR, there’s no need to involve a colleague in a code review. This effectively eliminates a major bottleneck in infrastructure delivery.
Here are some examples of proactive policies:
- Required tag for all resources
- Limiting a specific environment not to cost more than $5K monthly.
- Prohibiting the deletion of RDS instances
- Preventing the creation of public S3 buckets
- Limiting the creation of resources only from pre-approved modules
- Allowing to spin up resources only in a specific region (e.g GDPR)
Moreover, consider the difference between handling issues reactively versus proactively. In a reactive model, a wrong change to production triggers alerting systems, necessitating investigation, rollback, identifying the mistake’s origin, and conducting enablement/education sessions.
This process wastes a tremendous amount of time that could be saved by just giving the engineer instant feedback before making the change.
The Impact
The Impact of Proactive DevOps? A significant reduction in production issues, a DevOps team that’s up to 30% more efficient (according to our customers), and 100% control over your cloud.
The time your team has just saved on Infrastructure fire drills can be invested in new initiatives, like GenAI, cost optimization, and exploring new technologies that will help your business grow.
Want to set a new standard in managing your cloud and adopt a proactive mindset?
Let’s talk about transforming your DevOps strategy from reactive firefighting to proactive innovation.