Role Purpose & Context
Role Summary
The Global Cloud Operations Assistant Manager is responsible for defining and executing the operational strategy for a major part of our cloud infrastructure. You'll lead a team of talented engineers, making sure our systems are reliable, secure, and performant, all while keeping a keen eye on the budget. This role sits right at the heart of our technical operations, translating high-level business goals into concrete, actionable cloud strategies and ensuring your team has everything they need to deliver. When this role is done well, our services are rock-solid, our costs are optimised, and our engineers are growing. When it's not, we're looking at outages, spiralling cloud bills, and a frustrated team. The challenge is balancing aggressive growth with unwavering stability and cost control. The reward is seeing your team thrive and knowing you're directly responsible for the foundational reliability of our entire business.
Reporting Structure
- Reports to: Director, Global Cloud Operations
- Direct reports: Roughly 10-25 individuals, including some team leads or junior managers.
- Matrix relationships:
Principal Cloud Engineer, Head of Cloud Operations, Cloud Infrastructure Lead (Manager),
Key Stakeholders
Internal:
- SVP of Engineering
- Executive Peers (e.g., Head of Product, Head of Security)
- Finance Leadership
- Other Cloud Operations Managers
- Engineering Leads across various product teams
External:
- Key Cloud Vendor Account Managers (AWS, Azure, GCP)
- Third-party service providers (e.g., Datadog, PagerDuty)
- Industry bodies and compliance auditors
Organisational Impact
Scope: This role directly shapes our organisational strategy and capability within cloud operations. You'll be accountable for the operational health, cost efficiency, and resilience of a significant business unit, influencing how we build and run services across the entire company. Your decisions here genuinely impact our ability to deliver on customer promises and maintain our market position.
Performance Metrics
Quantitative Metrics
- Metric: Cloud Spend Variance
- Desc: Keeping our cloud costs within the agreed forecast, identifying opportunities to save money without compromising service quality.
- Target: Within 3% of the forecasted budget for your owned services.
- Freq: Monthly and Quarterly
- Example: If the Q3 budget for your services was £1.5M, you'd aim for actual spend between £1.455M and £1.545M. This means proactively rightsizing, managing reserved instances, and cleaning up orphaned resources.
- Metric: P1 Incident Reduction
- Desc: Reducing the number of critical, customer-impacting incidents that affect the services your team is responsible for.
- Target: Decrease the number of P1 incidents by 20% year-on-year for your owned services.
- Freq: Quarterly
- Example: If your services had 10 P1 incidents last year, we'd expect to see no more than 8 this year, showing real improvements in system resilience and proactive problem solving.
- Metric: Team Performance & Development
- Desc: The health and growth of your team, measured by retention and internal progression.
- Target: Maintain a team attrition rate below 10% annually and promote at least two engineers within your team each year.
- Freq: Annually
- Example: If you have a team of 15, you'd aim to lose no more than one person in a year, and you'd be actively working to get two of your engineers ready for their next career step, whether that's Senior Engineer or Lead.
- Metric: Mean Time to Recovery (MTTR)
- Desc: How quickly your team restores service after a critical incident, reflecting their efficiency and preparation.
- Target: Reduce average MTTR for P1/P2 incidents by 15% year-on-year.
- Freq: Quarterly
- Example: If the average time to get a P1 service back up was 60 minutes last year, you'd be pushing to get that down to 51 minutes this year through better runbooks, automation, and incident response training.
Qualitative Metrics
- Metric: Strategic Influence & Collaboration
- Desc: Your ability to shape the broader cloud strategy and work effectively with other departments.
- Evidence: You're regularly invited to high-level planning meetings with Product and Engineering leadership. Your input is actively sought when major architectural decisions are being made. Other teams see your group as a partner, not just a service desk. You're able to get buy-in for your team's initiatives, even when they require effort from others.
- Metric: Operational Excellence Culture
- Desc: Fostering a culture of continuous improvement, blameless post-mortems, and proactive problem-solving within your team.
- Evidence: Your team consistently conducts thorough, blameless post-mortems that lead to concrete action items. They're actively proposing and implementing automation to reduce 'toil'. There's a clear focus on root cause analysis rather than just quick fixes. They're sharing knowledge and best practices internally and, where appropriate, externally.
- Metric: Talent Development & Mentorship
- Desc: How effectively you're growing the skills and capabilities of your direct reports.
- Evidence: Your team members feel supported and challenged. They have clear development plans and are making measurable progress. You're regularly providing constructive feedback and creating opportunities for them to take on more responsibility. You're seen as a trusted advisor and mentor within the team and the wider organisation.
Primary Traits
- Trait: Calm Under Pressure (and leading others to be)
- Manifestation: When a P1 incident hits at 2 AM, you're the voice of reason on 'the bridge'. You're not just calm yourself, you're actively guiding your team to stay focused, follow the runbook, and communicate clearly. You can take the heat from a C-suite executive without losing your head, and then translate that pressure into clear, actionable steps for your team. You're the eye of the storm.
- Benefit: When systems are down and money is being lost, panic spreads like wildfire. As a manager, your composure sets the tone for the entire incident response. If you freak out, your team will too, and that's when mistakes happen. We need someone who can lead a team through chaos, ensuring they execute methodically and restore service as quickly as possible, even when the pressure is immense.
- Trait: Strategic Process Architect
- Manifestation: You don't just follow processes; you design them. You're always thinking about how to make our incident response, change management, or deployment pipelines more robust, efficient, and scalable. You love a well-defined SOP, but you're also constantly looking for ways to automate parts of it or improve its effectiveness. You're the person who sees the bigger picture of how all our operational processes fit together.
- Benefit: At this level, we're not just reacting; we're building the systems that prevent future problems. Your ability to think systematically about processes and then implement them across a team or department is crucial for reducing human error, improving reliability, and scaling our operations. Without this, we're constantly firefighting instead of building for the future.
- Trait: Unwavering Reliability & Accountability
- Manifestation: When you say something will get done, it gets done. Period. You're the person who follows up, closes the loop, and ensures commitments are met, both by yourself and your team. You take full ownership of your domain's operational health, meaning you're the first to step up when things go wrong and the one who ensures lessons are learned. Your word is your bond, and you expect the same from your team.
- Benefit: Our entire business relies on the cloud operations team to keep everything running. As a manager, you're the ultimate guarantor of that reliability for your area. If you're not consistently reliable, that erodes trust across the organisation, impacts customer satisfaction, and ultimately hits the bottom line. We need someone who lives and breathes accountability, and instils it in their team.
Supporting Traits
- Trait: Meticulous Strategist
- Desc: You're able to zoom out to the 10,000-foot view to define strategy, but also zoom in to spot the critical detail in a Terraform module or a post-mortem report that could make or break a solution. You appreciate precision at all levels.
- Trait: Constructively Skeptical
- Desc: You don't just accept data or proposals at face value. You ask probing questions, challenge assumptions, and push for the 'why' behind every decision, especially when it comes to operational changes or new technologies. This helps us avoid costly mistakes and build more robust solutions.
- Trait: Empathetic Collaborator
- Desc: You understand that your team's success depends on strong relationships with Engineering, Product, and Security. You actively seek to understand their perspectives, remove blockers for your team, and build bridges, not walls. You're a leader who genuinely cares about your people and their growth.
- Trait: Change Agent
- Desc: You're not content with the status quo. You're always looking for ways to improve processes, introduce new technologies, or optimise our cloud spend. You're comfortable leading your team through change and helping them adapt to new ways of working.
Primary Motivators
- Motivator: Building High-Performing Teams
- Daily: You'll spend a good chunk of your week coaching, mentoring, and developing your direct reports. This means regular 1:1s, helping them navigate tricky technical problems, providing career guidance, and celebrating their successes. You'll also be involved in hiring and onboarding new talent, shaping the future of your team.
- Motivator: Strategic Impact & Ownership
- Daily: You'll be defining the roadmap for your operational domain, deciding which automation projects get prioritised, how we approach cloud cost optimisation, and what our incident response strategy looks like. You'll own the operational health of a significant part of our cloud estate, and your decisions will have real, measurable impact on the business.
- Motivator: Solving Complex Organisational Challenges
- Daily: This isn't just about technical problems; it's about people, processes, and politics. You'll be navigating trade-offs between cost and reliability, convincing different teams to adopt new operational standards, and figuring out how to scale our operations efficiently as the business grows. It's often messy, but deeply rewarding.
Potential Demotivators
Honestly, this isn't a role for everyone. You'll find yourself in endless meetings, often trying to get different departments to agree on what 'done' actually means. You'll be responsible for a significant budget, which means constant pressure to justify every pound spent, even when it's for essential reliability. You'll still get pulled into critical incidents, even if you're not on-call, because ultimately, the buck stops with you. You'll also have to deal with the frustrations of 'shadow IT' – where other teams spin up resources without telling anyone, leaving you to clean up the mess and explain the unexpected bill.
Common Frustrations
- The constant tension between cost-cutting pressure and the business's demand for higher uptime and performance.
- Dealing with legacy systems and processes that are hard to change, even when everyone agrees they're inefficient.
- The 'blame game' during incidents, where you have to prove the infrastructure is stable before other teams will look at their code.
- Managing underperforming team members, which is never easy, but essential for team health.
- The sheer volume of administrative tasks that come with managing a team (HR processes, budgeting, reporting).
What Role Doesn't Offer
- A purely hands-on technical role; your time will be split between technical leadership, people management, and strategic planning.
- A quiet, predictable 9-to-5 job; incidents don't care about your schedule, and strategic challenges are always evolving.
- The chance to avoid difficult conversations; you'll need to provide tough feedback, negotiate with senior leaders, and sometimes deliver bad news.
ADHD Positives
- The fast-paced nature of incident response and the constant need to switch contexts can be highly engaging and stimulating, preventing boredom.
- The drive to automate 'toil' and find novel solutions to operational problems aligns well with a preference for innovation and efficiency.
- The high-stakes environment of critical incidents can provide intense focus, leading to rapid problem-solving and effective leadership under pressure.
ADHD Challenges and Accommodations
- The sheer volume of administrative tasks, reporting, and long-term strategic planning can be challenging; we can help by providing tools for task management, breaking down large projects into smaller, manageable chunks, and offering executive assistant support for certain administrative duties.
- Maintaining focus during lengthy, less stimulating meetings might be difficult; we encourage active participation, note-taking tools, and short breaks.
- The need for meticulous documentation and process adherence, while critical, might require structured templates and regular check-ins to ensure consistency.
Dyslexia Positives
- Strong spatial reasoning and pattern recognition skills are invaluable for understanding complex cloud architectures and identifying anomalies in monitoring dashboards.
- Excellent problem-solving abilities, often approaching issues from unique angles, can lead to innovative solutions for operational challenges.
- The ability to see the 'big picture' and connect disparate pieces of information is crucial for strategic planning and incident root cause analysis.
Dyslexia Challenges and Accommodations
- Extensive reading and writing of documentation, reports, and emails might be time-consuming; we offer tools like Grammarly, text-to-speech software, and templates for common documents.
- Ensuring accuracy in written communication and code (e.g., IaC files) is vital; we use robust code review processes, automated linting, and encourage verbal communication for complex explanations.
- Reliance on visual aids, diagrams, and verbal explanations in meetings can be beneficial; we prioritise these communication methods.
Autism Positives
- A deep appreciation for logical systems, processes, and structured environments is a huge asset in cloud operations, where consistency is key.
- Exceptional focus on detail can lead to identifying subtle issues in configurations or monitoring data that others might miss, preventing outages.
- The ability to maintain calm and methodical execution during high-stress incidents, focusing on facts and procedures, is highly valued.
- A preference for clear, direct communication can cut through ambiguity, which is essential in incident management and strategic discussions.
Autism Challenges and Accommodations
- Navigating complex social dynamics, office politics, and ambiguous interpersonal communication in a management role can be demanding; we offer clear communication guidelines, direct feedback, and support for understanding team dynamics.
- Unexpected changes to plans or urgent, context-switching demands might be challenging; we aim for transparency in planning, provide advance notice where possible, and support structured transitions between tasks.
- Sensory overload during intense incident 'bridge' calls (multiple people talking, flashing alerts) can be difficult; we encourage using noise-cancelling headphones, offer quiet spaces for focused work, and use structured communication protocols during incidents.
Sensory Considerations
Our operations centre can be a busy, sometimes noisy environment during major incidents, with multiple screens, flashing alerts, and concurrent conversations. However, for day-to-day work, we offer flexible seating options, quiet zones, and the option to work from home on certain days to manage sensory input. Meetings are typically held in dedicated rooms, and we encourage the use of headphones for focused work.
Flexibility Notes
We understand that everyone works differently. We're committed to providing a flexible work environment where possible, including hybrid working options, adjustable schedules, and the tools you need to thrive. Let's chat about what works best for you.
Key Responsibilities
Experience Levels Responsibilities
- Level: Global Cloud Operations Assistant Manager (Level 5)
- Responsibilities: Set the vision and strategy for your cloud operations domain, aligning it with broader business objectives and the Director's overall plan. This means looking 12-24 months ahead, not just reacting to today's problems.
- Build and lead a high-performing team of Cloud Operations Engineers, which involves everything from hiring and onboarding to performance reviews, career development, and sometimes, difficult conversations. You'll be a coach, a mentor, and a shield for your team.
- Own the P&L for your operational function, managing a budget of roughly £500K-£2M. This means making tough decisions about where to invest, where to cut costs, and how to get the most bang for our buck in the cloud.
- Drive the transformation of our cloud operations, constantly looking for ways to automate 'toil', improve our incident response capabilities, and enhance our overall system resilience. This isn't about incremental changes; it's about step-changes.
- Act as the primary point of contact and escalation for major incidents within your domain, leading the response, ensuring clear communication to senior leadership, and overseeing thorough, blameless post-mortems.
- Represent the organisation externally in discussions with key cloud vendors and industry bodies, influencing their roadmaps where possible and ensuring we're getting the best value and support.
- Define and enforce cloud governance policies, including tagging strategies, security best practices, and compliance requirements, working closely with Security and Compliance teams. You'll make sure we're doing things by the book, and that the book makes sense.
- Supervision: You'll be largely self-directed, with quarterly objectives set in alignment with the Director. We trust you to manage your team and your domain autonomously, only stepping in for strategic alignment or major escalations. Think of it as owning your own mini-business unit.
- Decision: You'll have full authority for your function, including budget allocation up to £2M, all hiring and firing decisions within your team, and vendor selection up to £100K. You'll consult with the Director on major organisational design changes or external commitments above £100K. Board-level decisions, naturally, will require alignment with the Director and CEO.
- Success: Your success will be measured by the stability and cost-efficiency of your cloud domain, the growth and retention of your team, and your ability to drive strategic initiatives that improve our overall operational posture. Ultimately, it's about delivering reliable, cost-effective cloud services that enable the business to thrive.
Decision-Making Authority
- Type: Cloud Architecture Changes
- Entry: Escalate all proposed changes to a Senior Engineer or Lead.
- Mid: Propose minor architectural changes within existing patterns; consult with Senior Engineers.
- Senior: Design and implement significant architectural changes within a workstream; inform Lead Engineer.
- Type: Budget Allocation (Cloud Spend)
- Entry: No authority; report any unexpected spend to supervisor.
- Mid: Identify cost-saving opportunities; propose changes to manager.
- Senior: Recommend budget optimisations for specific services (£5K-£10K impact); get manager approval.
- Type: Team Hiring & Performance
- Entry: No involvement beyond providing feedback on team culture.
- Mid: Participate in interview panels for junior roles; provide peer feedback.
- Senior: Interview candidates for junior and mid-level roles; mentor new hires.
ID:
Tool: Automated Root Cause Analysis
Benefit: Imagine AI tools like Datadog Watchdog or Dynatrace Davis sifting through mountains of logs, metrics, and traces during an incident. Instead of your team spending hours manually correlating data, the AI instantly highlights the most probable root cause. This means faster resolution, less 'toil' for your engineers, and quicker service restoration for our customers. You'll get to focus on the strategic fix, not the initial detective work.
ID:
Tool: Predictive Anomaly Detection & Cost Optimisation
Benefit: AI models can analyse performance trends and identify subtle anomalies – like a slow memory leak or creeping disk usage – that predict future outages *before* they even trigger a hard alert. For you, this means fewer P1 incidents to manage. On the FinOps side, AI can spot underutilised resources or inefficient spend patterns, giving you actionable insights to keep your cloud budget in check without manual auditing. It's about being proactive, not reactive.
ID:
Tool: IaC & Script Generation (for your team)
Benefit: While you're leading strategy, your team can use AI copilots like GitHub Copilot or Amazon CodeWhisperer to generate boilerplate Terraform configurations, Ansible playbooks, or Python automation scripts. This dramatically speeds up development, reduces errors, and frees your engineers to tackle more complex, interesting problems. You'll set the standards, and AI will help your team meet them faster.
ID:
Tool: Incident Report & Post-mortem Drafting
Benefit: After a major incident, the last thing anyone wants to do is write a lengthy report. With AI, you can feed an LLM the incident timeline, relevant Slack conversations, and technical logs. It can then generate a comprehensive first draft of both the customer-facing incident report and the internal blameless post-mortem. You and your team can then review, refine, and ensure accuracy, saving hours of painful documentation work per incident.
Your team could collectively save 15-25 hours weekly, allowing them to focus on strategic projects and automation initiatives, rather than repetitive tasks. For you, it means more time for leadership and less for firefighting.
Weekly time savings potential
Expect to use 3-5 core AI-powered tools and platforms, integrated into our existing cloud operations ecosystem.
Typical tool investment
Competency Requirements
Foundation Skills (Transferable)
Beyond the technical wizardry, we need someone who can lead, communicate, and navigate the complex landscape of a growing organisation. These are the bedrock skills that will make you a truly effective manager.
- Category: Strategic Leadership & Vision
- Skills: Organisational Design: Structuring teams and processes for optimal efficiency and scalability.
- Vision Setting: Translating business goals into a clear, actionable operational strategy for your domain.
- Change Management: Guiding your team and stakeholders through significant shifts in technology or process.
- Delegation & Empowerment: Trusting your team with responsibility and giving them the autonomy to deliver.
- Category: Communication & Influence
- Skills: Executive Presence: Articulating complex technical concepts clearly and concisely to non-technical senior leaders.
- Negotiation & Conflict Resolution: Mediating disagreements between teams or individuals to find common ground and move forward.
- Team Communication: Fostering open, honest, and constructive dialogue within your team and across departments.
- Stakeholder Management: Building strong relationships with internal and external partners, understanding their needs, and managing expectations.
- Category: Problem Solving & Decision Making
- Skills: Complex Problem Solving: Tackling ambiguous, multi-faceted operational challenges with no clear-cut answers.
- Risk Management: Identifying, assessing, and mitigating operational risks across your domain.
- Data-Driven Decision Making: Using metrics and insights to inform strategic choices and justify investments.
- Crisis Management: Leading effectively during high-pressure incidents, making critical decisions under extreme time constraints.
- Category: Personal Effectiveness & Development
- Skills: Mentorship & Coaching: Developing the skills and careers of your direct reports.
- Time Management & Prioritisation: Effectively managing your own workload and helping your team prioritise theirs.
- Resilience & Adaptability: Bouncing back from setbacks and adapting your strategy as business needs evolve.
- Continuous Learning: Staying abreast of industry trends, new technologies, and leadership best practices.
Functional Skills (Role-Specific Technical)
You'll need a deep, practical understanding of cloud operations, not just the theory. This role demands mastery of the tools and methodologies that keep our cloud infrastructure humming, and the ability to apply them at a strategic level.
Technical Competencies
- Skill: ITIL Framework (Strategic Application)
- Desc: You won't just know the theory; you'll be defining and optimising our Incident, Problem, and Change Management processes across your domain. This means knowing when to declare a P1 versus a P2, how to streamline emergency Change Advisory Board (CAB) procedures, and how to embed ITIL principles into our daily operations to drive efficiency and reduce risk.
- Level: Expert
- Skill: SRE Principles (Defining & Enforcing)
- Desc: You'll be responsible for defining Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets for the services your team owns. This means using data to strategically push back on feature releases when reliability is at risk, establishing clear targets for uptime, and fostering a culture of reliability engineering within your team.
- Level: Expert
- Skill: Cloud Cost Management (FinOps Strategy)
- Desc: This is about more than just finding orphaned resources. You'll be defining and enforcing enterprise-wide tagging strategies, evaluating and recommending Reserved Instances/Savings Plans, rightsizing workloads at scale, and implementing a holistic FinOps strategy to control and forecast cloud spend across your domain. You'll be working closely with Finance to manage a multi-million-pound budget.
- Level: Expert
- Skill: Infrastructure as Code (IaC) Governance
- Desc: You'll be designing and enforcing the IaC framework for your domain, including module registries, CI/CD integration, and policy-as-code. This isn't just about writing Terraform; it's about ensuring consistency, security, and scalability across all our infrastructure deployments, and making sure your team adheres to best practices.
- Level: Expert
- Skill: Disaster Recovery (DR) Planning & Ownership
- Desc: You'll own the DR plan for your services, differentiating between RPO (Recovery Point Objective) and RTO (Recovery Time Objective) at a strategic level. This includes overseeing and documenting failover tests, ensuring the Business Continuity Plan (BCP) is robust, and making critical decisions during actual DR scenarios.
- Level: Expert
- Skill: System Hardening & Security Posture (Strategic)
- Desc: You'll be responsible for defining and enforcing security best practices for cloud infrastructure within your domain. This means ensuring the principle of least privilege is applied, security group configurations are robust, and vulnerability scanning is integrated into our CI/CD pipelines. You'll work closely with the Security team to maintain a strong security posture.
- Level: Advanced
Digital Tools
- Tool: Cloud Platforms (AWS, Azure, GCP)
- Level: Strategic
- Usage: Leading multi-cloud strategy, evaluating new cloud services for enterprise adoption, managing top-level billing and account structures. You'll be making decisions about which cloud services to use and how to best use them across your domain.
- Tool: Monitoring & Observability (Datadog, Grafana, Splunk, ELK Stack)
- Level: Architect
- Usage: Defining the enterprise observability strategy for your domain, selecting and integrating platforms, and implementing AIOps for predictive monitoring. You'll ensure your team has the right tools to monitor and troubleshoot effectively.
- Tool: Infrastructure as Code (Terraform, Ansible, GitLab CI, Jenkins)
- Level: Strategic
- Usage: Designing the enterprise IaC framework for your domain, including module registries, CI/CD integration, and policy-as-code (e.g., Sentinel, Open Policy Agent). You'll ensure our infrastructure is deployed consistently and securely.
- Tool: Incident Management (PagerDuty, Opsgenie, Jira Service Management)
- Level: Strategic
- Usage: Owning the entire incident management process for your domain. This means managing on-call schedules and escalations at a global level, reporting on MTTA/MTTR to executive leadership, and driving continuous improvement in incident response.
- Tool: Scripting & Automation (Python, Bash, Boto3/Azure SDK)
- Level: Strategic
- Usage: Driving the 'automate everything' vision within your team. Identifying major opportunities for operational efficiency through automation, and managing platforms like ServiceNow GRC for compliance automation. You'll ensure your team is building robust, scalable automation.
- Tool: Executive & Planning (Anaplan, CloudHealth, Tableau Server, Power BI Premium)
- Level: Expert
- Usage: Managing cloud budgets and forecasting, presenting operational health metrics and strategic roadmaps to leadership. You'll use these tools to justify investments, track performance, and communicate effectively with senior stakeholders.
Industry Knowledge
- Area: Cloud Native Architectures
- Desc: Deep understanding of microservices, containers (Kubernetes, Docker), serverless computing, and event-driven architectures. You'll guide your team in operating these complex systems effectively.
- Area: DevOps & GitOps Principles
- Desc: A strong grasp of continuous integration/continuous deployment (CI/CD), version control, and the cultural aspects of DevOps. You'll foster a collaborative environment between operations and development.
- Area: Cyber Security Best Practices (Cloud)
- Desc: Comprehensive knowledge of cloud security models, identity and access management (IAM), data encryption, network security, and compliance frameworks relevant to cloud environments.
Regulatory Compliance Regulations
- Reg: GDPR (General Data Protection Regulation)
- Usage: Ensuring all cloud operations and data handling within your domain comply with GDPR requirements, particularly regarding data residency, access controls, and incident response procedures for data breaches. You'll be a key point of contact for audits.
- Reg: ISO 27001 (Information Security Management)
- Usage: Implementing and maintaining controls that align with ISO 27001 standards across your cloud infrastructure. This includes defining security policies, managing risks, and contributing to the overall Information Security Management System (ISMS).
- Reg: SOC 2 (Service Organisation Control 2)
- Usage: Understanding and contributing to the controls required for SOC 2 compliance, especially concerning security, availability, processing integrity, confidentiality, and privacy of customer data in the cloud environment. You'll help prepare for and respond to auditor requests.
Essential Prerequisites
- Proven experience (12-16 years) in cloud operations or site reliability engineering, with at least 5 years in a leadership or management role.
- Demonstrable experience managing a team of 5+ engineers, including hiring, performance management, and career development.
- A track record of successfully managing significant cloud budgets (e.g., £500K+) and driving cost optimisation initiatives.
- Deep, hands-on experience with at least one major cloud provider (AWS, Azure, or GCP) at an expert level, and strong familiarity with another.
- Extensive experience in designing, implementing, and managing Infrastructure as Code (IaC) frameworks at scale.
- Proven ability to lead major incident response efforts, conduct blameless post-mortems, and drive continuous improvement in operational processes.
- Strong understanding of SRE principles and their practical application in a production environment.
Career Pathway Context
These aren't just checkboxes; they're the foundational experiences that will allow you to step into this role and make an immediate impact. We're looking for someone who has 'been there, done that' in cloud operations and is ready to lead a team to new heights. If you've spent years in the trenches, built resilient systems, and now feel ready to shape the strategy and develop others, this could be your next big step.
Qualifications & Credentials
Emerging Foundation Skills
- Skill: AIOps Strategy & Implementation
- Why: Manual incident response and monitoring are becoming unsustainable as systems grow more complex. AIOps tools can drastically reduce MTTR, predict outages, and cut down on alert fatigue, but they need a strategic leader to implement them effectively and integrate them into existing workflows.
- Concepts: [{'concept_name': 'Event Correlation & Anomaly Detection', 'description': 'Using AI to automatically link related alerts and spot unusual patterns that indicate a problem before it escalates.'}, {'concept_name': 'Predictive Analytics for Outages', 'description': 'Leveraging machine learning to forecast potential system failures based on historical data and real-time metrics.'}, {'concept_name': 'Automated Remediation Workflows', 'description': 'Designing AI-driven responses that can automatically resolve common issues without human intervention.'}, {'concept_name': 'Natural Language Processing for Incident Summaries', 'description': 'Using LLMs to quickly summarise incident timelines and generate first-draft post-mortems.'}]
- Prepare: This quarter: Research leading AIOps platforms (e.g., Datadog Watchdog, Dynatrace Davis) and their capabilities.
- Next quarter: Pilot an AIOps feature on a non-critical service to understand its benefits and limitations.
- Month 6: Develop a business case for broader AIOps adoption, outlining ROI and implementation roadmap.
- Month 9: Lead the integration of an AIOps tool into your team's incident management workflow, training your engineers.
- QuickWin: Start experimenting with existing AI features in your current monitoring tools (e.g., Datadog's Watchdog) to see what insights they can provide immediately. Encourage your team to use LLMs to draft initial incident communications.
- Skill: Cloud Native Security Posture Management
- Why: Traditional security models don't cut it in dynamic cloud-native environments. As we move more towards serverless and containers, the attack surface changes, and operations managers need to lead the charge in securing these new paradigms, not just react to security incidents.
- Concepts: [{'concept_name': 'Zero Trust Architectures', 'description': 'Implementing security models where no user or device is trusted by default, regardless of their location.'}, {'concept_name': 'Container & Serverless Security', 'description': 'Specific security practices for securing Docker containers, Kubernetes clusters, and serverless functions.'}, {'concept_name': 'Cloud Security Posture Management (CSPM)', 'description': 'Using tools to continuously monitor cloud configurations for security misconfigurations and compliance violations.'}, {'concept_name': 'Identity & Access Management (IAM) at Scale', 'description': 'Advanced strategies for managing permissions and access controls across complex multi-cloud environments.'}]
- Prepare: This quarter: Review current cloud security best practices (e.g., AWS Well-Architected Framework - Security Pillar).
- Next quarter: Work with the Security team to identify top 3 security risks in your domain and develop mitigation plans.
- Month 6: Lead a project to implement a CSPM tool and integrate its findings into your operational dashboards.
- Month 9: Conduct a 'tabletop exercise' with your team and Security to simulate a cloud-native security incident.
- QuickWin: Ensure your team is regularly reviewing IAM policies for least privilege and automating security group audits. Start using cloud provider security tools (e.g., AWS Security Hub, Azure Security Center) if you're not already.
Advancing Technical Skills
- Skill: Serverless Architecture Governance
- Why: More services are moving to serverless, and while it simplifies some operational aspects, it introduces new challenges around cost management, observability, and cold starts. You'll need to define the operational best practices for serverless deployments.
- Concepts: [{'concept_name': 'Serverless Cost Optimisation', 'description': 'Strategies for managing costs in serverless environments, where billing models are often complex.'}, {'concept_name': 'Distributed Tracing & Observability', 'description': 'Implementing tools and practices to monitor and troubleshoot highly distributed serverless applications.'}, {'concept_name': 'Event-Driven Architecture Patterns', 'description': 'Understanding how to design and operate systems that react to events rather than traditional requests.'}]
- Prepare: This quarter: Deep dive into serverless best practices from AWS Lambda, Azure Functions, or GCP Cloud Functions.
- Next quarter: Work with a development team to define operational standards for their new serverless application.
- Month 6: Evaluate a new serverless monitoring tool and present its benefits to your Director.
- Month 9: Lead a project to automate serverless deployment and rollback procedures.
- QuickWin: Encourage your team to experiment with serverless for internal tooling. Review existing serverless applications for cost and performance bottlenecks.
- Skill: Sustainability in Cloud Operations
- Why: As environmental concerns grow, optimising cloud carbon footprint will become a critical operational metric. This means making choices about regions, instance types, and architecture with sustainability in mind, and you'll be leading that charge.
- Concepts: [{'concept_name': 'Carbon Footprint Measurement', 'description': 'Tools and methods to measure the energy consumption and carbon emissions of cloud resources.'}, {'concept_name': 'Green Cloud Architectures', 'description': 'Designing infrastructure for energy efficiency, such as using efficient regions or instance types.'}, {'concept_name': 'Workload Optimisation for Sustainability', 'description': 'Strategies to reduce resource usage, like rightsizing, auto-scaling, and shutting down idle environments.'}]
- Prepare: This quarter: Read up on cloud provider sustainability reports and initiatives.
- Next quarter: Identify one service in your domain with high resource usage and propose a 'green' optimisation plan.
- Month 6: Integrate sustainability metrics into your operational dashboards and reporting.
- Month 9: Lead a cross-functional initiative to establish company-wide cloud sustainability goals.
- QuickWin: Start by identifying and decommissioning orphaned resources. Encourage your team to use the most energy-efficient instance types for new deployments.
Future Skills Closing Note
The future of cloud operations isn't just about technology; it's about leadership, vision, and adaptability. Your role will be to guide your team through these evolving landscapes, ensuring we remain at the forefront of operational excellence. It's a challenging, but incredibly rewarding, path.
Education Requirements
- Level: Minimum
- Req: A Bachelor's degree in Computer Science, Engineering, Information Technology, or a closely related technical field.
- Alts: We're pragmatic. If you've got equivalent practical experience (think 16+ years in a highly technical, leadership-focused cloud operations role) and a track record of success that speaks for itself, we'd still love to hear from you. We value demonstrated ability over a piece of paper.
- Level: Preferred
- Req: A Master's degree in a relevant technical field or an MBA.
- Alts: While not essential, a Master's or MBA can give you an edge, especially in understanding the broader business context and strategic decision-making. Again, strong practical experience can often outweigh this.
Experience Requirements
You'll need roughly 12-16 years of progressive experience in technical roles, with a solid 5-8 years specifically focused on cloud operations or site reliability engineering. Crucially, at least 5 of those years should have been in a leadership or management capacity, where you were directly responsible for a team of engineers, their performance, and their development. We're looking for someone who has managed significant cloud environments, owned multi-million-pound budgets, and has a proven track record of driving operational excellence and strategic change.
Preferred Certifications
- Cert: ITIL Expert or ITIL Managing Professional
- Prod: Axelos
- Usage: Demonstrates a comprehensive understanding of IT service management best practices, which is crucial for defining and optimising our operational processes at a strategic level.
- Cert: FinOps Certified Practitioner (FOCP)
- Prod: FinOps Foundation
- Usage: Shows a dedicated focus on cloud financial management, which is absolutely vital for managing our cloud spend effectively and driving cost optimisation initiatives.
- Cert: Certified Information Security Manager (CISM) or Certified Cloud Security Professional (CCSP)
- Prod: ISACA / (ISC)²
- Usage: Highlights a strong understanding of information security governance and cloud security principles, which is increasingly important for managing risk in our cloud environments.
Recommended Activities
- Regularly attending industry conferences (e.g., AWS re:Invent, Azure Summit, KubeCon) to stay abreast of the latest cloud technologies and operational best practices.
- Participating in leadership development programmes or executive coaching to hone your management and strategic thinking skills.
- Engaging with industry groups and communities (e.g., FinOps Foundation, SRE communities) to share knowledge and learn from peers.
- Contributing to open-source projects or writing technical blogs/articles, demonstrating thought leadership in cloud operations.
Career Progression Pathways
Entry Paths to This Role
- Path: Senior Cloud Operations Engineer / Staff Cloud Operations Engineer
- Time: 3-5 years in previous role
- Path: Technical Lead, Cloud Infrastructure
- Time: 4-6 years in previous role
- Path: Cloud Architect
- Time: 5-7 years in previous role
Career Progression From This Role
- Pathway: Director, Global Cloud Operations (Level 6)
- Time: 3-5 years in current role
- Pathway: Principal Cloud Architect / Principal SRE (Individual Contributor Path)
- Time: 3-5 years in current role
Long Term Vision Potential Roles
- Title: VP of Infrastructure & Operations (Level 7)
- Time: 5-10 years from current role
- Title: Chief Technology Officer (CTO)
- Time: 10-15 years from current role
- Title: Head of Platform Engineering
- Time: 5-10 years from current role
Sector Mobility
The skills you'll gain in this role are highly transferable across various technical sectors. Cloud operations expertise is in high demand in SaaS companies, financial services, e-commerce, and any organisation that relies heavily on cloud infrastructure. You could move into a similar leadership role in a completely different industry, or even transition into a more general technology leadership position.
How Zavmo Delivers This Role's Development
DISCOVER Phase: Skills Gap Analysis
Zavmo maps your current competencies against all requirements in this job description through conversational assessment. We evaluate your foundation skills (communication, strategic thinking), functional skills (CRM expertise, negotiation), and readiness for career progression.
Output: Personalised skills gap heat map showing strengths and priorities, estimated time to competency, neurodiversity accommodations.
DISCUSS Phase: Personalised Learning Pathway
Based on your DISCOVER results, Zavmo creates a personalised learning plan prioritised by impact: foundation skills first, then functional skills. We adapt to your learning style, pace, and neurodiversity needs (ADHD, dyslexia, autism).
Output: Week-by-week schedule, each module linked to specific job responsibilities, checkpoints and milestones.
DELIVER Phase: Conversational Learning
Learn through conversation, not boring modules. Zavmo uses 10 conversation types (Socratic dialogue, role-play, coaching, case studies) to build competence. Practice difficult QBR presentations, negotiate tough renewals, and handle churn conversations in a safe AI environment before facing real clients.
Example: "For 'Stakeholder Mapping', Zavmo will guide you through analysing a complex enterprise account, identifying key decision-makers, and building an engagement strategy."
DEMONSTRATE Phase: Competency Assessment
Zavmo automatically builds your evidence portfolio as you learn. Every conversation, practice scenario, and application example is captured and mapped to NOS performance criteria. When ready, your portfolio supports OFQUAL qualification claims and demonstrates competence to employers.
Output: Competency matrix, evidence portfolio (downloadable), qualification readiness, career progression score.