Role Purpose & Context
Role Summary
The Manager, Cloud Infrastructure is here to lead and shape a high-performing team that designs, builds, and runs our global cloud platforms. You'll be making sure our core systems are always available, secure, and performant, which directly impacts our ability to serve customers and grow the business. You'll sit right at the intersection of our overall infrastructure strategy and the day-to-day operational excellence, translating big-picture ideas into concrete, reliable services.
When this role is done well, our engineering teams can deploy faster, our customers experience fewer outages, and our cloud spend is optimised without cutting corners. When it's not, we're looking at costly outages, security vulnerabilities, and ballooning cloud bills – frankly, it impacts everything. The biggest challenge here is balancing rapid innovation with rock-solid stability and cost efficiency, all while managing a diverse team and navigating constant change. The reward, though, is seeing your team thrive, building resilient systems that genuinely enable the company's success, and knowing you're at the helm of something truly critical.
Reporting Structure
- Reports to: Director of Global Infrastructure
- Direct reports: Roughly 10-25 engineers and potentially some team leads, depending on how we've structured things at the time. It's a proper leadership role.
- Matrix relationships:
Head of Infrastructure Operations, Principal Infrastructure Lead, Senior Infrastructure Manager,
Key Stakeholders
Internal:
- SVP of Engineering
- Head of Product
- Finance Leadership (especially the CFO's team)
- Security Operations
- Other Engineering Managers (DevOps, Data Engineering)
- Internal Audit and Compliance
External:
- Cloud Service Providers (AWS, Azure, GCP account managers)
- Key Infrastructure Vendors (e.g., Palo Alto Networks, Datadog)
- External Auditors and Regulators (occasionally)
- Industry Bodies and Peer Networks
Organisational Impact
Scope: Your work directly impacts our operational stability, security posture, and the financial health of our cloud spend. Get it right, and you're enabling faster product delivery and better customer experience. Get it wrong, and we're talking about outages, data breaches, and millions in wasted spend. It's a high-stakes game, honestly.
Performance Metrics
Quantitative Metrics
- Metric: Availability & Uptime
- Desc: Maintaining the highest possible uptime for our Tier 1 business applications, which are the ones that directly impact revenue or customer experience.
- Target: Achieve 99.99% ('four nines') uptime for all Tier 1 business applications.
- Freq: Monthly, reported quarterly to the executive committee.
- Example: If our main customer-facing platform is down for more than 5 minutes in a month, we've missed this target. It's a tight margin, but that's the expectation for critical systems.
- Metric: Global Infrastructure Budget Variance
- Desc: Managing the multi-million pound global infrastructure budget, ensuring we stay within our allocated spend and can justify any deviations.
- Target: Manage the global infrastructure budget to within +/- 5% of the forecast in Anaplan.
- Freq: Reviewed monthly with Finance, quarterly with the Director.
- Example: If your forecasted Q2 cloud spend was £1.2M, you'd need to keep actual spend between £1.14M and £1.26M. Anything outside that requires a clear explanation and an action plan.
- Metric: Total Cost of Ownership (TCO) Reduction
- Desc: Actively working to reduce the overall cost of running our infrastructure, especially by optimising cloud resources and migrating off expensive legacy platforms.
- Target: Deliver a 10% reduction in Total Cost of Ownership for legacy platforms through cloud migration and optimisation within 12 months.
- Freq: Annually, tracked quarterly.
- Example: If a particular on-prem database cluster costs £500K annually to run (licences, hardware, power, people), a 10% reduction would mean finding £50K in savings by moving it to a more efficient cloud service or optimising its current setup.
- Metric: Mean Time To Recovery (MTTR) for P1 Incidents
- Desc: How quickly we can restore service after a critical (Priority 1) incident. This isn't just about fixing it, but about getting it back to a stable state.
- Target: Reduce MTTR for P1 incidents by 15% year-over-year.
- Freq: Quarterly, based on post-mortem data.
- Example: If our average P1 MTTR was 60 minutes last year, you'd be aiming for 51 minutes or less this year. It's a tough one, but crucial for business continuity.
Qualitative Metrics
- Metric: Team Health & Development
- Desc: Ensuring your team is engaged, growing, and feels supported. This isn't just about output, but about building a sustainable, high-performing group.
- Evidence: High team retention rates (above 85%), positive feedback in internal engagement surveys, clear progression plans for individual team members, successful internal promotions, and active participation in mentorship programmes.
- Metric: Strategic Initiative Adoption
- Desc: How effectively your team adopts and champions new infrastructure strategies, like FinOps or Zero Trust, across the wider engineering organisation.
- Evidence: Successful implementation of FinOps practices leading to demonstrable cost savings, positive feedback from other engineering teams on new security standards, and clear evidence of architectural patterns shifting towards the defined strategy.
- Metric: Risk & Compliance Posture
- Desc: Maintaining a strong security and compliance posture for our cloud infrastructure, minimising audit findings and ensuring we meet regulatory requirements.
- Evidence: Zero critical or major findings in internal or external audits related to your domain, successful completion of compliance certifications (e.g., SOC 2), and proactive identification and mitigation of potential security risks.
- Metric: Vendor Relationship Management
- Desc: Building and maintaining strong, productive relationships with our key cloud providers and infrastructure vendors, ensuring we get the best service and value.
- Evidence: Positive feedback from vendor account managers, successful negotiation of enterprise agreements, proactive engagement on new features and cost optimisation programmes, and efficient resolution of vendor-related issues.
Primary Traits
- Trait: Decisive Under Fire
- Manifestation: When a SEV-1 outage hits at 2 AM, and you're getting conflicting information from three different engineers, you're the one who cuts through the noise. You'll make a clear 'go/no-go' decision on a high-risk fix within minutes, even if it means authorising emergency cloud spend without waiting for layers of approval. You don't freeze; you act.
- Benefit: Indecision during a major incident costs us millions per hour in lost revenue and reputational damage. When the platform is down, your team needs a commander, not a committee. Your ability to make a quick, informed decision can literally save the company.
- Trait: Pragmatically Influential
- Manifestation: You can convince the CFO to approve a £5M cloud migration by framing it as a strategic shift from Capex to Opex with predictable cost benefits, not just a purely technical upgrade. You'll get buy-in from skeptical engineering teams to adopt a new, stricter security standard by demonstrating how it actually reduces their on-call burden or makes their lives easier in the long run. It's about translating tech into business value and navigating the politics.
- Benefit: Infrastructure is often seen as a cost centre, but it's the engine that enables revenue. You must be able to translate complex technical needs into clear business value and navigate political landscapes to secure the necessary resources, budget, and alignment from other departments. Without this, your strategic initiatives simply won't get off the ground.
- Trait: Unflappably Accountable
- Manifestation: After a major incident or even a data breach, you'll stand before the executive committee, present a blameless post-mortem, and clearly articulate the plan to prevent recurrence without making excuses. You own the 'red' on your KPI dashboard and proactively communicate the recovery plan, not just when things are going well. You're the one who takes the hit, learns from it, and moves forward.
- Benefit: Trust is absolutely paramount here. When systems fail—and honestly, they will, it's the nature of the beast—leadership, stakeholders, and your team need to know the person in charge owns the outcome completely. You need to be focused on the solution, not the blame, and inspire confidence that things will be put right.
Supporting Traits
- Trait: Calm Under Pressure
- Desc: Your emotional state during an incident sets the tone for the entire response team. If you're panicking, everyone else will. We need someone who can be the eye of the storm.
- Trait: Process-Minded
- Desc: You can distinguish between value-add process (like a robust change management system) and soul-crushing bureaucracy. You'll implement the former and ruthlessly cut the latter.
- Trait: Fiscally Astute
- Desc: You innately understand the financial implications of technical decisions. You know the Total Cost of Ownership (TCO) of running Kubernetes versus using a serverless platform, and you can articulate it clearly to Finance.
- Trait: Coaching & Mentoring
- Desc: You genuinely enjoy helping your team members grow, identifying their strengths and weaknesses, and providing the guidance they need to excel. Your success is their success.
Primary Motivators
- Motivator: Building and Developing High-Performing Teams
- Daily: You'll spend time coaching individual engineers, helping them unblock technical challenges, and ensuring they have clear career paths. You'll also be actively involved in hiring and onboarding new talent, shaping the team's culture and capabilities.
- Motivator: Driving Strategic Impact and Transformation
- Daily: You'll be leading initiatives like multi-cloud migrations, implementing SRE principles, or rolling out new security architectures. This means defining the 'how,' getting buy-in, and seeing these complex programmes through to completion.
- Motivator: Solving Complex Organisational Challenges
- Daily: You won't just be solving technical problems; you'll be tackling issues like 'cloud sprawl,' balancing security and development velocity, or getting disparate teams to agree on a common infrastructure standard. It's often more about people and process than pure tech.
Potential Demotivators
Honestly, this role isn't for everyone. If you're someone who needs every project to be perfectly defined from day one, or if you prefer to just focus on the technical bits without the 'people stuff,' you might find it tough going. You'll still get the 3 AM PagerDuty call when a critical system fails, and you're the ultimate escalation point – that doesn't stop just because you're a manager. You'll spend a fair bit of time fighting the cloud bill, constantly justifying a massive, fluctuating Opex expense to a finance department that's used to predictable Capex. You'll also inherit some legacy tech anchors; that critical, ancient on-premise system that no one understands but is too risky to decommission? Yeah, that's yours now, and it'll prevent some of your modernisation efforts.
Common Frustrations
- The 3 AM PagerDuty Alert: The unavoidable reality that you are the ultimate escalation point when critical systems break in the middle of the night. Your phone will ring.
- Fighting the Cloud Bill: Constantly justifying a massive, fluctuating Opex bill to a finance department accustomed to predictable Capex cycles. It's a never-ending battle.
- Legacy Tech Anchors: Being held back by a critical, ancient on-premise system that no one understands but is too risky to decommission, preventing modernisation efforts and draining resources.
- Shadow IT Ambush: Discovering a department has been running a massive, unsecured data analytics workload on a credit card for six months, and now it's your problem to secure and support it – often with no budget.
- The Security vs. Velocity Squeeze: Getting caught between the security team demanding you lock everything down and development teams demanding faster, more flexible access. It's a constant balancing act.
- Vendor Lock-in Regret: Dealing with the long-term consequences (and costs) of a strategic platform decision made five years ago that is now a technical and financial burden, but you're stuck with it.
What Role Doesn't Offer
- A purely hands-on technical role: While you need deep technical knowledge, your day-to-day won't be writing code or configuring servers. It's about guiding and enabling others.
- A predictable 9-to-5 schedule: Incidents don't care about your calendar, and strategic planning often requires deep thought outside normal hours.
- An environment without ambiguity: You'll be dealing with complex, ill-defined problems where there isn't always a clear 'right' answer.
- A role where you can avoid difficult conversations: Managing people and budgets means tough decisions and sometimes even tougher conversations.
ADHD Positives
- The fast-paced, high-stakes nature of incident response can be incredibly engaging, playing to strengths in hyperfocus and rapid problem-solving.
- The need to jump between strategic planning, team management, and urgent operational issues can suit individuals who thrive on varied tasks and novelty.
- Leading multiple initiatives simultaneously can provide the stimulation and challenge that keeps things interesting.
ADHD Challenges and Accommodations
- The constant context-switching between strategic, tactical, and operational demands can be overwhelming if not managed well. We can help by clearly defining priorities and providing tools for task management.
- Long, detailed budget reviews or compliance documentation might be challenging. We can support with structured templates, dedicated focus time, or delegation where appropriate.
- Maintaining focus during lengthy, less stimulating meetings. We encourage active participation, note-taking, and breaks.
Dyslexia Positives
- Often excel at big-picture thinking, identifying patterns, and strategic problem-solving, which are crucial for infrastructure architecture and long-term planning.
- Strong verbal communication skills can be a huge asset when influencing stakeholders and leading a team through complex technical concepts.
- A knack for visualising complex systems and processes, which is invaluable for designing robust infrastructure.
Dyslexia Challenges and Accommodations
- Extensive reading of technical documentation, vendor contracts, or detailed reports might be time-consuming. We can offer text-to-speech software, provide executive summaries, or encourage verbal briefings.
- Writing detailed post-mortems or board updates can be a hurdle. We can support with AI drafting tools (as mentioned below), proofreading resources, or templates that structure the information clearly.
- Ensuring clarity in written communications. We encourage using simple language, bullet points, and having a colleague review important documents.
Autism Positives
- A deep, analytical approach to problem-solving, particularly for complex system failures or architectural challenges, can be a significant strength.
- Exceptional attention to detail in technical specifications, security configurations, and compliance requirements.
- A preference for logical, data-driven decision-making, which is essential when managing infrastructure performance and costs.
- Strong adherence to established processes (like ITIL) can ensure consistency and reliability in operations.
Autism Challenges and Accommodations
- Navigating complex social dynamics and unspoken expectations in leadership roles can be challenging. We foster a direct, transparent communication culture and provide clear expectations for leadership behaviours.
- Unexpected changes in priorities or sudden, urgent incidents might be disruptive. We aim for clear communication about changes and provide structured incident response protocols.
- Sensory overload in busy, open-plan office environments. We offer flexible working arrangements, quiet zones, and noise-cancelling headphones.
Sensory Considerations
Our main office is a modern, open-plan space, which can sometimes get a bit lively. That said, we've got plenty of quiet zones, meeting rooms you can book for focused work, and we're pretty flexible with working from home a few days a week. We want you to be comfortable and productive, so if you need specific adjustments like noise-cancelling headphones or a particular lighting setup, just let us know. We'll make it work.
Flexibility Notes
We're big believers in flexibility. While there are core hours for team meetings and collaboration, we trust you to manage your time effectively. If you need to pick up kids from school, or prefer to work some evenings, that's generally fine, as long as the work gets done and your team is supported. We're about outcomes, not clock-watching.
Key Responsibilities
Experience Levels Responsibilities
- Level: Manager, Cloud Infrastructure (12-16 years experience)
- Responsibilities: Build and lead a high-performing team of Cloud and SRE engineers, including hiring, performance management, coaching, and career development. You're responsible for their growth and well-being.
- Own the budget for your specific infrastructure domain (e.g., Cloud Operations, SRE), which typically ranges from £500K to £2M annually. That means forecasting, tracking, and justifying every pound spent.
- Define and drive the strategic roadmap for your area of responsibility, ensuring it aligns with the overall global infrastructure vision and wider business objectives. You'll be setting the direction.
- Oversee the design, implementation, and operation of our critical cloud infrastructure, making sure it's resilient, secure, scalable, and cost-optimised. This isn't hands-on coding, but deep architectural oversight.
- Establish and champion SRE principles and FinOps practices within your team and across the broader engineering organisation. You'll be driving a cultural shift towards proactive reliability and cost accountability.
- Manage key vendor relationships, particularly with our major cloud providers (AWS, Azure, GCP), negotiating contracts and ensuring we're getting maximum value from our partnerships.
- Be the ultimate escalation point for major incidents within your domain, providing decisive leadership and clear communication during critical outages. You'll be the one making the tough calls.
- Supervision: You'll be largely self-directed, working against quarterly objectives set with the Director of Global Infrastructure. Expect monthly 1:1s and strategic alignment meetings, but otherwise, you're trusted to get on with it.
- Decision: You have full authority over your functional budget (up to £2M), hiring and firing decisions within your team, and vendor selection up to £100K. Strategic decisions that impact other departments or require significant capital expenditure (above £2M) will need alignment with the Director and potentially the SVP of Engineering. Organisational design within your team is yours to define.
- Success: Success looks like a highly engaged and effective team, consistently meeting availability and cost targets, and successfully delivering on strategic initiatives that move our infrastructure forward. You'll be measured on the health of your team, the reliability of your systems, and your financial stewardship.
Decision-Making Authority
- Type: Budget Allocation
- Entry: No authority. Must get approval for any spend.
- Mid: Can approve minor purchases (up to £1K) within pre-approved project budgets.
- Senior: Can approve project-level spend up to £5K. Recommends larger budget requests to Director.
- Type: Hiring & Team Structure
- Entry: No involvement beyond interviewing.
- Mid: Provides input on candidate suitability.
- Senior: Leads interview panels, makes recommendations for hires. No authority on team structure.
- Type: Technical Architecture & Tooling
- Entry: Follows existing architectural patterns and uses approved tools.
- Mid: Proposes minor architectural improvements or new tools for specific projects, with team lead approval.
- Senior: Designs and implements major architectural components. Recommends new tools for a workstream, with Director input.
- Type: Incident Management & Resolution
- Entry: Executes runbooks, escalates issues to senior engineers.
- Mid: Independently troubleshoots and resolves routine incidents. Escalates complex issues.
- Senior: Leads incident response for major incidents (P2). Makes critical technical decisions during incidents.
- Type: Vendor Selection & Management
- Entry: No direct involvement.
- Mid: Researches potential vendors for specific tools, provides technical feedback.
- Senior: Evaluates vendor solutions, participates in contract negotiations. Recommends preferred vendors.
ID:
Tool: Predictive Outage Detection (AIOps)
Benefit: Use AI tools to analyse vast amounts of monitoring data, predicting potential hardware failures or service degradation *before* they cause a user-facing outage. This means fewer 3 AM pager calls and more proactive fixes. Honestly, it's a game-changer for incident prevention.
ID:
Tool: Cloud Cost Anomaly Analysis
Benefit: Deploy AI-powered FinOps tools to automatically scan our AWS, Azure, and GCP bills. It'll instantly flag anomalous spend, identify specific resources responsible for cost overruns, and even suggest optimisation strategies. This automates hours of manual spreadsheet analysis, giving you back precious time for strategic cost management.
ID:
Tool: Vendor Contract & Security Report Summarisation
Benefit: Feed lengthy vendor contracts, SOC 2 reports, or new CVE vulnerability descriptions into an LLM. Get a concise summary of key risks, obligations, and action items in minutes, not hours. It's like having a dedicated research assistant for all your due diligence.
ID: ✍️
Tool: Post-Mortem & Board Update Drafting
Benefit: Provide an LLM with a timeline of an incident and key data points. Ask it to generate a first draft of a blameless post-mortem or a non-technical executive summary for a board update. This significantly reduces writer's block on critical communications, letting you focus on the content and strategy.
Expect to save 15-25 hours weekly, giving you more time for strategic thinking and team development.
Weekly time savings potential
You'll have access to 4 core AI tools that are specifically tailored for infrastructure management.
Typical tool investment
Competency Requirements
Foundation Skills (Transferable)
These are the bedrock skills that every leader needs, but for a Manager, Cloud Infrastructure, they take on a specific flavour. It's about leading people, solving complex, often ambiguous problems, and communicating effectively across the organisation.
- Category: Communication & Influence
- Skills: Executive Presentation: The ability to distil complex technical concepts into clear, concise, and compelling narratives for senior leadership and non-technical audiences. You'll be presenting to the SVP and potentially the board.
- Stakeholder Management: Building strong relationships and gaining buy-in from diverse groups—Product, Finance, Security, other Engineering teams—often with competing priorities. It's about negotiation and consensus-building.
- Conflict Resolution: Mediating disagreements within your team or between your team and other departments, finding constructive solutions that move things forward.
- Written Communication: Crafting clear, unambiguous technical documentation, strategic proposals, and incident reports that can be understood by a wide audience.
- Category: Problem-Solving & Strategic Thinking
- Skills: Organisational Problem Solving: Tackling issues that span beyond just technology, like process inefficiencies, team dynamics, or inter-departmental friction. It's about seeing the bigger picture.
- Strategic Planning: Developing long-term roadmaps and visions for your domain, anticipating future needs, and aligning technical initiatives with business goals. You're not just reacting, you're shaping the future.
- Risk Management: Identifying, assessing, and mitigating operational, security, and financial risks associated with our infrastructure. This includes proactive planning for disaster recovery.
- Analytical Decision Making: Using data (performance metrics, cost reports, incident trends) to make informed decisions about architecture, resource allocation, and team priorities.
- Category: Leadership & Adaptability
- Skills: Team Leadership & Development: Inspiring, motivating, and coaching a team of highly skilled engineers. This includes fostering a culture of continuous learning, accountability, and psychological safety.
- Change Leadership: Guiding your team and the wider organisation through significant technological and process changes, managing resistance and driving adoption.
- Resilience & Grace Under Pressure: Maintaining composure and clarity of thought during high-stress situations, like major incidents or critical deadlines. Your team looks to you for calm.
- Continuous Learning: Staying abreast of the latest trends in cloud technology, SRE, FinOps, and security, and encouraging your team to do the same. The tech landscape moves fast.
Functional Skills (Role-Specific Technical)
These are the specific technical and domain-level skills you'll need to effectively lead our Cloud Infrastructure team. You won't be hands-on with every tool, but you'll need a deep understanding to guide your team and make informed strategic decisions.
Technical Competencies
- Skill: ITIL Framework
- Desc: Deep understanding of Incident, Problem, and Change Management processes, not just as theory but as the political and operational reality of a large organisation. You'll be driving the adoption and optimisation of these processes.
- Level: Expert
- Skill: Site Reliability Engineering (SRE)
- Desc: Implementing and championing SRE principles to move the organisation from a reactive 'firefighting' model to a proactive, data-driven one based on SLOs, SLIs, and error budgets. You'll be responsible for the cultural shift.
- Level: Expert
- Skill: Cloud FinOps
- Desc: Establishing practices and a culture of accountability around cloud spending, including showback/chargeback models, reserved instance/savings plan optimisation, and waste identification. This is crucial for managing our budget.
- Level: Expert
- Skill: Disaster Recovery (DR) & Business Continuity Planning (BCP)
- Desc: Designing, testing, and being ultimately accountable for the RPO/RTO of critical systems, from tabletop exercises to full failover simulations. You own our ability to recover from the worst-case scenario.
- Level: Expert
- Skill: Zero Trust Security Architecture
- Desc: Moving beyond the traditional perimeter-based security model to an identity-centric approach, embedding security into all infrastructure design and operations. You'll be a key driver of this transformation.
- Level: Advanced
- Skill: Capacity Planning & Performance Modelling
- Desc: Using historical data and business forecasts to predict future infrastructure needs, preventing performance bottlenecks and overspending. You'll be guiding your team on these methodologies.
- Level: Advanced
Digital Tools
- Tool: Cloud Platforms (AWS, Azure, GCP)
- Level: Strategic/Architect
- Usage: Setting multi-cloud/hybrid strategy, negotiating enterprise agreements (EAs) with providers, owning the overall cloud governance framework, and guiding architectural decisions across these platforms.
- Tool: Infrastructure as Code (Terraform, Ansible)
- Level: Strategic/Architect
- Usage: Mandating IaC adoption across the enterprise, setting standards for tooling (e.g., Terraform Cloud vs. Atlantis) and security (e.g., Checkov, tfsec), and ensuring automation is a core part of our delivery.
- Tool: Container Orchestration (Kubernetes, EKS, AKS, GKE)
- Level: Strategic/Architect
- Usage: Determining the enterprise container strategy, making build-vs-buy decisions on platforms (e.g., OpenShift, Rancher vs. vanilla K8s), and governing security and cost for containerised workloads.
- Tool: Observability & Monitoring (Datadog, New Relic, Grafana, Splunk)
- Level: Strategic/Architect
- Usage: Selecting and owning the enterprise observability platform, defining Service Level Objectives (SLOs), and using data from these tools to forecast capacity needs and justify budget to the CFO.
- Tool: Networking & Security (Palo Alto Networks, WAFs, VPC/VNet)
- Level: Strategic/Architect
- Usage: Developing the global network architecture and Zero Trust security model, and being accountable for network performance, security posture, and compliance (e.g., PCI-DSS, SOC 2).
- Tool: Financial & GRC Platforms (Anaplan, Workday Adaptive Planning, ServiceNow GRC/ITSM)
- Level: Strategic/Architect
- Usage: Managing a multi-million pound budget in Anaplan or Workday Adaptive Planning, using ServiceNow GRC/ITSM for change management, incident tracking, and audit evidence, and presenting risk posture to the board.
Industry Knowledge
- Area: Cloud Economics & Cost Optimisation
- Desc: Deep understanding of cloud pricing models, reserved instances, savings plans, and strategies for optimising spend across multiple cloud providers. You'll be the expert here.
- Area: Cyber Security Best Practices
- Desc: Comprehensive knowledge of current cyber threats, attack vectors, and defence mechanisms, including identity and access management, data encryption, and network segmentation. You'll guide our security posture.
- Area: Regulatory Compliance & Audit Requirements
- Desc: Familiarity with relevant industry regulations (e.g., GDPR, PCI-DSS, SOC 2) and how they impact infrastructure design and operations. You'll ensure we're always compliant.
- Area: Enterprise Architecture Principles
- Desc: Understanding how infrastructure decisions fit into the broader enterprise architecture, ensuring alignment and avoiding technical debt. You'll be a key contributor to our architectural vision.
Regulatory Compliance Regulations
- Reg: General Data Protection Regulation (GDPR)
- Usage: Ensuring our cloud infrastructure design and data handling practices comply with GDPR requirements, particularly regarding data residency, access controls, and breach notification. You'll be accountable for this within your domain.
- Reg: Payment Card Industry Data Security Standard (PCI-DSS)
- Usage: If we handle payment card data, you'll be responsible for ensuring our infrastructure meets PCI-DSS compliance standards, including network segmentation, encryption, and regular security assessments. This is non-negotiable.
- Reg: SOC 2 Type II
- Usage: Overseeing the controls and evidence collection for our annual SOC 2 audit, demonstrating the security, availability, processing integrity, confidentiality, and privacy of our systems. You'll work closely with internal audit.
- Reg: ISO 27001 (Information Security Management)
- Usage: Implementing and maintaining an Information Security Management System (ISMS) within your domain that aligns with ISO 27001 standards, ensuring robust information security practices are embedded.
Essential Prerequisites
- Extensive experience (10+ years) in a senior technical role within cloud infrastructure, SRE, or DevOps, with a clear progression in responsibility.
- Demonstrable experience leading and managing technical teams (at least 3-5 years as a formal manager or team lead). This isn't your first rodeo.
- A deep, practical understanding of at least two major cloud platforms (AWS, Azure, GCP) at an architectural level.
- Proven track record of managing significant infrastructure budgets (£500K+) and delivering cost optimisation initiatives.
- Experience in designing and implementing large-scale, highly available, and secure infrastructure solutions.
- Strong understanding of ITIL, SRE, and FinOps principles, with practical experience applying them in a production environment.
Career Pathway Context
These prerequisites mean you've already walked the path of a Senior SRE or Lead Infrastructure Engineer. You've seen the challenges, you've built the systems, and now you're ready to lead the people who do it. We're looking for someone who's not just technically brilliant, but also a proven leader and strategic thinker. If you haven't managed a team before, this role probably isn't the right fit just yet, but keep an eye out for our Lead roles!
Qualifications & Credentials
Emerging Foundation Skills
- Skill: Prompt Engineering & LLM Integration for Operations
- Why: Our competitors are already using generative AI to automate incident summaries, draft runbooks, and even suggest code fixes in minutes. Analysts and engineers who master this will outproduce their peers significantly, and managers need to guide this adoption.
- Concepts: [{'concept_name': 'Context Windows & Token Limits', 'description': 'Understanding how much information an LLM can process at once and how to optimise inputs for complex operational tasks.'}, {'concept_name': 'RAG Architectures for Proprietary Data', 'description': 'Implementing Retrieval Augmented Generation to allow LLMs to securely query and summarise internal documentation, incident databases, and knowledge bases.'}, {'concept_name': 'Output Validation & Hallucination Detection', 'description': 'Developing strategies and processes to verify the accuracy of AI-generated content, especially for critical operational tasks or compliance reports.'}, {'concept_name': 'AI-Powered Incident Response Playbooks', 'description': 'Designing and integrating LLMs into incident management workflows to provide real-time diagnostic suggestions, communication drafts, and resolution steps.'}]
- Prepare: This week: Experiment with Claude or ChatGPT to summarise incident timelines and draft initial post-mortem sections. See what it can do.
- This month: Identify one repetitive documentation or reporting task within your team that could be significantly accelerated by LLM integration.
- Month 2: Lead a pilot project to integrate an LLM API (e.g., OpenAI, Anthropic) into a specific operational workflow, perhaps for generating initial support responses.
- Month 3: Document the productivity gains and present a proposal to the Director on broader AI adoption within infrastructure operations.
- QuickWin: Start using generative AI to draft internal communications, summarise lengthy vendor reports, or even brainstorm solutions for complex architectural problems. No formal approval needed, just start experimenting and seeing the benefits.
- Skill: Sustainable Cloud & Green IT Practices
- Why: Customers, investors, and regulators are increasingly demanding demonstrable commitments to environmental sustainability. Our cloud footprint has a significant impact, and you'll need to lead efforts to minimise it.
- Concepts: [{'concept_name': 'Carbon Footprint of Cloud Services', 'description': 'Understanding how different cloud regions, instance types, and data storage impact environmental sustainability.'}, {'concept_name': 'Energy Efficiency Optimisation', 'description': 'Strategies for reducing energy consumption in cloud infrastructure, such as rightsizing, auto-scaling, and using low-carbon regions.'}, {'concept_name': 'Circular Economy Principles in IT', 'description': 'Applying principles of reuse, repair, and recycling to hardware and data centre operations, even in a cloud-first world.'}, {'concept_name': 'Reporting & Metrics for Green IT', 'description': 'Developing and tracking metrics to measure and report on the environmental impact of our infrastructure, for internal and external stakeholders.'}]
- Prepare: This week: Read up on AWS/Azure/GCP's sustainability reports and tools (e.g., AWS Customer Carbon Footprint Tool).
- This month: Identify one area within your current cloud spend that could be optimised for energy efficiency (e.g., migrating to more efficient instance types).
- Month 2: Research industry best practices for Green IT and present a summary of potential initiatives to your Director.
- Month 3: Start incorporating sustainability considerations into new architectural designs and vendor selection criteria.
- QuickWin: Begin by simply being aware of the carbon impact of our current cloud choices. Ask your team to consider energy efficiency when proposing new solutions. It's a mindset shift that starts small.
Advancing Technical Skills
- Skill: Advanced Cloud Native Security Posture Management
- Why: The attack surface of cloud-native environments is constantly expanding. As a manager, you need to lead the strategy for proactive defence, not just reactive patching. This means understanding the latest threats and how to build security in from the start.
- Concepts: [{'concept_name': 'Supply Chain Security for Cloud Native', 'description': 'Securing the entire software supply chain, from source code to deployment, in a containerised and microservices environment.'}, {'concept_name': 'Runtime Security & Behavioural Analysis', 'description': 'Implementing tools and strategies to detect and respond to anomalous behaviour within running containers and cloud functions.'}, {'concept_name': 'Policy-as-Code & GitOps for Security', 'description': 'Automating security policy enforcement through code, integrated into CI/CD pipelines and managed via Git.'}, {'concept_name': 'Cloud Security Posture Management (CSPM) & Cloud Workload Protection Platforms (CWPP)', 'description': 'Leading the selection and implementation of advanced security tools to continuously monitor and improve our cloud security posture.'}]
- Prepare: This week: Review recent cloud security breaches and understand the root causes and mitigation strategies.
- This month: Evaluate our current CSPM/CWPP tools and identify gaps in our security coverage. Talk to your Security Ops peers.
- Month 2: Lead a cross-functional workshop on integrating security earlier into the development lifecycle (Shift Left Security).
- Month 3: Develop a roadmap for enhancing our cloud-native security posture, including new tooling and process improvements.
- QuickWin: Ensure all new cloud deployments are reviewed against a robust security checklist. Push for automated security scanning in all CI/CD pipelines. It's about making security a default, not an afterthought.
- Skill: Advanced Data Observability & AIOps Integration
- Why: As systems become more distributed and complex, traditional monitoring isn't enough. You need to lead the charge on data observability to understand the health of data pipelines and integrate AI for predictive insights, preventing outages rather than just reacting to them.
- Concepts: [{'concept_name': 'Data Lineage & Data Quality Monitoring', 'description': 'Implementing tools to track data from source to destination and monitor its quality, ensuring reliability for critical business decisions.'}, {'concept_name': 'Distributed Tracing & Service Mesh Observability', 'description': 'Gaining deep insights into the performance and interactions of microservices within a complex, distributed architecture.'}, {'concept_name': 'Anomaly Detection & Root Cause Analysis with AI', 'description': 'Using machine learning to automatically detect unusual patterns in metrics and logs, and to assist in quickly identifying the root cause of issues.'}, {'concept_name': 'Predictive Capacity Planning with ML', 'description': 'Leveraging machine learning models to forecast future infrastructure capacity needs based on historical usage and business growth patterns.'}]
- Prepare: This week: Review our current observability stack. What are its blind spots? What data are we missing?
- This month: Research leading data observability platforms and AIOps solutions. What could they bring to our environment?
- Month 2: Pilot a data observability tool on a critical data pipeline or microservice, demonstrating its value in identifying issues early.
- Month 3: Develop a strategy for integrating AIOps capabilities into our incident management and capacity planning processes.
- QuickWin: Start asking your team 'What data are we missing?' during incident reviews. Push for better instrumentation on new services. Small changes in data collection can lead to big insights later.
Future Skills Closing Note
The reality is, the pace of change in infrastructure won't slow down. Your role as a manager isn't just to keep up, but to anticipate and lead. This means continuously evolving your own understanding and empowering your team to do the same. It's a journey, not a destination, and we're looking for someone who's excited by that challenge.
Education Requirements
- Level: Minimum
- Req: A Bachelor's degree in Computer Science, Engineering, or a closely related technical field.
- Alts: We're pragmatic here. If you've got equivalent practical experience (say, 15+ years in infrastructure with a proven track record of leadership and complex problem-solving), we're definitely interested. We value real-world impact over just a piece of paper.
- Level: Preferred
- Req: A Master's degree in a relevant field (e.g., Computer Science, Business Administration with a tech focus).
- Alts: Relevant industry certifications (like a top-tier cloud architect certification or a CISSP) combined with extensive experience can often be just as valuable, if not more so, than a Master's.
Experience Requirements
You'll need roughly 12-16 years of progressive experience in technology infrastructure, with at least 5-8 years specifically focused on cloud platforms (AWS, Azure, or GCP). Crucially, you must have a minimum of 3-5 years in a formal people management role, leading teams of engineers. We're looking for someone who's not just been 'senior' but has actively hired, mentored, and developed technical talent, and owned significant budgets and strategic initiatives.
Preferred Certifications
- Cert: ITIL Expert or ITIL Managing Professional
- Prod: AXELOS
- Usage: Demonstrates a deep understanding of IT service management best practices, which is crucial for establishing and optimising our operational processes.
- Cert: Certified Information Systems Security Professional (CISSP)
- Prod: ISC²
- Usage: Shows a comprehensive understanding of information security principles and practices, essential for leading our security posture in the cloud.
- Cert: Certified FinOps Practitioner or FinOps Certified Professional
- Prod: FinOps Foundation
- Usage: Validates expertise in cloud financial management, which is a core responsibility for managing our multi-million pound cloud spend.
Recommended Activities
- Regularly attending industry conferences (e.g., AWS re:Invent, Azure Summit, KubeCon) to stay abreast of the latest trends and network with peers.
- Participating in online courses or bootcamps on emerging technologies like advanced AI/ML for operations, quantum computing basics, or serverless architecture patterns.
- Engaging in leadership development programmes, focusing on areas like executive presence, strategic negotiation, and advanced team coaching.
- Contributing to open-source projects or writing technical blogs/articles to share knowledge and build thought leadership.
- Mentoring junior engineers or participating in internal knowledge-sharing sessions to reinforce your own learning and give back to the community.
Career Progression Pathways
Entry Paths to This Role
- Path: Senior SRE / Senior Cloud Architect
- Time: 3-5 years in a senior individual contributor role.
- Path: Lead Infrastructure Engineer / Principal SRE
- Time: 2-4 years in a lead technical role, often with some direct reports or significant project ownership.
- Path: Technical Project/Programme Manager (Infrastructure focus)
- Time: 4-6 years managing complex infrastructure projects or programmes.
Career Progression From This Role
- Pathway: Director of Global Infrastructure
- Time: 3-5 years in the Manager, Cloud Infrastructure role.
Long Term Vision Potential Roles
- Title: VP of Engineering (Infrastructure)
- Time: 5-8 years from this role.
- Title: Chief Technology Officer (CTO)
- Time: 8-12 years from this role.
- Title: Chief Information Officer (CIO)
- Time: 8-12 years from this role.
Sector Mobility
The skills you'll gain here—especially in cloud management, SRE, FinOps, and leading high-performing teams—are highly transferable across almost any industry. Whether it's FinTech, healthcare, e-commerce, or even public sector, every organisation needs robust, cost-effective, and secure infrastructure. You'll be a sought-after leader.
How Zavmo Delivers This Role's Development
DISCOVER Phase: Skills Gap Analysis
Zavmo maps your current competencies against all requirements in this job description through conversational assessment. We evaluate your foundation skills (communication, strategic thinking), functional skills (CRM expertise, negotiation), and readiness for career progression.
Output: Personalised skills gap heat map showing strengths and priorities, estimated time to competency, neurodiversity accommodations.
DISCUSS Phase: Personalised Learning Pathway
Based on your DISCOVER results, Zavmo creates a personalised learning plan prioritised by impact: foundation skills first, then functional skills. We adapt to your learning style, pace, and neurodiversity needs (ADHD, dyslexia, autism).
Output: Week-by-week schedule, each module linked to specific job responsibilities, checkpoints and milestones.
DELIVER Phase: Conversational Learning
Learn through conversation, not boring modules. Zavmo uses 10 conversation types (Socratic dialogue, role-play, coaching, case studies) to build competence. Practice difficult QBR presentations, negotiate tough renewals, and handle churn conversations in a safe AI environment before facing real clients.
Example: "For 'Stakeholder Mapping', Zavmo will guide you through analysing a complex enterprise account, identifying key decision-makers, and building an engagement strategy."
DEMONSTRATE Phase: Competency Assessment
Zavmo automatically builds your evidence portfolio as you learn. Every conversation, practice scenario, and application example is captured and mapped to NOS performance criteria. When ready, your portfolio supports OFQUAL qualification claims and demonstrates competence to employers.
Output: Competency matrix, evidence portfolio (downloadable), qualification readiness, career progression score.