Principal/Manager (12-16 years)

Manager, Cloud Infrastructure

This isn't just about keeping the lights on; it's about building the engine that powers our entire business. You'll be leading a significant chunk of our infrastructure team, making sure our cloud platforms are robust, cost-effective, and ready for whatever our product teams throw at them. Frankly, you're the person who translates strategic vision into operational reality, owning a critical piece of our global tech backbone.

Job ID
JD-TEIN-MGRTEIN-005
Department
Technical Roles
NOS Level
Level 5
OFQUAL Level
Level 7-8
Experience
Principal/Manager (12-16 years)

Role Purpose & Context

Role Summary

The Manager, Cloud Infrastructure is here to lead and shape a high-performing team that designs, builds, and runs our global cloud platforms. You'll be making sure our core systems are always available, secure, and performant, which directly impacts our ability to serve customers and grow the business. You'll sit right at the intersection of our overall infrastructure strategy and the day-to-day operational excellence, translating big-picture ideas into concrete, reliable services. When this role is done well, our engineering teams can deploy faster, our customers experience fewer outages, and our cloud spend is optimised without cutting corners. When it's not, we're looking at costly outages, security vulnerabilities, and ballooning cloud bills – frankly, it impacts everything. The biggest challenge here is balancing rapid innovation with rock-solid stability and cost efficiency, all while managing a diverse team and navigating constant change. The reward, though, is seeing your team thrive, building resilient systems that genuinely enable the company's success, and knowing you're at the helm of something truly critical.

Reporting Structure

Key Stakeholders

Internal:

External:

Organisational Impact

Scope: Your work directly impacts our operational stability, security posture, and the financial health of our cloud spend. Get it right, and you're enabling faster product delivery and better customer experience. Get it wrong, and we're talking about outages, data breaches, and millions in wasted spend. It's a high-stakes game, honestly.

Performance Metrics

Quantitative Metrics

  1. Metric: Availability & Uptime
  2. Desc: Maintaining the highest possible uptime for our Tier 1 business applications, which are the ones that directly impact revenue or customer experience.
  3. Target: Achieve 99.99% ('four nines') uptime for all Tier 1 business applications.
  4. Freq: Monthly, reported quarterly to the executive committee.
  5. Example: If our main customer-facing platform is down for more than 5 minutes in a month, we've missed this target. It's a tight margin, but that's the expectation for critical systems.
  6. Metric: Global Infrastructure Budget Variance
  7. Desc: Managing the multi-million pound global infrastructure budget, ensuring we stay within our allocated spend and can justify any deviations.
  8. Target: Manage the global infrastructure budget to within +/- 5% of the forecast in Anaplan.
  9. Freq: Reviewed monthly with Finance, quarterly with the Director.
  10. Example: If your forecasted Q2 cloud spend was £1.2M, you'd need to keep actual spend between £1.14M and £1.26M. Anything outside that requires a clear explanation and an action plan.
  11. Metric: Total Cost of Ownership (TCO) Reduction
  12. Desc: Actively working to reduce the overall cost of running our infrastructure, especially by optimising cloud resources and migrating off expensive legacy platforms.
  13. Target: Deliver a 10% reduction in Total Cost of Ownership for legacy platforms through cloud migration and optimisation within 12 months.
  14. Freq: Annually, tracked quarterly.
  15. Example: If a particular on-prem database cluster costs £500K annually to run (licences, hardware, power, people), a 10% reduction would mean finding £50K in savings by moving it to a more efficient cloud service or optimising its current setup.
  16. Metric: Mean Time To Recovery (MTTR) for P1 Incidents
  17. Desc: How quickly we can restore service after a critical (Priority 1) incident. This isn't just about fixing it, but about getting it back to a stable state.
  18. Target: Reduce MTTR for P1 incidents by 15% year-over-year.
  19. Freq: Quarterly, based on post-mortem data.
  20. Example: If our average P1 MTTR was 60 minutes last year, you'd be aiming for 51 minutes or less this year. It's a tough one, but crucial for business continuity.

Qualitative Metrics

  1. Metric: Team Health & Development
  2. Desc: Ensuring your team is engaged, growing, and feels supported. This isn't just about output, but about building a sustainable, high-performing group.
  3. Evidence: High team retention rates (above 85%), positive feedback in internal engagement surveys, clear progression plans for individual team members, successful internal promotions, and active participation in mentorship programmes.
  4. Metric: Strategic Initiative Adoption
  5. Desc: How effectively your team adopts and champions new infrastructure strategies, like FinOps or Zero Trust, across the wider engineering organisation.
  6. Evidence: Successful implementation of FinOps practices leading to demonstrable cost savings, positive feedback from other engineering teams on new security standards, and clear evidence of architectural patterns shifting towards the defined strategy.
  7. Metric: Risk & Compliance Posture
  8. Desc: Maintaining a strong security and compliance posture for our cloud infrastructure, minimising audit findings and ensuring we meet regulatory requirements.
  9. Evidence: Zero critical or major findings in internal or external audits related to your domain, successful completion of compliance certifications (e.g., SOC 2), and proactive identification and mitigation of potential security risks.
  10. Metric: Vendor Relationship Management
  11. Desc: Building and maintaining strong, productive relationships with our key cloud providers and infrastructure vendors, ensuring we get the best service and value.
  12. Evidence: Positive feedback from vendor account managers, successful negotiation of enterprise agreements, proactive engagement on new features and cost optimisation programmes, and efficient resolution of vendor-related issues.

Primary Traits

Supporting Traits

Primary Motivators

  1. Motivator: Building and Developing High-Performing Teams
  2. Daily: You'll spend time coaching individual engineers, helping them unblock technical challenges, and ensuring they have clear career paths. You'll also be actively involved in hiring and onboarding new talent, shaping the team's culture and capabilities.
  3. Motivator: Driving Strategic Impact and Transformation
  4. Daily: You'll be leading initiatives like multi-cloud migrations, implementing SRE principles, or rolling out new security architectures. This means defining the 'how,' getting buy-in, and seeing these complex programmes through to completion.
  5. Motivator: Solving Complex Organisational Challenges
  6. Daily: You won't just be solving technical problems; you'll be tackling issues like 'cloud sprawl,' balancing security and development velocity, or getting disparate teams to agree on a common infrastructure standard. It's often more about people and process than pure tech.

Potential Demotivators

Honestly, this role isn't for everyone. If you're someone who needs every project to be perfectly defined from day one, or if you prefer to just focus on the technical bits without the 'people stuff,' you might find it tough going. You'll still get the 3 AM PagerDuty call when a critical system fails, and you're the ultimate escalation point – that doesn't stop just because you're a manager. You'll spend a fair bit of time fighting the cloud bill, constantly justifying a massive, fluctuating Opex expense to a finance department that's used to predictable Capex. You'll also inherit some legacy tech anchors; that critical, ancient on-premise system that no one understands but is too risky to decommission? Yeah, that's yours now, and it'll prevent some of your modernisation efforts.

Common Frustrations

  1. The 3 AM PagerDuty Alert: The unavoidable reality that you are the ultimate escalation point when critical systems break in the middle of the night. Your phone will ring.
  2. Fighting the Cloud Bill: Constantly justifying a massive, fluctuating Opex bill to a finance department accustomed to predictable Capex cycles. It's a never-ending battle.
  3. Legacy Tech Anchors: Being held back by a critical, ancient on-premise system that no one understands but is too risky to decommission, preventing modernisation efforts and draining resources.
  4. Shadow IT Ambush: Discovering a department has been running a massive, unsecured data analytics workload on a credit card for six months, and now it's your problem to secure and support it – often with no budget.
  5. The Security vs. Velocity Squeeze: Getting caught between the security team demanding you lock everything down and development teams demanding faster, more flexible access. It's a constant balancing act.
  6. Vendor Lock-in Regret: Dealing with the long-term consequences (and costs) of a strategic platform decision made five years ago that is now a technical and financial burden, but you're stuck with it.

What Role Doesn't Offer

  1. A purely hands-on technical role: While you need deep technical knowledge, your day-to-day won't be writing code or configuring servers. It's about guiding and enabling others.
  2. A predictable 9-to-5 schedule: Incidents don't care about your calendar, and strategic planning often requires deep thought outside normal hours.
  3. An environment without ambiguity: You'll be dealing with complex, ill-defined problems where there isn't always a clear 'right' answer.
  4. A role where you can avoid difficult conversations: Managing people and budgets means tough decisions and sometimes even tougher conversations.

ADHD Positives

  1. The fast-paced, high-stakes nature of incident response can be incredibly engaging, playing to strengths in hyperfocus and rapid problem-solving.
  2. The need to jump between strategic planning, team management, and urgent operational issues can suit individuals who thrive on varied tasks and novelty.
  3. Leading multiple initiatives simultaneously can provide the stimulation and challenge that keeps things interesting.

ADHD Challenges and Accommodations

  1. The constant context-switching between strategic, tactical, and operational demands can be overwhelming if not managed well. We can help by clearly defining priorities and providing tools for task management.
  2. Long, detailed budget reviews or compliance documentation might be challenging. We can support with structured templates, dedicated focus time, or delegation where appropriate.
  3. Maintaining focus during lengthy, less stimulating meetings. We encourage active participation, note-taking, and breaks.

Dyslexia Positives

  1. Often excel at big-picture thinking, identifying patterns, and strategic problem-solving, which are crucial for infrastructure architecture and long-term planning.
  2. Strong verbal communication skills can be a huge asset when influencing stakeholders and leading a team through complex technical concepts.
  3. A knack for visualising complex systems and processes, which is invaluable for designing robust infrastructure.

Dyslexia Challenges and Accommodations

  1. Extensive reading of technical documentation, vendor contracts, or detailed reports might be time-consuming. We can offer text-to-speech software, provide executive summaries, or encourage verbal briefings.
  2. Writing detailed post-mortems or board updates can be a hurdle. We can support with AI drafting tools (as mentioned below), proofreading resources, or templates that structure the information clearly.
  3. Ensuring clarity in written communications. We encourage using simple language, bullet points, and having a colleague review important documents.

Autism Positives

  1. A deep, analytical approach to problem-solving, particularly for complex system failures or architectural challenges, can be a significant strength.
  2. Exceptional attention to detail in technical specifications, security configurations, and compliance requirements.
  3. A preference for logical, data-driven decision-making, which is essential when managing infrastructure performance and costs.
  4. Strong adherence to established processes (like ITIL) can ensure consistency and reliability in operations.

Autism Challenges and Accommodations

  1. Navigating complex social dynamics and unspoken expectations in leadership roles can be challenging. We foster a direct, transparent communication culture and provide clear expectations for leadership behaviours.
  2. Unexpected changes in priorities or sudden, urgent incidents might be disruptive. We aim for clear communication about changes and provide structured incident response protocols.
  3. Sensory overload in busy, open-plan office environments. We offer flexible working arrangements, quiet zones, and noise-cancelling headphones.

Sensory Considerations

Our main office is a modern, open-plan space, which can sometimes get a bit lively. That said, we've got plenty of quiet zones, meeting rooms you can book for focused work, and we're pretty flexible with working from home a few days a week. We want you to be comfortable and productive, so if you need specific adjustments like noise-cancelling headphones or a particular lighting setup, just let us know. We'll make it work.

Flexibility Notes

We're big believers in flexibility. While there are core hours for team meetings and collaboration, we trust you to manage your time effectively. If you need to pick up kids from school, or prefer to work some evenings, that's generally fine, as long as the work gets done and your team is supported. We're about outcomes, not clock-watching.

Key Responsibilities

Experience Levels Responsibilities

  1. Level: Manager, Cloud Infrastructure (12-16 years experience)
  2. Responsibilities: Build and lead a high-performing team of Cloud and SRE engineers, including hiring, performance management, coaching, and career development. You're responsible for their growth and well-being.
  3. Own the budget for your specific infrastructure domain (e.g., Cloud Operations, SRE), which typically ranges from £500K to £2M annually. That means forecasting, tracking, and justifying every pound spent.
  4. Define and drive the strategic roadmap for your area of responsibility, ensuring it aligns with the overall global infrastructure vision and wider business objectives. You'll be setting the direction.
  5. Oversee the design, implementation, and operation of our critical cloud infrastructure, making sure it's resilient, secure, scalable, and cost-optimised. This isn't hands-on coding, but deep architectural oversight.
  6. Establish and champion SRE principles and FinOps practices within your team and across the broader engineering organisation. You'll be driving a cultural shift towards proactive reliability and cost accountability.
  7. Manage key vendor relationships, particularly with our major cloud providers (AWS, Azure, GCP), negotiating contracts and ensuring we're getting maximum value from our partnerships.
  8. Be the ultimate escalation point for major incidents within your domain, providing decisive leadership and clear communication during critical outages. You'll be the one making the tough calls.
  9. Supervision: You'll be largely self-directed, working against quarterly objectives set with the Director of Global Infrastructure. Expect monthly 1:1s and strategic alignment meetings, but otherwise, you're trusted to get on with it.
  10. Decision: You have full authority over your functional budget (up to £2M), hiring and firing decisions within your team, and vendor selection up to £100K. Strategic decisions that impact other departments or require significant capital expenditure (above £2M) will need alignment with the Director and potentially the SVP of Engineering. Organisational design within your team is yours to define.
  11. Success: Success looks like a highly engaged and effective team, consistently meeting availability and cost targets, and successfully delivering on strategic initiatives that move our infrastructure forward. You'll be measured on the health of your team, the reliability of your systems, and your financial stewardship.

Decision-Making Authority

Save 15-25 hours weekly: Supercharge your Infrastructure Leadership with AI

Let's be real, managing a global infrastructure team means a mountain of tasks, from budget reviews to incident reports. Imagine if a significant chunk of that could be handled by AI, freeing you up for the truly strategic work. That's not a pipe dream; it's happening now.

ID:

Tool: Predictive Outage Detection (AIOps)

Benefit: Use AI tools to analyse vast amounts of monitoring data, predicting potential hardware failures or service degradation *before* they cause a user-facing outage. This means fewer 3 AM pager calls and more proactive fixes. Honestly, it's a game-changer for incident prevention.

ID:

Tool: Cloud Cost Anomaly Analysis

Benefit: Deploy AI-powered FinOps tools to automatically scan our AWS, Azure, and GCP bills. It'll instantly flag anomalous spend, identify specific resources responsible for cost overruns, and even suggest optimisation strategies. This automates hours of manual spreadsheet analysis, giving you back precious time for strategic cost management.

ID:

Tool: Vendor Contract & Security Report Summarisation

Benefit: Feed lengthy vendor contracts, SOC 2 reports, or new CVE vulnerability descriptions into an LLM. Get a concise summary of key risks, obligations, and action items in minutes, not hours. It's like having a dedicated research assistant for all your due diligence.

ID: ✍️

Tool: Post-Mortem & Board Update Drafting

Benefit: Provide an LLM with a timeline of an incident and key data points. Ask it to generate a first draft of a blameless post-mortem or a non-technical executive summary for a board update. This significantly reduces writer's block on critical communications, letting you focus on the content and strategy.

Expect to save 15-25 hours weekly, giving you more time for strategic thinking and team development. Weekly time savings potential
You'll have access to 4 core AI tools that are specifically tailored for infrastructure management. Typical tool investment
Explore AI Productivity for Manager, Cloud Infrastructure →

12-15 specific tools & techniques with implementation guides

Competency Requirements

Foundation Skills (Transferable)

These are the bedrock skills that every leader needs, but for a Manager, Cloud Infrastructure, they take on a specific flavour. It's about leading people, solving complex, often ambiguous problems, and communicating effectively across the organisation.

Functional Skills (Role-Specific Technical)

These are the specific technical and domain-level skills you'll need to effectively lead our Cloud Infrastructure team. You won't be hands-on with every tool, but you'll need a deep understanding to guide your team and make informed strategic decisions.

Technical Competencies

Digital Tools

Industry Knowledge

Regulatory Compliance Regulations

Essential Prerequisites

Career Pathway Context

These prerequisites mean you've already walked the path of a Senior SRE or Lead Infrastructure Engineer. You've seen the challenges, you've built the systems, and now you're ready to lead the people who do it. We're looking for someone who's not just technically brilliant, but also a proven leader and strategic thinker. If you haven't managed a team before, this role probably isn't the right fit just yet, but keep an eye out for our Lead roles!

Qualifications & Credentials

Emerging Foundation Skills

Advancing Technical Skills

Future Skills Closing Note

The reality is, the pace of change in infrastructure won't slow down. Your role as a manager isn't just to keep up, but to anticipate and lead. This means continuously evolving your own understanding and empowering your team to do the same. It's a journey, not a destination, and we're looking for someone who's excited by that challenge.

Education Requirements

Experience Requirements

You'll need roughly 12-16 years of progressive experience in technology infrastructure, with at least 5-8 years specifically focused on cloud platforms (AWS, Azure, or GCP). Crucially, you must have a minimum of 3-5 years in a formal people management role, leading teams of engineers. We're looking for someone who's not just been 'senior' but has actively hired, mentored, and developed technical talent, and owned significant budgets and strategic initiatives.

Preferred Certifications

Recommended Activities

Career Progression Pathways

Entry Paths to This Role

Career Progression From This Role

Long Term Vision Potential Roles

Sector Mobility

The skills you'll gain here—especially in cloud management, SRE, FinOps, and leading high-performing teams—are highly transferable across almost any industry. Whether it's FinTech, healthcare, e-commerce, or even public sector, every organisation needs robust, cost-effective, and secure infrastructure. You'll be a sought-after leader.

How Zavmo Delivers This Role's Development

DISCOVER Phase: Skills Gap Analysis

Zavmo maps your current competencies against all requirements in this job description through conversational assessment. We evaluate your foundation skills (communication, strategic thinking), functional skills (CRM expertise, negotiation), and readiness for career progression.

Output: Personalised skills gap heat map showing strengths and priorities, estimated time to competency, neurodiversity accommodations.

DISCUSS Phase: Personalised Learning Pathway

Based on your DISCOVER results, Zavmo creates a personalised learning plan prioritised by impact: foundation skills first, then functional skills. We adapt to your learning style, pace, and neurodiversity needs (ADHD, dyslexia, autism).

Output: Week-by-week schedule, each module linked to specific job responsibilities, checkpoints and milestones.

DELIVER Phase: Conversational Learning

Learn through conversation, not boring modules. Zavmo uses 10 conversation types (Socratic dialogue, role-play, coaching, case studies) to build competence. Practice difficult QBR presentations, negotiate tough renewals, and handle churn conversations in a safe AI environment before facing real clients.

Example: "For 'Stakeholder Mapping', Zavmo will guide you through analysing a complex enterprise account, identifying key decision-makers, and building an engagement strategy."

DEMONSTRATE Phase: Competency Assessment

Zavmo automatically builds your evidence portfolio as you learn. Every conversation, practice scenario, and application example is captured and mapped to NOS performance criteria. When ready, your portfolio supports OFQUAL qualification claims and demonstrates competence to employers.

Output: Competency matrix, evidence portfolio (downloadable), qualification readiness, career progression score.

Discover Your Skills Gap Explore Learning Paths