Senior Manager, Core Infrastructure Engineering
Manages team delivering scalable distributed systems and components on a 2–4 quarter horizon. Standardizes engineering practices and scalability requirements across teams; oversees optimization for high‑throughput, hyper‑scale workloads; and ensures effective use of distributed state tools and data plane platforms. Guides teams to design fault‑tolerant, in‑service‑upgradable systems, set SLO‑aligned durability/availability targets, and implement resiliency mechanisms (load‑shedding, throttling, rate‑limiting). Provides oversight for KPIs, telemetry, and moderately complex dashboards; directs design of functional/correctness requirements, fault‑injection tests, and replication/synchronization strategies. Ensures proactive incident management, operational readiness, and on‑call coverage; drives encryption/access control practices, remediation plans, and compliance documentation. Oversees development and maintenance of automation/IaC and partners with teams on change‑management plans enabling safe patching, updates, and rollbacks.
Key Responsibilities
System Design & Architecture – System Scalability:
- Manages the development and implementation of scalable distributed systems and components across multiple teams, including the effective use of distributed state management tools.
- Oversees code and/or system optimization efforts for large-scale data processing and high-throughput requirements within and across teams to support hyper-scale systems.
- Guides teams to define scalability requirements for owned components and ensures design and implementation requirements are met.
- Manages the use of data plane platforms to effectively handle large-scale data retrieval, storage, and processing.
- Ensures team accurately designs performance and load testing.
System Design & Architecture – System Reliability Design:
- Manages the strategy for building fault-tolerant components and systems capable of withstanding in-service updates by guiding the implementation of redundancy, replication, and automatic failover mechanisms.
- Develops design strategies for systems to effectively handle service disruptions (e.g., network partitions) by prioritizing consistency, availability, or partition tolerance.
- Leads implementation and optimization initiatives across teams for approaches to handle network unreliability, including load-shedding, throttling, and rate-limiting.
- Guides teams to design components and systems that are durable and adhere to service level objectives (SLOs), setting expectations for availability and durability of other computing services within the department.
System Design & Architecture – System Reliability Performance:
- Provides oversight in defining key performance indicators (KPIs) and telemetry to identify gaps or issues in running systems.
- Oversees the building and customization of moderately complex dashboards, telemetry systems, and alerting mechanisms to proactively monitor components and system health.
System Design & Architecture – Correctness / Availability:
- Oversees the design and implementation of functional and correctness requirements for feature sets and/or systems in new or existing systems.
- Guides teams to design complex test scenarios (e.g., fault-injection, brown-out) to evaluate system correctness.
- Directs implementation strategies for data replication and synchronization techniques to maintain data integrity and availability.
Operational Troubleshooting & Incident Management:
- Guides teams to be proactive when diagnosing, debugging, and resolving issues in active components and systems to support ongoing operation.
- Ensures teams leverage expertise to prevent interruptions, ensuring no maintenance windows are required for customers and users when resolving issues.
- Oversees operational readiness protocol and ensures teams remain knowledgeable of owned components and systems to support effective troubleshooting and performance.
- Oversees and approves schedules for operational support rotations.
Compliance & Security:
- Oversees implementation of robust security measures to protect data and applications in multi-tenant environments, ensuring team strategies incorporate encryption techniques and access controls.
- Directs execution of remediation plans to address identified security gaps, promoting continuous improvement of security measures.
- Ensures comprehensive documentation and cloud infrastructure compliance with industry standards and regulations.
Automation & Change Management:
- Oversees the development and maintenance of automation scripts and tools (e.g., Infrastructure as Code (IaC)) to manage cloud infrastructure.
- Works with teams to create and adhere to change management plans for patching, updating, and rolling back applications, and guides development of components to allow for automation of these processes.
Core Responsibilities
Planning & Execution:
- Manages multiple medium- to large-scale projects or initiatives across teams, ensuring timelines, deliverables, and budgets (when applicable) are monitored and met.
- Provides direction to teams on project work, setting priorities, and aligning with business needs.
- Guides teams on adjusting plans to accommodate resource or timeline changes.
Collaboration & Partnership:
- Drives cross-functional partnerships to align on expectations and shared objectives across multiple teams.
- Coaches team members to develop strategic relationships with business leaders, stakeholders, and external partners to foster collaboration and long-term success.
- Promotes inclusivity by actively seeking and listening to diverse perspectives, ensuring others feel heard and respected.
Problem Solving:
- Provides direction to multiple teams on addressing complex operational and/or technical issues, as well as guidance on analyzing complex data and/or information to identify solutions.
- Reviews and provides insights into unresolved or critical issues, helping teams to identify potential solutions.
Continuous Learning:
- Models engaging in continuous learning to deepen expertise and stay ahead of industry trends, integrating best practices into strategic planning.
- Leverages feedback to drive personal and team skill improvements.
- Identifies skill gaps across teams and empowers team members to pursue learning and knowledge-sharing opportunities that build their expertise in new areas, coaching them to apply learnings to advance the organization.
Continuous Improvement:
- Drives teams to collaborate on, develop, and implement ideas to increase the efficiency and effectiveness of processes, protocols, and workflows within and across teams, providing oversight.
- Guides teams to adopt new ideas for alternative approaches and methods and encourages feedback for continued improvement.
Performance and Development:
- Drives performance across teams by providing feedback and coaching in alignment with performance management processes, guidelines, and expectations.
- Discusses development goals with team members, shares opportunities to facilitate career development, and ensures individual goals are aligned with broader organizational goals.
- Develops and manages talent acquisition pipeline by leading candidate interviews, monitoring promotion eligibility, and/or orchestrating talent resources.
Minimum Job Qualifications
Education and/or Experience:
- 9 years of experience in software development
OR
- Bachelor’s of Technology (B.Tech) Degree in Computer Science, Computer Engineering, Software Engineering, Electrical/Electronics Engineering, Computer Information Systems, Information Systems, Information Technology, Telecommunications, Mathematics, Physics, or related field AND 5 years of experience in software development
OR
- Bachelor’s Degree in Computer Science, Computer Engineering, Software Engineering, Electrical/Electronics Engineering, Computer Information Systems, Information Systems, Information Technology, Telecommunications, Mathematics, Physics, or related field AND 5 years of experience in software development
OR
- Master’s of Technology (M.Tech) Degree in Computer Science, Computer Engineering, Software Engineering, Electrical/Electronics Engineering, Computer Information Systems, Information Systems, Information Technology, Telecommunications, Mathematics, Physics, or related field AND 3 years of experience in software development
OR
- Master’s Degree in Computer Science, Computer Engineering, Software Engineering, Electrical/Electronics Engineering, Computer Information Systems, Information Systems, Information Technology, Telecommunications, Mathematics, Physics, or related field AND 3 years of experience in software development
OR
- Doctorate in Computer Science, Computer Engineering, Software Engineering, Electrical/Electronics Engineering, Computer Information Systems, Information Systems, Information Technology, Telecommunications, Mathematics, Physics, or related field AND 1 year of experience in software development.
Job Skills:
- Same skills as prior level, plus:
- Agile Methodologies: Demonstrated ability to use agile methodologies to drive continuous improvement and product delivery.
- Automation: Demonstrated ability in or knowledge of automation, including designing, implementing, and managing automated tools, processes, or systems to streamline operations.
- Compliance: Demonstrated knowledge of and adherence to regulatory, legal, and organizational compliance requirements.
- Data Management: Demonstrated ability in or knowledge of data management, including designing, operationalizing, securing, and monitoring data pipelines.
- Data Integration: Demonstrated ability in or knowledge of data integration, including specifying and using services to load and manage data from various sources.
- Patent Innovation / Filing: Demonstrated ability to prepare, file, and manage patent applications for innovative products or technology.
Cloud – Cloud Platforms Experience:
- 3 years of professional experience with cloud platforms (e.g., AWS, Azure, Google, Oracle Cloud).
Testing and Automation Experience:
- 6 years of experience working in testing and automation at the system level.
Distributed Systems Experience:
- 2 years of experience working with delivering and operating large-scale distributed systems.
Preferred Job Qualifications
Education and/or Experience:
- 10 years of experience in software development
OR
- Bachelor’s of Technology (B.Tech) Degree in Computer Science, Computer Engineering, Software Engineering, Electrical/Electronics Engineering, Computer Information Systems, Information Systems, Information Technology, Telecommunications, Mathematics, Physics, Business Administration, or related field AND 6 years of experience in software development
OR
- Bachelor’s Degree in Computer Science, Computer Engineering, Software Engineering, Electrical/Electronics Engineering, Computer Information Systems, Information Systems, Information Technology, Telecommunications, Mathematics, Physics, Business Administration, or related field AND 6 years of experience in software development
OR
- Master’s of Technology (M.Tech) Degree in Computer Science, Computer Engineering, Software Engineering, Electrical/Electronics Engineering, Computer Information Systems, Information Systems, Information Technology, Telecommunications, Mathematics, Physics, Business Administration, or related field AND 4 years of experience in software development
OR
- Master’s Degree in Computer Science, Computer Engineering, Software Engineering, Electrical/Electronics Engineering, Computer Information Systems, Information Systems, Information Technology, Telecommunications, Mathematics, Physics, Business Administration, or related field AND 4 years of experience in software development
OR
- Doctorate in Computer Science, Computer Engineering, Software Engineering, Electrical/Electronics Engineering, Computer Information Systems, Information Systems, Information Technology, Telecommunications, Mathematics, Physics, Business Administration, or related field AND 2 years of experience in software development.
People Leadership / Management Experience:
- 2 years of experience in a technical lead role with direct reports.