Manager-Site Reliability Eng

Bengaluru, IndiaFull-timePosted Jun 23, 2026

Manager, Site Reliability Engineering leads and mentors Site Reliability Engineering (SRE) teams, fostering a culture of continuous improvement and inclusivity, while collaborating across the organization to enhance system resilience, scalability, and alignment with business objectives.

Manages and leads a team of Site Reliability Engineering colleagues, enabling a culture of continuous learning, growth opportunities, and inclusivity for all individual colleagues and teams
Provides leadership, guidance, and coaching to Site Reliability Engineering teams, supporting training and development of best practices in software development, resiliency, and non-functional system requirements
Recruit and develop a high-performing team, recognizing and rewarding achievements, and creating an environment that motivates and energizes colleagues to achieve best business objectives
Oversees and facilitates collaboration with Software Engineering teams to design and implement features that improve system resilience, scalability, and performance; ensuring optimal functionality
Collaborates with executives, product managers, and other stakeholders to ensure SRE principles are embedded throughout the organization
Leads comprehensive chaos engineering experiments and resiliency tests, driving the analyzation of outcomes and implementation of improvements that enhance system robustness and recovery capabilities
Plans regular drills and strategic planning to ensure organization is prepared for and can swiftly recover from complex and unexpected disruptions
Collaborates and co-creates effectively with teams in product and the business to align technology initiatives with business objectives

Education Qualifications:

Bachelor’s degree in Computer Science, Information Technology, Engineering, and/or comparable experience; advance degree preferred
Knowledge of modern observability stack – Splunk, Elastic Search, Prometheus, Grafana
Knowledge of containerization technologies (e.g., Kubernetes, Docker) and microservices architecture
Knowledge of observability tools and methodologies, including experience with logging, monitoring, tracing, and performance analysis platforms
Knowledge of cloud-based Site Reliability Engineering (SRE) practices and experience with public cloud platforms such as AWS, Azure, or Google Cloud.
Work Experience:
Experience in software development, or technology operations, with a focus on Site Reliability Engineering
Experience in Linux/Unix systems, object-oriented programming languages (e.g., Java), scripting languages (e.g., Python, Bash), and cloud platforms (e.g., AWS, Azure, GCP)

Licenses and Certifications:

Advanced certification in Site Reliability Engineering (SRE) or related is a plus