Manager-Site Reliability Eng

American Express·Oracle Recruiting
Bengaluru, IndiaFull-timePosted Jun 23, 2026
Open original posting

Manager, Site Reliability Engineering leads and mentors Site Reliability Engineering (SRE) teams, fostering a culture of continuous improvement and inclusivity, while collaborating across the organization to enhance system resilience, scalability, and alignment with business objectives.

  • Manages and leads a team of Site Reliability Engineering colleagues, enabling a culture of continuous learning, growth opportunities, and inclusivity for all individual colleagues and teams
  • Provides leadership, guidance, and coaching to Site Reliability Engineering teams, supporting training and development of best practices in software development, resiliency, and non-functional system requirements
  • Recruit and develop a high-performing team, recognizing and rewarding achievements, and creating an environment that motivates and energizes colleagues to achieve best business objectives
  • Oversees and facilitates collaboration with Software Engineering teams to design and implement features that improve system resilience, scalability, and performance; ensuring optimal functionality
  • Collaborates with executives, product managers, and other stakeholders to ensure SRE principles are embedded throughout the organization
  • Leads comprehensive chaos engineering experiments and resiliency tests, driving the analyzation of outcomes and implementation of improvements that enhance system robustness and recovery capabilities
  • Plans regular drills and strategic planning to ensure organization is prepared for and can swiftly recover from complex and unexpected disruptions
  • Collaborates and co-creates effectively with teams in product and the business to align technology initiatives with business objectives

Education Qualifications:

  • Bachelor’s degree in Computer Science, Information Technology, Engineering, and/or comparable experience; advance degree preferred
  • Knowledge of modern observability stack – Splunk, Elastic Search, Prometheus, Grafana
  • Knowledge of containerization technologies (e.g., Kubernetes, Docker) and microservices architecture
  • Knowledge of observability tools and methodologies, including experience with logging, monitoring, tracing, and performance analysis platforms
  • Knowledge of cloud-based Site Reliability Engineering (SRE) practices and experience with public cloud platforms such as AWS, Azure, or Google Cloud.
  • Work Experience:
  • Experience in software development, or technology operations, with a focus on Site Reliability Engineering
  • Experience in Linux/Unix systems, object-oriented programming languages (e.g., Java), scripting languages (e.g., Python, Bash), and cloud platforms (e.g., AWS, Azure, GCP)

 

Licenses and Certifications:

  • Advanced certification in Site Reliability Engineering (SRE) or related is a plus

Want jobs like this matched to you?

Swoopd scores fresh postings against your résumé so you only see the matches that matter.

Get started free