This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Strategic Operations Engineer III based in United States.
This role sits at the intersection of engineering operations, reliability, and AI-driven systems management, with a strong focus on keeping large-scale cloud infrastructure resilient, observable, and continuously improving. You will be responsible for shaping how operational processes are run across incident, problem, and change management, ensuring high availability and rapid recovery across complex distributed systems. The environment is highly technical and data-driven, with a strong emphasis on automation, AI-enabled insights, and operational excellence at scale. You will partner closely with engineering teams to improve system reliability, reduce noise in monitoring, and accelerate resolution of critical incidents. This is a high-impact role where you will influence both day-to-day operational stability and long-term platform resilience. It is ideal for someone who thrives in fast-paced environments where engineering rigor and structured operational thinking are equally important.
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Strategic Operations Engineer III based in United States.
This role sits at the intersection of engineering operations, reliability, and AI-driven systems management, with a strong focus on keeping large-scale cloud infrastructure resilient, observable, and continuously improving. You will be responsible for shaping how operational processes are run across incident, problem, and change management, ensuring high availability and rapid recovery across complex distributed systems. The environment is highly technical and data-driven, with a strong emphasis on automation, AI-enabled insights, and operational excellence at scale. You will partner closely with engineering teams to improve system reliability, reduce noise in monitoring, and accelerate resolution of critical incidents. This is a high-impact role where you will influence both day-to-day operational stability and long-term platform resilience. It is ideal for someone who thrives in fast-paced environments where engineering rigor and structured operational thinking are equally important.
Accountabilities:
- Lead end-to-end incident management processes, including detection, triage, escalation, coordination, and resolution of high-severity production issues.
- Drive major incident management (MIM) communications and ensure clear, timely updates across stakeholders during critical events.
- Develop and improve incident response playbooks, runbooks, and automation to reduce MTTR and improve operational consistency.
- Own and evolve problem management practices, leveraging data and AI/ML insights to identify recurring issues and drive long-term remediation.
- Lead change management processes, including CAB governance, risk evaluation, and enforcement of safe, compliant deployment practices.
- Enhance observability and monitoring systems to reduce alert fatigue and improve signal quality across large-scale environments.
- Apply AIOps methodologies to detect anomalies, enable predictive alerting, and improve root cause analysis and operational workflows.
- 5+ years of experience in IT operations, Site Reliability Engineering (SRE), or similar infrastructure-focused roles in large-scale environments.
- Strong expertise in incident, problem, and change management frameworks (ITIL or equivalent).
- Hands-on experience improving operational processes, governance models, and production reliability in high-availability systems.
- Solid understanding of AI/ML concepts such as anomaly detection, predictive analytics, and data-driven operational insights.
- Experience with AIOps platforms or building automation and AI-driven operational solutions for monitoring and incident response.
- Proficiency with operational tooling such as Jira, ServiceNow, FireHydrant, Moogsoft, or similar platforms.
- Strong communication, analytical, and stakeholder management skills with the ability to drive cross-functional alignment.
- Competitive salary range ($123,000 – $175,000 USD) with performance-based compensation considerations
- Comprehensive health, dental, and vision insurance coverage
- Remote-first work environment across the United States
- Opportunity to work on large-scale, high-availability cloud infrastructure systems
- Strong focus on automation, AI-driven operations, and continuous improvement
- Collaborative engineering culture with emphasis on ownership and operational excellence
- Commitment to diversity, equity, inclusion, and employee belonging