Senior Manager of Site Reliability Engineering - Securitized Products, Production Management - NA
Guide and shape the future of technology at a globally recognized firm, driven by pride in ownership.
As a Senior Manager of Site Reliability Engineering at JPMorgan Chase within the Corporate Investment Bank, Markets team, you are the non-functional requirement owner and champion for the applications in your remit. You are a key influencer in your team’s strategic planning, driving continual improvement in customer experience, resiliency, security, scalability, monitoring, instrumentation, and automation of the software in your area. You act in a blameless, data-driven manner and navigate difficult situations with composure and tact.
Job responsibilities
- Manage day-to-day execution of SRE functions (workload prioritization, shift coverage, triage quality, escalations, runbooks, and handoffs) to ensure consistent and timely outcomes during market hour
- Drives reuse-first adoption of enterprise-authorized AI capabilities within the work environment to improve reliability operations and customer experience outcomes, with human-in-the-loop validation and appropriate handling of sensitive data.
- Provide North America leadership for production management teams supporting trading desks across multiple Markets lines of business; ensure reliable day-to-day operations and sustained stability improvements
- Lead and coordinate L1/L2 investigations and incident response; ensure clear ownership, high-quality communications, and follow-through to root cause and prevention
- Act as a key technology partner to the trading desks: monitor operational signals, drive rapid engagement, translate business impact into technical action, and communicate clearly under pressure
- Drive adoption of SRE practices across delivery teams, ensuring best practices are implemented and demonstrated empirically via stability and reliability metrics (e.g., SLOs, error budgets, incident trends)
- Own and evolve observability (dashboards/alerts/SLOs, instrumentation, monitoring strategy) and use data to prioritize resiliency, performance, and scalability improvements
- Deliver automation and tooling that reduces operational toil and improves support effectiveness (faster diagnosis, safer remediation, repeatable fixes, and self-service workflows)
- Establish and enforce operational standards for delivery teams (operational readiness, testing discipline, release safety, rollback strategy, post-incident actions) and hold teams accountable for closing gaps
- Establishes team standards for AI-assisted reliability workflows across automation and delivery practices, ensuring traceability/auditability, resiliency, and security controls.
Required qualifications, capabilities, and skills
- Formal training or certification on site reliability engineering concepts and 5+ years applied experience . In addition, 2 + years of experience leading technologists to manage and solve complex technical items within your domain of expertise
- Demonstrated experience supporting front-office / trading desk workflows or similarly time-sensitive production environments, with comfort operating during market hours
- Proven production management / SRE leadership experience (support rotations, incident response, root cause analysis, post-incident actions, reliability improvements)
Experience leading teams in the safe use of enterprise-authorized AI capabilities within the work environment for reliability engineering workflows, including validation habits and awareness of data sensitivity.\
Ability to set and reinforce organization-level practices for reviewing AI-assisted recommendations and escalating uncertain decisions while maintaining resiliency, security, and auditability outcomes.
- Experience leading technologists in a player/coach capacity, including guiding support staff and influencing senior engineers and delivery teams
- Strong engineering fundamentals: distributed systems thinking, debugging, performance analysis, and pragmatic tradeoffs under pressure
- Practical AWS experience supporting production services (troubleshooting, deployment, operational visibility)
- Strong stakeholder management and communication skills, especially during incidents and high-pressure periods
- Proficiency in at least one programming language (e.g., Python, Java/Spring Boot, .NET) and ability to automate/engineer solutions that reduce toil
Preferred qualifications, capabilities, and skills
- Ability to code and demonstrate data fluency
- Prior Markets experience (Fixed Income preferred; experience supporting multiple lines of business is a plus)
- Hands-on experience with AWS CLI, CloudWatch, and cloud-native operational patterns
- Experience with Datadog (metrics/logs/traces, alert tuning, SLOs) and/or comparable observability stacks
- Experience applying AI-assisted tooling to reduce operational toil (e.g., incident summarization, runbook assistance), with appropriate controls and governance.
Coach and develop engineers through individualized mentoring; ensure knowledge is documented and shared via internal forums and communities of practice