Site Reliability Engineer Lead

PlanoFull-timePosted Jul 2, 2026

Job Description:

At Bank of America, we are guided by a common purpose to help make financial lives better through the power of every connection. We do this by driving Responsible Growth and delivering for our clients, teammates, communities and shareholders every day.

Being a Great Place to Work is core to how we drive Responsible Growth. This includes our commitment to being an inclusive workplace, attracting and developing exceptional talent, supporting our teammates’ physical, emotional, and financial wellness, recognizing and rewarding performance, and how we make an impact in the communities we serve.

Bank of America is committed to an in-office culture with specific requirements for office-based attendance and which allows for an appropriate level of flexibility for our teammates and businesses based on role-specific considerations.

At Bank of America, you can build a successful career with opportunities to learn, grow, and make an impact. Join us!

This job is responsible for partnering with engineering and technology teams to implement measures prescribed by the Site Reliability Engineer teams it leads. Key responsibilities include ensuring appropriate instrumentation, tooling, ticketing, alerting and on call routines are in place for key services, demonstrating technical expertise within domains, and decomposing objectives into work units. Job expectations include advancing efficient solution delivery practices and promoting exceptional design, engineering, and organizational practices.

The individual in this role is accountable for establishing and maintaining partnerships with Application Development and Production Support teams to implement the measures prescribed through the collaboration of the Senior Site Reliability Engineer (SRE) and the SRE team(s) they are leading. This individual will include ensuring the appropriate instrumentation, tooling, ticketing, alerting and on-call routines are in place for key services. This role demonstrates a high level of technical expertise within one or more technical domains. This role demonstrates the ability to decompose issues or objectives into units of work that can be assigned to other team members. This individual will advocate and advance more efficient solution delivery practices and evangelize great design, engineering and organizational practices.

Responsibilities:

Collaborates with Development and Infrastructure teams to understand technical solutions and implement monitoring capabilities outlined in the application and system monitoring designs put forward by the Senior Site Reliability Engineer (SRE)
Develops and maintains reliability scripts, tools and libraries and leverages them for common instrumentation, automation, and operational needs, and when mentoring SRE resources on reliability practices and established tools/capabilities
Partners to implement code changes to make use of common reliability libraries and tools and helps Application Production Services and Application Development teammates understand how to use them
Participates regularly in architecture community of practice meetings and communication via other channels
Identifies vulnerabilities and opportunities for reliability improvement, such as investigating low level error rates and 'noise' in monitoring, and defines solutions to reduce manual support effort and/or improve system reliability
Engages as a subject matter expert in major incident triage efforts and failure scenario modelling and diagnosis with Problem Manager root causes for major incident/problem management investigations
see position summary required/desired qualifications

Required Qualifications:

5+ years of experience in platform, systems, or infrastructure engineering, with a strong focus on automation and integration
Proficiency in SRE best practices; Proven ability to reduce toil and improve observability of the environment
Experience with automation and orchestration tools (e.g., Ansible or similar), and scripting with golang, Python, or equivalent
Experience with supporting enterprise service mesh platforms
Experience with Infrastructure as Code (IaC) concepts and CI/CD pipelines supporting automated builds, validation, and deployments
Experience integrating provisioning workflows with platform services such as virtualization, networking, identity, monitoring, and configuration management systems
Strong focus on testing and reliability, including automated integration/validation testing and troubleshooting of complex workflows

Desired Qualifications:

Linux System Administration
Splunk Administration
OpenShift Containers
Dyantrace Administration
Grafana
Ansible Automation
Horizon CI/CD (Jenkins, XLR, Artifactory, BitBucket)
Azure/AWS\GCP Cloud
Fast learner
Proven ability to work independently with minimal supervision and as part of a team with direct responsibilities
Systematic problem-solving approach, sense of ownership and drive
Ability to juggle competing priorities and adapt to changes in project scope

Skills:

Automation
Collaboration
Influence
Production Support
Result Orientation
Analytical Thinking
Application Development
Architecture
Solution Design
Stakeholder Management
Other
Terraform

Shift:

1st shift (United States of America)

Hours Per Week: