This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Platform Reliability Engineer based in the United States.
This role is focused on ensuring the stability, scalability, and performance of large-scale distributed systems that power critical business services.
You will operate at the intersection of software engineering and infrastructure, building automation and observability solutions that reduce operational friction and improve system resilience.
The position plays a key role in defining and maintaining service reliability standards, including SLOs and incident response practices.
You will work closely with engineering teams to design systems that are fault-tolerant, highly available, and production-ready from day one.
The environment is fast-paced and highly technical, emphasizing automation, continuous improvement, and strong engineering discipline.
You will also be responsible for improving deployment practices, monitoring systems, and operational workflows across production platforms.
This is a hands-on role where your work directly impacts uptime, customer experience, and engineering efficiency.
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Platform Reliability Engineer based in the United States.
This role is focused on ensuring the stability, scalability, and performance of large-scale distributed systems that power critical business services.
You will operate at the intersection of software engineering and infrastructure, building automation and observability solutions that reduce operational friction and improve system resilience.
The position plays a key role in defining and maintaining service reliability standards, including SLOs and incident response practices.
You will work closely with engineering teams to design systems that are fault-tolerant, highly available, and production-ready from day one.
The environment is fast-paced and highly technical, emphasizing automation, continuous improvement, and strong engineering discipline.
You will also be responsible for improving deployment practices, monitoring systems, and operational workflows across production platforms.
This is a hands-on role where your work directly impacts uptime, customer experience, and engineering efficiency.
Accountabilities:
- Define, monitor, and continuously improve service-level objectives (SLOs), SLIs, and error budgets to guide reliability priorities.
- Lead incident response efforts, including acting as incident commander, coordinating resolution, and driving post-incident reviews.
- Design and implement observability solutions using modern tooling for monitoring, logging, tracing, and alerting.
- Build and maintain automation tools to eliminate operational toil and improve system efficiency and repeatability.
- Architect and operate Kubernetes-based infrastructure, including scaling, networking, and workload optimization.
- Develop and improve CI/CD pipelines to support safe, frequent, and reliable software delivery.
- Conduct capacity planning, performance engineering, and reliability testing, including load and chaos testing initiatives.
- Partner with engineering teams to embed reliability, security, and fault tolerance into system design from the outset.
- Improve system resilience through redundancy, failover strategies, and proactive dependency management.
- Mentor engineers and contribute to a strong operational excellence culture across the organization.
- Bachelor’s degree in Computer Science, Engineering, or a related technical field.
- 5+ years of experience in Site Reliability Engineering, DevOps, or production infrastructure roles.
- Strong programming skills in Python, Go, or Java for automation and tooling development.
- Hands-on experience operating Linux-based production systems at scale, including networking and performance tuning.
- Proven experience managing Kubernetes environments and containerized workloads in production.
- Strong knowledge of observability stacks such as Prometheus, Grafana, OpenTelemetry, ELK/EFK, or similar tools.
- Experience building CI/CD pipelines and supporting production deployment workflows.
- Solid understanding of distributed systems concepts, including fault tolerance and system consistency.
- Demonstrated experience in incident management and production troubleshooting.
- Strong communication skills with ability to collaborate across engineering and operations teams.
- Experience with cloud platforms (AWS, Azure, or GCP) and familiarity with reliability engineering practices such as SLOs or chaos engineering is a plus.
- Competitive salary range of $135,000 – $185,000 annually.
- 100% remote position within the United States.
- Full-time W2 employment with long-term stability.
- Opportunity to work on large-scale distributed systems and mission-critical infrastructure.
- Exposure to modern cloud, Kubernetes, and observability ecosystems.
- Career growth through technical leadership, mentoring, and ownership of reliability strategy.
- Inclusive and equal opportunity workplace culture.