This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Senior Software Engineer, Cloud Development based in Canada.
This role sits at the core of a modern AI platform team responsible for building and operating large-scale infrastructure that powers intelligent product experiences. You will design and maintain cloud-native services that support model training, deployment, and high-throughput inference in production environments. The work spans distributed systems, Kubernetes-based orchestration, and GPU-accelerated workloads at global scale. You will contribute to the evolution of reliable, secure, and privacy-conscious AI systems used by millions of users. The environment is highly collaborative, bringing together engineering, product, infrastructure, and security teams. This is a hands-on role for someone who thrives in complex backend systems and cares deeply about performance, scalability, and operational excellence.
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Senior Software Engineer, Cloud Development based in Canada.
This role sits at the core of a modern AI platform team responsible for building and operating large-scale infrastructure that powers intelligent product experiences. You will design and maintain cloud-native services that support model training, deployment, and high-throughput inference in production environments. The work spans distributed systems, Kubernetes-based orchestration, and GPU-accelerated workloads at global scale. You will contribute to the evolution of reliable, secure, and privacy-conscious AI systems used by millions of users. The environment is highly collaborative, bringing together engineering, product, infrastructure, and security teams. This is a hands-on role for someone who thrives in complex backend systems and cares deeply about performance, scalability, and operational excellence.
Accountabilities:
- Design, build, and operate scalable platform services and APIs that support production AI and backend workloads.
- Own service reliability end-to-end, improving availability, latency, scalability, and cost efficiency across distributed systems.
- Develop and optimize Kubernetes-based infrastructure, including deployment pipelines, environment configuration, and resource management.
- Improve service lifecycle practices such as packaging, versioning, testing, validation, and automated deployments.
- Implement observability systems (metrics, logging, tracing, alerting) to strengthen operational visibility and incident response.
- Collaborate with cross-functional teams to deliver secure, scalable, and privacy-respecting platform capabilities.
- Participate in architectural discussions, operational processes, on-call rotations, and incident postmortems while mentoring peers.
- Bachelor’s degree with 4–6+ years of relevant experience, or equivalent hands-on production systems experience.
- Strong Python development skills with experience building maintainable services, libraries, and CLIs.
- Proven experience running production workloads in cloud environments (GCP preferred) and managing infrastructure at scale.
- Deep knowledge of Kubernetes and Helm, including multi-environment deployments and progressive rollouts.
- Experience with infrastructure-as-code tools such as Terraform for provisioning and managing cloud resources.
- Strong understanding of distributed systems, API design, and production-grade service reliability.
- Familiarity with observability tools (e.g., Grafana) and debugging performance or reliability issues in complex systems.
- Excellent communication skills and experience collaborating across engineering, product, and infrastructure teams.
- On-call and incident response experience in production environments.
- Bonus: experience with GPU workloads, Ray/Ray Serve, ML infrastructure, or multi-provider LLM systems.
- Competitive performance-based bonus program with shared success model.
- Comprehensive medical, dental, and vision coverage.
- Strong retirement contributions with immediate 100% vesting.
- Quarterly company-wide wellness days and additional paid holidays.
- Home office stipend and annual professional development budget.
- Quarterly well-being allowance for personal wellness needs.
- Generous parental leave policies.
- Employee referral bonuses and additional country-specific benefits (life insurance, disability coverage, EAP, etc.).