This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a MLOps Lead based in Portugal.
As an MLOps Lead, you will shape the strategy, architecture, and operational excellence of a cutting-edge machine learning infrastructure supporting large-scale AI systems. Leading a team of MLOps engineers, you will bridge the gap between research and production, ensuring that machine learning models are deployed, monitored, and scaled efficiently in high-performance environments. This role combines technical leadership with hands-on architectural decision-making, offering the opportunity to build robust infrastructure from the ground up while collaborating closely with engineering, research, and product teams. Working in a fully remote, international environment, you will help establish best practices and drive innovation across the entire machine learning lifecycle, enabling the delivery of reliable and scalable AI solutions.
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a MLOps Lead based in Portugal.
As an MLOps Lead, you will shape the strategy, architecture, and operational excellence of a cutting-edge machine learning infrastructure supporting large-scale AI systems. Leading a team of MLOps engineers, you will bridge the gap between research and production, ensuring that machine learning models are deployed, monitored, and scaled efficiently in high-performance environments. This role combines technical leadership with hands-on architectural decision-making, offering the opportunity to build robust infrastructure from the ground up while collaborating closely with engineering, research, and product teams. Working in a fully remote, international environment, you will help establish best practices and drive innovation across the entire machine learning lifecycle, enabling the delivery of reliable and scalable AI solutions.
Accountabilities
- Lead, mentor, and develop a high-performing team of MLOps engineers while fostering a culture of collaboration, technical excellence, and continuous improvement.
- Define and execute the MLOps roadmap, aligning infrastructure initiatives with research, engineering, and product objectives.
- Design, implement, and maintain scalable machine learning infrastructure, including automated training pipelines, CI/CD workflows, orchestration frameworks, and deployment processes.
- Drive architectural decisions for model serving platforms, ensuring low-latency, high-throughput inference using modern serving technologies.
- Build and optimize feature stores, data pipelines, and storage solutions that support large-scale model training and production inference.
- Collaborate closely with research teams to streamline the transition of machine learning models from experimentation to production environments.
- Establish monitoring, logging, alerting, and observability strategies to ensure model performance, system reliability, and early detection of drift or operational issues.
- Define engineering standards, operational best practices, and scalable infrastructure processes that support long-term platform growth.
- Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
- Minimum of 7 years of experience in MLOps or machine learning infrastructure engineering, including at least 3 years in a technical leadership role.
- Strong software engineering expertise in Python, with working knowledge of Bash and/or Go.
- Proven experience building, scaling, and leading MLOps infrastructure from the ground up.
- Deep knowledge of machine learning platforms and frameworks such as MLflow, Weights & Biases (W&B), PyTorch, and TensorFlow.
- Extensive experience with model serving technologies including Triton Inference Server, TorchServe, TensorFlow Serving, or KServe.
- Hands-on expertise with Kubernetes, cloud platforms (AWS, GCP, or Azure), infrastructure as code tools (Terraform, Helm, GitOps), and production-grade data pipelines.
- Strong experience with monitoring and observability solutions such as Prometheus, Grafana, Datadog, and OpenTelemetry.
- Excellent communication skills with the ability to collaborate effectively across research and engineering teams.
- Experience with workflow orchestration tools, FastAPI, Databricks, Snowflake, LLM infrastructure, SRE practices, or AI startup environments is considered an advantage.
- Competitive compensation package including salary and equity participation.
- Comprehensive healthcare coverage for employees and eligible dependents.
- Generous paid parental leave supporting biological, adoptive, and surrogate parenthood.
- Relocation assistance for employees joining one of the company's office locations, where applicable.
- Fully remote work environment with international collaboration opportunities.
- Opportunity to lead cutting-edge AI infrastructure initiatives with significant technical ownership.
- Inclusive, mission-driven culture that values innovation, collaboration, diversity of thought, and continuous learning.