Senior Data Platforms SRE
IQVIA Data Platform’s Governance
Join IQVIA's Data Platform Team
Join IQVIA's Governance Team to run and support our data platforms (Databricks & Snowflake) as a managed service. You will be part of the operational layer that keeps platforms healthy, cost-controlled, and compliant - handling day-to-day user requests, monitoring signals, and common incidents with clear documentation and disciplined ticket handling. As a Senior Data Platforms SRE, you will lead complex L2 troubleshooting, strengthen controls, and help build a measurable managed service.
Job Overview
As a Senior Data Platforms SRE, you will serve as a technical escalation point and resolve higher complexity incidents and technical problems end to end. This role requires advanced troubleshooting expertise in data platforms, operational discipline, and the ability to drive service improvements across the governance team. You will work directly with Databricks and Snowflake platforms deployed on Azure and AWS, investigating non-trivial technical issues, implementing platform guardrails, and mentoring junior engineers to elevate the team's operational maturity. The position combines hands-on technical work with service improvement initiatives, including defining operational metrics, tracking incident resolution trends, and contributing to SLA/SLO verification to ensure platform reliability and cost effectiveness. Level 2 support engineers handle technically complicated issues that exceed the competence of Level 1 support, focusing on incident resolution through deep technical knowledge and advanced troubleshooting skill.
Key Responsibilities
Advanced Incident Handling & Escalation: Investigate non-trivial job and pipeline failures, performance and capacity symptoms, and recurring platform issues. Coordinate escalation to platform engineering or vendors with complete diagnostics, including root cause analysis, system logs, and performance metrics to ensure rapid resolution. L2 engineers perform deep diagnoses to trace and resolve issues, managing support tickets to ensure timely resolution of all technical incidents .
Resource & Cost Governance: Perform spend anomaly checks to identify runaway queries, oversized clusters, and job loops that drive uncontrolled costs. Drive remediation actions including shutdowns and guardrails using policy and operational authority, ensuring alignment with financial controls and budget targets.
Platform Guardrails Implementation: Implement and maintain cluster policies and operational restrictions that prevent uncontrolled cost growth, aligned with governance objectives. Configure and enforce workspace-level controls, quota management, and resource limits across Databricks and Snowflake environments.
Environment Lifecycle Execution: Support workspace and environment provisioning and decommissioning activities, including Infrastructure as Code driven provisioning where defined. Execute environment setup, configuration validation, and teardown procedures following established operational standards.
Service Improvement & Metrics: Help define and track operational metrics including incident volumes, resolution trends, and mean time to resolution. Contribute to SLA/SLO verification and improve documentation quality and consistency across the service to support continuous operational improvement. L2 support engineers maintain detailed documentation and knowledge articles as key activities in their role.
Mentoring & Coaching: Mentor junior engineers on troubleshooting methodologies, ticket handling best practices, and technical documentation standards. Review tickets for completeness, ensure clean handoffs between support tiers, and raise the overall operational maturity of the team through knowledge sharing and skills development.
Required Technical Skills
This position requires a comprehensive technical skill set spanning data platforms, infrastructure technologies, and operational tools. Candidates must demonstrate hands-on proficiency with Databricks and Snowflake platforms, including cluster management, SQL query optimization, and performance troubleshooting. Strong experience with Azure and AWS cloud services is essential, particularly in areas of compute, storage, networking, and identity management. Linux console proficiency is required for log analysis, system diagnostics, and troubleshooting at the operating system level. Infrastructure as Code experience with Terraform is necessary for environment provisioning and configuration management. Container technologies including Kubernetes and Docker are required for understanding platform architecture and troubleshooting containerized workloads. DevOps practices and tools are essential for supporting CI/CD pipelines and automated deployments. Experience with monitoring and observability tools for platform health checks, alerting, and performance analysis is mandatory. Ticketing systems expertise, particularly with Jira, is required for structured incident handling and service request management. Advanced SQL skills are necessary for query analysis, optimization, and troubleshooting data processing issues.
Technology Area
Required Skills
Data Platforms
Databricks (cluster management, workspace administration, SQL query optimization, performance troubleshooting), Snowflake (warehouse management, query optimization, data sharing, security configuration)
Cloud Infrastructure
Azure (Compute, Storage, Networking, Identity Management), AWS (EC2, S3, IAM, VPC), Cloud architecture patterns, Resource optimization
Operating Systems
Linux console proficiency, Shell scripting (Bash), Log analysis and diagnostics, System troubleshooting, Performance monitoring
Infrastructure as Code
Terraform (resource provisioning, state management, module development), Configuration management, Environment automation
Container Technologies
Kubernetes (cluster architecture, pod troubleshooting, service debugging), Docker (container management, image troubleshooting, networking)
DevOps Tools
CI/CD pipelines (Azure DevOps, Jenkins, GitLab CI), Version control (Git), Build and deployment automation, Release management
Monitoring & Observability
Platform health monitoring, Log aggregation and analysis (ELK, Splunk), Alerting configuration, Performance metrics analysis, Dashboard creation
Ticketing Systems
Jira (incident management, service request tracking, workflow configuration), ITSM best practices, SLA management
Data Technologies
Advanced SQL (query optimization, execution plans, performance tuning), Database concepts (indexes, partitioning, statistics), Query troubleshooting and debugging
Required Qualifications
Candidates must possess strong hands-on experience supporting Databricks and Snowflake platforms in production environments, including troubleshooting job failures, performance degradation, and access or permission issues. This experience should include working with distributed computing frameworks, understanding cluster configurations, and resolving complex data pipeline problems. Experience with operational monitoring and alerting systems is required, including log analysis, metrics interpretation, and structured incident handling following established procedures. The ability to produce clear technical narratives in tickets is essential, documenting symptoms, diagnostic steps, root cause findings, and resolution actions in a format that supports knowledge transfer and future troubleshooting. A demonstrated ownership mindset is critical, including the ability to identify repeat issues, propose preventive controls, and drive standardization across operational processes. Candidates should show evidence of proactive problem-solving, initiative in process improvement, and commitment to operational excellence. Strong communication skills are necessary for coordinating with platform engineering teams, vendors, and business stakeholders during escalation and resolution activities. L2 support engineers must communicate complex technical details in a way that clients can understand, acting as an intermediary between the client and technical team.
Preferred Qualifications
• Familiarity with ITIL frameworks and best practices for IT service management, including incident management, problem management, and change management processes.
• Understanding of SLO and SLA concepts, including how to define service level objectives, measure compliance, and report on service performance metrics.
• Experience with FinOps principles and cloud cost governance practices, including cost allocation, budget tracking, spend optimization, and financial accountability for cloud platform resources.
• Knowledge of capacity planning methodologies and techniques for forecasting resource needs based on usage patterns and growth trends.
• Experience implementing automation for operational tasks including alerting, remediation workflows, and self-service capabilities.
• Background in platform security and compliance, including understanding of data governance requirements, access controls, and audit logging for regulated environments.
IQVIA is a leading global provider of clinical research services, commercial insights and healthcare intelligence to the life sciences and healthcare industries. We create intelligent connections to accelerate the development and commercialization of innovative medical treatments to help improve patient outcomes and population health worldwide. Learn more at https://jobs.iqvia.com
IQVIA is committed to integrity in our hiring process and maintains a zero tolerance policy for candidate fraud. All information and credentials submitted in your application must be truthful and complete. Any false statements, misrepresentations, or material omissions during the recruitment process will result in immediate disqualification of your application, or termination of employment if discovered later, in accordance with applicable law. We appreciate your honesty and professionalism.