Lead Principal Core Infrastructure Engineer

Oracle·Oracle Recruiting

IndiaFull-timePosted Jun 30, 2026

Mentors teams and leads the architecture of highly scalable, interdependent distributed systems. Identifies and removes performance/scalability bottlenecks for hyper‑scale workloads; defines scalability requirements with stakeholders; and designs elastic, high‑impact systems while advancing innovation in data plane platforms. Engineers and oversees fault‑tolerant, in‑service‑upgradable designs; optimizes resilience mechanisms (load‑shedding, throttling, rate‑limiting); and sets SLO‑aligned durability and availability standards across dependent services. Establishes KPIs and advanced telemetry; applies formal verification for complex features; and develops robust replication/synchronization strategies. Advises and leads resolution of complex production issues, sets operational readiness and SOP standards, and directs incident response and RCAs. Architects advanced security controls, drives remediation and compliance, and delivers enterprise‑level automation (IaC) and change strategies enabling safe, automated patching, updates, and rollbacks.

Key Responsibilities
System Design & Architecture - System Scalability:
–Mentor the team in the architecture and design of highly scalable, interdependent distributed systems, ensuring horizontal and vertical scalability and overall performance, including leveraging distributed state management tools.
–Lead the identification of performance and scalability bottlenecks and recommend solutions to optimize code and/or systems for large-scale data processing and high-throughput requirements to improve performance for hyper-scale systems.
–Lead collaboration with stakeholders to define system scalability requirements, ensuring the defined requirements meet customer expectations.
–Leverage deep expertise to design high-impact, interdependent systems to scale with elasticity (e.g., effectively scaling both up and down).
–Drive innovation in the use of data plane platforms.
–Evaluate whether systems are meeting nonfunctional scalability requirements, and proactively anticipate growing business needs within the business unit.
System Design & Architecture - System Reliability Design:
–Design and oversee the implementation of fault-tolerant, interdependent systems capable of withstanding in-service updates by implementing sophisticated redundancy, replication, and automatic failover capabilities.
–Lead the design and implementation of systems that effectively handle service disruptions (e.g., network partitions) by prioritizing consistency, availability, or partition tolerance.
–Guide the optimization of advanced mechanisms to handle network unreliability, including load-shedding, throttling, and rate-limiting.
–Design interdependent systems that are durable and adhere to service level objectives (SLOs), driving standards for availability and durability of other computing services within the organization
System Design & Architecture - System Reliability Performance:
–Define key performance indicators (KPIs) and telemetry to identify risks, gaps, or cyclical dependencies in running, interdependent systems.
–Drive the creation and customization of highly complex dashboards, telemetry systems, and alerting mechanisms, proactively ensuring system health and reliability.
System Design & Architecture - Correctness / Availability:
–Maintain expertise in industry standards for verifying correctness and apply existing techniques to interdependent systems.
–Formally verify complex features (e.g., via TLA+) to ensure system design correctness for various interdependent systems.
–Develop advanced strategies for data replication and synchronization, ensuring robust data integrity and availability
Operational Troubleshooting & Incident Management:
–Advise on efforts to diagnose, debug, and resolve complex issues in active, interdependent systems to support ongoing operation.
–Develop and implement comprehensive strategies to prevent interruptions, ensuring no maintenance windows are required for customers and users when resolving issues.
–Maintain expertise in dependencies, dependents, and owned systems to drive effective troubleshooting and performance.
–Set standards for operational readiness and standard operating procedures within the department, and hold third-party partners accountable for meeting those standards.
–Oversee operational support rotations, providing expert guidance in incident response and leading root cause investigations to prevent future occurrences.
Compliance & Security:
–Architect advanced security measures to protect data and applications in multi-tenant environments, and lead initiatives to enhance data and application protection.
–Guide the execution of comprehensive remediation plans to address identified security vulnerabilities.
–Ensure cloud infrastructure is in compliance with industry standards and regulations, and guide documentation efforts across projects.
Automation & Change Management:
–Develop enterprise-level automation tools and strategies (e.g., Infrastructure as Code (IaC)) and oversee their implementation.
–Drive alignment of change management plans and organizational initiatives for patching, updating, and rolling back applications, and design interdependent systems to allow for automation of these processes.

Core Responsibilities
Planning & Execution:
–Manages and provides direction on timelines, deliverables, and budgets when applicable for critical high-impact projects or initiatives that impact the line of business, ensuring timely completion and adherence to requirements. Anticipates and plans for shifts in resources or timelines based on changing business priorities, ensuring optimal outcomes.
Collaboration & Partnership:
–Influences cross-functional leaders and external stakeholders to gain alignment on strategic objectives. Fosters partnerships with key business leaders, stakeholders, and/or customers, identifying opportunities for expanding partnerships and promoting long-term organizational success. Champions transparency and inclusivity by actively seeking, listening to, and incorporating diverse perspectives.
Problem Solving:
–Leads specialized, advanced problem-solving efforts, serving as an escalation point for complex issues. Guides others to leverage innovative data-driven techniques to address ambiguous or novel issues, identify root causes, and drives the implementation of solutions that prevent future issues.
Continuous Learning:
–Leverages deep industry knowledge and expertise to serve as a thought leader within the organization. Contributes to the advancement of the field or industry through thought leadership (e.g., conference presentations, white papers, research contributions). Maintains and evolves expertise in relevant areas by proactively monitoring emerging trends, technologies, and industry standards, ensuring the organization remains current with best practices. Champions continuous learning and knowledge sharing, promoting professional development across teams. Applies new knowledge to drive advancement and mentors others to do the same.
Continuous Improvement:
–Develops innovative solutions and drives the implementation of ideas that increase the efficiency and effectiveness of processes, protocols, and workflows across the organization. Evaluates effectiveness of updated approaches and methods for continued improvement to enhance efficiencies and ensure changes align with organizational goals. Designs and develops metrics to measure success of improvement initiatives.
Performance and Development:
–Serves as a subject matter expert regarding talent needs and organizational talent strategy. Imparts leadership and expert knowledge throughout the talent development pipeline including candidate interviews, candidate assessment, and hiring decisions, ensuring alignment with organizational talent strategy.

Basic Qualifications

BS or MS degree in Computer Science or relevant technical field involving coding or equivalent practical experience
10+ years of total experience in software development
Demonstrated ability to write great code using Java, GoLang, C#, or similar OO languages
Proven ability to deliver products and experience with the full software development lifecycle
Experience working on large-scale, highly distributed services infrastructure
Experience working in an operational environment with mission-critical tier-one livesite servicing
Systematic problem-solving approach, strong communication skills, a sense of ownership, and drive
Experience designing architectures that demonstrate deep technical depth in one area, or span many products, to enable high availability, scalability, market-leading features and flexibility to meet future business demands

Preferred Qualifications

Experience as technical lead on a large scale cloud service
Hands-on experience developing and maintaining services on a public cloud platform (e.g., AWS, Azure, Oracle)
Experience working on Kubernetes
Knowledge of Infrastructure as Code (IAC) languages, preferably Terraform
Strong knowledge of databases (SQL and NoSQL)
Strong knowledge of Computer Networking (OSI layers, HTTP, DNS, TCP/IP, DHCP, Routers, Gateways, Subnets, etc.)
Knowledge of Linux internals, Linux/Unix troubleshooting skills
Familiarity with host virtualization technologies (KVM, Containers, Docker, etc.)
Able to effectively communicate technical ideas verbally and in writing (technical proposals, design specs, architecture diagrams and presentations)
Experience with hiring, mentorship and raising the talent bar across the organization

Career Level - IC5