AI Cluster Validation Student
No longer listedNVIDIA·Workday
IsraelPart-timePosted Jun 30, 2026
Open original postingWe are looking for a motivated Student AI Cluster Validation Engineer to join the Networking Solution Validation (NSV) team within the Networking Cluster Solutions (NCS) organization. You will work on large-scale AI cluster solutions, helping validate infrastructure, monitor system health, analyze telemetry data, and improve reliability across hardware, software, and AI workloads.
This is an excellent opportunity to gain hands-on experience with AI infrastructure, cluster operations, system reliability, and advanced engineering workflows while working alongside experienced engineers on cutting-edge technologies.
What you'll be doing:
- Support cluster owners in maintaining cluster health, readiness, and operational stability.
- Participate in cluster bring-up, validation, monitoring, and reliability activities.
- Monitor and analyze PHY health, telemetry, logs, and system metrics.
- Assist in troubleshooting system-level, hardware-level, and PHY-related issues.
- Support MTBI/MTBF analysis, reliability assessments, and long-term cluster health monitoring.
- Assist in root-cause analysis and corrective action tracking.
- Work with AI-based engineering tools to improve troubleshooting, analysis, and workflow efficiency.
- Collaborate with hardware, infrastructure, software, validation, and AI teams.
- Learn and work with advanced technologies including AI clusters, GPUs, telemetry systems, and high-speed interfaces.
What we need to see:
- B.Sc. student in Electrical Engineering, Information Systems Engineering, Computer Engineering, Computer Science, or a related field.
- Strong analytical, troubleshooting, and problem-solving skills.
- Interest in system architecture, reliability engineering, PHY technologies, and AI infrastructure.
- Ability to analyze logs, telemetry data, and monitoring metrics.
- Experience using AI-based engineering tools for analysis, automation, or productivity improvements.
- Strong communication, collaboration, and documentation skills.
Ways to stand out from the crowd:
- Understanding of PHY concepts and high-speed communication systems.
- Familiarity with telemetry, monitoring platforms, or data analysis.
- Exposure to AI infrastructure, GPUs, HPC environments, or data center technologies.
- Familiarity with InfiniBand, Ethernet, PCIe, NVLink, or similar high-speed interfaces.
- Participation in technical projects, military technology units, hackathons, open-source projects, or personal engineering initiatives.