AI Cluster Validation Student

No longer listed
NVIDIA·Workday
IsraelPart-timePosted Jun 30, 2026
Open original posting

We are looking for a motivated Student AI Cluster Validation Engineer to join the Networking Solution Validation (NSV) team within the Networking Cluster Solutions (NCS) organization. You will work on large-scale AI cluster solutions, helping validate infrastructure, monitor system health, analyze telemetry data, and improve reliability across hardware, software, and AI workloads.

This is an excellent opportunity to gain hands-on experience with AI infrastructure, cluster operations, system reliability, and advanced engineering workflows while working alongside experienced engineers on cutting-edge technologies.

 

What you'll be doing:

  • Support cluster owners in maintaining cluster health, readiness, and operational stability.
  • Participate in cluster bring-up, validation, monitoring, and reliability activities.
  • Monitor and analyze PHY health, telemetry, logs, and system metrics.
  • Assist in troubleshooting system-level, hardware-level, and PHY-related issues.
  • Support MTBI/MTBF analysis, reliability assessments, and long-term cluster health monitoring.
  • Assist in root-cause analysis and corrective action tracking.
  • Work with AI-based engineering tools to improve troubleshooting, analysis, and workflow efficiency.
  • Collaborate with hardware, infrastructure, software, validation, and AI teams.
  • Learn and work with advanced technologies including AI clusters, GPUs, telemetry systems, and high-speed interfaces.

What we need to see:

  • B.Sc. student in Electrical Engineering, Information Systems Engineering, Computer Engineering, Computer Science, or a related field.
  • Strong analytical, troubleshooting, and problem-solving skills.
  • Interest in system architecture, reliability engineering, PHY technologies, and AI infrastructure.
  • Ability to analyze logs, telemetry data, and monitoring metrics.
  • Experience using AI-based engineering tools for analysis, automation, or productivity improvements.
  • Strong communication, collaboration, and documentation skills.

Ways to stand out from the crowd:

  • Understanding of PHY concepts and high-speed communication systems.
  • Familiarity with telemetry, monitoring platforms, or data analysis.
  • Exposure to AI infrastructure, GPUs, HPC environments, or data center technologies.
  • Familiarity with InfiniBand, Ethernet, PCIe, NVLink, or similar high-speed interfaces.
  • Participation in technical projects, military technology units, hackathons, open-source projects, or personal engineering initiatives.

Want jobs like this matched to you?

Swoopd scores fresh postings against your résumé so you only see the matches that matter.

Get started free