Reliability Engineer (Supercomputing)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Reliability Engineer (Supercomputing): Ensuring the stability and performance of large-scale GPU supercomputing clusters with an accent on hardware, firmware, and OS-level diagnostics. Focus on investigating complex hardware failures, automating fleet reliability monitoring, and collaborating with vendors to resolve root causes for frontier AI research.
Location: Must be based in San Francisco, California
Compensation: $350,000 - $475,000 USD
Company
Thinking Machines Lab is an AI research organization dedicated to advancing collaborative general intelligence and building widely used AI tools.
What you will do
- Investigate, reproduce, and remediate issues across large GPU clusters.
- Manage the drivers, kernel surface, and diagnostics spanning hardware, firmware, and OS.
- Automate fleet reliability monitoring and analyze error rates to validate fixes.
- Drive the firmware lifecycle, including tracking, qualification, and regression analysis.
- Engage directly with hardware and server vendors to secure technical fixes and manage RMA processes.
- Write detailed postmortems and vendor cases to drive systemic improvements.
Requirements
- Must be based in San Francisco, California
- Bachelor’s degree or equivalent experience in computer science or engineering.
- Proficiency in Python or Rust.
- Experience operating large-scale clusters and container orchestration systems like Kubernetes or Slurm.
- Strong ability to own projects end-to-end and work across cross-functional teams.
- Visa sponsorship available for qualified candidates.
Nice to have
- Fluency with Linux systems and kernel-level debugging.
- Experience with out-of-band management (BMC, iDRAC, IPMI, Redfish).
- Deep knowledge of GPU hardware health (Xid errors, NVLink, NVSwitch, DCGM).
- Proven statistical rigor in reliability analysis.
Culture & Benefits
- Generous health, dental, and vision insurance.
- Unlimited PTO and paid parental leave.
- Relocation support provided as needed.
- Collaborative environment working on frontier AI research.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →