9 дней назад

Reliability Engineer (Supercomputing)

350 000 - 475 000$

Формат работы

onsite

Тип работы

fulltime

Грейд

senior

Английский

Страна

Релокация

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Reliability Engineer (Supercomputing): Ensuring the stability and performance of large-scale GPU supercomputing clusters with an accent on hardware, firmware, and OS-level diagnostics. Focus on investigating complex hardware failures, automating fleet reliability monitoring, and collaborating with vendors to resolve root causes for frontier AI research.

Location: Must be based in San Francisco, California

Compensation: $350,000 - $475,000 USD

Company

Thinking Machines Lab is an AI research organization dedicated to advancing collaborative general intelligence and building widely used AI tools.

What you will do

Investigate, reproduce, and remediate issues across large GPU clusters.
Manage the drivers, kernel surface, and diagnostics spanning hardware, firmware, and OS.
Automate fleet reliability monitoring and analyze error rates to validate fixes.
Drive the firmware lifecycle, including tracking, qualification, and regression analysis.
Engage directly with hardware and server vendors to secure technical fixes and manage RMA processes.
Write detailed postmortems and vendor cases to drive systemic improvements.

Requirements

Must be based in San Francisco, California
Bachelor’s degree or equivalent experience in computer science or engineering.
Proficiency in Python or Rust.
Experience operating large-scale clusters and container orchestration systems like Kubernetes or Slurm.
Strong ability to own projects end-to-end and work across cross-functional teams.
Visa sponsorship available for qualified candidates.

Nice to have

Fluency with Linux systems and kernel-level debugging.
Experience with out-of-band management (BMC, iDRAC, IPMI, Redfish).
Deep knowledge of GPU hardware health (Xid errors, NVLink, NVSwitch, DCGM).
Proven statistical rigor in reliability analysis.

Culture & Benefits

Generous health, dental, and vision insurance.
Unlimited PTO and paid parental leave.
Relocation support provided as needed.
Collaborative environment working on frontier AI research.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Похожие вакансии

Reliability Engineer (Supercomputing)

Thinking Machines Lab

Мэтч & Сопровод

Описание вакансии

TL;DR

Company

What you will do

Requirements

Nice to have

Culture & Benefits

Похожие вакансии

Site Reliability Engineer (AI)

Software Engineer Infrastructure (Golang/Kubernetes)

Staff Platform Engineer (DevOps)

Release Engineer (AI)

Sr Staff Site Reliability Engineer (Aerospace)

Staff Site Reliability Engineer (AI)

Разработка

Game Dev

Design и Creative

Аналитика

Менеджмент

People & Business

Reliability Engineer (Supercomputing)

Thinking Machines Lab

Мэтч & Сопровод

Описание вакансии

TL;DR

Company

What you will do

Requirements

Nice to have

Culture & Benefits

Categories

Похожие вакансии

Site Reliability Engineer (AI)

Software Engineer Infrastructure (Golang/Kubernetes)

Staff Platform Engineer (DevOps)

Release Engineer (AI)

Sr Staff Site Reliability Engineer (Aerospace)

Staff Site Reliability Engineer (AI)