Назад
обновлено 21 день назад

Principal Site Reliability Engineer (AI)

Формат работы
remote (Global)
Тип работы
fulltime
Грейд
senior
Английский
b2
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Principal Site Reliability Engineer (AI): Leading reliability strategy and architectural design for high-performance AI and HPC infrastructure with an accent on scalability, automation, and operational excellence. Focus on designing large-scale control-plane systems, defining reliability standards, and driving systemic improvements across GPU and network platforms.

Company

Nscale provides high-performance, cost-effective GPU cloud infrastructure engineered specifically for AI start-ups and enterprise customers.

What you will do

  • Own and evolve the long-term reliability strategy for AI and HPC infrastructure.
  • Design and lead the development of large-scale control-plane systems and automation frameworks.
  • Define reliability standards, SLO frameworks, and operational best practices.
  • Act as a senior technical escalation point during critical incidents to ensure systemic resolution.
  • Partner with cross-functional leadership to influence platform design and operational maturity.
  • Mentor senior and mid-level engineers to elevate SRE practices across the organization.

Requirements

  • 10+ years of experience in SRE, Systems, or Software Engineering operating complex infrastructure.
  • Expert-level software engineering skills in building production-grade automation.
  • Deep expertise in Linux, networking, and distributed systems design at scale.
  • Extensive experience debugging failures across hardware, OS, network, and application layers.
  • Proven ability to lead technical initiatives across teams without direct authority.
  • Strong systems-thinking mindset balancing reliability, velocity, and cost.

Nice to have

  • Hands-on experience with AI/HPC platforms, InfiniBand/RDMA, and workload schedulers like SLURM.
  • Experience designing observability systems for high-cardinality and high-throughput environments.
  • Familiarity with Kubernetes at scale and hybrid or bare-metal cloud architectures.

Culture & Benefits

  • Competitive base and equity package with annual reviews.
  • Remote-first environment with a focus on trust, autonomy, and flexible work.
  • Opportunity to work at a fast-growing startup building cutting-edge AI technology.
  • Collaborative, supportive environment with a focus on professional growth and progression.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →