обновлено 21 день назад

Principal Site Reliability Engineer (AI)

Формат работы

remote (Global)

Тип работы

fulltime

Грейд

senior

Английский

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Principal Site Reliability Engineer (AI): Leading reliability strategy and architectural design for high-performance AI and HPC infrastructure with an accent on scalability, automation, and operational excellence. Focus on designing large-scale control-plane systems, defining reliability standards, and driving systemic improvements across GPU and network platforms.

Company

Nscale provides high-performance, cost-effective GPU cloud infrastructure engineered specifically for AI start-ups and enterprise customers.

What you will do

Own and evolve the long-term reliability strategy for AI and HPC infrastructure.
Design and lead the development of large-scale control-plane systems and automation frameworks.
Define reliability standards, SLO frameworks, and operational best practices.
Act as a senior technical escalation point during critical incidents to ensure systemic resolution.
Partner with cross-functional leadership to influence platform design and operational maturity.
Mentor senior and mid-level engineers to elevate SRE practices across the organization.

Requirements

10+ years of experience in SRE, Systems, or Software Engineering operating complex infrastructure.
Expert-level software engineering skills in building production-grade automation.
Deep expertise in Linux, networking, and distributed systems design at scale.
Extensive experience debugging failures across hardware, OS, network, and application layers.
Proven ability to lead technical initiatives across teams without direct authority.
Strong systems-thinking mindset balancing reliability, velocity, and cost.

Nice to have

Hands-on experience with AI/HPC platforms, InfiniBand/RDMA, and workload schedulers like SLURM.
Experience designing observability systems for high-cardinality and high-throughput environments.
Familiarity with Kubernetes at scale and hybrid or bare-metal cloud architectures.

Culture & Benefits

Competitive base and equity package with annual reviews.
Remote-first environment with a focus on trust, autonomy, and flexible work.
Opportunity to work at a fast-growing startup building cutting-edge AI technology.
Collaborative, supportive environment with a focus on professional growth and progression.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Похожие вакансии

Principal Site Reliability Engineer (AI)

Nscale

Мэтч & Сопровод

Описание вакансии

TL;DR

Company

What you will do

Requirements

Nice to have

Culture & Benefits

Похожие вакансии

Site Reliability Engineer (AI)

Senior Customer Reliability Engineer (AI)

Site Reliability Engineer (AI)

Junior Site Reliability Engineer (AI)

Senior Site Reliability Engineer (AWS/Kubernetes)

Senior Site Reliability Engineer

Разработка

Game Dev

Design и Creative

Аналитика

Менеджмент

People & Business

Principal Site Reliability Engineer (AI)

Nscale

Мэтч & Сопровод

Описание вакансии

TL;DR

Company

What you will do

Requirements

Nice to have

Culture & Benefits

Categories

Похожие вакансии

Site Reliability Engineer (AI)

Senior Customer Reliability Engineer (AI)

Site Reliability Engineer (AI)

Junior Site Reliability Engineer (AI)

Senior Site Reliability Engineer (AWS/Kubernetes)

Senior Site Reliability Engineer