Manager, Site Reliability Engineering
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Manager, Site Reliability Engineering (SRE/DevOps): Lead a team of Site Reliability Engineers to maintain reliability, scalability, and performance of systems with an accent on multi-cloud reliability strategy, incident response, and automation. Focus on building SLO/SLI/SLA practices, improving observability and deployment processes, and driving infrastructure resilience (high availability and disaster recovery) with continuous learning through RCA and post-mortems.
Location: BGR Sofia (Hybrid)
Company
is a travel technology company powering intelligent offer and revenue optimization for airlines.
What you will do
- Lead and mentor the SRE team, driving reliability, accountability, and continuous improvement.
- Develop and implement strategies for multi-cloud reliability, monitoring, and incident response.
- Drive automation for deployment processes, infrastructure as code (IaC), and operational efficiency.
- Manage observability tooling for logging, metrics, and alerting; establish SLOs/SLIs/SLAs.
- Oversee root cause analysis (RCA) and post-mortems to improve systems and processes.
- Ensure high availability and disaster recovery strategies are in place and regularly tested; optimize cloud infrastructure costs.
Requirements
- 7+ years of experience in software engineering, SRE, or DevOps, including 3+ years in a managerial or leadership role.
- Strong cloud platform knowledge (Azure, AWS, IBM Cloud) and containerization (Docker, Kubernetes).
- Proficiency with automation and configuration management tools (Terraform, Ansible, Puppet, The Foreman).
- Experience with monitoring and observability tools (e.g., Prometheus, Grafana, PagerDuty, Graylog).
- Solid programming/scripting skills in Python, Go, Bash, or similar languages.
- Expertise in CI/CD pipelines and modern deployment strategies; strong analytical and problem-solving skills.
Nice to have
- Experience with large-scale distributed systems.
- Knowledge of networking, security, and compliance best practices.
- Experience with incident response and ITIL framework.
- Background in high-availability, customer-facing production environments.
Culture & Benefits
- Flexible ways of working with a hybrid setup.
- Culture focused on ownership, innovation, and care.
- Continuous learning and support to grow and innovate.
- Collaboration between software development and operations teams.
Hiring process
- Interviews to assess leadership, SRE/DevOps experience, and technical depth across reliability, automation, and observability.
- Discussion of collaboration approach and experience improving production reliability through incident management and RCA.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →