Role overview

XTN-DAD3686| SITE RELIABILITY ENGINEER

Requirements and responsibilities

Readable role content extracted into sections for faster review.

Cluster Validation & Testing

  • Validate GPU clusters of varying sizes to ensure hardware and system integrity prior to production release
  • Perform functional and reliability testing of GPUs, servers, and associated components
  • Verify network connectivity and performance, including InfiniBand where applicable

Orchestration & Benchmarking

  • Provision and configure GPU clusters using automated workflows
  • Execute and analyse performance and stability benchmarks orchestrated via Slurm
  • Validate results against expected performance and reliability thresholds

Test Framework & Automation

  • Maintain and extend the automated validation framework built using Python and Ansible
  • Integrate new test cases to support additional hardware platforms and GPU generations
  • Improve test reliability, coverage, and execution efficiency

Remediation & System Integrity

  • Diagnose and remediate unhealthy nodes through configuration changes or software fixes
  • Coordinate with on-site support and Smart Hands teams for hardware replacements when required
  • Ensure all issues are resolved and documented prior to handover to production operations

Documentation & Handover

  • Produce clear, accurate documentation of test results, hardware states, and remediation actions
  • Ensure smooth handovers to operations and engineering teams
  • Maintain up-to-date runbooks and validation procedures
Similar roles

Keep a backup shortlist.

Browse stack
FocusSite Reliability EngineerRole area
Seniority signalMiddleCandidate level
StackPythonPrimary skills
Location1 accepted countryEligibility

Stack

Use these tags to compare similar remote roles.

Location eligibility

Candidates should apply only when their profile country is listed here.

Your profileCountry not setSign in to check your country against this role.

Hiring flow

WithMira shows the role, then sends candidates to the company application.

1Check role fit, stack, and location eligibility in WithMira.
2Open the company application page from the tracked apply link.
3Save the role or subscribe for similar opportunities before leaving.
Apply on company siteCompany siteOpen link