Sleek

Senior Site Reliability Engineer (SRE)

Remote Site Reliability Engineering role with clear candidate location fit.

PostedJul 5, 2026

Eligible countries19 accepted countries

Seniority signalSenior

Work settingRemote

Accepted candidate locations

AustraliaBangladeshCambodiaChinaHong KongIndia+13 more

AWS Azure CI/CD GCP Kubernetes Node.js Python

Role overview

Senior Site Reliability Engineer (SRE)

Requirements and responsibilities

Readable role content extracted into sections for faster review.

You will ensure:

High-quality, secure, and scalable infrastructure capable of supporting modern applications and advanced AI workloads
Robust automation across CI/CD, infrastructure provisioning, and operations to increase reliability and reduce manual overhead
Thoughtful and pragmatic integration of AI into operational workflows to improve efficiency, detect anomalies, and accelerate delivery
Reliable systems engineering practices, including monitoring, incident response, performance tuning, and capacity planning
Strong DevOps standards, including reproducibility, testing, documentation, and operational excellence
Clear technical communication and cross-team alignment to enable predictable delivery and collaborative problem-solving
Mentorship and technical leadership that elevates platform engineering, DevOps maturity, and overall engineering quality across the organisation

Outcomes:

Conduct a full review of Sleek’s cloud infrastructure and propose a roadmap for reliability and scalability improvements
Lead upgrades or redesigns of core platform components such as networking, containers, orchestration, or databases
Improve incident response processes, SLIs, SLOs, and on-call readiness.

Outcomes:

Ensure platform and infrastructure are capable of supporting AI-powered features
Build or refine pipelines for model hosting, embeddings, vector search, or related AI services if required
Implement monitoring and guardrails for AI service performance, cost, and stability

Increase Engineering Velocity Through Automation

Enhance CI/CD pipelines for speed, safety, and reliability
Introduce infrastructure automation, testing automation, and deployment tooling to reduce manual steps
Champion modern DevOps and AI-assisted tooling to improve engineering productivity.

Improve Observability and Operational Excellence

Strengthen logging, monitoring, tracing, and alerting across services
Reduce noisy alerts and improve the signal-to-noise ratio for incidents
Implement readiness checks, runbooks, and automated recovery paths for critical services

Security and Compliance Improvements

Ensure secure configuration, secrets management, access control, and identity management
Implement automated security scanning, dependency monitoring, and hardened pipeline practices
Prepare platform-level requirements needed for reliable and secure AI usage

To do this you will have:

6+ years of progressive experience in Site Reliability Engineering (SRE)
6+ years of strong, hands-on experience across multi-cloud environments such as AWS, GCP, Azure including expertise in networking, compute, storage, security, and cost optimization.
6+ years of deep expertise in containerization and orchestration (e.g., Kubernetes, EKS, ECS)
6+ years of extensive experience with Infrastructure as Code (IaC) (e.g., Terraform, Pulumi, CloudFormation).
System Reliability: Proven ability to design, build, and operate highly reliable, scalable production systems utilizing advanced Zero-Downtime Deployment Patterns (e.g., Blue/Green, Canary, progressive delivery).
Modern Delivery & Tooling: Expertise in modernizing deployments via GitOps practices (e.g., ArgoCD, Flux) and building Self-Service Developer Platforms that enable engineering efficiency (e.g., environment automation, internal tooling).
Networking & Edge Routing: Experience implementing and managing Multi-Cloud API Gateways and Edge Routing solutions (e.g., Kong, Traefik, Cloudflare, multi-cluster ingress).
Security & Hardening: Strong background in platform security, including secrets management, Identity and Access Control (IAM), and Runtime/Security Hardening with tools like Falco/eBPF and WAFs.
Observability: Solid understanding and practical experience with modern observability stacks (e.g., Prometheus, OpenTelemetry, OpenSearch, ELK, CloudWatch).
AI/ML Infrastructure: Experience supporting or deploying AI/ML workloads (e.g., model inference, vector databases, GPU workloads), or strong familiarity with the infrastructure requirements for these systems.
Communication: Excellent communication and collaboration skills with a proven ability to describe complex infrastructure decisions clearly and a background in driving improvements in engineering practices.
Development Expertise: Familiarity with modern programming languages like Node.js, NestJS, and Python is highly desirable for extending DevOps capabilities or integrating tooling.

Similar roles