Resumen del rol

Future Openings- SRE Support Engineer- Observability

Requisitos y responsabilidades

Contenido del rol extraído en secciones para revisar más rápido.

Success Measures

  • Healthy volume of threads and tickets handled with high-quality outcomes
  • Consistent achievement of time-based SLAs
  • High customer satisfaction through surveys
  • Accurate classification of issue type, severity, and recurring patterns
  • Reduced repeat issues through better docs, tooling, and scalable onboarding

What Will Be True When You Succeed

  • Customers can onboard smoothly to monitoring/alerting with minimal friction
  • Monitoring and alerting issues are resolved quickly, with fewer escalations
  • Linux and networking-related incidents reach resolution faster due to strong troubleshooting and clean handoffs
  • Engineering and SRE teams receive clear, actionable feedback based on real customer trends
  • Knowledge base content prevents tickets and accelerates self-service

1) Frontline Support for Observability & Tooling

  • Manage Slack threads and tickets (roughly 50/50)
  • Handle a broad range of customer support: simple issue resolution through end-to-end onboarding
  • Provide clear, structured guidance to highly technical customers
  • Maintain strong attention to detail while managing multiple interactions in parallel

2) Deep-Dive Troubleshooting & Incident Support

  • Troubleshoot, isolate, and resolve monitoring and alerting issues (especially Prometheus + AlertManager)
  • Troubleshoot complex Linux and networking issues (TCP/IP fundamentals required)
  • Support OpenTelemetry, tracing, and telemetry pipelines, including investigation of gaps in signals and instrumentation
  • Drive incidents to resolution in partnership with Engineering/SRE teams

3) Documentation & Knowledge Development

  • Build and maintain customer-facing and internal knowledge base articles
  • Create informational posts for the community support platform
  • Turn repeated issues into reusable guides, checklists, and onboarding playbooks

4) Trend Analysis & Feedback to Engineering

  • Analyze and categorize customer interaction trends
  • Provide accurate, meaningful feedback to Engineering and SRE orgs to improve product/tooling
  • Identify “top offenders” and propose practical fixes (tooling, docs, process, product)

5) Operational Excellence & Continuous Improvement

  • Participate in post-mortem reviews and drive follow-through on improvements
  • Contribute meaningfully to team objectives and goals (process, tooling, and service scaling)
  • Bring creativity and discretion to resolve highly complex issues “outside the box”

Frontline Support

  • Moves smoothly from triage to deeper analysis without losing the customer
  • Communicates clearly and confidently with technical users
  • Maintains clean follow-ups and thread hygiene even with high context switching

Troubleshooting

  • Rapidly isolates issues across monitoring/alerting configs, Linux runtime behavior, and network connectivity
  • Uses structured approaches to incident handling: hypothesis → test → evidence → resolution
  • Produces high-signal writeups that accelerate downstream resolution

Documentation & Enablement

  • Documentation is clear enough that customers avoid opening tickets
  • Onboarding flows reduce time-to-value and prevent common misconfigurations
  • Captures “tribal knowledge” quickly and makes it reusable

Operational Excellence

  • Obsessing over details: correct severity, accurate tagging, clean timelines, strong handoffs
  • Spots patterns early and proactively proposes improvements that scale support

Typical Day / Work Patterns

  • ~50% Slack support, ~50% ticket handling
  • Deep-dive investigations during lower ticket volume periods
  • Documentation writing and lightweight tooling/process improvements when patterns emerge
  • Weekly team review of escalations, themes, and operational improvements
  • High rate of context switching and parallel issue management

Required Skills & Experience (Non-Negotiable)

  • Several years supporting highly scalable applications and web services
  • Hands-on experience with open-source observability and cloud-native tooling, including:Kubernetes (and container fundamentals)Prometheus and AlertManager troubleshootingOpenTelemetry and distributed tracing concepts
  • Kubernetes (and container fundamentals)
  • Prometheus and AlertManager troubleshooting
  • OpenTelemetry and distributed tracing concepts
  • Strong understanding of the Linux operating system (command line, process/network debugging, logs)
  • Good understanding of infrastructure observability principles (signals, alerting strategy, SLO thinking, noise reduction)
  • Good understanding of the TCP/IP suite and practical networking troubleshooting
  • Strong experience troubleshooting ambiguous, multi-layer issues
  • Excellent analytical capability and strong attention to detail
  • Strong written and verbal communication (clear, structured, customer-friendly)
  • Comfortable working with a very technical customer base
  • Passion for Technical Support and a service mindset

Details

  • Kubernetes (and container fundamentals)
  • Prometheus and AlertManager troubleshooting
  • OpenTelemetry and distributed tracing concepts

Nice-to-Haves

  • Experience improving or supporting internal support tooling or workflows (automation, templates, runbooks)
  • Experience operating at scale in a services environment (pattern detection, KPI/SLA awareness, operational process maturity)
  • Familiarity with Grafana, log aggregation, incident tooling, and production support practices
  • Prior SRE or platform support experience

Minimum Qualifications

  • 3–7+ years in Technical Support Engineering, SRE support, DevOps, Platform Support, or similar
  • Demonstrated experience supporting distributed systems, IaaS, or cloud platforms
  • Strong Linux, troubleshooting, and customer-facing communication background
  • Evidence of documentation, knowledge-base contributions, and process improvement mindset

What You’ll Love

  • Real technical problem solving with tangible customer impact
  • A role that blends deep troubleshooting with scaling support via docs, tooling, and process
  • High autonomy in a remote-first environment

What May Be Challenging

  • High context switching and managing multiple threads in parallel
  • Repeated patterns that require discipline to convert pain into scalable improvements
  • Supporting high-visibility systems where speed and accuracy matter
Roles similares

Mantén una lista de respaldo.

Ver stack
FocoSREÁrea del rol
Señal de seniorityMiddleNivel del candidato
StackKubernetesSkills principales
Ubicación1 país aceptadoElegibilidad

Stack

Usa estas tags para comparar roles remotos similares.

Elegibilidad de ubicación

Candidatos deberían aplicar solo cuando el país del perfil aparece aquí.

Tu perfilPaís no definidoInicia sesión para comparar tu país con este rol.

Flujo de contratación

WithMira muestra el rol y luego envía candidatos a la aplicación de la empresa.

1Revisa fit del rol, stack y elegibilidad de ubicación en WithMira.
2Abre la página de aplicación de la empresa desde el link rastreado.
3Guarda el rol o suscríbete a oportunidades similares antes de salir.
Aplicar en el sitio de la empresaSitio de la empresaAbrir link