Egen

Lead Machine Learning Engineer, Inference & Performance

Vaga remota de Machine Learning Engineering com fit claro de localização do candidato.

Publicada5 de jul. de 2026

Países elegíveis1 país aceito

Sinal de senioridadeSenior

Modelo de trabalhoRemoto

Locais aceitos para candidatos

Estados Unidos

Posso mesmo aplicar?Confira a lista de países

Países aceitos para candidatos estão listados (1).

Atualidade da fonte5 de jul. de 2026

Fit de localização1 país aceito

Match de stackKubernetes, LLM

Caminho de aplicaçãoSite da empresa

Resumo de fit da MiraPor que vale revisar esta vaga

Fit de localização1 país aceitoAdicione seu país

Match de stackAdicione skills ao perfil para compararKubernetes, LLM

Sinal de senioridadeSeniorDefina seu nível para uma análise mais precisa.

Prontidão para aplicarSite da empresaA aplicação continua no site da empresa.

Aplicação

Aplicar no site da empresa

Aplicação externa

Aplicando paraLead Machine Learning Engineer, Inference & PerformanceEgen

Fit de país1 país aceito

Caminho de aplicaçãoSite da empresa

WithMiraSalve ou assine antes de sair

Aplicação da empresa

O WithMira mantém esta vaga para descoberta. A aplicação continua no site da empresa.

Resumo da vaga

Conteúdo da vaga extraído em seções para revisão mais rápida.

Optimize Inference: Build and tune production LLM serving with vLLM and SGLang—maximizing throughput and minimizing latency through batching, paged attention, quantization, and KV-cache strategies
Profile & Accelerate Training: Instrument and profile training runs to find bottlenecks, then resolve them with the right attention implementations (e.g. FlashAttention) tuned to the underlying hardware (H200, GB200)
Engineer for the Hardware: Apply a working understanding of GPU architecture and attention internals to choose the right approach per accelerator, rather than relying on defaults
Serve at Scale: Deploy and operate multiple models within shared GPU clusters on GKE, with autoscaling, efficient bin-packing, and graceful handling of mixed workloads
Drive Efficiency: Own GPU utilization as a first-class metric—measure it, improve throughput-per-dollar, and continuously raise the ceiling on what our fleet can deliver
Collaborate & Consult: Work directly with clients to understand performance, latency, and cost requirements, and translate them into pragmatic serving and training architectures

Core Languages: Mastery of Python and shell scripting; comfort reading and reasoning about lower-level (CUDA-adjacent) performance code is a strong plus
Inference Frameworks: Hands-on experience with vLLM, SGLash, or comparable high-performance serving stacks
GPU & Model Internals: Solid grasp of GPU architecture, the fundamentals of LLM inference, and the attention mechanism—including where the bottlenecks live and how FlashAttention and similar techniques address them across hardware generations (H200, GB200)
Profiling: Fluency with profiling tools to diagnose training and inference bottlenecks (compute-bound vs. memory-bound, kernel-level analysis)
Infrastructure: Strong Kubernetes (GKE) experience—deploying and autoscaling multiple models on shared GPU clusters on Google Cloud
Mindset: A strong software engineering foundation—you write clean, maintainable code, measure before optimizing, and understand the full SDLC

Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field
5+ years of experience in ML/AI engineering, with a meaningful portion focused on performance, infrastructure, or systems
Proven track record of deploying and optimizing models in a production environment
Demonstrated experience profiling and improving GPU utilization for training and/or inference
Experience with Classic Machine Learning (neural nets, training, tuning) is a strong plus
Knowledge of Data Engineering and SQL

Ownership: You take pride in your work and see optimizations through from profile to production
Curiosity: Hardware and serving frameworks change fast; you are a lifelong learner who stays ahead of the curve
Rigor: You measure before you optimize and let data, not intuition, guide where you spend effort
Consultative Spirit: You enjoy interacting with clients and can translate technical complexity into business value
Ethics: You prioritize responsible AI development and data privacy

Vagas similares