Egen
Lead Machine Learning Engineer, Inference & Performance
Rol remoto de Machine Learning Engineering con fit claro de ubicación del candidato.
Publicado5 jul 2026
Países elegibles1 país aceptado
Señal de senioritySenior
Modelo de trabajoRemoto
Ubicaciones aceptadas para candidatos
Estados Unidos
Resumen del rol
Lead Machine Learning Engineer, Inference & Performance
Requisitos y responsabilidades
Contenido del rol extraído en secciones para revisar más rápido.
What You Will Do:
- Optimize Inference: Build and tune production LLM serving with vLLM and SGLang—maximizing throughput and minimizing latency through batching, paged attention, quantization, and KV-cache strategies
- Profile & Accelerate Training: Instrument and profile training runs to find bottlenecks, then resolve them with the right attention implementations (e.g. FlashAttention) tuned to the underlying hardware (H200, GB200)
- Engineer for the Hardware: Apply a working understanding of GPU architecture and attention internals to choose the right approach per accelerator, rather than relying on defaults
- Serve at Scale: Deploy and operate multiple models within shared GPU clusters on GKE, with autoscaling, efficient bin-packing, and graceful handling of mixed workloads
- Drive Efficiency: Own GPU utilization as a first-class metric—measure it, improve throughput-per-dollar, and continuously raise the ceiling on what our fleet can deliver
- Collaborate & Consult: Work directly with clients to understand performance, latency, and cost requirements, and translate them into pragmatic serving and training architectures
Your Technical Toolkit:
- Core Languages: Mastery of Python and shell scripting; comfort reading and reasoning about lower-level (CUDA-adjacent) performance code is a strong plus
- Inference Frameworks: Hands-on experience with vLLM, SGLash, or comparable high-performance serving stacks
- GPU & Model Internals: Solid grasp of GPU architecture, the fundamentals of LLM inference, and the attention mechanism—including where the bottlenecks live and how FlashAttention and similar techniques address them across hardware generations (H200, GB200)
- Profiling: Fluency with profiling tools to diagnose training and inference bottlenecks (compute-bound vs. memory-bound, kernel-level analysis)
- Infrastructure: Strong Kubernetes (GKE) experience—deploying and autoscaling multiple models on shared GPU clusters on Google Cloud
- Mindset: A strong software engineering foundation—you write clean, maintainable code, measure before optimizing, and understand the full SDLC
Basic Qualifications:
- Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field
- 5+ years of experience in ML/AI engineering, with a meaningful portion focused on performance, infrastructure, or systems
- Proven track record of deploying and optimizing models in a production environment
- Demonstrated experience profiling and improving GPU utilization for training and/or inference
- Experience with Classic Machine Learning (neural nets, training, tuning) is a strong plus
- Knowledge of Data Engineering and SQL
Personal Attributes:
- Ownership: You take pride in your work and see optimizations through from profile to production
- Curiosity: Hardware and serving frameworks change fast; you are a lifelong learner who stays ahead of the curve
- Rigor: You measure before you optimize and let data, not intuition, guide where you spend effort
- Consultative Spirit: You enjoy interacting with clients and can translate technical complexity into business value
- Ethics: You prioritize responsible AI development and data privacy
Roles similares
Mantén una lista de respaldo.
Stack
Usa estas tags para comparar roles remotos similares.
Elegibilidad de ubicación
Candidatos deberían aplicar solo cuando el país del perfil aparece aquí.
Tu perfilPaís no definidoInicia sesión para comparar tu país con este rol.
Flujo de contratación
WithMira muestra el rol y luego envía candidatos a la aplicación de la empresa.
1Revisa fit del rol, stack y elegibilidad de ubicación en WithMira.
2Abre la página de aplicación de la empresa desde el link rastreado.
3Guarda el rol o suscríbete a oportunidades similares antes de salir.