Staff Research Engineer, Post-training & Evaluation

Rol remoto de Anti-Evil Engineering con fit claro de ubicación del candidato.

PublicadoAgregado recientemente

Países elegibles1 país aceptado

Señal de seniorityLead

Modelo de trabajoRemoto

Ubicaciones aceptadas para candidatos

Estados Unidos

CI/CD Python REST

Puedo aplicar realmente?Revisa la lista de países

Las ubicaciones aceptadas para candidatos están listadas (1).

Actualidad de la fuenteAgregado recientemente

Fit de ubicación1 país aceptado

Match de stackCI/CD, Python

Camino de aplicaciónSitio de la empresa

Resumen de fit de MiraPor qué vale revisar este rol

Fit de ubicación1 país aceptadoAgrega tu país

Match de stackAgrega skills al perfil para compararCI/CD, Python

Señal de seniorityLeadDefine tu nivel para una revisión más precisa.

Preparación para aplicarSitio de la empresaLa aplicación continúa en el sitio de la empresa.

Aplicación

Aplicar en el sitio de la empresa

Aplicación externa

Aplicando aStaff Research Engineer, Post-training & EvaluationReddit

Fit de país1 país aceptado

Camino de aplicaciónSitio de la empresa

WithMiraGuarda o suscríbete antes de salir

Aplicación de la empresa

WithMira mantiene este rol para descubrimiento. La aplicación continúa en el sitio de la empresa.

Aplicar en el sitio de la empresa

Guardar rol

Resumen del rol

Staff Research Engineer, Post-training & Evaluation

Requisitos y responsabilidades

Contenido del rol extraído en secciones para revisar más rápido.

Responsibilities

Define the "Reddit Benchmark" evaluation standard: Own the methodology — not just the harness — for rigorously measuring model quality across Safety, Reasoning, representation/retrieval, and Reddit-specific knowledge. Decide what "Reddit-native" means in measurable terms and set the bar the org trains against.
Own evaluation reliability and statistical rigor: Establish the science behind trustworthy evals — judge variance, multi-sample scoring, inter-rater/inter-sample agreement, sampling and temperature effects, and calibration of automated judges. You are accountable for whether a benchmark delta is real or noise. Drive the practice of evaluation as a release gate — offline against frozen datasets, and pre-merge in CI/CD — so regressions are caught before endpoints ship.
Design model-as-a-judge methodology: Own judge selection, prompt design, calibration, and reliability for automated evaluation using frontier external models, enabling rapid, trustworthy iteration cycles.
Set post-training recipes and strategy: Design SFT recipes (data mixtures, curriculum, ablation strategy) that convert base models into helpful, well-aligned endpoints; partner with engineering to scale them.
Evaluate base and CPT checkpoints, not just endpoints: Design checkpoint-selection methodology across CPT experiments and LR studies, so we pick the right base before committing post-training compute.
Drive synthetic data generation strategy: Define and curate high-quality instruction and evaluation sets to improve generalization where human data is scarce.
Partner with Safety Engineering: Translate high-level safety policy into concrete classification metrics, probe sets, and CI/CD unit tests — including precision/recall at threshold, label-noise handling, and false-positive taxonomy for abuse detection (HHV).
Diagnose post-training instability: Dive into loss curves and eval logs to identify alignment tax and capability degradation, and recommend the fix.
Lead research direction: Set technical direction for evaluation and post-training across the team, mentor engineers and scientists, and represent the work internally (and externally where appropriate).

Required Qualifications

6+ years of professional ML experience (or PhD + 4+) with a direct focus on LLM post-training and evaluation.
PhD or MS in CS, ML, NLP, IR, or a related quantitative field — or equivalent industry research experience.
Deep expertise in evaluation reliability: judge/sample variance, multi-sample scoring, calibration, statistical significance, and the failure modes of automated evaluation.
Strong experience building custom, domain-specific evaluation harnesses (e.g., lm-eval-harness, Inspect AI, LightEval) — you know the strengths and limits of benchmarks like MMLU and GSM8K and when they don't apply, and you treat eval sets as versioned, frozen, regression-tracked code.
Experience evaluating both generation and representation/classification: model-as-a-judge for generative quality and precision/recall, PR-AUC, retrieval/MTEB-style metrics, gold-label denoising, and label-noise handling.
Deep understanding of Continuous Pre-training (CPT), Instruction Tuning (SFT), and how data quality shapes model behavior.
Fluency in Python; strong data-pipeline and eval-harness engineering (e.g., Hugging Face Transformers, vLLM, lm-eval-harness). Working knowledge of PyTorch and distributed training (FSDP2, DeepSpeed ZeRO-3) sufficient to direct and debug post-training runs.

Nice to haves

Experience with MLflow or similar experiment-tracking frameworks.
Familiarity with modern fine-tuning frameworks (Axolotl, TorchTune) and PyTorch-native training stacks (TorchTitan).
Synthetic data generation techniques (e.g., Self-Instruct).
Experience with preference optimization (DPO, RLHF, RLAIF, GRPO).
Publications in NLP/ML/FAccT or related venues, or other evidence of research leadership.
Experience evaluating multimodal models (embeddings, hateful-memes-style classification).

Nice to haves

Comprehensive Healthcare Benefits and Income Replacement Programs
401k with Employer Match
Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support
Family Planning Support
Gender-Affirming Care
Mental Health & Coaching Benefits
Flexible Vacation & Paid Volunteer Time Off
Generous Paid Parental Leave

Roles similares