Machine Learning Ops Manager (Remote)

 We are seeking a Machine Learning Ops Manager to build and scale the infrastructure powering our intelligent platform. 

Role Description

In this role, you will lead the development of machine learning and AI systems that drive automation, workflow optimization, and decision support across the CRT order lifecycle—from referral intake and documentation validation to insurance authorization and billing.

You will own the architecture and scalability of ML systems that support real-world healthcare operations. This includes building pipelines that can handle increasing data complexity, improving model performance in production, and enabling rapid iteration of AI-driven features that reduce inefficiencies and compliance risk.

Responsibilities

  • Architect, design, and scale ML systems that support high-volume, workflow-driven healthcare operations.
  • Build and maintain infrastructure for model training, evaluation, deployment, and monitoring across the ATLAS platform.
  • Develop robust data pipelines that power ML models used in documentation validation, workflow automation, and operational intelligence.
  • Partner closely with ML engineers, software engineers, and product teams to integrate models seamlessly into real-world CRT workflows.
  • Establish best practices for ML lifecycle management, including CI/CD, versioning, and reproducibility.
  • Monitor system performance, troubleshoot production issues, and continuously optimize for reliability, scalability, and efficiency.
  • Create tools and frameworks for evaluating ML and AI model performance, including both offline testing and real-time production feedback.
  • Drive improvements in observability, including logging, metrics, tracing, and model performance monitoring.


Candidate Qualifications

  • 5+ years of experience in ML infrastructure, deployment, and scaling models in production environments.
  • Strong software engineering skills with proficiency in Python and modern backend technologies.
  • Experience designing and operating highly available, scalable ML systems for inference, evaluation, and experimentation.
  • Deep understanding of observability practices, including monitoring, alerting, and performance tracking for ML systems.
  • Experience with distributed systems, reliability engineering, and incident response in production environments.
  • Ability to operate in ambiguity with a strong sense of ownership and urgency in a fast-paced, remote-first environment.

Nice To Have

  • Experience with ML CI/CD pipelines and frameworks such as PyTorch, TensorFlow, or similar.
  • Familiarity with modern inference frameworks (e.g., vLLM, Triton, TensorRT).
  • Experience managing GPU workloads, including orchestration, utilization, and cost optimization.
  • Knowledge of performance optimization techniques for training and inference (e.g., CUDA profiling, memory optimization, multi-GPU systems).
  • Exposure to healthcare, CRT, or workflow-heavy operational platforms.

Submit Resume

HR@atlas-vue.com