Entrupy is seeking a Senior ML-Ops Engineer to join our team responsible for our machine learning ops infrastructure. This is an onsite role based out of our Bangalore office. This role will report to Ashwath, Director of Engineering
This role involves a mix of technical oversight, technical contribution, and operations work. In addition, this role will serve as a point of contact with various US-based and IN-based teams responsible for model delivery: annotation teams, infrastructure engineers, machine learning engineers, and products.
Some project areas this team is responsible for include :
Infrastructure and libraries to define, deploy, run, and monitor training and inference jobs
Providing interfaces and tooling for ML engineers to work with
Job graph visualization and analytics
Bringing research models and code to production
Hybrid cloud server provisioning and automation
Internal dashboards and annotation tools
Contributing to best practices and methodology guidelines for data science teams
Platform advocacy, training, and mentoring
What You'll Do:
Work with engineering leads to define and build new features and subsystems for our machine learning platform.
Build interfaces and tools for researchers, engineers, and data teams.
Help bring research algorithms into production and scale them alongside their products.
Develop and maintain automated testing and contribute to integration testing and rollouts.
Assist research and product teams in the use of the platform.
Stay in regular communication with research and product teams to understand how the platform can best assist in other teams' objectives.
Participate in code review and contribute to Entrupy's technical standards.
Mentor junior developers researchers.
Manage annotation work from first and third-party annotation teams as required.
Collaborate with infrastructure teams to provision services and establish deployment workflows.
Who you are:
Experience with data pipelines, job schedulers or applied machine learning systems.
Experience with architecting, implementing, deploying, and monitoring backend systems.
Prior history in deploying, maintaining, and monitoring machine learning models in production environments, with a focus on model lifecycle management (retraining, versioning, etc.).
Strong proficiency with Python.
Experience maintaining infrastructure on AWS or other cloud services.
Experience with reviewing code/pull requests from junior developers.
At least 5+ years of software development experience.
Good to have
Experience with cloud automation and DevOps tools (logging, CI, GitHub Actions and Shell Scripting etc).
Experience with low-latency services and performance-sensitive code.
Familiarity with Anyscale for scalable and distributed machine learning workloads is a plus.
Experience with data visualization or analytics (Eg: Tableau).
Experience with tools like AWSCloudFormation or Terraform is a plus.
Experience with orchestration frameworks (Airflow)
Experience with Experiment Tracking and model versioning (Weights and Biases, MLFlow)
Experience with Model monitoring and alerting (Grafana, Prometheus, Data Dog)
Previously worked in a startup-like environment.
What we offer
Market competitive and pay equity-focused compensation structure
Hybrid with Flexible work from anywhere for 4 weeks per year
Generous time away including company holidays, paid time off, sick time, parental leave, and more!
Rich medical benefits and insurance coverage
Opportunity to be part of core team in a growing setup.