Working with Sagemaker, Tensorflow, Pytorch, Triton, Spark, or equivalent large-scale distributed Machine Learning technologies on a modern containerized deployment stack using Kubernetes, Spinnaker, and other technologies Experience building Distributed microservices on AWS, GCP or other public cloud substrates Eat, sleep, and breathe services. You have experience balancing live-site management, feature delivery, and retirement of technical debt Partner with Product Managers, Architects and Data Scientists to understand customer requirements, and help translate requirements to working software Own the technology for fully orchestrated machine learning APIs for Einstein Platform Contribute to the long-range plan, and help drive the microservices architectures for machine learning Designing, developing, debugging, and operating resilient distributed systems that run across thousands of compute nodes in multiple datacenters Participate in the team s on- call rotation to address complex problems in real-time and keep services operational and highly available Create and enforce processes that ensure quality of work, and drive engineering excellence Exhibit a customer-first mentality while making decisions, and be responsible and accountable for the output of the team Partner with vendors like AWS and Data Science teams to pick best fit in terms of libraries and compute to deliver cost effective and scalable model hosting and tuning/training capabilities Work collaboratively in a geographically distributed teams in North America, EMEA and APAC
Core Qualifications :
BS, MS, or PhD in computer science or a related field, or equivalent work experience with 9 to 15 years of experience 3+ years of hands-on experience with designing and developing complex big data, machine learning systems, and microservices architectures Track record of leading highly impactful projects from conception to production
Expertise in JVM based languages (Java, Scala) and/or Python Experience leading/working in teams that have built and and run machine learning services, such as for training inferences, at scale for predictive and generative models Experience with open source projects such as Spark, Kafka, Feast, Iceberg Experience in building software on AWS cloud computing such as OpenSearch, DynamoDB, EMR and S3