- Design and implement solutions to enhance the reliability and scalability of platforms and applications to accommodate rapidly growing demands.
- Analyze defects, propose improvements, and drive efficiencies in systems and processes.
- Optimize the performance and utilization of AI ML platform and infrastructure.
- Develop observability, security, and finops tools and orchestration.
- Author and improve the quality of technical engineering documentation.
- Debug and solve issues in a production environment.
- Participate in on-call rotations and escalation workflows.
- Guides and assists others in the areas of building appropriate level designs and gaining consensus from peers where appropriate
- Collaborates with other software engineers and teams to design and implement deployment approaches using automated continuous integration and continuous delivery pipelines
- Collaborates with other software engineers and teams to design, develop, test, and implement availability, reliability, scalability, and solutions in their applications
- Implements infrastructure, configuration, and network as code for the applications and platforms in your remi
Required qualifications, capabilities, and skills
- Formal training or certification on Site Reliability Engineering concepts and 3+ years applied experience
- Proficient in site reliability culture and principles and familiarity with how to implement site reliability within an application or platform
- Formal training or certification on Site Reliability Engineering concepts and 3+ years applied experience
- Expertise in programming with Python and cutting-edge software engineering practices.
- Coding skills in any of the programming languages like Python, Java, PHP, Shell Scripting, Powershell Scripting
- Experience in designing and implementing large-scale distributed systems and cloud-native architecture.
- Experience with developing on Cloud, especially AWS, and knowledge in Infrastructure as Code tools such as Terraform
- Ability to identify new technologies and relevant solutions to ensure design constraints are met by the software team
- Ability to initiate and implement ideas to solve business problems
Preferred qualifications, capabilities, and skills
- Prior experience working in AI, ML, or Data engineering.
- Systematic problem-solving and troubleshooting skills in a complex system.
- Excellent communication skills working with stakeholders and domain experts across the company to design solutions to user problems.
- Self-disciplined, self-managed, self-motivated with a strong sense of ownership, urgency, and
Employment Type: Full Time, Permanent
Read full job description