Design, implement, and maintain monitoring and alerting systems to ensure high availability and performance of satellite communication platforms.
Proactively identify and address system bottlenecks, vulnerabilities, and other reliability challenges.
Ensure infrastructure is capable of supporting AI and ML workloads at scale, with a focus on automation and efficiency.
Infrastructure Management Automation:
Build and maintain CI/CD pipelines for satellite communication AI/ML applications, ensuring smooth deployment and integration processes.
Implement and optimize cloud-native architectures, using platforms such as AWS, GCP, or Azure, to support AI/ML models and satellite communication systems.
Automate scaling, deployment, and configuration of infrastructure to ensure high availability and fault tolerance.
Incident Management Root Cause Analysis:
Lead incident response efforts, including troubleshooting, root cause analysis, and resolution of production issues.
Implement post-mortem analysis processes to continuously improve the reliability and performance of systems.
Ensure the implementation of best practices for incident documentation, including actionable feedback and lessons learned.
Collaboration Continuous Improvement:
Work closely with engineering teams, including AI/ML developers, software engineers, and network engineers, to identify areas for improvement and optimize system performance.
Collaborate with satellite engineers to integrate AI/ML solutions into the satellite communication stack, ensuring performance optimization and automation.
Contribute to the development of internal tools and dashboards to enhance system reliability and transparency.
Security Compliance:
Ensure security best practices are implemented across the satellite communication platform, particularly regarding AI/ML data privacy and satellite systems.
Collaborate with security teams to ensure systems are compliant with industry standards and regulations.
Qualifications:
Required:
Bachelor s degree in Computer Science, Engineering, or related field (or equivalent practical experience).
7+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering.
Strong knowledge of cloud platforms (AWS, GCP, Azure) and container orchestration tools ( Kubernetes , Docker).
Experience with infrastructure-as-code tools (Terraform, Ansible, etc.).
Strong expertise in monitoring, logging, and alerting tools (Prometheus, Grafana, ELK Stack, etc.).
Familiarity with AI/ML systems and how they can be scaled and managed in production environments.
Experience with scripting languages (Python, Bash, Go, etc.) for automation and tool development.
Preferred:
Experience with satellite communication systems or space-based infrastructure.
Knowledge of networking protocols and technologies related to satellite communication.
Experience with machine learning frameworks (TensorFlow, PyTorch, etc.) and deploying AI models in production.
Familiarity with disaster recovery, backup strategies, and high-availability configurations for cloud-based systems.
Certification in cloud platforms (AWS Certified Solutions Architect, Google Cloud Professional Cloud Architect, etc.).
Skills Attributes:
Problem-Solving Critical Thinking: Ability to think creatively and analytically to solve complex problems in real-time.
Collaboration: Excellent team player with the ability to work cross-functionally in a collaborative environment.
Adaptability: Able to thrive in a fast-paced, constantly evolving environment and adapt to new technologies and methodologies.
Communication: Strong written and verbal communication skills, with the ability to explain technical concepts clearly to non-technical stakeholders.
What We ll Offer
Professional development opportunities.
Collaborative and innovative work environment
Aviation, Maritime domain exposure, and business knowledge
Connectivity and content engineering and business knowledge
Opportunity to work in cross-functional teams.
Enterprise Product end-to-end experience with direct customer feedback
Travel Entertainment discounts (Cruises)
Performance-based bonus
International travel/vacation; a cruise or in-flight end-user experience
Opportunity to work across teams and organizations.