Work with Customers, AWS, GCP, Azure, Netskope Platform Engineering and Operations as well as other teams to ensure all services are successfully deployed and running smoothly.
Deploy kubernetes services across different clouds both public and private. Validate the health and functionality of all services.
Develop innovative ways to smartly measure, monitor & report application and infrastructure health (including CPU, memory, storage, cache, message queues).
Responsible for maintenance of the deployment configuration for our services running on K8s.
Help with migration of services from VMs to K8s or with necessary changes to support infrastructure improvements.
Run scripts and other tools in production to validate new system deployments and to update binaries and content.
Gain deep knowledge of our application stack.
Help formulate runbooks for any production related activities.
Monitor for successful log ingestion and metrics emissions.
Contribute to improving performance and bottlenecks due to latency of our microservices.
Solve scaling issues including auto-scaling, capacity management and planning.
Function well in a fast-paced and rapidly-changing environment
Participate with the dev teams in 24x7 on-call rotations.
Ability to debug and optimize code and automate routine tasks.
Drive efficiencies in systems and processes: capacity planning, configuration management, performance tuning, monitoring and root cause analysis.
Wear multiple hats in a fast-paced and rapidly-changing environment.
Solve complex, exciting challenges and improve the depth and breadth of your technical and analytical skills
Required skills and experience:
5+ years of experience troubleshooting Unix/Linux.
Excellent written and verbal communication skills.
Understanding of Networking concepts - TCP/IP, SSL/TLS, IPSec, GRE, VPN.
Experience with one or more of the following languages: Python, Golang, C++, C.
Experience with algorithms, data structures, complexity analysis, and software design.
Hands-on experience with cloud services (OpenStack, AWS, GCP, Azure) in a highly available and scalable production environment.
Experience with CICD and using automation tools such as Jenkins, Ansible, Spinnaker.
Knowledge of distributed systems is a big plus.
Previous experience working with a geographically distributed team with proven ability to work well in a diverse environment with customers, management, other Devops and SREs, developers, and quality assurance engineers.
Experience with Sumo Logic, Grafana/Prometheus and Elastic Stack.
Demonstrated ability to own and deliver projects independently and in a timely manner.
Education:
BSCS or equivalent required, MSCS or equivalent strongly preferred.