Design, build, and maintain a scalable and resilient infrastructure for our applications, leveraging cloud platforms (e.g., AWS, Azure, GCP), container orchestration platforms (e.g., Kubernetes, Docker Swarm), and serverless technologies.
Automate infrastructure provisioning and management using tools like Terraform, Ansible, or Puppet, ensuring consistency and efficiency.
Operational Excellence:
Develop, implement, and monitor Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to measure and improve system performance and reliability.
Proactively identify and mitigate potential risks and bottlenecks within our infrastructure.
Respond swiftly and effectively to incidents and outages, conducting thorough post-mortem analyses to identify root causes and implement preventative measures.
Collaboration Communication:
Foster strong working relationships with software engineering teams to optimize application performance, reliability, and security.
Clearly communicate technical concepts and solutions to both technical and non-technical audiences.
Actively participate in knowledge sharing and mentoring within the team.
Continuous Improvement:
Stay abreast of the latest advancements in cloud computing, containerization, and other relevant technologies.
Continuously evaluate and refine our operational processes and tooling to enhance efficiency and effectiveness.
Participate in on-call rotations to ensure 24/7 system availability and support.
Qualifications
Bachelor s degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
3+ years of hands-on experience as an SRE, DevOps Engineer, or Systems Administrator.
Proven expertise in cloud computing platforms (AWS, Azure, or GCP) and containerization technologies (Docker, Kubernetes).
Demonstrated experience with infrastructure-as-code tools (Terraform, Ansible, Puppet) and their practical application.
Strong understanding of networking fundamentals (TCP/IP, DNS, load balancing, firewalls).
Proficiency in scripting languages (Python, Bash, Go) for automation and system administration tasks.
Experience with monitoring and alerting systems (e.g., Prometheus, Grafana, Datadog) and their effective configuration.
Excellent analytical, problem-solving, and troubleshooting skills.
Strong communication and collaboration skills with a focus on teamwork and knowledge sharing.
Bonus Points
Experience with security best practices and tools.
Experience with serverless computing platforms (AWS Lambda, Google Cloud Functions).
Experience with stream processing technologies (Kafka, Kinesis).