Chaos Engineering Resilience Testing: Design and implement chaos experiments to simulate failure scenarios, stress test the system and uncover potential weaknesses within our infrastructure and applications.. AWS Infrastructure Expertise: Work with AWS services such as EC2, S3, RDS, and Lambda to develop and execute fault injection tests, testing the resilience of the cloud environment against real-world failure scenarios.. Distributed Systems Data Reliability: Utilize experience with distributed databases (e.g., Cassandra) and data streaming platforms (e.g., Kafka) to test the fault tolerance and reliability of critical data processes.. Monitoring Observability: Leverage monitoring tools (e.g., Prometheus, Grafana, ELK stack) to gain insights into system health and performance, analyzing metrics to detect and diagnose issues.. Networking Fundamentals: Apply knowledge of TCP/IP networking principles to simulate and understand network-related failures and their impact on system resilience.. Collaboration with Performance Team: Work closely with the Performance Test team to design experiments aligned with performance goals, share insights, and develop strategies to improve overall system resilience.. Containerization Orchestration (Plus): Utilize container orchestration tools such as Kubernetes, Docker, and AWS Fargate to run and manage chaos tests in containerized environments.. Skills And Qualifications. Education: Bachelor's degree in information technology, Computer Science, or related field.. Advanced/Fluent English communication.. 5+ years of experience in Chaos Engineering, Site Reliability Engineering, or related roles, with a strong focus on distributed systems.. In-depth knowledge of AWS services and architecture, including EC2, S3, RDS, and Lambda.. Proficiency with distributed systems, and experience with databases like Cassandra and data streaming platforms such as Kafka.. Strong understanding of monitoring and observability tools, particularly Prometheus, Grafana, and the ELK stack.. Networking fundamentals, with practical knowledge of TCP/IP principles in distributed environments.