As a Site Reliability Engineer (SRE) (m/f/d), you will play a crucial role in maintaining and improving the reliability, performance, and scalability of our cloud-based software solutions used in the construction industry. Working closely with cross-functional teams, youll contribute to the design, implementation, and operation of highly available systems, ensuring seamless operations and a superior user experience.
What your day will look like
Collaborate with development, operations, and infrastructure teams to design and implement robust and scalable solutions for our cloud-based software platforms.
Monitor, maintain, and optimize the reliability, performance, and availability of our systems by implementing best practices in infrastructure, monitoring, and automation.
Troubleshoot and resolve complex technical issues related to infrastructure, application, and performance to ensure minimal downtime and maximum system efficiency.
Develop and maintain tools for automation, configuration management, and continuous integration/delivery (CI/CD) to streamline deployment processes.
Implement security best practices and ensure compliance with relevant industry standards to safeguard our systems and data.
Participate in on-call rotations and incident response activities to address and resolve critical system issues promptly.
Contribute to capacity planning, scalability assessments, and disaster recovery planning to support the growing demands of our software platforms.
Continuously evaluate and adopt new technologies, tools, and methodologies to improve system reliability, performance, and efficiency.
What you need to fulfill the role
Bachelors degree in Computer Science, Information Technology, or related field.
Proven experience as a Site Reliability Engineer, DevOps Engineer, or similar role within a cloud-based environment.
Strong expertise in cloud platforms such as AWS, Azure
Hands-on experience with infrastructure as code (IaC) tools (e.g., Terraform, CloudFormation) and configuration management tools (e.g., Ansible, Puppet).
Proficiency in scripting and programming languages like Python, Bash, or Go for automation and scripting tasks.
Solid understanding of networking principles, databases, and web services.
Experience with monitoring and logging tools (e.g., Prometheus, ELK Stack, Grafana) for system performance analysis and troubleshooting.
Excellent problem-solving skills, ability to work in a fast-paced environment, and strong communication skills.