18 Fork Technologies Jobs
Site Reliability Engineer - System Administration & Support (7-9 yrs)
Fork Technologies
posted 2d ago
Flexible timing
Key skills for the job
We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our dynamic team. As an SRE, you will be responsible for ensuring the reliability, scalability, and performance of our production systems. You will collaborate with cross-functional teams, identify and resolve system bottlenecks, and proactively monitor the health of our infrastructure. If you are passionate about infrastructure management, automation, and maintaining highly available systems, this role is for you!
Key Responsibilities :
System Administration and Support :
- Maintain and manage both Linux and Windows-based systems, ensuring their performance, availability, and security.
- Install, configure, and upgrade system software and hardware.
- Develop, implement, and enforce security policies to protect the infrastructure.
System Architecture and Configuration Management :
- Work with development and operations teams to design and maintain scalable, fault-tolerant, and highly available systems.
- Use tools like Ansible, Puppet, Chef, or SaltStack for configuration management and automation of tasks.
- Design and implement infrastructure-as-code (IAC) solutions using tools such as Terraform or CloudFormation.
Networking and Protocols Expertise :
- Strong understanding of networking protocols such as TCP/IP, HTTP, DNS, and Load Balancing techniques to ensure optimal performance and uptime of systems.
- Manage network services and ensure high availability and low latency of services.
Monitoring and Performance Optimization :
- Implement and configure monitoring tools such as Grafana, Prometheus, and Loki to track system health and performance metrics.
- Set up alerts and dashboards to proactively monitor key system metrics (e.g., CPU, memory, disk I/O, network usage).
- Analyze logs and metrics to identify patterns, detect issues early, and recommend improvements to ensure reliability and stability.
Incident Response and Troubleshooting :
- Lead incident response efforts by coordinating with development, operations, and support teams to resolve critical incidents swiftly.
- Troubleshoot issues across complex systems and services, from application to networking issues, in order to restore services quickly.
- Conduct root cause investigations of incidents to implement long-term solutions and minimize recurrence.
CI/CD and Automation :
- Maintain and improve CI/CD pipelines to enable seamless software delivery and system updates with minimal downtime.
- Automate manual processes and tasks to increase efficiency and reduce the chance of human error.
- Develop and manage scripts and tools for automating deployment, monitoring, backup, and recovery operations.
Load Testing and Performance Benchmarking :
- Perform API and load testing using tools like Gatling and JMeter to assess the scalability and performance of critical services and APIs.
- Analyze performance results and recommend improvements to ensure systems are able to handle increasing traffic loads and scale seamlessly.
Collaboration and Documentation :
- Collaborate with cross-functional teams, including development, QA, and operations, to implement system improvements, optimize performance, and ensure service reliability.
- Maintain comprehensive documentation for system configurations, processes, troubleshooting steps, and operational procedures.
- Effectively communicate complex technical concepts to both technical and non-technical stakeholders.
Key Skills and Qualifications :
Extensive Knowledge of Linux and Windows Systems :
- Strong hands-on experience with system administration and troubleshooting in both Linux (e.g., Ubuntu, CentOS) and Windows environments.
System Architecture and Configuration Management :
- Experience with configuration management tools (e.g., Ansible, Puppet, Chef).
- Familiarity with containerization (e.g., Docker, Kubernetes) and cloud services (e.g., AWS, Azure, GCP).
Networking Knowledge :
- In-depth knowledge of TCP/IP, HTTP, DNS, Load Balancing, and related networking concepts and protocols.
Monitoring Tools and Metrics Analysis :
- Proficient in using monitoring tools such as Grafana, Prometheus, and Loki for real-time system monitoring and alerting.
- Experience with analyzing and troubleshooting system performance metrics and logs.
Incident Management and Root Cause Analysis :
- Proven experience in managing incidents, conducting post-mortems, and implementing measures to prevent future incidents.
CI/CD and Automation :
- Hands-on experience with CI/CD tools like Jenkins, GitLab CI, or CircleCI for automating deployments.
- Ability to write automation scripts in languages like Python, Bash, or Ruby to streamline operational workflows.
Performance Testing :
- Expertise in using performance testing tools like Gatling and JMeter to assess system scalability and performance under load.
Collaboration and Documentation Skills :
- Excellent interpersonal skills with the ability to work collaboratively in cross-functional teams.
- Strong technical writing skills for documenting complex systems and processes.
Preferred Qualifications :
- Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent work experience).
- Certifications in relevant technologies (e.g., AWS Certified Solutions Architect, Kubernetes Administrator, Red Hat Certified Engineer).
- Experience with distributed systems, container orchestration (e.g., Kubernetes), and microservices architecture.
- Familiarity with database technologies (e.g., MySQL, PostgreSQL, MongoDB, Cassandra).
Functional Areas: IT Hardware & Telecom
Read full job descriptionPrepare for Engineer System Administrator roles with real interview advice