i
Goibibo
Site Reliability Engineer
Goibibo
posted 2d ago
Flexible timing
Key skills for the job
The Site Reliability Team is responsible for monitoring all aspects of MakeMyTrip including production servers and services. You will be acting as first line of defense against any kind of service unavailability or performance of our production services 24 x 7 x 365.
You will be frequently interacting with various groups within organization like Engineering, Sales & Products and hence need to develop a good all-round understanding of components, systems and networks is must. Diligence and attention to detail are also key skills along with an ability to multi-task and prioritize work appropriately.
We don't expect you to have all the required knowledge when you join us, as many of these skills can be picked up through experience in the job, however those who want to gain new skills and grow must be prepared to spend time in doing suitable research and learning. You must be eager and quick learner with decent communication skills and must be able to use their initiative to tackle a broad range of problems.
Prime Responsibilities:
- Regularly examine multiple monitoring systems for unexpected deviations in any of application layers.
- React to alerts with well-defined procedures, escalate problems to the appropriate people, follow up till resolution and finally incident reporting.
- Setup/Monitor alerts on OPS tools and monitoring applications like Zabbix, Grafana, ELK stack.
- Create shell/Python script-based reports & CRON scheduling to support periodic reports.
- Adhere to defined process and be ready for some adhoc and surprise incidents
- Help your coworkers by creating documentation and detailed knowledge sharing for continuous improvement.
- Communications skills and clearness in reporting and communication.
- Troubleshooting Live site production issue by co-relating different components.
- Day-to-day maintenance of the application systems in operation, including tasks related to identifying and troubleshooting application issues and issues resolution or escalation.
Desired Skills:
- 3-6 years of relevant experience in 24x7 AWS Cloud based Linux production environment.
- Ability to monitor diverse architecture, troubleshoot problems, analyze impact and escalation
- Willing to work in precise schedules, night shifts & weekends to support our 24x7 systems on rotational basis.
- Basic Linux command skills is must & experience in any scripting language (Shell/Python) is plus.
- Basic Knowledge of Web/Internet concepts i.e. DNS, Common Protocols, Ports, Cookies, Firebug.
- Hands on experience in L2 debugging like finding errors/exceptions in logs.
- Basic Knowledge of SQL queries
- Work well in a busy team, being quick to learn and able to deal with a wide range of issues
- Prior experience in ELK, Zabbix or Grafana would be added advantage.
- Knowledge of AWS Cloud environment is huge plus.
Employment Type: Full Time, Permanent
Read full job descriptionPrepare for Site Reliability Engineer roles with real interview advice