Home
Communities
Companies
- Companies
  
  Discover best places to work
- Compare Companies
  
  Compare & find best workplace
- Add Office Photos
  
  Bring your workplace to life
- Add Company Benefits
  
  Highlight your company's perks
Reviews
- Company reviews
  
  Read reviews for 6L+ companies
- Write a review
  
  Rate your former or current company
Salaries
- Browse salaries
  
  Discover salaries for 6L+ companies
- Salary calculator
  
  Calculate your take home salary
- Are you paid fairly?
  
  Check your market value
- Share your salary
  
  Help other jobseekers
- Gratuity calculator
  
  Check your gratuity amount
- HRA calculator
  
  Check how much of your HRA is tax-free
- Salary hike calculator
  
  Check your salary hike
Interviews
- Company interviews
  
  Read interviews for 40K+ companies
- Campus placements
  
  Interviews questions for 2K+ colleges
- Share interview questions
  
  Contribute your interview questions
Jobs
Awards

WINNERS AWAITED!
- ABECA 2025
  
  WINNERS AWAITED!
  
  AmbitionBox Employee Choice Awards - 4th Edition
- ABECA 2024
  
  AmbitionBox Employee Choice Awards - 3rd Edition
- AmbitionBox Best Places to Work 2022
  
  2nd Edition
- AmbitionBox Best Places to Work 2021
  
  1st Edition

Add office photos

Employer? Claim Account for FREE

Tekgence India

Compare

4.5

based on 3 Reviews

35 Tekgence India Jobs

Reliability Engineer - Ansible/Terraform (8-9 yrs)

Tekgence India Private Limited

4.5

based on 3 Reviews

8-9 years

Tekgence India

posted 5d ago

Job Role Insights

Key skills for the job

Linux Administration Datadog Site Reliability Engineering Jenkins Splunk Admin Terraform

+ 5 more

Job Description

Job Summary :

Looking for a strong Reliability Engineer with below mentioned responsibilities.

Years of experience needed : Overall 8+ years of experience in product development.

Tasks to perform :

- Manage the end-to-end Customer Migration interaction with customers and close out issues - this should be primary responsibility of PE, but in consultation with RE wherever required.

- Run the Gameday (Operational Acceptance Testing)

- Ensure DR and Resiliency is set up in line with the business.

- Experience in Linux Administration

- Ensure Logging, Alarms and Troubleshooting is set up.

- Server Monitoring - Big Panda

- Application Monitoring - Datadog

- Service Level Metrics Define the quality of service / service level agreement / service level objective (SLO)

- Resource Monitoring - Azure Monitor

- Soft decommission of application.

- Engage with the other teams as needed (e.g. DB Support, Network Support, Security etc.) to reduce service disruptions.

- Good knowledge of Azure Cloud (Oracle on VM, Networking, Security, Log Analytics)

- Knowledge of Key-Vault integration

- Knowledge of Dev Ops - CI/CD pipeline (not immediate need for implementation, but for future support) and Terraform.

Primary Skill :

- SRE, DevOps, Ansible, Terraform, Python, Dockers, AWS (Atlas), ECS Based internal tooling & Monitoring tools - Datadog, Splunk, Dynatrace, Grafana.

Secondary Skill :

- Shell Script, Linux, Thousand Eyes, Gremlin etc.

1. Automation principals and tools ( Ansible etc.).should have worked with Toil identification and quality of life automation.

2. Advanced working experience with two or more of the following: Unix/Linux, Windows Server, Oracle, MSSQL, MongoDB.

3. Experience with Python, Java, Curl scripting or any other types of scripting.

4. Experience with JIRA, Confluence, BitBucket, GitHUB, Jenkins, Jules, Terraform.

5. Experience with two or more of the following observability tools : AppDynamics, Geneos, Dyanatrace, ECS Based internal tooling, Datadog, Cloud watch, Big Panda, Elastic Search (ELK), Google Cloud Logging, Grafana, Prometheus, Splunk, Thousand Eyes etc..

6. Experience with logging, monitoring, and event detection on Cloud or Distributed platforms.

7. Experience creating and modifying technical documentation such as environment flow, functional requirements, nonfunctional requirements.

8. Effective production management - Incident & change Management, Production control, ITSM, Service Now, problem solving and analytical skills with ability to turn findings into strategic imperatives.

9. Technical operations application support and stability, realiability and resiliency experience.

10. Minimum 4-6 years of hands-on experience into SRE implementation of monitoring system- Dashboards development for application reliability using Splunk, Dynatrace, Grafana, App Dynamics, Datadog, Big panda.

11. Experience working on Configuration as Code, Infrastructure as code, AWS(Altas)

12. Provides technical direction regarding monitoring and logging to less experienced staff or develops highly complex original solutions. Acts as an Expert technical resource for modeling, simulation and analysis efforts.

13. Overall, we are looking for an Automation Engineer, who could reduce the toil issues and enhance the system towards reliability and scalability.