Actively monitor systems, applications, and infrastructure across cloud environments (GCP Azure).
Ensure that service levels, such as uptime and performance, meet the expected standards. Support Tickets Issue Resolution:
Work on support tickets raised by platform users, addressing technical problems and providing timely solutions to ensure smooth operations.
Incident Management:
Lead the management and resolution of incidents, minimizing downtime and ensuring quick recovery.
Manage the incident lifecycle from detection to resolution, coordinating across teams as necessary.
Root Cause Analysis Problem Management:
Perform root cause analysis for incidents and recurring problems to prevent future occurrences. Document findings and implement preventive measures to maintain service reliability.
Automation Optimization:
Write scripts and automation tools (primarily using Python) to reduce manual intervention and optimize operational tasks, driving efficiency and consistency.
Cloud Monitoring Operations (GCP Azure):
Leverage your expertise in cloud technologies to monitor and manage resources in GCP and Azure environments.
Ensure seamless integration, configuration, and scaling of cloud services.
ServiceNow Integration:
Use ServiceNow for managing and tracking incidents, requests, and changes. Ensure proper documentation and ticket management following ITIL best practices.
Collaboration with Cross-functional Teams:
Work closely with development, operations, and other engineering teams to maintain a unified approach to platform reliability and performance. Provide inputs for continuous improvement of the platform and processes.