Home
Communities
Companies
- Companies
  
  Discover best places to work
- Compare Companies
  
  Compare & find best workplace
- Add Office Photos
  
  Bring your workplace to life
- Add Company Benefits
  
  Highlight your company's perks
Reviews
- Company reviews
  
  Read reviews for 6L+ companies
- Write a review
  
  Rate your former or current company
Salaries
- Browse salaries
  
  Discover salaries for 6L+ companies
- Salary calculator
  
  Calculate your take home salary
- Are you paid fairly?
  
  Check your market value
- Share your salary
  
  Help other jobseekers
Interviews
- Company interviews
  
  Read interviews for 40K+ companies
- Campus placements
  
  Interviews questions for 1K+ colleges
- Share interview questions
  
  Contribute your interview questions
Jobs
Awards

RATE NOW!
- ABECA 2025
  
  RATE NOW!
  
  AmbitionBox Employee Choice Awards - 4th Edition
- ABECA 2024
  
  AmbitionBox Employee Choice Awards - 3rd Edition
- AmbitionBox Best Places to Work 2022
  
  2nd Edition
- AmbitionBox Best Places to Work 2021
  
  1st Edition

Add office photos

Employer? Claim Account for FREE

Salesforce

Compare

4.1

based on 802 Reviews

132 Salesforce Jobs

Lead /Principal - Site Reliability Engineering

Salesforce

4.1

based on 802 Reviews

9-16 years

Hyderabad / Secunderabad

1 vacancy

Lead /Principal - Site Reliability Engineering

Salesforce

posted 1mon ago

Job Role Insights

Flexible timing

Key skills for the job

Salesforce Automation Testing Customer Support Operations Linux Administration CRM

+ 4 more

Job Description

As a Lead/Principal Software Engineer in Site/Product Reliability Engineering, you will play a pivotal role in ensuring and scaling the reliability of our AgentForce platform. Working in our India operations center, this role requires shift work, including weekends, to support services aligned with US hours. You will be part of a high-impact team dedicated to maintaining the availability and performance of Salesforce s AgentForce platform, with a focus on generative and predictive AI platform production support

About Salesforce:
we're Salesforce, the Customer Company, inspiring the future of business with AI + Data + CRM. We help companies blaze new trails and connect with customers in meaningful ways, while empowering our teams to drive positive change in the world and achieve their career goals.

Your Impact:
In this role, you will address end-to-end production challenges related to the AgentForce AI platform. You will lead the triaging of production issues for critical projects within our Generative AI platform, implement automated solutions to enhance reliability, and maintain comprehensive documentation of production incidents. Additionally, you will collaborate closely with AgentForce AI, product, and platform teams as part of a dynamic and innovative group of developers, architects, and product engineers.

Key Responsibilities:

Passion for triaging and solving complex problems in production systems.
You will establish the reliability process and collaborate closely with lead engineers.
Multi-System Debugging and Triage (must-have): AgentForce integrates multiple Salesforce platforms, such as Core, Service Cloud, Sales Cloud, Data Cloud, and AI Cloud, in addition to LLM providers like OpenAI, Azure OpenAI, and AWS Bedrock. Expertise in diagnosing and triaging performance and scalability issues across these diverse systems and vendors, as well as addressing scaling challenges, is essential.
Capable of investigating alerts and customer-reported issues, comprehensively analyzing the end-to-end stack. This includes first-level triage to assess all systems involved in a specific use case, identifying root causes, and generating detailed reports. Escalate to relevant engineering contacts and work to resolve the issue when necessary
Salesforce Core Platform Knowledge (nice to have) Familiarity with Salesforce Core platform and its architecture is a plus, given AgentForce s diverse configurations, user permissions, and CRM licensing setups. Strong knowledge of feature provisioning, user permissions, and CRM licensing requirements is beneficial.
Production Support & Issue Triage: Lead and shape the production triage process for AgentForce, focusing on service, infrastructure deployment, configuration, performance, and latency issues.
Collaborate with cross-functional teams and external partners to ensure scalable and reliable services.
Maintain comprehensive documentation of production issues, workflows, and areas for improvement.
Infrastructure & Scaling Management: Understand and support capacity modeling and forecasting to ensure adequate capacity for Agentforce services in production
Ensure and drive the scaling of Large Language Models (LLMs) and associated services in prod are in line with projected capacity requirements based on usage pattern. Consistently review chatbot and AI model utilization and optimize capacity based on usage trends to prevent any outage
Automation & Operational Excellence: Create and maintain playbooks and detailed knowledge articles for future analysis and troubleshooting. Automate manual processes to maintain high availability and repeatability of production systems.
Monitoring & Trust Management: Utilize the availability and trust dashboards, adjust SLOs and SLIs based on production feedback.
Identify automation gaps in prod and compare the establish critical user journey (CUJ) benchmarks for reliability and trustworthiness
Cross-functional Collaboration: Establish strong partnerships with Customer Support Groups (CSG) team to streamline escalations and minimize disruptions.
Be part of the 24x7 on-call support and multi-GEO coverage to maintain service reliability during peak periods.
Stakeholder Collaboration: Collaborate with business and engineering stakeholders for operational excellence, processes, and SLAs. Drive improvements based on key metrics, KPIs, and customer feedback.

Minimum Qualifications:

Bachelor s degree in Computer Science, Engineering, or a related technical field.
Proven expertise in implementing robust reliability processes across full-stack, end-to-end ML platforms, with in-depth understanding of Generative AI architecture and systems.
8+ years of experience in production support and triaging roles with a focus on end to end , infrastructure and operational reliability.
Experience in DevOps or data center management roles with expertise in Linux system engineering.
Strong knowledge of cloud services (AWS preferred), container technologies (Docker, Kubernetes), and CI/CD tools (Jenkins, GitLab).
Proficiency in scripting languages (Python, Shell, Golang) and knowledge of AI model deployment and scaling.

Preferred Qualifications:

Experience in managing large-scale AI applications and services, including monitoring and diagnostic techniques.
Expertise in deploying and managing LLMs and technologies like Retrieval-Augmented Generation (RAG).
Background in monitoring tools such as Splunk,Prometheus, Grafana, and ELK stack.
Knowledge of java profiler( e.g java filght recorder), open telemetry
Knowledge of TCP/IP networking protocols and infrastructure services in IaaS environments.
Familiarity with MLOps tools and practices for supporting the machine learning lifecycle.
AWS or Salesforce certifications are a plus.

What We Offer:

An opportunity to lead and scale key initiatives within our AI platform.
A collaborative work environment focused on innovation and impact.
Competitive compensation and benefits package.

Employment Type: Full Time, Permanent

Read full job description

Prepare for Site Reliability Engineer roles with real interview advice

People are getting interviews at Salesforce through

(based on 121 Salesforce interviews)

Job Portal

Referral

Company Website

Campus Placement

Walkin

Recruitment Consultant

30%

26%

15%

18% candidates got the interview through other sources.

High Confidence

What people at Salesforce are saying

3.6

Rating based on 4 Site Reliability Engineer reviews

Anonymous · Software Development in Hyderabad/Secunderabad

Likes

Work culture, work life balance, great benefits, very good scope for learning.

Dislikes

Tough to get promotion

Read 4 reviews

Site Reliability Engineer salary at Salesforce

reported by 17 employees with 2-8 years exp.

₹18.9 L/yr - ₹42 L/yr

74% more than the average Site Reliability Engineer Salary in India

View more details

What Salesforce employees are saying about work life

based on 803 employees

67%

85%

72%

66%

Flexible timing

Monday to Friday

No travel

Day Shift

View more insights

Compare Salesforce with

SAP

4.2

Compare

Zoho

4.3

Compare

Oracle

3.7

Compare

Adobe

4.0

Compare

Freshworks

3.5

Compare

SPRINKLR

3.2

Compare

Google

4.4

Compare

Atlassian

3.7

Compare

IBM

4.1

Compare

OutSystems

3.4

Compare

Pegasystems

3.6

Compare

Twilio

4.0

Compare

Microsoft Corporation

4.1

Compare

Accenture

3.9

Compare

Infosys

3.7

Compare

Wipro

3.7

Compare

TCS

3.7

Compare

HCLTech

3.6

Compare

Bosch Global Software Technologies

4.0

Compare

Amdocs

3.8

Compare

Similar Jobs for you

Site Reliability Engineer Lead at Zenoti

Hyderabad / Secunderabad

10-13 Yrs

₹ 35-40 LPA

Site Reliability Engineer at Sagent M&c

Chennai

12-15 Yrs

₹ 35-45 LPA

Senior Manager at Salesforce

Hyderabad / Secunderabad, Bangalore / Bengaluru

8-13 Yrs

₹ 50-75 LPA

Senior Site Reliability Engineer at American Express Company

Gurgaon / Gurugram

6-12 Yrs

₹ 35-40 LPA

Site Reliability Engineer at Angel One

Bangalore / Bengaluru

5-9 Yrs

₹ 25-40 LPA

Site Reliability Engineer at Sagent

Chennai

12-15 Yrs

₹ 35-50 LPA

Principal Site Reliability Engineer at Providence Global Center

Hyderabad / Secunderabad

9-14 Yrs

₹ 30-40 LPA

Site Reliability Engineer at Live Connections

Hyderabad / Secunderabad

10-15 Yrs

₹ 35-45 LPA

Site Reliability Engineer at NCR Voyix

Hyderabad / Secunderabad

9-14 Yrs

₹ 30-45 LPA

Technical Support Engineer at Salesforce

Hyderabad / Secunderabad, Bangalore / Bengaluru

10-15 Yrs

₹ 37.5-45 LPA