Upload Button Icon Add office photos
filter salaries All Filters

132 Salesforce Jobs

Lead /Principal - Site Reliability Engineering

9-16 years

Hyderabad / Secunderabad

1 vacancy

Lead /Principal - Site Reliability Engineering

Salesforce

posted 1mon ago

Job Role Insights

Flexible timing

Job Description

As a Lead/Principal Software Engineer in Site/Product Reliability Engineering, you will play a pivotal role in ensuring and scaling the reliability of our AgentForce platform. Working in our India operations center, this role requires shift work, including weekends, to support services aligned with US hours. You will be part of a high-impact team dedicated to maintaining the availability and performance of Salesforce s AgentForce platform, with a focus on generative and predictive AI platform production support

About Salesforce:
we're Salesforce, the Customer Company, inspiring the future of business with AI + Data + CRM. We help companies blaze new trails and connect with customers in meaningful ways, while empowering our teams to drive positive change in the world and achieve their career goals.

Your Impact:
In this role, you will address end-to-end production challenges related to the AgentForce AI platform. You will lead the triaging of production issues for critical projects within our Generative AI platform, implement automated solutions to enhance reliability, and maintain comprehensive documentation of production incidents. Additionally, you will collaborate closely with AgentForce AI, product, and platform teams as part of a dynamic and innovative group of developers, architects, and product engineers.
Key Responsibilities:
  • Passion for triaging and solving complex problems in production systems.
  • You will establish the reliability process and collaborate closely with lead engineers.
  • Multi-System Debugging and Triage (must-have): AgentForce integrates multiple Salesforce platforms, such as Core, Service Cloud, Sales Cloud, Data Cloud, and AI Cloud, in addition to LLM providers like OpenAI, Azure OpenAI, and AWS Bedrock. Expertise in diagnosing and triaging performance and scalability issues across these diverse systems and vendors, as well as addressing scaling challenges, is essential.
  • Capable of investigating alerts and customer-reported issues, comprehensively analyzing the end-to-end stack. This includes first-level triage to assess all systems involved in a specific use case, identifying root causes, and generating detailed reports. Escalate to relevant engineering contacts and work to resolve the issue when necessary
  • Salesforce Core Platform Knowledge (nice to have) Familiarity with Salesforce Core platform and its architecture is a plus, given AgentForce s diverse configurations, user permissions, and CRM licensing setups. Strong knowledge of feature provisioning, user permissions, and CRM licensing requirements is beneficial.
  • Production Support & Issue Triage: Lead and shape the production triage process for AgentForce, focusing on service, infrastructure deployment, configuration, performance, and latency issues.
  • Collaborate with cross-functional teams and external partners to ensure scalable and reliable services.
  • Maintain comprehensive documentation of production issues, workflows, and areas for improvement.
  • Infrastructure & Scaling Management: Understand and support capacity modeling and forecasting to ensure adequate capacity for Agentforce services in production
  • Ensure and drive the scaling of Large Language Models (LLMs) and associated services in prod are in line with projected capacity requirements based on usage pattern. Consistently review chatbot and AI model utilization and optimize capacity based on usage trends to prevent any outage
  • Automation & Operational Excellence: Create and maintain playbooks and detailed knowledge articles for future analysis and troubleshooting. Automate manual processes to maintain high availability and repeatability of production systems.
  • Monitoring & Trust Management: Utilize the availability and trust dashboards, adjust SLOs and SLIs based on production feedback.
  • Identify automation gaps in prod and compare the establish critical user journey (CUJ) benchmarks for reliability and trustworthiness
  • Cross-functional Collaboration: Establish strong partnerships with Customer Support Groups (CSG) team to streamline escalations and minimize disruptions.
  • Be part of the 24x7 on-call support and multi-GEO coverage to maintain service reliability during peak periods.
  • Stakeholder Collaboration: Collaborate with business and engineering stakeholders for operational excellence, processes, and SLAs. Drive improvements based on key metrics, KPIs, and customer feedback.
Minimum Qualifications:
  • Bachelor s degree in Computer Science, Engineering, or a related technical field.
  • Proven expertise in implementing robust reliability processes across full-stack, end-to-end ML platforms, with in-depth understanding of Generative AI architecture and systems.
  • 8+ years of experience in production support and triaging roles with a focus on end to end , infrastructure and operational reliability.
  • Experience in DevOps or data center management roles with expertise in Linux system engineering.
  • Strong knowledge of cloud services (AWS preferred), container technologies (Docker, Kubernetes), and CI/CD tools (Jenkins, GitLab).
  • Proficiency in scripting languages (Python, Shell, Golang) and knowledge of AI model deployment and scaling.
Preferred Qualifications:
  • Experience in managing large-scale AI applications and services, including monitoring and diagnostic techniques.
  • Expertise in deploying and managing LLMs and technologies like Retrieval-Augmented Generation (RAG).
  • Background in monitoring tools such as Splunk,Prometheus, Grafana, and ELK stack.
  • Knowledge of java profiler( e.g java filght recorder), open telemetry
  • Knowledge of TCP/IP networking protocols and infrastructure services in IaaS environments.
  • Familiarity with MLOps tools and practices for supporting the machine learning lifecycle.
  • AWS or Salesforce certifications are a plus.
What We Offer:
  • An opportunity to lead and scale key initiatives within our AI platform.
  • A collaborative work environment focused on innovation and impact.
  • Competitive compensation and benefits package.

Employment Type: Full Time, Permanent

Read full job description

Prepare for Site Reliability Engineer roles with real interview advice

People are getting interviews at Salesforce through

(based on 121 Salesforce interviews)
Job Portal
Referral
Company Website
Campus Placement
Walkin
Recruitment Consultant
30%
26%
15%
7%
2%
2%
18% candidates got the interview through other sources.
High Confidence
?
High Confidence means the data is based on a large number of responses received from the candidates.

What people at Salesforce are saying

3.6
 Rating based on 4 Site Reliability Engineer reviews

Likes

Work culture, work life balance, great benefits, very good scope for learning.

Dislikes

Tough to get promotion

Read 4 reviews

Site Reliability Engineer salary at Salesforce

reported by 17 employees with 2-8 years exp.
₹18.9 L/yr - ₹42 L/yr
74% more than the average Site Reliability Engineer Salary in India
View more details

What Salesforce employees are saying about work life

based on 803 employees
67%
85%
72%
66%
Flexible timing
Monday to Friday
No travel
Day Shift
View more insights

Salesforce Benefits

Free Food
Work From Home
Health Insurance
Cafeteria
Education Assistance
Free Transport +6 more
View more benefits

Compare Salesforce with

SAP

4.2
Compare

Zoho

4.3
Compare

Oracle

3.7
Compare

Adobe

4.0
Compare

Freshworks

3.5
Compare

SPRINKLR

3.2
Compare

Google

4.4
Compare

Atlassian

3.7
Compare

IBM

4.1
Compare

OutSystems

3.4
Compare

Pegasystems

3.6
Compare

Twilio

4.0
Compare

Microsoft Corporation

4.1
Compare

Accenture

3.9
Compare

Infosys

3.7
Compare

Wipro

3.7
Compare

TCS

3.7
Compare

HCLTech

3.6
Compare

Bosch Global Software Technologies

4.0
Compare

Amdocs

3.8
Compare

Similar Jobs for you

Site Reliability Engineer Lead at Zenoti

Hyderabad / Secunderabad

10-13 Yrs

₹ 35-40 LPA

Site Reliability Engineer at Sagent M&c

Chennai

12-15 Yrs

₹ 35-45 LPA

Senior Manager at Salesforce

Hyderabad / Secunderabad, Bangalore / Bengaluru

8-13 Yrs

₹ 50-75 LPA

Senior Site Reliability Engineer at American Express Company

Gurgaon / Gurugram

6-12 Yrs

₹ 35-40 LPA

Site Reliability Engineer at Angel One

Bangalore / Bengaluru

5-9 Yrs

₹ 25-40 LPA

Site Reliability Engineer at Sagent

Chennai

12-15 Yrs

₹ 35-50 LPA

Principal Site Reliability Engineer at Providence Global Center

Hyderabad / Secunderabad

9-14 Yrs

₹ 30-40 LPA

Site Reliability Engineer at Live Connections

Hyderabad / Secunderabad

10-15 Yrs

₹ 35-45 LPA

Site Reliability Engineer at NCR Voyix

Hyderabad / Secunderabad

9-14 Yrs

₹ 30-45 LPA

Technical Support Engineer at Salesforce

Hyderabad / Secunderabad, Bangalore / Bengaluru

10-15 Yrs

₹ 37.5-45 LPA

Lead /Principal - Site Reliability Engineering

9-16 Yrs

Hyderabad / Secunderabad

1mon ago·via naukri.com

Salesforce Project Manager

6-10 Yrs

Kolkata, Mumbai, New Delhi +4 more

5hr ago·via naukri.com

Senior Manager, Data Engineering

8-13 Yrs

Hyderabad / Secunderabad, Bangalore / Bengaluru

7hr ago·via naukri.com

Mulesoft - Customer Success Manager

5-10 Yrs

Hyderabad / Secunderabad, Bangalore / Bengaluru

7hr ago·via naukri.com

Sales Strategy Manager

5-10 Yrs

Bangalore / Bengaluru

7hr ago·via naukri.com

Lead Technical Writer

8-10 Yrs

Hyderabad / Secunderabad

7hr ago·via naukri.com

Manager, Technical Consulting

9-15 Yrs

Mumbai, Hyderabad / Secunderabad, Pune +3 more

1d ago·via naukri.com

Principal Solution Engineer

12-20 Yrs

Mumbai

1d ago·via naukri.com

Senior Solution Engineer

6-11 Yrs

Gurgaon / Gurugram

1d ago·via naukri.com

Agile Coach Senior

12-15 Yrs

Hyderabad / Secunderabad

1d ago·via naukri.com
write
Share an Interview