Home
Communities
Companies
- Companies
  
  Discover best places to work
- Compare Companies
  
  Compare & find best workplace
- Add Office Photos
  
  Bring your workplace to life
- Add Company Benefits
  
  Highlight your company's perks
Reviews
- Company reviews
  
  Read reviews for 6L+ companies
- Write a review
  
  Rate your former or current company
Salaries
- Browse salaries
  
  Discover salaries for 6L+ companies
- Salary calculator
  
  Calculate your take home salary
- Are you paid fairly?
  
  Check your market value
- Share your salary
  
  Help other jobseekers
Interviews
- Company interviews
  
  Read interviews for 40K+ companies
- Campus placements
  
  Interviews questions for 1K+ colleges
- Share interview questions
  
  Contribute your interview questions
Jobs
Awards

RATE NOW!
- ABECA 2025
  
  RATE NOW!
  
  AmbitionBox Employee Choice Awards - 4th Edition
- ABECA 2024
  
  AmbitionBox Employee Choice Awards - 3rd Edition
- AmbitionBox Best Places to Work 2022
  
  2nd Edition
- AmbitionBox Best Places to Work 2021
  
  1st Edition

Add office photos

Employer? Claim Account for FREE

Tableau Software

Compare

3.3

based on 3 Reviews

124 Tableau Software Jobs

Lead /Principal - Site Reliability Engineering

Tableau Software

3.3

based on 3 Reviews

11-15 years

Hyderabad / Secunderabad

1 vacancy

Lead /Principal - Site Reliability Engineering

Tableau Software

posted 1mon ago

Job Role Insights

Flexible timing

Key skills for the job

Salesforce Automation Testing Customer Support Operations Linux Administration CRM

+ 4 more

Job Description

To get the best candidate experience, please consider applying for a maximum of 3 roles within 12 months to ensure you are not duplicating efforts.

Job Category

Software Engineering

Job Details

About Salesforce

Role Description:
As a Lead/Principal Software Engineer in Site/Product Reliability Engineering, you will play a pivotal role in ensuring and scaling the reliability of our AgentForce platform. Working in our India operations center, this role requires shift work, including weekends, to support services aligned with US hours. You will be part of a high-impact team dedicated to maintaining the availability and performance of Salesforce s AgentForce platform, with a focus on generative and predictive AI platform production support

About Salesforce:
We re Salesforce, the Customer Company, inspiring the future of business with AI + Data + CRM. We help companies blaze new trails and connect with customers in meaningful ways, while empowering our teams to drive positive change in the world and achieve their career goals.

Your Impact:
In this role, you will address end-to-end production challenges related to the AgentForce AI platform. You will lead the triaging of production issues for critical projects within our Generative AI platform, implement automated solutions to enhance reliability, and maintain comprehensive documentation of production incidents. Additionally, you will collaborate closely with AgentForce AI, product, and platform teams as part of a dynamic and innovative group of developers, architects, and product engineers.

Key Responsibilities:

Passion for triaging and solving complex problems in production systems.
You will establish the reliability process and collaborate closely with lead engineers.
Multi-System Debugging and Triage (must-have): AgentForce integrates multiple Salesforce platforms, such as Core, Service Cloud, Sales Cloud, Data Cloud, and AI Cloud, in addition to LLM providers like OpenAI, Azure OpenAI, and AWS Bedrock. Expertise in diagnosing and triaging performance and scalability issues across these diverse systems and vendors, as well as addressing scaling challenges, is essential.
Capable of investigating alerts and customer-reported issues, comprehensively analyzing the end-to-end stack. This includes first-level triage to assess all systems involved in a specific use case, identifying root causes, and generating detailed reports. Escalate to relevant engineering contacts and work to resolve the issue when necessary
Salesforce Core Platform Knowledge (nice to have) Familiarity with Salesforce Core platform and its architecture is a plus, given AgentForce s diverse configurations, user permissions, and CRM licensing setups. Strong knowledge of feature provisioning, user permissions, and CRM licensing requirements is beneficial.
Production Support Issue Triage: Lead and shape the production triage process for AgentForce, focusing on service, infrastructure deployment, configuration, performance, and latency issues.
Collaborate with cross-functional teams and external partners to ensure scalable and reliable services.
Maintain comprehensive documentation of production issues, workflows, and areas for improvement.
Infrastructure Scaling Management: Understand and support capacity modeling and forecasting to ensure adequate capacity for Agentforce services in production
Ensure and drive the scaling of Large Language Models (LLMs) and associated services in prod are in line with projected capacity requirements based on usage pattern. Consistently review chatbot and AI model utilization and optimize capacity based on usage trends to prevent any outage
Automation Operational Excellence: Create and maintain playbooks and detailed knowledge articles for future analysis and troubleshooting. Automate manual processes to maintain high availability and repeatability of production systems.
Monitoring Trust Management: Utilize the availability and trust dashboards, adjust SLOs and SLIs based on production feedback.
Identify automation gaps in prod and compare the establish critical user journey (CUJ) benchmarks for reliability and trustworthiness
Cross-functional Collaboration: Establish strong partnerships with Customer Support Groups (CSG) team to streamline escalations and minimize disruptions.
Be part of the 24x7 on-call support and multi-GEO coverage to maintain service reliability during peak periods.
Stakeholder Collaboration: Collaborate with business and engineering stakeholders for operational excellence, processes, and SLAs. Drive improvements based on key metrics, KPIs, and customer feedback.

Minimum Qualifications:

Bachelor s degree in Computer Science, Engineering, or a related technical field.
Proven expertise in implementing robust reliability processes across full-stack, end-to-end ML platforms, with in-depth understanding of Generative AI architecture and systems.
8+ years of experience in production support and triaging roles with a focus on end to end , infrastructure and operational reliability.
Experience in DevOps or data center management roles with expertise in Linux system engineering.
Strong knowledge of cloud services (AWS preferred), container technologies (Docker, Kubernetes), and CI/CD tools (Jenkins, GitLab).
Proficiency in scripting languages (Python, Shell, Golang) and knowledge of AI model deployment and scaling.

Preferred Qualifications:

Experience in managing large-scale AI applications and services, including monitoring and diagnostic techniques.
Expertise in deploying and managing LLMs and technologies like Retrieval-Augmented Generation (RAG).
Background in monitoring tools such as Splunk,Prometheus, Grafana, and ELK stack.
Knowledge of java profiler( e.g java filght recorder), open telemetry
Knowledge of TCP/IP networking protocols and infrastructure services in IaaS environments.
Familiarity with MLOps tools and practices for supporting the machine learning lifecycle.
AWS or Salesforce certifications are a plus.

What We Offer:

An opportunity to lead and scale key initiatives within our AI platform.
A collaborative work environment focused on innovation and impact.
Competitive compensation and benefits package.

Learn more about Equality at www.equality.com and explore our company benefits at www.salesforcebenefits.com .

Salesforce welcomes all.

Employment Type: Full Time, Permanent

Read full job description

Prepare for Site Reliability Engineer roles with real interview advice

What people at Tableau Software are saying

What Tableau Software employees are saying about work life

based on 3 employees

50%

100%

Flexible timing

Monday to Friday

No travel

View more insights

Compare Tableau Software with

Qlik

3.4

Compare

MicroStrategy

2.0

Compare

Domo

3.4

Compare

Bosch Global Software Technologies

4.0

Compare

Amdocs

3.8

Compare

Automatic Data Processing (ADP)

4.0

Compare

24/7 Customer

3.5

Compare

Google

4.4

Compare

Thomson Reuters

4.1

Compare

Oracle Cerner

3.7

Compare

VMware Software

4.4

Compare

Adobe

4.0

Compare

R Systems International

3.4

Compare

OpenText Technologies

3.7

Compare

Chetu

3.2

Compare

Dassault Systemes

4.0

Compare

Onward Technologies Inc

3.2

Compare

Salesforce

4.1

Compare

Temenos

3.3

Compare

Globant

3.9

Compare

Similar Jobs for you

Principal Site Reliability Engineer at Zycus Infotech Pvt Ltd

Mumbai

8-12 Yrs

₹ 10-14 LPA

Principal Site Reliability Engineer at Zycus Infotech Pvt Ltd

Bangalore / Bengaluru

8-12 Yrs

₹ 10-14 LPA

Site Reliability Engineer at Zycus Infotech Pvt Ltd

Mumbai

8-12 Yrs

₹ 10-14 LPA

Site Reliability Engineer at Zycus Infotech Pvt Ltd

Bangalore / Bengaluru

8-12 Yrs

₹ 10-14 LPA

Site Reliability Engineer at Reuters News Agency

Mumbai, Hyderabad / Secunderabad

8-12 Yrs

₹ 10-14 LPA

Principal Site Reliability Engineer at WebEx Communications India (P) Ltd.

Bangalore / Bengaluru

6-10 Yrs

₹ 8-12 LPA

Site Reliability Engineer Lead at Zycus Infotech Pvt Ltd

Mumbai

8-12 Yrs

₹ 10-14 LPA

Site Reliability Engineer Lead at Zycus Infotech Pvt Ltd

Bangalore / Bengaluru

8-12 Yrs

₹ 10-14 LPA

Site Reliability Engineer at Institutional Shareholder Services Inc.

Mumbai

10-20 Yrs

₹ 12-12 LPA

Site Reliability Engineer at Trimble

Chennai

6-10 Yrs

₹ 8-12 LPA