Upload Button Icon Add office photos
filter salaries All Filters

124 Tableau Software Jobs

Lead /Principal - Site Reliability Engineering

11-15 years

Hyderabad / Secunderabad

1 vacancy

Lead /Principal - Site Reliability Engineering

Tableau Software

posted 1mon ago

Job Role Insights

Flexible timing

Job Description

To get the best candidate experience, please consider applying for a maximum of 3 roles within 12 months to ensure you are not duplicating efforts.
Job Category
Software Engineering
Job Details
About Salesforce
.
Role Description:
As a Lead/Principal Software Engineer in Site/Product Reliability Engineering, you will play a pivotal role in ensuring and scaling the reliability of our AgentForce platform. Working in our India operations center, this role requires shift work, including weekends, to support services aligned with US hours. You will be part of a high-impact team dedicated to maintaining the availability and performance of Salesforce s AgentForce platform, with a focus on generative and predictive AI platform production support

About Salesforce:
We re Salesforce, the Customer Company, inspiring the future of business with AI + Data + CRM. We help companies blaze new trails and connect with customers in meaningful ways, while empowering our teams to drive positive change in the world and achieve their career goals.

Your Impact:
In this role, you will address end-to-end production challenges related to the AgentForce AI platform. You will lead the triaging of production issues for critical projects within our Generative AI platform, implement automated solutions to enhance reliability, and maintain comprehensive documentation of production incidents. Additionally, you will collaborate closely with AgentForce AI, product, and platform teams as part of a dynamic and innovative group of developers, architects, and product engineers.
Key Responsibilities:
  • Passion for triaging and solving complex problems in production systems.
  • You will establish the reliability process and collaborate closely with lead engineers.
  • Multi-System Debugging and Triage (must-have): AgentForce integrates multiple Salesforce platforms, such as Core, Service Cloud, Sales Cloud, Data Cloud, and AI Cloud, in addition to LLM providers like OpenAI, Azure OpenAI, and AWS Bedrock. Expertise in diagnosing and triaging performance and scalability issues across these diverse systems and vendors, as well as addressing scaling challenges, is essential.
  • Capable of investigating alerts and customer-reported issues, comprehensively analyzing the end-to-end stack. This includes first-level triage to assess all systems involved in a specific use case, identifying root causes, and generating detailed reports. Escalate to relevant engineering contacts and work to resolve the issue when necessary
  • Salesforce Core Platform Knowledge (nice to have) Familiarity with Salesforce Core platform and its architecture is a plus, given AgentForce s diverse configurations, user permissions, and CRM licensing setups. Strong knowledge of feature provisioning, user permissions, and CRM licensing requirements is beneficial.
  • Production Support Issue Triage: Lead and shape the production triage process for AgentForce, focusing on service, infrastructure deployment, configuration, performance, and latency issues.
  • Collaborate with cross-functional teams and external partners to ensure scalable and reliable services.
  • Maintain comprehensive documentation of production issues, workflows, and areas for improvement.
  • Infrastructure Scaling Management: Understand and support capacity modeling and forecasting to ensure adequate capacity for Agentforce services in production
  • Ensure and drive the scaling of Large Language Models (LLMs) and associated services in prod are in line with projected capacity requirements based on usage pattern. Consistently review chatbot and AI model utilization and optimize capacity based on usage trends to prevent any outage
  • Automation Operational Excellence: Create and maintain playbooks and detailed knowledge articles for future analysis and troubleshooting. Automate manual processes to maintain high availability and repeatability of production systems.
  • Monitoring Trust Management: Utilize the availability and trust dashboards, adjust SLOs and SLIs based on production feedback.
  • Identify automation gaps in prod and compare the establish critical user journey (CUJ) benchmarks for reliability and trustworthiness
  • Cross-functional Collaboration: Establish strong partnerships with Customer Support Groups (CSG) team to streamline escalations and minimize disruptions.
  • Be part of the 24x7 on-call support and multi-GEO coverage to maintain service reliability during peak periods.
  • Stakeholder Collaboration: Collaborate with business and engineering stakeholders for operational excellence, processes, and SLAs. Drive improvements based on key metrics, KPIs, and customer feedback.
Minimum Qualifications:
  • Bachelor s degree in Computer Science, Engineering, or a related technical field.
  • Proven expertise in implementing robust reliability processes across full-stack, end-to-end ML platforms, with in-depth understanding of Generative AI architecture and systems.
  • 8+ years of experience in production support and triaging roles with a focus on end to end , infrastructure and operational reliability.
  • Experience in DevOps or data center management roles with expertise in Linux system engineering.
  • Strong knowledge of cloud services (AWS preferred), container technologies (Docker, Kubernetes), and CI/CD tools (Jenkins, GitLab).
  • Proficiency in scripting languages (Python, Shell, Golang) and knowledge of AI model deployment and scaling.
Preferred Qualifications:
  • Experience in managing large-scale AI applications and services, including monitoring and diagnostic techniques.
  • Expertise in deploying and managing LLMs and technologies like Retrieval-Augmented Generation (RAG).
  • Background in monitoring tools such as Splunk,Prometheus, Grafana, and ELK stack.
  • Knowledge of java profiler( e.g java filght recorder), open telemetry
  • Knowledge of TCP/IP networking protocols and infrastructure services in IaaS environments.
  • Familiarity with MLOps tools and practices for supporting the machine learning lifecycle.
  • AWS or Salesforce certifications are a plus.
What We Offer:
  • An opportunity to lead and scale key initiatives within our AI platform.
  • A collaborative work environment focused on innovation and impact.
  • Competitive compensation and benefits package.
Learn more about Equality at www.equality.com and explore our company benefits at www.salesforcebenefits.com .
Salesforce welcomes all.

Employment Type: Full Time, Permanent

Read full job description

Prepare for Site Reliability Engineer roles with real interview advice

What people at Tableau Software are saying

What Tableau Software employees are saying about work life

based on 3 employees
50%
100%
100%
Flexible timing
Monday to Friday
No travel
View more insights

Tableau Software Benefits

Free Transport
Child care
Gymnasium
Cafeteria
Work From Home
Free Food +6 more
View more benefits

Compare Tableau Software with

Qlik

3.4
Compare

MicroStrategy

2.0
Compare

Domo

3.4
Compare

Bosch Global Software Technologies

4.0
Compare

Amdocs

3.8
Compare

Automatic Data Processing (ADP)

4.0
Compare

24/7 Customer

3.5
Compare

Google

4.4
Compare

Thomson Reuters

4.1
Compare

Oracle Cerner

3.7
Compare

VMware Software

4.4
Compare

Adobe

4.0
Compare

R Systems International

3.4
Compare

OpenText Technologies

3.7
Compare

Chetu

3.2
Compare

Dassault Systemes

4.0
Compare

Onward Technologies Inc

3.2
Compare

Salesforce

4.1
Compare

Temenos

3.3
Compare

Globant

3.9
Compare

Similar Jobs for you

Principal Site Reliability Engineer at Zycus Infotech Pvt Ltd

Mumbai

8-12 Yrs

₹ 10-14 LPA

Principal Site Reliability Engineer at Zycus Infotech Pvt Ltd

Bangalore / Bengaluru

8-12 Yrs

₹ 10-14 LPA

Site Reliability Engineer at Zycus Infotech Pvt Ltd

Mumbai

8-12 Yrs

₹ 10-14 LPA

Site Reliability Engineer at Zycus Infotech Pvt Ltd

Bangalore / Bengaluru

8-12 Yrs

₹ 10-14 LPA

Site Reliability Engineer at Reuters News Agency

Mumbai, Hyderabad / Secunderabad

8-12 Yrs

₹ 10-14 LPA

Principal Site Reliability Engineer at WebEx Communications India (P) Ltd.

Bangalore / Bengaluru

6-10 Yrs

₹ 8-12 LPA

Site Reliability Engineer Lead at Zycus Infotech Pvt Ltd

Mumbai

8-12 Yrs

₹ 10-14 LPA

Site Reliability Engineer Lead at Zycus Infotech Pvt Ltd

Bangalore / Bengaluru

8-12 Yrs

₹ 10-14 LPA

Site Reliability Engineer at Institutional Shareholder Services Inc.

Mumbai

10-20 Yrs

₹ 12-12 LPA

Site Reliability Engineer at Trimble

Chennai

6-10 Yrs

₹ 8-12 LPA

Lead /Principal - Site Reliability Engineering

11-15 Yrs

Hyderabad / Secunderabad

1mon ago·via naukri.com

Manager, Technical Consulting (Salesforce Technical Architect)

6-11 Yrs

Mumbai, Hyderabad / Secunderabad, Pune +3 more

1d ago·via naukri.com

Territory Account Executive -BFSI

1-4 Yrs

Mumbai

1d ago·via naukri.com

Senior Solution Engineer

3-7 Yrs

Gurgaon / Gurugram

1d ago·via naukri.com

Lead, Account SE - Retail & Consumer Goods Industry

5-8 Yrs

Gurgaon / Gurugram

1d ago·via naukri.com

Enterprise Account Executive - BFSI

2-5 Yrs

Mumbai

1d ago·via naukri.com

Full-Stack Software Engineer / MTS - Bangalore

2-5 Yrs

Bangalore / Bengaluru

1d ago·via naukri.com

Information Security Associate - CIR-1

1-4 Yrs

Hyderabad / Secunderabad

1d ago·via naukri.com

Manager, Proactive Monitoring Engineering-1

4-8 Yrs

Hyderabad / Secunderabad

1d ago·via naukri.com

Principal Solution Engineer

14-17 Yrs

Mumbai

1d ago·via naukri.com
write
Share an Interview