i
CitiusTech
Filter interviews by
Types of clusters in Databricks include Standard, High Concurrency, and Single Node clusters.
Standard clusters are used for general-purpose workloads
High Concurrency clusters are optimized for concurrent workloads
Single Node clusters are used for development and testing purposes
Catalyst optimizer is a query optimizer in Apache Spark that leverages advanced techniques to optimize and improve the performance of Spark SQL queries.
Catalyst optimizer uses a rule-based and cost-based optimization approach to generate an optimized query plan.
It performs various optimizations such as constant folding, predicate pushdown, and projection pruning to improve query performance.
Catalyst optimizer also leve...
Explode function is used in Apache Spark to split an array into multiple rows.
Used in Apache Spark to split an array into multiple rows
Creates a new row for each element in the array
Commonly used in data processing and transformation tasks
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.
Delta Lake provides ACID transactions, schema enforcement, and time travel capabilities on top of data lakes.
Data lakes are a storage repository that holds a vast amount of raw data in its native format until it is needed.
Delta Lake is optimized for big data workloads and provides reliability and performance ...
RDD stands for Resilient Distributed Dataset, a fundamental data structure in Apache Spark.
RDD is a fault-tolerant collection of elements that can be operated on in parallel.
RDDs are immutable, meaning they cannot be changed once created.
RDDs support two types of operations: transformations (creating a new RDD from an existing one) and actions (returning a value to the driver program).
posted on 25 May 2024
I applied via Approached by Company and was interviewed in Apr 2024. There was 1 interview round.
Performance optimization techniques in Pyspark involve partitioning, caching, and using efficient transformations.
Partitioning data to distribute workload evenly
Caching intermediate results to avoid recomputation
Using efficient transformations like map, filter, and reduce
Avoiding unnecessary shuffling of data
Types of clusters in Databricks include Standard, High Concurrency, and Single Node clusters.
Standard clusters are used for general-purpose workloads
High Concurrency clusters are optimized for concurrent workloads
Single Node clusters are used for development and testing purposes
Catalyst optimizer is a query optimizer in Apache Spark that leverages advanced techniques to optimize and improve the performance of Spark SQL queries.
Catalyst optimizer uses a rule-based and cost-based optimization approach to generate an optimized query plan.
It performs various optimizations such as constant folding, predicate pushdown, and projection pruning to improve query performance.
Catalyst optimizer also leve...
Explode function is used in Apache Spark to split an array into multiple rows.
Used in Apache Spark to split an array into multiple rows
Creates a new row for each element in the array
Commonly used in data processing and transformation tasks
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.
Delta Lake provides ACID transactions, schema enforcement, and time travel capabilities on top of data lakes.
Data lakes are a storage repository that holds a vast amount of raw data in its native format until it is needed.
Delta Lake is optimized for big data workloads and provides reliability and performance ...
RDD stands for Resilient Distributed Dataset, a fundamental data structure in Apache Spark.
RDD is a fault-tolerant collection of elements that can be operated on in parallel.
RDDs are immutable, meaning they cannot be changed once created.
RDDs support two types of operations: transformations (creating a new RDD from an existing one) and actions (returning a value to the driver program).
posted on 25 May 2024
I applied via Approached by Company and was interviewed in Apr 2024. There was 1 interview round.
Performance optimization techniques in Pyspark involve partitioning, caching, and using efficient transformations.
Partitioning data to distribute workload evenly
Caching intermediate results to avoid recomputation
Using efficient transformations like map, filter, and reduce
Avoiding unnecessary shuffling of data
Senior Software Engineer
2.6k
salaries
| ₹5.8 L/yr - ₹20 L/yr |
Technical Lead
2.1k
salaries
| ₹7.4 L/yr - ₹27.4 L/yr |
Software Engineer
1.2k
salaries
| ₹3.3 L/yr - ₹11.2 L/yr |
Technical Lead 1
359
salaries
| ₹7 L/yr - ₹25.4 L/yr |
Technical Lead 2
290
salaries
| ₹7.8 L/yr - ₹28 L/yr |
Accenture
Capgemini
TCS
Wipro