Filter interviews by
I have worked on technologies such as Apache Spark, Hadoop, SQL, Python, and AWS.
Apache Spark
Hadoop
SQL
Python
AWS
Oops dsa sql network
To find the nth highest salary in pyspark, use the window function with row_number and filter on the desired rank.
Use window function with row_number to assign a rank to each salary
Filter the result to get the row with the desired rank
Example: df.withColumn('rank', F.row_number().over(Window.orderBy(F.col('salary').desc()))).filter(F.col('rank') == n).select('salary')
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.
Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing.
It stores data in Parquet format and uses Apache Spark for processing.
Delta Lake ensures data reliability and data quality by providing schema enforcement and data versioning.
It supports time tra...
Tuning operations in Databricks involves optimizing performance and efficiency of data processing tasks.
Use cluster configuration settings to allocate resources efficiently
Optimize code by minimizing data shuffling and reducing unnecessary operations
Leverage Databricks Auto Optimize to automatically tune performance
Monitor job performance using Databricks Runtime Metrics and Spark UI
I applied via Approached by Company and was interviewed in Apr 2024. There were 3 interview rounds.
Spark RDD and DF are two data structures in Apache Spark for processing and analyzing data.
RDD (Resilient Distributed Dataset) is a distributed collection of elements that can be operated on in parallel. It is immutable and fault-tolerant.
DF (DataFrame) is a distributed collection of data organized into named columns. It provides a more structured and efficient way to work with data compared to RDDs.
RDD is low-level an...
End to end data engineering architecture
Executor memory is the amount of memory allocated to each executor in a Spark application.
Executor memory is specified using the 'spark.executor.memory' configuration property.
It determines how much memory each executor can use to process tasks.
It is important to properly configure executor memory to avoid out-of-memory errors or inefficient resource utilization.
RDD is a low-level abstraction in Spark representing distributed data, while DataFrames are higher-level structured APIs for working with data.
RDD is an immutable distributed collection of objects, while DataFrames are distributed collection of data organized into named columns.
RDDs are more suitable for unstructured data and low-level transformations, while DataFrames provide a more user-friendly API for structured da...
Software Engineer
16
salaries
| ₹7 L/yr - ₹10 L/yr |
Software Developer
5
salaries
| ₹7.5 L/yr - ₹12 L/yr |
Senior Software Engineer
4
salaries
| ₹13.8 L/yr - ₹27 L/yr |
Software Engineer Level 1
3
salaries
| ₹8 L/yr - ₹8 L/yr |
Machine Learning Engineer
3
salaries
| ₹6.8 L/yr - ₹16.2 L/yr |
Fractal Analytics
Mu Sigma
Tiger Analytics
LatentView Analytics