Filter interviews by
I applied via LinkedIn and was interviewed in Aug 2024. There were 2 interview rounds.
I have been asked with basic python coding questions ( strings)
In 5 years, I see myself leading a team of data engineers, implementing cutting-edge technologies, and driving impactful data-driven decisions.
Leading a team of data engineers
Implementing cutting-edge technologies
Driving impactful data-driven decisions
Strength: Strong problem-solving skills. Weakness: Sometimes too detail-oriented.
Strength: Ability to analyze complex data sets and find efficient solutions
Weakness: Occasionally get caught up in minor details and lose sight of the bigger picture
Top trending discussions
Oops dsa sql network
To find the nth highest salary in pyspark, use the window function with row_number and filter on the desired rank.
Use window function with row_number to assign a rank to each salary
Filter the result to get the row with the desired rank
Example: df.withColumn('rank', F.row_number().over(Window.orderBy(F.col('salary').desc()))).filter(F.col('rank') == n).select('salary')
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.
Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing.
It stores data in Parquet format and uses Apache Spark for processing.
Delta Lake ensures data reliability and data quality by providing schema enforcement and data versioning.
It supports time tra...
Tuning operations in Databricks involves optimizing performance and efficiency of data processing tasks.
Use cluster configuration settings to allocate resources efficiently
Optimize code by minimizing data shuffling and reducing unnecessary operations
Leverage Databricks Auto Optimize to automatically tune performance
Monitor job performance using Databricks Runtime Metrics and Spark UI
I applied via Approached by Company and was interviewed in Apr 2024. There were 3 interview rounds.
Spark RDD and DF are two data structures in Apache Spark for processing and analyzing data.
RDD (Resilient Distributed Dataset) is a distributed collection of elements that can be operated on in parallel. It is immutable and fault-tolerant.
DF (DataFrame) is a distributed collection of data organized into named columns. It provides a more structured and efficient way to work with data compared to RDDs.
RDD is low-level an...
End to end data engineering architecture
Executor memory is the amount of memory allocated to each executor in a Spark application.
Executor memory is specified using the 'spark.executor.memory' configuration property.
It determines how much memory each executor can use to process tasks.
It is important to properly configure executor memory to avoid out-of-memory errors or inefficient resource utilization.
RDD is a low-level abstraction in Spark representing distributed data, while DataFrames are higher-level structured APIs for working with data.
RDD is an immutable distributed collection of objects, while DataFrames are distributed collection of data organized into named columns.
RDDs are more suitable for unstructured data and low-level transformations, while DataFrames provide a more user-friendly API for structured da...
Data Engineer
5
salaries
| ₹14 L/yr - ₹15.4 L/yr |
GIS Analyst
5
salaries
| ₹3.7 L/yr - ₹4 L/yr |
Support Engineer
4
salaries
| ₹16.5 L/yr - ₹16.5 L/yr |
Front end Developer
4
salaries
| ₹3.5 L/yr - ₹5 L/yr |
Business Development Associate
3
salaries
| ₹4.5 L/yr - ₹5 L/yr |
Fractal Analytics
Mu Sigma
Tiger Analytics
LatentView Analytics