Filter interviews by
HDFS is a distributed file system designed to store large data sets reliably and fault-tolerantly.
HDFS stands for Hadoop Distributed File System
It is the primary storage system used by Hadoop applications
It is designed to store large files and data sets across multiple machines
It provides high throughput access to application data
It is fault-tolerant and can handle node failures
It uses a master/slave architecture with ...
The default replication factor of Hadoop 2.x is 3.
Replication factor determines the number of copies of data blocks that are stored across the Hadoop cluster.
The default replication factor in Hadoop 2.x is 3, which means that each data block is replicated three times.
The replication factor can be configured in the Hadoop configuration files.
The replication factor affects the fault tolerance and performance of the Hadoo...
The default block size of Hadoop is 128 MB.
Hadoop uses HDFS (Hadoop Distributed File System) to store data in a distributed manner.
The default block size of HDFS is 128 MB.
This block size can be changed by modifying the dfs.blocksize property in the Hadoop configuration files.
Oops dsa sql network
To find the nth highest salary in pyspark, use the window function with row_number and filter on the desired rank.
Use window function with row_number to assign a rank to each salary
Filter the result to get the row with the desired rank
Example: df.withColumn('rank', F.row_number().over(Window.orderBy(F.col('salary').desc()))).filter(F.col('rank') == n).select('salary')
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.
Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing.
It stores data in Parquet format and uses Apache Spark for processing.
Delta Lake ensures data reliability and data quality by providing schema enforcement and data versioning.
It supports time tra...
Tuning operations in Databricks involves optimizing performance and efficiency of data processing tasks.
Use cluster configuration settings to allocate resources efficiently
Optimize code by minimizing data shuffling and reducing unnecessary operations
Leverage Databricks Auto Optimize to automatically tune performance
Monitor job performance using Databricks Runtime Metrics and Spark UI
I applied via Approached by Company and was interviewed in Apr 2024. There were 3 interview rounds.
Spark RDD and DF are two data structures in Apache Spark for processing and analyzing data.
RDD (Resilient Distributed Dataset) is a distributed collection of elements that can be operated on in parallel. It is immutable and fault-tolerant.
DF (DataFrame) is a distributed collection of data organized into named columns. It provides a more structured and efficient way to work with data compared to RDDs.
RDD is low-level an...
End to end data engineering architecture
Executor memory is the amount of memory allocated to each executor in a Spark application.
Executor memory is specified using the 'spark.executor.memory' configuration property.
It determines how much memory each executor can use to process tasks.
It is important to properly configure executor memory to avoid out-of-memory errors or inefficient resource utilization.
RDD is a low-level abstraction in Spark representing distributed data, while DataFrames are higher-level structured APIs for working with data.
RDD is an immutable distributed collection of objects, while DataFrames are distributed collection of data organized into named columns.
RDDs are more suitable for unstructured data and low-level transformations, while DataFrames provide a more user-friendly API for structured da...
Software Engineer
4
salaries
| ₹3 L/yr - ₹7 L/yr |
Senior Software Developer
3
salaries
| ₹5 L/yr - ₹10.8 L/yr |
Senior Software Engineer
3
salaries
| ₹5.4 L/yr - ₹7.6 L/yr |
TCS
Accenture
Wipro
Cognizant