i
NTT Data
Filter interviews by
I applied via Naukri.com and was interviewed in Jul 2023. There were 2 interview rounds.
Spark internal working and optimization techniques
Spark uses Directed Acyclic Graph (DAG) for optimizing workflows
Lazy evaluation helps in optimizing transformations by combining them into a single stage
Caching and persistence of intermediate results can improve performance
Partitioning data can help in parallel processing and reducing shuffle operations
posted on 14 Dec 2024
I was interviewed in Nov 2024.
Use 'hdfs diskbalancer' command to check disk utilisation and health in Hadoop
Run 'hdfs diskbalancer -report' to get a report on disk utilisation
Use 'hdfs diskbalancer -plan <path>' to generate a plan for balancing disk usage
Check the Hadoop logs for any disk health issues
Spark Architecture consists of Driver, Cluster Manager, and Executors. Driver manages the execution of Spark jobs.
Driver: Manages the execution of Spark jobs, converts user code into tasks, and coordinates with Cluster Manager.
Cluster Manager: Manages resources across the cluster and allocates resources to Spark applications.
Executors: Execute tasks assigned by the Driver and store data in memory or disk for further pr...
Optimization techniques in Spark improve performance and efficiency of data processing.
Partitioning data to distribute workload evenly
Caching frequently accessed data in memory
Using broadcast variables for small lookup tables
Avoiding shuffling operations whenever possible
Tuning memory settings and garbage collection parameters
I am unable to provide this information as it is confidential.
Confidential information about salaries in previous organizations should not be disclosed.
It is important to respect the privacy and confidentiality of past employers.
Discussing specific salary details may not be appropriate in a professional setting.
To create a pivot table in SQL from a non-pivot table, you can use the CASE statement with aggregate functions.
Use the CASE statement to categorize data into columns
Apply aggregate functions like SUM, COUNT, AVG, etc. to calculate values for each category
Group the data by the columns you want to pivot on
Creating triggers in a database involves defining the trigger, specifying the event that will activate it, and writing the code to be executed.
Define the trigger using the CREATE TRIGGER statement
Specify the event that will activate the trigger (e.g. INSERT, UPDATE, DELETE)
Write the code or actions to be executed when the trigger is activated
Test the trigger to ensure it functions as intended
I applied via Naukri.com and was interviewed in Aug 2024. There were 2 interview rounds.
Different types of joins in SQL with examples
Inner Join: Returns rows when there is a match in both tables
Left Join: Returns all rows from the left table and the matched rows from the right table
Right Join: Returns all rows from the right table and the matched rows from the left table
Full Outer Join: Returns all rows when there is a match in either table
Large Spark datasets can be handled by partitioning, caching, optimizing transformations, and tuning resources.
Partitioning data to distribute workload evenly across nodes
Caching frequently accessed data to avoid recomputation
Optimizing transformations to reduce unnecessary processing
Tuning resources like memory allocation and parallelism for optimal performance
Spark configuration settings can be tuned to optimize query performance by adjusting parameters like memory allocation, parallelism, and caching.
Increase executor memory and cores to allow for more parallel processing
Adjust shuffle partitions to optimize data shuffling during joins and aggregations
Enable dynamic allocation to scale resources based on workload demands
Utilize caching to store intermediate results and avo...
To handle data skew and partition imbalance in Spark, strategies include using salting, bucketing, repartitioning, and optimizing join operations.
Use salting to evenly distribute skewed keys across partitions
Implement bucketing to pre-partition data based on a specific column
Repartition data based on a specific key to balance partitions
Optimize join operations by broadcasting small tables or using partitioning strategi
I applied via Company Website and was interviewed in Sep 2024. There was 1 interview round.
Spark optimization techniques involve partitioning, caching, and tuning resource allocation.
Partitioning data to distribute workload evenly
Caching frequently accessed data to avoid recomputation
Tuning resource allocation for optimal performance
I applied via Referral and was interviewed in Dec 2024. There were 2 interview rounds.
30 Questions in 20 Minutes
More questions about coding in sql & pyspark
Coalesce is used to reduce the number of partitions in a DataFrame, while repartition is used to increase the number of partitions.
Coalesce is a narrow transformation that can only decrease the number of partitions without shuffling data.
Repartition is a wide transformation that can both increase or decrease the number of partitions and involves shuffling data across the cluster.
Coalesce is more efficient for reducing ...
Rank vs dense rank quetions ctes
Python data structure
I applied via Naukri.com and was interviewed in Jun 2022. There were 4 interview rounds.
Internal tables are managed by Hive, while external tables are managed by the user.
Internal tables are stored in a Hive-managed warehouse directory, while external tables can be stored anywhere.
Internal tables are deleted when the table is dropped, while external tables are not.
External tables can be used to access data stored in non-Hive formats, such as CSV or JSON.
Internal tables are typically used for temporary or ...
posted on 8 Apr 2022
I applied via Referral and was interviewed in Mar 2022. There was 1 interview round.
Spark optimization techniques improve performance and efficiency of Spark applications.
Partitioning data to reduce shuffling
Caching frequently used data
Using broadcast variables for small data
Using efficient data formats like Parquet
Tuning memory and CPU usage
Using appropriate cluster size
Avoiding unnecessary data shuffling
Using appropriate serialization formats
Using appropriate join strategies
We use Hadoop Distributed File System (HDFS) for our project.
HDFS is a distributed file system designed to run on commodity hardware.
It provides high-throughput access to application data and is fault-tolerant.
HDFS is used by many big data processing frameworks like Hadoop, Spark, etc.
It stores data in a distributed manner across multiple nodes in a cluster.
HDFS is optimized for large files and sequential reads and wri
I applied via Recruitment Consultant and was interviewed in Jul 2021. There were 3 interview rounds.
Software Engineer
935
salaries
| ₹2.8 L/yr - ₹11 L/yr |
Senior Associate
761
salaries
| ₹1.2 L/yr - ₹7.3 L/yr |
Network Engineer
654
salaries
| ₹1.8 L/yr - ₹10 L/yr |
Software Developer
615
salaries
| ₹2.5 L/yr - ₹13 L/yr |
Senior Software Engineer
510
salaries
| ₹6.5 L/yr - ₹25.5 L/yr |
Tata Communications
Bharti Airtel
Reliance Communications
Vodafone Idea