Filter interviews by
Be the first one to contribute and help others!
Were asked simple questions on pyspake and SQL concepts
I applied via Campus Placement and was interviewed before Jan 2023. There was 1 interview round.
I applied via Naukri.com and was interviewed in Oct 2024. There were 2 interview rounds.
Optimizing SQL queries involves using indexes, avoiding unnecessary joins, and optimizing the query structure.
Use indexes on columns frequently used in WHERE clauses
Avoid using SELECT * and only retrieve necessary columns
Optimize joins by using INNER JOIN instead of OUTER JOIN when possible
Use EXPLAIN to analyze query performance and make necessary adjustments
Performance optimization in Spark involves tuning configurations, optimizing code, and utilizing caching.
Tune Spark configurations such as executor memory, number of executors, and shuffle partitions.
Optimize code by reducing unnecessary shuffles, using efficient transformations, and avoiding unnecessary data movements.
Utilize caching to store intermediate results in memory and avoid recomputation.
Example: In my projec...
SparkContext is the main entry point for Spark functionality, while SparkSession is the entry point for Spark SQL.
SparkContext is the entry point for low-level API functionality in Spark.
SparkSession is the entry point for Spark SQL functionality.
SparkContext is used to create RDDs (Resilient Distributed Datasets) in Spark.
SparkSession provides a unified entry point for reading data from various sources and performing
When a spark job is submitted, various steps are executed at the backend to process the job.
The job is submitted to the Spark driver program.
The driver program communicates with the cluster manager to request resources.
The cluster manager allocates resources (CPU, memory) to the job.
The driver program creates DAG (Directed Acyclic Graph) of the job stages and tasks.
Tasks are then scheduled and executed on worker nodes ...
Calculate second highest salary using SQL and pyspark
Use SQL query with ORDER BY and LIMIT to get the second highest salary
In pyspark, use orderBy() and take() functions to achieve the same result
The two types of modes for Spark architecture are standalone mode and cluster mode.
Standalone mode: Spark runs on a single machine with a single JVM and is suitable for development and testing.
Cluster mode: Spark runs on a cluster of machines managed by a cluster manager like YARN or Mesos for production workloads.
Client mode is better for very less latency due to direct communication with the cluster.
Client mode allows direct communication with the cluster, reducing latency.
Standalone mode requires an additional layer of communication, increasing latency.
Client mode is preferred for real-time applications where low latency is crucial.
I applied via Recruitment Consulltant and was interviewed in Nov 2024. There were 2 interview rounds.
I applied via AmbitionBox and was interviewed in Nov 2024. There were 4 interview rounds.
I utilize tools such as Excel, Python, SQL, and Tableau for data analysis.
Excel for basic data manipulation and visualization
Python for advanced data analysis and machine learning
SQL for querying databases
Tableau for creating interactive visualizations
Data analysis of code in the context of data analysis.
Coding logical question paper.
posted on 17 Jul 2024
I applied via Naukri.com and was interviewed in Aug 2024. There were 2 interview rounds.
I am a Senior Data Engineer with experience in developing data pipelines and optimizing data storage for various projects.
Developed data pipelines using Apache Spark for real-time data processing
Optimized data storage using technologies like Hadoop and AWS S3
Worked on a project to analyze customer behavior and improve marketing strategies
My day-to-day job in the project involved designing and implementing data pipelines, optimizing data workflows, and collaborating with cross-functional teams.
Designing and implementing data pipelines to extract, transform, and load data from various sources
Optimizing data workflows to improve efficiency and performance
Collaborating with cross-functional teams including data scientists, analysts, and business stakeholde...
DAGs handle fault tolerance by rerunning failed tasks and maintaining task dependencies.
DAGs rerun failed tasks automatically to ensure completion.
DAGs maintain task dependencies to ensure proper sequencing.
DAGs can be configured to retry failed tasks a certain number of times before marking them as failed.
Shuffling is the process of redistributing data across partitions in a distributed computing environment.
Shuffling is necessary when data needs to be grouped or aggregated across different partitions.
It can be handled efficiently by minimizing the amount of data being shuffled and optimizing the partitioning strategy.
Techniques like partitioning, combiners, and reducers can help reduce the amount of shuffling in MapRed
Repartition increases or decreases the number of partitions in a DataFrame, while Coalesce only decreases the number of partitions.
Repartition can increase or decrease the number of partitions in a DataFrame, leading to a shuffle of data across the cluster.
Coalesce only decreases the number of partitions in a DataFrame without performing a full shuffle, making it more efficient than repartition.
Repartition is typically...
Incremental data is handled by identifying new data since the last update and merging it with existing data.
Identify new data since last update
Merge new data with existing data
Update data warehouse or database with incremental changes
SCD stands for Slowly Changing Dimension, a concept in data warehousing to track changes in data over time.
SCD is used to maintain historical data in a data warehouse.
There are three types of SCD - Type 1, Type 2, and Type 3.
Type 1 SCD overwrites old data with new data.
Type 2 SCD creates a new record for each change, preserving history.
Type 3 SCD maintains both old and new values in the same record.
SCD is important for...
Reverse a string using SQL and Python codes.
In SQL, use the REVERSE function to reverse a string.
In Python, use slicing with a step of -1 to reverse a string.
Use Spark and SQL to find the top 5 countries with the highest population.
Use Spark to load the data and perform data processing.
Use SQL queries to group by country and sum the population.
Order the results in descending order and limit to top 5.
Example: SELECT country, SUM(population) AS total_population FROM table_name GROUP BY country ORDER BY total_population DESC LIMIT 5
To find different records for different joins using two tables
Use the SQL query to perform different joins like INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN
Identify the key columns in both tables to join on
Select the columns from both tables and use WHERE clause to filter out the different records
A catalyst optimizer is a query optimization tool used in Apache Spark to improve performance by generating an optimal query plan.
Catalyst optimizer is a rule-based query optimization framework in Apache Spark.
It leverages rules to transform the logical query plan into a more optimized physical plan.
The optimizer applies various optimization techniques like predicate pushdown, constant folding, and join reordering.
By o...
Used query optimization techniques to improve performance in database queries.
Utilized indexing to speed up search queries.
Implemented query caching to reduce redundant database calls.
Optimized SQL queries by restructuring joins and subqueries.
Utilized database partitioning to improve query performance.
Used query profiling tools to identify and optimize slow queries.
Use the len() function to check the length of the data frame.
Use len() function to get the number of rows in the data frame.
If the length is 0, then the data frame is empty.
Example: if len(df) == 0: print('Data frame is empty')
Cores and worker nodes are decided based on the workload requirements and scalability needs of the data processing system.
Consider the size and complexity of the data being processed
Evaluate the processing speed and memory requirements of the tasks
Take into account the parallelism and concurrency needed for efficient data processing
Monitor the system performance and adjust cores and worker nodes as needed
Enforcing schema ensures that data conforms to a predefined structure and rules.
Ensures data integrity by validating incoming data against predefined schema
Helps in maintaining consistency and accuracy of data
Prevents data corruption and errors in data processing
Can lead to rejection of data that does not adhere to the schema
I applied via Company Website and was interviewed in Dec 2024. There was 1 interview round.
I applied via Campus Placement and was interviewed in Dec 2024. There were 2 interview rounds.
Basics of mathematical ability and verbal ability
I applied via Referral and was interviewed in Nov 2024. There was 1 interview round.
I applied via Naukri.com and was interviewed in Nov 2024. There was 1 interview round.
Content Writer
4
salaries
| ₹3.6 L/yr - ₹5 L/yr |
Mobile App Developer
3
salaries
| ₹5.6 L/yr - ₹6.5 L/yr |
HR Executive
3
salaries
| ₹1.9 L/yr - ₹7.9 L/yr |
Network Engineer
3
salaries
| ₹2.4 L/yr - ₹12 L/yr |
QA Executive
3
salaries
| ₹3.6 L/yr - ₹3.6 L/yr |
TCS
Infosys
Wipro
HCLTech