Add office photos
Engaged Employer

Capgemini

3.8
based on 38.8k Reviews
Proud winner of ABECA 2024 - AmbitionBox Employee Choice Awards
Filter interviews by

30+ Interview Questions and Answers

Updated 18 Oct 2024
Popular Designations

Q1. How will you Join if two tables are large in pyspark?

Ans.

Use broadcast join or partition join in pyspark to join two large tables efficiently.

  • Use broadcast join for smaller table and partition join for larger table.

  • Broadcast join - broadcast the smaller table to all worker nodes.

  • Partition join - partition both tables on the join key and join them.

  • Example: df1.join(broadcast(df2), 'join_key')

  • Example: df1.join(df2, 'join_key').repartition('join_key')

Add your answer

Q2. Write a SQL to get Student names who got marks>45 in each subject from Student table

Ans.

SQL query to retrieve student names with marks > 45 in each subject

  • Use GROUP BY and HAVING clauses to filter students with marks > 45 in each subject

  • Join Student table with Marks table on student_id to get marks for each student

  • Select student names from Student table based on the conditions

Add your answer

Q3. How to remove Duplicates in Data frame using pyspark?

Ans.

Use dropDuplicates() function in pyspark to remove duplicates in a data frame.

  • Use dropDuplicates() function on the data frame to remove duplicates based on all columns.

  • Specify subset of columns to remove duplicates based on specific columns.

  • Use the distinct() function to remove duplicates and keep only distinct rows.

Add your answer

Q4. What will be spark configuration to process 2 gb of data

Ans.

Set spark configuration with appropriate memory and cores for efficient processing of 2 GB data

  • Increase executor memory and cores to handle larger data size

  • Adjust spark memory overhead to prevent out of memory errors

  • Optimize shuffle partitions for better performance

Add your answer
Discover null interview dos and don'ts from real experiences

Q5. How you will run a child notebook into a parent notebook using dbutils command

Ans.

Use dbutils.notebook.run() command to run a child notebook in a parent notebook

  • Use dbutils.notebook.run() command with the path to the child notebook and any parameters needed

  • Ensure that the child notebook is accessible and has necessary permissions

  • Handle any return values or errors from the child notebook appropriately

Add your answer

Q6. Assume below Dataframes DF1 (UserID,Name) DF2 (UserID,PageID,Timestamp,Events) Write code to Join the DF's, Count the No of Events and filter Users with 0 Events

Ans.

Join DF's, count events, filter users with 0 events

  • Use join operation to combine DF1 and DF2 on UserID

  • Group by UserID and count the number of events

  • Filter out users with 0 events

Add your answer
Are these interview questions helpful?

Q7. What is Re-Partition and Coalesce? How are these used?

Ans.

Re-Partition and Coalesce are methods used to control the number of partitions in a dataset in Apache Spark.

  • Re-Partition is used to increase or decrease the number of partitions in a dataset by shuffling the data across the cluster.

  • Coalesce is used to decrease the number of partitions in a dataset without shuffling the data, which can improve performance.

  • Re-Partition is typically used when there is a need to increase parallelism or balance the data distribution, while Coalesc...read more

Add your answer

Q8. What is sql , explain normalizing ?

Ans.

SQL is a programming language used to manage and manipulate relational databases. Normalizing is the process of organizing data in a database to minimize redundancy.

  • SQL stands for Structured Query Language

  • It is used to create, modify, and query relational databases

  • Normalization is the process of breaking down a database into smaller, more manageable tables to reduce redundancy and improve data integrity

  • There are different levels of normalization, such as first normal form (1N...read more

Add your answer
Share interview questions and help millions of jobseekers 🌟

Q9. Write Python code to Extract Pincode from Address Field in Dataframe using Pyspark?

Ans.

Extract Pincode from Address Field in Dataframe using Pyspark

  • Use pyspark.sql.functions regexp_extract() function to extract pincode from address field

  • Create a new column in the dataframe to store the extracted pincode

  • Specify the regular expression pattern for pincode extraction

  • Example: df.withColumn('pincode', regexp_extract(df['address'], '\b\d{6}\b', 0))

Add your answer

Q10. What is the activity use for creating email notification?

Ans.

The activity used for creating email notification is sending an email.

  • Use SMTP (Simple Mail Transfer Protocol) to send emails

  • Set up an email server or use a third-party email service provider

  • Include the recipient's email address, subject, and message content

  • Can be automated using tools like Python's smtplib library or email marketing platforms like Mailchimp

Add your answer

Q11. What is Broadcast join ? How is this useful

Ans.

Broadcast join is a type of join operation in distributed computing where one smaller dataset is broadcasted to all nodes for efficient processing.

  • Reduces data shuffling by sending smaller dataset to all nodes

  • Useful when one dataset is significantly smaller than the other

  • Improves performance by reducing network traffic and processing time

Add your answer

Q12. What is a database and how it is used ?

Ans.

A database is a collection of data that is organized in a way that allows for efficient retrieval and manipulation of data.

  • A database is used to store and manage data.

  • It allows for easy retrieval and manipulation of data.

  • Examples of databases include MySQL, Oracle, and MongoDB.

  • Databases are used in various industries such as finance, healthcare, and e-commerce.

Add your answer

Q13. what are the Components of Data factory pipeline ?

Ans.

Components of Data factory pipeline include datasets, activities, linked services, triggers, and pipelines.

  • Datasets: Define the data structure and location for input and output data.

  • Activities: Define the actions to be performed on the data such as data movement, data transformation, or data processing.

  • Linked Services: Define the connections to external data sources or destinations.

  • Triggers: Define the conditions that determine when a pipeline should be executed.

  • Pipelines: De...read more

Add your answer

Q14. How to Enable Hive support in spark?

Ans.

Enable Hive support in Spark for seamless integration of Hive tables and queries.

  • Set 'spark.sql.catalogImplementation' to 'hive' in SparkConf

  • Include 'spark-hive' dependency in the Spark application

  • Ensure Hive configuration files are available in the classpath

  • Use HiveContext or enable Hive support in SparkSession

Add your answer

Q15. what Purpose of Lineage graph ?

Ans.

Lineage graph is used to track the flow of data from source to destination, helping in understanding data dependencies and impact analysis.

  • Helps in understanding data dependencies and relationships

  • Tracks the flow of data from source to destination

  • Aids in impact analysis and troubleshooting

  • Useful for data governance and compliance

  • Can be visualized to easily comprehend complex data pipelines

Add your answer

Q16. Write a code to count frequency of elements in a list.

Ans.

Code to count frequency of elements in a list of strings.

  • Use a dictionary to store the frequency of each element in the list.

  • Iterate through the list and update the count in the dictionary.

  • Return the dictionary with element frequencies.

Add your answer

Q17. what is df.explain() in pyspark

Ans.

df.explain() in pyspark is used to display the physical plan of the DataFrame operations.

  • df.explain() is used to show the execution plan of the DataFrame operations in pyspark.

  • It helps in understanding how the operations are being executed and optimized by Spark.

  • The output of df.explain() includes details like the logical and physical plans, optimizations applied, and stages of execution.

Add your answer

Q18. What is Lazy evaluation in spark

Ans.

Lazy evaluation is a strategy used by Spark to delay the execution of transformations until an action is called.

  • Lazy evaluation improves performance by optimizing the execution plan

  • Transformations in Spark are not executed immediately, but rather recorded as a lineage graph

  • Actions trigger the execution of the transformations and produce a result

  • Lazy evaluation allows Spark to optimize the execution plan by combining and reordering transformations

  • Example: val data = spark.read...read more

Add your answer

Q19. Why IT after electrical engineering

Ans.

Combining my electrical engineering background with IT skills allows me to work on cutting-edge technologies and solve complex problems.

  • Interest in technology and data analysis sparked during electrical engineering studies

  • Realized the potential of combining electrical engineering knowledge with IT for innovative solutions

  • Opportunities in data engineering field align with my career goals

Add your answer

Q20. Explain Joins in spark using pyspark

Ans.

Joins in Spark using PySpark are used to combine data from two different DataFrames based on a common key.

  • Joins are performed using the join() function in PySpark.

  • Common types of joins include inner join, outer join, left join, and right join.

  • Example: df1.join(df2, df1.key == df2.key, 'inner')

Add your answer

Q21. Compare Databricks and azure synapse notebook.

Ans.

Databricks and Azure Synapse Notebook are both cloud-based platforms for data engineering and analytics.

  • Databricks is primarily focused on big data processing and machine learning, while Azure Synapse Notebook is part of a larger analytics platform.

  • Databricks provides a collaborative environment for data scientists and engineers to work together, while Azure Synapse Notebook is integrated with other Azure services for seamless data integration and analysis.

  • Databricks offers b...read more

Add your answer

Q22. what is BQ what are advantages

Ans.

BQ stands for BigQuery, a fully managed, serverless, and highly scalable cloud data warehouse provided by Google Cloud.

  • Advantages of BigQuery include fast query performance due to its distributed architecture

  • Scalability to handle large datasets without the need for infrastructure management

  • Integration with other Google Cloud services like Dataflow, Dataproc, and Data Studio

  • Support for standard SQL queries and real-time data streaming

  • Cost-effectiveness with a pay-as-you-go pri...read more

Add your answer

Q23. How to handle data skewness ?

Ans.

Data skewness can be handled by partitioning data, using sampling techniques, optimizing queries, and using parallel processing.

  • Partitioning data based on key values to distribute workload evenly

  • Using sampling techniques to estimate skewed data distribution

  • Optimizing queries by using appropriate indexes and query optimization techniques

  • Using parallel processing to distribute workload across multiple nodes

Add your answer

Q24. Write a query to find the 2nd highest salary.

Ans.

Query to find the 2nd highest salary in a database table.

  • Use the ORDER BY clause to sort salaries in descending order.

  • Use the LIMIT clause to retrieve the second row.

  • Consider handling cases where there may be ties for the highest salary.

Add your answer

Q25. Difference between Left join and inner join

Ans.

Left join returns all records from the left table and the matching records from the right table.

  • Inner join returns only the matching records from both tables.

  • Left join includes all records from the left table, even if there are no matches in the right table.

  • Inner join excludes the non-matching records from both tables.

  • Left join is used to retrieve all records from one table and the matching records from another table.

  • Inner join is used to retrieve only the records that have m...read more

Add your answer

Q26. External Table vs Internal table

Ans.

External tables store data outside the database while internal tables store data within the database.

  • External tables reference data stored outside the database, such as in HDFS or S3, while internal tables store data within the database itself.

  • External tables are typically used for data that is not managed by the database system, while internal tables are used for data that is managed by the database system.

  • External tables are often used for data that needs to be shared acros...read more

Add your answer

Q27. Tell about joins in C

Ans.

Joins in C are used to combine records from two or more tables based on a related column between them.

  • Joins in C are typically implemented using nested loops or hash tables.

  • Common types of joins include inner join, outer join, left join, and right join.

  • Example: Performing an inner join between two tables to retrieve only the matching records.

Add your answer

Q28. Explain spark architecture

Ans.

Spark architecture is a distributed computing framework that consists of a driver program, cluster manager, and worker nodes.

  • Spark driver program coordinates the execution of tasks and maintains the overall state of the application.

  • Cluster manager allocates resources for the application and monitors its execution.

  • Worker nodes execute the tasks assigned by the driver program and store data in memory or disk.

  • Spark architecture supports various data processing workloads like bat...read more

Add your answer

Q29. triggers in azure

Ans.

Triggers in Azure are used to initiate an action based on a specific event or condition.

  • Triggers can be used in Azure Functions to execute code in response to events such as changes in data or messages in a queue.

  • Azure Logic Apps also use triggers to start workflows based on events such as receiving an email or a new file being added to a storage account.

  • Triggers can be configured to run on a schedule or based on a specific condition being met.

  • Examples of triggers include HTT...read more

Add your answer

Q30. optimization technique in pyspark

Ans.

One optimization technique in PySpark is using partitioning to distribute data evenly across nodes.

  • Use partitioning to distribute data evenly across nodes

  • Avoid shuffling data unnecessarily

  • Cache intermediate results to avoid recomputation

Add your answer

Q31. Python to write a factorial

Ans.

Python code to calculate factorial of a number

  • Use a recursive function to calculate the factorial

  • Base case: if n is 0 or 1, return 1

  • Recursive case: return n * factorial(n-1)

  • Example: def factorial(n): return 1 if n == 0 or n == 1 else n * factorial(n-1)

Add your answer

Q32. azure and its details

Ans.

Azure is a cloud computing platform and service provided by Microsoft.

  • Azure offers a wide range of services including virtual machines, storage, databases, and analytics.

  • It provides scalability, reliability, and security for businesses to build, deploy, and manage applications.

  • Azure supports various programming languages and frameworks, allowing developers to use their preferred tools.

  • It offers AI and machine learning capabilities through services like Azure Machine Learning ...read more

Add your answer

Q33. Find the second largest

Ans.

To find the second largest element in an array

  • Sort the array in descending order

  • Return the element at index 1

Add your answer

Q34. Explain snowflake architecture

Ans.

Snowflake architecture is a cloud-based data warehousing solution that separates storage and compute resources for scalability and performance.

  • Snowflake uses a unique architecture with three layers: storage, compute, and services.

  • Storage layer stores data in a columnar format for efficient querying.

  • Compute layer processes queries independently, allowing for elastic scalability.

  • Services layer manages metadata, security, and query optimization.

  • Snowflake's architecture enables a...read more

Add your answer

Q35. spark optimization technique

Ans.

Spark optimization techniques improve performance and efficiency of Spark jobs.

  • Use partitioning to distribute data evenly across nodes

  • Cache intermediate results to avoid recomputation

  • Use broadcast variables for small lookup tables

  • Optimize shuffle operations by reducing data shuffling

  • Tune memory settings for better performance

Add your answer

Q36. small file problem in spark

Ans.

Small file problem in Spark refers to inefficiency when processing multiple small files in Spark jobs.

  • Small files can lead to inefficient resource utilization and slow job execution in Spark.

  • One solution is to combine small files into larger files before processing.

  • Another solution is to use coalesce or repartition to reduce the number of partitions in RDDs.

  • Avoid using HDFS block size smaller than the default block size (128 MB) to prevent small file problem.

Add your answer

Q37. Architecture of Spark

Ans.

Spark is a distributed computing system that provides an interface for programming clusters with implicit data parallelism.

  • Spark is built on the concept of Resilient Distributed Datasets (RDDs), which are fault-tolerant collections of objects.

  • It supports various programming languages such as Scala, Java, Python, and R.

  • Spark provides high-level APIs for distributed data processing, including transformations and actions.

  • It offers in-memory computing capabilities, enabling faste...read more

Add your answer

Q38. Explain project structure

Ans.

Project structure refers to the organization of files, folders, and resources within a project.

  • Project structure should be logical and easy to navigate

  • Common structures include separating code into modules, organizing files by type (e.g. scripts, data, documentation), and using version control

  • Example: A data engineering project may have folders for data extraction, transformation, loading, and documentation

Add your answer

Q39. Coding in sql joins

Ans.

SQL joins are used to combine rows from two or more tables based on a related column between them.

  • Use INNER JOIN to return rows when there is at least one match in both tables

  • Use LEFT JOIN to return all rows from the left table, and the matched rows from the right table

  • Use RIGHT JOIN to return all rows from the right table, and the matched rows from the left table

  • Use FULL JOIN to return rows when there is a match in one of the tables

Add your answer
Contribute & help others!
Write a review
Share interview
Contribute salary
Add office photos

Interview Process at null

based on 25 interviews in the last 1 year
2 Interview rounds
Technical Round 1
Technical Round 2
View more
Interview Tips & Stories
Ace your next interview with expert advice and inspiring stories

Top Data Engineer Interview Questions from Similar Companies

3.9
 • 66 Interview Questions
4.1
 • 28 Interview Questions
3.4
 • 18 Interview Questions
3.7
 • 13 Interview Questions
3.7
 • 12 Interview Questions
4.0
 • 11 Interview Questions
View all
Share an Interview
Stay ahead in your career. Get AmbitionBox app
qr-code
Helping over 1 Crore job seekers every month in choosing their right fit company
70 Lakh+

Reviews

5 Lakh+

Interviews

4 Crore+

Salaries

1 Cr+

Users/Month

Contribute to help millions
Get AmbitionBox app

Made with ❤️ in India. Trademarks belong to their respective owners. All rights reserved © 2024 Info Edge (India) Ltd.

Follow us
  • Youtube
  • Instagram
  • LinkedIn
  • Facebook
  • Twitter