i
Capgemini
Proud winner of ABECA 2024 - AmbitionBox Employee Choice Awards
Filter interviews by
I was interviewed in Jun 2024.
Use dropDuplicates() function in pyspark to remove duplicates in a data frame.
Use dropDuplicates() function on the data frame to remove duplicates based on all columns.
Specify subset of columns to remove duplicates based on specific columns.
Use the distinct() function to remove duplicates and keep only distinct rows.
Broadcast join is a type of join operation in distributed computing where one smaller dataset is broadcasted to all nodes for efficient processing.
Reduces data shuffling by sending smaller dataset to all nodes
Useful when one dataset is significantly smaller than the other
Improves performance by reducing network traffic and processing time
Re-Partition and Coalesce are methods used to control the number of partitions in a dataset in Apache Spark.
Re-Partition is used to increase or decrease the number of partitions in a dataset by shuffling the data across the cluster.
Coalesce is used to decrease the number of partitions in a dataset without shuffling the data, which can improve performance.
Re-Partition is typically used when there is a need to increase p...
Extract Pincode from Address Field in Dataframe using Pyspark
Use pyspark.sql.functions regexp_extract() function to extract pincode from address field
Create a new column in the dataframe to store the extracted pincode
Specify the regular expression pattern for pincode extraction
Example: df.withColumn('pincode', regexp_extract(df['address'], '\b\d{6}\b', 0))
SQL query to retrieve student names with marks > 45 in each subject
Use GROUP BY and HAVING clauses to filter students with marks > 45 in each subject
Join Student table with Marks table on student_id to get marks for each student
Select student names from Student table based on the conditions
Enable Hive support in Spark for seamless integration of Hive tables and queries.
Set 'spark.sql.catalogImplementation' to 'hive' in SparkConf
Include 'spark-hive' dependency in the Spark application
Ensure Hive configuration files are available in the classpath
Use HiveContext or enable Hive support in SparkSession
Joins in Spark using PySpark are used to combine data from two different DataFrames based on a common key.
Joins are performed using the join() function in PySpark.
Common types of joins include inner join, outer join, left join, and right join.
Example: df1.join(df2, df1.key == df2.key, 'inner')
Use broadcast join or partition join in pyspark to join two large tables efficiently.
Use broadcast join for smaller table and partition join for larger table.
Broadcast join - broadcast the smaller table to all worker nodes.
Partition join - partition both tables on the join key and join them.
Example: df1.join(broadcast(df2), 'join_key')
Example: df1.join(df2, 'join_key').repartition('join_key')
df.explain() in pyspark is used to display the physical plan of the DataFrame operations.
df.explain() is used to show the execution plan of the DataFrame operations in pyspark.
It helps in understanding how the operations are being executed and optimized by Spark.
The output of df.explain() includes details like the logical and physical plans, optimizations applied, and stages of execution.
Spark architecture is a distributed computing framework that consists of a driver program, cluster manager, and worker nodes.
Spark driver program coordinates the execution of tasks and maintains the overall state of the application.
Cluster manager allocates resources for the application and monitors its execution.
Worker nodes execute the tasks assigned by the driver program and store data in memory or disk.
Spark archit...
Lineage graph is used to track the flow of data from source to destination, helping in understanding data dependencies and impact analysis.
Helps in understanding data dependencies and relationships
Tracks the flow of data from source to destination
Aids in impact analysis and troubleshooting
Useful for data governance and compliance
Can be visualized to easily comprehend complex data pipelines
External tables store data outside the database while internal tables store data within the database.
External tables reference data stored outside the database, such as in HDFS or S3, while internal tables store data within the database itself.
External tables are typically used for data that is not managed by the database system, while internal tables are used for data that is managed by the database system.
External ta...
Join DF's, count events, filter users with 0 events
Use join operation to combine DF1 and DF2 on UserID
Group by UserID and count the number of events
Filter out users with 0 events
What people are saying about Capgemini
I applied via campus placement at Government College of Engineering, Salem and was interviewed in Oct 2024. There was 1 interview round.
Capgemini interview questions for designations
I applied via LinkedIn and was interviewed in Aug 2024. There were 3 interview rounds.
I am a detail-oriented data engineer with a passion for problem-solving and a strong background in programming and data analysis.
Experienced in designing and implementing data pipelines
Proficient in programming languages such as Python, SQL, and Java
Skilled in data modeling and database management
Strong analytical skills and ability to work with large datasets
Excellent communication and teamwork skills
I have a Bachelor's degree in Computer Science and a Master's degree in Data Engineering.
Bachelor's degree in Computer Science
Master's degree in Data Engineering
Some DSA problems medium level
Get interview-ready with Top Capgemini Interview Questions
I applied via Naukri.com and was interviewed in Sep 2024. There was 1 interview round.
To find the second largest element in an array
Sort the array in descending order
Return the element at index 1
I applied via Company Website and was interviewed in Jul 2024. There were 4 interview rounds.
Numeric ability , reasoning, maths
Python , data structures, c , c++, java
Python code to calculate factorial of a number
Use a recursive function to calculate the factorial
Base case: if n is 0 or 1, return 1
Recursive case: return n * factorial(n-1)
Example: def factorial(n): return 1 if n == 0 or n == 1 else n * factorial(n-1)
Combining my electrical engineering background with IT skills allows me to work on cutting-edge technologies and solve complex problems.
Interest in technology and data analysis sparked during electrical engineering studies
Realized the potential of combining electrical engineering knowledge with IT for innovative solutions
Opportunities in data engineering field align with my career goals
One optimization technique in PySpark is using partitioning to distribute data evenly across nodes.
Use partitioning to distribute data evenly across nodes
Avoid shuffling data unnecessarily
Cache intermediate results to avoid recomputation
Data skewness can be handled by partitioning data, using sampling techniques, optimizing queries, and using parallel processing.
Partitioning data based on key values to distribute workload evenly
Using sampling techniques to estimate skewed data distribution
Optimizing queries by using appropriate indexes and query optimization techniques
Using parallel processing to distribute workload across multiple nodes
Query to find the 2nd highest salary in a database table.
Use the ORDER BY clause to sort salaries in descending order.
Use the LIMIT clause to retrieve the second row.
Consider handling cases where there may be ties for the highest salary.
Code to count frequency of elements in a list of strings.
Use a dictionary to store the frequency of each element in the list.
Iterate through the list and update the count in the dictionary.
Return the dictionary with element frequencies.
I applied via Naukri.com and was interviewed in May 2024. There were 3 interview rounds.
Components of Data factory pipeline include datasets, activities, linked services, triggers, and pipelines.
Datasets: Define the data structure and location for input and output data.
Activities: Define the actions to be performed on the data such as data movement, data transformation, or data processing.
Linked Services: Define the connections to external data sources or destinations.
Triggers: Define the conditions that ...
The activity used for creating email notification is sending an email.
Use SMTP (Simple Mail Transfer Protocol) to send emails
Set up an email server or use a third-party email service provider
Include the recipient's email address, subject, and message content
Can be automated using tools like Python's smtplib library or email marketing platforms like Mailchimp
English test for reading,writting ,listening,speaking skills
I was interviewed in Mar 2024.
Set spark configuration with appropriate memory and cores for efficient processing of 2 GB data
Increase executor memory and cores to handle larger data size
Adjust spark memory overhead to prevent out of memory errors
Optimize shuffle partitions for better performance
Use dbutils.notebook.run() command to run a child notebook in a parent notebook
Use dbutils.notebook.run() command with the path to the child notebook and any parameters needed
Ensure that the child notebook is accessible and has necessary permissions
Handle any return values or errors from the child notebook appropriately
The duration of Capgemini Data Engineer interview process can vary, but typically it takes about less than 2 weeks to complete.
based on 40 interviews
2 Interview rounds
based on 112 reviews
Rating in categories
Consultant
55.2k
salaries
| ₹5.2 L/yr - ₹17.5 L/yr |
Associate Consultant
50.8k
salaries
| ₹3 L/yr - ₹10 L/yr |
Senior Consultant
46.1k
salaries
| ₹7.5 L/yr - ₹24.5 L/yr |
Senior Analyst
20.6k
salaries
| ₹2 L/yr - ₹7.5 L/yr |
Senior Software Engineer
20.2k
salaries
| ₹3.5 L/yr - ₹12.1 L/yr |
Wipro
Accenture
Cognizant
TCS