i
Capgemini
Filter interviews by
Azure Data Factory (ADF) is a cloud-based data integration service for creating data-driven workflows.
ADF allows for the creation of ETL (Extract, Transform, Load) processes.
It supports data movement between various sources like Azure Blob Storage, SQL databases, and on-premises data.
ADF provides a visual interface for designing data pipelines, making it user-friendly.
It integrates with other Azure services like A...
OOP concepts include encapsulation, inheritance, polymorphism, and abstraction, forming the foundation of object-oriented programming.
Encapsulation: Bundling data and methods that operate on the data within a single unit (class). Example: A class 'Car' with attributes like 'color' and methods like 'drive()'.
Inheritance: Mechanism where a new class derives properties and behavior from an existing class. Example: 'E...
Pandas is a powerful Python library for data manipulation and analysis, ideal for handling structured data.
Data Cleaning: Remove duplicates and handle missing values. Example: df.drop_duplicates() or df.fillna(0).
Data Transformation: Reshape data using pivot tables. Example: df.pivot_table(values='sales', index='date', columns='product').
Data Aggregation: Group data for summary statistics. Example: df.groupby('cat...
Optimization techniques in Apache Spark improve performance and efficiency.
Partitioning data to distribute work evenly
Caching frequently accessed data in memory
Using broadcast variables for small lookup tables
Optimizing shuffle operations to reduce data movement
Tuning memory and parallelism settings for specific workloads
What people are saying about Capgemini
To find the second largest element in an array
Sort the array in descending order
Return the element at index 1
Dropping duplicates involves removing repeated entries from a dataset to ensure data integrity and accuracy.
Use the 'drop_duplicates()' method in pandas to remove duplicate rows from a DataFrame.
Example: df.drop_duplicates(subset=['column_name'], keep='first') removes duplicates based on 'column_name'.
In SQL, use 'SELECT DISTINCT' to retrieve unique records from a table.
Example: SELECT DISTINCT column_name FROM ta...
SQL joins are used to combine rows from two or more tables based on a related column between them.
Use INNER JOIN to return rows when there is at least one match in both tables
Use LEFT JOIN to return all rows from the left table, and the matched rows from the right table
Use RIGHT JOIN to return all rows from the right table, and the matched rows from the left table
Use FULL JOIN to return rows when there is a match ...
Snowflake architecture is a cloud-based data warehousing solution that separates storage and compute resources for scalability and performance.
Snowflake uses a unique architecture with three layers: storage, compute, and services.
Storage layer stores data in a columnar format for efficient querying.
Compute layer processes queries independently, allowing for elastic scalability.
Services layer manages metadata, secu...
Data skewness can be handled by partitioning data, using sampling techniques, optimizing queries, and using parallel processing.
Partitioning data based on key values to distribute workload evenly
Using sampling techniques to estimate skewed data distribution
Optimizing queries by using appropriate indexes and query optimization techniques
Using parallel processing to distribute workload across multiple nodes
One optimization technique in PySpark is using partitioning to distribute data evenly across nodes.
Use partitioning to distribute data evenly across nodes
Avoid shuffling data unnecessarily
Cache intermediate results to avoid recomputation
Optimization techniques in Apache Spark improve performance and efficiency.
Partitioning data to distribute work evenly
Caching frequently accessed data in memory
Using broadcast variables for small lookup tables
Optimizing shuffle operations to reduce data movement
Tuning memory and parallelism settings for specific workloads
I appeared for an interview in Jun 2024.
Use dropDuplicates() function in pyspark to remove duplicates in a data frame.
Use dropDuplicates() function on the data frame to remove duplicates based on all columns.
Specify subset of columns to remove duplicates based on specific columns.
Use the distinct() function to remove duplicates and keep only distinct rows.
Broadcast join is a type of join operation in distributed computing where one smaller dataset is broadcasted to all nodes for efficient processing.
Reduces data shuffling by sending smaller dataset to all nodes
Useful when one dataset is significantly smaller than the other
Improves performance by reducing network traffic and processing time
Re-Partition and Coalesce are methods used to control the number of partitions in a dataset in Apache Spark.
Re-Partition is used to increase or decrease the number of partitions in a dataset by shuffling the data across the cluster.
Coalesce is used to decrease the number of partitions in a dataset without shuffling the data, which can improve performance.
Re-Partition is typically used when there is a need to increase p...
Extract Pincode from Address Field in Dataframe using Pyspark
Use pyspark.sql.functions regexp_extract() function to extract pincode from address field
Create a new column in the dataframe to store the extracted pincode
Specify the regular expression pattern for pincode extraction
Example: df.withColumn('pincode', regexp_extract(df['address'], '\b\d{6}\b', 0))
SQL query to retrieve student names with marks > 45 in each subject
Use GROUP BY and HAVING clauses to filter students with marks > 45 in each subject
Join Student table with Marks table on student_id to get marks for each student
Select student names from Student table based on the conditions
Enable Hive support in Spark for seamless integration of Hive tables and queries.
Set 'spark.sql.catalogImplementation' to 'hive' in SparkConf
Include 'spark-hive' dependency in the Spark application
Ensure Hive configuration files are available in the classpath
Use HiveContext or enable Hive support in SparkSession
Joins in Spark using PySpark are used to combine data from two different DataFrames based on a common key.
Joins are performed using the join() function in PySpark.
Common types of joins include inner join, outer join, left join, and right join.
Example: df1.join(df2, df1.key == df2.key, 'inner')
Use broadcast join or partition join in pyspark to join two large tables efficiently.
Use broadcast join for smaller table and partition join for larger table.
Broadcast join - broadcast the smaller table to all worker nodes.
Partition join - partition both tables on the join key and join them.
Example: df1.join(broadcast(df2), 'join_key')
Example: df1.join(df2, 'join_key').repartition('join_key')
df.explain() in pyspark is used to display the physical plan of the DataFrame operations.
df.explain() is used to show the execution plan of the DataFrame operations in pyspark.
It helps in understanding how the operations are being executed and optimized by Spark.
The output of df.explain() includes details like the logical and physical plans, optimizations applied, and stages of execution.
Spark architecture is a distributed computing framework that consists of a driver program, cluster manager, and worker nodes.
Spark driver program coordinates the execution of tasks and maintains the overall state of the application.
Cluster manager allocates resources for the application and monitors its execution.
Worker nodes execute the tasks assigned by the driver program and store data in memory or disk.
Spark archit...
Lineage graph is used to track the flow of data from source to destination, helping in understanding data dependencies and impact analysis.
Helps in understanding data dependencies and relationships
Tracks the flow of data from source to destination
Aids in impact analysis and troubleshooting
Useful for data governance and compliance
Can be visualized to easily comprehend complex data pipelines
External tables store data outside the database while internal tables store data within the database.
External tables reference data stored outside the database, such as in HDFS or S3, while internal tables store data within the database itself.
External tables are typically used for data that is not managed by the database system, while internal tables are used for data that is managed by the database system.
External ta...
Join DF's, count events, filter users with 0 events
Use join operation to combine DF1 and DF2 on UserID
Group by UserID and count the number of events
Filter out users with 0 events
I applied via Campus Placement and was interviewed in Oct 2024. There was 1 interview round.
I applied via LinkedIn and was interviewed in Aug 2024. There were 3 interview rounds.
I am a detail-oriented data engineer with a passion for problem-solving and a strong background in programming and data analysis.
Experienced in designing and implementing data pipelines
Proficient in programming languages such as Python, SQL, and Java
Skilled in data modeling and database management
Strong analytical skills and ability to work with large datasets
Excellent communication and teamwork skills
I have a Bachelor's degree in Computer Science and a Master's degree in Data Engineering.
Bachelor's degree in Computer Science
Master's degree in Data Engineering
Some DSA problems medium level
I applied via Naukri.com and was interviewed in Sep 2024. There was 1 interview round.
To find the second largest element in an array
Sort the array in descending order
Return the element at index 1
Dropping duplicates involves removing repeated entries from a dataset to ensure data integrity and accuracy.
Use the 'drop_duplicates()' method in pandas to remove duplicate rows from a DataFrame.
Example: df.drop_duplicates(subset=['column_name'], keep='first') removes duplicates based on 'column_name'.
In SQL, use 'SELECT DISTINCT' to retrieve unique records from a table.
Example: SELECT DISTINCT column_name FROM table_n...
I applied via Company Website and was interviewed in Jul 2024. There were 4 interview rounds.
Numeric ability , reasoning, maths
Python , data structures, c , c++, java
Python code to calculate factorial of a number
Use a recursive function to calculate the factorial
Base case: if n is 0 or 1, return 1
Recursive case: return n * factorial(n-1)
Example: def factorial(n): return 1 if n == 0 or n == 1 else n * factorial(n-1)
Combining my electrical engineering background with IT skills allows me to work on cutting-edge technologies and solve complex problems.
Interest in technology and data analysis sparked during electrical engineering studies
Realized the potential of combining electrical engineering knowledge with IT for innovative solutions
Opportunities in data engineering field align with my career goals
One optimization technique in PySpark is using partitioning to distribute data evenly across nodes.
Use partitioning to distribute data evenly across nodes
Avoid shuffling data unnecessarily
Cache intermediate results to avoid recomputation
Data skewness can be handled by partitioning data, using sampling techniques, optimizing queries, and using parallel processing.
Partitioning data based on key values to distribute workload evenly
Using sampling techniques to estimate skewed data distribution
Optimizing queries by using appropriate indexes and query optimization techniques
Using parallel processing to distribute workload across multiple nodes
I appeared for an interview in Mar 2025, where I was asked the following questions.
Azure Data Factory (ADF) is a cloud-based data integration service for creating data-driven workflows.
ADF allows for the creation of ETL (Extract, Transform, Load) processes.
It supports data movement between various sources like Azure Blob Storage, SQL databases, and on-premises data.
ADF provides a visual interface for designing data pipelines, making it user-friendly.
It integrates with other Azure services like Azure ...
Window functions in SQL and PySpark allow for advanced data analysis over specified ranges of rows.
Window functions perform calculations across a set of table rows related to the current row.
Common window functions include ROW_NUMBER(), RANK(), DENSE_RANK(), SUM(), AVG(), etc.
In SQL: SELECT employee_id, salary, RANK() OVER (ORDER BY salary DESC) AS salary_rank FROM employees;
In PySpark: df.withColumn('salary_rank', F.r...
Query to find the 2nd highest salary in a database table.
Use the ORDER BY clause to sort salaries in descending order.
Use the LIMIT clause to retrieve the second row.
Consider handling cases where there may be ties for the highest salary.
Code to count frequency of elements in a list of strings.
Use a dictionary to store the frequency of each element in the list.
Iterate through the list and update the count in the dictionary.
Return the dictionary with element frequencies.
The duration of Capgemini Data Engineer interview process can vary, but typically it takes about less than 2 weeks to complete.
based on 44 interview experiences
Difficulty level
Duration
based on 146 reviews
Rating in categories
Consultant
58.6k
salaries
| ₹8.9 L/yr - ₹16.5 L/yr |
Associate Consultant
51.2k
salaries
| ₹4.5 L/yr - ₹10 L/yr |
Senior Consultant
50k
salaries
| ₹12.5 L/yr - ₹21 L/yr |
Senior Analyst
22k
salaries
| ₹3.1 L/yr - ₹7.5 L/yr |
Senior Software Engineer
21.6k
salaries
| ₹4.7 L/yr - ₹12.8 L/yr |
Wipro
Accenture
Cognizant
TCS