Filter interviews by
To debug a slow block, start by identifying potential bottlenecks, analyzing logs, checking for errors, and profiling the code.
Identify potential bottlenecks in the code or system that could be causing the slow performance.
Analyze logs and error messages to pinpoint any issues or exceptions that may be occurring.
Use profiling tools to analyze the performance of the code and identify areas that need optimization.
Ch...
Airflow is a platform to programmatically author, schedule, and monitor workflows.
Airflow is written in Python and uses Directed Acyclic Graphs (DAGs) to define workflows.
It has a web-based UI for visualization and monitoring of workflows.
Airflow consists of a scheduler, a metadata database, a web server, and an executor.
Tasks in Airflow are defined as operators, which determine what actually gets executed.
Example...
Prepartition is the process of dividing data into smaller partitions before performing any operations on it.
Prepartitioning helps in improving query performance by reducing the amount of data that needs to be processed.
It can also help in distributing data evenly across multiple nodes in a distributed system.
Examples include partitioning a large dataset based on a specific column like date or region before running...
The number of executors required to load 200 Petabytes of data depends on the size of each executor and the available cache.
Calculate the size of each executor based on available resources and data size
Consider the amount of cache available for data processing
Determine the optimal number of executors based on the above factors
What people are saying about Accenture
Case classes in Python are classes that are used to create immutable objects for pattern matching and data modeling.
Case classes are typically used in functional programming to represent data structures.
They are immutable, meaning their values cannot be changed once they are created.
Case classes automatically define equality, hash code, and toString methods based on the class constructor arguments.
They are commonl...
RDD stands for Resilient Distributed Dataset in Spark, which is an immutable distributed collection of objects.
RDD is the fundamental data structure in Spark, representing a collection of elements that can be operated on in parallel.
RDDs are fault-tolerant, meaning they can automatically recover from failures.
RDDs support two types of operations: transformations (creating a new RDD from an existing one) and action...
Find the 2nd highest salary employee in each department using PySpark.
Read the CSV file into a DataFrame using spark.read.csv().
Group the DataFrame by 'Department' and use the 'dense_rank()' function to rank salaries.
Filter the DataFrame to get employees with a rank of 2.
Select the 'Employee name' and 'Department' columns for the final output.
RDD Lineage is the record of transformations applied to an RDD and the dependencies between RDDs.
RDD Lineage tracks the sequence of transformations applied to an RDD from its source data.
It helps in fault tolerance by allowing RDDs to be reconstructed in case of data loss.
RDD Lineage is used in Spark to optimize the execution plan by eliminating unnecessary computations.
Example: If an RDD is created from a text fi...
Broadcast Variables are read-only shared variables that are cached on each machine in a Spark cluster rather than being sent with tasks.
Broadcast Variables are used to efficiently distribute large read-only datasets to all worker nodes in a Spark cluster.
They are useful for tasks that require the same data to be shared across multiple stages of a job.
Broadcast Variables are created using the broadcast() method in ...
Calculate the frequency of each unique string in an array and display the results.
Use a dictionary to count occurrences: {'a': 3, 'b': 2, 'c': 1}.
Iterate through the list and update counts for each character.
Example: For input ['a', 'a', 'b'], output should be 'a,2' and 'b,1'.
Utilize collections.Counter for a more concise solution.
I applied via Referral and was interviewed in Aug 2023. There were 2 interview rounds.
Airflow is a platform to programmatically author, schedule, and monitor workflows.
Airflow is written in Python and uses Directed Acyclic Graphs (DAGs) to define workflows.
It has a web-based UI for visualization and monitoring of workflows.
Airflow consists of a scheduler, a metadata database, a web server, and an executor.
Tasks in Airflow are defined as operators, which determine what actually gets executed.
Example: A D...
RDD stands for Resilient Distributed Dataset in Spark, which is an immutable distributed collection of objects.
RDD is the fundamental data structure in Spark, representing a collection of elements that can be operated on in parallel.
RDDs are fault-tolerant, meaning they can automatically recover from failures.
RDDs support two types of operations: transformations (creating a new RDD from an existing one) and actions (tr...
RDD Lineage is the record of transformations applied to an RDD and the dependencies between RDDs.
RDD Lineage tracks the sequence of transformations applied to an RDD from its source data.
It helps in fault tolerance by allowing RDDs to be reconstructed in case of data loss.
RDD Lineage is used in Spark to optimize the execution plan by eliminating unnecessary computations.
Example: If an RDD is created from a text file an...
Broadcast Variables are read-only shared variables that are cached on each machine in a Spark cluster rather than being sent with tasks.
Broadcast Variables are used to efficiently distribute large read-only datasets to all worker nodes in a Spark cluster.
They are useful for tasks that require the same data to be shared across multiple stages of a job.
Broadcast Variables are created using the broadcast() method in Spark...
Broadcasting is a technique used in Apache Spark to optimize data transfer by sending smaller data to all nodes in a cluster.
Broadcasting is used to efficiently distribute read-only data to all nodes in a cluster to avoid unnecessary data shuffling.
It is commonly used when joining large datasets with smaller lookup tables.
Broadcast variables are cached in memory and reused across multiple stages of a Spark job.
The limi...
Accumulators are used for aggregating values across tasks, while Catalyst optimizer is a query optimizer for Apache Spark.
Accumulators are variables that are only added to through an associative and commutative operation and can be used to implement counters or sums.
Catalyst optimizer is a rule-based query optimizer that leverages advanced programming language features to build an extensible query optimizer.
Catalyst op...
To debug a slow block, start by identifying potential bottlenecks, analyzing logs, checking for errors, and profiling the code.
Identify potential bottlenecks in the code or system that could be causing the slow performance.
Analyze logs and error messages to pinpoint any issues or exceptions that may be occurring.
Use profiling tools to analyze the performance of the code and identify areas that need optimization.
Check f...
The number of executors required to load 200 Petabytes of data depends on the size of each executor and the available cache.
Calculate the size of each executor based on available resources and data size
Consider the amount of cache available for data processing
Determine the optimal number of executors based on the above factors
Prepartition is the process of dividing data into smaller partitions before performing any operations on it.
Prepartitioning helps in improving query performance by reducing the amount of data that needs to be processed.
It can also help in distributing data evenly across multiple nodes in a distributed system.
Examples include partitioning a large dataset based on a specific column like date or region before running anal...
Find the 2nd highest salary employee in each department using PySpark.
Read the CSV file into a DataFrame using spark.read.csv().
Group the DataFrame by 'Department' and use the 'dense_rank()' function to rank salaries.
Filter the DataFrame to get employees with a rank of 2.
Select the 'Employee name' and 'Department' columns for the final output.
Calculate the frequency of each unique string in an array and display the results.
Use a dictionary to count occurrences: {'a': 3, 'b': 2, 'c': 1}.
Iterate through the list and update counts for each character.
Example: For input ['a', 'a', 'b'], output should be 'a,2' and 'b,1'.
Utilize collections.Counter for a more concise solution.
Case classes in Python are classes that are used to create immutable objects for pattern matching and data modeling.
Case classes are typically used in functional programming to represent data structures.
They are immutable, meaning their values cannot be changed once they are created.
Case classes automatically define equality, hash code, and toString methods based on the class constructor arguments.
They are commonly use...
To load specific columns from a file, use data processing tools to filter the required columns efficiently.
Use libraries like Pandas in Python: `df = pd.read_csv('file.csv', usecols=['col1', 'col2', ...])`.
In SQL, you can specify columns in your SELECT statement: `SELECT col1, col2 FROM table_name;`.
For CSV files, tools like awk can be used: `awk -F, '{print $1,$2,...}' file.csv`.
In ETL processes, configure the extract...
Lambda Architecture is a data processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods. Lambda function is a small anonymous function that can take any number of arguments, but can only have one expression.
Lambda Architecture combines batch processing and stream processing to handle large amounts of data efficiently.
Batch layer stores and proc...
Coding in python use many tools scikit learn dashboarding such as tableau additionally I am skilled in ML
I applied via Naukri.com and was interviewed before Feb 2023. There were 2 interview rounds.
2 questions on basics of DS and algo. easy and medium level included.
I applied via LinkedIn and was interviewed before Mar 2023. There were 3 interview rounds.
That was great and easy
Gave 2 codes
Difficult level is medium
Handle a P1 situation by prioritizing the issue, communicating effectively, and collaborating with team members.
Prioritize the issue based on impact and urgency
Communicate with stakeholders about the situation and potential solutions
Collaborate with team members to address the issue efficiently
It would be challenging to work all rotational shifts without shift allowance.
Working all rotational shifts without shift allowance can lead to burnout and decreased job satisfaction.
It may be difficult to maintain work-life balance without shift allowance.
Financial compensation for working rotational shifts is a common practice in many industries.
Without shift allowance, employees may feel undervalued and demotivated.
...
Capgemini offers a diverse range of projects and opportunities for growth, with a strong focus on innovation and collaboration.
Capgemini has a global presence, providing opportunities to work on projects with clients from various industries and regions.
The company values innovation and encourages employees to think creatively and implement new ideas.
Capgemini promotes a collaborative work environment, where teamwork an...
I have a strong analytical background, proven track record of delivering results, and excellent communication skills.
I have a Master's degree in Business Analytics and 5+ years of experience in data analysis.
I have consistently exceeded performance targets in my previous roles by utilizing advanced analytical techniques.
I have excellent communication skills, which allow me to effectively present complex data insights t...
Engineering allows me to apply problem-solving skills to create innovative solutions and make a positive impact on society.
Passion for problem-solving and innovation
Desire to make a positive impact on society through technology
Interest in applying scientific principles to real-world challenges
I applied via Campus Placement and was interviewed before Nov 2018. There were 4 interview rounds.
Goals in IT include career advancement, skill development, and contributing to innovative projects.
Career advancement through promotions or moving to higher positions
Skill development through training, certifications, and learning new technologies
Contributing to innovative projects that make a difference in the industry
Setting personal goals and targets to measure progress and success
Networking and building relationshi...
I applied via Company Website and was interviewed before Sep 2019. There were 5 interview rounds.
Some of the top questions asked at the Accenture Data Engineering Analyst interview for experienced candidates -
based on 5 interview experiences
Difficulty level
Duration
based on 226 reviews
Rating in categories
Application Development Analyst
39.3k
salaries
| ₹4.8 L/yr - ₹11 L/yr |
Application Development - Senior Analyst
27.7k
salaries
| ₹8.3 L/yr - ₹16.1 L/yr |
Team Lead
26.5k
salaries
| ₹12.6 L/yr - ₹22.5 L/yr |
Senior Analyst
19.5k
salaries
| ₹9 L/yr - ₹15.7 L/yr |
Senior Software Engineer
18.5k
salaries
| ₹10.4 L/yr - ₹18 L/yr |
TCS
Cognizant
Capgemini
Infosys