Filter interviews by
Materialized views are database objects that store the result of a query for faster access and improved performance.
Materialized views store data physically, unlike regular views which are virtual.
They can improve query performance by pre-computing expensive joins and aggregations.
Materialized views can be refreshed automatically or manually to keep data up-to-date.
Example: A materialized view can aggregate sales ...
Unity catalog is a centralized repository of assets, scripts, and plugins for Unity game development.
Unity catalog is used by developers to easily access and integrate assets into their Unity projects.
It includes a wide range of resources such as 3D models, textures, animations, and scripts.
Developers can search, preview, and download assets from the Unity catalog.
Unity catalog helps streamline the game developmen...
Optimisation techniques for improving Databricks performance
Utilize cluster sizing and autoscaling to match workload demands
Optimize data storage formats like Parquet for efficient querying
Use partitioning and indexing to speed up data retrieval
Leverage caching for frequently accessed data
Monitor and tune query performance using Databricks SQL Analytics
Consider using Delta Lake for ACID transactions and improved p...
Autoloader is a tool or feature that automatically loads data into a system without manual intervention.
Autoloader eliminates the need for manual data loading processes.
It can be used in data warehouses, databases, or ETL pipelines.
Examples include Amazon Redshift's COPY command for bulk data loading.
What people are saying about Accenture
Alerting mechanism in ADF for failed pipelines involves setting up alerts in Azure Monitor and configuring email notifications.
Set up alerts in Azure Monitor for monitoring pipeline runs
Configure alert rules to trigger notifications when a pipeline run fails
Use Azure Logic Apps to send email notifications for failed pipeline runs
Persist stores the data in memory and disk, while cache only stores in memory.
Persist stores the data both in memory and disk for fault tolerance and recovery.
Cache only stores the data in memory for faster access.
Persist is used when the data needs to be recovered in case of failure, while cache is used for temporary storage.
Example: persist() in Spark RDD saves data to disk, while cache() stores data in memory f...
An accumulator is a variable used in distributed computing to aggregate values across multiple tasks or nodes.
Accumulators are used in Spark to perform calculations in a distributed manner.
They are read-only variables that can only be updated by an associative and commutative operation.
Accumulators are used for tasks like counting elements or summing values in parallel processing.
Example: counting the number of er...
Developed a data pipeline to process and analyze large datasets for real-time insights in a retail environment.
Designed ETL processes using Apache Airflow to automate data extraction from various sources.
Utilized AWS services like S3 for storage and Redshift for data warehousing.
Implemented data quality checks to ensure accuracy and reliability of the data.
Created dashboards using Tableau for visualizing sales tre...
Activities in ADF and their uses
Data movement activities like Copy Data and Data Flow
Data transformation activities like Mapping Data Flow and Wrangling Data Flow
Data orchestration activities like Execute Pipeline and Wait
Control activities like If Condition and For Each
Integration Runtimes for executing activities in ADF
Databricks supports two types of clusters: Standard and High Concurrency.
Databricks supports Standard clusters for single user workloads
Databricks supports High Concurrency clusters for multi-user workloads
Standard clusters are suitable for ad-hoc analysis and ETL jobs
High Concurrency clusters are suitable for shared notebooks and interactive dashboards
I applied via Naukri.com and was interviewed in Dec 2024. There was 1 interview round.
Optimisation techniques for improving Databricks performance
Utilize cluster sizing and autoscaling to match workload demands
Optimize data storage formats like Parquet for efficient querying
Use partitioning and indexing to speed up data retrieval
Leverage caching for frequently accessed data
Monitor and tune query performance using Databricks SQL Analytics
Consider using Delta Lake for ACID transactions and improved perfor...
Autoloader is a tool or feature that automatically loads data into a system without manual intervention.
Autoloader eliminates the need for manual data loading processes.
It can be used in data warehouses, databases, or ETL pipelines.
Examples include Amazon Redshift's COPY command for bulk data loading.
Unity catalog is a centralized repository of assets, scripts, and plugins for Unity game development.
Unity catalog is used by developers to easily access and integrate assets into their Unity projects.
It includes a wide range of resources such as 3D models, textures, animations, and scripts.
Developers can search, preview, and download assets from the Unity catalog.
Unity catalog helps streamline the game development pro...
Alerting mechanism in ADF for failed pipelines involves setting up alerts in Azure Monitor and configuring email notifications.
Set up alerts in Azure Monitor for monitoring pipeline runs
Configure alert rules to trigger notifications when a pipeline run fails
Use Azure Logic Apps to send email notifications for failed pipeline runs
I applied via Recruitment Consulltant and was interviewed in Jun 2024. There was 1 interview round.
I would rate myself 4 in Pyspark, 5 in Python, and 4 in SQL.
Strong proficiency in Python programming language
Experience in working with Pyspark for big data processing
Proficient in writing complex SQL queries for data manipulation
Familiarity with optimizing queries for performance
Hands-on experience in data engineering projects
Use Python's built-in data structures like sets or dictionaries to handle duplicates.
Use a set to remove duplicates from a list: unique_list = list(set(original_list))
Use a dictionary to remove duplicates from a list while preserving order: unique_list = list(dict.fromkeys(original_list))
Use Databricks provided tools like databricks-connect and databricks-cli to migrate Hive metadata to Unity catalog.
Use databricks-connect to connect to the Databricks workspace from your local development environment.
Use databricks-cli to export the Hive metadata from the existing Hive metastore.
Create a new Unity catalog in Databricks and import the exported metadata using databricks-cli.
Validate the migration by chec...
To read a CSV file from an ADLS path, you can use libraries like pandas or pyspark.
Use pandas library in Python to read a CSV file from ADLS path
Use pyspark library in Python to read a CSV file from ADLS path
Ensure you have the necessary permissions to access the ADLS path
The number of stages created from the code provided depends on the specific code and its functionality.
The number of stages can vary based on the complexity of the code and the specific tasks being performed.
Stages may include data extraction, transformation, loading, and processing.
It is important to analyze the code and identify distinct stages to determine the total number.
Narrow transformation processes one record at a time, while wide transformation processes multiple records at once.
Narrow transformation processes one record at a time, making it easier to parallelize and optimize.
Wide transformation processes multiple records at once, which can lead to shuffling and performance issues.
Examples of narrow transformations include map and filter operations, while examples of wide transfor...
Actions and transformations are key concepts in data engineering, involving the manipulation and processing of data.
Actions are operations that trigger the execution of a data transformation job in a distributed computing environment.
Transformations are functions that take an input dataset and produce an output dataset, often involving filtering, aggregating, or joining data.
Examples of actions include 'saveAsTextFile'...
Enforcing the schema ensures data consistency and validation, while manually defining the schema in code allows for more flexibility and customization.
Enforcing the schema ensures that all data conforms to a predefined structure and format, preventing errors and inconsistencies.
Manually defining the schema in code allows for more flexibility in handling different data types and structures.
Enforcing the schema can be do...
Optimizations like partitioning, caching, and using efficient file formats can reduce overhead in reading large datasets in Spark.
Partitioning data based on key can reduce the amount of data shuffled during joins and aggregations
Caching frequently accessed datasets in memory can avoid recomputation
Using efficient file formats like Parquet or ORC can reduce disk I/O and improve read performance
SQL query to find the name of person who logged in last within each country from Person Table
Use a subquery to find the max login time for each country
Join the Person table with the subquery on country and login time to get the name of the person
List is mutable, Tuple is immutable in Python.
List can be modified after creation, Tuple cannot be modified.
List is defined using square brackets [], Tuple is defined using parentheses ().
Example: list_example = [1, 2, 3], tuple_example = (4, 5, 6)
Rank assigns a unique rank to each row, Dense Rank assigns a unique rank to each distinct row, and Row Number assigns a unique number to each row.
Rank assigns the same rank to rows with the same value, leaving gaps in the ranking if there are ties.
Dense Rank assigns a unique rank to each distinct row, leaving no gaps in the ranking.
Row Number assigns a unique number to each row, without any regard for the values in the...
List comprehension is a concise way to create lists in Python by applying an expression to each item in an iterable.
Syntax: [expression for item in iterable]
Can include conditions: [expression for item in iterable if condition]
Example: squares = [x**2 for x in range(10)]
Optimized data processing and storage to enhance performance and reduce latency in ETL workflows.
Implemented partitioning in our data warehouse to improve query performance, reducing data scan times by 40%.
Utilized indexing on frequently queried columns, leading to a 30% decrease in query execution time.
Migrated from a traditional RDBMS to a distributed NoSQL database, which improved scalability and read/write speeds.
O...
Interactive clusters allow for real-time interaction and exploration, while job clusters are used for running batch jobs.
Interactive clusters are used for real-time data exploration and analysis.
Job clusters are used for running batch jobs and processing large amounts of data.
Interactive clusters are typically smaller in size and have shorter lifespans.
Job clusters are usually larger and more powerful to handle heavy w...
To add a column in a dataframe, use the 'withColumn' method. To rename a column, use the 'withColumnRenamed' method.
To add a column, use the 'withColumn' method with the new column name and the expression to compute the values for that column.
Example: df.withColumn('new_column', df['existing_column'] * 2)
To rename a column, use the 'withColumnRenamed' method with the current column name and the new column name.
Example:...
Coalesce is used to combine multiple small partitions into a larger one, while Repartition is used to increase or decrease the number of partitions in a DataFrame.
Coalesce reduces the number of partitions in a DataFrame by combining small partitions into larger ones.
Repartition increases or decreases the number of partitions in a DataFrame by shuffling the data across partitions.
Coalesce is more efficient than Repartit...
I applied via Company Website and was interviewed in Sep 2024. There was 1 interview round.
Union combines and removes duplicates, while union all combines all rows including duplicates.
Union removes duplicates from the result set
Union all includes all rows, even duplicates
Use union when you want to remove duplicates, use union all when duplicates are needed
Rank assigns unique ranks to each distinct value, while dense rank assigns consecutive ranks to each distinct value.
Rank does not skip ranks when there are ties, while dense rank does
Rank may have gaps in the ranking sequence, while dense rank does not
Rank is useful when you want to know the exact position of a value in a sorted list, while dense rank is useful when you want to know the relative position of a value com...
Facts tables contain numerical data while dimensions tables contain descriptive attributes.
Facts tables store quantitative data like sales revenue or quantity sold
Dimensions tables store descriptive attributes like product name or customer details
Facts tables are typically used for analysis and reporting, while dimensions tables provide context for the facts
Lambda functions in Python are anonymous functions that can have any number of arguments but only one expression.
Lambda functions are defined using the lambda keyword.
They are commonly used for small, one-time tasks.
Lambda functions can be used as arguments to higher-order functions like map, filter, and reduce.
I applied via Approached by Company and was interviewed in Oct 2024. There were 2 interview rounds.
It was 60 mins test where there were 11 MCQ 3 SQL and 1 python questions
Use Databricks code to read multiple files from ADLS and write into a single file
Use Databricks File System (DBFS) to access files in ADLS
Read multiple files using Spark's read method
Combine the dataframes using union or merge
Write the combined dataframe to a single file using Spark's write method
Spark architecture is a distributed computing framework that provides high-level APIs for various languages.
Spark architecture consists of a cluster manager, worker nodes, and a driver program.
It uses Resilient Distributed Datasets (RDDs) for fault-tolerant distributed data processing.
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object.
It supports various data so...
I appeared for an interview before Jul 2024, where I was asked the following questions.
Optimization techniques enhance performance and efficiency in data processing and storage.
Indexing: Use indexes in databases to speed up query performance. For example, adding an index on a 'date' column can improve search times.
Partitioning: Split large datasets into smaller, manageable pieces. For instance, partitioning a sales table by year can optimize query performance.
Caching: Store frequently accessed data in me...
Retrieve the second highest salary from a salary table using SQL.
Use the DISTINCT keyword to avoid duplicate salaries.
Utilize the ORDER BY clause to sort salaries in descending order.
Limit the results to the second row using OFFSET or a subquery.
Example query: SELECT DISTINCT salary FROM employees ORDER BY salary DESC LIMIT 1 OFFSET 1;
I appeared for an interview in Apr 2025, where I was asked the following questions.
Materialized views are database objects that store the result of a query for faster access and improved performance.
Materialized views store data physically, unlike regular views which are virtual.
They can improve query performance by pre-computing expensive joins and aggregations.
Materialized views can be refreshed automatically or manually to keep data up-to-date.
Example: A materialized view can aggregate sales data ...
I applied via Naukri.com and was interviewed in Sep 2024. There was 1 interview round.
Use set() function to remove duplicates from a list in Python.
Convert the list to a set using set() function
Convert the set back to a list to remove duplicates
Example: list_with_duplicates = ['a', 'b', 'a', 'c']; list_without_duplicates = list(set(list_with_duplicates))
In PySpark, you can add a column with a default value using the withColumn method and lit function.
Use the withColumn method to add a new column to a DataFrame.
Utilize the lit function from pyspark.sql.functions to set a default value.
Example: df = df.withColumn('new_column', lit('default_value')).
This will add 'new_column' with 'default_value' for all rows in the DataFrame.
Some of the top questions asked at the Accenture Data Engineer interview -
The duration of Accenture Data Engineer interview process can vary, but typically it takes about less than 2 weeks to complete.
based on 88 interview experiences
Difficulty level
Duration
based on 250 reviews
Rating in categories
Application Development Analyst
39.3k
salaries
| ₹4.8 L/yr - ₹11 L/yr |
Application Development - Senior Analyst
27.7k
salaries
| ₹8.3 L/yr - ₹16.1 L/yr |
Team Lead
26.6k
salaries
| ₹12.6 L/yr - ₹22.5 L/yr |
Senior Analyst
19.5k
salaries
| ₹9.1 L/yr - ₹15.7 L/yr |
Senior Software Engineer
18.5k
salaries
| ₹10.4 L/yr - ₹18 L/yr |
TCS
Cognizant
Capgemini
Infosys