Add office photos
Employer?
Claim Account for FREE

Accenture

3.8
based on 56.4k Reviews
Video summary
Proud winner of ABECA 2024 - AmbitionBox Employee Choice Awards
Filter interviews by

70+ D P Kapoor & Co Interview Questions and Answers

Updated 3 Jan 2025
Popular Designations

Q1. What all the optimisation are possible to reduce the overhead of reducing the reading part of large datasets in spark ?

Ans.

Optimizations like partitioning, caching, and using efficient file formats can reduce overhead in reading large datasets in Spark.

  • Partitioning data based on key can reduce the amount of data shuffled during joins and aggregations

  • Caching frequently accessed datasets in memory can avoid recomputation

  • Using efficient file formats like Parquet or ORC can reduce disk I/O and improve read performance

View 1 answer

Q2. Write a sql query to find the name of person who logged in last within each country from Person Table ?

Ans.

SQL query to find the name of person who logged in last within each country from Person Table

  • Use a subquery to find the max login time for each country

  • Join the Person table with the subquery on country and login time to get the name of the person

Add your answer

Q3. What to import data from RDMS via sqoop without primary key

Ans.

Use --split-by option in sqoop to import data from RDMS without primary key

  • Use --split-by option to specify a column to split the import into multiple mappers

  • Use --boundary-query option to specify a query to determine the range of values for --split-by column

  • Example: sqoop import --connect jdbc:mysql://localhost/mydb --username root --password password --table mytable --split-by id

  • Example: sqoop import --connect jdbc:mysql://localhost/mydb --username root --password password ...read more

Add your answer

Q4. Difference between Coalesce and Repartition and In which case we are using it ?

Ans.

Coalesce is used to combine multiple small partitions into a larger one, while Repartition is used to increase or decrease the number of partitions in a DataFrame.

  • Coalesce reduces the number of partitions in a DataFrame by combining small partitions into larger ones.

  • Repartition increases or decreases the number of partitions in a DataFrame by shuffling the data across partitions.

  • Coalesce is more efficient than Repartition as it minimizes data movement.

  • Coalesce is typically us...read more

View 1 answer
Discover D P Kapoor & Co interview dos and don'ts from real experiences

Q5. What happens when we enforce the schema and when we manually define the schema in the code ?

Ans.

Enforcing the schema ensures data consistency and validation, while manually defining the schema in code allows for more flexibility and customization.

  • Enforcing the schema ensures that all data conforms to a predefined structure and format, preventing errors and inconsistencies.

  • Manually defining the schema in code allows for more flexibility in handling different data types and structures.

  • Enforcing the schema can be done using tools like Apache Avro or Apache Parquet, while m...read more

Add your answer

Q6. Difference between Rank , Dense Rank and Row Number and when we are using each of them ?

Ans.

Rank assigns a unique rank to each row, Dense Rank assigns a unique rank to each distinct row, and Row Number assigns a unique number to each row.

  • Rank assigns the same rank to rows with the same value, leaving gaps in the ranking if there are ties.

  • Dense Rank assigns a unique rank to each distinct row, leaving no gaps in the ranking.

  • Row Number assigns a unique number to each row, without any regard for the values in the rows.

  • Rank is used when you want to see the ranking of eac...read more

Add your answer
Are these interview questions helpful?

Q7. How to add a column in dataframe ? How to rename the column in dataframe ?

Ans.

To add a column in a dataframe, use the 'withColumn' method. To rename a column, use the 'withColumnRenamed' method.

  • To add a column, use the 'withColumn' method with the new column name and the expression to compute the values for that column.

  • Example: df.withColumn('new_column', df['existing_column'] * 2)

  • To rename a column, use the 'withColumnRenamed' method with the current column name and the new column name.

  • Example: df.withColumnRenamed('old_column', 'new_column')

Add your answer

Q8. How many stages will create from the above code that I have written

Ans.

The number of stages created from the code provided depends on the specific code and its functionality.

  • The number of stages can vary based on the complexity of the code and the specific tasks being performed.

  • Stages may include data extraction, transformation, loading, and processing.

  • It is important to analyze the code and identify distinct stages to determine the total number.

Add your answer
Share interview questions and help millions of jobseekers 🌟

Q9. Methods of migrating Hive metdatastore to unity catalog in Databricks ?

Ans.

Use Databricks provided tools like databricks-connect and databricks-cli to migrate Hive metadata to Unity catalog.

  • Use databricks-connect to connect to the Databricks workspace from your local development environment.

  • Use databricks-cli to export the Hive metadata from the existing Hive metastore.

  • Create a new Unity catalog in Databricks and import the exported metadata using databricks-cli.

  • Validate the migration by checking the tables and databases in the Unity catalog.

Add your answer

Q10. Difference between the interactive cluster and job cluster ?

Ans.

Interactive clusters allow for real-time interaction and exploration, while job clusters are used for running batch jobs.

  • Interactive clusters are used for real-time data exploration and analysis.

  • Job clusters are used for running batch jobs and processing large amounts of data.

  • Interactive clusters are typically smaller in size and have shorter lifespans.

  • Job clusters are usually larger and more powerful to handle heavy workloads.

  • Examples: Interactive clusters can be used for ad...read more

Add your answer

Q11. How to handle duplicates in python ?

Ans.

Use Python's built-in data structures like sets or dictionaries to handle duplicates.

  • Use a set to remove duplicates from a list: unique_list = list(set(original_list))

  • Use a dictionary to remove duplicates from a list while preserving order: unique_list = list(dict.fromkeys(original_list))

Add your answer

Q12. What are action and transformation ?

Ans.

Actions and transformations are key concepts in data engineering, involving the manipulation and processing of data.

  • Actions are operations that trigger the execution of a data transformation job in a distributed computing environment.

  • Transformations are functions that take an input dataset and produce an output dataset, often involving filtering, aggregating, or joining data.

  • Examples of actions include 'saveAsTextFile' in Apache Spark, which saves the RDD to a text file, and ...read more

Add your answer

Q13. Rate yourself out of 5 in Pyspark , Python and SQL

Ans.

I would rate myself 4 in Pyspark, 5 in Python, and 4 in SQL.

  • Strong proficiency in Python programming language

  • Experience in working with Pyspark for big data processing

  • Proficient in writing complex SQL queries for data manipulation

  • Familiarity with optimizing queries for performance

  • Hands-on experience in data engineering projects

Add your answer

Q14. What are the technologies you have worked on?

Ans.

I have worked on various technologies including Hadoop, Spark, SQL, Python, and AWS.

  • Experience with Hadoop and Spark for big data processing

  • Proficient in SQL for data querying and manipulation

  • Skilled in Python for data analysis and scripting

  • Familiarity with AWS services such as S3, EC2, and EMR

  • Knowledge of data warehousing and ETL processes

Add your answer

Q15. How you do the alerting mechanism in adf for failed pipelines

Ans.

Alerting mechanism in ADF for failed pipelines involves setting up alerts in Azure Monitor and configuring email notifications.

  • Set up alerts in Azure Monitor for monitoring pipeline runs

  • Configure alert rules to trigger notifications when a pipeline run fails

  • Use Azure Logic Apps to send email notifications for failed pipeline runs

Add your answer

Q16. What is difference between hadoop and spark? Difference between coalesce and repartition? Sql query HDFS

Ans.

Hadoop is a distributed storage and processing framework, while Spark is a fast and general-purpose cluster computing system.

  • Hadoop is primarily used for batch processing of large datasets, while Spark is known for its in-memory processing capabilities.

  • Hadoop uses MapReduce for processing data, while Spark uses Resilient Distributed Datasets (RDDs).

  • Coalesce is used to reduce the number of partitions in a DataFrame or RDD without shuffling data, while repartition is used to in...read more

Add your answer

Q17. what are different kind of triggers available in data factory and tell use case of each trigger

Ans.

Different kinds of triggers in Data Factory and their use cases

  • Schedule Trigger: Runs pipelines on a specified schedule, like daily or hourly

  • Tumbling Window Trigger: Triggers pipelines based on a defined window of time

  • Event Trigger: Triggers pipelines based on events like file arrival or HTTP request

  • Data Lake Storage Gen2 Trigger: Triggers pipelines when new data is added to a Data Lake Storage Gen2 account

Add your answer

Q18. What is data proc and why u choose it in ur project

Ans.

Data proc is short for data processing, which involves transforming raw data into a more usable format for analysis.

  • Data proc involves cleaning, transforming, and aggregating raw data

  • It helps in preparing data for analysis and visualization

  • Examples include cleaning and formatting data from multiple sources before loading into a database

Add your answer

Q19. What is List Comprehension ?

Ans.

List comprehension is a concise way to create lists in Python by applying an expression to each item in an iterable.

  • Syntax: [expression for item in iterable]

  • Can include conditions: [expression for item in iterable if condition]

  • Example: squares = [x**2 for x in range(10)]

View 1 answer

Q20. Read a CSV file from ADLS path ?

Ans.

To read a CSV file from an ADLS path, you can use libraries like pandas or pyspark.

  • Use pandas library in Python to read a CSV file from ADLS path

  • Use pyspark library in Python to read a CSV file from ADLS path

  • Ensure you have the necessary permissions to access the ADLS path

Add your answer

Q21. Difference between List and Tuple ?

Ans.

List is mutable, Tuple is immutable in Python.

  • List can be modified after creation, Tuple cannot be modified.

  • List is defined using square brackets [], Tuple is defined using parentheses ().

  • Example: list_example = [1, 2, 3], tuple_example = (4, 5, 6)

Add your answer

Q22. What is Slowly Changing Dimension 2

Ans.

Slowly Changing Dimension 2 (SCD2) is a data warehousing concept where historical data is preserved by creating new records for changes.

  • SCD2 is used to track historical changes in data over time.

  • It involves creating new records for changes while preserving old records.

  • Commonly used in data warehousing to maintain historical data for analysis.

  • Example: If a customer changes their address, a new record with the updated address is created while the old record is retained for hist...read more

Add your answer

Q23. 1) reverse a string in python and Java 2) pyspark architecture 3) number series code 4) fibonachi series 5) SQL windows function query

Ans.

Questions related to string manipulation, data processing, and SQL queries for a Data Engineer role.

  • To reverse a string in Python, you can use slicing with a step of -1. Example: 'hello'[::-1] will return 'olleh'.

  • To reverse a string in Java, you can convert the string to a character array and then swap characters from start and end indexes. Example: 'hello' -> 'olleh'.

  • PySpark architecture includes Driver, Executor, and Cluster Manager components for distributed data processin...read more

Add your answer

Q24. Narrow vs Wide Transformation ?

Ans.

Narrow transformation processes one record at a time, while wide transformation processes multiple records at once.

  • Narrow transformation processes one record at a time, making it easier to parallelize and optimize.

  • Wide transformation processes multiple records at once, which can lead to shuffling and performance issues.

  • Examples of narrow transformations include map and filter operations, while examples of wide transformations include groupBy and join operations.

Add your answer

Q25. Write a sql query to get second highest salary from table

Ans.

SQL query to retrieve second highest salary from a table

  • Use the ORDER BY clause to sort salaries in descending order

  • Use the LIMIT clause to retrieve the second row

View 1 answer

Q26. explain different kind of joins and give use case of self join

Ans.

Different types of joins include inner join, outer join, left join, and right join. Self join is used to join a table with itself.

  • Inner join: Returns rows when there is a match in both tables

  • Outer join: Returns all rows when there is a match in one of the tables

  • Left join: Returns all rows from the left table and the matched rows from the right table

  • Right join: Returns all rows from the right table and the matched rows from the left table

  • Self join: Used to join a table with it...read more

Add your answer

Q27. End to end databricks code to read the multiple files from adls and writing it into a single file

Ans.

Use Databricks code to read multiple files from ADLS and write into a single file

  • Use Databricks File System (DBFS) to access files in ADLS

  • Read multiple files using Spark's read method

  • Combine the dataframes using union or merge

  • Write the combined dataframe to a single file using Spark's write method

Add your answer

Q28. Optimisation technic to improve the performance of databricks

Ans.

Optimisation techniques for improving Databricks performance

  • Utilize cluster sizing and autoscaling to match workload demands

  • Optimize data storage formats like Parquet for efficient querying

  • Use partitioning and indexing to speed up data retrieval

  • Leverage caching for frequently accessed data

  • Monitor and tune query performance using Databricks SQL Analytics

  • Consider using Delta Lake for ACID transactions and improved performance

Add your answer

Q29. Difference between select and withcolumn in pyspark

Ans.

select is used to select specific columns from a DataFrame, while withColumn is used to add or update columns in a DataFrame.

  • select is used to select specific columns from a DataFrame

  • withColumn is used to add or update columns in a DataFrame

  • select does not modify the original DataFrame, while withColumn returns a new DataFrame with the added/updated column

  • Example: df.select('col1', 'col2') - selects columns col1 and col2 from DataFrame df

  • Example: df.withColumn('new_col', df['...read more

Add your answer

Q30. What is Delta Table concept

Ans.

Delta Table is a type of table in Delta Lake that supports ACID transactions and time travel capabilities.

  • Delta Table is a type of table in Delta Lake that supports ACID transactions.

  • It allows users to read and write data in an Apache Spark environment.

  • Delta Table provides time travel capabilities, enabling users to access previous versions of data.

  • It helps in ensuring data consistency and reliability in data pipelines.

Add your answer

Q31. How is data processed using PySpark?

Ans.

Data is processed using PySpark by creating Resilient Distributed Datasets (RDDs) and applying transformations and actions.

  • Data is loaded into RDDs from various sources such as HDFS, S3, or databases.

  • Transformations like map, filter, reduceByKey, etc., are applied to process the data.

  • Actions like collect, count, saveAsTextFile, etc., are used to trigger the actual computation.

  • PySpark provides a distributed computing framework for processing large datasets efficiently.

Add your answer

Q32. What does select count(0) mean?

Ans.

Select count(0) returns the count of rows in a table, regardless of the values in the specified column.

  • Select count(0) counts all rows in a table, ignoring the values in the specified column.

  • It is equivalent to select count(*) or select count(1).

  • Example: SELECT COUNT(0) FROM table_name;

Add your answer

Q33. difference between primary key and foreign key

Ans.

Primary key uniquely identifies a record in a table, while foreign key establishes a link between two tables.

  • Primary key ensures each record is unique in a table

  • Foreign key establishes a relationship between tables

  • Primary key is used to enforce entity integrity

  • Foreign key is used to enforce referential integrity

Add your answer

Q34. Performance tuning in spark

Ans.

Performance tuning in Spark involves optimizing resource allocation and minimizing data shuffling.

  • Use appropriate cluster configuration and resource allocation

  • Minimize data shuffling by using appropriate partitioning and caching

  • Use efficient transformations and actions

  • Avoid unnecessary operations and transformations

  • Use broadcast variables for small data sets

  • Use appropriate serialization formats

  • Monitor and optimize garbage collection

  • Use appropriate hardware and network configu...read more

View 1 answer

Q35. Difference between variables and parameters in ADF

Ans.

Variables are used to store values that can be changed, while parameters are used to pass values into activities in ADF.

  • Variables can be modified within a pipeline, while parameters are set at runtime and cannot be changed within the pipeline.

  • Variables are defined within a pipeline, while parameters are defined at the pipeline level.

  • Variables can be used to store intermediate values or results, while parameters are used to pass values between activities.

  • Example: A variable ca...read more

Add your answer

Q36. azure tech stack used in the current project

Ans.

Azure tech stack used in the current project includes Azure Data Factory, Azure Databricks, and Azure SQL Database.

  • Azure Data Factory for data integration and orchestration

  • Azure Databricks for big data processing and analytics

  • Azure SQL Database for storing and querying structured data

Add your answer

Q37. activities in adf and there uses

Ans.

Activities in ADF and their uses

  • Data movement activities like Copy Data and Data Flow

  • Data transformation activities like Mapping Data Flow and Wrangling Data Flow

  • Data orchestration activities like Execute Pipeline and Wait

  • Control activities like If Condition and For Each

  • Integration Runtimes for executing activities in ADF

Add your answer

Q38. Facts vs dimensions table

Ans.

Facts tables contain numerical data while dimensions tables contain descriptive attributes.

  • Facts tables store quantitative data like sales revenue or quantity sold

  • Dimensions tables store descriptive attributes like product name or customer details

  • Facts tables are typically used for analysis and reporting, while dimensions tables provide context for the facts

Add your answer

Q39. what are slots in Bigquery?

Ans.

Slots in Bigquery are virtual partitions that allow users to control the amount of data processed by a query.

  • Slots help in managing query resources and controlling costs

  • Users can purchase additional slots to increase query capacity

  • Slots are used to allocate processing power for queries based on the amount purchased

Add your answer

Q40. What is unity catalog

Ans.

Unity catalog is a centralized repository of assets, scripts, and plugins for Unity game development.

  • Unity catalog is used by developers to easily access and integrate assets into their Unity projects.

  • It includes a wide range of resources such as 3D models, textures, animations, and scripts.

  • Developers can search, preview, and download assets from the Unity catalog.

  • Unity catalog helps streamline the game development process by providing a library of pre-made assets and tools.

Add your answer

Q41. what is culustering and partitioning

Ans.

Clustering is grouping similar data points together, while partitioning is dividing data into subsets based on certain criteria.

  • Clustering is a technique used in unsupervised machine learning to group similar data points together.

  • Partitioning involves dividing a dataset into subsets based on specific criteria, such as range of values or categories.

  • Examples of clustering algorithms include K-means and hierarchical clustering.

  • Examples of partitioning methods include range parti...read more

Add your answer

Q42. What is spark architecture?

Ans.

Spark architecture refers to the structure of Apache Spark, including components like driver, executor, and cluster manager.

  • Spark architecture consists of a driver program that manages the execution of tasks.

  • Executors are worker nodes that run tasks and store data in memory or disk.

  • Cluster manager allocates resources and coordinates tasks across the cluster.

  • Spark applications run on a cluster of machines managed by a cluster manager like YARN or Mesos.

  • Data is processed in par...read more

Add your answer

Q43. What are external views?

Ans.

External views are virtual tables that provide a way to present data from one or more tables in a database.

  • External views do not store data themselves, but instead provide a way to access data from underlying tables.

  • They can be used to simplify complex queries by presenting data in a more user-friendly format.

  • External views can also be used to restrict access to certain columns or rows of data for security purposes.

Add your answer

Q44. What are BigQuery slots?

Ans.

BigQuery slots are units of computational capacity used to process queries in Google BigQuery.

  • BigQuery slots are used to allocate resources for query processing in Google BigQuery.

  • Each query consumes a certain number of slots based on the complexity and size of the data being processed.

  • Users can purchase additional slots to increase query processing capacity.

  • Slots are used to parallelize query execution and improve performance.

  • Example: Running a complex query on a large datas...read more

Add your answer

Q45. Diff between data proc and data flow

Ans.

Data processing involves transforming raw data into meaningful information, while data flow refers to the movement of data between systems or components.

  • Data processing focuses on transforming raw data into a usable format for analysis or storage.

  • Data flow involves the movement of data between different systems, processes, or components.

  • Data processing can include tasks such as cleaning, aggregating, and analyzing data.

  • Data flow can be visualized as the path that data takes f...read more

Add your answer

Q46. Wide Transformation in pyspark?

Ans.

Wide Transformation in pyspark involves shuffling data across partitions, typically used for operations like groupBy.

  • Wide transformations involve shuffling data across partitions

  • They are typically used for operations like groupBy, join, and sortByKey

  • They require data movement and can be more expensive in terms of performance compared to narrow transformations

Add your answer

Q47. What is an Accumulator

Ans.

An accumulator is a variable used in distributed computing to aggregate values across multiple tasks or nodes.

  • Accumulators are used in Spark to perform calculations in a distributed manner.

  • They are read-only variables that can only be updated by an associative and commutative operation.

  • Accumulators are used for tasks like counting elements or summing values in parallel processing.

  • Example: counting the number of errors encountered during data processing.

Add your answer

Q48. what is pyspark architecture?

Ans.

PySpark architecture is a distributed computing framework that combines Python and Spark to process big data.

  • PySpark architecture includes a driver program, cluster manager, and worker nodes.

  • The driver program is responsible for converting the user code into tasks and scheduling them on the worker nodes.

  • Cluster manager allocates resources and monitors the worker nodes.

  • Worker nodes execute the tasks and return the results to the driver program.

  • PySpark uses RDDs (Resilient Dist...read more

Add your answer

Q49. triggers and there type in adf

Ans.

Triggers in Azure Data Factory (ADF) are events that cause a pipeline to execute.

  • Types of triggers in ADF include schedule, tumbling window, event-based, and manual.

  • Schedule triggers run pipelines on a specified schedule, like daily or hourly.

  • Tumbling window triggers run pipelines at specified time intervals.

  • Event-based triggers execute pipelines based on events like file arrival or HTTP request.

  • Manual triggers require manual intervention to start a pipeline.

Add your answer

Q50. Union vs union all

Ans.

Union combines and removes duplicates, while union all combines all rows including duplicates.

  • Union removes duplicates from the result set

  • Union all includes all rows, even duplicates

  • Use union when you want to remove duplicates, use union all when duplicates are needed

Add your answer

Q51. Rank vs dense rank

Ans.

Rank assigns unique ranks to each distinct value, while dense rank assigns consecutive ranks to each distinct value.

  • Rank does not skip ranks when there are ties, while dense rank does

  • Rank may have gaps in the ranking sequence, while dense rank does not

  • Rank is useful when you want to know the exact position of a value in a sorted list, while dense rank is useful when you want to know the relative position of a value compared to others

Add your answer

Q52. Lambda in python

Ans.

Lambda functions in Python are anonymous functions that can have any number of arguments but only one expression.

  • Lambda functions are defined using the lambda keyword.

  • They are commonly used for small, one-time tasks.

  • Lambda functions can be used as arguments to higher-order functions like map, filter, and reduce.

Add your answer

Q53. What is sharding?

Ans.

Sharding is a database partitioning technique where large databases are divided into smaller, more manageable parts called shards.

  • Sharding helps distribute data across multiple servers to improve performance and scalability.

  • Each shard contains a subset of the data, allowing for parallel processing and faster query execution.

  • Common sharding strategies include range-based sharding, hash-based sharding, and list-based sharding.

  • Examples of sharded databases include MongoDB, Cassa...read more

Add your answer

Q54. How to create mount points

Ans.

Mount points are directories in a Unix-like operating system where additional file systems can be attached.

  • Use the 'mount' command to attach a file system to a directory

  • Specify the device or file system to be mounted and the directory where it should be attached

  • Use the 'umount' command to detach a file system from a directory

Add your answer

Q55. What is Autoloader

Ans.

Autoloader is a tool or feature that automatically loads data into a system without manual intervention.

  • Autoloader eliminates the need for manual data loading processes.

  • It can be used in data warehouses, databases, or ETL pipelines.

  • Examples include Amazon Redshift's COPY command for bulk data loading.

Add your answer

Q56. python- remove duplicate from set

Ans.

Use set() function to remove duplicates from a list in Python.

  • Convert the list to a set using set() function

  • Convert the set back to a list to remove duplicates

  • Example: list_with_duplicates = ['a', 'b', 'a', 'c']; list_without_duplicates = list(set(list_with_duplicates))

Add your answer

Q57. Difference between persist and cache

Ans.

Persist stores the data in memory and disk, while cache only stores in memory.

  • Persist stores the data both in memory and disk for fault tolerance and recovery.

  • Cache only stores the data in memory for faster access.

  • Persist is used when the data needs to be recovered in case of failure, while cache is used for temporary storage.

  • Example: persist() in Spark RDD saves data to disk, while cache() stores data in memory for faster access.

Add your answer

Q58. Remove duplicate characters from string

Ans.

Remove duplicate characters from a string

  • Iterate through the string and keep track of characters seen

  • Use a set to store unique characters and remove duplicates

  • Reconstruct the string without duplicates

Add your answer

Q59. Tell me about data pipeline

Ans.

Data pipeline is a series of processes that collect, transform, and move data from one system to another.

  • Data pipeline involves extracting data from various sources

  • Data is then transformed and cleaned to ensure quality and consistency

  • Finally, the data is loaded into a destination for storage or analysis

  • Examples of data pipeline tools include Apache NiFi, Apache Airflow, and AWS Glue

Add your answer

Q60. Describe about spark architecture

Ans.

Spark architecture is a distributed computing framework that provides high-level APIs for various languages.

  • Spark architecture consists of a cluster manager, worker nodes, and a driver program.

  • It uses Resilient Distributed Datasets (RDDs) for fault-tolerant distributed data processing.

  • Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object.

  • It supports various data sources like HDFS, Cassandra, HBase, etc.

  • Spark architecture ...read more

Add your answer

Q61. type of cluster in databricks

Ans.

Databricks supports two types of clusters: Standard and High Concurrency.

  • Databricks supports Standard clusters for single user workloads

  • Databricks supports High Concurrency clusters for multi-user workloads

  • Standard clusters are suitable for ad-hoc analysis and ETL jobs

  • High Concurrency clusters are suitable for shared notebooks and interactive dashboards

Add your answer

Q62. integration run time in adf

Ans.

Integration run time in Azure Data Factory (ADF) refers to the time taken for data integration processes to run.

  • Integration run time can vary based on the complexity of the data integration tasks and the volume of data being processed.

  • Factors such as network latency, data source location, and the number of parallel activities can also impact integration run time.

  • Monitoring and optimizing integration run time is important for ensuring efficient data processing in ADF.

  • For examp...read more

Add your answer

Q63. Brief about Hadoop and kafka

Ans.

Hadoop is a distributed storage and processing system for big data, while Kafka is a distributed streaming platform.

  • Hadoop is used for storing and processing large volumes of data across clusters of computers.

  • Kafka is used for building real-time data pipelines and streaming applications.

  • Hadoop uses HDFS (Hadoop Distributed File System) for storage, while Kafka uses topics to publish and subscribe to streams of data.

  • Hadoop MapReduce is a processing framework within Hadoop, whi...read more

Add your answer

Q64. Streaming tools for big data

Ans.

Streaming tools for big data are essential for real-time processing and analysis of large datasets.

  • Apache Kafka is a popular streaming tool for handling real-time data streams.

  • Apache Spark Streaming is another tool that enables real-time processing of big data.

  • Amazon Kinesis is a managed service for real-time data streaming on AWS.

Add your answer

Q65. Working of spark framework

Ans.

Spark framework is a distributed computing system that provides in-memory processing capabilities for big data analytics.

  • Spark framework is built on top of the Hadoop Distributed File System (HDFS) for storage and Apache Mesos or Hadoop YARN for resource management.

  • It supports multiple programming languages such as Scala, Java, Python, and R.

  • Spark provides high-level APIs like Spark SQL for structured data processing, MLlib for machine learning, GraphX for graph processing, a...read more

Add your answer

Q66. Serverless computing in databricks

Ans.

Serverless computing in Databricks allows users to run code without managing servers, scaling automatically based on workload.

  • Serverless computing in Databricks enables users to focus on writing code without worrying about server management.

  • It automatically scales resources based on workload, reducing costs and improving efficiency.

  • Users can run code in Databricks without provisioning or managing servers, making it easier to deploy and scale applications.

  • Examples of serverles...read more

Add your answer

Q67. Clusters Types in databricks

Ans.

There are two types of clusters in Databricks: Standard and High Concurrency.

  • Standard clusters are used for single user workloads and are terminated when not in use.

  • High Concurrency clusters are used for multiple users and remain active even when not in use.

  • Both types of clusters can be configured with different sizes and auto-scaling options.

Add your answer

Q68. difference between rdd & df

Ans.

RDD stands for Resilient Distributed Dataset and is the fundamental data structure of Spark. DF stands for DataFrame and is a distributed collection of data organized into named columns.

  • RDD is a low-level abstraction representing a collection of elements distributed across many nodes in a cluster, while DF is a higher-level abstraction built on top of RDDs that provides a more structured and optimized way to work with data.

  • RDDs are more suitable for unstructured data and requ...read more

Add your answer

Q69. Explain databricks

Ans.

Databricks is a unified analytics platform that combines data engineering, data science, and business analytics.

  • Databricks provides a collaborative workspace for data engineers, data scientists, and business analysts to work together on big data projects.

  • It integrates with popular tools like Apache Spark for data processing and machine learning.

  • Databricks offers automated cluster management and scaling to handle large datasets efficiently.

  • It allows for easy visualization of d...read more

Add your answer

Q70. Project explain

Ans.

I led a project to develop a real-time data processing system for a retail company.

  • Designed data pipelines to ingest, process, and analyze large volumes of data

  • Implemented ETL processes using tools like Apache Spark and Kafka

  • Built data models and dashboards for business insights

  • Collaborated with cross-functional teams to gather requirements and deliver solutions

Add your answer

Q71. SCD in informatica

Ans.

Slowly Changing Dimension (SCD) in Informatica is used to track historical data changes in a data warehouse.

  • SCD Type 1: Overwrite old data with new data

  • SCD Type 2: Add new row for each change with effective start and end dates

  • SCD Type 3: Add columns to track changes without adding new rows

Add your answer

Q72. Containers in ssis

Ans.

Containers in SSIS are used to group and organize tasks and workflows.

  • Containers provide a way to group related tasks together.

  • They help in organizing and managing complex workflows.

  • There are different types of containers in SSIS, such as Sequence Container, For Loop Container, and Foreach Loop Container.

  • Containers can be nested within each other to create hierarchical structures.

  • They allow for better control flow and error handling in SSIS packages.

Add your answer

More about working at Accenture

Top Rated Mega Company - 2024
Top Rated Company for Women - 2024
Top Rated IT/ITES Company - 2024
Contribute & help others!
Write a review
Share interview
Contribute salary
Add office photos

Interview Process at D P Kapoor & Co

based on 85 interviews
3 Interview rounds
Technical Round - 1
Technical Round - 2
HR Round
View more
Interview Tips & Stories
Ace your next interview with expert advice and inspiring stories

Top Data Engineer Interview Questions from Similar Companies

3.7
 • 64 Interview Questions
4.0
 • 29 Interview Questions
3.4
 • 16 Interview Questions
3.0
 • 12 Interview Questions
4.0
 • 11 Interview Questions
3.4
 • 10 Interview Questions
View all
Share an Interview
Stay ahead in your career. Get AmbitionBox app
qr-code
Helping over 1 Crore job seekers every month in choosing their right fit company
70 Lakh+

Reviews

5 Lakh+

Interviews

4 Crore+

Salaries

1 Cr+

Users/Month

Contribute to help millions

Made with ❤️ in India. Trademarks belong to their respective owners. All rights reserved © 2024 Info Edge (India) Ltd.

Follow us
  • Youtube
  • Instagram
  • LinkedIn
  • Facebook
  • Twitter