Accenture
70+ D P Kapoor & Co Interview Questions and Answers
Q1. What all the optimisation are possible to reduce the overhead of reducing the reading part of large datasets in spark ?
Optimizations like partitioning, caching, and using efficient file formats can reduce overhead in reading large datasets in Spark.
Partitioning data based on key can reduce the amount of data shuffled during joins and aggregations
Caching frequently accessed datasets in memory can avoid recomputation
Using efficient file formats like Parquet or ORC can reduce disk I/O and improve read performance
Q2. Write a sql query to find the name of person who logged in last within each country from Person Table ?
SQL query to find the name of person who logged in last within each country from Person Table
Use a subquery to find the max login time for each country
Join the Person table with the subquery on country and login time to get the name of the person
Q3. What to import data from RDMS via sqoop without primary key
Use --split-by option in sqoop to import data from RDMS without primary key
Use --split-by option to specify a column to split the import into multiple mappers
Use --boundary-query option to specify a query to determine the range of values for --split-by column
Example: sqoop import --connect jdbc:mysql://localhost/mydb --username root --password password --table mytable --split-by id
Example: sqoop import --connect jdbc:mysql://localhost/mydb --username root --password password ...read more
Q4. Difference between Coalesce and Repartition and In which case we are using it ?
Coalesce is used to combine multiple small partitions into a larger one, while Repartition is used to increase or decrease the number of partitions in a DataFrame.
Coalesce reduces the number of partitions in a DataFrame by combining small partitions into larger ones.
Repartition increases or decreases the number of partitions in a DataFrame by shuffling the data across partitions.
Coalesce is more efficient than Repartition as it minimizes data movement.
Coalesce is typically us...read more
Q5. What happens when we enforce the schema and when we manually define the schema in the code ?
Enforcing the schema ensures data consistency and validation, while manually defining the schema in code allows for more flexibility and customization.
Enforcing the schema ensures that all data conforms to a predefined structure and format, preventing errors and inconsistencies.
Manually defining the schema in code allows for more flexibility in handling different data types and structures.
Enforcing the schema can be done using tools like Apache Avro or Apache Parquet, while m...read more
Q6. Difference between Rank , Dense Rank and Row Number and when we are using each of them ?
Rank assigns a unique rank to each row, Dense Rank assigns a unique rank to each distinct row, and Row Number assigns a unique number to each row.
Rank assigns the same rank to rows with the same value, leaving gaps in the ranking if there are ties.
Dense Rank assigns a unique rank to each distinct row, leaving no gaps in the ranking.
Row Number assigns a unique number to each row, without any regard for the values in the rows.
Rank is used when you want to see the ranking of eac...read more
Q7. How to add a column in dataframe ? How to rename the column in dataframe ?
To add a column in a dataframe, use the 'withColumn' method. To rename a column, use the 'withColumnRenamed' method.
To add a column, use the 'withColumn' method with the new column name and the expression to compute the values for that column.
Example: df.withColumn('new_column', df['existing_column'] * 2)
To rename a column, use the 'withColumnRenamed' method with the current column name and the new column name.
Example: df.withColumnRenamed('old_column', 'new_column')
Q8. How many stages will create from the above code that I have written
The number of stages created from the code provided depends on the specific code and its functionality.
The number of stages can vary based on the complexity of the code and the specific tasks being performed.
Stages may include data extraction, transformation, loading, and processing.
It is important to analyze the code and identify distinct stages to determine the total number.
Q9. Methods of migrating Hive metdatastore to unity catalog in Databricks ?
Use Databricks provided tools like databricks-connect and databricks-cli to migrate Hive metadata to Unity catalog.
Use databricks-connect to connect to the Databricks workspace from your local development environment.
Use databricks-cli to export the Hive metadata from the existing Hive metastore.
Create a new Unity catalog in Databricks and import the exported metadata using databricks-cli.
Validate the migration by checking the tables and databases in the Unity catalog.
Q10. Difference between the interactive cluster and job cluster ?
Interactive clusters allow for real-time interaction and exploration, while job clusters are used for running batch jobs.
Interactive clusters are used for real-time data exploration and analysis.
Job clusters are used for running batch jobs and processing large amounts of data.
Interactive clusters are typically smaller in size and have shorter lifespans.
Job clusters are usually larger and more powerful to handle heavy workloads.
Examples: Interactive clusters can be used for ad...read more
Q11. How to handle duplicates in python ?
Use Python's built-in data structures like sets or dictionaries to handle duplicates.
Use a set to remove duplicates from a list: unique_list = list(set(original_list))
Use a dictionary to remove duplicates from a list while preserving order: unique_list = list(dict.fromkeys(original_list))
Q12. What are action and transformation ?
Actions and transformations are key concepts in data engineering, involving the manipulation and processing of data.
Actions are operations that trigger the execution of a data transformation job in a distributed computing environment.
Transformations are functions that take an input dataset and produce an output dataset, often involving filtering, aggregating, or joining data.
Examples of actions include 'saveAsTextFile' in Apache Spark, which saves the RDD to a text file, and ...read more
Q13. Rate yourself out of 5 in Pyspark , Python and SQL
I would rate myself 4 in Pyspark, 5 in Python, and 4 in SQL.
Strong proficiency in Python programming language
Experience in working with Pyspark for big data processing
Proficient in writing complex SQL queries for data manipulation
Familiarity with optimizing queries for performance
Hands-on experience in data engineering projects
Q14. What are the technologies you have worked on?
I have worked on various technologies including Hadoop, Spark, SQL, Python, and AWS.
Experience with Hadoop and Spark for big data processing
Proficient in SQL for data querying and manipulation
Skilled in Python for data analysis and scripting
Familiarity with AWS services such as S3, EC2, and EMR
Knowledge of data warehousing and ETL processes
Q15. How you do the alerting mechanism in adf for failed pipelines
Alerting mechanism in ADF for failed pipelines involves setting up alerts in Azure Monitor and configuring email notifications.
Set up alerts in Azure Monitor for monitoring pipeline runs
Configure alert rules to trigger notifications when a pipeline run fails
Use Azure Logic Apps to send email notifications for failed pipeline runs
Q16. What is difference between hadoop and spark? Difference between coalesce and repartition? Sql query HDFS
Hadoop is a distributed storage and processing framework, while Spark is a fast and general-purpose cluster computing system.
Hadoop is primarily used for batch processing of large datasets, while Spark is known for its in-memory processing capabilities.
Hadoop uses MapReduce for processing data, while Spark uses Resilient Distributed Datasets (RDDs).
Coalesce is used to reduce the number of partitions in a DataFrame or RDD without shuffling data, while repartition is used to in...read more
Q17. what are different kind of triggers available in data factory and tell use case of each trigger
Different kinds of triggers in Data Factory and their use cases
Schedule Trigger: Runs pipelines on a specified schedule, like daily or hourly
Tumbling Window Trigger: Triggers pipelines based on a defined window of time
Event Trigger: Triggers pipelines based on events like file arrival or HTTP request
Data Lake Storage Gen2 Trigger: Triggers pipelines when new data is added to a Data Lake Storage Gen2 account
Q18. What is data proc and why u choose it in ur project
Data proc is short for data processing, which involves transforming raw data into a more usable format for analysis.
Data proc involves cleaning, transforming, and aggregating raw data
It helps in preparing data for analysis and visualization
Examples include cleaning and formatting data from multiple sources before loading into a database
Q19. What is List Comprehension ?
List comprehension is a concise way to create lists in Python by applying an expression to each item in an iterable.
Syntax: [expression for item in iterable]
Can include conditions: [expression for item in iterable if condition]
Example: squares = [x**2 for x in range(10)]
Q20. Read a CSV file from ADLS path ?
To read a CSV file from an ADLS path, you can use libraries like pandas or pyspark.
Use pandas library in Python to read a CSV file from ADLS path
Use pyspark library in Python to read a CSV file from ADLS path
Ensure you have the necessary permissions to access the ADLS path
Q21. Difference between List and Tuple ?
List is mutable, Tuple is immutable in Python.
List can be modified after creation, Tuple cannot be modified.
List is defined using square brackets [], Tuple is defined using parentheses ().
Example: list_example = [1, 2, 3], tuple_example = (4, 5, 6)
Q22. What is Slowly Changing Dimension 2
Slowly Changing Dimension 2 (SCD2) is a data warehousing concept where historical data is preserved by creating new records for changes.
SCD2 is used to track historical changes in data over time.
It involves creating new records for changes while preserving old records.
Commonly used in data warehousing to maintain historical data for analysis.
Example: If a customer changes their address, a new record with the updated address is created while the old record is retained for hist...read more
Q23. 1) reverse a string in python and Java 2) pyspark architecture 3) number series code 4) fibonachi series 5) SQL windows function query
Questions related to string manipulation, data processing, and SQL queries for a Data Engineer role.
To reverse a string in Python, you can use slicing with a step of -1. Example: 'hello'[::-1] will return 'olleh'.
To reverse a string in Java, you can convert the string to a character array and then swap characters from start and end indexes. Example: 'hello' -> 'olleh'.
PySpark architecture includes Driver, Executor, and Cluster Manager components for distributed data processin...read more
Q24. Narrow vs Wide Transformation ?
Narrow transformation processes one record at a time, while wide transformation processes multiple records at once.
Narrow transformation processes one record at a time, making it easier to parallelize and optimize.
Wide transformation processes multiple records at once, which can lead to shuffling and performance issues.
Examples of narrow transformations include map and filter operations, while examples of wide transformations include groupBy and join operations.
Q25. Write a sql query to get second highest salary from table
SQL query to retrieve second highest salary from a table
Use the ORDER BY clause to sort salaries in descending order
Use the LIMIT clause to retrieve the second row
Q26. explain different kind of joins and give use case of self join
Different types of joins include inner join, outer join, left join, and right join. Self join is used to join a table with itself.
Inner join: Returns rows when there is a match in both tables
Outer join: Returns all rows when there is a match in one of the tables
Left join: Returns all rows from the left table and the matched rows from the right table
Right join: Returns all rows from the right table and the matched rows from the left table
Self join: Used to join a table with it...read more
Q27. End to end databricks code to read the multiple files from adls and writing it into a single file
Use Databricks code to read multiple files from ADLS and write into a single file
Use Databricks File System (DBFS) to access files in ADLS
Read multiple files using Spark's read method
Combine the dataframes using union or merge
Write the combined dataframe to a single file using Spark's write method
Q28. Optimisation technic to improve the performance of databricks
Optimisation techniques for improving Databricks performance
Utilize cluster sizing and autoscaling to match workload demands
Optimize data storage formats like Parquet for efficient querying
Use partitioning and indexing to speed up data retrieval
Leverage caching for frequently accessed data
Monitor and tune query performance using Databricks SQL Analytics
Consider using Delta Lake for ACID transactions and improved performance
Q29. Difference between select and withcolumn in pyspark
select is used to select specific columns from a DataFrame, while withColumn is used to add or update columns in a DataFrame.
select is used to select specific columns from a DataFrame
withColumn is used to add or update columns in a DataFrame
select does not modify the original DataFrame, while withColumn returns a new DataFrame with the added/updated column
Example: df.select('col1', 'col2') - selects columns col1 and col2 from DataFrame df
Example: df.withColumn('new_col', df['...read more
Q30. What is Delta Table concept
Delta Table is a type of table in Delta Lake that supports ACID transactions and time travel capabilities.
Delta Table is a type of table in Delta Lake that supports ACID transactions.
It allows users to read and write data in an Apache Spark environment.
Delta Table provides time travel capabilities, enabling users to access previous versions of data.
It helps in ensuring data consistency and reliability in data pipelines.
Q31. How is data processed using PySpark?
Data is processed using PySpark by creating Resilient Distributed Datasets (RDDs) and applying transformations and actions.
Data is loaded into RDDs from various sources such as HDFS, S3, or databases.
Transformations like map, filter, reduceByKey, etc., are applied to process the data.
Actions like collect, count, saveAsTextFile, etc., are used to trigger the actual computation.
PySpark provides a distributed computing framework for processing large datasets efficiently.
Q32. What does select count(0) mean?
Select count(0) returns the count of rows in a table, regardless of the values in the specified column.
Select count(0) counts all rows in a table, ignoring the values in the specified column.
It is equivalent to select count(*) or select count(1).
Example: SELECT COUNT(0) FROM table_name;
Q33. difference between primary key and foreign key
Primary key uniquely identifies a record in a table, while foreign key establishes a link between two tables.
Primary key ensures each record is unique in a table
Foreign key establishes a relationship between tables
Primary key is used to enforce entity integrity
Foreign key is used to enforce referential integrity
Q34. Performance tuning in spark
Performance tuning in Spark involves optimizing resource allocation and minimizing data shuffling.
Use appropriate cluster configuration and resource allocation
Minimize data shuffling by using appropriate partitioning and caching
Use efficient transformations and actions
Avoid unnecessary operations and transformations
Use broadcast variables for small data sets
Use appropriate serialization formats
Monitor and optimize garbage collection
Use appropriate hardware and network configu...read more
Q35. Difference between variables and parameters in ADF
Variables are used to store values that can be changed, while parameters are used to pass values into activities in ADF.
Variables can be modified within a pipeline, while parameters are set at runtime and cannot be changed within the pipeline.
Variables are defined within a pipeline, while parameters are defined at the pipeline level.
Variables can be used to store intermediate values or results, while parameters are used to pass values between activities.
Example: A variable ca...read more
Q36. azure tech stack used in the current project
Azure tech stack used in the current project includes Azure Data Factory, Azure Databricks, and Azure SQL Database.
Azure Data Factory for data integration and orchestration
Azure Databricks for big data processing and analytics
Azure SQL Database for storing and querying structured data
Q37. activities in adf and there uses
Activities in ADF and their uses
Data movement activities like Copy Data and Data Flow
Data transformation activities like Mapping Data Flow and Wrangling Data Flow
Data orchestration activities like Execute Pipeline and Wait
Control activities like If Condition and For Each
Integration Runtimes for executing activities in ADF
Q38. Facts vs dimensions table
Facts tables contain numerical data while dimensions tables contain descriptive attributes.
Facts tables store quantitative data like sales revenue or quantity sold
Dimensions tables store descriptive attributes like product name or customer details
Facts tables are typically used for analysis and reporting, while dimensions tables provide context for the facts
Q39. what are slots in Bigquery?
Slots in Bigquery are virtual partitions that allow users to control the amount of data processed by a query.
Slots help in managing query resources and controlling costs
Users can purchase additional slots to increase query capacity
Slots are used to allocate processing power for queries based on the amount purchased
Q40. What is unity catalog
Unity catalog is a centralized repository of assets, scripts, and plugins for Unity game development.
Unity catalog is used by developers to easily access and integrate assets into their Unity projects.
It includes a wide range of resources such as 3D models, textures, animations, and scripts.
Developers can search, preview, and download assets from the Unity catalog.
Unity catalog helps streamline the game development process by providing a library of pre-made assets and tools.
Q41. what is culustering and partitioning
Clustering is grouping similar data points together, while partitioning is dividing data into subsets based on certain criteria.
Clustering is a technique used in unsupervised machine learning to group similar data points together.
Partitioning involves dividing a dataset into subsets based on specific criteria, such as range of values or categories.
Examples of clustering algorithms include K-means and hierarchical clustering.
Examples of partitioning methods include range parti...read more
Q42. What is spark architecture?
Spark architecture refers to the structure of Apache Spark, including components like driver, executor, and cluster manager.
Spark architecture consists of a driver program that manages the execution of tasks.
Executors are worker nodes that run tasks and store data in memory or disk.
Cluster manager allocates resources and coordinates tasks across the cluster.
Spark applications run on a cluster of machines managed by a cluster manager like YARN or Mesos.
Data is processed in par...read more
Q43. What are external views?
External views are virtual tables that provide a way to present data from one or more tables in a database.
External views do not store data themselves, but instead provide a way to access data from underlying tables.
They can be used to simplify complex queries by presenting data in a more user-friendly format.
External views can also be used to restrict access to certain columns or rows of data for security purposes.
Q44. What are BigQuery slots?
BigQuery slots are units of computational capacity used to process queries in Google BigQuery.
BigQuery slots are used to allocate resources for query processing in Google BigQuery.
Each query consumes a certain number of slots based on the complexity and size of the data being processed.
Users can purchase additional slots to increase query processing capacity.
Slots are used to parallelize query execution and improve performance.
Example: Running a complex query on a large datas...read more
Q45. Diff between data proc and data flow
Data processing involves transforming raw data into meaningful information, while data flow refers to the movement of data between systems or components.
Data processing focuses on transforming raw data into a usable format for analysis or storage.
Data flow involves the movement of data between different systems, processes, or components.
Data processing can include tasks such as cleaning, aggregating, and analyzing data.
Data flow can be visualized as the path that data takes f...read more
Q46. Wide Transformation in pyspark?
Wide Transformation in pyspark involves shuffling data across partitions, typically used for operations like groupBy.
Wide transformations involve shuffling data across partitions
They are typically used for operations like groupBy, join, and sortByKey
They require data movement and can be more expensive in terms of performance compared to narrow transformations
Q47. What is an Accumulator
An accumulator is a variable used in distributed computing to aggregate values across multiple tasks or nodes.
Accumulators are used in Spark to perform calculations in a distributed manner.
They are read-only variables that can only be updated by an associative and commutative operation.
Accumulators are used for tasks like counting elements or summing values in parallel processing.
Example: counting the number of errors encountered during data processing.
Q48. what is pyspark architecture?
PySpark architecture is a distributed computing framework that combines Python and Spark to process big data.
PySpark architecture includes a driver program, cluster manager, and worker nodes.
The driver program is responsible for converting the user code into tasks and scheduling them on the worker nodes.
Cluster manager allocates resources and monitors the worker nodes.
Worker nodes execute the tasks and return the results to the driver program.
PySpark uses RDDs (Resilient Dist...read more
Q49. triggers and there type in adf
Triggers in Azure Data Factory (ADF) are events that cause a pipeline to execute.
Types of triggers in ADF include schedule, tumbling window, event-based, and manual.
Schedule triggers run pipelines on a specified schedule, like daily or hourly.
Tumbling window triggers run pipelines at specified time intervals.
Event-based triggers execute pipelines based on events like file arrival or HTTP request.
Manual triggers require manual intervention to start a pipeline.
Q50. Union vs union all
Union combines and removes duplicates, while union all combines all rows including duplicates.
Union removes duplicates from the result set
Union all includes all rows, even duplicates
Use union when you want to remove duplicates, use union all when duplicates are needed
Q51. Rank vs dense rank
Rank assigns unique ranks to each distinct value, while dense rank assigns consecutive ranks to each distinct value.
Rank does not skip ranks when there are ties, while dense rank does
Rank may have gaps in the ranking sequence, while dense rank does not
Rank is useful when you want to know the exact position of a value in a sorted list, while dense rank is useful when you want to know the relative position of a value compared to others
Q52. Lambda in python
Lambda functions in Python are anonymous functions that can have any number of arguments but only one expression.
Lambda functions are defined using the lambda keyword.
They are commonly used for small, one-time tasks.
Lambda functions can be used as arguments to higher-order functions like map, filter, and reduce.
Q53. What is sharding?
Sharding is a database partitioning technique where large databases are divided into smaller, more manageable parts called shards.
Sharding helps distribute data across multiple servers to improve performance and scalability.
Each shard contains a subset of the data, allowing for parallel processing and faster query execution.
Common sharding strategies include range-based sharding, hash-based sharding, and list-based sharding.
Examples of sharded databases include MongoDB, Cassa...read more
Q54. How to create mount points
Mount points are directories in a Unix-like operating system where additional file systems can be attached.
Use the 'mount' command to attach a file system to a directory
Specify the device or file system to be mounted and the directory where it should be attached
Use the 'umount' command to detach a file system from a directory
Q55. What is Autoloader
Autoloader is a tool or feature that automatically loads data into a system without manual intervention.
Autoloader eliminates the need for manual data loading processes.
It can be used in data warehouses, databases, or ETL pipelines.
Examples include Amazon Redshift's COPY command for bulk data loading.
Q56. python- remove duplicate from set
Use set() function to remove duplicates from a list in Python.
Convert the list to a set using set() function
Convert the set back to a list to remove duplicates
Example: list_with_duplicates = ['a', 'b', 'a', 'c']; list_without_duplicates = list(set(list_with_duplicates))
Q57. Difference between persist and cache
Persist stores the data in memory and disk, while cache only stores in memory.
Persist stores the data both in memory and disk for fault tolerance and recovery.
Cache only stores the data in memory for faster access.
Persist is used when the data needs to be recovered in case of failure, while cache is used for temporary storage.
Example: persist() in Spark RDD saves data to disk, while cache() stores data in memory for faster access.
Q58. Remove duplicate characters from string
Remove duplicate characters from a string
Iterate through the string and keep track of characters seen
Use a set to store unique characters and remove duplicates
Reconstruct the string without duplicates
Q59. Tell me about data pipeline
Data pipeline is a series of processes that collect, transform, and move data from one system to another.
Data pipeline involves extracting data from various sources
Data is then transformed and cleaned to ensure quality and consistency
Finally, the data is loaded into a destination for storage or analysis
Examples of data pipeline tools include Apache NiFi, Apache Airflow, and AWS Glue
Q60. Describe about spark architecture
Spark architecture is a distributed computing framework that provides high-level APIs for various languages.
Spark architecture consists of a cluster manager, worker nodes, and a driver program.
It uses Resilient Distributed Datasets (RDDs) for fault-tolerant distributed data processing.
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object.
It supports various data sources like HDFS, Cassandra, HBase, etc.
Spark architecture ...read more
Q61. type of cluster in databricks
Databricks supports two types of clusters: Standard and High Concurrency.
Databricks supports Standard clusters for single user workloads
Databricks supports High Concurrency clusters for multi-user workloads
Standard clusters are suitable for ad-hoc analysis and ETL jobs
High Concurrency clusters are suitable for shared notebooks and interactive dashboards
Q62. integration run time in adf
Integration run time in Azure Data Factory (ADF) refers to the time taken for data integration processes to run.
Integration run time can vary based on the complexity of the data integration tasks and the volume of data being processed.
Factors such as network latency, data source location, and the number of parallel activities can also impact integration run time.
Monitoring and optimizing integration run time is important for ensuring efficient data processing in ADF.
For examp...read more
Q63. Brief about Hadoop and kafka
Hadoop is a distributed storage and processing system for big data, while Kafka is a distributed streaming platform.
Hadoop is used for storing and processing large volumes of data across clusters of computers.
Kafka is used for building real-time data pipelines and streaming applications.
Hadoop uses HDFS (Hadoop Distributed File System) for storage, while Kafka uses topics to publish and subscribe to streams of data.
Hadoop MapReduce is a processing framework within Hadoop, whi...read more
Q64. Streaming tools for big data
Streaming tools for big data are essential for real-time processing and analysis of large datasets.
Apache Kafka is a popular streaming tool for handling real-time data streams.
Apache Spark Streaming is another tool that enables real-time processing of big data.
Amazon Kinesis is a managed service for real-time data streaming on AWS.
Q65. Working of spark framework
Spark framework is a distributed computing system that provides in-memory processing capabilities for big data analytics.
Spark framework is built on top of the Hadoop Distributed File System (HDFS) for storage and Apache Mesos or Hadoop YARN for resource management.
It supports multiple programming languages such as Scala, Java, Python, and R.
Spark provides high-level APIs like Spark SQL for structured data processing, MLlib for machine learning, GraphX for graph processing, a...read more
Q66. Serverless computing in databricks
Serverless computing in Databricks allows users to run code without managing servers, scaling automatically based on workload.
Serverless computing in Databricks enables users to focus on writing code without worrying about server management.
It automatically scales resources based on workload, reducing costs and improving efficiency.
Users can run code in Databricks without provisioning or managing servers, making it easier to deploy and scale applications.
Examples of serverles...read more
Q67. Clusters Types in databricks
There are two types of clusters in Databricks: Standard and High Concurrency.
Standard clusters are used for single user workloads and are terminated when not in use.
High Concurrency clusters are used for multiple users and remain active even when not in use.
Both types of clusters can be configured with different sizes and auto-scaling options.
Q68. difference between rdd & df
RDD stands for Resilient Distributed Dataset and is the fundamental data structure of Spark. DF stands for DataFrame and is a distributed collection of data organized into named columns.
RDD is a low-level abstraction representing a collection of elements distributed across many nodes in a cluster, while DF is a higher-level abstraction built on top of RDDs that provides a more structured and optimized way to work with data.
RDDs are more suitable for unstructured data and requ...read more
Q69. Explain databricks
Databricks is a unified analytics platform that combines data engineering, data science, and business analytics.
Databricks provides a collaborative workspace for data engineers, data scientists, and business analysts to work together on big data projects.
It integrates with popular tools like Apache Spark for data processing and machine learning.
Databricks offers automated cluster management and scaling to handle large datasets efficiently.
It allows for easy visualization of d...read more
Q70. Project explain
I led a project to develop a real-time data processing system for a retail company.
Designed data pipelines to ingest, process, and analyze large volumes of data
Implemented ETL processes using tools like Apache Spark and Kafka
Built data models and dashboards for business insights
Collaborated with cross-functional teams to gather requirements and deliver solutions
Q71. SCD in informatica
Slowly Changing Dimension (SCD) in Informatica is used to track historical data changes in a data warehouse.
SCD Type 1: Overwrite old data with new data
SCD Type 2: Add new row for each change with effective start and end dates
SCD Type 3: Add columns to track changes without adding new rows
Q72. Containers in ssis
Containers in SSIS are used to group and organize tasks and workflows.
Containers provide a way to group related tasks together.
They help in organizing and managing complex workflows.
There are different types of containers in SSIS, such as Sequence Container, For Loop Container, and Foreach Loop Container.
Containers can be nested within each other to create hierarchical structures.
They allow for better control flow and error handling in SSIS packages.
More about working at Accenture
Top HR Questions asked in D P Kapoor & Co
Interview Process at D P Kapoor & Co
Top Data Engineer Interview Questions from Similar Companies
Reviews
Interviews
Salaries
Users/Month