IBM
20+ Dhwani Rural Information Systems Interview Questions and Answers
Q1. 1) How to handle data skewness in spark.
Data skewness in Spark can be handled by partitioning, bucketing, or using salting techniques.
Partitioning the data based on a key column can distribute the data evenly across the nodes.
Bucketing can group the data into buckets based on a key column, which can improve join performance.
Salting involves adding a random prefix to the key column, which can distribute the data evenly.
Using broadcast joins for small tables can also help in reducing skewness.
Using dynamic allocation...read more
Q2. 5) How to create a kafka topic with replication factor 2
To create a Kafka topic with replication factor 2, use the command line tool or Kafka API.
Use the command line tool 'kafka-topics.sh' with the '--replication-factor' flag set to 2.
Alternatively, use the Kafka API to create a topic with a replication factor of 2.
Ensure that the number of brokers in the Kafka cluster is greater than or equal to the replication factor.
Consider setting the 'min.insync.replicas' configuration property to 2 to ensure that at least two replicas are ...read more
Q3. 4) How to read json data using spark
To read JSON data using Spark, use the SparkSession.read.json() method.
Create a SparkSession object
Use the read.json() method to read the JSON data
Specify the path to the JSON file or directory containing JSON files
The resulting DataFrame can be manipulated using Spark's DataFrame API
Q4. 2) Difference between partitioning and Bucketing
Partitioning is dividing data into smaller chunks based on a column value. Bucketing is dividing data into equal-sized buckets based on a hash function.
Partitioning is used for organizing data for efficient querying and processing.
Bucketing is used for evenly distributing data across nodes in a cluster.
Partitioning is done based on a column value, such as date or region.
Bucketing is done based on a hash function, such as MD5 or SHA-1.
Partitioning can improve query performance...read more
Q5. 3) Difference between cache and persistent storage
Cache is temporary storage used to speed up access to frequently accessed data. Persistent storage is permanent storage used to store data even after power loss.
Cache is faster but smaller than persistent storage
Cache is volatile and data is lost when power is lost
Persistent storage is non-volatile and data is retained even after power loss
Examples of cache include CPU cache, browser cache, and CDN cache
Examples of persistent storage include hard disk drives, solid-state driv...read more
Q6. what is difference between union vs union all
Union combines and removes duplicates, Union All combines all rows including duplicates.
Union merges two tables and removes duplicates
Union All merges two tables and includes duplicates
Union is slower than Union All as it removes duplicates
Syntax: SELECT column1, column2 FROM table1 UNION/UNION ALL SELECT column1, column2 FROM table2
Example: SELECT name FROM table1 UNION SELECT name FROM table2
Q7. What are the components we use in graphs to remove duplicates
Components used in graphs to remove duplicates include HashSet and HashMap.
Use HashSet to store unique elements
Use HashMap to store key-value pairs with unique keys
Iterate through the graph and add elements to HashSet or HashMap to remove duplicates
Q8. Why there are 2 keys available in azure resources
Two keys are available in Azure resources for security purposes.
One key is used for authentication and the other for authorization.
Authentication key is used to verify the identity of the user or application accessing the resource.
Authorization key is used to grant or deny access to specific resources or actions.
Having two keys adds an extra layer of security to Azure resources.
Examples of Azure resources that use two keys are Azure Storage and Azure Event Hubs.
Q9. What do you know about Forms and Templates and its use in workflow and webreports
Forms and Templates are used in workflow and web reports to standardize data input and presentation.
Forms are used to collect data in a structured manner, often with predefined fields and formats
Templates are pre-designed layouts for presenting data in a consistent way
Forms and Templates help streamline processes, ensure data consistency, and improve reporting accuracy
In workflow management, Forms can be used to gather input from users at different stages of a process
Web repo...read more
Q10. 1) Project Architecture 2) Complex job handles in project 3) Types of lookup 4) SCD -2 implementation in datastage 5) sql - analytical functions,scenario based question 6) Unix - SED/GREP command
The interview questions cover project architecture, complex job handling, lookup types, SCD-2 implementation, SQL analytical functions, and Unix commands.
Project architecture involves designing the overall structure of a data project.
Complex job handling refers to managing intricate data processing tasks within a project.
Lookup types include exact match, range match, and fuzzy match.
SCD-2 implementation in DataStage involves capturing historical changes in data.
SQL analytical...read more
Q11. How would you implement SCD TYPE 2 IN INFORMATICA?
Implementing SCD Type 2 in Informatica involves using Slowly Changing Dimension transformations and mapping variables.
Use Slowly Changing Dimension (SCD) transformations in Informatica to track historical changes in data.
Create mapping variables to keep track of effective start and end dates for each record.
Use Update Strategy transformations to handle inserts, updates, and deletes in the target table.
Implement Type 2 SCD by inserting new records with updated data and marking...read more
Q12. What do you know about CS Workflows
CS Workflows refer to the processes and steps involved in managing and analyzing data in a computer science context.
CS Workflows involve defining data sources and transformations
They often include data cleaning, processing, and analysis steps
Tools like Apache Airflow and Luigi are commonly used for managing workflows
CS Workflows help automate data pipelines and ensure data quality and consistency
Q13. Difference between Colease and repartition in pyspark
coalesce and repartition are both used to control the number of partitions in a PySpark DataFrame.
coalesce reduces the number of partitions by combining them, while repartition shuffles the data to create new partitions
coalesce is a narrow transformation and does not trigger a full shuffle, while repartition is a wide transformation and triggers a shuffle
coalesce is useful when reducing the number of partitions, while repartition is useful when increasing the number of partit...read more
Q14. Tell me about overall IT experiance
I have over 5 years of experience in IT, with a focus on data engineering and database management.
Worked on designing and implementing data pipelines to extract, transform, and load data from various sources
Managed and optimized databases for performance and scalability
Collaborated with cross-functional teams to develop data-driven solutions
Experience with tools like SQL, Python, Hadoop, and Spark
Participated in data modeling and data architecture design
Q15. How many graphs build yet?
I have built 10 graphs so far, including network graphs, bar graphs, and pie charts.
I have built 10 graphs in total
I have experience building network graphs, bar graphs, and pie charts
I have used tools like matplotlib and seaborn for graph building
Q16. Delete vs truncate vs drop
Difference between delete, truncate and drop in SQL
DELETE removes specific rows from a table
TRUNCATE removes all rows from a table
DROP removes the entire table from the database
Q17. Difference between row_number and dense_rank
row_number assigns unique sequential integers to rows, while dense_rank assigns ranks to rows with no gaps between ranks.
row_number function assigns a unique sequential integer to each row in the result set
dense_rank function assigns ranks to rows with no gaps between ranks
row_number does not handle ties, while dense_rank does
Example: row_number - 1, 2, 3, 4; dense_rank - 1, 2, 2, 3
Q18. how you deal with escalaltions
I address escalations by identifying the root cause, communicating effectively, collaborating with stakeholders, and finding a resolution.
Identify the root cause of the escalation to understand the issue thoroughly
Communicate effectively with all parties involved to ensure clarity and transparency
Collaborate with stakeholders to gather necessary information and work towards a resolution
Find a resolution that addresses the escalation and prevents similar issues in the future
Q19. What is broadcast variable
Broadcast variable is a read-only variable that is cached on each machine in a cluster instead of being shipped with tasks.
Broadcast variables are used to efficiently distribute large read-only datasets to worker nodes in Spark applications.
They are cached in memory on each machine and can be reused across multiple stages of a job.
Broadcast variables help in reducing the amount of data that needs to be transferred over the network during task execution.
Q20. Advantages and disadvantages of Hive?
Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
Advantages: SQL-like query language for querying large datasets, optimized for OLAP workloads, supports partitioning and bucketing for efficient queries.
Disadvantages: Slower performance compared to traditional databases for OLTP workloads, limited support for complex queries and transactions.
Example: Hive can be used to analyze large volumes of log data to ext...read more
Q21. What is RCP in datastage
RCP in DataStage stands for Runtime Column Propagation.
RCP is a feature in IBM DataStage that allows the runtime engine to determine the columns that are needed for processing at runtime.
It helps in optimizing the job performance by reducing unnecessary column processing.
RCP can be enabled or disabled at the job level or individual stage level.
Example: By enabling RCP, DataStage can dynamically propagate only the required columns for processing, improving job efficiency.
Q22. optimisation techniques
Optimisation techniques are methods used to improve the efficiency and performance of data processing tasks.
Use indexing to speed up data retrieval
Implement parallel processing to distribute workloads
Utilize caching to store frequently accessed data
Optimize algorithms for better performance
Use data compression techniques to reduce storage space
Q23. Optimisation done in the code
Optimisation in code involves improving efficiency and performance.
Use of efficient data structures and algorithms
Minimizing unnecessary computations
Reducing memory usage
Parallel processing for faster execution
Profiling and identifying bottlenecks
Q24. Spark optimizations technique
Spark optimizations techniques improve performance and efficiency of Spark jobs.
Partitioning data to optimize parallelism
Caching data in memory to avoid recomputation
Using broadcast variables for small lookup tables
Avoiding shuffles by using narrow transformations
Tuning memory and executor settings for optimal performance
Q25. What is datastage
Datastage is an ETL tool used for extracting, transforming, and loading data from various sources to a target destination.
Datastage is a popular ETL tool developed by IBM.
It allows users to design and run jobs that move and transform data.
Datastage supports various data sources such as databases, flat files, and cloud services.
It provides a graphical interface for designing data integration jobs.
Datastage jobs can be scheduled and monitored for data processing.
Example: Using ...read more
Q26. Oops concepts in java
Oops concepts in Java refer to Object-Oriented Programming principles like Inheritance, Encapsulation, Polymorphism, and Abstraction.
Inheritance: Allows a class to inherit properties and behavior from another class.
Encapsulation: Bundling data and methods that operate on the data into a single unit.
Polymorphism: Ability of a method to do different things based on the object it is acting upon.
Abstraction: Hiding the implementation details and showing only the necessary feature...read more
Q27. Difference between the two
The difference between the two is the key factor that sets them apart.
Data Engineer focuses on designing and maintaining data pipelines and infrastructure for data storage and processing.
Data Scientist focuses on analyzing and interpreting complex data to provide insights and make data-driven decisions.
Data Engineer typically works on building and optimizing data pipelines using tools like Apache Spark or Hadoop.
Data Scientist uses statistical and machine learning techniques ...read more
Q28. Snowflake's Architecture
Snowflake is a cloud-based data warehousing platform that separates storage and compute, providing scalability and flexibility.
Snowflake uses a unique architecture called multi-cluster, shared data architecture.
It separates storage and compute, allowing users to scale each independently.
Data is stored in virtual warehouses, which are compute resources that can be scaled up or down based on workload.
Snowflake uses a central data repository called the data lake, which stores al...read more
Q29. Day to day tasks
Day to day tasks involve data collection, processing, analysis, and maintenance to ensure data quality and availability.
Collecting and storing data from various sources
Cleaning and preprocessing data for analysis
Developing and maintaining data pipelines
Analyzing data to extract insights and trends
Collaborating with data scientists and analysts to support their work
More about working at IBM
Top HR Questions asked in Dhwani Rural Information Systems
Interview Process at Dhwani Rural Information Systems
Top Data Engineer Interview Questions from Similar Companies
Reviews
Interviews
Salaries
Users/Month