i
IBM
Proud winner of ABECA 2024 - AmbitionBox Employee Choice Awards
Filter interviews by
Data skewness in Spark can be handled by partitioning, bucketing, or using salting techniques.
Partitioning the data based on a key column can distribute the data evenly across the nodes.
Bucketing can group the data into buckets based on a key column, which can improve join performance.
Salting involves adding a random prefix to the key column, which can distribute the data evenly.
Using broadcast joins for small tables c...
Partitioning is dividing data into smaller chunks based on a column value. Bucketing is dividing data into equal-sized buckets based on a hash function.
Partitioning is used for organizing data for efficient querying and processing.
Bucketing is used for evenly distributing data across nodes in a cluster.
Partitioning is done based on a column value, such as date or region.
Bucketing is done based on a hash function, such ...
Cache is temporary storage used to speed up access to frequently accessed data. Persistent storage is permanent storage used to store data even after power loss.
Cache is faster but smaller than persistent storage
Cache is volatile and data is lost when power is lost
Persistent storage is non-volatile and data is retained even after power loss
Examples of cache include CPU cache, browser cache, and CDN cache
Examples of per...
To read JSON data using Spark, use the SparkSession.read.json() method.
Create a SparkSession object
Use the read.json() method to read the JSON data
Specify the path to the JSON file or directory containing JSON files
The resulting DataFrame can be manipulated using Spark's DataFrame API
To create a Kafka topic with replication factor 2, use the command line tool or Kafka API.
Use the command line tool 'kafka-topics.sh' with the '--replication-factor' flag set to 2.
Alternatively, use the Kafka API to create a topic with a replication factor of 2.
Ensure that the number of brokers in the Kafka cluster is greater than or equal to the replication factor.
Consider setting the 'min.insync.replicas' configurati...
I can join within two weeks of receiving an offer.
I can start within two weeks of receiving an offer.
I need to give notice at my current job before starting.
I have some personal commitments that I need to wrap up before joining.
Datastage is an ETL tool used for extracting, transforming, and loading data from various sources to a target destination.
Datastage is a popular ETL tool developed by IBM.
It allows users to design and run jobs that move and transform data.
Datastage supports various data sources such as databases, flat files, and cloud services.
It provides a graphical interface for designing data integration jobs.
Datastage jobs can be s...
RCP in DataStage stands for Runtime Column Propagation.
RCP is a feature in IBM DataStage that allows the runtime engine to determine the columns that are needed for processing at runtime.
It helps in optimizing the job performance by reducing unnecessary column processing.
RCP can be enabled or disabled at the job level or individual stage level.
Example: By enabling RCP, DataStage can dynamically propagate only the requi...
What people are saying about IBM
IBM interview questions for designations
- - - - --- --
Get interview-ready with Top IBM Interview Questions
Snowflake is a cloud-based data warehousing platform that separates storage and compute, providing scalability and flexibility.
Snowflake uses a unique architecture called multi-cluster, shared data architecture.
It separates storage and compute, allowing users to scale each independently.
Data is stored in virtual warehouses, which are compute resources that can be scaled up or down based on workload.
Snowflake uses a cen...
I am a data engineer with a strong background in programming and data analysis.
Experienced in designing and implementing data pipelines
Proficient in programming languages like Python, SQL, and Java
Skilled in data modeling and database management
Familiar with big data technologies such as Hadoop and Spark
Developed a data pipeline to process and analyze customer feedback data
Used Apache Spark for data processing
Implemented machine learning models for sentiment analysis
Visualized insights using Tableau for stakeholders
Collaborated with cross-functional teams to improve customer experience
row_number assigns unique sequential integers to rows, while dense_rank assigns ranks to rows with no gaps between ranks.
row_number function assigns a unique sequential integer to each row in the result set
dense_rank function assigns ranks to rows with no gaps between ranks
row_number does not handle ties, while dense_rank does
Example: row_number - 1, 2, 3, 4; dense_rank - 1, 2, 2, 3
Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
Advantages: SQL-like query language for querying large datasets, optimized for OLAP workloads, supports partitioning and bucketing for efficient queries.
Disadvantages: Slower performance compared to traditional databases for OLTP workloads, limited support for complex queries and transactions.
Example: Hi...
I applied via Referral and was interviewed in Apr 2024. There was 1 interview round.
I have over 5 years of experience in IT, with a focus on data engineering and database management.
Worked on designing and implementing data pipelines to extract, transform, and load data from various sources
Managed and optimized databases for performance and scalability
Collaborated with cross-functional teams to develop data-driven solutions
Experience with tools like SQL, Python, Hadoop, and Spark
Participated in data m
I applied via Naukri.com and was interviewed in Jul 2024. There was 1 interview round.
Broadcast variable is a read-only variable that is cached on each machine in a cluster instead of being shipped with tasks.
Broadcast variables are used to efficiently distribute large read-only datasets to worker nodes in Spark applications.
They are cached in memory on each machine and can be reused across multiple stages of a job.
Broadcast variables help in reducing the amount of data that needs to be transferred over
1 hour coding test with 1 coding question and 1 SQL question. Coding question was average, easy to solve. SQL question was very easy.
The duration of IBM Data Engineer interview process can vary, but typically it takes about less than 2 weeks to complete.
based on 40 interviews
3 Interview rounds
based on 207 reviews
Rating in categories
Application Developer
11.7k
salaries
| ₹5.9 L/yr - ₹26.5 L/yr |
Software Engineer
5.5k
salaries
| ₹5.4 L/yr - ₹22.6 L/yr |
Advisory System Analyst
5.2k
salaries
| ₹9.4 L/yr - ₹26 L/yr |
Senior Software Engineer
4.8k
salaries
| ₹8 L/yr - ₹30 L/yr |
Senior Systems Engineer
4.5k
salaries
| ₹5.6 L/yr - ₹20 L/yr |
Oracle
TCS
Cognizant
Accenture