TCS
60+ Interview Questions and Answers
Q1. what is an internal and external table in Hive
Internal tables store data within Hive's warehouse directory while external tables store data outside of it.
Internal tables are managed by Hive and are deleted when the table is dropped
External tables are not managed by Hive and data is not deleted when the table is dropped
Internal tables are faster for querying as data is stored within Hive's warehouse directory
External tables are useful for sharing data between different systems
Example: CREATE TABLE my_table (col1 INT, col2...read more
Q2. what is view in SQL and dense and dense rank
View is a virtual table created from a SQL query. Dense rank assigns a unique rank to each row in a result set.
A view is a saved SQL query that can be used as a table
Dense rank assigns a unique rank to each row in a result set, with no gaps between the ranks
Dense rank is used to rank rows based on a specific column or set of columns
Example: SELECT * FROM my_view WHERE column_name = 'value'
Example: SELECT column_name, DENSE_RANK() OVER (ORDER BY column_name) FROM my_table
Q3. How to deal with data quality issues
Data quality issues can be dealt with by identifying the root cause, implementing data validation checks, and establishing data governance policies.
Identify the root cause of the data quality issue
Implement data validation checks to prevent future issues
Establish data governance policies to ensure data accuracy and consistency
Regularly monitor and audit data quality
Involve stakeholders in the data quality process
Use data profiling and cleansing tools
Ensure data security and p...read more
Q4. do you have experience in aws glue, how will you use glue for data migration?
Yes, I have experience in AWS Glue and can use it for data migration.
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics.
To use Glue for data migration, I would start by creating a Glue job that defines the source and target data sources, as well as any transformations needed.
I would then configure the job to run on a schedule or trigger it manually to migrate the data from the source to the target.
G...read more
Q5. How do you select the unique customers in the last 3 months sales
Use SQL query to select unique customers in last 3 months sales
Filter sales data for the last 3 months
Use DISTINCT keyword to select unique customers
Join with customer table if necessary
Q6. How can we join a table without any identity columns?
You can join tables without identity columns using other unique columns or composite keys.
Use other unique columns or composite keys to join the tables
Consider using a combination of columns to create a unique identifier for joining
If no unique columns are available, consider using a combination of non-unique columns with additional logic to ensure accurate joins
Q7. what is partition and coalesce
Partitioning is dividing a large dataset into smaller, manageable parts. Coalescing is merging small partitions into larger ones.
Partitioning is useful for parallel processing and optimizing query performance.
Coalescing reduces the number of partitions and can improve query performance.
In Spark, partitioning can be done based on a specific column or by specifying the number of partitions.
Coalescing can be used to reduce the number of partitions after filtering or joining oper...read more
Q8. what is repartition and bucketing
Repartitioning and bucketing are techniques used in Apache Spark to optimize data processing.
Repartitioning is the process of redistributing data across partitions to optimize parallelism and improve performance.
Bucketing is a technique used to organize data into more manageable and efficient groups based on a specific column or set of columns.
Repartitioning and bucketing can be used together to further optimize data processing.
Repartitioning can be done using the repartition...read more
Q9. What are types of stages in snowflake?
Snowflake has 4 types of stages: ingest, load, copy, and query.
Ingest stage is used to load data from external sources into Snowflake.
Load stage is used to load data from internal sources within Snowflake.
Copy stage is used to copy data from one table to another within Snowflake.
Query stage is used to execute SQL queries on Snowflake data.
Stages can be created and managed using SQL commands or the Snowflake web interface.
Q10. 1)Diffrence between partion and bucketing in hve 2) diffrence between internal and external tables 3) explain hive architecture 4) difference between cache and parsit 5) what is RDD
1) Partitioning is dividing data into smaller parts based on a column, while bucketing is dividing data into equal-sized files based on a hash function. 2) Internal tables store data in a default location managed by Hive, while external tables store data in a user-defined location. 3) Hive architecture consists of a metastore, driver, compiler, optimizer, and execution engine. 4) Cache stores data in memory for faster access, while persist writes data to disk. 5) RDD (Resilie...read more
Q11. What is Clustering? what is difference between pods and nodes?
Clustering is the process of grouping similar data points together. Pods are groups of one or more containers, while nodes are individual machines in a cluster.
Clustering is a technique used in machine learning to group similar data points together based on certain features or characteristics.
Pods in a cluster are groups of one or more containers that share resources and are scheduled together on the same node.
Nodes are individual machines within a cluster that run multiple p...read more
Q12. what is the window function
Window function is a SQL function that performs a calculation across a set of rows that are related to the current row.
Window functions are used to calculate running totals, moving averages, and other calculations that depend on the order of rows.
They allow you to perform calculations on a subset of rows within a larger result set.
Examples of window functions include ROW_NUMBER, RANK, DENSE_RANK, and NTILE.
Window functions are often used in conjunction with the OVER clause to...read more
Q13. How to find rank without using aggregator in Informatica
To find rank without using aggregator in Informatica, use the Rank transformation with a custom rank variable.
Use Rank transformation in Informatica
Create a custom rank variable to assign ranks based on specific criteria
Use conditional statements in the Rank transformation to determine rank
Q14. 1.What is partition and bucketing. 2.diffrenece between union and union all. 3.spark architecture. 4.manage and external table in hive and diffirence. 5.sql and pyhton problem basic
Answers to interview questions for Data Engineer position.
1. Partitioning is a way to divide a large dataset into smaller, more manageable parts based on a specific column or expression. Bucketing is a technique to further organize the data within each partition into smaller, equally-sized files based on a hash function.
2. UNION combines the result sets of two or more SELECT statements, removing duplicate rows. UNION ALL also combines the result sets, but retains all rows, in...read more
Q15. what is an anonymous function
An anonymous function is a function without a name.
Also known as lambda functions or closures
Can be used as arguments to higher-order functions
Can be defined inline without a separate declaration
Example: lambda x: x**2 defines a function that squares its input
Q16. oracle, python, pyspark, have you worked on sql? What is you job role at previous comapny?
Yes, I have experience working with Oracle, Python, PySpark, and SQL in my previous roles as a Data Engineer.
Worked extensively with Oracle databases for data storage and retrieval
Utilized Python for data manipulation, analysis, and automation tasks
Implemented data processing and analytics using PySpark
Proficient in writing and optimizing SQL queries for data extraction and transformation
Q17. How to call an notebook from another notebook using databricks
To call a notebook from another notebook in Databricks, use the %run command followed by the path of the notebook.
Use the %run command followed by the path of the notebook to call it from another notebook.
Make sure the notebook you want to call is in the same workspace or accessible to the notebook you are calling it from.
You can also pass parameters to the notebook being called using the %run command.
Q18. how to migrate data from local server to AWS Redshift
Data can be migrated from a local server to AWS Redshift using tools like AWS Database Migration Service or manual ETL processes.
Use AWS Database Migration Service for automated migration
Export data from local server to S3 and then load into Redshift using COPY command
Use ETL tools like AWS Glue for data transformation and loading into Redshift
Q19. Difference between rank and dense_rank, Left vs Left anti join
Rank assigns unique ranks to rows, while dense_rank handles ties by assigning the same rank to tied rows. Left join includes all rows from the left table and matching rows from the right table, while left anti join includes only rows from the left table that do not have a match in the right table.
Rank assigns unique ranks to rows based on the specified order, while dense_rank handles ties by assigning the same rank to tied rows.
Example: If we have scores of 90, 85, 85, 80, th...read more
Q20. how you ingest your data in pipeline?
I ingest data in the pipeline using tools like Apache Kafka and Apache NiFi.
Use Apache Kafka for real-time data streaming
Utilize Apache NiFi for data ingestion and transformation
Implement data pipelines using tools like Apache Spark or Apache Flink
Q21. how migrate data from local server to AWS redshift
To migrate data from a local server to AWS Redshift, you can use various methods such as AWS Database Migration Service, AWS Glue, or manual ETL processes.
Use AWS Database Migration Service (DMS) to replicate data from the local server to Redshift
Create a DMS replication instance and endpoints for the source and target databases
Configure the replication task to specify the source and target endpoints, table mappings, and transformation rules
Start the replication task to migra...read more
Q22. What is Spark? Why it is so popular
Spark is a fast and general-purpose cluster computing system for big data processing.
Spark is popular for its speed and ease of use in processing large datasets.
It provides in-memory processing capabilities, making it faster than traditional disk-based processing systems.
Spark supports multiple programming languages like Java, Scala, Python, and R.
It offers a wide range of libraries for diverse tasks such as SQL, streaming, machine learning, and graph processing.
Spark can run...read more
Q23. What is MERGE Statement used for?
MERGE statement is used to perform insert, update, or delete operations in a single statement based on a condition.
Combines INSERT, UPDATE, and DELETE operations into a single statement
Helps to avoid multiple separate statements for different operations
Useful for synchronizing data between two tables based on a condition
Q24. what types of spark join strategies
Spark join strategies include broadcast join, shuffle hash join, and shuffle sort merge join.
Broadcast join is used when one of the DataFrames is small enough to fit in memory on all nodes.
Shuffle hash join is used when joining two large DataFrames by partitioning and shuffling the data based on the join key.
Shuffle sort merge join is used when joining two large DataFrames by sorting and merging the data based on the join key.
Q25. Whats ur cctc and expected ctc?
I am currently earning X amount and my expected salary is Y amount.
Current CTC is X amount
Expected CTC is Y amount
Q26. How to add column in a df?
To add a column in a df, use the df['new_column'] = value syntax.
Use the df['new_column'] = value syntax to add a new column to a DataFrame.
Value can be a single value, a list, or a Series.
Example: df['new_column'] = 10
Q27. Corrupt Record Handling in Spark
Corrupt record handling in Spark involves identifying and handling data that does not conform to expected formats.
Use DataFrameReader option("badRecordsPath", "path/to/bad/records") to save corrupt records to a separate location for further analysis.
Use DataFrame.na.drop() or DataFrame.na.fill() to handle corrupt records by dropping or filling missing values.
Implement custom logic to identify and handle corrupt records based on specific requirements.
Q28. What is Magic Table in SQL?
Magic Table in SQL is a temporary table that is automatically created and populated with data during triggers execution.
Magic Table is also known as 'Inserted' table in SQL Server.
It is used in triggers to access the data that was inserted, updated, or deleted in a table.
For example, in an 'AFTER INSERT' trigger, the Magic Table contains the rows that were just inserted.
Q29. How do you design data pipelines
Data pipelines are designed by identifying data sources, defining data transformations, and selecting appropriate tools and technologies.
Identify data sources and understand their structure and format
Define data transformations and processing steps
Select appropriate tools and technologies for data ingestion, processing, and storage
Consider scalability, reliability, and performance requirements
Implement error handling and data quality checks
Monitor and optimize the data pipeli...read more
Q30. explain data engineer life cycle and its tools
Data engineer life cycle involves collecting, storing, processing, and analyzing data using various tools.
Data collection: Gathering data from various sources such as databases, APIs, and logs.
Data storage: Storing data in databases, data lakes, or data warehouses.
Data processing: Cleaning, transforming, and enriching data using tools like Apache Spark or Hadoop.
Data analysis: Analyzing data to extract insights and make data-driven decisions.
Tools: Examples of tools used in d...read more
Q31. Difference between Partition and clustering in GCP
Partitioning is dividing data into smaller parts for better management, while clustering is grouping similar data together for efficient querying.
Partitioning is used to divide data into smaller chunks based on a specific column or key, which helps in managing and querying large datasets efficiently.
Clustering is used to group similar rows of data together physically on disk, which can improve query performance by reducing the amount of data that needs to be scanned.
In Google...read more
Q32. What is a CTE in SQL?
CTE stands for Common Table Expression in SQL, used to create temporary result sets that can be referenced within a query.
CTEs improve readability and maintainability of complex queries
They can be recursive, allowing for hierarchical data querying
CTEs are defined using the WITH keyword followed by the CTE name and query
Q33. how to optimize spark jobs?
Optimizing Spark jobs involves tuning configurations, optimizing code, and utilizing resources efficiently.
Tune Spark configurations such as executor memory, cores, and parallelism
Optimize code by reducing unnecessary shuffles, caching intermediate results, and using efficient transformations
Utilize resources efficiently by monitoring job performance, scaling cluster resources as needed, and optimizing data storage formats
Q34. How can we Optimize SP?
Optimizing stored procedures involves improving performance by reducing execution time and resource usage.
Identify and eliminate unnecessary or redundant code
Use appropriate indexing to speed up data retrieval
Avoid using cursors and loops for better performance
Update statistics regularly to help the query optimizer make better decisions
Consider partitioning large tables to improve query performance
Q35. How to avoid data skewness?
Avoid data skewness by partitioning data, using sampling techniques, and optimizing queries.
Partition data to distribute evenly across nodes
Use sampling techniques to analyze data distribution
Optimize queries to prevent skewed data distribution
Q36. What are integration runtime?
Integration runtimes are compute infrastructure used by Azure Data Factory to provide data integration capabilities across different network environments.
Integration runtimes can be self-hosted or Azure-hosted.
They are used to move data between cloud and on-premises data stores.
Integration runtimes provide connectivity to various data sources and destinations.
Examples include Azure Integration Runtime and Self-hosted Integration Runtime.
Q37. What is your Expected CTC?
My expected CTC is based on industry standards, my experience, and the responsibilities of the role.
My expected CTC is in line with the market rates for Data Engineers with similar experience and skills.
I have taken into consideration the responsibilities and requirements of the role when determining my expected CTC.
I am open to negotiation based on the overall compensation package offered by the company.
Q38. Difference between datastage and informatica
Datastage and Informatica are both ETL tools used for data integration, but they have differences in terms of features and capabilities.
Datastage is developed by IBM and is known for its parallel processing capabilities, while Informatica is developed by Informatica Corporation and is known for its strong data quality features.
Datastage has a more user-friendly interface compared to Informatica, making it easier for beginners to use.
Informatica offers more advanced features f...read more
Q39. What is cloud computing?
Cloud computing is the delivery of computing services over the internet, including storage, databases, networking, software, and more.
Cloud computing allows users to access and use resources on-demand without the need for physical infrastructure.
Examples of cloud computing services include Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform.
It offers scalability, flexibility, cost-effectiveness, and the ability to access resources from anywhere with an inter...read more
Q40. 5 vs of data
The 5 Vs of data are Volume, Velocity, Variety, Veracity, and Value.
Volume refers to the amount of data being generated and stored.
Velocity refers to the speed at which data is being generated and processed.
Variety refers to the different types of data being generated, such as structured, unstructured, and semi-structured data.
Veracity refers to the accuracy and reliability of the data.
Value refers to the usefulness and relevance of the data to the organization.
Q41. Name few functions used in PL/SQL
Some functions used in PL/SQL include TO_CHAR, TO_DATE, NVL, and CONCAT.
TO_CHAR: Converts a number or date to a string
TO_DATE: Converts a string to a date
NVL: Replaces NULL values with a specified default value
CONCAT: Concatenates two or more strings
Q42. how to improve performance
Improving performance in data engineering involves optimizing code, utilizing efficient algorithms, and scaling infrastructure.
Optimize code by reducing unnecessary computations and improving data processing efficiency.
Utilize efficient algorithms and data structures to minimize time and space complexity.
Scale infrastructure by leveraging cloud services, parallel processing, and distributed computing.
Monitor performance metrics and conduct regular performance tuning to identi...read more
Q43. What is data engineering
Data engineering involves designing, building, and maintaining data pipelines to collect, store, and process data for analysis.
Designing and implementing data pipelines to collect and process data from various sources
Building and maintaining data infrastructure such as databases and data warehouses
Optimizing data workflows for efficiency and scalability
Collaborating with data scientists and analysts to ensure data quality and availability
Using tools like Apache Spark, Hadoop,...read more
Q44. SCD 1 vs SCD 2
SCD 1 overwrites old data with new data, while SCD 2 keeps track of historical changes.
SCD 1 updates existing records with new data, losing historical information.
SCD 2 creates new records for each change, preserving historical data.
SCD 1 is simpler and faster, but can lead to data loss.
SCD 2 is more complex and slower, but maintains a full history of changes.
Q45. What is Databricks?
Databricks is a unified analytics platform that provides collaborative environment for data scientists, engineers, and analysts.
Databricks allows users to write and run Apache Spark code in a collaborative environment.
It integrates with popular programming languages like Python, Scala, and SQL.
Databricks provides tools for data visualization, machine learning, and data engineering.
It offers automated cluster management and optimization for Spark jobs.
Databricks is commonly us...read more
Q46. What is sql and types
SQL is a programming language used for managing and manipulating relational databases.
SQL stands for Structured Query Language
Types of SQL include MySQL, PostgreSQL, Oracle, SQL Server, etc.
SQL is used for querying, updating, and managing databases
Common SQL commands include SELECT, INSERT, UPDATE, DELETE
Q47. what is oops concept
Object-oriented programming (OOP) is a programming paradigm based on the concept of objects, which can contain data in the form of fields and code in the form of procedures.
OOP focuses on creating objects that interact with each other to solve a problem
Key concepts include encapsulation, inheritance, polymorphism, and abstraction
Encapsulation involves bundling data and methods that operate on the data into a single unit
Inheritance allows a class to inherit properties and beha...read more
Q48. What is Spark submit
Spark submit is a command-line tool used to submit Spark applications to a cluster.
Spark submit is used to launch Spark applications on a cluster.
It is a command-line interface that allows users to specify the application's main class or JAR file, along with other configuration options.
Spark submit handles the deployment of the application code and resources to the cluster, and manages the execution of the application.
It supports various options for configuring the applicatio...read more
Q49. Are ready torelocate
Yes, I am open to relocating for the right opportunity.
I am willing to relocate for the right job opportunity
I have relocated in the past for work
I am flexible and open to new experiences
Q50. Types of cloud computing
Types of cloud computing include public, private, hybrid, and multicloud.
Public cloud: Services are delivered over the internet and shared across multiple organizations. Example: AWS, Azure, Google Cloud
Private cloud: Services are maintained on a private network and dedicated to a single organization. Example: VMware, OpenStack
Hybrid cloud: Combination of public and private clouds, allowing data and applications to be shared between them. Example: AWS Outposts, Azure Stack
Mul...read more
Q51. Reverse string in python list
Reverse strings in a Python list
Use list comprehension to iterate through the list and reverse each string
Use the slice notation [::-1] to reverse each string
Example: strings = ['hello', 'world'], reversed_strings = [s[::-1] for s in strings]
Q52. 2nd highest salary sql
To find the 2nd highest salary in SQL, use the 'SELECT' statement with 'ORDER BY' and 'LIMIT' clauses.
Use the 'SELECT' statement to retrieve the salary column from the table.
Use the 'ORDER BY' clause to sort the salaries in descending order.
Use the 'LIMIT' clause to limit the result to the second row.
Q53. Optimizations in pyspark
Optimizations in pyspark involve techniques to improve performance and efficiency of data processing.
Use partitioning to distribute data evenly across nodes for parallel processing
Utilize caching to store intermediate results in memory for faster access
Avoid unnecessary shuffling of data by using appropriate join strategies
Optimize the execution plan by analyzing and adjusting the stages of the job
Use broadcast variables for small lookup tables to reduce data transfer
Q54. What is Spark is RDD
Spark RDD stands for Resilient Distributed Dataset, which is a fundamental data structure in Apache Spark.
RDD is an immutable distributed collection of objects that can be operated on in parallel.
It allows for fault-tolerant distributed data processing in Spark.
RDDs can be created from Hadoop InputFormats, local collections, or by transforming other RDDs.
Operations on RDDs are lazily evaluated, allowing for efficient data processing.
Example: val rdd = sc.parallelize(List(1, 2...read more
Q55. Queue & Stack Algorithm
Queue & Stack Algorithm involves data structures for storing and retrieving data in a specific order.
Queue follows First In First Out (FIFO) principle, like a line at a grocery store.
Stack follows Last In First Out (LIFO) principle, like a stack of plates.
Examples: Queue - BFS algorithm in graph traversal. Stack - Undo feature in text editors.
Q56. Writing pyspark codes
Writing pyspark codes involves using PySpark API to process big data in a distributed computing environment.
Use PySpark API to create SparkContext and SparkSession objects
Utilize transformations like map, filter, reduceByKey, etc. to process data
Implement actions like collect, count, saveAsTextFile, etc. to trigger computation
Optimize performance by caching RDDs and using broadcast variables
Handle errors and exceptions using try-except blocks
Q57. SQL codes with live examples
SQL codes with live examples
Use SELECT statement to retrieve data from a database table
Use WHERE clause to filter data based on specific conditions
Use JOIN to combine rows from two or more tables based on a related column
Q58. Explain ur project?
Developed a data pipeline to ingest, clean, and analyze customer feedback data for product improvements.
Used Apache Kafka for real-time data streaming
Implemented data cleaning and transformation using Python and Pandas
Utilized SQL for data analysis and visualization
Collaborated with product managers to identify key insights for product enhancements
Q59. optimizing techniques
Optimizing techniques are essential for improving data processing efficiency.
Utilize indexing to speed up data retrieval operations
Implement caching mechanisms to reduce redundant data processing
Use parallel processing to distribute workloads and improve performance
Optimize database queries by analyzing and restructuring them for efficiency
Q60. Indexing in SQL
Indexing in SQL improves query performance by creating a data structure that allows for faster retrieval of data.
Indexes are created on columns in a table to speed up SELECT queries.
Types of indexes include clustered, non-clustered, unique, and composite indexes.
Examples of creating an index: CREATE INDEX idx_name ON table_name(column_name);
Q61. Types of hashing
Hashing is a technique used to convert data into a fixed-size string of bytes.
Hash functions are used to map data of arbitrary size to fixed-size values.
Common hashing algorithms include MD5, SHA-1, and SHA-256.
Hashing is commonly used in data security, password storage, and data retrieval.
Q62. Write SQL queries
SQL queries for data manipulation and retrieval
Use SELECT statement to retrieve data from a table
Use WHERE clause to filter data based on specific conditions
Use JOIN clause to combine data from multiple tables
Use GROUP BY clause to group data based on a specific column
Use ORDER BY clause to sort the results in ascending or descending order
Q63. Rotation of Array
Rotate an array of strings by a given number of positions.
Create a new array and copy elements from the original array based on the rotation index.
Handle cases where the rotation index is greater than the array length by using modulo operation.
Example: Original array ['a', 'b', 'c', 'd', 'e'], rotate by 2 positions -> ['c', 'd', 'e', 'a', 'b']
More about working at TCS
Top HR Questions asked in null
Interview Process at null
Top Data Engineer Interview Questions from Similar Companies
Reviews
Interviews
Salaries
Users/Month