i
TCS
Filter interviews by
Distribution keys in Amazon Redshift determine how data is distributed across nodes for efficient query performance.
Distribution keys help optimize data storage and query performance by controlling data distribution across nodes.
Choosing the right distribution key can minimize data movement during query execution, improving performance.
For example, if you frequently join two tables on a common column, setting that...
Pyspark is a powerful tool for big data processing using Python, enabling efficient data manipulation and analysis.
Pyspark allows distributed data processing using RDDs (Resilient Distributed Datasets). Example: rdd = spark.parallelize([1, 2, 3])
DataFrames in Pyspark provide a higher-level abstraction for structured data. Example: df = spark.read.csv('file.csv')
Pyspark supports SQL queries through the Spark SQL mo...
Lists are mutable, ordered collections; tuples are immutable, ordered collections in Python.
Lists are defined using square brackets: `my_list = [1, 2, 3]`.
Tuples are defined using parentheses: `my_tuple = (1, 2, 3)`.
Lists can be modified (add/remove elements): `my_list.append(4)`.
Tuples cannot be modified after creation: `my_tuple[0] = 10` raises an error.
Lists have more built-in methods (e.g., sort, reverse) comp...
Data engineer with 5 years of experience in building scalable data pipelines and optimizing data workflows for analytics.
5 years of experience in data engineering, focusing on ETL processes and data warehousing.
Proficient in Python and SQL for data manipulation and analysis.
Experience with cloud platforms like AWS and Azure for deploying data solutions.
Implemented a real-time data pipeline using Apache Kafka, impr...
What people are saying about TCS
Worked on a data pipeline project to streamline ETL processes for a retail analytics platform.
Designed and implemented ETL processes using Apache Airflow.
Utilized AWS services like S3 for storage and Redshift for data warehousing.
Optimized data transformation scripts in Python for performance improvements.
Collaborated with data scientists to ensure data quality and accessibility.
Query optimization improves database performance by enhancing query execution efficiency and reducing resource consumption.
Use indexes to speed up data retrieval. For example, indexing a 'last_name' column can improve search performance in large tables.
Avoid SELECT *; specify only the columns needed to reduce data transfer and processing time.
Use WHERE clauses to filter data early in the query process, minimizing ...
Apache Spark architecture includes a cluster manager, worker nodes, and driver program.
Apache Spark architecture consists of a cluster manager, such as YARN or Mesos, which allocates resources and schedules tasks.
Worker nodes execute the tasks and store data in memory or disk.
The driver program coordinates the execution of the application and interacts with the cluster manager to distribute tasks.
Spark application...
Cloud computing is the delivery of computing services over the internet, including storage, databases, networking, software, and more.
Cloud computing allows users to access and use resources on-demand without the need for physical infrastructure.
Examples of cloud computing services include Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform.
It offers scalability, flexibility, cost-effectiveness, ...
SQL queries for data manipulation and retrieval
Use SELECT statement to retrieve data from a table
Use WHERE clause to filter data based on specific conditions
Use JOIN clause to combine data from multiple tables
Use GROUP BY clause to group data based on a specific column
Use ORDER BY clause to sort the results in ascending or descending order
Some functions used in PL/SQL include TO_CHAR, TO_DATE, NVL, and CONCAT.
TO_CHAR: Converts a number or date to a string
TO_DATE: Converts a string to a date
NVL: Replaces NULL values with a specified default value
CONCAT: Concatenates two or more strings
I appeared for an interview in Apr 2025, where I was asked the following questions.
I applied via Walk-in
Rank assigns unique ranks to rows, while dense_rank handles ties by assigning the same rank to tied rows. Left join includes all rows from the left table and matching rows from the right table, while left anti join includes only rows from the left table that do not have a match in the right table.
Rank assigns unique ranks to rows based on the specified order, while dense_rank handles ties by assigning the same rank to ...
I applied via Recruitment Consulltant and was interviewed in Aug 2024. There were 2 interview rounds.
Focus of quantitative maths and aptitude a bit more
I applied via LinkedIn and was interviewed in Oct 2024. There was 1 interview round.
Reverse strings in a Python list
Use list comprehension to iterate through the list and reverse each string
Use the slice notation [::-1] to reverse each string
Example: strings = ['hello', 'world'], reversed_strings = [s[::-1] for s in strings]
To find the 2nd highest salary in SQL, use the 'SELECT' statement with 'ORDER BY' and 'LIMIT' clauses.
Use the 'SELECT' statement to retrieve the salary column from the table.
Use the 'ORDER BY' clause to sort the salaries in descending order.
Use the 'LIMIT' clause to limit the result to the second row.
I appeared for an interview in Sep 2024.
I applied via Approached by Company and was interviewed in Sep 2024. There was 1 interview round.
SCD 1 overwrites old data with new data, while SCD 2 keeps track of historical changes.
SCD 1 updates existing records with new data, losing historical information.
SCD 2 creates new records for each change, preserving historical data.
SCD 1 is simpler and faster, but can lead to data loss.
SCD 2 is more complex and slower, but maintains a full history of changes.
Corrupt record handling in Spark involves identifying and handling data that does not conform to expected formats.
Use DataFrameReader option("badRecordsPath", "path/to/bad/records") to save corrupt records to a separate location for further analysis.
Use DataFrame.na.drop() or DataFrame.na.fill() to handle corrupt records by dropping or filling missing values.
Implement custom logic to identify and handle corrupt records...
Object-oriented programming (OOP) is a programming paradigm based on the concept of objects, which can contain data in the form of fields and code in the form of procedures.
OOP focuses on creating objects that interact with each other to solve a problem
Key concepts include encapsulation, inheritance, polymorphism, and abstraction
Encapsulation involves bundling data and methods that operate on the data into a single uni...
Data engineer life cycle involves collecting, storing, processing, and analyzing data using various tools.
Data collection: Gathering data from various sources such as databases, APIs, and logs.
Data storage: Storing data in databases, data lakes, or data warehouses.
Data processing: Cleaning, transforming, and enriching data using tools like Apache Spark or Hadoop.
Data analysis: Analyzing data to extract insights and mak...
Spark join strategies include broadcast join, shuffle hash join, and shuffle sort merge join.
Broadcast join is used when one of the DataFrames is small enough to fit in memory on all nodes.
Shuffle hash join is used when joining two large DataFrames by partitioning and shuffling the data based on the join key.
Shuffle sort merge join is used when joining two large DataFrames by sorting and merging the data based on the j...
Spark is a fast and general-purpose cluster computing system for big data processing.
Spark is popular for its speed and ease of use in processing large datasets.
It provides in-memory processing capabilities, making it faster than traditional disk-based processing systems.
Spark supports multiple programming languages like Java, Scala, Python, and R.
It offers a wide range of libraries for diverse tasks such as SQL, strea...
Clustering is the process of grouping similar data points together. Pods are groups of one or more containers, while nodes are individual machines in a cluster.
Clustering is a technique used in machine learning to group similar data points together based on certain features or characteristics.
Pods in a cluster are groups of one or more containers that share resources and are scheduled together on the same node.
Nodes ar...
The duration of TCS Data Engineer interview process can vary, but typically it takes about less than 2 weeks to complete.
based on 101 interview experiences
Difficulty level
Duration
based on 513 reviews
Rating in categories
Hyderabad / Secunderabad,
Bangalore / Bengaluru
+16-11 Yrs
Not Disclosed
System Engineer
1.1L
salaries
| ₹1 L/yr - ₹9 L/yr |
IT Analyst
65.6k
salaries
| ₹7.7 L/yr - ₹12.9 L/yr |
AST Consultant
53.5k
salaries
| ₹12 L/yr - ₹21 L/yr |
Assistant System Engineer
33.2k
salaries
| ₹2.7 L/yr - ₹6.4 L/yr |
Associate Consultant
32.9k
salaries
| ₹16.2 L/yr - ₹28 L/yr |
Amazon
Wipro
Infosys
Accenture