7 Eleven
Thomson Reuters Interview Questions and Answers
Q1. Write a query to get the customer with the highest total order value for each year, month. [Note: Order table is different and Customer table is different. Order_ID and Customer_ID are the PK of the table with ...
read moreQuery to get the customer with the highest total order value for each year, month.
Join the Order and Customer tables on the foreign key
Group the results by year, month, and customer
Calculate the total order value for each group
Find the maximum total order value for each year, month
If there are multiple customers with the same highest total order value, select the one with the lower Customer_ID
Q2. There are 10 million records in the table and the schema does not contain the ModifiedDate column. One cell was modified the next day in the table. How will you fetch that particular information that needs to b...
read moreTo fetch the modified information without ModifiedDate column from a table with 10 million records.
Create a trigger to capture the modified information and insert it into a separate table with ModifiedDate column.
Use a tool like Change Data Capture (CDC) to track changes in the table and extract the modified information.
Use a query to compare the current table with a backup copy taken the previous day to identify the modified information.
Q3. How do you handle data pipeline when the schema information keeps changing at the source?
Handle changing schema by using schema evolution techniques and version control.
Use schema evolution techniques like adding new fields, renaming fields, and changing data types.
Implement version control to track changes and ensure backward compatibility.
Use tools like Apache Avro or Apache Parquet to store data in a self-describing format.
Implement automated testing to ensure data quality and consistency.
Collaborate with data producers to establish clear communication and doc...read more
Q4. What is Normalisation and Denormalisation? When do we use them? Give a real-time example that is implemented in your project.
Normalisation is the process of organizing data in a database to reduce redundancy and improve data integrity. Denormalisation is the opposite process.
Normalisation is used to eliminate data redundancy and improve data integrity.
Denormalisation is used to improve query performance by reducing the number of joins required.
A real-time example of normalisation is breaking down a customer's information into separate tables such as customer details, order details, and payment deta...read more
Q5. Difference between Parquet and ORC file. Why industry uses parquet over ORC? Can schema evolution happen in ORC?
Parquet and ORC are columnar storage formats. Parquet is preferred due to its cross-platform support and better compression. ORC supports schema evolution.
Parquet is a columnar storage format that is widely used in the industry due to its cross-platform support and better compression.
ORC is another columnar storage format that supports schema evolution.
Parquet is preferred over ORC due to its better compression and support for a wider range of programming languages.
ORC is pre...read more
Q6. What are the different types of schema you know in Data Warehousing?
There are three types of schema in Data Warehousing: Star Schema, Snowflake Schema, and Fact Constellation Schema.
Star Schema: central fact table connected to dimension tables in a star shape
Snowflake Schema: extension of star schema with normalized dimension tables
Fact Constellation Schema: multiple fact tables connected to dimension tables in a complex structure
Q7. Difference between Broadcast variable and accumulator variable
Broadcast variables are read-only variables that are cached on each worker node while accumulator variables are write-only variables that are used to accumulate values across multiple tasks.
Broadcast variables are used to give every node a copy of a large input dataset or a small lookup table.
Accumulator variables are used to keep a running total of values across multiple tasks.
Broadcast variables are used for read-only operations while accumulator variables are used for writ...read more
Q8. Different SCD Types
SCD (Slowly Changing Dimensions) types are used to track changes in data over time.
Type 1: Overwrite the old data with new data
Type 2: Create a new record with a new primary key
Type 3: Create a new column to store the old data
Type 4: Create a separate table to store the old data
Type 6: Hybrid of Type 1 and Type 2
Top Senior Data Engineer Interview Questions from Similar Companies
Reviews
Interviews
Salaries
Users/Month