LTIMindtree
30+ GMR Group Interview Questions and Answers
Q1. 7) How does query acceleration speed up query processing?
Query acceleration speeds up query processing by optimizing query execution and reducing the time taken to retrieve data.
Query acceleration uses techniques like indexing, partitioning, and caching to optimize query execution.
It reduces the time taken to retrieve data by minimizing disk I/O and utilizing in-memory processing.
Examples include using columnar storage formats like Parquet or optimizing join operations.
Q2. SQL what are the condition used in sql? when we have table but we want create
SQL conditions are used to filter data based on specified criteria. Common conditions include WHERE, AND, OR, IN, BETWEEN, etc.
Common SQL conditions include WHERE, AND, OR, IN, BETWEEN, LIKE, etc.
Conditions are used to filter data based on specified criteria in SQL queries.
Examples: WHERE salary > 50000, AND department = 'IT', OR age < 30
Q3. How to handle missing data in pyspark dataframe.
Handle missing data in pyspark dataframe by using functions like dropna, fillna, or replace.
Use dropna() function to remove rows with missing data
Use fillna() function to fill missing values with a specified value
Use replace() function to replace missing values with a specified value
Q4. In Databricks, when a spark is submitted, what happens at backend. Explain the flow?
When a spark is submitted in Databricks, several backend processes are triggered to execute the job.
The submitted spark job is divided into tasks by the Spark driver.
The tasks are then scheduled to run on the available worker nodes in the cluster.
The worker nodes execute the tasks and return the results to the driver.
The driver aggregates the results and presents them to the user.
Various optimizations such as data shuffling and caching may be applied during the execution proc...read more
Q5. How would you delete duplicate records from a table?
To delete duplicate records from a table, you can use the DELETE statement with a self-join or subquery.
Identify the duplicate records using a self-join or subquery
Use the DELETE statement to remove the duplicate records
Consider using a temporary table to store the unique records before deleting the duplicates
Q6. duplicate table how we create? window function? types of joins? explain each join?
To duplicate a table, use CREATE TABLE AS or INSERT INTO SELECT. Window functions are used for calculations across a set of table rows. Types of joins include INNER, LEFT, RIGHT, and FULL OUTER joins.
To duplicate a table, use CREATE TABLE AS or INSERT INTO SELECT
Window functions are used for calculations across a set of table rows
Types of joins include INNER, LEFT, RIGHT, and FULL OUTER joins
Explain each join: INNER - returns rows when there is at least one match in both tabl...read more
Q7. How to filter data from A dashboard to B dashboard?
Use data connectors or APIs to extract and transfer data from one dashboard to another.
Utilize data connectors or APIs provided by the dashboard platforms to extract data from A dashboard.
Transform the data as needed to match the format of B dashboard.
Use data connectors or APIs of B dashboard to transfer the filtered data from A dashboard to B dashboard.
Q8. How do you do to performance optimization in Spark?
Performance optimization in Spark involves tuning configurations, optimizing code, and utilizing caching.
Tune Spark configurations such as executor memory, cores, and parallelism
Optimize code by reducing unnecessary shuffles, using efficient transformations, and avoiding unnecessary data movements
Utilize caching to store intermediate results in memory for faster access
Q9. Do you have hands on experience on big data tools
Yes, I have hands-on experience with big data tools.
I have worked extensively with Hadoop, Spark, and Kafka.
I have experience with data ingestion, processing, and storage using these tools.
I have also worked with NoSQL databases like Cassandra and MongoDB.
I am familiar with data warehousing concepts and have worked with tools like Redshift and Snowflake.
Q10. 4) Describe the SSO process between Snowflake and Azure Active Directory.
SSO process between Snowflake and Azure Active Directory involves configuring SAML-based authentication.
Configure Snowflake to use SAML authentication with Azure AD as the identity provider
Set up a trust relationship between Snowflake and Azure AD
Users authenticate through Azure AD and are granted access to Snowflake resources
SSO eliminates the need for separate logins and passwords for Snowflake and Azure AD
Q11. How much data can be stored in MySQL database?
The maximum amount of data that can be stored in a MySQL database depends on various factors.
The maximum size of a MySQL database is determined by the file system and operating system limitations.
The maximum size of a single table in MySQL is 64 terabytes (TB) for InnoDB storage engine and 256 terabytes (TB) for MyISAM storage engine.
The maximum number of rows in a table is determined by the maximum value of the AUTO_INCREMENT column.
The maximum size of a row in MySQL is 65,5...read more
Q12. Time travel, different types of tables in snowflake and their retention periods
Snowflake has different types of tables with varying retention periods. Time travel allows accessing historical data.
Snowflake has two types of tables: transient and persistent
Transient tables are temporary and have a retention period of 1 day by default
Persistent tables are permanent and have a retention period of 1 year by default
Time travel in Snowflake allows accessing historical data at different points in time
Time travel is enabled by default for 1 day for transient tab...read more
Q13. 6) Automatic data loading from pipes in to Snowflake.
Automate data loading from pipes into Snowflake for efficient data processing.
Use Snowpipe, a continuous data ingestion service provided by Snowflake, to automatically load data from pipes into Snowflake tables.
Snowpipe monitors a stage for new data files and loads them into the specified table in real-time.
Configure Snowpipe to trigger a data load whenever new data files are added to the stage, eliminating the need for manual intervention.
Snowpipe supports various file forma...read more
Q14. combine two columns in pyspark dataframe
Use the withColumn method in PySpark to combine two columns in a DataFrame.
Use the withColumn method to create a new column by combining two existing columns
Specify the new column name and the expression to combine the two columns
Example: df = df.withColumn('combined_column', concat(col('column1'), lit(' '), col('column2')))
Q15. How would you truncate a table?
Truncating a table removes all data from the table while keeping the structure intact.
Truncate is a DDL (Data Definition Language) command in SQL.
It is used to quickly delete all rows from a table.
Truncate is faster than using the DELETE statement.
Truncate cannot be rolled back, and it does not generate any log data.
The table structure, indexes, and constraints remain intact after truncation.
Q16. Why would someone index a table?
To improve query performance by reducing the time it takes to retrieve data from a table.
Indexes help to speed up data retrieval operations by allowing the database to quickly locate the required data.
They can be used to optimize queries that involve filtering, sorting, or joining data.
Indexes can also improve the performance of data modification operations, such as inserts, updates, and deletes.
Choosing the right columns to index is important to ensure maximum benefit.
Exampl...read more
Q17. What is difference between lookup and sp activity
Lookup is used to retrieve a single value from a dataset, while stored procedure activity executes a stored procedure in a database.
Lookup is used in data pipelines to retrieve a single value or a set of values from a dataset.
Stored procedure activity is used in ETL processes to execute a stored procedure in a database.
Lookup is typically used for data enrichment or validation purposes.
Stored procedure activity is commonly used for data transformation or loading tasks.
Q18. How to handle large amount of data on tableau.
Utilize Tableau's features like data extracts, data blending, and performance optimization techniques.
Use data extracts to improve performance by reducing the amount of data being processed.
Utilize data blending to combine data from multiple sources without the need for complex ETL processes.
Optimize performance by using filters, aggregations, and calculations efficiently.
Consider using Tableau's in-memory data engine for faster processing of large datasets.
Q19. 1) Snowflake architecture in your current project.
Snowflake architecture is used in our project for cloud-based data warehousing.
Snowflake follows a multi-cluster shared data architecture.
It separates storage and compute resources, allowing for independent scaling.
Data is stored in virtual warehouses, which are compute clusters that can be scaled up or down based on workload.
Snowflake uses a unique architecture called a multi-cluster, shared data architecture, which separates storage and compute resources for improved perfor...read more
Q20. Performance tuning techniques in Spark & Hive
Performance tuning techniques in Spark & Hive involve optimizing resource allocation, partitioning data, using appropriate data formats, and caching.
Optimize resource allocation by adjusting memory and CPU settings based on workload requirements
Partition data to distribute processing load evenly across nodes
Use appropriate data formats like Parquet or ORC for efficient storage and retrieval
Cache intermediate results to avoid recomputation and improve query performance
Q21. Difference between extract and live connection
Extract connection imports data into Tableau while live connection directly connects to the data source.
Extract connection creates a static snapshot of data while live connection accesses real-time data from the source.
Extract connection is useful for large datasets or when offline access is needed.
Live connection is beneficial for real-time analysis and when data needs to be updated frequently.
Examples: Extract connection - importing a CSV file into Tableau. Live connection ...read more
Q22. Spark architecture in detail
Spark architecture includes driver, executor, and cluster manager components for distributed data processing.
Spark architecture consists of a driver program that manages the execution of tasks across multiple worker nodes.
Executors are responsible for executing tasks on worker nodes and storing data in memory or disk.
Cluster manager is used to allocate resources and schedule tasks across the cluster.
Spark applications run as independent sets of processes on a cluster, coordin...read more
Q23. What is dense rank in sql
Dense rank in SQL assigns a unique rank to each distinct row in a result set, with no gaps between the ranks.
Dense rank is used to assign a rank to each row in a result set without any gaps.
It differs from regular rank in that it does not skip ranks if there are ties.
For example, if two rows have the same value and are ranked 1st, the next row will be ranked 2nd, not 3rd.
Q24. what is the oops of java
Object-oriented programming concepts in Java
Encapsulation: bundling data and methods that operate on the data into a single unit
Inheritance: allows a class to inherit properties and behavior from another class
Polymorphism: ability of a method to do different things based on the object it is acting upon
Abstraction: hiding the implementation details and showing only the functionality to the user
Q25. 2) Database roles in Snowflake.
Database roles in Snowflake define permissions and access control for users and objects.
Database roles in Snowflake are used to manage permissions and access control for users and objects.
Roles can be assigned to users or other roles to grant specific privileges.
Examples of roles in Snowflake include ACCOUNTADMIN, SYSADMIN, SECURITYADMIN, and PUBLIC.
Q26. What is Pyspark
Pyspark is a Python API for Apache Spark, a powerful open-source distributed computing system.
Pyspark allows users to write Spark applications using Python programming language.
It provides high-level APIs in Python for Spark's core functionality.
Pyspark can be used for processing large datasets in a distributed computing environment.
Example: Using Pyspark to perform data analysis and machine learning tasks on big data.
Q27. What is spark cluster
Spark cluster is a group of interconnected computers that work together to process large datasets using Apache Spark.
Consists of a master node and multiple worker nodes
Master node manages the distribution of tasks and resources
Worker nodes execute the tasks in parallel
Used for processing big data and running distributed computing jobs
Q28. How hive works in hdfs
Hive is a data warehouse system built on top of Hadoop for querying and analyzing large datasets stored in HDFS.
Hive translates SQL-like queries into MapReduce jobs to process data stored in HDFS
It uses a metastore to store metadata about tables and partitions
HiveQL is the query language used in Hive, similar to SQL
Hive supports partitioning, bucketing, and indexing for optimizing queries
Q29. 3) Session Policy in Snowflake.
Session Policy in Snowflake defines the behavior of a session, including session timeout and idle timeout settings.
Session Policy can be set at the account, user, or role level in Snowflake.
Session Policy settings include session timeout, idle timeout, and other session-related configurations.
Example: Setting a session timeout of 30 minutes will automatically end the session if there is no activity for 30 minutes.
Q30. 5) Network Policy in Snowflake.
Network Policy in Snowflake controls access to Snowflake resources based on IP addresses or ranges.
Network Policies are used to restrict access to Snowflake resources based on IP addresses or ranges.
They can be applied at the account, user, or role level.
Network Policies can be used to whitelist specific IP addresses or ranges that are allowed to access Snowflake resources.
They can also be used to blacklist IP addresses or ranges that are not allowed to access Snowflake resou...read more
Q31. Sql query to write max salary
Use SQL query with MAX function to find the highest salary in a table.
Use SELECT MAX(salary) FROM table_name;
Make sure to replace 'salary' with the actual column name in the table.
Ensure proper permissions to access the table.
Q32. Default join in tableau
Default join in Tableau is inner join
Default join in Tableau is inner join, which only includes rows that have matching values in both tables
Other types of joins in Tableau include left join, right join, and full outer join
To change the default join type in Tableau, you can drag the field from one table to another and select the desired join type
Q33. pyspark optimization technique
One pyspark optimization technique is using broadcast variables to efficiently distribute read-only data across all nodes.
Use broadcast variables to efficiently distribute read-only data across all nodes
Avoid shuffling data unnecessarily by using partitioning and caching
Optimize data processing by using appropriate transformations and actions
Q34. Explain blending
Blending is the process of combining multiple data sources or models to create a single, unified dataset or prediction.
Blending involves taking the outputs of multiple models and combining them to improve overall performance.
It is commonly used in machine learning competitions to create an ensemble model that outperforms individual models.
Blending can also refer to combining different data sources, such as blending demographic data with sales data for analysis.
Q35. Left join in Sql
Left join in SQL combines rows from two tables based on a related column, including all rows from the left table.
Left join keyword: LEFT JOIN
Syntax: SELECT columns FROM table1 LEFT JOIN table2 ON table1.column = table2.column
Retrieves all rows from table1 and the matching rows from table2, if any
Non-matching rows from table2 will have NULL values for columns from table2
Q36. Introduction of urs
I am a Senior Data Engineer with expertise in data processing and analysis.
Experienced in designing and implementing data pipelines
Proficient in programming languages like Python and SQL
Skilled in working with big data technologies like Hadoop and Spark
Familiar with data warehousing and ETL processes
Strong problem-solving and analytical skills
More about working at LTIMindtree
Top HR Questions asked in GMR Group
Interview Process at GMR Group
Top Senior Data Engineer Interview Questions from Similar Companies
Reviews
Interviews
Salaries
Users/Month