Gcp Data Engineer

50+ Gcp Data Engineer Interview Questions and Answers

Updated 25 Feb 2025

Q1. GCP Services, What is use of Bigquery? What is Pubsub,Dataflow,cloud storage. Question related previous roles and responsibility.

Ans.

Bigquery is a cloud-based data warehousing tool used for analyzing large datasets quickly. Pubsub is a messaging service, Dataflow is a data processing tool, and Cloud Storage is a scalable object storage service.

Bigquery is used for analyzing large datasets quickly
Pubsub is a messaging service used for asynchronous communication between applications
Dataflow is a data processing tool used for batch and stream processing
Cloud Storage is a scalable object storage service used f...read more

Q2. what is Iam what is sa what is bigquery various optimisations joins sql complex query what is qliksense GIThub schema routines schedules delete drop truncate GUI and terraform related spark basics file formats...

Ans.

IAM is Identity and Access Management, SA is Service Account, BigQuery is a data warehouse, QlikSense is a data visualization tool, GitHub is a version control system, Spark is a distributed computing framework, Airflow is a workflow automation tool, Bigtable is a NoSQL database, Cloud Composer is a managed workflow orchestration service, Pub/Sub is a messaging service.

IAM is used to manage access to resources in Google Cloud Platform.
SA is a special Google account that repre...read more

Gcp Data Engineer Interview Questions and Answers for Freshers

View all interview questions

Q3. How to migrate the datawarehouse with gcp services using real time data services

Ans.

Real-time data services can be used to migrate datawarehouse with GCP services.

Use Cloud Dataflow to ingest and transform data in real-time
Use Cloud Pub/Sub to stream data to BigQuery or Cloud Storage
Use Cloud Dataproc to process data in real-time
Use Cloud Composer to orchestrate data pipelines
Use Cloud Spanner for real-time transactional data
Use Cloud SQL for real-time relational data
Use Cloud Bigtable for real-time NoSQL data

Q4. Explain Google cloud bigquery architecture?

Ans.

Google Cloud BigQuery is a fully-managed, serverless data warehouse that uses a distributed architecture for processing and analyzing large datasets.

BigQuery uses a distributed storage system called Capacitor for storing and managing data.
It uses a distributed query engine called Dremel for executing SQL-like queries on large datasets.
BigQuery separates storage and compute, allowing users to scale compute resources independently.
It supports automatic data partitioning and clu...read more

View 1 answer

Are these interview questions helpful?

Q5. What is GCP Bigquery, Architecture of BQ, Cloud composer, What Is DAG . Visualization studio like Looker, data studio.

Ans.

GCP BigQuery is a serverless, highly scalable, and cost-effective data warehouse for analyzing big data sets.

BigQuery is a fully managed, petabyte-scale data warehouse that enables super-fast SQL queries using the processing power of Google's infrastructure.
BigQuery's architecture includes storage, Dremel execution engine, and SQL layer.
Cloud Composer is a managed workflow orchestration service that helps you create, schedule, and monitor pipelines using Apache Airflow.
DAG (D...read more

Q6. How to create data pipeline in gcp

Ans.

Data pipelines in GCP can be created using various tools like Dataflow, Dataproc, and Cloud Composer.

Choose the appropriate tool based on the use case and data volume
Define the data source and destination
Create a pipeline using the chosen tool and define the data transformations
Test and deploy the pipeline
Monitor and troubleshoot the pipeline for any issues

Share interview questions and help millions of jobseekers 🌟

Q7. Explain Pub-sub mechanism.How you implemented in your project.

Ans.

Pub-sub mechanism is a messaging pattern where senders (publishers) of messages are decoupled from receivers (subscribers).

Pub-sub stands for publish-subscribe.
Publishers send messages to a topic, and subscribers receive messages from that topic.
Google Cloud Pub/Sub is a fully-managed real-time messaging service that allows you to send and receive messages between independent applications.
In my project, we used Google Cloud Pub/Sub to decouple various components of our data p...read more

Q8. which of these 2 select * from table and select * from table limit 100 is faster

Ans.

select * from table limit 100 is faster

Using 'select * from table' retrieves all rows from the table, which can be slower if the table is large
Using 'select * from table limit 100' limits the number of rows retrieved, making it faster
Limiting the number of rows fetched can improve query performance

Gcp Data Engineer Jobs

GCP Data Engineer • 4-10 years

Tech Mahindra Ltd

•

3.5

Pune

IN_Associate_GCP Data Engineer_Data & Analaytics_Advisory_Pune • 4-8 years

Pricewaterhouse Coopers Private Limited

•

3.4

Pune

Senior Associate_ GCP Data Engineer_D&A_Advisory • 4-8 years

Pricewaterhouse Coopers Private Limited

•

3.4

Hyderabad / Secunderabad

View all Gcp Data Engineer jobs

Q9. If query fails in bigquery,how can you find out error.

Ans.

To find out errors in a failed BigQuery query, check the query job history and error message details.

Check the query job history in the BigQuery console for details on the failed query.
Look for error messages in the job history to identify the specific issue that caused the query to fail.
Review the query syntax and data being queried to troubleshoot common errors such as syntax errors or data type mismatches.

Q10. Where do you use Dataproc and where do you use cloud composer

Ans.

Dataproc is used for processing large datasets in Hadoop/Spark, while Cloud Composer is used for orchestrating workflows and managing pipelines.

Use Dataproc for processing large datasets in Hadoop/Spark
Use Cloud Composer for orchestrating workflows and managing pipelines
Dataproc is ideal for running big data processing jobs, while Cloud Composer is suitable for managing complex workflows

Q11. What services in gcp u have used

Ans.

I have used various services in GCP including BigQuery, Dataflow, Cloud Storage, and Pub/Sub.

BigQuery for data warehousing and analytics
Dataflow for data processing and ETL
Cloud Storage for storing and accessing data
Pub/Sub for messaging and event-driven architectures

Q12. What are the modules you've used in python?

Ans.

I have used modules like pandas, numpy, matplotlib, and sklearn in Python for data manipulation, analysis, visualization, and machine learning tasks.

pandas - for data manipulation and analysis
numpy - for numerical computing and array operations
matplotlib - for data visualization
sklearn - for machine learning tasks

Q13. Best practices used while writing queries in Bigquery.

Ans.

Best practices for writing queries in Bigquery

Use standard SQL syntax for better performance and compatibility
Avoid using SELECT * and instead specify only the columns needed
Optimize queries by using appropriate functions and operators
Use query caching to reduce costs and improve performance
Partition tables and use clustering to improve query performance

Q14. bq commands on show the schema of the table

Ans.

Use 'bq show' command to display the schema of a table in BigQuery.

Use 'bq show' command followed by the dataset and table name to display the schema.
The schema includes the column names, data types, and mode (nullable or required).
Example: bq show project_id:dataset.table_name

Q15. write a python code to trigger a dataflow job in cloud function

Ans.

Python code to trigger a dataflow job in cloud function

Use the googleapiclient library to interact with the Dataflow API
Authenticate using service account credentials
Submit a job to Dataflow using the projects.locations.templates.launch endpoint

Q16. Case Study: Using GCP's tool make a pipeline to transfer file from one GCS bucket to another

Ans.

Use GCP Dataflow to transfer files between GCS buckets

Create a Dataflow pipeline using Apache Beam to read from source bucket and write to destination bucket
Use GCS connector to read and write files in Dataflow pipeline
Set up appropriate permissions for Dataflow service account to access both buckets

Q17. SQL: Find keys present in table A but not in B(B is old copy of A)

Ans.

Use SQL to find keys present in table A but not in table B (old copy of A).

Use a LEFT JOIN to combine tables A and B based on the key column
Filter the results where the key column in table B is NULL
This will give you the keys present in table A but not in table B

Q18. How display string in reverse using python

Ans.

Use Python's slicing feature to display a string in reverse order.

Use string slicing with a step of -1 to reverse the string.
Example: 'hello'[::-1] will output 'olleh'.

Q19. How to shedule job to trigger every hr in Airflow

Ans.

To schedule a job to trigger every hour in Airflow, you can use the Cron schedule interval

Define a DAG (Directed Acyclic Graph) in Airflow
Set the schedule_interval parameter to '0 * * * *' to trigger the job every hour
Example: schedule_interval='0 * * * *'

Q20. Python: list and Tupple differences

Ans.

List and tuple are both used to store collections of data, but they have some differences.

Lists are mutable while tuples are immutable
Lists use square brackets [] while tuples use parentheses ()
Lists are typically used for collections of homogeneous data while tuples are used for heterogeneous data
Lists have more built-in methods than tuples

Q21. What are the GCP services used in your project

Ans.

The GCP services used in our project include BigQuery, Dataflow, Pub/Sub, and Cloud Storage.

BigQuery for data warehousing and analytics
Dataflow for real-time data processing
Pub/Sub for messaging and event ingestion
Cloud Storage for storing data and files

Q22. how many slots are there in bigquery?

Ans.

BigQuery does not have fixed slots, it dynamically allocates resources based on the query requirements.

BigQuery does not have a fixed number of slots like traditional databases.
It dynamically allocates resources based on the query requirements.
The number of slots available for a query can vary depending on the complexity and size of the query.
BigQuery's serverless architecture allows it to scale automatically to handle large workloads.

Q23. what are the data sources used?

Ans.

Various data sources such as databases, APIs, files, and streaming services are used for data ingestion and processing.

Databases (e.g. MySQL, PostgreSQL)
APIs (e.g. RESTful APIs)
Files (e.g. CSV, JSON)
Streaming services (e.g. Kafka, Pub/Sub)

Q24. What is materialized view in bigquery?

Ans.

Materialized view in BigQuery is a precomputed result set stored as a table for faster query performance.

Materialized views store the results of a query and can be used to speed up query performance by avoiding the need to recompute the same result multiple times.
They are updated periodically to reflect changes in the underlying data.
Materialized views are particularly useful for complex queries that involve aggregations or joins.
Example: CREATE MATERIALIZED VIEW my_materiali...read more

Q25. Write code to find max number of product by customer

Ans.

Code to find max number of product by customer

Iterate through each customer's purchases
Keep track of the count of each product for each customer
Find the product with the maximum count for each customer

Q26. Which types of jobs handled in Bigquery.

Ans.

Bigquery handles various types of jobs including querying, loading, exporting, and copying data.

Querying data for analysis and reporting
Loading data into Bigquery for storage and processing
Exporting data from Bigquery to other systems
Copying data within Bigquery or to other destinations

Q27. GCP main components and its uses

Ans.

GCP main components include Compute Engine, Cloud Storage, BigQuery, and Dataflow for various uses.

Compute Engine - Virtual machines for running workloads
Cloud Storage - Object storage for storing data
BigQuery - Data warehouse for analytics
Dataflow - Stream and batch processing of data

Q28. What is mean by Partitioning and clustering? Types of partitioning

Ans.

Partitioning is dividing data into smaller, manageable parts. Clustering is grouping similar data together. Types include range, hash, list, and composite partitioning.

Partitioning divides large tables into smaller, more manageable parts based on a chosen criteria.
Clustering groups together rows with similar values for one or more columns to improve query performance.
Types of partitioning include range partitioning, hash partitioning, list partitioning, and composite partitio...read more

Q29. What types on nosql databases in gcp

Ans.

Types of NoSQL databases in GCP include Firestore, Bigtable, and Datastore.

Firestore is a flexible, scalable database for mobile, web, and server development.
Bigtable is a high-performance NoSQL database service for large analytical and operational workloads.
Datastore is a highly scalable NoSQL database for web and mobile applications.

Q30. bq commands on create table and load csv file

Ans.

Using bq commands to create a table and load a CSV file in Google BigQuery

Use 'bq mk' command to create a new table in BigQuery
Use 'bq load' command to load a CSV file into the created table
Specify schema and source format when creating the table
Specify source format and destination table when loading the CSV file
Example: bq mk --table dataset.table_name schema.json
Example: bq load --source_format=CSV dataset.table_name data.csv

Q31. explain about leaf nodes and columnar storage.

Ans.

Leaf nodes are the bottom nodes in a tree structure, while columnar storage stores data in columns rather than rows.

Leaf nodes are the end nodes in a tree structure, containing actual data or pointers to data.
Columnar storage stores data in columns rather than rows, allowing for faster query performance on specific columns.
Columnar storage is commonly used in data warehouses and analytics databases.
Leaf nodes are important for efficient data retrieval in tree-based data struc...read more

Q32. What are generators and decorators?

Ans.

Generators and decorators are features in Python. Generators are functions that can pause and resume execution, while decorators are functions that modify other functions.

Generators are functions that use the yield keyword to return values one at a time, allowing for efficient memory usage.
Decorators are functions that take another function as input and return a new function with added functionality.
Generators can be used to iterate over large datasets without loading everyth...read more

Q33. What is partitioning and clustering?

Ans.

Partitioning is dividing data into smaller parts based on a key, while clustering is storing data together based on similar values.

Partitioning is used to improve query performance by reducing the amount of data that needs to be scanned.
Clustering is used to physically store related data together on disk to improve query performance.
In BigQuery, partitioning can be done based on a date column, while clustering can be done based on one or more columns to group related data tog...read more

Q34. What are sql joins explain About bigquery related

Ans.

SQL joins are used to combine rows from two or more tables based on a related column between them.

SQL joins are used to retrieve data from multiple tables based on a related column between them
Types of SQL joins include INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN
In BigQuery, joins can be performed using standard SQL syntax
Example: SELECT * FROM table1 INNER JOIN table2 ON table1.column = table2.column

Q35. gcp storage class types

Ans.

GCP offers different storage classes for varying performance and cost requirements.

Standard Storage: for frequently accessed data
Nearline Storage: for data accessed less frequently
Coldline Storage: for data accessed very infrequently
Archive Storage: for data stored for long-term retention

Q36. sql optimisation techniques

Ans.

SQL optimization techniques focus on improving query performance by reducing execution time and resource usage.

Use indexes to speed up data retrieval
Avoid using SELECT * and instead specify only the columns needed
Optimize joins by using appropriate join types and conditions
Limit the use of subqueries and instead use JOINs where possible
Use EXPLAIN to analyze query execution plans and identify bottlenecks

Q37. How to use indexing in sql.

Ans.

Indexing in SQL is used to improve the performance of queries by creating indexes on columns in tables.

Indexes can be created on columns that are frequently used in WHERE, JOIN, and ORDER BY clauses.
Indexes can speed up query performance by allowing the database to quickly locate rows based on the indexed columns.
Primary keys automatically create a unique index on the column(s) specified.
Examples: CREATE INDEX idx_name ON table_name(column_name);
Examples: DROP INDEX idx_name ...read more

Q38. What is windows function bigquery

Ans.

Window functions in BigQuery are used to perform calculations across a set of table rows related to the current row.

Window functions allow you to perform calculations on a set of rows related to the current row
They are used with the OVER() clause in SQL queries
Common window functions include ROW_NUMBER(), RANK(), and NTILE()
They can be used to calculate moving averages, cumulative sums, and more

Q39. Difference between nearline and coldline.

Ans.

Nearline is for data accessed less frequently, while coldline is for data accessed very infrequently.

Nearline storage is designed for data that is accessed less frequently but still needs to be readily available.
Coldline storage is for data that is accessed very infrequently and is stored at a lower cost.
Nearline storage has a higher retrieval cost compared to coldline storage.
Examples of nearline storage include Google Cloud Storage Nearline, while examples of coldline stora...read more

Q40. Different between big table and bigquery

Ans.

BigTable is a NoSQL database for real-time analytics, while BigQuery is a fully managed data warehouse for running SQL queries.

BigTable is a NoSQL database designed for real-time analytics and high-throughput applications.
BigQuery is a fully managed data warehouse that allows users to run SQL queries on large datasets.
BigTable is optimized for high-speed reads and writes, making it suitable for real-time data processing.
BigQuery is optimized for running complex SQL queries on...read more

View 1 answer

Q41. Discuss other orchestration tool in GCP

Ans.

Cloud Composer is another orchestration tool in GCP

Cloud Composer is a fully managed workflow orchestration service built on Apache Airflow
It allows you to author, schedule, and monitor workflows that span across GCP services
Cloud Composer provides a rich set of features like DAGs, plugins, and monitoring capabilities
It integrates seamlessly with other GCP services like BigQuery, Dataflow, and Dataproc

Q42. What is cloud function

Ans.

Cloud Functions are event-driven functions that run in response to cloud events.

Serverless functions that automatically scale based on demand
Can be triggered by events from various cloud services
Supports multiple programming languages like Node.js, Python, etc.

Q43. explain scd and Merge in bigquery

Ans.

SCD stands for Slowly Changing Dimension and Merge is a SQL operation used to update or insert data in BigQuery.

SCD is used to track changes to data over time in a data warehouse
Merge in BigQuery is used to perform insert, update, or delete operations in a single statement
Example: MERGE INTO target_table USING source_table ON condition WHEN MATCHED THEN UPDATE SET col1 = value1 WHEN NOT MATCHED THEN INSERT (col1, col2) VALUES (value1, value2)

Q44. Different storage types in GCP.

Ans.

Different storage types in GCP include Cloud Storage, Persistent Disk, Cloud SQL, Bigtable, and Datastore.

Cloud Storage: object storage for storing and accessing data from Google Cloud
Persistent Disk: block storage for virtual machine instances
Cloud SQL: fully-managed relational database service
Bigtable: NoSQL wide-column database service for large analytical and operational workloads
Datastore: NoSQL document database for web and mobile applications

Q45. Explain lazy evaludation in spark.

Ans.

Lazy evaluation in Spark delays the execution of transformations until an action is called.

Transformations in Spark are not executed immediately, but are stored as a directed acyclic graph (DAG) of operations.
Actions trigger the execution of the DAG, allowing for optimizations like pipelining and avoiding unnecessary computations.
Lazy evaluation helps in optimizing the execution plan and improving performance by delaying the actual computation until necessary.

Q46. Used cases on bigquery and sql

Ans.

BigQuery is used for analyzing large datasets and running complex queries, while SQL is used for querying databases.

BigQuery is used for analyzing large datasets quickly and efficiently
SQL is used for querying databases to retrieve specific data
BigQuery can handle petabytes of data, making it ideal for big data analysis
SQL can be used to perform operations like filtering, sorting, and aggregating data

Q47. Dataflow function to split sentence

Ans.

Dataflow function to split sentence

Use the Split transform in Dataflow to split the sentence into words
Apply ParDo function to process each word individually
Use regular expressions to handle punctuation and special characters

Q48. Recursion function for factorial

Ans.

Recursion function to calculate factorial of a number

Define a function that takes an integer as input
Base case: if input is 0, return 1
Recursive case: return input multiplied by factorial of input-1
Example: factorial(5) = 5 * factorial(4) = 5 * 4 * factorial(3) = ... = 5 * 4 * 3 * 2 * 1 = 120

Q49. partition vs clustering

Ans.

Partitioning is dividing data into smaller chunks for efficient storage and retrieval, while clustering is organizing data within those partitions based on a specific column.

Partitioning is done at the storage level to distribute data across multiple nodes for better performance.
Clustering is done at the query level to physically group data based on a specific column, improving query performance.
Example: Partitioning by date in a sales database can improve query performance b...read more

Q50. Why you choose tcs

Ans.

I chose TCS for its reputation, global presence, diverse opportunities, and focus on innovation.

TCS is a renowned company with a strong reputation in the IT industry
TCS has a global presence with offices in multiple countries, providing opportunities for international exposure
TCS offers diverse opportunities for career growth and development in various domains
TCS is known for its focus on innovation and cutting-edge technologies, which aligns with my career goals

Interview Questions of Similar Designations

Software Engineer Interview Questions and Answers

7.2k Questions

Senior Software Engineer Interview Questions and Answers

4k Questions

System Engineer Interview Questions and Answers

1.7k Questions

Data Analyst Interview Questions and Answers

1.4k Questions

Data Engineer Interview Questions and Answers

1.1k Questions

Interview Tips & Stories

Ace your next interview with expert advice and inspiring stories

Explore community

Interview experiences of popular companies

TCS Interview Questions

3.7

• 10.4k Interviews

Accenture Interview Questions

3.8

• 8.1k Interviews

Wipro Interview Questions

3.7

• 5.6k Interviews

Cognizant Interview Questions

3.7

• 5.6k Interviews

Capgemini Interview Questions

3.7

• 4.8k Interviews

Tech Mahindra Interview Questions

3.5

• 3.8k Interviews

Deloitte Interview Questions

3.8

• 2.8k Interviews

Deutsche Bank Interview Questions

3.9

• 362 Interviews

Altimetrik Interview Questions

3.8

• 214 Interviews

Bornemindz Software Solutions Interview Questions

4.5

• 2 Interviews

View all

Calculate your in-hand salary

Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary

Home

Interviews

Gcp Data Engineer Interview Questions

Share an Interview

Stay ahead in your career. Get AmbitionBox app

Helping over 1 Crore job seekers every month in choosing their right fit company

65 L+

Reviews

4 L+

Interviews

4 Cr+

Salaries

1 Cr+

Users/Month

65 Lakh+

Reviews

4 Lakh+

Interviews

4 Crore+

Salaries

1 Crore+

Users/Month

Contribute

Contribute to help millions

Company

Reviews

Users/Jobseekers

Employers

AmbitionBox Awards

AmbitionBox

Terms & Policies

Get AmbitionBox app

50+ Gcp Data Engineer Interview Questions and Answers

Q1. GCP Services, What is use of Bigquery? What is Pubsub,Dataflow,cloud storage. Question related previous roles and responsibility.

Q2. what is Iam what is sa what is bigquery various optimisations joins sql complex query what is qliksense GIThub schema routines schedules delete drop truncate GUI and terraform related spark basics file formats...

Q3. How to migrate the datawarehouse with gcp services using real time data services

Q4. Explain Google cloud bigquery architecture?

Q5. What is GCP Bigquery, Architecture of BQ, Cloud composer, What Is DAG . Visualization studio like Looker, data studio.

Q6. How to create data pipeline in gcp

Q7. Explain Pub-sub mechanism.How you implemented in your project.

Q8. which of these 2 select * from table and select * from table limit 100 is faster

Gcp Data Engineer Jobs

Q9. If query fails in bigquery,how can you find out error.

Q10. Where do you use Dataproc and where do you use cloud composer

Q11. What services in gcp u have used

Q12. What are the modules you've used in python?

Q13. Best practices used while writing queries in Bigquery.

Q14. bq commands on show the schema of the table

Q15. write a python code to trigger a dataflow job in cloud function

Q16. Case Study: Using GCP's tool make a pipeline to transfer file from one GCS bucket to another

Q17. SQL: Find keys present in table A but not in B(B is old copy of A)

Q18. How display string in reverse using python

Q19. How to shedule job to trigger every hr in Airflow

Q20. Python: list and Tupple differences

Q21. What are the GCP services used in your project

Q22. how many slots are there in bigquery?

Q23. what are the data sources used?

Q24. What is materialized view in bigquery?

Q25. Write code to find max number of product by customer

Q26. Which types of jobs handled in Bigquery.

Q27. GCP main components and its uses

Q28. What is mean by Partitioning and clustering? Types of partitioning

Q29. What types on nosql databases in gcp

Q30. bq commands on create table and load csv file

Q31. explain about leaf nodes and columnar storage.

Q32. What are generators and decorators?

Q33. What is partitioning and clustering?

Q34. What are sql joins explain About bigquery related

Q35. gcp storage class types

Q36. sql optimisation techniques

Q37. How to use indexing in sql.

Q38. What is windows function bigquery

Q39. Difference between nearline and coldline.

Q40. Different between big table and bigquery

Q41. Discuss other orchestration tool in GCP

Q42. What is cloud function

Q43. explain scd and Merge in bigquery

Q44. Different storage types in GCP.

Q45. Explain lazy evaludation in spark.

Q46. Used cases on bigquery and sql

Q47. Dataflow function to split sentence

Q48. Recursion function for factorial

Q49. partition vs clustering

Q50. Why you choose tcs

Interview Questions of Similar Designations

Top Interview Questions for Gcp Data Engineer Related Skills

Interview experiences of popular companies

Calculate your in-hand salary