Altimetrik
Neurons-IT Interview Questions and Answers
Q1. What services have you used in GCP
I have used services like BigQuery, Dataflow, Pub/Sub, and Cloud Storage in GCP.
BigQuery for data warehousing and analytics
Dataflow for real-time data processing
Pub/Sub for messaging and event ingestion
Cloud Storage for storing data and files
Q2. What are accumulators in spark?
Accumulators are shared variables that are updated by worker nodes and can be used for aggregating information across tasks.
Accumulators are used for implementing counters and sums in Spark.
They are only updated by worker nodes and are read-only by the driver program.
Accumulators are useful for debugging and monitoring purposes.
Example: counting the number of errors encountered during processing.
Q3. Write a query to find duplicate data using SQL
Query to find duplicate data using SQL
Use GROUP BY and HAVING clause to identify duplicate records
Select columns to check for duplicates
Use COUNT() function to count occurrences of each record
Q4. What is pub/sub?
Pub/sub is a messaging pattern where senders (publishers) of messages do not program the messages to be sent directly to specific receivers (subscribers).
Pub/sub stands for publish/subscribe.
Publishers send messages to a topic, and subscribers receive messages from that topic.
It allows for decoupling of components in a system, enabling scalability and flexibility.
Examples include Apache Kafka, Google Cloud Pub/Sub, and MQTT.
Q5. Explain spark architecture
Spark architecture is a distributed computing framework that consists of a driver program, cluster manager, and worker nodes.
Spark architecture includes a driver program that manages the execution of the Spark application.
It also includes a cluster manager that allocates resources and schedules tasks on worker nodes.
Worker nodes are responsible for executing the tasks and storing data in memory or disk.
Spark architecture supports fault tolerance through resilient distributed ...read more
Q6. Design etl flow in gcp
Designing ETL flow in Google Cloud Platform (GCP) involves defining data sources, transformation processes, and loading destinations.
Identify data sources and extract data using GCP services like Cloud Storage, BigQuery, or Cloud SQL.
Transform data using tools like Dataflow or Dataprep to clean, enrich, and aggregate data.
Load transformed data into target destinations such as BigQuery, Cloud Storage, or other databases.
Schedule and automate the ETL process using Cloud Compose...read more
Q7. Find duplicates in list python
Use a dictionary to find duplicates in a list of strings in Python.
Create an empty dictionary to store the count of each string in the list.
Iterate through the list and update the count in the dictionary for each string.
Print out the strings that have a count greater than 1 to find duplicates.
Interview Process at Neurons-IT
Top Senior Data Engineer Interview Questions from Similar Companies
Reviews
Interviews
Salaries
Users/Month