Tech Mahindra
Inline Technologies Interview Questions and Answers
Q1. how to remove duplicate rows from bigquery? find the month of a given date in bigquery.
To remove duplicate rows from BigQuery, use the DISTINCT keyword. To find the month of a given date, use the EXTRACT function.
To remove duplicate rows, use SELECT DISTINCT * FROM table_name;
To find the month of a given date, use SELECT EXTRACT(MONTH FROM date_column) AS month_name FROM table_name;
Make sure to replace 'table_name' and 'date_column' with the appropriate values in your query.
Q2. what operator is used in composer to move data from gcs to bq
The operator used in Composer to move data from GCS to BigQuery is the GCS to BigQuery operator.
The GCS to BigQuery operator is used in Apache Airflow, which is the underlying technology of Composer.
This operator allows you to transfer data from Google Cloud Storage (GCS) to BigQuery.
You can specify the source and destination parameters in the operator to define the data transfer process.
Q3. write a code for this - input = [1,2,3,4] output = [1,4,9,16]
Code to square each element in the input array.
Iterate through the input array and square each element.
Store the squared values in a new array to get the desired output.
Q4. architecture of bq. Query optimization techniques in bigquery.
BigQuery architecture includes storage, execution, and optimization components for efficient query processing.
BigQuery stores data in Capacitor storage system for fast access.
Query execution is distributed across multiple nodes for parallel processing.
Query optimization techniques include partitioning tables, clustering tables, and using query cache.
Using partitioned tables can help eliminate scanning unnecessary data.
Clustering tables based on certain columns can improve que...read more
Q5. difference between bigtable and bigquery.
Bigtable is a NoSQL database for real-time analytics, while BigQuery is a fully managed data warehouse for running SQL queries.
Bigtable is a NoSQL database designed for real-time analytics and high throughput, while BigQuery is a fully managed data warehouse for running SQL queries.
Bigtable is used for storing large amounts of semi-structured data, while BigQuery is used for analyzing structured data using SQL queries.
Bigtable is suitable for real-time data processing and hig...read more
Q6. RDD vs dataframe vs dataset in pyspark
RDD vs dataframe vs dataset in PySpark
RDD (Resilient Distributed Dataset) is the basic abstraction in PySpark, representing a distributed collection of objects
Dataframe is a distributed collection of data organized into named columns, similar to a table in a relational database
Dataset is a distributed collection of data with the ability to use custom classes for type safety and user-defined functions
Dataframes and Datasets are built on top of RDDs, providing a more structured...read more
Q7. dataflow vs dataproc.
Dataflow is a fully managed stream and batch processing service, while Dataproc is a managed Apache Spark and Hadoop service.
Dataflow is a serverless data processing service that automatically scales to handle your data, while Dataproc is a managed Spark and Hadoop service that requires you to provision and manage clusters.
Dataflow is designed for both batch and stream processing, allowing you to process data in real-time, while Dataproc is more focused on batch processing.
Da...read more
Interview Process at Inline Technologies
Top Data Engineer Interview Questions from Similar Companies
Reviews
Interviews
Salaries
Users/Month