Wipro
10+ We Care Consultancy Services Interview Questions and Answers
Q1. What's the use of broadcast and accumalator in spark
Broadcast and accumulator are used in Spark for efficient data sharing and aggregation across tasks.
Broadcast variables are used to efficiently distribute large read-only data to all tasks in a Spark job.
Accumulators are used for aggregating values from all tasks in a Spark job to a shared variable.
Broadcast variables help in reducing data transfer costs and improving performance.
Accumulators are used for tasks like counting or summing values across all tasks.
Example: Broadca...read more
Q2. Pyspark - How to add new column to the data How to read data from Csv file
To add a new column to data in Pyspark, use 'withColumn' method. To read data from a CSV file, use 'spark.read.csv' method.
To add a new column to data in Pyspark, use 'withColumn' method
Example: df.withColumn('new_column', df['existing_column'] * 2)
To read data from a CSV file, use 'spark.read.csv' method
Example: df = spark.read.csv('file.csv', header=True, inferSchema=True)
Q3. How to migrate from Hive to Bigquery
Migrating from Hive to Bigquery involves exporting data from Hive, transforming it into a compatible format, and importing it into Bigquery.
Export data from Hive using tools like Sqoop or Apache NiFi
Transform the data into a compatible format like Avro or Parquet
Import the transformed data into Bigquery using tools like Dataflow or Bigquery Data Transfer Service
Q4. Difference between external and internal table
External tables reference data stored outside the database, while internal tables store data within the database.
External tables are defined on data that is stored outside the database, such as in HDFS or S3.
Internal tables store data within the database itself, typically in a managed storage like HDFS or S3.
External tables do not delete data when dropped, while internal tables do.
Internal tables are managed by the database, while external tables are not.
Example: Creating an ...read more
Q5. Difference between rdd and data frame
RDD is a low-level abstraction in Spark representing distributed data, while DataFrames are higher-level structured APIs for working with data.
RDD is an immutable distributed collection of objects, while DataFrames are distributed collection of data organized into named columns.
RDDs are more suitable for unstructured data and low-level transformations, while DataFrames provide a more user-friendly API for structured data processing.
DataFrames offer optimizations like query op...read more
Q6. What is executor memory
Executor memory is the amount of memory allocated to each executor in a Spark application.
Executor memory is specified using the 'spark.executor.memory' configuration property.
It determines how much memory each executor can use to process tasks.
It is important to properly configure executor memory to avoid out-of-memory errors or inefficient resource utilization.
Q7. Explain adf questions in detail
ADF questions refer to Azure Data Factory questions which are related to data integration and data transformation processes.
ADF questions are related to Azure Data Factory, a cloud-based data integration service.
These questions may involve data pipelines, data flows, activities, triggers, and data movement.
Candidates may be asked about their experience with designing, monitoring, and managing data pipelines in ADF.
Examples of ADF questions include how to create a pipeline, ho...read more
Q8. Coalesce and repartition in spark
Coalesce and repartition are operations in Spark used to control the number of partitions in a DataFrame.
Coalesce reduces the number of partitions without shuffling data, while repartition reshuffles data to create a specified number of partitions.
Coalesce is more efficient when reducing partitions, as it minimizes data movement.
Repartition is useful for evenly distributing data across a specified number of partitions.
Example: df.coalesce(1) will reduce the DataFrame to a sin...read more
Q9. write pyhton code for palindrome
Python code to check if a string is a palindrome or not.
Define a function that takes a string as input.
Use string slicing to reverse the input string.
Compare the reversed string with the original string to check for palindrome.
Return True if the string is a palindrome, False otherwise.
Q10. What is pyspark
PySpark is a Python API for Apache Spark, a powerful open-source distributed computing system.
PySpark is used for processing large datasets with distributed computing.
It provides high-level APIs in Python for Spark programming.
PySpark allows seamless integration with Python libraries like Pandas and NumPy.
Example: PySpark can be used for data processing, machine learning, and real-time analytics.
Q11. explain spark theory question
Apache Spark is a fast and general-purpose cluster computing system.
Apache Spark is an open-source distributed computing system that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
It can be used for a wide range of applications such as batch processing, real-time stream processing, machine learning, and graph processing.
Spark provides high-level APIs in Java, Scala, Python, and R, and supports SQL, streaming data, mach...read more
Q12. Merge 2 unsorted array
Merge two unsorted arrays into a single sorted array.
Create a new array to store the merged result
Iterate through both arrays and compare elements to merge in sorted order
Handle remaining elements in either array after one array is fully processed
Q13. Optimization Techniques
Optimization techniques are methods used to improve the efficiency and performance of data processing.
Use indexing to speed up data retrieval
Implement caching to reduce redundant computations
Utilize parallel processing for faster execution
Optimize algorithms for better performance
Use data partitioning to distribute workload evenly
Q14. Spark Optimisation techniques
Spark optimization techniques aim to improve performance and efficiency of Spark jobs.
Use partitioning to distribute data evenly
Cache intermediate results to avoid recomputation
Optimize shuffle operations by reducing data shuffling
Use broadcast variables for small lookup tables
Tune memory and executor settings for better performance
Q15. architecture of spark
Spark is a distributed computing framework that provides in-memory processing capabilities for big data analytics.
Spark has a master-slave architecture with a central coordinator called the Spark Master and distributed workers called Spark Workers.
It uses Resilient Distributed Datasets (RDDs) for fault-tolerant distributed data processing.
Spark supports various programming languages like Scala, Java, Python, and R for writing applications.
It includes components like Spark SQL...read more
Interview Process at We Care Consultancy Services
Top Data Engineer Interview Questions from Similar Companies
Reviews
Interviews
Salaries
Users/Month