Sigmoid
Klenzaids Interview Questions and Answers
Q1. inferschema in pyspark when reading file
inferschema in pyspark is used to automatically infer the schema of a file when reading it.
inferschema is a parameter in pyspark that can be set to true when reading a file to automatically infer the schema based on the data
It is useful when the schema of the file is not known beforehand
Example: df = spark.read.csv('file.csv', header=True, inferSchema=True)
Q2. what is scd in dw?
SCD stands for Slowly Changing Dimension in Data Warehousing.
SCD is a technique used in data warehousing to track changes to dimension data over time.
There are different types of SCDs - Type 1, Type 2, and Type 3.
Type 1 SCD overwrites old data with new data, Type 2 creates new records for changes, and Type 3 maintains both old and new values in separate columns.
Example: In a customer dimension table, if a customer changes their address, a Type 2 SCD would create a new record ...read more
Q3. optimizing techniques in spark
Optimizing techniques in Spark involve partitioning, caching, and tuning resources for efficient data processing.
Use partitioning to distribute data evenly across nodes for parallel processing
Cache frequently accessed data in memory to avoid recomputation
Tune resources such as memory allocation and parallelism settings for optimal performance
Q4. repartition vs coalesce
Repartition is used to increase the number of partitions in a DataFrame, while coalesce is used to decrease the number of partitions.
Repartition involves shuffling data across the network, which can be expensive in terms of performance and resources.
Coalesce is a more efficient operation as it minimizes data movement by only merging existing partitions.
Repartition is typically used when there is a need for more parallelism or to evenly distribute data for better performance.
C...read more
Q5. normalization in db and types
Normalization in databases is the process of organizing data in a database to reduce redundancy and improve data integrity.
Normalization is used to eliminate redundant data and ensure data integrity.
It involves breaking down a table into smaller tables and defining relationships between them.
There are different normal forms such as 1NF, 2NF, 3NF, and BCNF.
Normalization helps in reducing data redundancy and improving query performance.
Example: In a database, instead of storing...read more
Q6. transformation vs action
Transformation involves changing the data structure, while action involves performing a computation on the data.
Transformation changes the data structure without executing any computation
Action performs a computation on the data and triggers the execution
Examples of transformation include map, filter, and reduce in Spark or Pandas
Examples of action include count, collect, and saveAsTextFile in Spark
Q7. rank vs dense rank
Rank assigns unique ranks to each distinct value, while dense rank assigns ranks without gaps.
Rank function assigns unique ranks to each distinct value in a result set.
Dense rank function assigns ranks to rows in a result set without any gaps between the ranks.
Rank function may skip ranks if there are ties in values, while dense rank will not skip ranks.
Top Senior Data Engineer Interview Questions from Similar Companies
Reviews
Interviews
Salaries
Users/Month