i
UST
Filter interviews by
I applied via Campus Placement and was interviewed in Oct 2024. There was 1 interview round.
Use regular expression to remove special characters from a string
Use the regex pattern [^a-zA-Z0-9\s] to match any character that is not a letter, digit, or whitespace
Use the replace() function in your programming language to replace the matched special characters with an empty string
Example: input string 'Hello! How are you?' will become 'Hello How are you' after removing special characters
Rank assigns unique rank to each row, while dense rank assigns consecutive ranks without gaps.
Rank leaves gaps in rank sequence if there are ties, while dense rank does not
Rank function is used to assign a unique rank to each row based on a specified column
Dense rank function is used to assign consecutive ranks to rows without any gaps
Example: If there are two rows with rank 1 in a dataset, rank function will assign 1
Use Google Cloud Storage to load CSV data into BigQuery
Upload the CSV file to Google Cloud Storage
Create a BigQuery table with the appropriate schema
Use the 'bq load' command to load the data from the CSV file into the BigQuery table
I applied via Naukri.com and was interviewed in Jan 2024. There was 1 interview round.
ADF triggers are used in Azure Data Factory to schedule and orchestrate data pipelines.
ADF triggers enable the automation of data movement and data transformation activities.
Triggers can be scheduled to run at specific times or based on event-based triggers.
They can be used to start or stop pipelines, and can be configured with parameters and dependencies.
Examples of triggers include time-based schedules, event-based t...
IR stands for Integration Runtime. Dataset is a representation of data, while linked service is a connection to the data source.
IR is a compute infrastructure used to provide data integration capabilities
Dataset is a structured representation of data used in data engineering tasks
Linked service is a connection to a data source, providing access to the data
IR enables data movement and transformation between different da...
Optimization techniques in Spark
Partitioning data to optimize data locality
Caching frequently accessed data
Using broadcast variables for small data sets
Using appropriate data structures and algorithms
Avoiding unnecessary shuffling of data
Top trending discussions
posted on 31 Dec 2024
Apache Spark architecture includes a cluster manager, worker nodes, and driver program.
Apache Spark architecture consists of a cluster manager, which allocates resources and schedules tasks.
Worker nodes execute tasks and store data in memory or disk.
Driver program coordinates tasks and communicates with the cluster manager.
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkCon...
reduceBy is used to aggregate data based on key, while groupBy is used to group data based on key.
reduceBy is a transformation that combines the values of each key using an associative function and a neutral 'zero value'.
groupBy is a transformation that groups the data based on a key and returns a grouped data set.
reduceBy is more efficient for aggregating data as it reduces the data before shuffling, while groupBy shu...
RDD is a low-level abstraction representing a distributed collection of objects, while DataFrame is a higher-level abstraction representing a distributed collection of data organized into named columns.
RDD is more suitable for unstructured data and low-level transformations, while DataFrame is more suitable for structured data and high-level abstractions.
DataFrames provide optimizations like query optimization and code...
The different modes of execution in Apache Spark include local mode, standalone mode, YARN mode, and Mesos mode.
Local mode: Spark runs on a single machine with one executor.
Standalone mode: Spark runs on a cluster managed by a standalone cluster manager.
YARN mode: Spark runs on a Hadoop cluster using YARN as the resource manager.
Mesos mode: Spark runs on a Mesos cluster with Mesos as the resource manager.
I applied via Approached by Company and was interviewed in Apr 2024. There was 1 interview round.
I have handled terabytes of data in my POCs, including data from various sources and formats.
Handled terabytes of data in POCs
Worked with data from various sources and formats
Used tools like Hadoop, Spark, and SQL for data processing
Repartition is used for increasing partitions for parallelism, while coalesce is used for decreasing partitions to reduce shuffling.
Repartition is used when there is a need for more partitions to increase parallelism.
Coalesce is used when there are too many partitions and need to reduce them to avoid shuffling.
Example: Repartition can be used before a join operation to evenly distribute data across partitions for bette...
Designing/configuring a cluster for 10 petabytes of data involves considerations for storage capacity, processing power, network bandwidth, and fault tolerance.
Consider using a distributed file system like HDFS or object storage like Amazon S3 to store and manage the large volume of data.
Implement a scalable processing framework like Apache Spark or Hadoop to efficiently process and analyze the data in parallel.
Utilize...
I applied via Job Portal and was interviewed in Aug 2024. There were 3 interview rounds.
Its mandatory test even for experience people
The question is about a Pyspark problem.
Use SparkSession to create a Spark application
Load data from a source like CSV or Parquet files
Perform transformations and actions on the data using PySpark functions
Optimize performance by using caching and partitioning
I applied via LinkedIn and was interviewed in Sep 2024. There was 1 interview round.
I am a data engineer with a strong background in programming and data analysis.
Experienced in programming languages such as Python, SQL, and Java
Skilled in data manipulation, ETL processes, and data modeling
Worked on projects involving big data technologies like Hadoop and Spark
Handled conflict by facilitating open communication and finding a mutually beneficial solution
Identified the root cause of the conflict
Encouraged all parties involved to share their perspectives
Facilitated a discussion to find common ground and reach a resolution
Ensured that all parties felt heard and respected
Implemented strategies to prevent similar conflicts in the future
1 Interview rounds
based on 11 reviews
Rating in categories
Software Developer
2k
salaries
| ₹2.5 L/yr - ₹12.2 L/yr |
Senior Software Engineer
1.6k
salaries
| ₹6.5 L/yr - ₹26 L/yr |
Software Engineer
1.3k
salaries
| ₹3.6 L/yr - ₹14.7 L/yr |
System Analyst
1.2k
salaries
| ₹6.5 L/yr - ₹22.2 L/yr |
Senior Software Developer
1.1k
salaries
| ₹5.5 L/yr - ₹19.6 L/yr |
TCS
Infosys
Wipro
HCLTech