Filter interviews by
Hadoop architecture is a distributed computing framework for processing large data sets across clusters of computers.
Hadoop consists of HDFS (Hadoop Distributed File System) for storage and MapReduce for processing.
HDFS divides data into blocks and stores them across multiple nodes in a cluster.
MapReduce is a programming model for processing large data sets in parallel across a distributed cluster.
Hadoop also includes ...
Hadoop is a distributed storage system while Spark is a distributed processing engine.
Hadoop is primarily used for storing and processing large volumes of data in a distributed environment.
Spark is designed for fast data processing and can perform in-memory computations, making it faster than Hadoop for certain tasks.
Hadoop uses MapReduce for processing data, while Spark uses Resilient Distributed Datasets (RDDs) for f...
posted on 31 Dec 2024
Apache Spark architecture includes a cluster manager, worker nodes, and driver program.
Apache Spark architecture consists of a cluster manager, which allocates resources and schedules tasks.
Worker nodes execute tasks and store data in memory or disk.
Driver program coordinates tasks and communicates with the cluster manager.
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkCon...
reduceBy is used to aggregate data based on key, while groupBy is used to group data based on key.
reduceBy is a transformation that combines the values of each key using an associative function and a neutral 'zero value'.
groupBy is a transformation that groups the data based on a key and returns a grouped data set.
reduceBy is more efficient for aggregating data as it reduces the data before shuffling, while groupBy shu...
RDD is a low-level abstraction representing a distributed collection of objects, while DataFrame is a higher-level abstraction representing a distributed collection of data organized into named columns.
RDD is more suitable for unstructured data and low-level transformations, while DataFrame is more suitable for structured data and high-level abstractions.
DataFrames provide optimizations like query optimization and code...
The different modes of execution in Apache Spark include local mode, standalone mode, YARN mode, and Mesos mode.
Local mode: Spark runs on a single machine with one executor.
Standalone mode: Spark runs on a cluster managed by a standalone cluster manager.
YARN mode: Spark runs on a Hadoop cluster using YARN as the resource manager.
Mesos mode: Spark runs on a Mesos cluster with Mesos as the resource manager.
posted on 7 May 2024
I applied via Approached by Company and was interviewed in Apr 2024. There was 1 interview round.
Create a list containing all Python data types.
Use the following data types: int, float, complex, str, list, tuple, dict, set, bool, bytes, bytearray, memoryview, None
Example: ['int', 'float', 'complex', 'str', 'list', 'tuple', 'dict', 'set', 'bool', 'bytes', 'bytearray', 'memoryview', 'None']
Extract a character from a string in a list of strings.
Iterate through the list of strings
Use indexing to extract the desired character from each string
Handle cases where the index is out of range
Return the extracted characters as a new list
Create a dictionary with Name and Age for 4 records in Python.
Use curly braces {} to create a dictionary.
Separate key-value pairs with a colon :
Separate each record with a comma ,
Function to check if a string is a palindrome.
Create a function that takes a string as input.
Reverse the string and compare it with the original string.
Return true if they are the same, false otherwise.
Example: 'racecar' is a palindrome.
Data normalization is the process of organizing data in a database efficiently, while data standardization is the process of ensuring consistency and uniformity in data.
Data normalization involves organizing data into tables and columns to reduce redundancy and improve data integrity.
Data standardization involves ensuring that data is consistent and uniform across the database.
Normalization helps in reducing data redun...
The probability of drawing 3 balls of the same color from a box containing 4 balls of each color (Red, Green, Blue).
Calculate the total number of ways to draw 3 balls out of 12 balls
Calculate the number of ways to draw 3 balls of the same color
Divide the number of favorable outcomes by the total number of outcomes to get the probability
Apache Spark is a fast and general-purpose cluster computing system.
Apache Spark is an open-source distributed computing system that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
It has a unified architecture that combines SQL, streaming, machine learning, and graph processing capabilities.
Spark architecture consists of a driver program that coordinates the exe...
I applied via Approached by Company and was interviewed in Apr 2024. There was 1 interview round.
I have handled terabytes of data in my POCs, including data from various sources and formats.
Handled terabytes of data in POCs
Worked with data from various sources and formats
Used tools like Hadoop, Spark, and SQL for data processing
Repartition is used for increasing partitions for parallelism, while coalesce is used for decreasing partitions to reduce shuffling.
Repartition is used when there is a need for more partitions to increase parallelism.
Coalesce is used when there are too many partitions and need to reduce them to avoid shuffling.
Example: Repartition can be used before a join operation to evenly distribute data across partitions for bette...
Designing/configuring a cluster for 10 petabytes of data involves considerations for storage capacity, processing power, network bandwidth, and fault tolerance.
Consider using a distributed file system like HDFS or object storage like Amazon S3 to store and manage the large volume of data.
Implement a scalable processing framework like Apache Spark or Hadoop to efficiently process and analyze the data in parallel.
Utilize...
I applied via Job Portal and was interviewed in Aug 2024. There were 3 interview rounds.
Its mandatory test even for experience people
I applied via campus placement at KLS Institute of Management Education and Research, Belgaum and was interviewed in Jun 2024. There were 2 interview rounds.
Basic coding test like prime
Process Advisor
705
salaries
| ₹1 L/yr - ₹7 L/yr |
Assistant Manager
504
salaries
| ₹7.5 L/yr - ₹22 L/yr |
Assistant Vice President
435
salaries
| ₹13.5 L/yr - ₹39 L/yr |
Senior Analyst
343
salaries
| ₹3.5 L/yr - ₹10.1 L/yr |
Process Associate
197
salaries
| ₹1 L/yr - ₹6.6 L/yr |
HSBC Group
JPMorgan Chase & Co.
Standard Chartered
Deutsche Bank