Filter interviews by
RDD stands for Resilient Distributed Dataset and is the fundamental data structure of Apache Spark.
RDD is a distributed collection of objects that can be operated on in parallel.
DataFrames and Datasets are higher-level abstractions built on top of RDDs.
RDDs are more low-level and offer more control over data processing compared to DataFrames and Datasets.
Partitioning is the process of dividing data into smaller chunks for better organization and processing in distributed systems.
Partitioning helps in distributing data across multiple nodes for parallel processing.
Coalesce is used to reduce the number of partitions without shuffling data, while repartition is used to increase the number of partitions by shuffling data.
Example: coalesce(5) will merge partitions into 5 pa...
Spark is a distributed computing framework that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Spark has a master-slave architecture with a driver program that communicates with a cluster manager to distribute work across worker nodes.
It uses Resilient Distributed Datasets (RDDs) for fault-tolerant distributed data processing.
Spark supports various programming l...
DAG stands for Directed Acyclic Graph. It is a finite directed graph with no cycles.
DAG is a collection of nodes connected by edges where each edge goes from one node to another, but no cycles are allowed.
In the context of Spark, a DAG represents the sequence of transformations that need to be applied to the input data to get the final output.
When a Spark job is submitted, Spark creates a DAG of the transformations spe...
Top trending discussions
I was interviewed in Dec 2024.
Our team follows a CI/CD workflow that includes automated testing, code reviews, and continuous integration.
Automated testing is run on every code change to catch bugs early.
Code reviews are conducted before merging changes to ensure code quality.
Continuous integration is used to automatically build and test code changes in a shared repository.
Deployment pipelines are set up to automate the release process.
Version cont...
Yes, there have been security incidents and I have handled them effectively.
Implemented security protocols to prevent future incidents
Conducted thorough investigation to identify the root cause
Collaborated with IT team to strengthen security measures
Communicated with stakeholders to ensure transparency and trust
Provided training to employees on cybersecurity best practices
Authentication verifies the identity of a user, while authorization determines what actions a user is allowed to perform.
Authentication confirms the identity of a user through credentials like passwords or biometrics.
Authorization determines the level of access or permissions a user has once their identity is confirmed.
Authentication is the process of logging in, while authorization is the process of granting or denyin...
LLD for an authentication and authorization system
Separate modules for authentication and authorization
Use of secure hashing algorithms for storing passwords
Role-based access control implementation
Audit logging for tracking user actions
Integration with external identity providers
Design a document managed storage like Google Drive as an E2E solution.
Implement user authentication and authorization for secure access.
Create a user-friendly interface for uploading, organizing, and sharing documents.
Include features like version control, file syncing, and search functionality.
Utilize cloud storage for scalability and accessibility.
Implement encryption for data security.
Integrate with third-party app
I applied via Naukri.com and was interviewed in Dec 2024. There was 1 interview round.
I applied via campus placement at National Institute of Technology (NIT), Warangal
1 hour aptitude test
posted on 11 Sep 2024
I applied via Company Website and was interviewed in Aug 2024. There was 1 interview round.
RAG pipeline is a data processing pipeline used in data science to categorize data into Red, Amber, and Green based on certain criteria.
RAG stands for Red, Amber, Green which are used to categorize data based on certain criteria
Red category typically represents data that needs immediate attention or action
Amber category represents data that requires monitoring or further investigation
Green category represents data that...
Confusion metrics are used to evaluate the performance of a classification model by comparing predicted values with actual values.
Confusion matrix is a table that describes the performance of a classification model.
It consists of four different metrics: True Positive, True Negative, False Positive, and False Negative.
These metrics are used to calculate other evaluation metrics like accuracy, precision, recall, and F1 s...
I applied via Recruitment Consulltant and was interviewed in Apr 2024. There was 1 interview round.
SQL, Python coding …
I applied via Company Website and was interviewed in Sep 2024. There were 2 interview rounds.
Basic mathematical and resoning questions.
Developed a predictive model for customer churn in a telecom company
Collected and cleaned customer data including usage patterns and demographics
Used machine learning algorithms such as logistic regression and random forest
Evaluated model performance using metrics like accuracy and AUC-ROC curve
Random forest is an ensemble learning method that uses multiple decision trees to make predictions, while a decision tree is a single tree-like structure that makes decisions based on features.
Random forest is a collection of decision trees that work together to make predictions.
Decision tree is a single tree-like structure that makes decisions based on features.
Random forest reduces overfitting by averaging the predic...
A cost function is a mathematical formula used to measure the cost of a particular decision or set of decisions.
Cost function helps in evaluating the performance of a model by measuring how well it is able to predict the outcomes.
It is used in optimization problems to find the best solution that minimizes the cost.
Examples include mean squared error in linear regression and cross-entropy loss in logistic regression.
I applied via Referral and was interviewed in May 2024. There were 3 interview rounds.
I was asked to write SQL queries for 3rd highest salary of the employee, some name filtering, group by tasks.
Python code to find the index of the maximum number without using numpy.
Answering questions related to data science concepts and techniques.
Recall is the ratio of correctly predicted positive observations to the total actual positives. Precision is the ratio of correctly predicted positive observations to the total predicted positives.
To reduce variance in an ensemble model, techniques like bagging, boosting, and stacking can be used. Bagging involves training multiple models on different ...
I applied via Referral and was interviewed in Aug 2024. There were 2 interview rounds.
Topics like sql, python, aptitude were covered.
posted on 30 Mar 2023
I applied via LinkedIn and was interviewed before Mar 2022. There were 3 interview rounds.
based on 1 interview
Interview experience
based on 1 review
Rating in categories
Senior Applied Data Scientist
128
salaries
| ₹10.9 L/yr - ₹20 L/yr |
Lead Applied Data Scientist
85
salaries
| ₹17 L/yr - ₹28.5 L/yr |
Applied Data Scientist
79
salaries
| ₹10 L/yr - ₹16.5 L/yr |
Senior Engineer
61
salaries
| ₹10 L/yr - ₹30 L/yr |
Senior Data Scientist
49
salaries
| ₹9 L/yr - ₹28 L/yr |
Fractal Analytics
Mu Sigma
AbsolutData
Algonomy