CoverPhoto
Infosys logo
Premium Employer

Infosys

Verified
3.6
based on 39.5k Reviews
Filter interviews by
Data Scientist
Experienced
Skills
Clear (1)

10+ Infosys Data Scientist Interview Questions and Answers

Updated 19 Oct 2024

Q1. XgBoost algorithm has 10-20 features. How are the splits decided, on which feature are they going to be divided?

Ans.

XgBoost algorithm uses a greedy approach to determine splits based on feature importance.

  • XgBoost algorithm calculates the information gain for each feature to determine the best split.

  • The feature with the highest information gain is chosen for the split.

  • This process is repeated recursively for each node in the tree.

  • Features can be split based on numerical values or categories.

  • Example: If a feature like 'age' has the highest information gain, the data will be split based on di...read more

View 1 answer
right arrow

Q2. Explain precision and recall, when are they used in which scenario?

Ans.

Precision and recall are metrics used in evaluating the performance of classification models.

  • Precision measures the accuracy of positive predictions, while recall measures the ability of the model to find all positive instances.

  • Precision = TP / (TP + FP)

  • Recall = TP / (TP + FN)

  • Precision is important when false positives are costly, while recall is important when false negatives are costly.

  • For example, in a spam email detection system, high precision is desired to avoid classif...read more

Add your answer
right arrow

Q3. What is activation function? Explain Naive Bayes? Confusion matrix? Hyperparameters in DL? Hypothesis testing

Ans.

Activation function is a mathematical function used in neural networks to introduce non-linearity.

  • Activation function is applied to the weighted sum of inputs in a neural network node.

  • It helps in determining the output of a node or the activation of a neuron.

  • Common activation functions include sigmoid, tanh, ReLU, and softmax.

  • Activation functions introduce non-linearity, allowing neural networks to learn complex patterns.

  • They help in improving the accuracy and performance of ...read more

Add your answer
right arrow

Q4. what is SMOTE? Do you have any experience working on Time Series? Code analysis of global variable?

Ans.

SMOTE stands for Synthetic Minority Over-sampling Technique, used to balance imbalanced datasets by generating synthetic samples.

  • SMOTE is commonly used in machine learning to address class imbalance by creating synthetic samples of the minority class.

  • It works by generating new instances of the minority class by interpolating between existing instances.

  • SMOTE is particularly useful in scenarios where the minority class is underrepresented and traditional sampling techniques may...read more

Add your answer
right arrow
Discover Infosys interview dos and don'ts from real experiences

Q5. Do you have any experience on cloud platform?

Ans.

Yes, I have experience working on cloud platforms such as AWS and Google Cloud.

  • Experience with AWS services like S3, EC2, and Redshift

  • Familiarity with Google Cloud services like BigQuery and Compute Engine

  • Utilized cloud platforms for data storage, processing, and analysis

Add your answer
right arrow

Q6. What is phyton and R

Ans.

Python and R are programming languages commonly used in data science and statistical analysis.

  • Python is a general-purpose language with a large community and many libraries for data manipulation and machine learning.

  • R is a language specifically designed for statistical computing and graphics, with a wide range of packages for data analysis and visualization.

  • Both languages are popular choices for data scientists and have their own strengths and weaknesses.

  • Python is often used ...read more

Add your answer
right arrow
Are these interview questions helpful?

Q7. What is L1 and L2 Regularization?

Ans.

L1 and L2 regularization are techniques used in machine learning to prevent overfitting by adding penalty terms to the cost function.

  • L1 regularization adds the absolute values of the coefficients as penalty term to the cost function.

  • L2 regularization adds the squared values of the coefficients as penalty term to the cost function.

  • L1 regularization can lead to sparse models by forcing some coefficients to be exactly zero.

  • L2 regularization is computationally more efficient comp...read more

View 1 answer
right arrow

Q8. What is multi collinearity?

Ans.

Multicollinearity is a phenomenon where two or more independent variables in a regression model are highly correlated.

  • It can lead to unstable and unreliable estimates of regression coefficients.

  • It can also make it difficult to determine the individual effect of each independent variable on the dependent variable.

  • It can be detected using methods such as correlation matrix, variance inflation factor (VIF), and eigenvalues.

  • Example: In a regression model predicting house prices, ...read more

Add your answer
right arrow
Share interview questions and help millions of jobseekers 🌟
man with laptop

Q9. What is entropy, information gain?

Ans.

Entropy is a measure of randomness or uncertainty in a dataset, while information gain is the reduction in entropy after splitting a dataset based on a feature.

  • Entropy is used in decision tree algorithms to determine the best feature to split on.

  • Information gain measures the effectiveness of a feature in classifying the data.

  • Higher information gain indicates that a feature is more useful for splitting the data.

  • Entropy is calculated using the formula: -p1*log2(p1) - p2*log2(p2...read more

Add your answer
right arrow

Q10. what is hypothesis testing?

Ans.

Hypothesis testing is a statistical method used to make inferences about a population based on sample data.

  • Hypothesis testing involves formulating a null hypothesis and an alternative hypothesis.

  • The null hypothesis is assumed to be true until there is enough evidence to reject it.

  • Statistical tests are used to determine the likelihood of observing the data if the null hypothesis is true.

  • The p-value is used to determine the significance of the results.

  • Common hypothesis tests in...read more

Add your answer
right arrow

Q11. what is data imbalance?

Ans.

Data imbalance refers to unequal distribution of classes in a dataset, where one class has significantly more samples than others.

  • Data imbalance can lead to biased models that favor the majority class.

  • It can result in poor performance for minority classes, as the model may struggle to accurately predict them.

  • Techniques like oversampling, undersampling, and using different evaluation metrics can help address data imbalance.

  • For example, in a fraud detection dataset, the majorit...read more

Add your answer
right arrow

Q12. What is data science

Ans.

Data science is the field of extracting insights and knowledge from data using various techniques and tools.

  • Data science involves collecting, cleaning, and analyzing data to extract insights.

  • It uses various techniques such as machine learning, statistical modeling, and data visualization.

  • Data science is used in various fields such as finance, healthcare, and marketing.

  • Examples of data science applications include fraud detection, personalized medicine, and recommendation syst...read more

Add your answer
right arrow

Q13. Explain XGBoost algoritm

Ans.

XGBoost is a powerful machine learning algorithm known for its speed and performance in handling large datasets.

  • XGBoost stands for eXtreme Gradient Boosting, which is an implementation of gradient boosting machines.

  • It is widely used in machine learning competitions and is known for its speed and performance.

  • XGBoost uses a technique called boosting, where multiple weak learners are combined to create a strong learner.

  • It builds a series of decision trees to predict the target v...read more

Add your answer
right arrow

Q14. Correlation vs covariance

Ans.

Covariance measures the relationship between two variables, while correlation measures the strength and direction of the relationship.

  • Covariance can be positive, negative, or zero, indicating the direction of the relationship between variables.

  • Correlation is always between -1 and 1, with 1 indicating a perfect positive relationship, -1 indicating a perfect negative relationship, and 0 indicating no relationship.

  • Covariance is affected by the scale of the variables, while corre...read more

Add your answer
right arrow
Contribute & help others!
Write a review
Write a review
Share interview
Share interview
Contribute salary
Contribute salary
Add office photos
Add office photos

Interview Process at Infosys Data Scientist

based on 16 interviews
3 Interview rounds
Technical Round
HR Round
Group Discussion Round
View more
interview tips and stories logo
Interview Tips & Stories
Ace your next interview with expert advice and inspiring stories

Top Data Scientist Interview Questions from Similar Companies

TCS Logo
3.7
 • 29 Interview Questions
Affine Logo
3.3
 • 18 Interview Questions
Capgemini Logo
3.7
 • 14 Interview Questions
Walmart Logo
3.8
 • 12 Interview Questions
View all
Recently Viewed
COMPANY BENEFITS
Novo Nordisk
139 benefits
REVIEWS
PC Solutions
3.7
(324 reviews)
JOBS
Infosys
No Jobs
SALARIES
11:11 Systems
SALARIES
Microsoft Corporation
SALARIES
Paytm
SALARIES
Microsoft Corporation
DESIGNATION
SALARIES
Zomato
SALARIES
Vinculum Solutions
Share an Interview
Stay ahead in your career. Get AmbitionBox app
play-icon
play-icon
qr-code
Helping over 1 Crore job seekers every month in choosing their right fit company
75 Lakh+

Reviews

5 Lakh+

Interviews

4 Crore+

Salaries

1 Cr+

Users/Month

Contribute to help millions

Made with ❤️ in India. Trademarks belong to their respective owners. All rights reserved © 2024 Info Edge (India) Ltd.

Follow us
  • Youtube
  • Instagram
  • LinkedIn
  • Facebook
  • Twitter