Machine Learning interviews often span from conceptual understanding to hands-on coding implementations. Whether you’re just starting out or already working in data science, having a strong command of ML fundamentals, algorithms, and practical applications is key to cracking your next interview.
In this blog post, Iโll walk you through 18 essential ML questions โ divided into easy, medium, hard, and practical code-based categories, complete with code snippets, visual intuition, and real-world applications.
๐ข Easy Level Questions
1. What is the Bias-Variance Tradeoff in data Science?
In supervised learning, the bias-variance tradeoff is the tension between two types of errors:
- Bias: Assumptions made by the model to simplify the learning process. High bias can cause underfitting.
- Variance: Sensitivity to fluctuations in training data. High variance can lead to overfitting.
A model with low bias and low variance is ideal, but in reality, improving one often worsens the other.
๐ Example:
- High Bias: Linear regression on non-linear data
- High Variance: Deep decision tree on small dataset
Visualization:
Model Complexity | Bias | Variance | Total Error |
---|---|---|---|
Low | High | Low | High |
Medium | Medium | Medium | Low |
High | Low | High | High |
Takeaway: Choose algorithms and hyperparameters that balance both.
2. Difference Between Parametric and Non-Parametric Models
Aspect | Parametric Models | Non-Parametric Models |
---|---|---|
Assumption | Predefined form (e.g., linear) | No fixed structure |
Parameters | Fixed number (e.g., weights) | Grows with data |
Flexibility | Less flexible | Highly flexible |
Interpretability | Easy to interpret | Hard to interpret |
Examples | Linear Regression, Logistic Reg. | KNN, Decision Trees, SVM |
Use Case:
- Use parametric when data is small, noise-free, and assumptions hold.
- Use non-parametric when data is large or complex.
3. What is the Central Limit Theorem (CLT)?
The Central Limit Theorem states that the sampling distribution of the sample mean of a large enough number of independent samples approaches a normal distribution, regardless of the shape of the original distribution.
๐ Why it matters:
CLT justifies using normality-based statistical tests (like z-tests, t-tests) even if the data isn’t normally distributed.
Real-World Use:
- Quality control in manufacturing
- A/B testing in marketing
4. What is Cross-Validation?
Cross-validation is a resampling technique used to evaluate machine learning models on a limited dataset.
๐ง K-Fold Cross Validation:
- Divide dataset into
k
parts (folds) - Train on
k-1
, test on 1 - Repeat for each fold
- Average results
๐ฆ Why use it?
- Helps prevent overfitting
- Gives more stable estimate of model performance
Code (Sklearn):
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
model = LogisticRegression()
scores = cross_val_score(model, X, y, cv=5)
print("Accuracy: ", scores.mean())
๐ก Medium Level Questions
5. What is a p-value?
A p-value is the probability that the observed result occurred by chance, assuming the null hypothesis is true.
- p < 0.05: Statistically significant (reject null hypothesis)
- p > 0.05: Not statistically significant
๐ Use Case in ML:
In feature selection (e.g., backward elimination), features with high p-values can be dropped.
6. What is Multicollinearity? How can it be detected?
Multicollinearity occurs when independent variables in a regression model are highly correlated, making coefficient estimates unstable.
๐ How to detect:
- Correlation Matrix
- VIF (Variance Inflation Factor): VIF > 10 is problematic.
Code:
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd
X = pd.DataFrame(...) # feature matrix
vif = pd.DataFrame()
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["Feature"] = X.columns
7. What is the K-Means Clustering Algorithm?
K-Means is an unsupervised learning algorithm that partitions data into K clusters.
๐งฎ Steps:
- Initialize K centroids randomly
- Assign data points to nearest centroid
- Recompute centroids
- Repeat until convergence
๐ Application:
- Customer segmentation
- Image compression
- Anomaly detection
8. How Does the Random Forest Algorithm Work?
Random Forest is a bagging-based ensemble method that trains multiple decision trees on different subsets of data and features.
๐ฒ Key Concepts:
- Each tree votes; the final prediction is majority vote (classification) or average (regression)
- Helps reduce overfitting compared to single decision trees
๐ง When to Use:
- Works well with missing values
- Robust to outliers
- Feature importance extraction
9. Explain Gradient Boosting
Gradient Boosting builds trees sequentially, where each new tree learns from the errors of the previous trees by minimizing a loss function using gradient descent.
๐ฅ Popular Frameworks:
- XGBoost: Highly optimized, parallel
- LightGBM: Fast, low memory usage
- CatBoost: Categorical handling
๐ When to use:
- Structured/tabular data
- Winning competitions (Kaggle, Zindi)
10. What is Principal Component Analysis (PCA)?
PCA is a dimensionality reduction technique that transforms features into new variables (principal components) that capture the maximum variance.
๐ง Steps:
- Standardize the data
- Compute covariance matrix
- Calculate eigenvalues/vectors
- Project data onto top N components
๐ฆ Use Case:
- Speed up ML models
- Remove multicollinearity
- Data visualization (2D/3D)
๐ด Hard Level Questions
11. Difference Between Bagging and Boosting
Factor | Bagging | Boosting |
---|---|---|
Strategy | Train models in parallel | Train models sequentially |
Objective | Reduce variance | Reduce bias |
Example Models | Random Forest | AdaBoost, XGBoost, LightGBM |
Data Sampling | Bootstrapped datasets | Weighted data (focus on hard examples) |
Overfitting | Less prone | More prone (unless regularized) |
๐ง When to Use:
- Use bagging when the model is overfitting (high variance)
- Use boosting when the model is underfitting (high bias)
12. Generative vs. Discriminative Models
Type | Generative | Discriminative |
---|---|---|
Learns | Joint distribution P(X,Y) | Conditional distribution P(Y |
Usage | Can generate new data | Best for classification |
Examples | Naive Bayes, GANs, HMM | Logistic Regression, SVM, XGBoost |
13. What is the Vanishing Gradient Problem?
In deep neural networks, gradients used in backpropagation become extremely small as they move toward the input layers, making learning very slow.
๐ Occurs mostly with:
- Sigmoid/Tanh activation
- Deep RNNs
๐ Solutions:
- Use ReLU
- Batch Normalization
- Residual connections (ResNet)
- LSTM/GRU for RNNs
14. What are RNNs and Their Variants?
Recurrent Neural Networks (RNNs) are specialized for sequential data (text, time series).
๐ They maintain a hidden state that captures previous inputs, enabling context learning.
๐ฆ Variants:
- LSTM: Solves vanishing gradient, long memory
- GRU: Simpler than LSTM, performs similarly
๐ Use Cases:
- Text generation
- Language modeling
- Time series forecasting
15. What is a Support Vector Machine (SVM)?
SVM is a supervised learning model that finds the optimal hyperplane to separate different classes with maximum margin.
๐ง Key Concepts:
- Supports linear and non-linear classification via kernels (RBF, Polynomial)
- Effective in high-dimensional space
๐ Use Cases:
- Image recognition
- Text classification
- Bioinformatics
๐จโ๐ป Practical Code-Based Questions
16. Python Function for Logistic Regression using Gradient Descent
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def logistic_regression(X, y, lr=0.01, epochs=1000):
m, n = X.shape
weights = np.zeros(n)
for _ in range(epochs):
predictions = sigmoid(np.dot(X, weights))
error = predictions - y
gradient = np.dot(X.T, error) / m
weights -= lr * gradient
return weights
17. Implement Neural Network From Scratch
import numpy as np
def sigmoid(x): return 1 / (1 + np.exp(-x))
def sigmoid_deriv(x): return x * (1 - x)
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([[0],[1],[1],[0]])
input_dim, hidden_dim, output_dim = 2, 4, 1
w0 = np.random.randn(input_dim, hidden_dim)
w1 = np.random.randn(hidden_dim, output_dim)
for epoch in range(10000):
l1 = sigmoid(np.dot(X, w0))
l2 = sigmoid(np.dot(l1, w1))
l2_error = y - l2
l2_delta = l2_error * sigmoid_deriv(l2)
l1_error = l2_delta.dot(w1.T)
l1_delta = l1_error * sigmoid_deriv(l1)
w1 += l1.T.dot(l2_delta)
w0 += X.T.dot(l1_delta)
18. Naive Bayes Classifier From Scratch
Read Full Theory here https://en.wikipedia.org/wiki/Naive_Bayes_classifier
import numpy as np
from collections import defaultdict
class NaiveBayes:
def fit(self, X, y):
self.classes = np.unique(y)
self.stats = {}
for c in self.classes:
X_c = X[y == c]
self.stats[c] = {
"mean": np.mean(X_c, axis=0),
"var": np.var(X_c, axis=0),
"prior": len(X_c) / len(X)
}
def predict(self, X):
def gaussian(x, mean, var):
return (1 / np.sqrt(2 * np.pi * var)) * np.exp(-((x - mean) ** 2) / (2 * var))
predictions = []
for x in X:
probs = {}
for c in self.classes:
prior = np.log(self.stats[c]["prior"])
likelihood = np.sum(np.log(gaussian(x, self.stats[c]["mean"], self.stats[c]["var"])))
probs[c] = prior + likelihood
predictions.append(max(probs, key=probs.get))
return predictions
๐ Conclusion
Cracking ML interviews requires a mix of conceptual clarity and coding ability. These 18 questions cover all aspects โ from theoretical concepts like bias-variance tradeoff to practical implementation like logistic regression and neural nets.
โ Tip: Donโt just memorize โ understand the intuition, practice the code, and apply them in mini-projects or datasets.
Also Read
- Top 18 Data Science GEN AI Real Interview Questions Asked in Microsoft
- Best 10 Certifications for Career Switch to Data Science
- Switch from Non-IT to IT Career in 2025 (Step-by-Step Guide)
- Top 21 Generative AI Engineer Interview Questions and Answers
- AI Agent vs Agentic AI: Top 10 Interview Q&A (Accenture Experience)