Machine Learning interviews often span from conceptual understanding to hands-on coding implementations. Whether you’re just starting out or already working in data science, having a strong command of ML fundamentals, algorithms, and practical applications is key to cracking your next interview.
In this blog post, I’ll walk you through 18 essential ML questions — divided into easy, medium, hard, and practical code-based categories, complete with code snippets, visual intuition, and real-world applications.
Easy Level Questions
1. What is the Bias-Variance Tradeoff in data Science?
In supervised learning, the bias-variance tradeoff is the tension between two types of errors:
- Bias: Assumptions made by the model to simplify the learning process. High bias can cause underfitting.
- Variance: Sensitivity to fluctuations in training data. High variance can lead to overfitting.
A model with low bias and low variance is ideal, but in reality, improving one often worsens the other.
Example:
- High Bias: Linear regression on non-linear data
- High Variance: Deep decision tree on small dataset
Visualization:
Model Complexity | Bias | Variance | Total Error |
---|---|---|---|
Low | High | Low | High |
Medium | Medium | Medium | Low |
High | Low | High | High |
Takeaway: Choose algorithms and hyperparameters that balance both.
2. Difference Between Parametric and Non-Parametric Models
Aspect | Parametric Models | Non-Parametric Models |
---|---|---|
Assumption | Predefined form (e.g., linear) | No fixed structure |
Parameters | Fixed number (e.g., weights) | Grows with data |
Flexibility | Less flexible | Highly flexible |
Interpretability | Easy to interpret | Hard to interpret |
Examples | Linear Regression, Logistic Reg. | KNN, Decision Trees, SVM |
Use Case:
- Use parametric when data is small, noise-free, and assumptions hold.
- Use non-parametric when data is large or complex.
3. What is the Central Limit Theorem (CLT)?
The Central Limit Theorem states that the sampling distribution of the sample mean of a large enough number of independent samples approaches a normal distribution, regardless of the shape of the original distribution.
Why it matters:
CLT justifies using normality-based statistical tests (like z-tests, t-tests) even if the data isn’t normally distributed.
Real-World Use:
- Quality control in manufacturing
- A/B testing in marketing
4. What is Cross-Validation?
Cross-validation is a resampling technique used to evaluate machine learning models on a limited dataset.
K-Fold Cross Validation:
- Divide dataset into
k
parts (folds) - Train on
k-1
, test on 1 - Repeat for each fold
- Average results
Why use it?
- Helps prevent overfitting
- Gives more stable estimate of model performance
Code (Sklearn):
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
model = LogisticRegression()
scores = cross_val_score(model, X, y, cv=5)
print("Accuracy: ", scores.mean())
Medium Level Questions
5. What is a p-value?
A p-value is the probability that the observed result occurred by chance, assuming the null hypothesis is true.
- p < 0.05: Statistically significant (reject null hypothesis)
- p > 0.05: Not statistically significant
Use Case in ML:
In feature selection (e.g., backward elimination), features with high p-values can be dropped.
6. What is Multicollinearity? How can it be detected?
Multicollinearity occurs when independent variables in a regression model are highly correlated, making coefficient estimates unstable.
How to detect:
- Correlation Matrix
- VIF (Variance Inflation Factor): VIF > 10 is problematic.
Code:
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd
X = pd.DataFrame(...) # feature matrix
vif = pd.DataFrame()
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["Feature"] = X.columns
7. What is the K-Means Clustering Algorithm?
K-Means is an unsupervised learning algorithm that partitions data into K clusters.
Steps:
- Initialize K centroids randomly
- Assign data points to nearest centroid
- Recompute centroids
- Repeat until convergence
Application:
- Customer segmentation
- Image compression
- Anomaly detection
8. How Does the Random Forest Algorithm Work?
Random Forest is a bagging-based ensemble method that trains multiple decision trees on different subsets of data and features.
Key Concepts:
- Each tree votes; the final prediction is majority vote (classification) or average (regression)
- Helps reduce overfitting compared to single decision trees
When to Use:
- Works well with missing values
- Robust to outliers
- Feature importance extraction
9. Explain Gradient Boosting
Gradient Boosting builds trees sequentially, where each new tree learns from the errors of the previous trees by minimizing a loss function using gradient descent.
Popular Frameworks:
- XGBoost: Highly optimized, parallel
- LightGBM: Fast, low memory usage
- CatBoost: Categorical handling
When to use:
- Structured/tabular data
- Winning competitions (Kaggle, Zindi)
10. What is Principal Component Analysis (PCA)?
PCA is a dimensionality reduction technique that transforms features into new variables (principal components) that capture the maximum variance.
Steps:
- Standardize the data
- Compute covariance matrix
- Calculate eigenvalues/vectors
- Project data onto top N components
Use Case:
- Speed up ML models
- Remove multicollinearity
- Data visualization (2D/3D)
Hard Level Questions
11. Difference Between Bagging and Boosting
Factor | Bagging | Boosting |
---|---|---|
Strategy | Train models in parallel | Train models sequentially |
Objective | Reduce variance | Reduce bias |
Example Models | Random Forest | AdaBoost, XGBoost, LightGBM |
Data Sampling | Bootstrapped datasets | Weighted data (focus on hard examples) |
Overfitting | Less prone | More prone (unless regularized) |
When to Use:
- Use bagging when the model is overfitting (high variance)
- Use boosting when the model is underfitting (high bias)
12. Generative vs. Discriminative Models
Type | Generative | Discriminative |
---|---|---|
Learns | Joint distribution P(X,Y) | Conditional distribution P(Y |
Usage | Can generate new data | Best for classification |
Examples | Naive Bayes, GANs, HMM | Logistic Regression, SVM, XGBoost |
13. What is the Vanishing Gradient Problem?
In deep neural networks, gradients used in backpropagation become extremely small as they move toward the input layers, making learning very slow.
Occurs mostly with:
- Sigmoid/Tanh activation
- Deep RNNs
Solutions:
- Use ReLU
- Batch Normalization
- Residual connections (ResNet)
- LSTM/GRU for RNNs
14. What are RNNs and Their Variants?
Recurrent Neural Networks (RNNs) are specialized for sequential data (text, time series).
They maintain a hidden state that captures previous inputs, enabling context learning.
Variants:
- LSTM: Solves vanishing gradient, long memory
- GRU: Simpler than LSTM, performs similarly
Use Cases:
- Text generation
- Language modeling
- Time series forecasting
15. What is a Support Vector Machine (SVM)?
SVM is a supervised learning model that finds the optimal hyperplane to separate different classes with maximum margin.
Key Concepts:
- Supports linear and non-linear classification via kernels (RBF, Polynomial)
- Effective in high-dimensional space
Use Cases:
- Image recognition
- Text classification
- Bioinformatics
Practical Code-Based Questions
16. Python Function for Logistic Regression using Gradient Descent
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def logistic_regression(X, y, lr=0.01, epochs=1000):
m, n = X.shape
weights = np.zeros(n)
for _ in range(epochs):
predictions = sigmoid(np.dot(X, weights))
error = predictions - y
gradient = np.dot(X.T, error) / m
weights -= lr * gradient
return weights
17. Implement Neural Network From Scratch
import numpy as np
def sigmoid(x): return 1 / (1 + np.exp(-x))
def sigmoid_deriv(x): return x * (1 - x)
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([[0],[1],[1],[0]])
input_dim, hidden_dim, output_dim = 2, 4, 1
w0 = np.random.randn(input_dim, hidden_dim)
w1 = np.random.randn(hidden_dim, output_dim)
for epoch in range(10000):
l1 = sigmoid(np.dot(X, w0))
l2 = sigmoid(np.dot(l1, w1))
l2_error = y - l2
l2_delta = l2_error * sigmoid_deriv(l2)
l1_error = l2_delta.dot(w1.T)
l1_delta = l1_error * sigmoid_deriv(l1)
w1 += l1.T.dot(l2_delta)
w0 += X.T.dot(l1_delta)
18. Naive Bayes Classifier From Scratch
Read Full Theory here https://en.wikipedia.org/wiki/Naive_Bayes_classifier
import numpy as np
from collections import defaultdict
class NaiveBayes:
def fit(self, X, y):
self.classes = np.unique(y)
self.stats = {}
for c in self.classes:
X_c = X[y == c]
self.stats[c] = {
"mean": np.mean(X_c, axis=0),
"var": np.var(X_c, axis=0),
"prior": len(X_c) / len(X)
}
def predict(self, X):
def gaussian(x, mean, var):
return (1 / np.sqrt(2 * np.pi * var)) * np.exp(-((x - mean) ** 2) / (2 * var))
predictions = []
for x in X:
probs = {}
for c in self.classes:
prior = np.log(self.stats[c]["prior"])
likelihood = np.sum(np.log(gaussian(x, self.stats[c]["mean"], self.stats[c]["var"])))
probs[c] = prior + likelihood
predictions.append(max(probs, key=probs.get))
return predictions
Conclusion
Cracking ML interviews requires a mix of conceptual clarity and coding ability. These 18 questions cover all aspects — from theoretical concepts like bias-variance tradeoff to practical implementation like logistic regression and neural nets.
Tip: Don’t just memorize — understand the intuition, practice the code, and apply them in mini-projects or datasets.
Also Read