-
Classification and regression
-
Uses Hinge loss
-
Evaluated using maximum margin
-
SVM with no kernel is similar to Logistic Regression
-
Math: Convex optimization and Hillbert space theory
-
Hard margin -> Kernel trick -> Soft margin
-
Similar to Linear regression, but you apply a kernel functionto have non linear decision boundaries
-
Non parametric (grows with sample space)
-
Hyper parameters: "right" kernel, regularization penalties, the slack variable
-
One svm for each class
Pros:
-
Works with non linearly seperable data when used with kernels
-
SVMs don’t penalize examples for which the correct decision is made with sufficient confidence. This may be good for generalization.
-
SVMs have a nice dual form, giving sparse solutions when using the kernel trick? (better scalability).
Bayes therom:
Denominator is indepedant of C, so it is constant.
Numerator corresponds to chain rule for each feature or joint probability=>
Naive approach assumes that p(x0/x1..xn,C) = p(x/C), that is each feature is conditionally independant of every other feature. So it can said as
-
Training: Calculate probabilities for each class versus feature at training
-
P(Ck) for all classes
-
P(x/Ck) for all values of x for all features for every class
- nb_dict[class][feature] = dict(x_val, prob)
-
-
Use it at inference time
Simple non parametric supervised method used for classification and regression
-
Distance metric:
-
For real-valued data, the Euclidean distance can be used.
-
Other other types of data such as categorical or binary data, Hamming distance can be used.
-
-
In the case of regression problems, the average of the predicted attribute may be returned.
-
In the case of classification, the most prevalent class may be returned
-
instance-based lazy learning algorithm
Def Predict(x):
Candidates = find_knn(x)
Return processed(candidates)
Def find_knn(trainingSet, testInstance, k):
Loop through all data points->calculate distance->keep track of top k
- Work well with non linear relations. TODO: Be able to explain why
-
IP: Continuous or descrete
-
OP: Can be used for both regression and classification tasks
-
Find best split point at each level
-
Split nodes by all possible features and select one which increases homogenuity in children
-
Greedy. Only looks at current best split
-
-
And keep splitting
-
Pre-pruning: Stopping condition like until node contains more than 50 samples.
-
Knobs: Prune (post) to avoid overfitting
-
Gini impurity: Ideal split = 0
- Calculate the metric for each child and weight them by #child/total children to get metric at split point.
-
Information gain: Ideal split = Entropy = 0
- Calculate the metric for each child and weight them by #child/total children to get metric at split point.
3. Entropy = 0 for completely homogenous/pure node, 1 for equal split
4. Information gain = 1 - Entropy
-
Chi square:
-
**((Actual – Expected)^2 / Expected)^½ for each child and class. **
-
And add all of them for calcualting chi square for the spit.
-
-
Reduction in variance
-
Variance = sum (x - xmean)^2/n for each node
-
Variance of split = Weighted sum of each node
-
-
MSE?
-
Constraints (Greedy, only consider current state)/ Pre-pruning
-
Min samples for node_split/terminal node
-
Max Depth of tree and number of terminal nodes
-
Max features, Rule of thumb = sqrt(total_num_feats). Max :30-40%
-
-
Post - Pruning: Let the tree grow, and prune the nodes with negative information gain?
-
Sckikit learn does not support post-pruning. XGBoost does.
-
Error(post merge) < error(pre merge) => merge/prune
-
Advantages:
-
Mimics human intuition. Easy to interpret to non statisticians
-
Can be used to find significant features or create new features
-
Less sensitive to outliers. Why? And missing values. Why?
-
Non parametric and hence have no assumptions of the distribution of the data.
-
Better handles non linearity
-
Captures Order?
Disadvantages:
-
Overfitting. But can use pruning and set constaints
-
Not fit for continous variables. Loosed information when it categorizes variables in different categories. Why?
-
Can become unstable - small variations to training can result in completely different trees?
-
Try decision sckikit learn
- Identify significant variables
-
Understand splits in Regression Decision trees
Final points:
-
A good implementation would take following parameters
-
Split strategy
-
Constraints
-
Pruning strategy
-
-
Knobs:
-
mac_depth, max observations etc
-
Minimum impurity
-
-
Better accuracy and stability
-
Bagging: Use multiple classifiers/models on random samples (with replacemnt) of the training data and take avg/mean/mode of predictions
-
Bootstrap: Repeated sampling with replacement
Cons of bagging:
- Can lead to highly correlated/identical trees, which can lead to false confidence of overfitted features
-
Random forests: It is a bagging method on bootstrapped samples and take y^ as average of y^ from all trees
-
Handles highly correlated trees by only using (random) subset of features for each tree rather than subsets of data
-
Can calculate out-of-bag errors instead of using train/test set, like k-fold validation?
-
Can be used for Regression and classification
-
Non parametric (grows with sample space)
-
No hyper parameters except num of trees
-
One forest for all classes
Notes:
-
Also does dimensionality reduction - by identifying strong features?
-
Handles missing values, outliers. How?
-
Can be used for unsupervised. How?
-
Learning is faster?
Disadvantages:
-
Not very good for regression, as cannot predict out of training range values?
-
Can feel like black box?
-
Inference can be slower?
-
Iteratively learns by combining many weak classifiers to produce a powerful committee
-
Also uses bootsrapped sample datasets
-
Boosting: Iterative - Each tree learns from the last one
- Weights each training example by how incorrectly it was classified (or) only works on misclassified samples?
-
TODO: Take notes from https://www.youtube.com/watch?time_continue=2&v=sRktKszFmSk
-
https://github.com/dmlc/xgboost/blob/master/doc/parameter.md
-
Market segmentation
-
Grouping of news
-
Social network analysis
-
No labels
-
Very high dimensionality
-
Detect patterns
-
Exploratory step: Can we used to do feature selection as well.
-
Find sub groups?
Initialize centroids
Until convergence
Assign points to centroids
Recalculate new centroids
Pros:
-
Does not assume underlying distribution? (But assumes Spherical?)
-
Can work in many dimensions
Cons:
-
Need to pick k, assumes existence of underlying groupings
-
Feature engineering - only accepts numeric normalized features
Finding right k:
-
If we underfit, too many dissimilar samples migh be in same group - over generalization
-
If we overfit, too specific groups that new samples might not fall into
-
Elbow plot is useful to find the sweet spot (number of clusters versus sum of squares within groups)
https://www.kaggle.com/arthurtok/interactive-intro-to-dimensionality-reduction
PCA: unsupervised
LDA: Supervised
T-SNE: Unsupervised, topology preserving
-
Hierarchical clustering
-
connectivity based
-
A Note on statistical learning versus machine learning:
-
In statistical learning you assume some distribution of your data and fit a model
-
Boosting is based on weak learners (high bias, low variance). In terms of decision trees, weak learners are shallow trees, sometimes even as small as decision stumps (trees with two leaves). Boosting reduces error mainly by reducing bias (and also to some extent variance, by aggregating the output from many models).
-
On the other hand, Random Forest uses as you said fully grown decision trees (low bias, high variance). It tackles the error reduction task in the opposite way: by reducing variance. The trees are made uncorrelated to maximize the decrease in variance, but the algorithm cannot reduce bias (which is slightly higher than the bias of an individual tree in the forest). Hence the need for large, unprunned trees, so that the bias is initially as low as possible.
Please note that unlike Boosting (which is sequential), RF grows trees in parallel.
Weak learner is a learner that no matter what the distribution over the training data is will always do better than chance, when it tries to label the data. Doing better than chance means we are always going to have an error rate which is less than 1/2.
This means that the learner algorithm is always going to learn something, not always completely accurate i.e., it is weak and poor when it comes to learning the relationships between XX (inputs) and YY (target).
http://fastml.com/what-is-better-gradient-boosted-trees-or-random-forest/
-
Convenient probability scores for observations
-
Multi-collinearity is not really an issue and can be countered with L2 regularization to an extent
-
Doesn’t perform well when feature space is too large
-
Doesn’t handle large number of categorical features/variables well
-
Relies on transformations for non-linear features
-
Relies on entire data [ Not a very serious drawback I’d say]?
-
Intuitive Decision Rules
-
Can handle non-linear features
-
Take into account variable interactions?
-
Highly biased to training set [Random Forests to your rescue]
-
No ranking score as direct result
-
Can handle large feature space
-
Can handle non-linear feature interactions
-
Do not rely on entire data?
-
Not very efficient with large number of observations
-
It can be tricky to find appropriate kernel sometimes
-
Always start with logistic regression, if nothing then to use the performance as baseline
-
See if decision trees (Random Forests) provide significant improvement. Even if you do not end up using the resultant model, you can use random forest results to remove noisy variables
-
Go for SVM if you have large number of features and number of observations are not a limitation for available resources and time