Frequent Questions
Last updated
Last updated
Simple model, high bias -> lower the bias -> complex model -> overfitting -> higher variance
Complex model, high variance -> lower the variance -> simplified model -> underfitting -> higher bias
High order polynomial function is the more complex model.
To determine a new input instance's class/regress value, find the K closest history data points and do a majority vote to predict.
Normalize the features to avoid over/under-estimation for large/small value features.
KD tree, grids on multi-dimension space, to accelerate the K neighbors' finding procedure.
Weighted neighbors, the closer the larger weights would be, to make it more sensitive to similarity.
A way to put unlabeled data into K groups where data points in the same group are similar and data points in different groups are far apart.
Choose K -> random initialize centers -> iteratively recalculate centers with each cluster's distribution.
Pros:
Easy to understand, guarantee to converge, scalable to large data, relative fast
Cons:
manually set K, initial sensitive, outliers sensitive, linear boundaries, O(n) for each step
To improve:
verify performance with different K, select K at the turning point
K-means++, sequentially select initial centers in a way that the new center is far away from the previous ones.
Pre-processing to normalize and filter outliers
Use kernel to map data points into high dimensions. Then apply linear boundaries there.
Both rely on measuring distances, Euclidean/Manhattan distance, power-> n, cosine similarity, etc.
KNN has to have labels for each history data sample ahead.
K-means is an unsupervised learning method and doesn't require labels at all.
Positive/Negative: the predicted results; True/False: whether the prediction is correct.
TP, TN: Correctly predicted positive and correctly predicted negative
FP, FN: Wrongly predicted positive and wrongly predicted negative
TP / (TP + FP) -> among all positive predictions, how much correct: precision
TP / (TP + FN) -> among all positive groud truths, how much correct: recall, True positive rate
FP / (FP + TN) -> among all negative ground truths, how much wrong: False Positive rate
ROC curve: TPR + FPR
P(theta | x) = P(x | theta) * P(theta) / P(x)
P(x, theta): P(x | theta) * P(theta), joint probability
P(x | theta): the probability of event B occurring given that A == likelihood of A given B.
P(theta): Priori probability
P(theta | x): Posterior probability
Naivety: the conditional probability is calculated as the pure product of the individual probabilities of components. This implies the absolute independence of features — a condition probably never met in real life.
Only the loss term: empirical risk minimization
Loss + regularization: structural risk minimization
L1 means theta prior probability is Laplacian, L2 means theta prior probability is gaussian. https://www.bilibili.com/video/BV1aE411L7sj?p=6&spm_id_from=pageDriver
L2 regularization tends to spread the loss on all terms, L1 is more sparse/binary.
Type I error: false positive
Type II error: false negative
A Fourier transform converts a signal from time to frequency domain—it’s a very common way to extract features from audio signals or other time series such as sensor data.
In statistics, the likelihood function (often simply called the likelihood) measures the goodness of fit of a statistical model to a sample of data for given values of the unknown parameters.
A generative model will learn the distribution of data while a discriminative model will learn the distinction between different categories of data. The distinctive model can better at performance.