Bias vs Variance 😆

Simple model, high bias -> lower the bias -> complex model -> overfitting -> higher variance

Complex model, high variance -> lower the variance -> simplified model -> underfitting -> higher bias

High order polynomial function is the more complex model.


To determine a new input instance's class/regress value, find the K closest history data points and do a majority vote to predict.

  • Normalize the features to avoid over/under-estimation for large/small value features.

  • KD tree, grids on multi-dimension space, to accelerate the K neighbors' finding procedure.

  • Weighted neighbors, the closer the larger weights would be, to make it more sensitive to similarity.

K-means 🧐

A way to put unlabeled data into K groups where data points in the same group are similar and data points in different groups are far apart.

  • Choose K -> random initialize centers -> iteratively recalculate centers with each cluster's distribution.

  • Pros:

    • Easy to understand, guarantee to converge, scalable to large data, relative fast

  • Cons:

    • manually set K, initial sensitive, outliers sensitive, linear boundaries, O(n) for each step

  • To improve:

    • verify performance with different K, select K at the turning point

    • K-means++, sequentially select initial centers in a way that the new center is far away from the previous ones.

    • Pre-processing to normalize and filter outliers

    • Use kernel to map data points into high dimensions. Then apply linear boundaries there.

KNN vs K-means 🤔

Both rely on measuring distances, Euclidean/Manhattan distance, power-> n, cosine similarity, etc.

  • KNN has to have labels for each history data sample ahead.

  • K-means is an unsupervised learning method and doesn't require labels at all.

Metrics 😤

Positive/Negative: the predicted results; True/False: whether the prediction is correct.

TP, TN: Correctly predicted positive and correctly predicted negative

FP, FN: Wrongly predicted positive and wrongly predicted negative

TP / (TP + FP) -> among all positive predictions, how much correct: precision

TP / (TP + FN) -> among all positive groud truths, how much correct: recall, True positive rate

FP / (FP + TN) -> among all negative ground truths, how much wrong: False Positive rate

ROC curve: TPR + FPR

Bayes' Theorem 🤠

P(theta | x) = P(x | theta) * P(theta) / P(x)

P(x, theta): P(x | theta) * P(theta), joint probability

P(x | theta): the probability of event B occurring given that A == likelihood of A given B.

P(theta): Priori probability

P(theta | x): Posterior probability

Naivety: the conditional probability is calculated as the pure product of the individual probabilities of components. This implies the absolute independence of features — a condition probably never met in real life.

L1 L2 Regularization

Only the loss term: empirical risk minimization

Loss + regularization: structural risk minimization

L1 means theta prior probability is Laplacian, L2 means theta prior probability is gaussian.

L2 regularization tends to spread the loss on all terms, L1 is more sparse/binary.

Type I Error and Type II Error

Type I error: false positive

Type II error: false negative

Fourier Transform

A Fourier transform converts a signal from time to frequency domain—it’s a very common way to extract features from audio signals or other time series such as sensor data.


In statistics, the likelihood function (often simply called the likelihood) measures the goodness of fit of a statistical model to a sample of data for given values of the unknown parameters.

Generative vs Distinctive

A generative model will learn the distribution of data while a discriminative model will learn the distinction between different categories of data. The distinctive model can better at performance.

