What if we told you that a single statistical method could revolutionize fields from finance to healthcare? Welcome to the world of density estimation. Let's explore how this unsung hero of data science works and why it's so important.

What is Density Estimation?

Definiton

Density estimation is a technique used to estimate the probability density function that data came from. There are two main types of density estimation:

Nonparametric Density Estimation: a technique where no assumptions are made about the underlying distribution of the data. Rather, the estimated density is fit to the data itself.

Parametric Density Estimation: a technique where the underlying distribution (ex: normal, exponential, etc.) that the data comes from is assumed. The parameter(s) that maximize the likelihood of observing the given data are used to estimate the density.

Importance & Applications

Understanding the underlying distribution of data results in highly useful insights. It's employed by many different machine learning methods including Naive Bayes classifiers and Markov models.

These models have applications to countless different industries. They're used in finance for risk assessment and fraud detection; in healthcare for disease diagnosis and treatment personalization; and in agriculture for yield maximization and pest control.

Histogram Estimators

Intuition

Histogram estimators are a nonparametric density estimation technique. Data is split up into non-overlapping bins and the proportion of data points in each bin is calculated. These proportions are then normalized (scaled down) so that the resulting histogram is a valid density function.

When estimating the density at a specific value, the height of the bin containing that value is used. The formula for the height is given by:

f(x)=\frac{\text{points in bin containing x}}{\text{total points}\times\text{bin size}}

Histogram Simulation

The following visualization depicts how histogram estimators can be incredibly close to the true distribution as the number of data points and bins increases.

Number of Data Points & Bins:

Pros & Cons of Histogram Estimators

The main advantages of using a histogram estimator is that they are very easy to implement and don't make any assumptions about the underlying distribution of the data.

However, histogram estimators also have a few key disadvantages. Firstly, they produce a discontinuous density estimate which may be inaccurate. Additionally, as seen in the simulation, a good estimate requires a lot of data. Ideally, in practice, every bin will have atleast one data point. In low dimensions, this is feasible; however, once you start working with high-dimensional data, the number of bins increases exponentially.

Gaussian Kernel Density Estimators (KDE)

Intuition

However, sometimes a dataset does not follow that of a common density function. For example, in the graph below, it is clear that a simple Gaussian will not be sufficient to capture the nuances of this dataset. A more complex density estimator function will be needed to model this data.

In datasets such as these, a Gaussian kernel density estimator can be used to represent a density function that will capture the data more accurately. The main idea is to place a kernel function, in this case a normalized Gaussian, at every data point. The sum of all of these Gaussians will result in a valid density function that is more appealing than the one derived using a histogram estimator.

The kernel function is a Gaussian with a mean of $0$ and a variance of $1$ ; its pdf is given by:

K(x)=\frac{1}{\sqrt{2\pi}}e^\frac{-x^2}{2}

The overall density function is given by:

f(x)=\frac{1}{nh}\sum^n_{i=1}K(\frac{x-x_i}{h})

In this equation, $\frac{1}{n}$ ensures that the sum of all $n$ Gaussians is normalized, maintaining an area of $1$ . Additionally, $h$ controls the width of each Gaussian.

Pros & Cons of KDE Estimators

Just like histogram estimators, KDE estimators are easy to understand and don't make assumptions about the data. Additionally, KDE estimators address some of the key issues with histogram estimators. Unlike histogram estimators, they produce a smooth estimate of the density function.

However, KDE estimators have their own set of disadvantages. Firstly, they're computationally expensive; computing a Gaussian for each data point can be very costly especially if the data set is massive. Furthermore, they still need a substantial amount of data, albeit less than histogram estimators, to provide accurate estimates.

Maximum Likelihood Estimators (MLE)

Intuition

Unlike nonparametric density estimators, maximum likelihood estimators assume the data follows a specific probability distribution. These distributions have parameters which define the characteristics of the distribution. For example, the parameters in a normal distribution are the mean (center) and standard deviation (spread).

The optimal parameters are found by maximizing the likelihood function, which measures how probable the observed data is given values of parameters. It is defined as the product of the probability density functions for the observed data:

L(\theta ;x)=\prod_{i=1}^n p_x(x_i|\theta)

MLE Simulation

Try experimenting with different values of $\mu$ (mean) and $\sigma$ (standard deviation) to find the parameter values that maximize the likelihood of observing the data points below.

\mu

(controls the Gaussian's center)

\sigma

(controls the Gaussian's width)

Current Likelihood

Max Likelihood

Pros & Cons of MLE Estimators

One advantage of using MLE compared to nonparametric methods is the limited amount of data that it requires; since the distribution is already assumed, it just has to be fit. Another advantage comes from the fact that the assumed distributions are already established. For example, if a normal distribution is assumed, then all of the useful properties of a normal distribution can be applied.

The biggest disadvantage of using a maximum likelihood estimator has to do with the fact that it makes an assumption about the underlying distribution. If the data doesn't actually come from the assumed distribution, the estimated density can be wildly inaccurate, leading to misguided predictions.

Density Estimation in Practice

Classification

One common technique for classification is to fit a separate density for each class. When predicting the class of a new data point, you compute the probability that the data point belongs to each class based on these density estimates. This approach forms the basis of techniques like Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA). For example, a classification problem for a two class, one feature dataset could look like:

Another common technique is to assume features are conditionally independent and then fit a separate density for each feature within each class. To predict the class of a new data point, you calculate the product of the probabilities of each feature given each class, and select the class with the highest resulting value. This technique is known as the Naive Bayes classifier.

Histogram, KDE, or MLE?

All three density estimation techniques have their own advantages and disadvantages. Well which one should be used? That depends on the data and the task at hand.

In specific scenarios, each one can be the best option. For example, if you have a vast amount of data, histogram estimators could be the best. If you have data that is almost guaranteed to be from a known distribution, MLE might be the best. On the other hand if you have data that comes from an unknown distribution, KDE could be appropriate.

In practice, it's important to tailor your approach to the characteristics of your data and the objectives of your analysis.

Data Quality

Regardless of whether you choose a parametric or nonparametric method of density estimation, the data being used to fit the density is incredibly important. Specifically, it's important that the data is representative of the population. Otherwise, the estimated density won't be representative of the population either.

Data is typically collected through sampling, and there are three main methodologies:

Simple Random Sampling: each member of the population has an equal chance of being selected, oftentimes a random number generator is used.

Stratified Random Sampling: the population is divided into non-overlapping groups called strata. Within each strata, members are sampled using simple random sampling.

Clustered Random Sampling: the population is divided into non-overlapping groups called clusters. Clusters are then sampled at random, and all members of selected clusters are part of the final sample.

Stratified sampling often ensures a more representative sample since subgroups are guaranteed to be sampled from. Clustered sampling on the other hand sacrifies representation for cost effectiveness, since sampling entire clusters at once can be less expensive.

Sampling Visualized

The following visualization depicts how the population is partitioned when different sampling techniques are employed.

Simple Stratified Clustered

Conclusion

All in all, density estimation is an incredibly powerful technique that can be adapted for a wide variety of use cases. It's the foundation of many statistical and machine learning methods. By selecting and applying the appropriate technique, be it histogram estimators, KDEs, or MLEs you can uncover valuable insights, improve predictive models, and make informed decisions based on your data.

Sources

The explanations and visualizations for histogram estimators and maximum likelihood estimation were adapted from Professor Justin Eldridge's DSC 140A - Probabilistic Modeling & Machine Learning.

The explanations for kernel density estimators were adapted from Jaroslaw Drapala's article, Kernel Density Estimator Explained.

It's impossible to truly understand density estimation without a strong foundation in probability and statistics. The explanations below should serve as a refresher on those fundamentals.

Probability Spaces & Random Variables

Probability Spaces

A probability space can be expressed as a combination of a sample space $\Omega$ , collection of events $\mathcal{F}$ , and a probability measure $\mathbb{P}$ .

A sample space is the collection of all possible outcomes; sample spaces can be either discrete or continuous.

Discrete Sample Space: a discrete sample space consists of a finite or countably infinite set of distinct outcomes.

Continuous Sample Space: a continuous sample space consists of an uncountably infinite set of outcomes, typically corresponding to all possible values within a certain interval.

A collection of events is a subset of the possible outcomes. Finally, a probability measure is a function $\mathcal{F}\rightarrow[0,1]$ that satisfies the following axioms:

$0 \leq \mathbb{P}(A) \leq 1$ where $A$ is an event.
$\mathbb{P}(\varnothing)=0,\ \mathbb{P}(\Omega)=1$ ( $\varnothing$ denotes the empty set).
If $A_1,...A_n$ are a disjoint sequence of events, then $\mathbb{P}(A_1\cup...\cup A_n)=\sum^n_{i=1}\mathbb{P}(A_i)$ .

Note: two events are disjoint if their intersection is the empty set.

Inclusion-Exclusion Principle: for any two events $A$ and $B$ in a probability space, $\mathbb{P}(A\cup B)=\mathbb{P}(A)+\mathbb{P}(B)-\mathbb{P}(A\cap B)$ .

Random Variables

A random variable is a function $X$ : $\Omega\rightarrow\R$ where $\Omega$ is the sample space of some probability space. A random variable can be either discrete (having a specific set of values) or continuous (having any value in a specific range).

Conditional Probability & Independence

Conditional Probability

Given a probability space $(\Omega,\mathcal{F},\mathbb{P})$ and a positive probability event $B\in\mathcal{F}$ , we can define a new probability measure $\mathbb{P}(-|B)$ on ( $\Omega,\mathcal{F}$ ) defined by:

\mathbb{P}(A|B)=\frac{\mathbb{P}(A\cup B)}{\mathbb{P}(B)}

Multiplication Rule: $\mathbb{P}(A\cap B)=\mathbb{P}(A|B)\mathbb{P}(B)=\mathbb{P}(B|A)\mathbb{P}(A)$ .

Law of Total Probability: if $B_1,B_2,... \in \mathcal{F}$ is a finite or infinite sequence of events partitioning $\Omega$ , then for any event $A\in\mathcal{F}$ : $\mathbb{P}(A)=\sum_i\mathbb{P}(A\cap B_i)=\sum_i\mathbb{P}(A|B_i)\mathbb{P}(B_i)$ .

Bayes' Theorem

\mathbb{P}(A|B)=\frac{\mathbb{P}(B|A)\mathbb{P}(A)}{\mathbb{P}(B)}

Independence

Two events $A$ and $B$ in a probability space are independent if any of the three equivalent conditions hold:

$\mathbb{P}(A|B)=\mathbb{P}(A)$ .
$\mathbb{P}(B|A)=\mathbb{P}(B)$ .
$\mathbb{P}(A\cap B)=\mathbb{P}(A)\mathbb{P}(B)$ .

Mutual Independence: a sequence of events $A_1,A_2,...$ in a probability space are mutually independent if any collection of events $A_{i_1},A_{i_2},...,A_{i_n}$ chosen from $A_1,A_2,...$ satisfy $\mathbb{P}(A_{i_1}\cap A_{i_2}\cap...\cap A_{i_n})=\mathbb{P}(A_{i_1})\mathbb{P}(A_{i_2})...\mathbb{P}(A_{i_n})$ .

Probability Distribution Functions

Cumulative Density Functions (CDFs)

Let $X$ be a random variable. The cumulative density function (cdf) of $X$ is the function $F_x$ : $\R\rightarrow\R$ defined by $F_x(t)=\mathbb{P}(X\leq t)=\mathbb{P}(X\in(-\infty ,t])$ The cdf gives us the probability that $X$ is less than or equal to some value.

A function $F$ : $\R\rightarrow\R$ is the cdf of a random variable if and only if:

$\lim_{t\to-\infty}F(t)=0$ and $\lim_{t\to\infty}F(t)=1$
$F$ is nondecreasing: if $s \leq t$ then $F(s) \leq F(t)$
$F$ is right continuous: $\lim_{t\to a^+}F(t)=F(a)$

Probability Mass Functions (PMFs)

Let $X$ be a discrete random variable. The probability mass function (pmf) of $X$ is defined as $p_x(a)=\mathbb{P}(X=a)$ . The pmf gives us the probability that $X$ is exactly equal to some value.

A function $p$ : $\R\to\R$ is the pmf of a discrete random variable if and only if there exists a finite or infinite sequence of numbers $a_1,a_2,...\in\R$ such that:

$0 \leq p(a_i) \leq 1$ for all $a_i$
$\sum_i p(a_i)=1$
$p(a)=0$ if $p\notin\{a_1,a_2,...\}$

Probability Density Functions (PDFs)

Let $X$ be a continuous random variable. The probability density function (pdf) of $X$ is defined as $f_x(t)$ . The pdf gives us the relative likelihood (not a probability) that $X$ is exactly equal to some value.

To find the probability that $X$ falls in a certain interval, we can take the integral of the pdf over that interval:

\mathbb{P}(X\in[a,b])=\int^b_a f_x(t) dt

A function $f$ : $\R\to\R$ is the pdf of a continuous random variable if and only if:

$f$ is integrable
$f(t)\geq 0$ for all $t$
$\int^{\infty}_{-\infty} f(t) dt=1$

Converting PMFs/PDFs to CDFs

If $X$ is discrete:

F_x(t)=\mathbb{P}(X\leq t)=\sum_{a\leq t}p_x(a)

If $X$ is continuous:

F_x(t)=\mathbb{P}(X\leq t)=\int^{t}_{-\infty}f_x(s)ds

Converting CDFs to PMFs/PDFs

Let $X$ be a random variable and $F_x$ its cdf.

If $F_x$ is piecewise constant, then $X$ is discrete and its pmf is found by calculating the magnitude of the jumps at the points of discontinuity.

If $F_x$ is continuous and piecewise continuously differentiable, then $X$ is continuous and its pdf is given by $f_x(t)=F_x'(t)$

Discrete Random Variables

Bernoulli Random Variables

A Bernoulli random variable models a single experiment with two possible outcomes: success and failure. It has a success probability of $p$ .

Let $X$ be a Bernoulli random variable. Then, we write $X\sim \text{Ber}(p)$ . Its pmf is given by:

\mathbb{P}(X=1)=p

\mathbb{P}(X=0)=1-p

Binomial Random Variables

A Binomial random variable represents the number of successes in $n$ independent Bernoulli trials, each with a success probability $p$ .

Let $X$ be a Binomial random variable. Then, we write $X\sim \text{Bin}(n,p)$ . Its pmf for $0 \leq k \leq n$ is given by:

\mathbb{P}(X=k)={n\choose k}p^k(1-p)^{n-k}

Geometric Random Variables

A Geometric random variable represents the number of Bernoulli trials (with success probabiliy $p$ ) needed to get the first success.

Let $X$ be a Geometric random variable. Then, we write $X\sim \text{Geom}(p)$ . Its pmf for $1 \leq k$ is given by:

\mathbb{P}(X=k)=p(1-p)^{k-1}

Poisson Random Variables

A Poisson random variable counts the number of times an event happens in a fixed period of time. It is defined by the average rate of occurrence, $\lambda$ .

Let $X$ be a Poisson random variable. Then, we write $X\sim \text{Poisson}(\lambda)$ . Its pmf for $0 \leq k$ is given by:

\mathbb{P}(X=k)=e^{-\lambda}\frac{\lambda^k}{k!}

Continuous Distributions

Uniform Distributions

The uniform distribution is a probability distribution in which all outcomes are equally likely within a specified range, $[a,b]$ .

Let $X$ be a uniform random variable. Then, we write $X\sim \text{Unif}[a,b]$ . Its pdf is given by:

f_x(t)=\begin{cases} \frac{1}{b-a},\ t\in[a,b] \\ 0,\ t\notin[a,b] \end{cases}

Normal (Gaussian) Distributions

The normal distribution is a bell-shaped probability distribution defined by its mean, $\mu$ (the center) and its standard deviation, $\sigma$ (the spread).

Let $X$ be a normal random variable. Then, we write $X\sim \mathcal{N}(\mu,\sigma^2)$ . Its pdf is given by:

f_x(t)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2}(\frac{t-\mu}{\sigma})^2}

Expectation & Variance

Expectation of Discrete Random Variables

Suppose $X$ is a discrete random variable with pmf $p_x$ . Its expectation (also known as mean/average) is given by:

\mathbb{E}\text{[X]}=\sum_{a\in\R}a\cdot \mathbb{P}(X=a)

If $W\sim \text{Ber}(p)$ . Then, $\mathbb{E}\text{[W]}=p$ .

If $X\sim \text{Bin}(n,p)$ . Then, $\mathbb{E}\text{[X]}=np$ .

If $Y\sim \text{Geom}(p)$ . Then, $\mathbb{E}\text{[Y]}=\frac{1}{p}$ .

If $Z\sim \text{Poisson}(\lambda)$ . Then, $\mathbb{E}\text{[Z]}=\lambda$ .

Theorem: expecation is linear; if $X$ and $\text{Y}$ are random variables defined on the same probability space and $c\in\R$ , then $\mathbb{E}\text{[X}+\text{Y]}=\mathbb{E}\text{[X]}+\mathbb{E}\text{[Y]}$ and $\mathbb{E}\text{[cX]}=c\mathbb{E}\text{[X]}$ .

Theorem: if $X$ is a discrete random variable and $g$ is a function, then $\mathbb{E}\text{[g(X)]}=\sum_{a\in\R}g(a)\cdot\mathbb{P}(X=a)$ .

Expectation of Continuous Random Variables

Suppose $X$ is a continuous random variable with pdf $f_x$ . Its expectation is given by:

\mathbb{E}\text{[X]}=\int^{\infty}_{-\infty}t\cdot f_x(t)dt

If $U\sim \text{Unif}[a,b]$ . Then, $\mathbb{E}\text{[U]}=\frac{a+b}{2}$ .

If $X\sim \mathcal{N}(\mu,\sigma)$ . Then, $\mathbb{E}\text{[X]}=\mu$ .

Theorem: if $X$ is a continuous random variable and $g$ is a function, then $\mathbb{E}\text{[g(X)]}=\int^{\infty}_{-\infty}g(t)f_x(t)dt$ .

Variance

Let $X$ be a random variable. The variance of $X$ is:

\text{Var}\text{(X)}=\mathbb{E}[(X-\mathbb{E}\text{[X]})^2]

Another formula for variance that is often easier to calculate is:

\text{Var}\text{(X)}=\mathbb{E}[X^2]-(\mathbb{E}\text{[X]})^2

Note: the standard deviation of $X$ is: $\sqrt{\text{Var}(X)}$ .

Important Theorems

Central Limit Theorem

Suppose $X_1,X_2,...$ is an infinite sequence of independent and identically distributed random variables with mean $\mu$ and variance $\sigma^2$ . Let $S_n=X_1+X_2+...+X_n$ . Then:

\lim_{n\to\infty}\frac{S_n-n\mu}{\sigma\sqrt{n}}\sim\mathcal{N}(0,1)

Law of Large Numbers

Suppose $X_1,X_2,...$ is an infinite sequence of independent and identically distributed random variables with mean $\mu$ and finite variance. Then:

\mathbb{P}(\lim_{n\to\infty}\overline{X}_n=\mu)=1

Density Estimation

What is Density Estimation?

Definiton

Importance & Applications

Histogram Estimators

Intuition

Histogram Simulation

Pros & Cons of Histogram Estimators

Gaussian Kernel Density Estimators (KDE)

Intuition

Pros & Cons of KDE Estimators

Maximum Likelihood Estimators (MLE)

Intuition

MLE Simulation

Pros & Cons of MLE Estimators

Density Estimation in Practice

Classification

Histogram, KDE, or MLE?

Data Quality

Sampling Visualized

Conclusion

Sources

Appendix

Probability Spaces & Random Variables

Probability Spaces

Random Variables

Conditional Probability & Independence

Conditional Probability

Bayes' Theorem

Independence

Probability Distribution Functions

Cumulative Density Functions (CDFs)

Probability Mass Functions (PMFs)

Probability Density Functions (PDFs)

Converting PMFs/PDFs to CDFs

Converting CDFs to PMFs/PDFs

Discrete Random Variables

Bernoulli Random Variables

Binomial Random Variables

Geometric Random Variables

Poisson Random Variables

Continuous Distributions

Uniform Distributions

Normal (Gaussian) Distributions

Expectation & Variance

Expectation of Discrete Random Variables

Expectation of Continuous Random Variables

Variance

Important Theorems

Central Limit Theorem

Law of Large Numbers