Density Estimation

What if we told you that a single statistical method could revolutionize fields from finance to healthcare? Welcome to the world of density estimation. Let's explore how this unsung hero of data science works and why it's so important.

By: Anish Kasam & Rushil Chandrupatla

What is Density Estimation?

Definiton

Density estimation is a technique used to estimate the probability density function that data came from. There are two main types of density estimation:

Nonparametric Density Estimation: a technique where no assumptions are made about the underlying distribution of the data. Rather, the estimated density is fit to the data itself.

Parametric Density Estimation: a technique where the underlying distribution (ex: normal, exponential, etc.) that the data comes from is assumed. The parameter(s) that maximize the likelihood of observing the given data are used to estimate the density.

Importance & Applications

Understanding the underlying distribution of data results in highly useful insights. It's employed by many different machine learning methods including Naive Bayes classifiers and Markov models.

These models have applications to countless different industries. They're used in finance for risk assessment and fraud detection; in healthcare for disease diagnosis and treatment personalization; and in agriculture for yield maximization and pest control.

Histogram Estimators

Intuition

Histogram estimators are a nonparametric density estimation technique. Data is split up into non-overlapping bins and the proportion of data points in each bin is calculated. These proportions are then normalized (scaled down) so that the resulting histogram is a valid density function.

When estimating the density at a specific value, the height of the bin containing that value is used. The formula for the height is given by:

f(x)=points in bin containing xtotal points×bin sizef(x)=\frac{\text{points in bin containing x}}{\text{total points}\times\text{bin size}}

Histogram Simulation

The following visualization depicts how histogram estimators can be incredibly close to the true distribution as the number of data points and bins increases.

True Density

Pros & Cons of Histogram Estimators

The main advantages of using a histogram estimator is that they are very easy to implement and don't make any assumptions about the underlying distribution of the data.

However, histogram estimators also have a few key disadvantages. Firstly, they produce a discontinuous density estimate which may be inaccurate. Additionally, as seen in the simulation, a good estimate requires a lot of data. Ideally, in practice, every bin will have atleast one data point. In low dimensions, this is feasible; however, once you start working with high-dimensional data, the number of bins increases exponentially.

Gaussian Kernel Density Estimators (KDE)

Intuition

However, sometimes a dataset does not follow that of a common density function. For example, in the graph below, it is clear that a simple Gaussian will not be sufficient to capture the nuances of this dataset. A more complex density estimator function will be needed to model this data.

In datasets such as these, a Gaussian kernel density estimator can be used to represent a density function that will capture the data more accurately. The main idea is to place a kernel function, in this case a normalized Gaussian, at every data point. The sum of all of these Gaussians will result in a valid density function that is more appealing than the one derived using a histogram estimator.

The kernel function is a Gaussian with a mean of 00 and a variance of 11; its pdf is given by:

K(x)=12πex22K(x)=\frac{1}{\sqrt{2\pi}}e^\frac{-x^2}{2}

The overall density function is given by:

f(x)=1nhi=1nK(xxih)f(x)=\frac{1}{nh}\sum^n_{i=1}K(\frac{x-x_i}{h})

In this equation, 1n\frac{1}{n} ensures that the sum of all nn Gaussians is normalized, maintaining an area of 11. Additionally, hh controls the width of each Gaussian.

Pros & Cons of KDE Estimators

Just like histogram estimators, KDE estimators are easy to understand and don't make assumptions about the data. Additionally, KDE estimators address some of the key issues with histogram estimators. Unlike histogram estimators, they produce a smooth estimate of the density function.

However, KDE estimators have their own set of disadvantages. Firstly, they're computationally expensive; computing a Gaussian for each data point can be very costly especially if the data set is massive. Furthermore, they still need a substantial amount of data, albeit less than histogram estimators, to provide accurate estimates.

Maximum Likelihood Estimators (MLE)

Intuition

Unlike nonparametric density estimators, maximum likelihood estimators assume the data follows a specific probability distribution. These distributions have parameters which define the characteristics of the distribution. For example, the parameters in a normal distribution are the mean (center) and standard deviation (spread).

The optimal parameters are found by maximizing the likelihood function, which measures how probable the observed data is given values of parameters. It is defined as the product of the probability density functions for the observed data:

L(θ;x)=i=1npx(xiθ)L(\theta ;x)=\prod_{i=1}^n p_x(x_i|\theta)

MLE Simulation

Try experimenting with different values of μ\mu (mean) and σ\sigma (standard deviation) to find the parameter values that maximize the likelihood of observing the data points below.

Likelihood at Point
Current Likelihood
Max Likelihood

Pros & Cons of MLE Estimators

One advantage of using MLE compared to nonparametric methods is the limited amount of data that it requires; since the distribution is already assumed, it just has to be fit. Another advantage comes from the fact that the assumed distributions are already established. For example, if a normal distribution is assumed, then all of the useful properties of a normal distribution can be applied.

The biggest disadvantage of using a maximum likelihood estimator has to do with the fact that it makes an assumption about the underlying distribution. If the data doesn't actually come from the assumed distribution, the estimated density can be wildly inaccurate, leading to misguided predictions.

Density Estimation in Practice

Classification

One common technique for classification is to fit a separate density for each class. When predicting the class of a new data point, you compute the probability that the data point belongs to each class based on these density estimates. This approach forms the basis of techniques like Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA). For example, a classification problem for a two class, one feature dataset could look like:

Another common technique is to assume features are conditionally independent and then fit a separate density for each feature within each class. To predict the class of a new data point, you calculate the product of the probabilities of each feature given each class, and select the class with the highest resulting value. This technique is known as the Naive Bayes classifier.

Histogram, KDE, or MLE?

All three density estimation techniques have their own advantages and disadvantages. Well which one should be used? That depends on the data and the task at hand.

In specific scenarios, each one can be the best option. For example, if you have a vast amount of data, histogram estimators could be the best. If you have data that is almost guaranteed to be from a known distribution, MLE might be the best. On the other hand if you have data that comes from an unknown distribution, KDE could be appropriate.

In practice, it's important to tailor your approach to the characteristics of your data and the objectives of your analysis.

Data Quality

Regardless of whether you choose a parametric or nonparametric method of density estimation, the data being used to fit the density is incredibly important. Specifically, it's important that the data is representative of the population. Otherwise, the estimated density won't be representative of the population either.

Data is typically collected through sampling, and there are three main methodologies:

Simple Random Sampling: each member of the population has an equal chance of being selected, oftentimes a random number generator is used.

Stratified Random Sampling: the population is divided into non-overlapping groups called strata. Within each strata, members are sampled using simple random sampling.

Clustered Random Sampling: the population is divided into non-overlapping groups called clusters. Clusters are then sampled at random, and all members of selected clusters are part of the final sample.

Stratified sampling often ensures a more representative sample since subgroups are guaranteed to be sampled from. Clustered sampling on the other hand sacrifies representation for cost effectiveness, since sampling entire clusters at once can be less expensive.

Sampling Visualized

The following visualization depicts how the population is partitioned when different sampling techniques are employed.

Conclusion

All in all, density estimation is an incredibly powerful technique that can be adapted for a wide variety of use cases. It's the foundation of many statistical and machine learning methods. By selecting and applying the appropriate technique, be it histogram estimators, KDEs, or MLEs you can uncover valuable insights, improve predictive models, and make informed decisions based on your data.

Sources

The explanations and visualizations for histogram estimators and maximum likelihood estimation were adapted from Professor Justin Eldridge's DSC 140A - Probabilistic Modeling & Machine Learning.

The explanations for kernel density estimators were adapted from Jaroslaw Drapala's article, Kernel Density Estimator Explained.

Appendix

It's impossible to truly understand density estimation without a strong foundation in probability and statistics. The explanations below should serve as a refresher on those fundamentals.

Probability Spaces & Random Variables

Probability Spaces

A probability space can be expressed as a combination of a sample space Ω\Omega, collection of events F\mathcal{F}, and a probability measure P\mathbb{P}.

A sample space is the collection of all possible outcomes; sample spaces can be either discrete or continuous.

Discrete Sample Space: a discrete sample space consists of a finite or countably infinite set of distinct outcomes.

Continuous Sample Space: a continuous sample space consists of an uncountably infinite set of outcomes, typically corresponding to all possible values within a certain interval.

A collection of events is a subset of the possible outcomes. Finally, a probability measure is a function F[0,1]\mathcal{F}\rightarrow[0,1] that satisfies the following axioms:

  1. 0P(A)10 \leq \mathbb{P}(A) \leq 1 where AA is an event.
  2. P()=0, P(Ω)=1\mathbb{P}(\varnothing)=0,\ \mathbb{P}(\Omega)=1 (\varnothing denotes the empty set).
  3. If A1,...AnA_1,...A_n are a disjoint sequence of events, then P(A1...An)=i=1nP(Ai)\mathbb{P}(A_1\cup...\cup A_n)=\sum^n_{i=1}\mathbb{P}(A_i).

Note: two events are disjoint if their intersection is the empty set.

Inclusion-Exclusion Principle: for any two events AA and BB in a probability space, P(AB)=P(A)+P(B)P(AB)\mathbb{P}(A\cup B)=\mathbb{P}(A)+\mathbb{P}(B)-\mathbb{P}(A\cap B).

Random Variables

A random variable is a function XX: ΩR\Omega\rightarrow\R where Ω\Omega is the sample space of some probability space. A random variable can be either discrete (having a specific set of values) or continuous (having any value in a specific range).

Conditional Probability & Independence

Conditional Probability

Given a probability space (Ω,F,P)(\Omega,\mathcal{F},\mathbb{P}) and a positive probability event BFB\in\mathcal{F}, we can define a new probability measure P(B)\mathbb{P}(-|B) on (Ω,F\Omega,\mathcal{F}) defined by:

P(AB)=P(AB)P(B)\mathbb{P}(A|B)=\frac{\mathbb{P}(A\cup B)}{\mathbb{P}(B)}

Multiplication Rule: P(AB)=P(AB)P(B)=P(BA)P(A)\mathbb{P}(A\cap B)=\mathbb{P}(A|B)\mathbb{P}(B)=\mathbb{P}(B|A)\mathbb{P}(A).

Law of Total Probability: if B1,B2,...FB_1,B_2,... \in \mathcal{F} is a finite or infinite sequence of events partitioning Ω\Omega, then for any event AFA\in\mathcal{F}: P(A)=iP(ABi)=iP(ABi)P(Bi)\mathbb{P}(A)=\sum_i\mathbb{P}(A\cap B_i)=\sum_i\mathbb{P}(A|B_i)\mathbb{P}(B_i).

Bayes' Theorem

P(AB)=P(BA)P(A)P(B)\mathbb{P}(A|B)=\frac{\mathbb{P}(B|A)\mathbb{P}(A)}{\mathbb{P}(B)}

Independence

Two events AA and BB in a probability space are independent if any of the three equivalent conditions hold:

  1. P(AB)=P(A)\mathbb{P}(A|B)=\mathbb{P}(A).
  2. P(BA)=P(B)\mathbb{P}(B|A)=\mathbb{P}(B).
  3. P(AB)=P(A)P(B)\mathbb{P}(A\cap B)=\mathbb{P}(A)\mathbb{P}(B).

Mutual Independence: a sequence of events A1,A2,...A_1,A_2,... in a probability space are mutually independent if any collection of events Ai1,Ai2,...,AinA_{i_1},A_{i_2},...,A_{i_n} chosen from A1,A2,...A_1,A_2,... satisfy P(Ai1Ai2...Ain)=P(Ai1)P(Ai2)...P(Ain)\mathbb{P}(A_{i_1}\cap A_{i_2}\cap...\cap A_{i_n})=\mathbb{P}(A_{i_1})\mathbb{P}(A_{i_2})...\mathbb{P}(A_{i_n}).

Probability Distribution Functions

Cumulative Density Functions (CDFs)

Let XX be a random variable. The cumulative density function (cdf) of XX is the function FxF_x: RR\R\rightarrow\R defined by Fx(t)=P(Xt)=P(X(,t])F_x(t)=\mathbb{P}(X\leq t)=\mathbb{P}(X\in(-\infty ,t]) The cdf gives us the probability that XX is less than or equal to some value.

A function FF: RR\R\rightarrow\R is the cdf of a random variable if and only if:

  1. limtF(t)=0\lim_{t\to-\infty}F(t)=0 and limtF(t)=1\lim_{t\to\infty}F(t)=1
  2. FF is nondecreasing: if sts \leq t then F(s)F(t)F(s) \leq F(t)
  3. FF is right continuous: limta+F(t)=F(a)\lim_{t\to a^+}F(t)=F(a)

Probability Mass Functions (PMFs)

Let XX be a discrete random variable. The probability mass function (pmf) of XX is defined as px(a)=P(X=a)p_x(a)=\mathbb{P}(X=a). The pmf gives us the probability that XX is exactly equal to some value.

A function pp: RR\R\to\R is the pmf of a discrete random variable if and only if there exists a finite or infinite sequence of numbers a1,a2,...Ra_1,a_2,...\in\R such that:

  1. 0p(ai)10 \leq p(a_i) \leq 1 for all aia_i
  2. ip(ai)=1\sum_i p(a_i)=1
  3. p(a)=0p(a)=0 if p{a1,a2,...}p\notin\{a_1,a_2,...\}

Probability Density Functions (PDFs)

Let XX be a continuous random variable. The probability density function (pdf) of XX is defined as fx(t)f_x(t). The pdf gives us the relative likelihood (not a probability) that XX is exactly equal to some value.

To find the probability that XX falls in a certain interval, we can take the integral of the pdf over that interval:

P(X[a,b])=abfx(t)dt\mathbb{P}(X\in[a,b])=\int^b_a f_x(t) dt

A function ff: RR\R\to\R is the pdf of a continuous random variable if and only if:

  1. ff is integrable
  2. f(t)0f(t)\geq 0 for all tt
  3. f(t)dt=1\int^{\infty}_{-\infty} f(t) dt=1

Converting PMFs/PDFs to CDFs

If XX is discrete:

Fx(t)=P(Xt)=atpx(a)F_x(t)=\mathbb{P}(X\leq t)=\sum_{a\leq t}p_x(a)

If XX is continuous:

Fx(t)=P(Xt)=tfx(s)dsF_x(t)=\mathbb{P}(X\leq t)=\int^{t}_{-\infty}f_x(s)ds

Converting CDFs to PMFs/PDFs

Let XX be a random variable and FxF_x its cdf.

If FxF_x is piecewise constant, then XX is discrete and its pmf is found by calculating the magnitude of the jumps at the points of discontinuity.

If FxF_x is continuous and piecewise continuously differentiable, then XX is continuous and its pdf is given by fx(t)=Fx(t)f_x(t)=F_x'(t)

Discrete Random Variables

Bernoulli Random Variables

A Bernoulli random variable models a single experiment with two possible outcomes: success and failure. It has a success probability of pp.

Let XX be a Bernoulli random variable. Then, we write XBer(p)X\sim \text{Ber}(p). Its pmf is given by:

P(X=1)=p\mathbb{P}(X=1)=p P(X=0)=1p\mathbb{P}(X=0)=1-p

Binomial Random Variables

A Binomial random variable represents the number of successes in nn independent Bernoulli trials, each with a success probability pp.

Let XX be a Binomial random variable. Then, we write XBin(n,p)X\sim \text{Bin}(n,p). Its pmf for 0kn0 \leq k \leq n is given by:

P(X=k)=(nk)pk(1p)nk\mathbb{P}(X=k)={n\choose k}p^k(1-p)^{n-k}

Geometric Random Variables

A Geometric random variable represents the number of Bernoulli trials (with success probabiliy pp) needed to get the first success.

Let XX be a Geometric random variable. Then, we write XGeom(p)X\sim \text{Geom}(p). Its pmf for 1k1 \leq k is given by:

P(X=k)=p(1p)k1\mathbb{P}(X=k)=p(1-p)^{k-1}

Poisson Random Variables

A Poisson random variable counts the number of times an event happens in a fixed period of time. It is defined by the average rate of occurrence, λ\lambda.

Let XX be a Poisson random variable. Then, we write XPoisson(λ)X\sim \text{Poisson}(\lambda). Its pmf for 0k0 \leq k is given by:

P(X=k)=eλλkk!\mathbb{P}(X=k)=e^{-\lambda}\frac{\lambda^k}{k!}

Continuous Distributions

Uniform Distributions

The uniform distribution is a probability distribution in which all outcomes are equally likely within a specified range, [a,b][a,b].

Let XX be a uniform random variable. Then, we write XUnif[a,b]X\sim \text{Unif}[a,b]. Its pdf is given by:

fx(t)={1ba, t[a,b]0, t[a,b]f_x(t)=\begin{cases} \frac{1}{b-a},\ t\in[a,b] \\ 0,\ t\notin[a,b] \end{cases}

Normal (Gaussian) Distributions

The normal distribution is a bell-shaped probability distribution defined by its mean, μ\mu (the center) and its standard deviation, σ\sigma (the spread).

Let XX be a normal random variable. Then, we write XN(μ,σ2)X\sim \mathcal{N}(\mu,\sigma^2). Its pdf is given by:

fx(t)=12πσ2e12(tμσ)2f_x(t)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2}(\frac{t-\mu}{\sigma})^2}

Expectation & Variance

Expectation of Discrete Random Variables

Suppose XX is a discrete random variable with pmf pxp_x. Its expectation (also known as mean/average) is given by:

E[X]=aRaP(X=a)\mathbb{E}\text{[X]}=\sum_{a\in\R}a\cdot \mathbb{P}(X=a)

If WBer(p)W\sim \text{Ber}(p). Then, E[W]=p\mathbb{E}\text{[W]}=p.

If XBin(n,p)X\sim \text{Bin}(n,p). Then, E[X]=np\mathbb{E}\text{[X]}=np.

If YGeom(p)Y\sim \text{Geom}(p). Then, E[Y]=1p\mathbb{E}\text{[Y]}=\frac{1}{p}.

If ZPoisson(λ)Z\sim \text{Poisson}(\lambda). Then, E[Z]=λ\mathbb{E}\text{[Z]}=\lambda.

Theorem: expecation is linear; if XX and Y\text{Y} are random variables defined on the same probability space and cRc\in\R, then E[X+Y]=E[X]+E[Y]\mathbb{E}\text{[X}+\text{Y]}=\mathbb{E}\text{[X]}+\mathbb{E}\text{[Y]} and E[cX]=cE[X]\mathbb{E}\text{[cX]}=c\mathbb{E}\text{[X]}.

Theorem: if XX is a discrete random variable and gg is a function, then E[g(X)]=aRg(a)P(X=a)\mathbb{E}\text{[g(X)]}=\sum_{a\in\R}g(a)\cdot\mathbb{P}(X=a).

Expectation of Continuous Random Variables

Suppose XX is a continuous random variable with pdf fxf_x. Its expectation is given by:

E[X]=tfx(t)dt\mathbb{E}\text{[X]}=\int^{\infty}_{-\infty}t\cdot f_x(t)dt

If UUnif[a,b]U\sim \text{Unif}[a,b]. Then, E[U]=a+b2\mathbb{E}\text{[U]}=\frac{a+b}{2}.

If XN(μ,σ)X\sim \mathcal{N}(\mu,\sigma). Then, E[X]=μ\mathbb{E}\text{[X]}=\mu.

Theorem: if XX is a continuous random variable and gg is a function, then E[g(X)]=g(t)fx(t)dt\mathbb{E}\text{[g(X)]}=\int^{\infty}_{-\infty}g(t)f_x(t)dt.

Variance

Let XX be a random variable. The variance of XX is:

Var(X)=E[(XE[X])2]\text{Var}\text{(X)}=\mathbb{E}[(X-\mathbb{E}\text{[X]})^2]

Another formula for variance that is often easier to calculate is:

Var(X)=E[X2](E[X])2\text{Var}\text{(X)}=\mathbb{E}[X^2]-(\mathbb{E}\text{[X]})^2

Note: the standard deviation of XX is: Var(X)\sqrt{\text{Var}(X)}.

Important Theorems

Central Limit Theorem

Suppose X1,X2,...X_1,X_2,... is an infinite sequence of independent and identically distributed random variables with mean μ\mu and variance σ2\sigma^2. Let Sn=X1+X2+...+XnS_n=X_1+X_2+...+X_n. Then:

limnSnnμσnN(0,1)\lim_{n\to\infty}\frac{S_n-n\mu}{\sigma\sqrt{n}}\sim\mathcal{N}(0,1)

Law of Large Numbers

Suppose X1,X2,...X_1,X_2,... is an infinite sequence of independent and identically distributed random variables with mean μ\mu and finite variance. Then:

P(limnXn=μ)=1\mathbb{P}(\lim_{n\to\infty}\overline{X}_n=\mu)=1