Density Estimation
What if we told you that a single statistical method could revolutionize fields from finance to healthcare? Welcome to the world of density estimation. Let's explore how this unsung hero of data science works and why it's so important.
What is Density Estimation?
Definiton
Density estimation is a technique used to estimate the probability density function that data came from. There are two main types of density estimation:
Nonparametric Density Estimation: a technique where no assumptions are made about the underlying distribution of the data. Rather, the estimated density is fit to the data itself.
Parametric Density Estimation: a technique where the underlying distribution (ex: normal, exponential, etc.) that the data comes from is assumed. The parameter(s) that maximize the likelihood of observing the given data are used to estimate the density.
Importance & Applications
Understanding the underlying distribution of data results in highly useful insights. It's employed by many different machine learning methods including Naive Bayes classifiers and Markov models.
These models have applications to countless different industries. They're used in finance for risk assessment and fraud detection; in healthcare for disease diagnosis and treatment personalization; and in agriculture for yield maximization and pest control.
Histogram Estimators
Intuition
Histogram estimators are a nonparametric density estimation technique. Data is split up into non-overlapping bins and the proportion of data points in each bin is calculated. These proportions are then normalized (scaled down) so that the resulting histogram is a valid density function.
When estimating the density at a specific value, the height of the bin containing that value is used. The formula for the height is given by:
Histogram Simulation
The following visualization depicts how histogram estimators can be incredibly close to the true distribution as the number of data points and bins increases.
Pros & Cons of Histogram Estimators
The main advantages of using a histogram estimator is that they are very easy to implement and don't make any assumptions about the underlying distribution of the data.
However, histogram estimators also have a few key disadvantages. Firstly, they produce a discontinuous density estimate which may be inaccurate. Additionally, as seen in the simulation, a good estimate requires a lot of data. Ideally, in practice, every bin will have atleast one data point. In low dimensions, this is feasible; however, once you start working with high-dimensional data, the number of bins increases exponentially.
Gaussian Kernel Density Estimators (KDE)
Intuition
However, sometimes a dataset does not follow that of a common density function. For example, in the graph below, it is clear that a simple Gaussian will not be sufficient to capture the nuances of this dataset. A more complex density estimator function will be needed to model this data.
In datasets such as these, a Gaussian kernel density estimator can be used to represent a density function that will capture the data more accurately. The main idea is to place a kernel function, in this case a normalized Gaussian, at every data point. The sum of all of these Gaussians will result in a valid density function that is more appealing than the one derived using a histogram estimator.
The kernel function is a Gaussian with a mean of and a variance of ; its pdf is given by:
The overall density function is given by:
In this equation, ensures that the sum of all Gaussians is normalized, maintaining an area of . Additionally, controls the width of each Gaussian.
Pros & Cons of KDE Estimators
Just like histogram estimators, KDE estimators are easy to understand and don't make assumptions about the data. Additionally, KDE estimators address some of the key issues with histogram estimators. Unlike histogram estimators, they produce a smooth estimate of the density function.
However, KDE estimators have their own set of disadvantages. Firstly, they're computationally expensive; computing a Gaussian for each data point can be very costly especially if the data set is massive. Furthermore, they still need a substantial amount of data, albeit less than histogram estimators, to provide accurate estimates.
Maximum Likelihood Estimators (MLE)
Intuition
Unlike nonparametric density estimators, maximum likelihood estimators assume the data follows a specific probability distribution. These distributions have parameters which define the characteristics of the distribution. For example, the parameters in a normal distribution are the mean (center) and standard deviation (spread).
The optimal parameters are found by maximizing the likelihood function, which measures how probable the observed data is given values of parameters. It is defined as the product of the probability density functions for the observed data:
MLE Simulation
Try experimenting with different values of (mean) and (standard deviation) to find the parameter values that maximize the likelihood of observing the data points below.
Pros & Cons of MLE Estimators
One advantage of using MLE compared to nonparametric methods is the limited amount of data that it requires; since the distribution is already assumed, it just has to be fit. Another advantage comes from the fact that the assumed distributions are already established. For example, if a normal distribution is assumed, then all of the useful properties of a normal distribution can be applied.
The biggest disadvantage of using a maximum likelihood estimator has to do with the fact that it makes an assumption about the underlying distribution. If the data doesn't actually come from the assumed distribution, the estimated density can be wildly inaccurate, leading to misguided predictions.
Density Estimation in Practice
Classification
One common technique for classification is to fit a separate density for each class. When predicting the class of a new data point, you compute the probability that the data point belongs to each class based on these density estimates. This approach forms the basis of techniques like Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA). For example, a classification problem for a two class, one feature dataset could look like:
Another common technique is to assume features are conditionally independent and then fit a separate density for each feature within each class. To predict the class of a new data point, you calculate the product of the probabilities of each feature given each class, and select the class with the highest resulting value. This technique is known as the Naive Bayes classifier.
Histogram, KDE, or MLE?
All three density estimation techniques have their own advantages and disadvantages. Well which one should be used? That depends on the data and the task at hand.
In specific scenarios, each one can be the best option. For example, if you have a vast amount of data, histogram estimators could be the best. If you have data that is almost guaranteed to be from a known distribution, MLE might be the best. On the other hand if you have data that comes from an unknown distribution, KDE could be appropriate.
In practice, it's important to tailor your approach to the characteristics of your data and the objectives of your analysis.
Data Quality
Regardless of whether you choose a parametric or nonparametric method of density estimation, the data being used to fit the density is incredibly important. Specifically, it's important that the data is representative of the population. Otherwise, the estimated density won't be representative of the population either.
Data is typically collected through sampling, and there are three main methodologies:
Simple Random Sampling: each member of the population has an equal chance of being selected, oftentimes a random number generator is used.
Stratified Random Sampling: the population is divided into non-overlapping groups called strata. Within each strata, members are sampled using simple random sampling.
Clustered Random Sampling: the population is divided into non-overlapping groups called clusters. Clusters are then sampled at random, and all members of selected clusters are part of the final sample.
Stratified sampling often ensures a more representative sample since subgroups are guaranteed to be sampled from. Clustered sampling on the other hand sacrifies representation for cost effectiveness, since sampling entire clusters at once can be less expensive.
Sampling Visualized
The following visualization depicts how the population is partitioned when different sampling techniques are employed.
Conclusion
All in all, density estimation is an incredibly powerful technique that can be adapted for a wide variety of use cases. It's the foundation of many statistical and machine learning methods. By selecting and applying the appropriate technique, be it histogram estimators, KDEs, or MLEs you can uncover valuable insights, improve predictive models, and make informed decisions based on your data.
Sources
The explanations and visualizations for histogram estimators and maximum likelihood estimation were adapted from Professor Justin Eldridge's DSC 140A - Probabilistic Modeling & Machine Learning.
The explanations for kernel density estimators were adapted from Jaroslaw Drapala's article, Kernel Density Estimator Explained.
Appendix
It's impossible to truly understand density estimation without a strong foundation in probability and statistics. The explanations below should serve as a refresher on those fundamentals.
Probability Spaces & Random Variables
Probability Spaces
A probability space can be expressed as a combination of a sample space , collection of events , and a probability measure .
A sample space is the collection of all possible outcomes; sample spaces can be either discrete or continuous.
Discrete Sample Space: a discrete sample space consists of a finite or countably infinite set of distinct outcomes.
Continuous Sample Space: a continuous sample space consists of an uncountably infinite set of outcomes, typically corresponding to all possible values within a certain interval.
A collection of events is a subset of the possible outcomes. Finally, a probability measure is a function that satisfies the following axioms:
- where is an event.
- ( denotes the empty set).
- If are a disjoint sequence of events, then .
Note: two events are disjoint if their intersection is the empty set.
Inclusion-Exclusion Principle: for any two events and in a probability space, .
Random Variables
A random variable is a function : where is the sample space of some probability space. A random variable can be either discrete (having a specific set of values) or continuous (having any value in a specific range).
Conditional Probability & Independence
Conditional Probability
Given a probability space and a positive probability event , we can define a new probability measure on () defined by:
Multiplication Rule: .
Law of Total Probability: if is a finite or infinite sequence of events partitioning , then for any event : .
Bayes' Theorem
Independence
Two events and in a probability space are independent if any of the three equivalent conditions hold:
- .
- .
- .
Mutual Independence: a sequence of events in a probability space are mutually independent if any collection of events chosen from satisfy .
Probability Distribution Functions
Cumulative Density Functions (CDFs)
Let be a random variable. The cumulative density function (cdf) of is the function : defined by The cdf gives us the probability that is less than or equal to some value.
A function : is the cdf of a random variable if and only if:
- and
- is nondecreasing: if then
- is right continuous:
Probability Mass Functions (PMFs)
Let be a discrete random variable. The probability mass function (pmf) of is defined as . The pmf gives us the probability that is exactly equal to some value.
A function : is the pmf of a discrete random variable if and only if there exists a finite or infinite sequence of numbers such that:
- for all
- if
Probability Density Functions (PDFs)
Let be a continuous random variable. The probability density function (pdf) of is defined as . The pdf gives us the relative likelihood (not a probability) that is exactly equal to some value.
To find the probability that falls in a certain interval, we can take the integral of the pdf over that interval:
A function : is the pdf of a continuous random variable if and only if:
- is integrable
- for all
Converting PMFs/PDFs to CDFs
If is discrete:
If is continuous:
Converting CDFs to PMFs/PDFs
Let be a random variable and its cdf.
If is piecewise constant, then is discrete and its pmf is found by calculating the magnitude of the jumps at the points of discontinuity.
If is continuous and piecewise continuously differentiable, then is continuous and its pdf is given by
Discrete Random Variables
Bernoulli Random Variables
A Bernoulli random variable models a single experiment with two possible outcomes: success and failure. It has a success probability of .
Let be a Bernoulli random variable. Then, we write . Its pmf is given by:
Binomial Random Variables
A Binomial random variable represents the number of successes in independent Bernoulli trials, each with a success probability .
Let be a Binomial random variable. Then, we write . Its pmf for is given by:
Geometric Random Variables
A Geometric random variable represents the number of Bernoulli trials (with success probabiliy ) needed to get the first success.
Let be a Geometric random variable. Then, we write . Its pmf for is given by:
Poisson Random Variables
A Poisson random variable counts the number of times an event happens in a fixed period of time. It is defined by the average rate of occurrence, .
Let be a Poisson random variable. Then, we write . Its pmf for is given by:
Continuous Distributions
Uniform Distributions
The uniform distribution is a probability distribution in which all outcomes are equally likely within a specified range, .
Let be a uniform random variable. Then, we write . Its pdf is given by:
Normal (Gaussian) Distributions
The normal distribution is a bell-shaped probability distribution defined by its mean, (the center) and its standard deviation, (the spread).
Let be a normal random variable. Then, we write . Its pdf is given by:
Expectation & Variance
Expectation of Discrete Random Variables
Suppose is a discrete random variable with pmf . Its expectation (also known as mean/average) is given by:
If . Then, .
If . Then, .
If . Then, .
If . Then, .
Theorem: expecation is linear; if and are random variables defined on the same probability space and , then and .
Theorem: if is a discrete random variable and is a function, then .
Expectation of Continuous Random Variables
Suppose is a continuous random variable with pdf . Its expectation is given by:
If . Then, .
If . Then, .
Theorem: if is a continuous random variable and is a function, then .
Variance
Let be a random variable. The variance of is:
Another formula for variance that is often easier to calculate is:
Note: the standard deviation of is: .
Important Theorems
Central Limit Theorem
Suppose is an infinite sequence of independent and identically distributed random variables with mean and variance . Let . Then:
Law of Large Numbers
Suppose is an infinite sequence of independent and identically distributed random variables with mean and finite variance. Then: