In Bayesian method of parameter estimation, the unknown parameter θ is treated as a random variable. To the parameter θ, a prior distribution fΘ(θ) is assigned, which encodes what we know about the parameter before observing the data. For a given value Θ=θ, the data have a probability density function fX∣Θ(x∣θ). In this notation, read X∣Θ as the (probability of) X given Θ. The joint distribution of X and Θ can then be written as:
fX,Θ(x,θ)=fX∣Θ(x∣θ)fΘ(θ).
Joint Distribution: the probability density function fA,B(a,b) is defined such that fA,B(a,b)dadb gives the probability that a≤A≤a+da∧b≤B≤b+db.
The marginal distribution of the data is obtained, by definition, upon integrating over the parameter space:
fX(x)=∫dθfX,Θ(x∣θ)=∫dθfX∣Θ(x∣θ)fΘ(θ).
However, note that we can also write the joint distribution as fX,Θ(x,θ)=fΘ∣X(θ∣x)fX(x), so that the distribution of Θ given the data X is given by:
This is also called the posterior distribution, representing what we know about Θ given the data X. The denominator serves as a normalization constant. The information is contained in the numerator:
posteriordensity∝likelihood×priordensity.
To construct the Bayesian estimate θ^Bay, we can extract the average or most probable value from the posterior distribution: this is a mere choice. As a measure for the variance of the estimate, we can take the standard deviation of the distribution.
Example
Let us consider the exponential distribution again, so that fXi∣Λ(xi∣λ)=λe−λxi. The conditional distribution fX∣Λ(x∣λ) is then given by:
fX∣Λ(x∣λ)=i=1∏nλe−λxi=(λe−λxˉn)n,
where xˉn is the sample average for the sample of size n.
There is no formal prescription for what the prior distribution should look like. Usually, we take a conservative approach that is as unbiased as possible, so that the data can speak for themselves. To this end, we take a flat prior, that is constant on a interval in which we expect the true parameter to be. Assuming the parameter to fall inside the interval λ∈[λmin,λmin], we get:
fΛ(λ)=λmax−λmin1,λmin≤λ≤λmax,
and zero otherwise. Ofcourse, choosing the interval to be very narrow makes even a flat prior biased. We will take the most conservative approach; we only assume that the parameter is positive and take the interval λ∈[0,∞). The posterior is then:
where I(x) is a normalization factor depending on the data. Notice that since the prior is flat, it cancelled out of the posterior: the only traces it left are the integration bounds on the normalization integral. The normalization integral can be written in terms of the gamma function as:
I(x)=(xˉnn)n+11∫0∞dyyne−y=(xˉnn)n+1Γ(n+1).
Therefore, we find the posterior to be:
fΛ∣X(λ∣x)=Γ(n+1)(xˉnn)n+1(λe−λxˉn)n,
which can be recognized as a Gamma distribution f(λ;α,β)=βαλα−1e−βλ/Γ(α) with parameters α=n+1 and β=nxˉn. From the posterior, we take the expectation value as the estimate and the standard deviation as the error in the estimate:
In the limit of large sample size n→∞, we find that the estimator becomes λ^Bay=1/xˉ, equivalent to the Method of Moments and Maximum Likelihood Method estimators, the same applies to the error estimates. Below, we plot the posterior obtained for a sample of size n=1000, together with the estimate and its error.
Posterior distribution for estimation of parameter λ.