Skip to content

Encoding distribution and "true" distribution modeling #348

@teaandscones

Description

@teaandscones

Hello @YodaEmbedding,

I was reading your answer in this previously posted issue:

For a single fixed encoding distribution $p$, the average rate cost for encoding a single symbol that is drawn from the same distribution $p$ is:

$$R = \sum_t - p(t) \, \log p(t)$$

But this is not what we're doing. What we're actually interested in is the cross-entropy. That is the average rate cost for encoding a single symbol drawn from the true distribution $\hat{p}$:

$$R = \sum_t - \hat{p}(t) \, \log p(t)$$

To be consistent with our notation above, we should also sprinkle in some $i$ s:

$$R_i = \sum_t - \hat{p}_i(t) \, \log p_i(t)$$

In our case, we know exactly what $\hat{p}$ is...

$$\hat{p}_i(t) = \delta[t - \hat{y}_i] =\begin{cases}1 & \text{if } t = \hat{y}_i \\ 0 & \text{otherwise}\end{cases}$$

If we plug this into the earlier equation, the rate cost for encoding the $i$-th element becomes:

$$R_i = -\log p_i(\hat{y}_i)$$

Originally posted by @YodaEmbedding in #314

I’m trying to better understand the reasoning behind this formulation, and I have two main questions:

  1. Why is the true distribution $\hat{p}_i$ considered different from the encoding distribution $p_i$?
    What does it mean to refer to a “true” distribution in this context?

  2. Why is $\hat{p}_i$ modeled as a delta function?
    This seems to imply that there is no uncertainty at all — only one possible symbol $\hat{y}_i$ with probability 1. If that’s the case, what motivates using a probabilistic framework at all?

Thanks in advance for any clarification!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions