You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Doesn't make any attempt to assure that the unigrams probabilities sum to 1.0. I don't know if this is a problem or not.
My suggestion would be to treat the "scale" parameter is the probability of OOV - P(OOV) - as suggested in the script. Then the following normalizations could be done:
Normalize non-OOV unigrams so they sum to 1 - P(OOV)
Normalize OOV unigrams so they sum to P(OOV)
That should ensure that the set of specified OOV words is treated as having a collected probability of P(OOV) and the remainder of the lexicon picks up the remaining probability mass.
It may also make sense to ensure that any word specified by the user that already exists in the lexicon is moved to the OOV set so that it inherits the probability specified by the user. I actually don't know if that's a good idea as it will impact all backoff N-grams. So perhaps a warning is better and these words are skipped or an option could exist to apply the user specified probabilities to in vocabulary words if that's really what the user wants to do.
The text was updated successfully, but these errors were encountered:
It looks like the script:
https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/utils/lang/add_unigrams_arpa.pl
Doesn't make any attempt to assure that the unigrams probabilities sum to 1.0. I don't know if this is a problem or not.
My suggestion would be to treat the "scale" parameter is the probability of OOV - P(OOV) - as suggested in the script. Then the following normalizations could be done:
That should ensure that the set of specified OOV words is treated as having a collected probability of P(OOV) and the remainder of the lexicon picks up the remaining probability mass.
It may also make sense to ensure that any word specified by the user that already exists in the lexicon is moved to the OOV set so that it inherits the probability specified by the user. I actually don't know if that's a good idea as it will impact all backoff N-grams. So perhaps a warning is better and these words are skipped or an option could exist to apply the user specified probabilities to in vocabulary words if that's really what the user wants to do.
The text was updated successfully, but these errors were encountered: