Generating Arabic Diacritics in Transcribed Text #1213
abedkhooli
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Whisper models rarely include short vowels and phonetic symbols (they are in the tokenizer). This could be due to training sets and/or based on output adjustments (suppressed tokens?). The question is: does faster-whisper have parameters/options to enable/disable/unfavor such tokens? And what is the preferred approach to add diacritics as part of the transcription output - even if pronounced wrongly (I know output can be post processed to add them from trained text).
Ex. 4th token is called Damma, the one before the last is called Tanween. There are 9 total in the ex.
Beta Was this translation helpful? Give feedback.
All reactions