Generating Arabic Diacritics in Transcribed Text #1213

abedkhooli · 2024-12-23T15:08:20Z

abedkhooli
Dec 23, 2024

Whisper models rarely include short vowels and phonetic symbols (they are in the tokenizer). This could be due to training sets and/or based on output adjustments (suppressed tokens?). The question is: does faster-whisper have parameters/options to enable/disable/unfavor such tokens? And what is the preferred approach to add diacritics as part of the transcription output - even if pronounced wrongly (I know output can be post processed to add them from trained text).

Ex. 4th token is called Damma, the one before the last is called Tanween. There are 9 total in the ex.

model_id = "openai/whisper-large-v3"
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokens = tokenizer.tokenize("ذهبتُ إلى المدرسةِ في الصّباحِ، وكان الجوُّ بارداً جدّاً.") 
ids = tokenizer.convert_tokens_to_ids(tokens)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating Arabic Diacritics in Transcribed Text #1213

{{title}}

Replies: 0 comments

Select a reply

Generating Arabic Diacritics in Transcribed Text #1213

abedkhooli Dec 23, 2024

Replies: 0 comments

abedkhooli
Dec 23, 2024