correct predictions to make the data less
imbalanced and retrain our homograph model in
the hope of attaining a better homograph score.
In case a word can be followed by ezafe, it is
shown in the Ariana Lexicon with the GEN
symbol. Currently, every word in the input text is
passed to the ezafe module. We can use the
information from the lexicon to detect words that
are never followed by ezafe and not pass them to
the ezafe module, thereby increase system
performance.
Since we are working on a low-resource task,
an acoustic model using this G2P system would
not be able to detect the position of stress properly
on its own. To solve this problem, we are
currently using the rules explained in [16], but for
improvement, we can add a component that uses
the output of our G2P system (the phoneme
sequence) to determine the position of stress in
words; therefore, helping the acoustic model
produce more natural and humanlike speech
signals. This component can comprise a
Transformer encoder block to receive a phoneme
sequence and a linear layer to produce 0-1 vectors
indicating the presence or absence of stress for
the phonemes in the middle word.
Needless to say, using pre-trained models
and/or having access to sufficient resources
would affect the system’s performance. In our
case, for instance, a pre-trained BERT would
improve our system in terms of ezafe recognition.
7. Conclusions
This paper presents a sequence-level multi-
module framework for G2P conversion of Persian
text. The system is comprised of a GRU-based
model combined with an attention layer, and a
Transformer-based model to tackle OOV,
homograph and ezafe problems. The models were
evaluated using the Bijankhan corpus in terms of
word-level accuracy. Moreover, we introduce a
new evaluation metric called homograph score
for disambiguating homograph pronunciation in
G2P tools.
Acknowledgements
This research was supported by Asr Gooyesh
Pardaz Co. We thank Khosro Hosseinzadeh and
Farokh Kakaei who provided insight and
expertise that assisted the research.
References
[1] Avanesov, Ruben Ivanovich. Modern Russian
Stress: The Commonwealth and International
Library of Science, Technology, Engineering and
Liberal Studies: Pergamon Oxford Russian
Series. Elsevier, 2015.
[2] Juzová, Markéta, Daniel Tihelka, and Jakub Vít.
“Unified Language-Independent DNN-Based
G2P Converter.” In INTERSPEECH, pp. 2085-
2089. 2019.
[3] Veiga, Arlindo, Sara Candeias, and Fernando
Perdigão. “Generating a pronunciation dictionary
for European Portuguese using a joint-sequence
model with embedded stress
assignment.” Journal of the Brazilian Computer
Society 19, no. 2 (2013): 127-134.
[4] Mousmi Ajay chaurasia, “Cloud Computing:
Challenges Of Security Issues”, ISSN (print):
2393-8374, (online): 2394-0697, Volume-3,
Issue-8, September 2016 23-29, International
Journal Of Current Engineering And Scientific
Research (IJCESR)
[5] Yolchuyeva, Sevinj, Géza Németh, and Bálint
Gyires-Tóth. “Transformer based grapheme-to-
phoneme conversion.” arXiv preprint
arXiv:2004.06338 (2020).
[6] Sun, Hao, Xu Tan, Jun-Wei Gan, Hongzhi Liu,
Sheng Zhao, Tao Qin, and Tie-Yan Liu. “Token-
level ensemble distillation for grapheme-to-
phoneme conversion.” arXiv preprint
arXiv:1904.03446 (2019).
[7] Meersman, Robrecht, “Grafeem-naar-
foneemconversie door middel van.” Master’s
Thesis, Ghent University, 2019, pp. 7-13,
https://www.scriptieprijs.be/sites/default/files/the
sis/2019-07/thesis.pdf.
[8] Novak, Josef R., Nobuaki Minematsu, and
Keikichi Hirose. “WFST-based grapheme-to-
phoneme conversion: Open source tools for
alignment, model-building and decoding.”
In Proceedings of the 10th International
Workshop on Finite State Methods and Natural
Language Processing, pp. 45-49. 2012.
[9] Dai, Dongyang, Zhiyong Wu, Shiyin Kang, Xixin
Wu, Jia Jia, Dan Su, Dong Yu, and Helen Meng.
“Disambiguation of Chinese Polyphones in an
End-to-End Framework with Semantic Features