Multi-Module G2P Converter for Persian Focusing on Relations
between Words
Mahdi Rezaei, Negar Nayeri, Saeed Farzi, Hossein Sameti
K. N. Toosi University of Technology, Tehran, Iran
Sharif University of Technology, Tehran, Iran
Abstract
In this paper, we investigate the application of
end-to-end and multi-module frameworks for
G2P conversion for the Persian language. The
results demonstrate that our proposed multi-
module G2P system outperforms our end-to-end
systems in terms of accuracy and speed. The
system consists of a pronunciation dictionary as
our look-up table, along with separate models to
handle homographs, OOVs and ezafe in Persian
created using GRU and Transformer
architectures. The system is sequence-level rather
than word-level, which allows it to effectively
capture the unwritten relations between words
(cross-word information) necessary for
homograph disambiguation and ezafe recognition
without the need for any pre-processing. After
evaluation, our system achieved a 94.48% word-
level accuracy, outperforming the previous G2P
systems for Persian.
Index Terms: Grapheme-to-Phoneme (G2P), Text-to-
Speech (TTS), Deep Learning (DL), Transformers,
homograph disambiguation, cross-word information, low-
resource language, attention mechanism
1. Introduction
Before aiming to produce a Grapheme-to-
phoneme (G2P) system in the current research,
we attempted to create a Tacotron to receive
grapheme sequences and return speech.
However, the model achieved poor results due to
Persian being a low-resource language; hence we
decided to build a G2P system. G2P systems aim
to convert a grapheme (letter) sequence into its
pronunciation sequence, and are an essential
component of text-to-speech (TTS) and speech
recognition systems for any language lacking
consistent pronunciation rules.
A good G2P system must address the issues of
out-of-vocabulary (OOV) words and cross-word
relations. OOV words are those which are not
present in the lexicon, meaning they were not
seen during model training. In the case of G2P,
the lexicon is a dictionary consisting of
graphemes and their respective phonemes. As for
cross-word relations, a Persian G2P task is
mainly concerned with homographs and ezafe
constructions.
Homographs are words that are represented by
the same grapheme, but are pronounced
differently. Homograph detection in Persian is
much more complicated than in English since
there are more homographs in Persian vocabulary
and more possible pronunciations for each one.
The difference in the part-of-speech (POS) tags
of homographs can sometimes help disambiguate
the pronunciation of these words; however, it is
not an inclusive solution since some homographs
also have the same POS tag. For instance, the
Persian grapheme “  can maintain three
meanings: “is taking”, “is cutting” and “was
taking”, all verbs, and respectively pronounced
[mi:bæræd], [mi:boræd] and [mi:bord]. Here, the
difference in pronunciation is because, in many
cases in Persian, vowels such as [æ], [e] and [ɔ]
are represented by diacritics, which are often
absent in Persian text. This absence of
representation is among the reasons why there is
usually no one-to-one relationship between
graphemes and phonemes in Persian.
One of the significant tasks while processing
Persian text is ezafe recognition. Ezafe links two
words and creates compounds such as nominal
and possessive. However, ezafe does not occur
whenever a noun is followed by an adjective or
another noun, thus making context information a
necessary factor in ezafe recognition. Table 1
shows two example sentences along with their
pronunciations and English translations. The
highlighted words in Example 1 have an
unrepresented ezafe in between that links them,
whereas this is not the case for Example 2.
Table 1. Impact of context on ezafe formation.
Ezafe is often absent from Persian text, further
complicating tasks such as G2P; however, if
represented explicitly, it appears with the
following graphemes: a) with the diacritic
called kasre and pronounced [e], used when the
first word in the compound ends in a consonant,
and b) with or ”, both pronounced [je],
used when the first word in the compound ends in
a vowel. Table 2 offers examples of ezafe
representations followed by their pronunciation
and English translation.
Table 2. Examples of ezafe representations in Persian text.
Ezafe
Compound
Phonetics
English
Translation


[dæftære ɑ:bi]
blue notebook

[xɑ:neje mɑ:]
our house

[xɑ:neje mɑ:]
our house
Considering the above-mentioned problems,
we have questioned using an end-to-end approach
for our G2P system. After several experiments,
we concluded that assigning separate modules to
each problem would increase the accuracy as well
as the performance of the system since the
modules can run simultaneously. Moreover, by
using a pronunciation dictionary, we could reach
100% accuracy in predicting the pronunciation of
many grapheme sequences. It is worth
mentioning that we used no pre-processing in
terms of splitting words into their composing
morphemes.
Cross-word relations exist in different
languages in various forms and if not represented
in text, can lead to complexities in G2P and
incorrect pronunciations. For example, in Arabic,
different grammatical relations, POS of words
and definiteness of nouns can assign certain
diacritics to the last character of words in
sentences. There are more diacritics in Arabic
than in Persian; however, as is the case in Persian,
they are often absent from the text. Additionally,
in languages such as Russian and Thai, a change
in the position of stress in words can change the
meaning or produce a meaningless sentence [1],
[2].
In Persian text, many letters have an initial as
well as a final form. For example,  and 
are the initial forms of the letters be and he, while
their final forms are represented as  and ”.
The zero-width non-joiner (ZWNJ) character
(with the Unicode symbol U+200C) is placed
between a letter in its final form and another in its
initial form. The token “  ”, meaning
“their books”, contains a ZWNJ between letters
Example
Phonetics
English
Translation



[xodkɑ:re
ʁermez rɑ:
bærdɑ:ʃtæm]
I picked up the
red pen.




[rænge ʌn
xodkɑ:r
ʁermez æst]
The color of
that pen is red.
and . As seen in this example, the ZWNJ
character can produce syntactically-loaded
tokens. We can add such tokens to the Persian
training lexicon to enrich the pronunciation
dictionary and increase G2P performance.
In this paper, we review related works and
state-of-the-art approaches addressing the
problem of G2P conversion (section 2) and
introduce our proposed method to solve this
problem in Persian (section 3). We then compare
our experiments with previous works (section 4)
and evaluate and compare the performance of our
models (section 5). Finally, we propose
recommendations for future work (section 6) and
draw conclusions (section 7).
2. Related Works
The main solutions proposed to solve the G2P
problem include rule-based models, joint-
sequence models [3], [4], deep neural networks,
and as of recent, Transformer-based models.
Joint-sequence models create graphonemes
through generating alignments between
graphemes and phonemes and build an n-gram
model, which returns the probability of a
graphoneme given n previous graphonemes. [3]
constructs a joint-based G2P converter for
Portuguese with embedded rules for stressed
vowel assignment, using a pronunciation
dictionary containing the most frequent words
alongside a list of homographs.
[5] trained three sets of Transformers with
different structures and numbers of encoder and
decoder layers. The fewer parameters in the
proposed method compared to previous DNN
models with convolutional and recurrent layers
led to faster training, while achieving a similar
accuracy. The models were evaluated on PER and
WER. [6] experimented with sequence and token
level knowledge distillation using single and
ensemble teacher models to label words with
their phoneme sequences. Their proposed
Transformer-based model outperformed the CNN
and RNN-based ones for the converter model.
This model used token-level ensemble distillation
to increase the task’s accuracy and improve WER
and PER. Knowledge transfer was also
performed to achieve lightweight models for
better online deployment.
The models mentioned above receive words in
isolation without considering the context
information, which would significantly decrease
the performance for languages with complex
cross-word relations such as Persian. Moreover,
G2P models have often been evaluated on
CMUDict dataset for US English [5], [6], [7], [8]
and NetTalk [5], [8], making them directly
comparable. This comparison is not applicable to
languages other than English. In the following,
we review [9], [2] and [7] as examples of research
that pay attention to cross-word information.
[9] built an end-to-end framework to receive
raw Chinese text and perform polyphone
disambiguation with the help of a pre-trained
BERT. The contextual information helped the
model extract semantic features and effectively
disambiguate polyphones. [2] created a unified-
DNN-based G2P converter for languages with
regular and/or irregular pronunciation, which was
verified for English, Russian and Czech. The
proposed models consisted of an embedding
layer, a bi-LSTM layer, another
LSTM/convolutional layer, a bi-LSTM layer for
the encoder, and an LSTM decoder followed by a
linear projection with softmax activation. The
model with the LSTM variation in the encoder
outperformed the one with the convolutional
layer. This architecture considers cross-word and
within-word relations and is basically the same
architecture we used for one of our proposed
systems in the present paper (section 4). [7]
developed three neural networks with recurrent,
convolutional and Transformer architectures,
among which the Transformer obtained the best
results for both word and phoneme error rates in
English, Croatian and Turkish. The model
handles the homograph problem better when
trained on the sentence level rather than word
level. As opposed to previous G2P models, the
proposed Transformer model is more robust
against longer graphemes. The paper mentions
that rule-based models complement neural
networks to better overcome the OOV problem
and return lower error rates.
[10], [11] and [12] acknowledge Persian as a
challenging language for the G2P conversion task
due to the absence of vowel diacritics from the
text. [10] develops a BRNN-LSTM accepting 10-
sentence sequences derived from FarsDat and
Bijankhan corpora. The model achieved a 98%
accuracy on the phoneme level. [10] briefly
mentions the multi-module approachwhich is
thoroughly raised and examined in the current
articleto tackle homograph and ezafe problems
in Persian but refuses to adopt the approach,
arguing that the cumulative errors from the
multiple modules would negatively affect the
overall accuracy. [11] builds a multi-layer
Perceptron (MLP) to solve this problem and
defines a variable called Vowel State, which
determines whether or not a letter contains a
vowel and the type of vowel. The model gained a
97.34% accuracy on the phoneme level when
tested on 2024 common Persian words. [12]
proposed a G2P system containing a rule-based
layer and two MLP layers to determine
pronunciations and gemination, which obtained
an 87% word-to-phoneme and a 61% letter-to-
phoneme accuracy on a 3000-word test set.
Unlike [10], [11] and [12] have not addressed the
issues of ezafe and homographs, i.e., cross-word
relations in Persian.
3. Proposed Work
In our proposed architecture, we take a multi-
module approach to the problem and allocate
OOV words, ezafe recognition and homographs
to separate modules. This multi-module system
also uses a dictionary containing words and their
phoneme transcription. After normalizing and
tokenizing the data, we loop through every word
in the text sequence. If the word is in the
vocabulary, i.e., within the dictionary, its
pronunciation is extracted from the dictionary. If
the word is within our homograph list, it is passed
to the homograph model along with the two
nearest preceding and proceeding words;
otherwise, it will be passed to the OOV model. It
is worth mentioning that all words go through
ezafe recognition to determine whether they
should be followed by an ezafe. The ezafe model
receives the words along with their preceding and
proceeding words the same way as the
homograph model. Lastly, the results are
concatenated and the G2P system generates the
phoneme sequence. The models are described in
the following.
Figure 1. Transformer architecture in OOV module [13].
3.1. OOV Module
OOVs or out-of-vocabulary words are those
which did not exist in the model’s training set.
Predicting the pronunciation of OOVs is a simple
seq2seq task assigned to a transformer model.
Figure 1 [13] shows the arch of the Transformer.
The graphemes (characters) of the input words
are fed into the model. Since this model is on the
word level and is not concerned with cross-word
relations, the phoneme sequence output does not
contain any ezafe pronunciation at its end, which
is also the case for the homograph model.
Figure 2. The structure of our bidirectional GRU neural network in the homograph model [14].
3.2. Homograph Module
After normalization and tokenization, similar
to the architecture of the G2P converter used in
[2], the text data input is fed into the homograph
model in 5-gram sequences. The embedding layer
performs vectorization for each character in each
of the five words (tokens), and padding is applied
for words shorter than the maximum length of
graphemes (characters) in each batch. Five
bidirectional GRU models extract within-word
relations separately for each word (gram). In the
next step, we create five word representation
vectors using the hidden activations of those five
GRU layers. We make the word representations
in the same way as encoder-decoder connectors
in the machine translation task. The word
representations are received by another
bidirectional GRU layer which is responsible for
extracting cross-word information in terms of
boundary and ezafe. Finally, the within-word and
cross-word information extracted by the models
regarding the middle word are concatenated and
fed into a GRU decoder. Here, the cross-word
information is in fact the meaning and the
corresponding pronunciation of the homograph.
Figure 2 derived from [14] shows the structure of
the homograph model.
A homograph dictionary was extracted from
the Ariana lexicon, which is introduced in 4.1. In
case the model generates a phoneme sequence
that is not among the pronunciations in the
homograph dictionary, the sequence is corrected
using a minimum edit distance algorithm, and the
pronunciation with the highest score is selected.
In the case where there are equal scores, one
pronunciation is selected at random.
Here, it is worth mentioning the case of multi-
pronunciation words, which are words with more
than one correct pronunciation all having the
same meaning, unlike homographs. These words
are passed to our pronunciation dictionary, which
only contains their most common pronunciation.
For instance, the word ”, meaning sky, has
the two possible pronunciations, “aseman” and
“asman”, whereas the pronunciation dictionary
only contains the former.
3.3. Ezafe Module
In one experiment, the ezafe module uses a
deep neural network to predict whether the
middle word in a 5-gram sequence has ezafe at its
end. Very similar to the architecture in the
homograph model, we create five word
representation vectors, and then, thanks to a
bidirectional GRU, cross-word information is
extracted. This information output is passed to a
linear layer rather than a decoder, which will
return a 0-1 vector predicting the presence or
absence of ezafe in the middle word in the 5-
gram. This model will be called ezafe model .
In another experiment, we attempted to create
the ezafe model on the word level (ezafe model
Ⅱ). The embedding layer performed vectorization
for each token in the 5-gram, leading to a large
embedding table. A bidirectional GRU learned
the cross-word information, which was passed to
a linear layer and 0-1 vectors were returned,
indicating the presence of ezafe in the middle
word.
While labeling the training data for this model,
we paid attention to pairs of grapheme and
phoneme sequences due to the similar
pronunciation of definite words and ezafe forms
in Persian. For instance,  and
 are
both pronounced “medade”, while the former
means “the pencil” and the latter is pencil
followed by ezafe.
4. Experiments and Results
In this section, after introducing the datasets used
to train and evaluate the models, we introduce
other state-of-the-art architectures for later
comparison in section 5. These architectures use
a single end-to-end model which simultaneously
addresses issues of OOVs, ezafe and homographs
in the same module.
4.1. Datasets
We used a portion of the Bijankhan corpus [15],
which is the most well-known corpus for Persian
G2P and consists of formal sentences to train and
evaluate the G2P system. Our data consists of
42,540 sentences with an overall of one million
words and their phonemic transcriptions. Our
phonetic experts corrected the existing errors in
the transcriptions. Furthermore, phonemes are
represented with only one symbol. For example,
the word “(shekaste), meaning broken, is
transcribed as “$ek/ste”. Here, the advantage of
using $ over sh is that the model makes one
prediction for this single phoneme rather than
two, which ultimately affects the accuracy and
efficiency of the model.
The data was randomly shuffled and split into
train, validation and test sets with a percentage of
respectively 80%, 5% and 15%, and changed into
the input format required by each model. The
same data was used for all modules and systems,
except for OOV and homograph modules which
used a subset of the original test set. This is why
we used a larger portion of data for the test set
compared to the validation set (15% vs. 5% of the
original data).
While training the OOV model, we used all
occurrences of all tokens so that the model would
gain an understanding of the frequency and
importance of words. We did not exclude
homographs from the training set for the OOV
model, and the homograph model was also
trained on non-homographs as middle grams.
This way, the models are trained on more data,
often of the same roots, and thus, generate more
phoneme sequences during training time.
The Ariana lexicon was used to tag the words
in the Bijankhan corpus for the purpose of
training the ezafe model (3.3). This lexicon
consists of word entries along with all possible
phoneme transcriptions and POS tags in both
simple and ezafe forms. The Ariana lexicon was
also used to create our look-up table which is a
pronunciation dictionary and a list of homograph
words. Figure 3 shows an example of word
entries in the Ariana lexicon. The symbol GEN
refers to the presence of ezafe in the phonemic
transcription. Moreover, in Persian, some
syntactic forms are attached to words with zero-
width non-joiner (ZWNJ) character and form a
single token, as in  ” which means
“their books”. This lexicon considers a separate
entry for these syntactically loaded tokens.
Figure 3. An example of entries in the Ariana lexicon.
4.2. Other State-of-the-Art Techniques
Based on the G2P converter developed in [2] and
[14] and similar to the homograph model in 3.2,
we used GRU models to learn within-word and
cross-word information. The information about
the middle word is then concatenated and fed into
a single GRU decoder. This DNN-based model
produced relatively poor results, so we did not
calculate its performance and report it here. The
other G2P system differed from this DNN-based
system in using the attention mechanism in the
decoder.
Similar to the end-to-end Transformer-based
G2P converter with sentence-level inputs in [7],
our Transformer-based system used an end-to-
end transformer-based model which received 5-
gram inputs on the character level. The sequence
characters were separated using # symbols as
border characters, and the middle word was
within parentheses. The system returned the
phoneme sequence of the middle word as output
(Figure 4). For instance, for the example input
below (A), our Transformer-based system would
return the output B, which is the phoneme
sequence of the word ”, meaning food, in
ezafe form.
A: 
B: q / z a y e
Figure 4. The structure of our end-to-end Transformer-
based G2P converter.
5. Results
We used accuracy on the word level which
calculates the percentage of correctly predicted
words by the model that exactly match the
reference phoneme sequences over the total
number of words. Table 3 shows a comparison of
the evaluation results for our experiments. In
what follows, we present the performance results
of the OOV, ezafe and homograph modules in our
G2P system.
5.1. OOV Model Results
The performance of the OOV model was
evaluated on a subset extracted from the original
test data with no overlap with the train data. This
model achieved a 74.41% accuracy on the word
level.
Table 3. Comparison of word-level accuracy between G2P
systems.
System
Word-Level Accuracy
Our multi-module system
94.48%
DNN-based + attention
[2], [14]
87.3%
Transformer-based [7]
91.04%
5.2. Ezafe Model Results.
Table 4 indicates a performance comparison
between our ezafe models. Here, we only report
on the word-level accuracywhich is the number
of words correctly labeled with ezafe over the
total number of wordssince ezafe occurrence
was balanced throughout the original test set
(unlike the evaluation process carried out in 5.3
for our homograph model). The poor
performance of Model Ⅱ, which is the word-level
model, is due to the fact that we are working on
low resources; hence the model has few or zero
encounters with many words. Here, the character-
level approach can help increase the performance
through character embeddings.
Model actually refers to the ezafe prediction
conducted by our Transformer-based system.
Table 4. Comparison of accuracy between ezafe models.
Model Name
Architecture
Accuracy
Model
Character-
level GRU-
based
97.14%
Model
Word-level
GRU-based
82.41%
Model
Component of
end-to-end
Transformer
94.32%
5.3. Homograph Model Results
The homograph model was evaluated on a subset
extracted from the original test set where the
middle gram was a homograph, and achieved a
91.38% word-level accuracy in predicting the
phoneme sequence for homographs. However,
due to the scatteredness of homographs
throughout our data, accuracy is not a good
measure to understand the performance of this
model. Therefore, using the concept of sensitivity
and specificity in statistics, we propose a measure
called homograph score, which ranges from 0 to
1 with 1 being the best score, and evaluates a
model’s performance in predicting the correct
phoneme sequence associated with homographs.
(1)
(2)
(1) calculates 𝑆
𝑗
, which is the sum of all correct
calls for pronunciation 𝑃
𝑖
over the total
appearances of 𝑃
𝑖
in the test dataset, divided by n,
which is the number of possible pronunciations of
homograph j. (2) calculates (1) for every unique
homograph in the test set, and the summation is
divided by the total number of unique
homographs in the test set (C), resulting in
homograph score. Our homograph model
attained a homograph score of 0.7348.
6. Future Work
After the service is released for customer use, we
can log the words categorized as OOV and add
their phoneme sequences to our look-up table.
Homographs often have less frequent variants.
To provide the model with more training
examples of the less frequent word(s), we can run
our homograph model on raw text and corpora
and manually check the predictions assuming the
less frequent pronunciations. We can then use the
correct predictions to make the data less
imbalanced and retrain our homograph model in
the hope of attaining a better homograph score.
In case a word can be followed by ezafe, it is
shown in the Ariana Lexicon with the GEN
symbol. Currently, every word in the input text is
passed to the ezafe module. We can use the
information from the lexicon to detect words that
are never followed by ezafe and not pass them to
the ezafe module, thereby increase system
performance.
Since we are working on a low-resource task,
an acoustic model using this G2P system would
not be able to detect the position of stress properly
on its own. To solve this problem, we are
currently using the rules explained in [16], but for
improvement, we can add a component that uses
the output of our G2P system (the phoneme
sequence) to determine the position of stress in
words; therefore, helping the acoustic model
produce more natural and humanlike speech
signals. This component can comprise a
Transformer encoder block to receive a phoneme
sequence and a linear layer to produce 0-1 vectors
indicating the presence or absence of stress for
the phonemes in the middle word.
Needless to say, using pre-trained models
and/or having access to sufficient resources
would affect the system’s performance. In our
case, for instance, a pre-trained BERT would
improve our system in terms of ezafe recognition.
7. Conclusions
This paper presents a sequence-level multi-
module framework for G2P conversion of Persian
text. The system is comprised of a GRU-based
model combined with an attention layer, and a
Transformer-based model to tackle OOV,
homograph and ezafe problems. The models were
evaluated using the Bijankhan corpus in terms of
word-level accuracy. Moreover, we introduce a
new evaluation metric called homograph score
for disambiguating homograph pronunciation in
G2P tools.
Acknowledgements
This research was supported by Asr Gooyesh
Pardaz Co. We thank Khosro Hosseinzadeh and
Farokh Kakaei who provided insight and
expertise that assisted the research.
References
[1] Avanesov, Ruben Ivanovich. Modern Russian
Stress: The Commonwealth and International
Library of Science, Technology, Engineering and
Liberal Studies: Pergamon Oxford Russian
Series. Elsevier, 2015.
[2] Juzová, Markéta, Daniel Tihelka, and Jakub Vít.
“Unified Language-Independent DNN-Based
G2P Converter.” In INTERSPEECH, pp. 2085-
2089. 2019.
[3] Veiga, Arlindo, Sara Candeias, and Fernando
Perdigão. “Generating a pronunciation dictionary
for European Portuguese using a joint-sequence
model with embedded stress
assignment.” Journal of the Brazilian Computer
Society 19, no. 2 (2013): 127-134.
[4] Mousmi Ajay chaurasia, “Cloud Computing:
Challenges Of Security Issues”, ISSN (print):
2393-8374, (online): 2394-0697, Volume-3,
Issue-8, September 2016 23-29, International
Journal Of Current Engineering And Scientific
Research (IJCESR)
[5] Yolchuyeva, Sevinj, Géza Németh, and Bálint
Gyires-Tóth. “Transformer based grapheme-to-
phoneme conversion.” arXiv preprint
arXiv:2004.06338 (2020).
[6] Sun, Hao, Xu Tan, Jun-Wei Gan, Hongzhi Liu,
Sheng Zhao, Tao Qin, and Tie-Yan Liu. “Token-
level ensemble distillation for grapheme-to-
phoneme conversion.” arXiv preprint
arXiv:1904.03446 (2019).
[7] Meersman, Robrecht, “Grafeem-naar-
foneemconversie door middel van.” Master’s
Thesis, Ghent University, 2019, pp. 7-13,
https://www.scriptieprijs.be/sites/default/files/the
sis/2019-07/thesis.pdf.
[8] Novak, Josef R., Nobuaki Minematsu, and
Keikichi Hirose. “WFST-based grapheme-to-
phoneme conversion: Open source tools for
alignment, model-building and decoding.”
In Proceedings of the 10th International
Workshop on Finite State Methods and Natural
Language Processing, pp. 45-49. 2012.
[9] Dai, Dongyang, Zhiyong Wu, Shiyin Kang, Xixin
Wu, Jia Jia, Dan Su, Dong Yu, and Helen Meng.
“Disambiguation of Chinese Polyphones in an
End-to-End Framework with Semantic Features
Extracted by Pre-Trained BERT.” In Interspeech,
pp. 2090-2094. 2019.
[10] Behbahani, Yasser Mohseni, Bagher Babaali, and
Mussa Turdalyuly. “Persian sentences to
phoneme sequences conversion based on
recurrent neural networks.” Open Computer
Science 6, no. 1 (2016): 219-225.
[11] Rasekh, Ehsan, and Mohammad Eshghi. “An
Efficient Network for Farsi Text to Speech
Conversion Using Vowel State.” In TENCON
2006-2006 IEEE Region 10 Conference, pp. 1-4.
IEEE, 2006.
[12] Namnabat, M., and M. M. Homayounpour. “A
letter to sound system for Farsi Language using
neural networks.” In 2006 8th international
Conference on Signal Processing, vol. 1. IEEE,
2006.
[13] Vaswani, Ashish, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. "Attention is
all you need." Advances in neural information
processing systems 30 (2017).
[14] Jůzová, Markéta, and Jakub Vít. "Using Auto-
Encoder BiLSTM Neural Network for Czech
Grapheme-to-Phoneme Conversion."
In International Conference on Text, Speech, and
Dialogue, pp. 91-102. Springer, Cham, 2019.
[15] Bijankhan, Mahmood, Javad Sheykhzadegan,
Mohammad Bahrani, and Masood Ghayoomi.
“Lessons from building a Persian written corpus:
Peykare.” Language resources and
evaluation 45, no. 2 (2011): 143-164.
[16] Ferguson, Charles A. "Word stress in
Persian." Language 33, no. 2 (1957): 123-135.