worked on by: Rashid Harvey
Outline
The Arabic alphabet is fundamentally different from the Latin alphabet. So, people have developed transliteration and transcription systems to represent the Arabic writing system with Latin letters. There are several use-cases:
1. The user cannot write Arabic, either because he doesn't know how to or because he doesn't have an Arabic keyboard, for example.
2. The user understands Arabic and can read and write. However, he wants to represent it in a way that people can read it even though they do not know how to read Arabic.
The DMG-standard (DIN 31635) emerged in Germany in 1935 and has served as an inspiration to many other popular Latin transliteration systems. It is however mainly used by scientists studying the Arabic language.
The standard is mostly used to fulfill the second use case, as it is itself almost as complicated to write as Arabic itself: Most letters are transliterated 1-to-1 which requires many special characters not seen in normal use or on a regular keyboard. In that sense the standard is quite outdated as it was conceptualized when pen and paper were mostly used to write, but it is still widely taught and used in the German scientific field. However, it is still quite difficult and tedious to transliterate Arabic and so it was expressed to me that an automatic transliteration tool would be helpful and doesn't exist yet for DMG. After some consideration, I started working on it and, after some time, I felt like I would be able to create a good result, and so I registered my bachelor's thesis.
A first prototype is currently available here (
https://transliteration.eu.pythonanywhere.com/)
Thesis Requirements
Firstly, the thesis should define what a valid transliteration is, depending on a fixed set of preferences. It might seem counter-intuitive to solve a problem that you define yourself. There even is a public official standard (
https://www.aai.uni-hamburg.de/voror/medien/dmg.pdf). However, in practice, the rules are often adapted and different people and organizations develop different alternatives. The official standard is 90 years old, and a lot has changed in the meantime. So, determining and specifying what exactly constitutes a correct (and maybe even preferred) transliteration is necessary.
Then the thesis should contain an algorithm that can take a fully vocalized Arabic text and return a corresponding transliteration, except for the correct transliteration of Arabic names.
Furthermore, the thesis should contain the design of a usable application for the specific use case of a researcher that regularly operates the app in his work. All the core aspects of usability should be considered here: Easy to learn, efficient, error-resistant and satisfying.
Very important is a final evaluation. It should estimate the accuracy of the transliteration algorithm and the usability of the application. A necessary part of this is a user test. The evaluation should also give an outlook into the future: What is still missing and why? What could be improved and how? How could the algorithm be extended to serve more purposes?
Optionally, depending on how much time is left, there will be an implementation that mainly relies on AI. In a few tests, I have discovered that both
ChatGPT as well as
GitHub Copilot can very easily pick up on how the transliteration works if sufficient examples are provided. Pre-trained LLMs could be an easy win for transliteration, as they are flexible and have a built-in sense of semantics, however hallucinations and their nondeterministic nature might pose a challenge. Other approaches using AI should be considered as well. This section should also have a separate evaluation which includes a comparison to the non-AI algorithm and prospects for further development.
Planning
The next steps are the following
- Further improvements, bug fixes, etc. in correspondence with an expert
- Working on the usability of the application, like enabling different input methods and explaining the UI.
- Literature review (Finding and reading related works)
- Gathering necessary testing and training data (contacting libraries and universities, etc.)
- Testing different approaches to certain problems like NER, vocalization and prefixing.
- When the application is riper, iterative synchronous and asynchronous user tests; in-person as soon as possible (when I am in Berlin)
Reception so far
Anyway, it's very much an open topic in research. You could probably come up with a solution that would be of interest to people, even if it weren't flawless.
The drudgery of manual romanization is something that scholars in fields like Arabistik and Islamwissenschaft would love to put behind them.
-- Dr. Theodore S. Beers
Ich habe gerade Ihre Software ausprobiert und bin wirklich beeindruckt.
Es funktioniert schon sehr gut.
-- Doğa Akpınar
Vielen herzlichen Dank für Ihre Email und den Hinweis auf Ihre spannende Arbeit.
-- Dr. Till Grallert
das Projekt klingt sehr spannend
Ansonsten finde ich für eine BA-Arbeit völlig legitim, Vokalisierung im Input vorauszusetzen.
Alles in allem ist das ein schönes BA-Projekt und ich möchte Sie sehr dazu ermutigen, sich nicht entmutigen zu lassen von der Sprache oder der schlechten Forschungs- und Softwarelage.
-- Dr. Jonas Müller-Laackman
vielen Dank für Ihre Nachricht. Ihre Bachelorarbeit klingt nach einem spannenden Projekt!
-- Dr. Victoria Mummelthei
Ihre Anfrage ist sehr interessant für uns
-- Dr. Ruben Schenzle
Weekly Status
Week 1 (CW 13)
Activities
- Writing this Wiki
- Working on the UI and some bug fixes
- Reading a few papers from ArabicNLP2023 to get a feeling for the type of topics that are accepted, which is mostly AI:
- Summarization, Translation (especially of dialects) and RAG/QA and LLMs in general
- Looking into the Qalamos-Project for potential uses for this project
- Correspondence with Dr. Theodore Beers
Results
- I believe it might be difficult to submit a paper to ArabicNLP2024, but worth a shot. Drafts are explicitly allowed. Possible paper sizes are 2, 4, and 8 pages. So 2 pages might be a good size.
- Qalamos transliterates its titles and the author names. These likely present very valuable data.
- Theo caught a nice bug. He also referred me to Jonas Müller-Laackman which I already contacted and Till Grallert. He also gave me the information that "Most researchers in English-speaking countries are now using the IJMES standard (and libraries use ALA-LC)." The "now" is very interesting, for example. For ALA-LC, there is already a solution online (https://transliterate.arabicalphabet.net/). It also explicitly states how the tool works, which I have already taken some inspiration from by using Mishkal. Theo gave Mishkal a good grade: "It actually does a decent job most of the time." He then ended the email with: "The thing is, the problem space is almost unbounded. You could keep improving the program to account for more quirks of Arabic orthography, but there would always be the possibility that it would choke on some valid input that you haven't tested before. I think this will turn out to be a labor of love, if you stick with it. For a BA thesis project, you've done a lot already." Which is interesting as I just started writing my Bachelors thesis and still plan to improve week by week.
Next Steps
- I will integrate a virtual Arabic keyboard to make it more accessible to people without an Arabic keyboard
- I will fix a few more bugs like the one that Theo caught
- I will revisit Named Entity Recognition. It is too slow to use it in practice as it stands. Also, it doesn't even try to attempt to fulfil the specification.
- I will contact Qalamos to acquire easy API access to their data. Otherwise, I will write a scraper to gather the necessary data from their public website
- I will dig even deeper into how the programming libraries work that I am using. Even though I have already dug really deep (no deliverables)
- Apart from digging deeper, I will also try to recreate some of the code, as I have already done with some of the data. This is to be able to improve on the current project, as there is no development on them and the quality is very weak. I might even fork the projects on GitHub and use them instead of the publicly available ones
- Lastly, if the time suffices, I will work on writing more tests. These are important to catch regressions, and I have been a bit sloppy with them too often in the past.
Problems
Week 2 (CW 14) 01.04.2024 – 07.04.2024
Activities
Results
- The UI now has a virtual keyboard
- I got a lot more feedback and a bug report
- Also, a new, more modern, specification document from the university of Bamberg: Translit.pdf
Next Steps
- Most of the things from before
- Reworking the feedback process
- Adding examples to the settings
- Developing the tests jointly with a specification
Problems
- Unfortunately, I didn't have a lot of resources to invest this week, partially due to a paper, a lab report and a trip
- Also, adding a virtual keyboard was much more difficult than I had estimated.
Week 3 (CW 15) 08.04. - 14.04.
Activities
- Emails
- UI
- Bug fixing
- Writing tests
- Gathering data
Results
- Found a lot more libraries for POS-tagging and lemmatization on GitHub
- Even better keyboard
- One-click feedback
- Added illustrative examples to settings
- More encompassing tests using more real inputs
- Some more data and models for NER (1, 2, 3, 4)
- More project referrals:
- CtG (Closing the Gap in non-Latin script data): Possibly more transliteration data
- Rule-based IJMES to Arabic reverse transliteration in XSLT and in Python using a transitional representation called BetaCode
- How could this be useful? Maybe something can be learned from the algorithms or how they were implemented. The same applies for any test data etc.
- Also, if it is easy to make a DMG to IJMES conversion (which is unlikely), then these algorithms could be used to validate more automatically only using random Arabic text (Arabic → DMG → IJMES → Arabic)
Next Steps
- NER
- Vocalization
- Writing the paper for ArabicNLP (deadline 05.2024). I believe the most interesting topic for the conference would be studying different transformers and LLMs for transliteration. Google Codelab is probably a good platform for developing this
Problems
- It is hard for me to determine the optimal solutions for specific problems. Either the solution is rule-based and therefore generally of bad quality or it uses a DNN (RNN, GRU, LSTM, Transformer) which is categorically too resource-intensive. The most valuable resource being time, especially the response time. The complete algorithm should run in a few milliseconds. DNNs just can't provide this kind of efficiency.
- The other problem is actually the qualitative assessments: Even between the rule-based or between the DNN, it is difficult for me to determine the optimal and most modern and general algorithms. Also, I have difficulties to determine, how to improve the quality further. Is it just data or do I need to adapt the rules to optimize them for this specific use-case?
Week 4 (CW 16) 15.04. - 21.04.
Activities
- meetings
- work on transliteration
Results
- Resolved a few doubts and questions with Professor Prechelt
- Had a meeting with my Arabic teacher and got a lot of feedback
- Tried to get an appointment with Coranica project that have a Coranic transliteration
- Better prefixing
- Custom NER detection
- Better hamzatul wasl and ta marbutah handling
Next Steps
- Working on merging vocalization libraries to seriously improve vocalization performance
Problems
Week 5 (CW 17) 22.04. - 28.04.
Activities
- More bug fixes
- Researching the current approaches to vocalization
Results
- Most tests are passing. One is not yet but that is due to an underlying library that will need a revisit
- For vocalization, I found one possible test data set (Tashkeela) with 75.6 million words and three possible vocalization libraries on top of the familiar Mishkal which is currently in use
Next Steps
- First I will need to design different accumulation algorithms that can join the resuts from the different models
- Simultaneously, I will need to make a lot of tests and compare the results
- Finally, I must write the paper
Problems
- This week and the following two weeks (weeks 17-19), I have a few exams that take a big toll on my time
Week 1 (CW XX)
Activities
Results
Next Steps
Problems