Cantonese is a dialect of China which mainly used in Hong Kong, Macau, Guangdong Province and other regions. Although Cantonese is very similar to Mandarin, there are still many challenges in accurately translating Cantonese. Many native Mandarin speakers have encountered obstacles in reading Cantonese texts. Compared with other languages with higher ubiquity and richer resources, Cantonese is a low resource language with sparse data that cannot support to create neural machine translation models, because Cantonese is rarely used in formal article. Ku Su Wa and Chong Iok Hei, the 2020/2021 outstanding undergraduates in the Department of Computer and Information Science, realized this problem, and therefore, used it as the topic of their graduation project. Under the guidance of Professor Derek Wong, they used bilingual dictionary neural machine translation system “Candarin” without parallel data, back-translation and dual learning. By these approaches, they successfully solved low resource problems in Cantonese. In order to make it easier for the use of general public, they even designed an easy-to-use website and mobile version. Their graduation project was selected as one of the ‘Best Excellent Projects’.
In regards of the achievements of the four years at UM, Chong Iok Hei shares, ‘Except for teaching me the latest technical knowledge and information, UM also helps to develop my relationships with others. At UM, I have met many like-minded friends and great teachers.’ Ku Su Wa also says, ‘In this harmonious learning environment, I have met many new friends and professors. They have given me a lot of help and encouragement. Whenever I encounter difficulties, they will sincerely help me and it makes me feel very warm. Also, I have developed the ability of self-management, and know how to arrange and manage my time appropriately.’ In the future, Chong Iok Hei hopes to work in an educational institution so that Macau children can acquire computer knowledge from an early age. Ku Su Wa also shared that his interest regarding natural language processing was deepened through his graduation project, and he will continue to conduct in-depth research in the future.
粵語是中國的方言,目前主要在港澳、廣東省及部份地區使用,儘管粵語和普通話很相似,但要準確翻譯粵語還是存在不少挑戰。不少以普通話為母語的人在閱讀粵語文本時遇到了障礙。與其他普遍性較高、資源較豐富的語言相比, 因為粵語很少以文字形式在正式文本上使用,因此粵語沒有豐富的數據, 無法利用數據開發基於機器學習的系統。2020/2021學年電腦及資訊科學系的優秀本科畢業生古樹樺及鍾旭熙注意到市場的這種需求,特意以此為畢業設計作品的題目,二人在黃輝教授的指導下運用沒有並行數據的雙語詞典神經機器翻譯模型「Candarin」、反向翻譯和雙重學習等方法,成功創造出粵語和普通話兩種語言的互譯器,解決了粵語翻譯的問題。務求方便大眾使用,兩人更設計出易於使用的網頁及手機版本,他們的畢業設計作品獲選成為「最佳優秀作品項目」之一。
提及在澳大四年的收穫,鍾旭熙分享說:「本科除了帶給我最新科技資訊與技術學習之外,還有的是強大的人際網絡。在澳大,我認識了很多志同道合的朋友和優秀教授。」古樹樺同學則表示:「在澳大融洽的學習環境中,我結識了很多朋友和教授,他們給予我很多支持和鼓勵,每當遇到困難,他們經驗的分享和諄諄教誨,都讓我感到很窩心。其次,我也培養了自我管理能力,懂得適當安排和管理自己的時間,都令我受益匪淺。」對於未來,鍾旭熙期望能夠從事教育工作令澳門學生從小得到正確的電腦知識。古樹樺則表示透過畢業作品加深他在自然語言處理的興趣,未來他將繼續深入研究。
二人的作品獲選成為「學系最佳優秀作品項目」
Their project received the award of ‘Best Final Year Project of CIS’
翻譯器「Candarin」的網頁版及手機版
The webpage and app of Candarin