Conference Paper

Industry Experience: Chinese Names Duplicate Records Detection


Thong Tong Khin, Badrul Affandy Ahmad, Wang Xiaomei and Kandiah Arichandran


The Soundex method is the preferred method for duplicate detection process on Malaysian Chinese names. The names are written in English text, but are phonetically translated from various Chinese dialects. When using the Russell Soundex method, it is found that the number of duplicates is high and the number of false positives is also high. The adaptive nature of Soundex method provides an avenue to optimize it for foreign language names, such as Chinese names. Through a series of tests, this study has optimized the Soundex codes for general Malaysian Chinese names. The test results have shown that a few short Chinese surnames contribute to false positives.


The International Workshop on Knowledge Extraction and Semantic Annotation, 2017

