Guannan Zhang

2024

Large Language Models (LLMs) hold considerable promise for artificial general intelligence, given their intrinsic abilities to accomplish a wide range of open-domain tasks either independently or in tandem with specialized expert models. However, despite these capabilities, the performance of LLMs has yet to be comprehensively evaluated in realistic scenarios. To this end, in this work, we introduce a novel task, the Realistic Chinese Spell Checking (RCSC), to evaluate the effectiveness of existing methods comprehensively. In contrast to existing works that solely address Chinese character misspellings or pinyin conversions, our task aims to convert the realistic Chinese text into the corresponding correct text. The realistic Chinese text may potentially contain both Chinese misspellings and pinyin conversions. We first present the Realistic Chinese Spell Checking Benchmark (RCSCB), which consists of two subsets and contains a total of 581,657 samples. Then, we benchmark the performance of various baselines and find that all the existing methods, including instruction-based LLMs, achieve unsatisfactory results on RCSCB. To further improve the performance on RCSCB, we propose Pinyin-Enhanced Spell Checker (PESC), which is specifically designed to address pinyin-related misspellings. Experimental results demonstrate that PESC can achieve state-of-the-art performance on RCSCB. Despite the progress made, the current state-of-the-art performance is still far from satisfactory. We expect further progress on this crucial and challenging task.

2023

Hierarchical text classification (HTC) focuses on classifying one text into multiple labels, which are organized as a hierarchical taxonomy. Due to its wide involution in realistic scenarios, HTC attracts long-term attention from both industry and academia. However, the high cost of hierarchical multi-label annotation makes HTC suffer from the data scarcity problem. In view of the difficulty in balancing the controllability of multiple structural labels and text diversity, automatically generating high-quality data for HTC is challenging and under-explored. To fill this blank, we propose a novel data generation framework tailored for HTC, which can achieve both label controllability and text diversity by extracting high-quality semantic-level and phrase-level hierarchical label information. Experimental results on three benchmarks demonstrate that, compared with existing data augmentation methods, the data generated from our method can bring the most significant performance improvements of several strong HTC models. Extensive analysis confirms that the improvements yielded by our proposed method do correlate to the enhancement of label controllability and text diversity.

Co-authors

Guannan Zhang

2024

2023

Co-authors

Venues