Ke Zhang


2024

pdf bib
Prompt-based Generation of Natural Language Explanations of Synthetic Lethality for Cancer Drug Discovery
Ke Zhang | Yimiao Feng | Jie Zheng
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Synthetic lethality (SL) offers a promising approach for targeted anti-cancer therapy. Deeply understanding SL gene pair mechanisms is vital for anti-cancer drug discovery. However, current wet-lab and machine learning-based SL prediction methods lack user-friendly and quantitatively evaluable explanations. To address these problems, we propose a prompt-based pipeline for generating natural language explanations. We first construct a natural language dataset named NexLeth. This dataset is derived from New Bing through prompt-based queries and expert annotations and contains 707 instances. NexLeth enhances the understanding of SL mechanisms and it is a benchmark for evaluating SL explanation methods. For the task of natural language generation for SL explanations, we combine subgraph explanations from an SL knowledge graph (KG) with instructions to construct novel personalized prompts, so as to inject the domain knowledge into the generation process. We then leverage the prompts to fine-tune pre-trained biomedical language models on our dataset. Experimental results show that the fine-tuned model equipped with designed prompts performs better than existing biomedical language models in terms of text quality and explainability, suggesting the potential of our dataset and the fine-tuned model for generating understandable and reliable explanations of SL mechanisms.

pdf bib
Typos Correction Training against Misspellings from Text-to-Text Transformers
Guicai Xie | Ke Zhang | Lei Duan | Wei Zhang | Zeqian Huang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Dense retrieval (DR) has become a mainstream approach to information seeking, where a system is required to return relevant information to a user query. In real-life applications, typoed queries resulting from the users’ mistyping words or phonetic typing errors exist widely in search behaviors. Current dense retrievers experience a significant drop in retrieval effectiveness when they encounter typoed queries. Therefore, the search system requires the extra introduction of spell-checkers to deal with typos and then applies the DR model to perform robust matching. Herein, we argue that directly conducting the typos correction training would be beneficial to make an end-to-end retriever against misspellings. To this end, we propose a novel approach that can facilitate the incorporation of the spelling correction objective into the DR model using the encoder-decoder architecture. During typos correction training, we also develop a prompt-based augmentation technique to enhance the DR space alignment of the typoed query and its original query. Extensive experiments demonstrate that the effectiveness of our proposed end-to-end retriever significantly outperforms existing typos-aware training approaches and sophisticated training advanced retrievers. Our code is available at https://github.com/striver314/ToCoTR.

2023

pdf bib
BUMP: A Benchmark of Unfaithful Minimal Pairs for Meta-Evaluation of Faithfulness Metrics
Liang Ma | Shuyang Cao | Robert L Logan IV | Di Lu | Shihao Ran | Ke Zhang | Joel Tetreault | Alejandro Jaimes
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The proliferation of automatic faithfulness metrics for summarization has produced a need for benchmarks to evaluate them. While existing benchmarks measure the correlation with human judgements of faithfulness on model-generated summaries, they are insufficient for diagnosing whether metrics are: 1) consistent, i.e., indicate lower faithfulness as errors are introduced into a summary, 2) effective on human-written texts, and 3) sensitive to different error types (as summaries can contain multiple errors). To address these needs, we present a benchmark of unfaithful minimal pairs (BUMP), a dataset of 889 human-written, minimally different summary pairs, where a single error is introduced to a summary from the CNN/DailyMail dataset to produce an unfaithful summary. We find BUMP complements existing benchmarks in a number of ways: 1) the summaries in BUMP are harder to discriminate and less probable under SOTA summarization models, 2) unlike non-pair-based datasets, BUMP can be used to measure the consistency of metrics, and reveals that the most discriminative metrics tend not to be the most consistent, and 3) unlike datasets containing generated summaries with multiple errors, BUMP enables the measurement of metrics’ performance on individual error types.

2022

pdf bib
CrisisLTLSum: A Benchmark for Local Crisis Event Timeline Extraction and Summarization
Hossein Rajaby Faghihi | Bashar Alhafni | Ke Zhang | Shihao Ran | Joel Tetreault | Alejandro Jaimes
Findings of the Association for Computational Linguistics: EMNLP 2022

Social media has increasingly played a key role in emergency response: first responders can use public posts to better react to ongoing crisis events and deploy the necessary resources where they are most needed. Timeline extraction and abstractive summarization are critical technical tasks to leverage large numbers of social media posts about events. Unfortunately, there are few datasets for benchmarking technical approaches for those tasks. This paper presents , the largest dataset of local crisis event timelines available to date. contains 1,000 crisis event timelines across four domains: wildfires, local fires, traffic, and storms. We built using a semi-automated cluster-then-refine approach to collect data from the public Twitter stream. Our initial experiments indicate a significant gap between the performance of strong baselines compared to the human performance on both tasks.Our dataset, code, and models are publicly available (https://github.com/CrisisLTLSum/CrisisTimelines).

pdf bib
Mapping the Design Space of Human-AI Interaction in Text Summarization
Ruijia Cheng | Alison Smith-Renner | Ke Zhang | Joel Tetreault | Alejandro Jaimes-Larrarte
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Automatic text summarization systems commonly involve humans for preparing data or evaluating model performance, yet, there lacks a systematic understanding of humans’ roles, experience, and needs when interacting with or being assisted by AI. From a human-centered perspective, we map the design opportunities and considerations for human-AI interaction in text summarization and broader text generation tasks. We first conducted a systematic literature review of 70 papers, developing a taxonomy of five interactions in AI-assisted text generation and relevant design dimensions. We designed text summarization prototypes for each interaction. We then interviewed 16 users, aided by the prototypes, to understand their expectations, experience, and needs regarding efficiency, control, and trust with AI in text summarization and propose design considerations accordingly.

pdf bib
An Exploration of Post-Editing Effectiveness in Text Summarization
Vivian Lai | Alison Smith-Renner | Ke Zhang | Ruijia Cheng | Wenjuan Zhang | Joel Tetreault | Alejandro Jaimes-Larrarte
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Automatic summarization methods are efficient but can suffer from low quality. In comparison, manual summarization is expensive but produces higher quality. Can humans and AI collaborate to improve summarization performance? In similar text generation tasks (e.g., machine translation), human-AI collaboration in the form of “post-editing” AI-generated text reduces human workload and improves the quality of AI output. Therefore, we explored whether post-editing offers advantages in text summarization. Specifically, we conducted an experiment with 72 participants, comparing post-editing provided summaries with manual summarization for summary quality, human efficiency, and user experience on formal (XSum news) and informal (Reddit posts) text. This study sheds valuable insights on when post-editing is useful for text summarization: it helped in some cases (e.g., when participants lacked domain knowledge) but not in others (e.g., when provided summaries include inaccurate information). Participants’ different editing strategies and needs for assistance offer implications for future human-AI summarization systems.

2021

pdf bib
Olá, Bonjour, Salve! XFORMAL: A Benchmark for Multilingual Formality Style Transfer
Eleftheria Briakou | Di Lu | Ke Zhang | Joel Tetreault
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We take the first step towards multilingual style transfer by creating and releasing XFORMAL, a benchmark of multiple formal reformulations of informal text in Brazilian Portuguese, French, and Italian. Results on XFORMAL suggest that state-of-the-art style transfer approaches perform close to simple baselines, indicating that style transfer is even more challenging when moving multilingual.

pdf bib
A Review of Human Evaluation for Style Transfer
Eleftheria Briakou | Sweta Agrawal | Ke Zhang | Joel Tetreault | Marine Carpuat
Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)

This paper reviews and summarizes human evaluation practices described in 97 style transfer papers with respect to three main evaluation aspects: style transfer, meaning preservation, and fluency. In principle, evaluations by human raters should be the most reliable. However, in style transfer papers, we find that protocols for human evaluations are often underspecified and not standardized, which hampers the reproducibility of research in this field and progress toward better human and automatic evaluation methods.