Research Projects

Research projects in Arase Lab

Semantic Textual Similarity Estimation through Alignment

STS

Apr 1, 2024

#STS

Measuring the similarity between two texts serves as a core technology in NLP, forming the foundation for various applications including information retrieval, question answering, and conversational systems. Our primary emphasis lies in paraphrasing, which represent an equivalent meaning through diverse wordings and structures. We are dedicated to advancing paraphrase recognition and generation models, with a particular focus on the significance of word and phrase alignment. This approach elucidates the rationale behind why two expressions convey the same meaning, enriching our understanding of meaning composition.

Keywords

Paraphrase
Entailment
Word and phrase alignment

Selected Publications

Y. Arase, H. Bao, and S. Yokoi. Unbalanced Optimal Transport for Unbalanced Word Alignment, in Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL 2023), pp. 3966–3986 (July 2023).
S. Kadotani and Y. Arase. Monolingual Phrase Alignment as Parse Forest Mapping, in Proc. of the Joint Conference on Lexical and Computational Semantics (*SEM 2023), pp. 449–455 (July 2023).
J. Takayama, T. Kajiwara, and Y. Arase. DIRECT: Direct and Indirect Responses in Conversational Text Corpus, in Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 1980–1989 (Nov. 2021).
Y. Arase and J. Tsujii. Compositional Phrase Alignment and Beyond, in Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pp. 1611-1623 (Nov. 2020).
Y. Arase and J. Tsujii. Transfer Fine-Tuning: A BERT Case Study, in Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP2019), pp. 5393-5404 (Nov. 2019).
Y. Arase and J. Tsujii. Monolingual Phrase Alignment on Parse Forests, in Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), pp. 1-11 (Sept. 2017).

Read less

Computer Assisted Language Learning

CALL

Apr 1, 2024

#CALL

Read less

We employ paraphrasing techniques to construct a system to aid language learning and education. In language learning, abundant exposure to authentic English texts authored by native speakers, tailored to learners’ proficiency levels, is crucial. However, such texts are often scarce. To tackle this challenge, we are developing text simplification methods capable of transforming advanced texts (written by native speakers) into simpler versions. Additionally, we are creating language resources for CALL and developing models for automatic difficulty prediction.

Keywords

Text simplification
Difficulty (level) assessment
Language resource creation

Selected Publications

X. Wu and Y. Arase. An In-depth Evaluation of GPT-4 in Sentence Simplification with Error-based Human Assessment. arXiv:2403.04963.
R. Miyano, T. Kajiwara, Y. Arase. Self-Ensemble of N-best Generation Hypotheses by Lexically Constrained Decoding, in Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), pp. 14653-14661 (Dec. 2023).
Y. Arase, S. Uchida, and T. Kajiwara. CEFR-based Sentence Difficulty Annotation and Assessment, in Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP 2022), pp. 6206-6219 (Dec. 2022).
H. Huang, T. Kajiwara, and Y. Arase. Definition Modelling for Appropriate Specificity, in Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), pp. 2499–2509 (Nov. 2021).
D. Nishihara, T. Kajiwara, and Y. Arase. Controllable Text Simplification with Lexical Constraint, in Proc. of Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pp. 260-266, (July 2019).

Read less

NLP for Medical Documents

Medical NLP

Apr 1, 2024

#Medical NLP

Read less

Medical practitioners spend a significant amount of time in writing textual records, including medical and nursing reports, examination findings, and more. Yet, the majority of these documents remain unstructured, posing challenges for effective and efficient reuse. In response to this issue, our project focuses on developing technologies to intelligently process medical texts. By facilitating their effective use, we aim to enhance the quality of medical care.

Keywords

Medical NLP
Summarization
Domain adaptation
Data augmentation

Selected Publications

S. Ohashi, J. Takayama, T. Kajiwara, and Y. Arase. Distinct Label Representations for Few-Shot Text Classification, in Proc. of the Annual Meeting of the Association for Computational Linguistics and International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021), pp. 831-836 (Aug. 2021).
S. Ohashi, J. Takayama, T. Kajiwara, C. Chu, Y. Arase. Text Classification with Negative Supervision, in Proc. of Annual Meeting of the Association for Computational Linguistics (ACL 2020), pp. 351–357 (July 2020).
Y. Arase, T. Kajiwara, and C. Chu. Annotation of Adverse Drug Reactions in Patients’ Weblogs, in Proc. of International Conference on Language Resources and Evaluation (LREC 2020), pp. 6769–6776 (May 2020).

Read less

Evaluation of LLM Misalignment

LLM Evaluation

Apr 1, 2024

#LLM Evaluation

Read less

[New project starting in 2024]

LLMs have dramatically advanced language processing and have widely accepted by society as fundamental information technologies. However, despite their widespread adoption, these LLMs are not flawless; issues such as hallucinations, adversarial use for disinformation generation, and the amplification of social biases are well-documented concerns. In order to ensure the healthy and safe use of LLMs, it is imperative to identify and rectify such misalignments. This project aims to develop technology capable of systematically detecting and evaluating misalignments in LLMs.

Keywords

Large language models (LLMs)
Misalignment
Disinformation
Hallucination

Read less