Juhyeon Kim
Ph.D. Seoul National University

I am a Ph.D. student at Seoul National University, conducting research in causal discovery, large language models (LLMs). My work is broadly focused on building reliable, interpretable, and generalizable causal reasoning systems, particularly in settings where data is sparse, non stationary, or embedded in unstructured text. My early research addressed automated extraction of causal relationships from text using part of speech based data augmentation, which I presented as the first author at CASE at EMNLP 2022. Since then, I have been actively exploring how pre trained language models can be leveraged to incorporate prior knowledge into causal structure learning, co authoring papers presented at NeurIPS 2024 (Causality and Large Models Workshop) and currently under review at IEEE Access. Alongside this, I am the lead inventor of a causal discovery framework that integrates LLMs into graph learning pipelines.

My recent efforts involve both theoretical and empirical investigations of non stationary time series causal discovery and spatio temporal causal discovery. In addition, I am developing a series of studies on LLM citation reliability, including saliency based methods, and on diffusion based causal discovery using mixture of experts (MoE) architectures. My research aims to bridge language understanding, causal inference, and dynamic systems, with the long term goal of building robust AI systems that can reason about change, uncertainty, and cause effect relationships in real world settings.


Education
  • Seoul National University
    Seoul National University
    Department of Data Science
    Ph.D. Student
    Mar. 2023 - present
  • Seoul National University
    Seoul National University
    M.S. in Data Science
    Mar. 2021 - Feb. 2023
  • Korea University
    Korea University
    B.S. in Public Administration & Business
    Mar. 2014 - Feb. 2021
News
2025
Completed Ph.D. coursework in Data Science at Seoul National University.
Feb 27
2024
Conducted data science training and project collaboration with employees at Hyundai Motor Group as part of a corporate bootcamp initiative (2024).
Jul 01
Filed a patent as lead inventor for “A Causal Discovery Framework Leveraging Prior Knowledge from Large Language Models,” in collaboration with LG AI Research.
Mar 05
2023
Participated in the SNU Medical AI Talent Development Program, organized by the Ministry of Health and Welfare of Korea (2023- ). Read more
Apr 01
Participated in the project “Causal Discovery for Time Series Data Guided by Data-driven Causal Knowledge” in collaboration with LG AI Research (2023- ). Featured
Mar 31
Contributed to the project “Development of Machine Learning Models and Algorithms Based on Causality,” funded by the Ministry of Science and ICT in KOREA (2023- ).
Mar 01
Participated in the “Center for Optimization of Foundation Models and AI Platforms,” funded by the Ministry of Science and ICT in Korea (2023- ).
Feb 28
2022
Participated in the National Research Foundation of Korea (NRF) project on "developing self-directed AI technologies for solving emerging problems" (2022-2023).
Mar 31
Selected Publications (view all )
On Incorporating Prior Knowledge Extracted from Pre-trained Language Models into Causal Discovery
On Incorporating Prior Knowledge Extracted from Pre-trained Language Models into Causal Discovery

Juhyeon Kim, Chanhui Lee, LG AI Institute, Sanghack Lee

Neural Information Processing Systems (NeurIPS) Workshop on Causality and Large Models 2024 Spotlight

Pre-trained Language Models (PLMs) can reason about causality by leveraging vast pre-trained knowledge and text descriptions of datasets, proving their effectiveness even when data is scarce. However, there are crucial limitations in current PLM-based causal reasoning methods: i) PLM cannot utilize large datasets in prompt due to the limits of context length, and ii) the methods are not adept at comprehending the whole interconnected causal structures. On the other hand, data-driven causal discovery can discover the causal structure as a whole, although it works well only when the number of data observations is sufficiently large enough. To overcome each other approaches’ limitations, we propose a new framework that integrates PLMs-based causal reasoning into data-driven causal discovery, resulting in improved and robust performance. Furthermore, our framework extends to the time-series data and exhibits superior performance.

On Incorporating Prior Knowledge Extracted from Pre-trained Language Models into Causal Discovery

Juhyeon Kim, Chanhui Lee, LG AI Institute, Sanghack Lee

Neural Information Processing Systems (NeurIPS) Workshop on Causality and Large Models 2024 Spotlight

Pre-trained Language Models (PLMs) can reason about causality by leveraging vast pre-trained knowledge and text descriptions of datasets, proving their effectiveness even when data is scarce. However, there are crucial limitations in current PLM-based causal reasoning methods: i) PLM cannot utilize large datasets in prompt due to the limits of context length, and ii) the methods are not adept at comprehending the whole interconnected causal structures. On the other hand, data-driven causal discovery can discover the causal structure as a whole, although it works well only when the number of data observations is sufficiently large enough. To overcome each other approaches’ limitations, we propose a new framework that integrates PLMs-based causal reasoning into data-driven causal discovery, resulting in improved and robust performance. Furthermore, our framework extends to the time-series data and exhibits superior performance.

Detecting Causality by Data Augmentation via Part-of-Speech tagging

Juhyeon Kim, Yesong Choi, Sanghack Lee

CASE at EMNLP 2022 Spotlight

Finding causal relations in texts has been a challenge since it requires methods ranging from defining event ontologies to developing proper algorithmic approaches. In this paper, we developed a framework which classifies whether a given sentence contains a causal event. As our approach, we exploited an external corpus that has causal labels to overcome the small size of the original corpus (Causal News Corpus) provided by task organizers. Further, we employed a data augmentation technique utilizing Part Of-Speech (POS) based on our observation that some parts of speech are more (or less) relevant to causality. Our approach especially improved the recall of detecting causal events in sentences.

Detecting Causality by Data Augmentation via Part-of-Speech tagging

Juhyeon Kim, Yesong Choi, Sanghack Lee

CASE at EMNLP 2022 Spotlight

Finding causal relations in texts has been a challenge since it requires methods ranging from defining event ontologies to developing proper algorithmic approaches. In this paper, we developed a framework which classifies whether a given sentence contains a causal event. As our approach, we exploited an external corpus that has causal labels to overcome the small size of the original corpus (Causal News Corpus) provided by task organizers. Further, we employed a data augmentation technique utilizing Part Of-Speech (POS) based on our observation that some parts of speech are more (or less) relevant to causality. Our approach especially improved the recall of detecting causal events in sentences.

All publications