Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: LLM tends can invent descriptions for entities and relationships #1543

Open
1 of 3 tasks
EvenStarArwen opened this issue Dec 20, 2024 · 3 comments
Open
1 of 3 tasks
Labels
triage Default label assignment, indicates new issue needs reviewed by a maintainer

Comments

@EvenStarArwen
Copy link

Do you need to file an issue?

  • I have searched the existing issues and this bug is not already filed.
  • My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
  • I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the issue

During entity & relationship extraction, the LLM (experimented with GPT-4o and GPT-4o-mini) are likely to describe entities/relationships in its own words when relevant information is actually not mentioned.

For example, for a paper abstract like below:

"Evolvability is the ability of a biological system to produce phenotypic variation that is both heritable and adaptive. It has long been the subject of anecdotal observations and theoretical work. In recent years, however, the molecular causes of evolvability have been an increasing focus of experimental work. Here, we review recent experimental progress in areas as different as the evolution of drug resistance in cancer cells and the rewiring of transcriptional regulation circuits in vertebrates..."

As you can imagine, as a paper abstract, entities like "cancer cells" are only mentioned, and there are no further introduction/descriptions of such concepts. However, in the LLM's output, there will be descriptions like:

(“entity”{tuple_delimiter}CANCER CELLS{tuple_delimiter}CELLS{tuple_delimiter}Cancer cells are a type of cell that have undergone genetic changes, allowing them to resist treatments and proliferate uncontrollably)

Such information is never mentioned in the paper abstract. And the same could occur for relationships. Given the facts that the goal of RAG is to augment the factuality of LLM's outputs and to avoid hallucination made by it, having LLM-generated descriptions in the knowledge graph used for RAG is definitely not desired. While in my examples above, the descriptions seem to be acceptable, but you could surely imagine there will be some entities for which LLMs could provide problematic descriptions for them, and thereby posing risks to the downstream RAG tasks.

I suppose this could be addressed by polishing the prompts and few-shot examples to explicitly discourage such LLM-invented descriptions that have no supporting information from the original text units.

Steps to reproduce

No response

GraphRAG Config Used

To reproduce the examples above, you could try the following prompts, which is based on entity_extraction.txt. The only changes I made to this file is that I have put real data (i.e., entity types and text input) into this prompt to allow for testing the entity extraction quality of LLMs independently (instead of testing the whole GraphRAG system). You can send this directly to an API call or to the web platforms.

-Goal-
Given a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.
 
-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: One of the following types: [{entity_types}]
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>)
 
2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other
- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
 Format each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_strength>)
 
3. Return output in English as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter.
 
4. When finished, output {completion_delimiter}
 
######################
-Examples-
######################
Example 1:
Entity_types: ORGANIZATION,PERSON
Text:
The Verdantis's Central Institution is scheduled to meet on Monday and Thursday, with the institution planning to release its latest policy decision on Thursday at 1:30 p.m. PDT, followed by a press conference where Central Institution Chair Martin Smith will take questions. Investors expect the Market Strategy Committee to hold its benchmark interest rate steady in a range of 3.5%-3.75%.
######################
Output:
("entity"{tuple_delimiter}CENTRAL INSTITUTION{tuple_delimiter}ORGANIZATION{tuple_delimiter}The Central Institution is the Federal Reserve of Verdantis, which is setting interest rates on Monday and Thursday)
{record_delimiter}
("entity"{tuple_delimiter}MARTIN SMITH{tuple_delimiter}PERSON{tuple_delimiter}Martin Smith is the chair of the Central Institution)
{record_delimiter}
("entity"{tuple_delimiter}MARKET STRATEGY COMMITTEE{tuple_delimiter}ORGANIZATION{tuple_delimiter}The Central Institution committee makes key decisions about interest rates and the growth of Verdantis's money supply)
{record_delimiter}
("relationship"{tuple_delimiter}MARTIN SMITH{tuple_delimiter}CENTRAL INSTITUTION{tuple_delimiter}Martin Smith is the Chair of the Central Institution and will answer questions at a press conference{tuple_delimiter}9)
{completion_delimiter}

######################
Example 2:
Entity_types: ORGANIZATION
Text:
TechGlobal's (TG) stock skyrocketed in its opening day on the Global Exchange Thursday. But IPO experts warn that the semiconductor corporation's debut on the public markets isn't indicative of how other newly listed companies may perform.

TechGlobal, a formerly public company, was taken private by Vision Holdings in 2014. The well-established chip designer says it powers 85% of premium smartphones.
######################
Output:
("entity"{tuple_delimiter}TECHGLOBAL{tuple_delimiter}ORGANIZATION{tuple_delimiter}TechGlobal is a stock now listed on the Global Exchange which powers 85% of premium smartphones)
{record_delimiter}
("entity"{tuple_delimiter}VISION HOLDINGS{tuple_delimiter}ORGANIZATION{tuple_delimiter}Vision Holdings is a firm that previously owned TechGlobal)
{record_delimiter}
("relationship"{tuple_delimiter}TECHGLOBAL{tuple_delimiter}VISION HOLDINGS{tuple_delimiter}Vision Holdings formerly owned TechGlobal from 2014 until present{tuple_delimiter}5)
{completion_delimiter}

######################
Example 3:
Entity_types: ORGANIZATION,GEO,PERSON
Text:
Five Aurelians jailed for 8 years in Firuzabad and widely regarded as hostages are on their way home to Aurelia.

The swap orchestrated by Quintara was finalized when $8bn of Firuzi funds were transferred to financial institutions in Krohaara, the capital of Quintara.

The exchange initiated in Firuzabad's capital, Tiruzia, led to the four men and one woman, who are also Firuzi nationals, boarding a chartered flight to Krohaara.

They were welcomed by senior Aurelian officials and are now on their way to Aurelia's capital, Cashion.

The Aurelians include 39-year-old businessman Samuel Namara, who has been held in Tiruzia's Alhamia Prison, as well as journalist Durke Bataglani, 59, and environmentalist Meggie Tazbah, 53, who also holds Bratinas nationality.
######################
Output:
("entity"{tuple_delimiter}FIRUZABAD{tuple_delimiter}GEO{tuple_delimiter}Firuzabad held Aurelians as hostages)
{record_delimiter}
("entity"{tuple_delimiter}AURELIA{tuple_delimiter}GEO{tuple_delimiter}Country seeking to release hostages)
{record_delimiter}
("entity"{tuple_delimiter}QUINTARA{tuple_delimiter}GEO{tuple_delimiter}Country that negotiated a swap of money in exchange for hostages)
{record_delimiter}
{record_delimiter}
("entity"{tuple_delimiter}TIRUZIA{tuple_delimiter}GEO{tuple_delimiter}Capital of Firuzabad where the Aurelians were being held)
{record_delimiter}
("entity"{tuple_delimiter}KROHAARA{tuple_delimiter}GEO{tuple_delimiter}Capital city in Quintara)
{record_delimiter}
("entity"{tuple_delimiter}CASHION{tuple_delimiter}GEO{tuple_delimiter}Capital city in Aurelia)
{record_delimiter}
("entity"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}PERSON{tuple_delimiter}Aurelian who spent time in Tiruzia's Alhamia Prison)
{record_delimiter}
("entity"{tuple_delimiter}ALHAMIA PRISON{tuple_delimiter}GEO{tuple_delimiter}Prison in Tiruzia)
{record_delimiter}
("entity"{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}PERSON{tuple_delimiter}Aurelian journalist who was held hostage)
{record_delimiter}
("entity"{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}PERSON{tuple_delimiter}Bratinas national and environmentalist who was held hostage)
{record_delimiter}
("relationship"{tuple_delimiter}FIRUZABAD{tuple_delimiter}AURELIA{tuple_delimiter}Firuzabad negotiated a hostage exchange with Aurelia{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}QUINTARA{tuple_delimiter}AURELIA{tuple_delimiter}Quintara brokered the hostage exchange between Firuzabad and Aurelia{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}QUINTARA{tuple_delimiter}FIRUZABAD{tuple_delimiter}Quintara brokered the hostage exchange between Firuzabad and Aurelia{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}ALHAMIA PRISON{tuple_delimiter}Samuel Namara was a prisoner at Alhamia prison{tuple_delimiter}8)
{record_delimiter}
("relationship"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}Samuel Namara and Meggie Tazbah were exchanged in the same hostage release{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}Samuel Namara and Durke Bataglani were exchanged in the same hostage release{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}Meggie Tazbah and Durke Bataglani were exchanged in the same hostage release{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}FIRUZABAD{tuple_delimiter}Samuel Namara was a hostage in Firuzabad{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}FIRUZABAD{tuple_delimiter}Meggie Tazbah was a hostage in Firuzabad{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}FIRUZABAD{tuple_delimiter}Durke Bataglani was a hostage in Firuzabad{tuple_delimiter}2)
{completion_delimiter}

######################
-Real Data-
######################
Entity_types: {"Anatomical Entities", "Biological Processes", "Catalysts", "Cells", "Cell Lines", "Cellular Components", "Chemicals", "Cofactors", "Diseases", "Genes", "Molecular Functions", "Pathways", "Phenotypes", "Proteins", "Sequences", "Transcripts", "Vaccines", "Variants", "Concepts"}
Text: {Evolvability is the ability of a biological system to produce phenotypic variation that is both heritable and adaptive. It has long been the subject of anecdotal observations and theoretical work. In recent years, however, the molecular causes of evolvability have been an increasing focus of experimental work. Here, we review recent experimental progress in areas as different as the evolution of drug resistance in cancer cells and the rewiring of transcriptional regulation circuits in vertebrates. This research reveals the importance of three major themes: multiple genetic and non-genetic mechanisms to generate phenotypic diversity, robustness in genetic systems, and adaptive landscape topography. We also discuss the mounting evidence that evolvability can evolve and the question of whether it evolves adaptively.
}
######################
Output:

Logs and screenshots

No response

Additional Information

  • GraphRAG Version: 1.0
  • Operating System: MacOS 14.5
  • Python Version: This is not related to python version since I think it is a prompt engeering issue.
  • Related Issues:
@EvenStarArwen EvenStarArwen added the triage Default label assignment, indicates new issue needs reviewed by a maintainer label Dec 20, 2024
@EvenStarArwen
Copy link
Author

One more note, to replicate the issue, you can also try for arbitrary small chunks of text inputs that you are very familiar with, for example, abstracts of papers in your domain.

@EvenStarArwen EvenStarArwen changed the title [Issue]: <title> [Issue]: LLM tends can invent descriptions for entities and relationships Dec 20, 2024
@Deng-Xian-Sheng
Copy link

You can tell LLM: don't invent, just use existing texts

ps: This is what my former boss said, and now he has been fired by me.

@Deng-Xian-Sheng
Copy link

#1347

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Default label assignment, indicates new issue needs reviewed by a maintainer
Projects
None yet
Development

No branches or pull requests

2 participants