Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug fix for transform when using cuml.hdbscan &calculate_probabilities=True #1543

Closed
wants to merge 1 commit into from

Conversation

wwdda
Copy link

@wwdda wwdda commented Sep 22, 2023

Hi @MaartenGr,

Regarding the issue #1463, I also come across the same issue when trying transform() with cuml HDBSCAN and calculate_probabilities=True.

Below is the code for reproducing this issue:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all')['data']
topic_model = BERTopic().fit(docs)
topics, probs = topic_model.transform(docs)

from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
from bertopic.cluster._utils import hdbscan_delegator, is_supported_hdbscan

# Create instances of GPU-accelerated UMAP and HDBSCAN
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True)

# Pass the above models to be used in BERTopic
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model,calculate_probabilities=True)

topic_model = topic_model.fit(docs)

print(type(topic_model.hdbscan_model))
print(is_supported_hdbscan(topic_model.hdbscan_model))
print(topic_model.calculate_probabilities)

topic_model.transform(docs)

I modified the membership_vector clause in hdbscan_delegator and fixed this issue. I think this modification also aligns with the other two conditional clauses.

For your information, below are my running environments.

Operating System: Ubuntu 20.04 LTS
Version: bertopic 0.15.0
Other: installed by pip, python=3.10.13, cuml=23.08(stable)

@MaartenGr
Copy link
Owner

@wwdda Thanks for the PR! This might be a duplicate of #1324. Could you check whether that PR solves your issue? If so, then I will go ahead and merge that one.

@wwdda
Copy link
Author

wwdda commented Sep 24, 2023

@wwdda Thanks for the PR! This might be a duplicate of #1324. Could you check whether that PR solves your issue? If so, then I will go ahead and merge that one.

Thanks for your response @MaartenGr. Sorry for the duplicate which I didn't notice when I searched for a solution to the issue.
Indeed, this PR solves my issue and its implementation includes batch_size which is very helpful.

@MaartenGr MaartenGr closed this Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants