Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k. -> _bertopic.py", line 3742, in _reduce_dimensionality #2218

Open
1 task done
miha42-github opened this issue Nov 18, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@miha42-github
Copy link

Have you searched existing issues? 🔎

  • I have searched and found no existing issues

Desribe the bug

Description

When generating topics for the attached file the following trace back was produced:

/home/mediumroast/.local/lib/python3.11/site-packages/umap/spectral.py:521: RuntimeWarning: k >= N for N * N square matrix. Attempting to use scipy.linalg.eigh instead.
  eigenvalues, eigenvectors = scipy.sparse.linalg.eigsh(
Traceback (most recent call last):
  File "/home/mediumroast/.local/lib/python3.11/site-packages/bertopic/_bertopic.py", line 3742, in _reduce_dimensionality
    self.umap_model.fit(embeddings, y=y)
  File "/home/mediumroast/.local/lib/python3.11/site-packages/umap/umap_.py", line 2784, in fit
    self.embedding_, aux_data = self._fit_embed_data(
                                ^^^^^^^^^^^^^^^^^^^^^
  File "/home/mediumroast/.local/lib/python3.11/site-packages/umap/umap_.py", line 2830, in _fit_embed_data
    return simplicial_set_embedding(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mediumroast/.local/lib/python3.11/site-packages/umap/umap_.py", line 1107, in simplicial_set_embedding
    embedding = spectral_layout(
                ^^^^^^^^^^^^^^^^
  File "/home/mediumroast/.local/lib/python3.11/site-packages/umap/spectral.py", line 304, in spectral_layout
    return _spectral_layout(
           ^^^^^^^^^^^^^^^^^
  File "/home/mediumroast/.local/lib/python3.11/site-packages/umap/spectral.py", line 521, in _spectral_layout
    eigenvalues, eigenvectors = scipy.sparse.linalg.eigsh(
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mediumroast/.local/lib/python3.11/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py", line 1608, in eigsh
    raise TypeError("Cannot use scipy.linalg.eigh for sparse A with "
TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mediumroast/mr_caffeine/./caffeine_svc.py", line 131, in <module>
    asyncio.run(main())
  File "/usr/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/mediumroast/mr_caffeine/./caffeine_svc.py", line 103, in main
    await asyncio.gather(*tasks)
  File "/home/mediumroast/mr_caffeine/./caffeine_svc.py", line 31, in run_pipeline
    await processing_pipeline.run_pipeline(tenant, env)
  File "/home/mediumroast/mr_caffeine/lib/pipeline.py", line 286, in run_pipeline
    _intermediate_results = await _interactions_pipe(prepared_interactions[2]['partial_data'], tenant, my_env, u)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mediumroast/mr_caffeine/lib/pipeline.py", line 102, in _interactions_pipe
    processed_interactions = await interaction_processor.run_subpipeline(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mediumroast/mr_caffeine/lib/interactions.py", line 123, in run_subpipeline
    topics, reprocess = await self._generate_topics(interaction)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mediumroast/mr_caffeine/lib/interactions.py", line 67, in _generate_topics
    topics_result, reprocess = await topic_modeler.model_document()
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mediumroast/mr_caffeine/lib/model.py", line 441, in model_document
    return await self._bertopic_model()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mediumroast/mr_caffeine/lib/model.py", line 419, in _bertopic_model
    topics = self._get_topics(document)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mediumroast/mr_caffeine/lib/model.py", line 371, in _get_topics
    topics, probs = model.fit_transform(document)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mediumroast/.local/lib/python3.11/site-packages/bertopic/_bertopic.py", line 449, in fit_transform
    umap_embeddings = self._reduce_dimensionality(embeddings, y)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mediumroast/.local/lib/python3.11/site-packages/bertopic/_bertopic.py", line 3744, in _reduce_dimensionality
    self.umap_model.fit(embeddings)
  File "/home/mediumroast/.local/lib/python3.11/site-packages/umap/umap_.py", line 2784, in fit
    self.embedding_, aux_data = self._fit_embed_data(
                                ^^^^^^^^^^^^^^^^^^^^^
  File "/home/mediumroast/.local/lib/python3.11/site-packages/umap/umap_.py", line 2830, in _fit_embed_data
    return simplicial_set_embedding(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mediumroast/.local/lib/python3.11/site-packages/umap/umap_.py", line 1107, in simplicial_set_embedding
    embedding = spectral_layout(
                ^^^^^^^^^^^^^^^^
  File "/home/mediumroast/.local/lib/python3.11/site-packages/umap/spectral.py", line 304, in spectral_layout
    return _spectral_layout(
           ^^^^^^^^^^^^^^^^^
  File "/home/mediumroast/.local/lib/python3.11/site-packages/umap/spectral.py", line 521, in _spectral_layout
    eigenvalues, eigenvectors = scipy.sparse.linalg.eigsh(
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mediumroast/.local/lib/python3.11/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py", line 1608, in eigsh
    raise TypeError("Cannot use scipy.linalg.eigh for sparse A with "
TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.

Attachment

A PDF representation of a web page with text extracted and cleaned Unstructured.io tooling wrapped in LangChain.
How to use Jira Product Discovery - Official guide.pdf

Reproduction

from bertopic import BERTopic
from bertopic.representation import MaximalMarginalRelevance
from sklearn.feature_extraction.text import CountVectorizer
from langchain_community.document_loaders import UnstructuredFileLoader
from unstructured.cleaners.core import (
    clean, clean_bullets, clean_dashes, clean_ligatures, 
    clean_non_ascii_chars, clean_extra_whitespace, 
    replace_mime_encodings, replace_unicode_quotes
)

def get_topics(document):
        start_time = time.time()
        # Instantiate the BERTopic model
        model = BERTopic(
                vectorizer_model=CountVectorizer(stop_words="english"),  
                verbose=False,
                n_gram_range=(3,5),
                representation_model=MaximalMarginalRelevance(),
                min_topic_size=2,
                nr_topics=10,
        )

        # Train the model
        topics, probs = model.fit_transform(document)
        
        # Get the topics
        results = model.get_topics()
        topics_with_name = model.get_topic_info() # This is returning a Pandas Dataframe
        duration = time.time() - start_time
        print(f"Modeled topics for [{self.OBJECT_TYPE}] with id [{str(self.ID)}] in [{duration}] seconds.")
        return topics_with_name.to_dict() # Force df to dict 

def get_clean_text_langchain(path_to_file):
              file_loader = UnstructuredFileLoader(
                  file_path=path_to_file 
                  mode='elements',
                  post_processors=[
                      clean,
                      clean_extra_whitespace,
                      clean_bullets, 
                      clean_dashes, 
                      clean_ligatures, 
                      clean_non_ascii_chars, 
                      replace_mime_encodings, 
                      replace_unicode_quotes
                  ]
              )
              # Load the document
              file_elements = interaction_loader.load()
              # Obtain the important text from the partition
              selected_elements = [e for e in file_elements if e.metadata['category']=="NarrativeText"] 
              # Combine the text
              full_clean = " ".join([e.page_content for e in selected_elements])
                

            # Finished extracting text returning to the calling service
            self.logger.info(f"Successfully cleaned and captured text from [{str(extracted)}] interactions, returning to caller.")
            return full_clean

# extracted using langchain and ustructured
cleaned_text = get_clean_text_langchain('/path/to/file')
topics = get_topics(cleaned_text)

BERTopic Version

0.16.4

@miha42-github miha42-github added the bug Something isn't working label Nov 18, 2024
@MaartenGr
Copy link
Owner

It seems to be an issue related to UMAP. How many documents are you passing to BERTopic? If it's just 1, then that makes sense. You would have to pass hundreds documents at least.

    model = BERTopic(
           vectorizer_model=CountVectorizer(stop_words="english"),  
           verbose=False,
          n_gram_range=(3,5),
          representation_model=MaximalMarginalRelevance(),
          min_topic_size=2,
           nr_topics=10,
   )

Note that if you are using vectorizer_model then n_gram_range will not be used, you will have to add that parameter to your vectorizer_model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants