Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chunking functions now accept new docs as parameters instread of tokens #113

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

DeoLeung
Copy link

fix #111

@gusye1234 gusye1234 requested a review from rangehow December 26, 2024 10:25
Comment on lines +11 to +29
from ._utils import clean_str
from ._utils import compute_mdhash_id
from ._utils import decode_tokens_by_tiktoken
from ._utils import encode_string_by_tiktoken
from ._utils import is_float_regex
from ._utils import list_of_list_to_csv
from ._utils import logger
from ._utils import pack_user_ass_to_openai_messages
from ._utils import split_string_by_multi_markers
from ._utils import truncate_list_by_token_size
from .base import BaseGraphStorage
from .base import BaseKVStorage
from .base import BaseVectorStorage
from .base import CommunitySchema
from .base import QueryParam
from .base import SingleCommunitySchema
from .base import TextChunkSchema
from .prompt import GRAPH_FIELD_SEP
from .prompt import PROMPTS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this part of the code cleaner.

@rangehow
Copy link
Collaborator

From my personal perspective, the scope of this PR is much broader than the custom length mentioned in #111. This PR moves most of the logic from the user-invisible get_chunks to the user-customizable chunk_func, offering users a lot of flexibility for customization. However, I am concerned that such a degree of freedom might cause some issues, so I think it would be worth considering adding some mandatory checks in get_chunks for the parameters and return values of chunk_func, and returning a custom warning for non-compliant cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

customizing chunking function
2 participants