Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize DocumentLoader docstrings and integration docs #22866

Closed
1 task done
isahers1 opened this issue Jun 13, 2024 · 2 comments
Closed
1 task done

Standardize DocumentLoader docstrings and integration docs #22866

isahers1 opened this issue Jun 13, 2024 · 2 comments
Labels
🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder help wanted Good issue for contributors

Comments

@isahers1
Copy link
Collaborator

Privileged issue

  • I am a LangChain maintainer, or was asked directly by a LangChain maintainer to create an issue here.

Issue Content

Issue

To make our document loader integrations as easy to use as possible we need to make sure the docs for them are thorough and standardized. There are two parts to this: updating the document loader docstrings and updating the actual integration docs.

This needs to be done for each DocumentLoader integration, ideally with one PR per DocumentLoader.

Related to broader issues #21983 and #22005.

Docstrings

Each DocumentLoader class docstring should have the sections shown in the Appendix below. The sections should have input and output code blocks when relevant. See RecursiveUrlLoader docstrings and corresponding API reference for an example.

Doc Pages

Each DocumentLoader docs page should follow this template. See RecursiveUrlLoader for an example.

You can use the langchain-cli to quickly get started with a new document loader integration docs page (run from root of repo):

poetry run pip install -e libs/cli
poetry run langchain-cli integration create-doc --name "foo-bar" --name-class FooBar --component-type "DocumentLoader" --destination-dir ./docs/docs/integrations/document_loaders/

where --name is the integration package name without the "langchain-" prefix and --name-class is the class name without the "Loader" suffix. This will create a template doc with some autopopulated fields at docs/docs/integrations/document_loaders/foo_bar.ipynb.

To build a preview of the docs you can run (from root):

make docs_clean
make docs_build
cd docs/build/output-new
yarn
yarn start

Appendix

"""
ModuleName document loader integration

# TODO: Replace with relevant packages, env vars.
Setup:
    Install ``__package_name__`` and set environment variable ``__MODULE_NAME___API_KEY``.

    .. code-block:: bash

        pip install -U __package_name__
        export __MODULE_NAME___API_KEY="your-api-key"

# TODO: Replace with relevant init params.
Instantiate:
    .. code-block:: python

        from __module_name__ import __ModuleName__Loader

        loader = __ModuleName__Loader(
            url = "https://docs.python.org/3.9/",
            # otherparams = ...
        )

Load:
    .. code-block:: python

        docs = loader.load()
        print(docs[0].page_content[:100])
        print(docs[0].metadata)

    .. code-block:: python

        TODO: Example output

# TODO: Delete if async load is not implemented
Async load:
    .. code-block:: python

        docs = await loader.aload()
        print(docs[0].page_content[:100])
        print(docs[0].metadata)

    .. code-block:: python

        TODO: Example output

Lazy load:
    .. code-block:: python

        docs = []
        docs_lazy = loader.lazy_load()

        # async variant:
        # docs_lazy = await loader.alazy_load()

        for doc in docs_lazy:
            docs.append(doc)
        print(docs[0].page_content[:100])
        print(docs[0].metadata)

    .. code-block:: python

        TODO: Example output  
"""
@dosubot dosubot bot added the 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder label Jun 13, 2024
@isahers1 isahers1 added the help wanted Good issue for contributors label Jun 14, 2024
@lucas-tucker
Copy link
Contributor

I will work on this today

hwchase17 pushed a commit that referenced this issue Jun 18, 2024
**Standardizing DocumentLoader docstrings (of which there are many)**

This PR addresses issue #22866 and adds docstrings according to the
issue's specified format (in the appendix) for files csv_loader.py and
json_loader.py in langchain_community.document_loaders. In particular,
the following sections have been added to both CSVLoader and JSONLoader:
Setup, Instantiate, Load, Async load, and Lazy load. It may be worth
adding a 'Metadata' section to the JSONLoader docstring to clarify how
we want to extract the JSON metadata (using the `metadata_func`
argument). The files I used to walkthrough the various sections were
`example_2.json` from
[HERE](https://support.oneskyapp.com/hc/en-us/articles/208047697-JSON-sample-files)
and `hw_200.csv` from
[HERE](https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html).

---------

Co-authored-by: lucast2021 <[email protected]>
Co-authored-by: isaac hershenson <[email protected]>
hinthornw pushed a commit that referenced this issue Jun 20, 2024
**Standardizing DocumentLoader docstrings (of which there are many)**

This PR addresses issue #22866 and adds docstrings according to the
issue's specified format (in the appendix) for files csv_loader.py and
json_loader.py in langchain_community.document_loaders. In particular,
the following sections have been added to both CSVLoader and JSONLoader:
Setup, Instantiate, Load, Async load, and Lazy load. It may be worth
adding a 'Metadata' section to the JSONLoader docstring to clarify how
we want to extract the JSON metadata (using the `metadata_func`
argument). The files I used to walkthrough the various sections were
`example_2.json` from
[HERE](https://support.oneskyapp.com/hc/en-us/articles/208047697-JSON-sample-files)
and `hw_200.csv` from
[HERE](https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html).

---------

Co-authored-by: lucast2021 <[email protected]>
Co-authored-by: isaac hershenson <[email protected]>
arunraja1 added a commit to skypointcloud/skypoint-langchain that referenced this issue Jun 27, 2024
* partners: fix numpy dep (#22858)

Following https://github.com/langchain-ai/langchain/pull/22813, which
added python 3.12 to CI, here we update numpy accordingly in partner
packages.

* [docs]: added info for TavilySearchResults (#22765)

* experimental[major]: Force users to opt-in into code that relies on the python repl (#22860)

This should make it obvious that a few of the agents in langchain
experimental rely on the python REPL as a tool under the hood, and will
force users to opt-in.

* community[patch]: FAISS VectorStore deserializer should be opt-in (#22861)

FAISS deserializer uses pickle module. Users have to opt-in to
de-serialize.

* ci: Add script to check for pickle usage in community (#22863)

Add script to check for pickle usage in community.

* experimental[patch]/docs[patch]: Update links to security docs (#22864)

Minor update to newest version of security docs (content should be
identical).

* core: In astream_events v2 propagate cancel/break to the inner astream call (#22865)

- previous behavior was for the inner astream to continue running with
no interruption
- also propagate break in core runnable methods

* core[patch]: Treat type as a special field when merging lists (#22750)

Should we even log a warning? At least for Anthropic, it's expected to
get e.g. `text_block` followed by `text_delta`.

@ccurme @baskaryan @efriis

* core: release 0.2.6 (#22868)

* langchain: release 0.2.4 (#22872)

* Fix: lint errors and update Field alias in models.py and AutoSelectionScorer initialization (#22846)

This PR addresses several lint errors in the core package of LangChain.
Specifically, the following issues were fixed:

1.Unexpected keyword argument "required" for "Field"  [call-arg]
2.tests/integration_tests/chains/test_cpal.py:263: error: Unexpected
keyword argument "narrative_input" for "QueryModel" [call-arg]

* Fix typo in vearch.md (#22840)

Fix typo

* docs: s/path_images/images/ for ImageCaptionLoader keyword arguments (#22857)

Quick update to `ImageCaptionLoader` documentation to reflect what's in
code.

* docs: update NVIDIA Riva tool to use NVIDIA NIM for LLM (#22873)

**Description:**
Update the NVIDIA Riva tool documentation to use NVIDIA NIM for the LLM.
Show how to use NVIDIA NIMs and link to documentation for LangChain with
NIM.

---------

Co-authored-by: Hayden Wolff <[email protected]>
Co-authored-by: Isaac Francisco <[email protected]>

* docs, cli[patch]: document loaders doc template (#22862)

From: https://github.com/langchain-ai/langchain/pull/22290

---------

Co-authored-by: Eugene Yurtsev <[email protected]>

* cli[patch]: Release 0.0.25 (#22876)

* infra: lint new docs to match doc loader template (#22867)

* docs: fixes for Elasticsearch integrations, cache doc and providers list (#22817)

Some minor fixes in the documentation:
 - ElasticsearchCache initilization is now correct
 - List of integrations for ES updated

* docs: `ReAct` reference (#22830)

The `ReAct` is used all across LangChain but it is not referenced
properly.
Added references to the original paper.

* community[minor]: Implement ZhipuAIEmbeddings interface (#22821)

- **Description:** Implement ZhipuAIEmbeddings interface, include:
     - The `embed_query` method
     - The `embed_documents` method

refer to [ZhipuAI
Embedding-2](https://open.bigmodel.cn/dev/api#text_embedding)

---------

Co-authored-by: Eugene Yurtsev <[email protected]>

* docs: Astra DB vectorstore, adjust syntax for automatic-embedding example (#22833)

Description: Adjusting the syntax for creating the vectorstore
collection (in the case of automatic embedding computation) for the most
idiomatic way to submit the stored secret name.

Co-authored-by: Bagatur <[email protected]>

* community[minor]: Prem Templates (#22783)

This PR adds the feature add Prem Template feature in ChatPremAI.
Additionally it fixes a minor bug for API auth error when API passed
through arguments.

* qdrant[patch]: Use collection_exists API instead of exceptions (#22764)

## Description

Currently, the Qdrant integration relies on exceptions raised by
[`get_collection`
](https://qdrant.tech/documentation/concepts/collections/#collection-info)
to check if a collection exists.

Using
[`collection_exists`](https://qdrant.tech/documentation/concepts/collections/#check-collection-existence)
is recommended to avoid missing any unhandled exceptions. This PR
addresses this.

## Testing
All integration and unit tests pass. No user-facing changes.

* docs: Standardize ChatGroq (#22751)

Updated ChatGroq doc string as per issue
https://github.com/langchain-ai/langchain/issues/22296:"langchain_groq:
updated docstring for ChatGroq in langchain_groq to match that of the
description (in the appendix) provided in issue
https://github.com/langchain-ai/langchain/issues/22296. "

Issue: This PR is in response to issue
https://github.com/langchain-ai/langchain/issues/22296, and more
specifically the ChatGroq model. In particular, this PR updates the
docstring for langchain/libs/partners/groq/langchain_groq/chat_model.py
by adding the following sections: Instantiate, Invoke, Stream, Async,
Tool calling, Structured Output, and Response metadata. I used the
template from the Anthropic implementation and referenced the Appendix
of the original issue post. I also noted that: `usage_metadata `returns
none for all ChatGroq models I tested; there is no mention of image
input in the ChatGroq documentation; unlike that of ChatHuggingFace,
`.stream(messages)` for ChatGroq returned blocks of output.

---------

Co-authored-by: lucast2021 <[email protected]>
Co-authored-by: Bagatur <[email protected]>

* anthropic[patch]: always add tool_result type to ToolMessage content (#22721)

Anthropic tool results can contain image data, which are typically
represented with content blocks having `"type": "image"`. Currently,
these content blocks are passed as-is as human/user messages to
Anthropic, which raises BadRequestError as it expects a tool_result
block to follow a tool_use.

Here we update ChatAnthropic to nest the content blocks inside a
tool_result content block.

Example:
```python
import base64

import httpx
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import AIMessage, HumanMessage, ToolMessage
from langchain_core.pydantic_v1 import BaseModel, Field


# Fetch image
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
image_data = base64.b64encode(httpx.get(image_url).content).decode("utf-8")


class FetchImage(BaseModel):
    should_fetch: bool = Field(..., description="Whether an image is requested.")


llm = ChatAnthropic(model="claude-3-sonnet-20240229").bind_tools([FetchImage])

messages = [
    HumanMessage(content="Could you summon a beautiful image please?"),
    AIMessage(
        content=[
            {
                "type": "tool_use",
                "id": "toolu_01Rn6Qvj5m7955x9m9Pfxbcx",
                "name": "FetchImage",
                "input": {"should_fetch": True},
            },
        ],
        tool_calls=[
            {
                "name": "FetchImage",
                "args": {"should_fetch": True},
                "id": "toolu_01Rn6Qvj5m7955x9m9Pfxbcx",
            },
        ],
    ),
    ToolMessage(
        name="FetchImage",
        content=[
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/jpeg",
                    "data": image_data,
                },
            },
        ],
        tool_call_id="toolu_01Rn6Qvj5m7955x9m9Pfxbcx",
    ),
]

llm.invoke(messages)
```

Trace:
https://smith.langchain.com/public/d27e4fc1-a96d-41e1-9f52-54f5004122db/r

* docs[patch]: Expand embeddings docs (#22881)

* docs: generate table for document loaders (#22871)

Co-authored-by: Bagatur <[email protected]>

* docs: doc loader feat table alignment (#22900)

* community[minor]: add chat model llamacpp (#22589)

- **PR title**: [community] add chat model llamacpp


- **PR message**:
- **Description:** This PR introduces a new chat model integration with
llamacpp_python, designed to work similarly to the existing ChatOpenAI
model.
      + Work well with instructed chat, chain and function/tool calling.
+ Work with LangGraph (persistent memory, tool calling), will update
soon

- **Dependencies:** This change requires the llamacpp_python library to
be installed.
    
@baskaryan

---------

Co-authored-by: Bagatur <[email protected]>
Co-authored-by: Bagatur <[email protected]>

* docs: Fix typo in tutorial about structured data extraction (#22888)

[Fixed typo](docs: Fix typo in tutorial about structured data
extraction)

* [Community]: HuggingFaceCrossEncoder `score` accounting for <not-relevant score,relevant score> pairs. (#22578)

- **Description:** Some of the Cross-Encoder models provide scores in
pairs, i.e., <not-relevant score (higher means the document is less
relevant to the query), relevant score (higher means the document is
more relevant to the query)>. However, the `HuggingFaceCrossEncoder`
`score` method does not currently take into account the pair situation.
This PR addresses this issue by modifying the method to consider only
the relevant score if score is being provided in pair. The reason for
focusing on the relevant score is that the compressors select the top-n
documents based on relevance.
    - **Issue:** #22556 
- Please also refer to this
[comment](https://github.com/UKPLab/sentence-transformers/issues/568#issuecomment-729153075)

* fireworks[patch]: add usage_metadata to (a)invoke and (a)stream (#22906)

* anthropic[minor]: Adds streaming tool call support for Anthropic (#22687)

Preserves string content chunks for non tool call requests for
convenience.

One thing - Anthropic events look like this:

```
RawContentBlockStartEvent(content_block=TextBlock(text='', type='text'), index=0, type='content_block_start')
RawContentBlockDeltaEvent(delta=TextDelta(text='<thinking>\nThe', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockDeltaEvent(delta=TextDelta(text=' provide', type='text_delta'), index=0, type='content_block_delta')
...
RawContentBlockStartEvent(content_block=ToolUseBlock(id='toolu_01GJ6x2ddcMG3psDNNe4eDqb', input={}, name='get_weather', type='tool_use'), index=1, type='content_block_start')
RawContentBlockDeltaEvent(delta=InputJsonDelta(partial_json='', type='input_json_delta'), index=1, type='content_block_delta')
```

Note that `delta` has a `type` field. With this implementation, I'm
dropping it because `merge_list` behavior will concatenate strings.

We currently have `index` as a special field when merging lists, would
it be worth adding `type` too?

If so, what do we set as a context block chunk? `text` vs.
`text_delta`/`tool_use` vs `input_json_delta`?

CC @ccurme @efriis @baskaryan

* core[patch]: fix validation of @deprecated decorator (#22513)

This PR moves the validation of the decorator to a better place to avoid
creating bugs while deprecating code.

Prevent issues like this from arising:
https://github.com/langchain-ai/langchain/issues/22510

we should replace with a linter at some point that just does static
analysis

* community[patch]: SitemapLoader restrict depth of parsing sitemap (CVE-2024-2965) (#22903)

This PR restricts the depth to which the sitemap can be parsed.

Fix for: CVE-2024-2965

* community[major], experimental[patch]: Remove Python REPL from community (#22904)

Remove the REPL from community, and suggest an alternative import from
langchain_experimental.

Fix for this issue:
https://github.com/langchain-ai/langchain/issues/14345

This is not a bug in the code or an actual security risk. The python
REPL itself is behaving as expected.

The PR is done to appease blanket security policies that are just
looking for the presence of exec in the code.

---------

Co-authored-by: Erick Friis <[email protected]>

* community[minor]: Fix long_context_reorder.py async (#22839)

Implement `async def atransform_documents( self, documents:
Sequence[Document], **kwargs: Any ) -> Sequence[Document]` for
`LongContextReorder`

* core[patch]: Fix FunctionCallbackHandler._on_tool_end (#22908)

If the global `debug` flag is enabled, the agent will get the following
error in `FunctionCallbackHandler._on_tool_end` at runtime.

```
Error in ConsoleCallbackHandler.on_tool_end callback: AttributeError("'list' object has no attribute 'strip'")
```

By calling str() before strip(), the error was avoided.
This error can be seen at
[debugging.ipynb](https://github.com/langchain-ai/langchain/blob/master/docs/docs/how_to/debugging.ipynb).

- Issue: NA
- Dependencies: NA
- Twitter handle: https://x.com/kiarina37

* dcos: Add admonition to PythonREPL tool (#22909)

Add admonition to the documentation to make sure users are aware that
the tool allows execution of code on the host machine using a python
interpreter (by design).

* docs: add groq to chatmodeltabs (#22913)

* experimental: LLMGraphTransformer - added relationship properties.  (#21856)

- **Description:** 
The generated relationships in the graph had no properties, but the
Relationship class was properly defined with properties. This made it
very difficult to transform conditional sentences into a graph. Adding
properties to relationships can solve this issue elegantly.
The changes expand on the existing LLMGraphTransformer implementation
but add the possibility to define allowed relationship properties like
this: LLMGraphTransformer(llm=llm, relationship_properties=["Condition",
"Time"],)
- **Issue:** 
    no issue found
 - **Dependencies:**
    n/a
- **Twitter handle:** 
    @IstvanSpace


-Quick Test
=================================================================
from dotenv import load_dotenv
import os
from langchain_community.graphs import Neo4jGraph
from langchain_experimental.graph_transformers import
LLMGraphTransformer
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.documents import Document

load_dotenv()
os.environ["NEO4J_URI"] = os.getenv("NEO4J_URI")
os.environ["NEO4J_USERNAME"] = os.getenv("NEO4J_USERNAME")
os.environ["NEO4J_PASSWORD"] = os.getenv("NEO4J_PASSWORD")
graph = Neo4jGraph()
llm = ChatOpenAI(temperature=0, model_name="gpt-4o")
llm_transformer = LLMGraphTransformer(llm=llm)
#text = "Harry potter likes pies, but only if it rains outside"
text = "Jack has a dog named Max. Jack only walks Max if it is sunny
outside."
documents = [Document(page_content=text)]
llm_transformer_props = LLMGraphTransformer(
    llm=llm,
    relationship_properties=["Condition"],
)
graph_documents_props =
llm_transformer_props.convert_to_graph_documents(documents)
print(f"Nodes:{graph_documents_props[0].nodes}")
print(f"Relationships:{graph_documents_props[0].relationships}")
graph.add_graph_documents(graph_documents_props)

---------

Co-authored-by: Istvan Lorincz <[email protected]>
Co-authored-by: Bagatur <[email protected]>

* core: in astream_events v2 always await task even if already finished (#22916)

- this ensures exceptions propagate to the caller

* core: release 0.2.7 (#22917)

* infra: remove nvidia from monorepo scheduled tests (#22915)

Scheduled tests run in
https://github.com/langchain-ai/langchain-nvidia/tree/main

* docs: Fix wrongly referenced class name in confluence.py (#22879)

Fixes #22542

Changed ConfluenceReader to ConfluenceLoader

* templates: remove lockfiles (#22920)

poetry will default to latest versions without

* langchain: release 0.2.5 (#22922)

* text-splitters[patch]: Fix HTMLSectionSplitter (#22812)

Update former pull request:
https://github.com/langchain-ai/langchain/pull/22654.

Modified `langchain_text_splitters.HTMLSectionSplitter`, where in the
latest version `dict` data structure is used to store sections from a
html document, in function `split_html_by_headers`. The header/section
element names serve as dict keys. This can be a problem when duplicate
header/section element names are present in a single html document.
Latter ones can replace former ones with the same name. Therefore some
contents can be miss after html text splitting is conducted.

Using a list to store sections can hopefully solve the problem. A Unit
test considering duplicate header names has been added.

---------

Co-authored-by: Bagatur <[email protected]>

* community: release 0.2.5 (#22923)

* cli[minor]: remove redefined DEFAULT_GIT_REF (#21471)

remove redefined DEFAULT_GIT_REF

Co-authored-by: Isaac Francisco <[email protected]>

* experimental: release 0.0.61 (#22924)

* docs: add ollama json mode (#22926)

fixes #22910

* community: 'Solve the issue where the _search function in ElasticsearchStore supports passing a query_vector parameter, but the parameter does not take effect. (#21532)

**Issue:**
When using the similarity_search_with_score function in
ElasticsearchStore, I expected to pass in the query_vector that I have
already obtained. I noticed that the _search function does support the
query_vector parameter, but it seems to be ineffective. I am attempting
to resolve this issue.

Co-authored-by: Isaac Francisco <[email protected]>

* Update ollama.py with optional raw setting. (#21486)

Ollama has a raw option now. 

https://github.com/ollama/ollama/blob/main/docs/api.md

Thank you for contributing to LangChain!

- [ ] **PR title**: "package: description"
- Where "package" is whichever of langchain, community, core,
experimental, etc. is being modified. Use "docs: ..." for purely docs
changes, "templates: ..." for template changes, "infra: ..." for CI
changes.
  - Example: "community: add foobar LLM"


- [ ] **PR message**: ***Delete this entire checklist*** and replace
with
    - **Description:** a description of the change
    - **Issue:** the issue # it fixes, if applicable
    - **Dependencies:** any dependencies required for this change
- **Twitter handle:** if your PR gets announced, and you'd like a
mention, we'll gladly shout you out!


- [ ] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.


- [ ] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, hwchase17.

---------

Co-authored-by: Isaac Francisco <[email protected]>
Co-authored-by: isaac hershenson <[email protected]>

* docs:Fix mispelling in streaming doc (#22936)

Description: Fix mispelling
Issue: None
Dependencies: None
Twitter handle: None

Co-authored-by: qcloud <[email protected]>

* docs: update ZhipuAI ChatModel docstring (#22934)

- **Description:** Update ZhipuAI ChatModel rich docstring
- **Issue:** the issue #22296

* Improve llm graph transformer docstring (#22939)

* infra: update integration test workflow (#22945)

* community(you): Better support for You.com News API (#22622)

## Description
While `YouRetriever` supports both You.com's Search and News APIs, news
is supported as an afterthought.
More specifically, not all of the News API parameters are exposed for
the user, only those that happen to overlap with the Search API.

This PR:
- improves support for both APIs, exposing the remaining News API
parameters while retaining backward compatibility
- refactor some REST parameter generation logic
- updates the docstring of `YouSearchAPIWrapper`
- add input validation and warnings to ensure parameters are properly
set by user
- 🚨 Breaking: Limit the news results to `k` items

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

* docs: nim model name update (#22943)

NIM Model name change in a notebook and mdx file.

Thanks!

* standard-tests[patch]: don't require str chunk contents (#22965)

* Update sql_qa.ipynb (#22966)

fixes #22798 
fixes #22963

* docs: update databricks.ipynb (#22949)

arbitary -> arbitrary

* docs: Standardise formatting (#22948)

Standardised formatting 


![image](https://github.com/langchain-ai/langchain/assets/73015364/ea3b5c5c-e7a6-4bb7-8c6b-e7d8cbbbf761)

* [Partner]: Add metadata to stream response (#22716)

Adds `response_metadata` to stream responses from OpenAI. This is
returned with `invoke` normally, but wasn't implemented for `stream`.

---------

Co-authored-by: Chester Curme <[email protected]>

* standard-tests[patch]: Release 0.1.1 (#22984)

* docs: Update llamacpp ntbk (#22907)

Co-authored-by: Bagatur <[email protected]>

* community[minor]: add `ChatSnowflakeCortex` chat model (#21490)

**Description:** This PR adds a chat model integration for [Snowflake
Cortex](https://docs.snowflake.com/en/user-guide/snowflake-cortex/llm-functions),
which gives an instant access to industry-leading large language models
(LLMs) trained by researchers at companies like Mistral, Reka, Meta, and
Google, including [Snowflake
Arctic](https://www.snowflake.com/en/data-cloud/arctic/), an open
enterprise-grade model developed by Snowflake.

**Dependencies:** Snowflake's
[snowpark](https://pypi.org/project/snowflake-snowpark-python/) library
is required for using this integration.

**Twitter handle:** [@gethouseware](https://twitter.com/gethouseware)

- [x] **Add tests and docs**:
1. integration tests:
`libs/community/tests/integration_tests/chat_models/test_snowflake.py`
2. unit tests:
`libs/community/tests/unit_tests/chat_models/test_snowflake.py`
  3. example notebook: `docs/docs/integrations/chat/snowflake.ipynb`


- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

* openai[patch]: add stream_usage parameter (#22854)

Here we add `stream_usage` to ChatOpenAI as:

1. a boolean attribute
2. a kwarg to _stream and _astream.

Question: should the `stream_usage` attribute be `bool`, or `bool |
None`?

Currently I've kept it `bool` and defaulted to False. It was implemented
on
[ChatAnthropic](https://github.com/langchain-ai/langchain/blob/e832bbb48627aa9f00614e82e7ace60b7d8957c6/libs/partners/anthropic/langchain_anthropic/chat_models.py#L535)
as a bool. However, to maintain support for users who access the
behavior via OpenAI's `stream_options` param, this ends up being
possible:
```python
llm = ChatOpenAI(model_kwargs={"stream_options": {"include_usage": True}})
assert not llm.stream_usage
```
(and this model will stream token usage).

Some options for this:
- it's ok
- make the `stream_usage` attribute bool or None
- make an \_\_init\_\_ for ChatOpenAI, set a `._stream_usage` attribute
and read `.stream_usage` from a property

Open to other ideas as well.

* community: Add Baichuan Embeddings batch size (#22942)

- **Support batch size** 
Baichuan updates the document, indicating that up to 16 documents can be
imported at a time

- **Standardized model init arg names**
    - baichuan_api_key -> api_key
    - model_name  -> model

* Add RAG to conceptual guide (#22790)

Co-authored-by: jacoblee93 <[email protected]>

* docs: update universal init title (#22990)

* community[minor]: add tool calling for DeepInfraChat (#22745)

DeepInfra now supports tool calling for supported models.

---------

Co-authored-by: Bagatur <[email protected]>

* docs[patch]: Reorder streaming guide, add tags (#22993)

CC @hinthornw

* docs: Add some 3rd party tutorials (#22931)

Langchain is very popular among developers in China, but there are still
no good Chinese books or documents, so I want to add my own Chinese
resources on langchain topics, hoping to give Chinese readers a better
experience using langchain. This is not a translation of the official
langchain documentation, but my understanding.

---------

Co-authored-by: ccurme <[email protected]>

* standard-tests[patch]: Update chat model standard tests (#22378)

- Refactor standard test classes to make them easier to configure
- Update openai to support stop_sequences init param
- Update groq to support stop_sequences init param
- Update fireworks to support max_retries init param
- Update ChatModel.bind_tools to type tool_choice
- Update groq to handle tool_choice="any". **this may be controversial**

---------

Co-authored-by: Chester Curme <[email protected]>

* core: run_in_executor: Wrap StopIteration in RuntimeError (#22997)

- StopIteration can't be set on an asyncio.Future it raises a TypeError
and leaves the Future pending forever so we need to convert it to a
RuntimeError

* infra: test all dependents on any change (#22994)

* core[patch]: Release 0.2.8 (#23012)

* community: OCI GenAI embedding batch size (#22986)

Thank you for contributing to LangChain!

- [x] **PR title**: "community: OCI GenAI embedding batch size"



- [x] **PR message**:
    - **Issue:** #22985 


- [ ] **Add tests and docs**: N/A


- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

---------

Signed-off-by: Anders Swanson <[email protected]>
Co-authored-by: Chester Curme <[email protected]>

* docs: add bing search integration to agent (#22929)

- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

* core[minor]: message transformer utils (#22752)

* docs[patch]: Update docs links (#23013)

* docs[patch]: Adds evaluation sections (#23050)

Also want to add an index/rollup page to LangSmith docs to enable
linking to a how-to category as a group (e.g.
https://docs.smith.langchain.com/how_to_guides/evaluation/)

CC @agola11 @hinthornw

* docs: Update how to docs for pydantic compatibility (#22983)

Add missing imports in docs from langchain_core.tools  BaseTool

---------

Co-authored-by: Eugene Yurtsev <[email protected]>

* Include "no escape" and "inverted section" mustache vars in Prompt.input_variables and Prompt.input_schema (#22981)

* [Community]: FIxed the DocumentDBVectorSearch `_similarity_search_without_score` (#22970)

- **Description:** The PR #22777 introduced a bug in
`_similarity_search_without_score` which was raising the
`OperationFailure` error. The mistake was syntax error for MongoDB
pipeline which has been corrected now.
    - **Issue:** #22770

* community: Fix #22975 (Add SSL Verification Option to Requests Class in langchain_community) (#22977)

- **PR title**: "community: Fix #22975 (Add SSL Verification Option to
Requests Class in langchain_community)"
- **PR message**: 
    - **Description:**
- Added an optional verify parameter to the Requests class with a
default value of True.
- Modified the get, post, patch, put, and delete methods to include the
verify parameter.
- Updated the _arequest async context manager to include the verify
parameter.
- Added the verify parameter to the GenericRequestsWrapper class and
passed it to the Requests class.
    - **Issue:** This PR fixes issue #22975.
- **Dependencies:** No additional dependencies are required for this
change.
    - **Twitter handle:** @lunara_x

You can check this change with below code.
```python
from langchain_openai.chat_models import ChatOpenAI
from langchain.requests import RequestsWrapper
from langchain_community.agent_toolkits.openapi import planner
from langchain_community.agent_toolkits.openapi.spec import reduce_openapi_spec

with open("swagger.yaml") as f:
    data = yaml.load(f, Loader=yaml.FullLoader)
swagger_api_spec = reduce_openapi_spec(data)

llm = ChatOpenAI(model='gpt-4o')
swagger_requests_wrapper = RequestsWrapper(verify=False) # modified point
superset_agent = planner.create_openapi_agent(swagger_api_spec, swagger_requests_wrapper, llm, allow_dangerous_requests=True, handle_parsing_errors=True)

superset_agent.run(
    "Tell me the number and types of charts and dashboards available."
)
```

---------

Co-authored-by: Harrison Chase <[email protected]>

* [Community]: Fixed DDG DuckDuckGoSearchResults Docstring  (#22968)

- **Description:** A very small fix in the Docstring of
`DuckDuckGoSearchResults` identified in the following issue.
- **Issue:** #22961

---------

Co-authored-by: Harrison Chase <[email protected]>

* docs: embeddings classes (#22927)

Added a table with all Embedding classes.

* docs: Standardize DocumentLoader docstrings (#22932)

**Standardizing DocumentLoader docstrings (of which there are many)**

This PR addresses issue #22866 and adds docstrings according to the
issue's specified format (in the appendix) for files csv_loader.py and
json_loader.py in langchain_community.document_loaders. In particular,
the following sections have been added to both CSVLoader and JSONLoader:
Setup, Instantiate, Load, Async load, and Lazy load. It may be worth
adding a 'Metadata' section to the JSONLoader docstring to clarify how
we want to extract the JSON metadata (using the `metadata_func`
argument). The files I used to walkthrough the various sections were
`example_2.json` from
[HERE](https://support.oneskyapp.com/hc/en-us/articles/208047697-JSON-sample-files)
and `hw_200.csv` from
[HERE](https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html).

---------

Co-authored-by: lucast2021 <[email protected]>
Co-authored-by: isaac hershenson <[email protected]>

* langchain[patch]: add tool messages formatter for tool calling agent (#22849)

- **Description:** add tool_messages_formatter for tool calling agent,
make tool messages can be formatted in different ways for your LLM.
  - **Issue:** N/A
  - **Dependencies:** N/A

* langchain: add id_key option to EnsembleRetriever for metadata-based document merging (#22950)

**Description:**
- What I changed
- By specifying the `id_key` during the initialization of
`EnsembleRetriever`, it is now possible to determine which documents to
merge scores for based on the value corresponding to the `id_key`
element in the metadata, instead of `page_content`. Below is an example
of how to use the modified `EnsembleRetriever`:
    ```python
retriever = EnsembleRetriever(retrievers=[ret1, ret2], id_key="id") #
The Document returned by each retriever must keep the "id" key in its
metadata.
    ```

- Additionally, I added a script to easily test the behavior of the
`invoke` method of the modified `EnsembleRetriever`.

- Why I changed
- There are cases where you may want to calculate scores by treating
Documents with different `page_content` as the same when using
`EnsembleRetriever`. For example, when you want to ensemble the search
results of the same document described in two different languages.
- The previous `EnsembleRetriever` used `page_content` as the basis for
score aggregation, making the above usage difficult. Therefore, the
score is now calculated based on the specified key value in the
Document's metadata.

**Twitter handle:** @shimajiroxyz

* community: add KafkaChatMessageHistory (#22216)

Add chat history store based on Kafka.

Files added: 
`libs/community/langchain_community/chat_message_histories/kafka.py`
`docs/docs/integrations/memory/kafka_chat_message_history.ipynb`

New issue to be created for future improvement:
1. Async method implementation.
2. Message retrieval based on timestamp.
3. Support for other configs when connecting to cloud hosted Kafka (e.g.
add `api_key` field)
4. Improve unit testing & integration testing.

* LanceDB integration update (#22869)

Added : 

- [x] relevance search (w/wo scores)
- [x] maximal marginal search
- [x] image ingestion
- [x] filtering support
- [x] hybrid search w reranking 

make test, lint_diff and format checked.

* SemanticChunker : Feature Addition ("Semantic Splitting with gradient") (#22895)

```SemanticChunker``` currently provide three methods to split the texts semantically:
- percentile
- standard_deviation
- interquartile

I propose new method ```gradient```. In this method, the gradient of distance is used to split chunks along with the percentile method (technically) . This method is useful when chunks are highly correlated with each other or specific to a domain e.g. legal or medical. The idea is to apply anomaly detection on gradient array so that the distribution become wider and easy to identify boundaries in highly semantic data.
I have tested this merge on a set of 10 domain specific documents (mostly legal).

Details : 
    - **Issue:** Improvement
    - **Dependencies:** NA
    - **Twitter handle:** [x.com/prajapat_ravi](https://x.com/prajapat_ravi)


@hwchase17

---------

Co-authored-by: Raviraj Prajapat <[email protected]>
Co-authored-by: isaac hershenson <[email protected]>

* docs: `AWS` platform page update (#23063)

Added a reference to the `GlueCatalogLoader` new document loader.

* Update Fireworks link (#23058)

* docs: add trim_messages to chatbot (#23061)

* LanceDB example minor change (#23069)

Removed package version `0.6.13` in the example.

* core[patch],community[patch],langchain[patch]: `tenacity` dependency to version `>=8.1.0,<8.4.0` (#22973)

Fix https://github.com/langchain-ai/langchain/issues/22972.

- [x] **PR title**: "package: description"
- Where "package" is whichever of langchain, community, core,
experimental, etc. is being modified. Use "docs: ..." for purely docs
changes, "templates: ..." for template changes, "infra: ..." for CI
changes.
  - Example: "community: add foobar LLM"


- [x] **PR message**: ***Delete this entire checklist*** and replace
with
    - **Description:** a description of the change
    - **Issue:** the issue # it fixes, if applicable
    - **Dependencies:** any dependencies required for this change
- **Twitter handle:** if your PR gets announced, and you'd like a
mention, we'll gladly shout you out!


- [x] **Add tests and docs**: If you're adding a new integration, please
include
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.


- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

* core[patch]: Document BaseStore (#23082)

Add doc-string to BaseStore

* community: glob multiple patterns when using DirectoryLoader (#22852)

- **Description:** Updated
*community.langchain_community.document_loaders.directory.py* to enable
the use of multiple glob patterns in the `DirectoryLoader` class. Now,
the glob parameter is of type `list[str] | str` and still defaults to
the same value as before. I updated the docstring of the class to
reflect this, and added a unit test to
*community.tests.unit_tests.document_loaders.test_directory.py* named
`test_directory_loader_glob_multiple`. This test also shows an example
of how to use the new functionality.
- ~~Issue:~~**Discussion Thread:**
https://github.com/langchain-ai/langchain/discussions/18559
- **Dependencies:** None
- **Twitter handle:** N/a

- [x] **Add tests and docs**
    - Added test (described above)
    - Updated class docstring

- [x] **Lint and test**

---------

Co-authored-by: isaac hershenson <[email protected]>
Co-authored-by: Harrison Chase <[email protected]>
Co-authored-by: Isaac Francisco <[email protected]>

* core[patch]: Release 0.2.9 (#23091)

* community: add args_schema to SearxSearch  (#22954)

This change adds args_schema (pydantic BaseModel) to SearxSearchRun for
correct schema formatting on LLM function calls

Issue: currently using SearxSearchRun with OpenAI function calling
returns the following error "TypeError: SearxSearchRun._run() got an
unexpected keyword argument '__arg1' ".

This happens because the schema sent to the LLM is "input:
'{"__arg1":"foobar"}'" while the method should be called with the
"query" parameter.

---------

Co-authored-by: Harrison Chase <[email protected]>

* core[minor]: Support multiple keys in get_from_dict_or_env (#23086)

Support passing multiple keys for ge_from_dict_or_env

* community[minor]: Implement Doctran async execution (#22372)

**Description**

The DoctranTextTranslator has an async transform function that was not
implemented because [the doctran
library](https://github.com/psychic-api/doctran) uses a sync version of
the `execute` method.

- I implemented the `DoctranTextTranslator.atransform_documents()`
method using `asyncio.to_thread` to run the function in a separate
thread.
- I updated the example in the Notebook with the new async version.
- The performance improvements can be appreciated when a big document is
divided into multiple chunks.

Relates to:
- Issue #14645: https://github.com/langchain-ai/langchain/issues/14645
- Issue #14437: https://github.com/langchain-ai/langchain/issues/14437
- https://github.com/langchain-ai/langchain/pull/15264

---------

Co-authored-by: Eugene Yurtsev <[email protected]>

* docs: Fix URL formatting in deprecation warnings (#23075)

**Description**

Updated the URLs in deprecation warning messages. The URLs were
previously written as raw strings and are now formatted to be clickable
HTML links.

Example of a broken link in the current API Reference:
https://api.python.langchain.com/en/latest/chains/langchain.chains.openai_functions.extraction.create_extraction_chain_pydantic.html

<img width="942" alt="Screenshot 2024-06-18 at 13 21 07"
src="https://github.com/langchain-ai/langchain/assets/4854600/a1b1863c-cd03-4af2-a9bc-70375407fb00">

* langchain[patch]: fix `OutputType` of OutputParsers and fix legacy API in OutputParsers (#19792)

# Description

This pull request aims to address specific issues related to the
ambiguity and error-proneness of the output types of certain output
parsers, as well as the absence of unit tests for some parsers. These
issues could potentially lead to runtime errors or unexpected behaviors
due to type mismatches when used, causing confusion for developers and
users. Through clarifying output types, this PR seeks to improve the
stability and reliability.

Therefore, this pull request

- fixes the `OutputType` of OutputParsers to be the expected type;
- e.g. `OutputType` property of `EnumOutputParser` raises `TypeError`.
This PR introduce a logic to extract `OutputType` from its attribute.
- and fixes the legacy API in OutputParsers like `LLMChain.run` to the
modern API like `LLMChain.invoke`;
- Note: For `OutputFixingParser`, `RetryOutputParser` and
`RetryWithErrorOutputParser`, this PR introduces `legacy` attribute with
False as default value in order to keep the backward compatibility
- and adds the tests for the `OutputFixingParser` and
`RetryOutputParser`.

The following table shows my expected output and the actual output of
the `OutputType` of OutputParsers.
I have used this table to fix `OutputType` of OutputParsers.

| Class Name of OutputParser | My Expected `OutputType` (after this PR)|
Actual `OutputType` [evidence](#evidence) (before this PR)| Fix Required
|
|---------|--------------|---------|--------|
| BooleanOutputParser | `<class 'bool'>` | `<class 'bool'>` | NO |
| CombiningOutputParser | `typing.Dict[str, Any]` | `TypeError` is
raised | YES |
| DatetimeOutputParser | `<class 'datetime.datetime'>` | `<class
'datetime.datetime'>` | NO |
| EnumOutputParser(enum=MyEnum) | `MyEnum` | `TypeError` is raised | YES
|
| OutputFixingParser | The same type as `self.parser.OutputType` | `~T`
| YES |
| CommaSeparatedListOutputParser | `typing.List[str]` |
`typing.List[str]` | NO |
| MarkdownListOutputParser | `typing.List[str]` | `typing.List[str]` |
NO |
| NumberedListOutputParser | `typing.List[str]` | `typing.List[str]` |
NO |
| JsonOutputKeyToolsParser | `typing.Any` | `typing.Any` | NO |
| JsonOutputToolsParser | `typing.Any` | `typing.Any` | NO |
| PydanticToolsParser | `typing.Any` | `typing.Any` | NO |
| PandasDataFrameOutputParser | `typing.Dict[str, Any]` | `TypeError` is
raised | YES |
| PydanticOutputParser(pydantic_object=MyModel) | `<class
'__main__.MyModel'>` | `<class '__main__.MyModel'>` | NO |
| RegexParser | `typing.Dict[str, str]` | `TypeError` is raised | YES |
| RegexDictParser | `typing.Dict[str, str]` | `TypeError` is raised |
YES |
| RetryOutputParser | The same type as `self.parser.OutputType` | `~T` |
YES |
| RetryWithErrorOutputParser | The same type as `self.parser.OutputType`
| `~T` | YES |
| StructuredOutputParser | `typing.Dict[str, Any]` | `TypeError` is
raised | YES |
| YamlOutputParser(pydantic_object=MyModel) | `MyModel` | `~T` | YES |

NOTE: In "Fix Required", "YES" means that it is required to fix in this
PR while "NO" means that it is not required.

# Issue

No issues for this PR.

# Twitter handle

- [hmdev3](https://twitter.com/hmdev3)

# Questions:

1. Is it required to create tests for legacy APIs `LLMChain.run` in the
following scripts?
   - libs/langchain/tests/unit_tests/output_parsers/test_fix.py;
   - libs/langchain/tests/unit_tests/output_parsers/test_retry.py.

2. Is there a more appropriate expected output type than I expect in the
above table?
- e.g. the `OutputType` of `CombiningOutputParser` should be
SOMETHING...

# Actual outputs (before this PR)

<div id='evidence'></div>

<details><summary>Actual outputs</summary>

## Requirements

- Python==3.9.13
- langchain==0.1.13

```python
Python 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import langchain
>>> langchain.__version__
'0.1.13'
>>> from langchain import output_parsers
```

### `BooleanOutputParser`

```python
>>> output_parsers.BooleanOutputParser().OutputType
<class 'bool'>
```

### `CombiningOutputParser`

```python
>>> output_parsers.CombiningOutputParser(parsers=[output_parsers.DatetimeOutputParser(), output_parsers.CommaSeparatedListOutputParser()]).OutputType
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\workspace\venv\lib\site-packages\langchain_core\output_parsers\base.py", line 160, in OutputType
    raise TypeError(
TypeError: Runnable CombiningOutputParser doesn't have an inferable OutputType. Override the OutputType property to specify the output type.
```

### `DatetimeOutputParser`

```python
>>> output_parsers.DatetimeOutputParser().OutputType
<class 'datetime.datetime'>
```

### `EnumOutputParser`

```python
>>> from enum import Enum
>>> class MyEnum(Enum):
...     a = 'a'
...     b = 'b'
...
>>> output_parsers.EnumOutputParser(enum=MyEnum).OutputType
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\workspace\venv\lib\site-packages\langchain_core\output_parsers\base.py", line 160, in OutputType
    raise TypeError(
TypeError: Runnable EnumOutputParser doesn't have an inferable OutputType. Override the OutputType property to specify the output type.
```

### `OutputFixingParser`

```python
>>> output_parsers.OutputFixingParser(parser=output_parsers.DatetimeOutputParser()).OutputType
~T
```

### `CommaSeparatedListOutputParser`

```python
>>> output_parsers.CommaSeparatedListOutputParser().OutputType
typing.List[str]
```

### `MarkdownListOutputParser`

```python
>>> output_parsers.MarkdownListOutputParser().OutputType
typing.List[str]
```

### `NumberedListOutputParser`

```python
>>> output_parsers.NumberedListOutputParser().OutputType
typing.List[str]
```

### `JsonOutputKeyToolsParser`

```python
>>> output_parsers.JsonOutputKeyToolsParser(key_name='tool').OutputType
typing.Any
```

### `JsonOutputToolsParser`

```python
>>> output_parsers.JsonOutputToolsParser().OutputType
typing.Any
```

### `PydanticToolsParser`

```python
>>> from langchain.pydantic_v1 import BaseModel
>>> class MyModel(BaseModel):
...     a: int
...
>>> output_parsers.PydanticToolsParser(tools=[MyModel, MyModel]).OutputType
typing.Any
```

### `PandasDataFrameOutputParser`

```python
>>> output_parsers.PandasDataFrameOutputParser().OutputType
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\workspace\venv\lib\site-packages\langchain_core\output_parsers\base.py", line 160, in OutputType
    raise TypeError(
TypeError: Runnable PandasDataFrameOutputParser doesn't have an inferable OutputType. Override the OutputType property to specify the output type.
```

### `PydanticOutputParser`

```python
>>> output_parsers.PydanticOutputParser(pydantic_object=MyModel).OutputType
<class '__main__.MyModel'>
```

### `RegexParser`

```python
>>> output_parsers.RegexParser(regex='$', output_keys=['a']).OutputType
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\workspace\venv\lib\site-packages\langchain_core\output_parsers\base.py", line 160, in OutputType
    raise TypeError(
TypeError: Runnable RegexParser doesn't have an inferable OutputType. Override the OutputType property to specify the output type.
```

### `RegexDictParser`

```python
>>> output_parsers.RegexDictParser(output_key_to_format={'a':'a'}).OutputType
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\workspace\venv\lib\site-packages\langchain_core\output_parsers\base.py", line 160, in OutputType
    raise TypeError(
TypeError: Runnable RegexDictParser doesn't have an inferable OutputType. Override the OutputType property to specify the output type.
```

### `RetryOutputParser`

```python
>>> output_parsers.RetryOutputParser(parser=output_parsers.DatetimeOutputParser()).OutputType
~T
```

### `RetryWithErrorOutputParser`

```python
>>> output_parsers.RetryWithErrorOutputParser(parser=output_parsers.DatetimeOutputParser()).OutputType
~T
```

### `StructuredOutputParser`

```python
>>> from langchain.output_parsers.structured import ResponseSchema
>>> response_schemas = [ResponseSchema(name="foo",description="a list of strings",type="List[string]"),ResponseSchema(name="bar",description="a string",type="string"), ]
>>> output_parsers.StructuredOutputParser.from_response_schemas(response_schemas).OutputType
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\workspace\venv\lib\site-packages\langchain_core\output_parsers\base.py", line 160, in OutputType
    raise TypeError(
TypeError: Runnable StructuredOutputParser doesn't have an inferable OutputType. Override the OutputType property to specify the output type.
```

### `YamlOutputParser`

```python
>>> output_parsers.YamlOutputParser(pydantic_object=MyModel).OutputType
~T
```


<div>

---------

Co-authored-by: Eugene Yurtsev <[email protected]>

* core[patch]: Pin pydantic in py3.12.4 (#23130)

* core[minor]: handle boolean data in draw_mermaid (#23135)

This change should address graph rendering issues for edges with boolean
data

Example from langgraph:

```python
from typing import Annotated, TypedDict

from langchain_core.messages import AnyMessage
from langgraph.graph import END, START, StateGraph
from langgraph.graph.message import add_messages


class State(TypedDict):
    messages: Annotated[list[AnyMessage], add_messages]


def branch(state: State) -> bool:
    return 1 + 1 == 3


graph_builder = StateGraph(State)
graph_builder.add_node("foo", lambda state: {"messages": [("ai", "foo")]})
graph_builder.add_node("bar", lambda state: {"messages": [("ai", "bar")]})

graph_builder.add_conditional_edges(
    START,
    branch,
    path_map={True: "foo", False: "bar"},
    then=END,
)

app = graph_builder.compile()
print(app.get_graph().draw_mermaid())
```

Previous behavior:

```python
AttributeError: 'bool' object has no attribute 'split'
```

Current behavior:

```python
%%{init: {'flowchart': {'curve': 'linear'}}}%%
graph TD;
	__start__[__start__]:::startclass;
	__end__[__end__]:::endclass;
	foo([foo]):::otherclass;
	bar([bar]):::otherclass;
	__start__ -. ('a',) .-> foo;
	foo --> __end__;
	__start__ -. ('b',) .-> bar;
	bar --> __end__;
	classDef startclass fill:#ffdfba;
	classDef endclass fill:#baffc9;
	classDef otherclass fill:#fad7de;
```

* docs: use trim_messages in chatbot how to (#23139)

* docs[patch]: Adds feedback input after thumbs up/down (#23141)

CC @baskaryan

* docs[patch]: Fix typo in feedback (#23146)

* core[patch]: runnablewithchathistory from core.runnables (#23136)

* openai[patch], standard-tests[patch]: don't pass in falsey stop vals (#23153)

adds an image input test to standard-tests as well

* anthropic: docstrings (#23145)

Added missed docstrings. Format docstrings to the consistent format
(used in the API Reference)

* anthropic[patch]: test image input (#23155)

* text-splitters: Introduce Experimental Markdown Syntax Splitter (#22257)

#### Description
This MR defines a `ExperimentalMarkdownSyntaxTextSplitter` class. The
main goal is to replicate the functionality of the original
`MarkdownHeaderTextSplitter` which extracts the header stack as metadata
but with one critical difference: it keeps the whitespace of the
original text intact.

This draft reimplements the `MarkdownHeaderTextSplitter` with a very
different algorithmic approach. Instead of marking up each line of the
text individually and aggregating them back together into chunks, this
method builds each chunk sequentially and applies the metadata to each
chunk. This makes the implementation simpler. However, since it's
designed to keep white space intact its not a full drop in replacement
for the original. Since it is a radical implementation change to the
original code and I would like to get feedback to see if this is a
worthwhile replacement, should be it's own class, or is not a good idea
at all.

Note: I implemented the `return_each_line` parameter but I don't think
it's a necessary feature. I'd prefer to remove it.

This implementation also adds the following additional features:
- Splits out code blocks and includes the language in the `"Code"`
metadata key
- Splits text on the horizontal rule `---` as well
- The `headers_to_split_on` parameter is now optional - with sensible
defaults that can be overridden.

#### Issue
Keeping the whitespace keeps the paragraphs structure and the formatting
of the code blocks intact which allows the caller much more flexibility
in how they want to further split the individuals sections of the
resulting documents. This addresses the issues brought up by the
community in the following issues:
- https://github.com/langchain-ai/langchain/issues/20823
- https://github.com/langchain-ai/langchain/issues/19436
- https://github.com/langchain-ai/langchain/issues/22256

#### Dependencies
N/A

#### Twitter handle
@RyanElston

---------

Co-authored-by: isaac hershenson <[email protected]>

* ibm: docstrings (#23149)

Added missed docstrings. Format docstrings to the consistent format
(used in the API Reference)

* community: add **request_kwargs and expect TimeError AsyncHtmlLoader (#23068)

- **Description:** add `**request_kwargs` and expect `TimeError` in
`_fetch` function for AsyncHtmlLoader. This allows you to fill in the
kwargs parameter when using the `load()` method of the `AsyncHtmlLoader`
class.

Co-authored-by: Yucolu <[email protected]>

* docs: Overhaul Databricks components documentation (#22884)

**Description:** Documentation at
[integrations/llms/databricks](https://python.langchain.com/v0.2/docs/integrations/llms/databricks/)
is not up-to-date and includes examples about chat model and embeddings,
which should be located in the different corresponding subdirectories.
This PR split the page into correct scope and overhaul the contents.

**Note**: This PR might be hard to review on the diffs view, please use
the following preview links for the changed pages.
- `ChatDatabricks`:
https://langchain-git-fork-b-step62-chat-databricks-doc-langchain.vercel.app/v0.2/docs/integrations/chat/databricks/
- `Databricks`:
https://langchain-git-fork-b-step62-chat-databricks-doc-langchain.vercel.app/v0.2/docs/integrations/llms/databricks/
- `DatabricksEmbeddings`:
https://langchain-git-fork-b-step62-chat-databricks-doc-langchain.vercel.app/v0.2/docs/integrations/text_embedding/databricks/

- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

---------

Signed-off-by: B-Step62 <[email protected]>

* text-splitters: Fix/recursive json splitter data persistence issue (#21529)

Thank you for contributing to LangChain!

**Description:** Noticed an issue with when I was calling
`RecursiveJsonSplitter().split_json()` multiple times that I was getting
weird results. I found an issue where `chunks` list in the `_json_split`
method. If chunks is not provided when _json_split (which is the case
when split_json calls _json_split) then the same list is used for
subsequent calls to `_json_split`.


You can see this in the test case i also added to this commit.

Output should be: 
```
[{'a': 1, 'b': 2}]
[{'c': 3, 'd': 4}]
```

Instead you get:
```
[{'a': 1, 'b': 2}]
[{'a': 1, 'b': 2, 'c': 3, 'd': 4}]
```

---------

Co-authored-by: Nuno Campos <[email protected]>
Co-authored-by: isaac hershenson <[email protected]>
Co-authored-by: Isaac Francisco <[email protected]>

* docs[patch]: Standardize prerequisites in tutorial docs (#23150)

CC @baskaryan

* ai21: docstrings (#23142)

Added missed docstrings. Format docstrings to the consistent format
(used in the API Reference)

* core[patch]: Update documentation in LLM namespace (#23138)

Update documentation in lllm namespace.

* core[patch]: Document embeddings namespace (#23132)

Document embeddings namespace

* core[patch]: Expand documentation in the indexing namespace (#23134)

* community: move test to integration tests (#23178)

Tests failing on master with

> FAILED
tests/unit_tests/embeddings/test_ovhcloud.py::test_ovhcloud_embed_documents
- ValueError: Request failed with status code: 401, {"message":"Bad
token; invalid JSON"}

* community[patch]: Prevent unit tests from making network requests (#23180)

* Prevent unit tests from making network requests

* core[patch]: Add doc-string to document compressor (#23085)

* core[patch]: Add documentation to load namespace (#23143)

Document some of the modules within the load namespace

* core[patch]: update test to catch circular imports (#23172)

This raises ImportError due to a circular import:
```python
from langchain_core import chat_history
```

This does not:
```python
from langchain_core import runnables
from langchain_core import chat_history
```

Here we update `test_imports` to run each import in a separate
subprocess. Open to other ways of doing this!

* core[patch]: Add an example to the Document schema doc-string (#23131)

Add an example to the document schema

* core[patch]: fix chat history circular import (#23182)

* fix: MoonshotChat fails when setting the moonshot_api_key through the OS environment. (#23176)

Close #23174

Co-authored-by: tianming <[email protected]>

* docs: add bing search tool to ms platform (#23183)

- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

* prompty: docstring (#23152)

Added missed docstrings. Format docstrings to the consistent format
(used in the API Reference)

---------

Co-authored-by: ccurme <[email protected]>

* core[patch[: add exceptions propagation test for astream_events v2 (#23159)

**Description:** `astream_events(version="v2")` didn't propagate
exceptions in `langchain-core<=0.2.6`, fixed in the #22916. This PR adds
a unit test to check that exceptions are propagated upwards.

Co-authored-by: Sergey Kozlov <[email protected]>

* langchain[small]: Change type to BasePromptTemplate (#23083)

```python
Change from_llm(
 prompt: PromptTemplate 
 ...
 )
```
 to
```python
Change from_llm(
 prompt: BasePromptTemplate 
 ...
 )
```

* community[patch]: update sambastudio embeddings (#23133)

Description: update sambastudio embeddings integration, now compatible
with generic endpoints and CoE endpoints

* community[patch]: sambanova llm integration improvement (#23137)

- **Description:** sambanova sambaverse integration improvement: removed
input parsing that was changing raw user input, and was making to use
process prompt parameter as true mandatory

* openai[patch]: image token counting (#23147)

Resolves #23000

---------

Co-authored-by: isaac hershenson <[email protected]>
Co-authored-by: ccurme <[email protected]>

* upstage: move to external repo (#22506)

* community[patch]: restore compatibility with SQLAlchemy 1.x (#22546)

- **Description:** Restores compatibility with SQLAlchemy 1.4.x that was
broken since #18992 and adds a test run for this version on CI (only for
Python 3.11)
- **Issue:** fixes #19681
- **Dependencies:** None
- **Twitter handle:** `@krassowski_m`

---------

Co-authored-by: Erick Friis <[email protected]>

* infra: add more formatter rules to openai (#23189)

Turns on
https://docs.astral.sh/ruff/settings/#format_docstring-code-format and
https://docs.astral.sh/ruff/settings/#format_skip-magic-trailing-comma

```toml
[tool.ruff.format]
docstring-code-format = true
skip-magic-trailing-comma = true
```

* core[patch]: Add doc-strings to outputs, fix @root_validator (#23190)

- Document outputs namespace
- Update a vanilla @root_validator that was missed

* core[patch]: Document messages namespace (#23154)

- Moved doc-strings below attribtues in TypedDicts -- seems to render
better on APIReference pages.
* Provided more description and some simple code examples

* infra: run CI on large diffs (#23192)

currently we skip CI on diffs >= 300 files. think we should just run it
on all packages instead

---------

Co-authored-by: Erick Friis <[email protected]>

* core[patch]: Document agent schema (#23194)

* Document agent schema
* Refer folks to langgraph for more information on how to create agents.

* fireworks[patch]: fix api_key alias in Fireworks LLM (#23118)

Thank you for contributing to LangChain!

**Description**
The current code snippet for `Fireworks` had incorrect parameters. This
PR fixes those parameters.

---------

Co-authored-by: Chester Curme <[email protected]>
Co-authored-by: Bagatur <[email protected]>

* core:Add optional max_messages to MessagePlaceholder (#16098)

- **Description:** Add optional max_messages to MessagePlaceholder
- **Issue:**
[16096](https://github.com/langchain-ai/langchain/issues/16096)
- **Dependencies:** None
- **Twitter handle:** @davedecaprio

Sometimes it's better to limit the history in the prompt itself rather
than the memory. This is needed if you want different prompts in the
chain to have different history lengths.

---------

Co-authored-by: Harrison Chase <[email protected]>

* docs: standard params (#23199)

* standard-tests[patch]: test stop not stop_sequences (#23200)

* fix https://github.com/langchain-ai/langchain/issues/23215 (#23216)

fix bug 
The ZhipuAIEmbeddings class is not working.

Co-authored-by: xu yandong <[email protected]>

* huggingface[patch]: fix CI for python 3.12 (#23197)

* huggingface: docstrings (#23148)

Added missed docstrings. Format docstrings to the consistent format
(used in the API Reference)

Co-authored-by: ccurme <[email protected]>

* Docs: Update Rag tutorial so it includes an additional notebook cell with pip installs of required langchain_chroma and langchain_community. (#23204)

Description: Update Rag tutorial notebook so it includes an additional
notebook cell with pip installs of required langchain_chroma and
langchain_community.

This fixes the issue with the rag tutorial gives you a 'missing modules'
error if you run code in the notebook as is.

---------

Co-authored-by: Chester Curme <[email protected]>

* doc: replace function all with tool call (#23184)

- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

* partners[minor]: Fix value error message for with_structured_output (#22877)

Currently, calling `with_structured_output()` with an invalid method
argument raises `Unrecognized method argument. Expected one of
'function_calling' or 'json_format'`, but the JSON mode option [is now
referred
to](https://python.langchain.com/v0.2/docs/how_to/structured_output/#the-with_structured_output-method)
by `'json_mode'`. This fixes that.

Co-authored-by: Harrison Chase <[email protected]>

* community: docstrings (#23202)

Added missed docstrings. Format docstrings to the consistent format
(used in the API Reference)

* docs: Update clean up API reference (#23221)

- Fix bug with TypedDicts rendering inherited methods if inherting from
  typing_extensions.TypedDict rather than typing.TypedDict
- Do not surface inherited pydantic methods for subclasses of BaseModel
- Subclasses of RunnableSerializable will not how methods inherited from
  Runnable or from BaseModel
- Subclasses of Runnable that not pydantic models will include a link to
RunnableInterface (they still show inherited methods, we can fix this
later)

* docs: API reference remove Prev/Up/Next buttons (#23225)

These do not work anyway. Let's remove them for now for simplicity.

* core[minor]: Adds an in-memory implementation of RecordM…
ccurme pushed a commit that referenced this issue Jul 5, 2024
- [X] **PR title**
- [X] **PR message**: ***Delete this entire checklist*** and replace
with
    - **Description:** Update of docstrings and docpages
- **Issue:**
[22866](#22866)

- [X] **Add tests and docs**

- [X] **Lint and test**
@baskaryan baskaryan changed the title Standardize DocumentLoader docstrings and integration do Standardize DocumentLoader docstrings and integration docs Jul 31, 2024
Copy link

dosubot bot commented Nov 26, 2024

Hi, @isahers1. I'm Dosu, and I'm helping the LangChain team manage their backlog. I'm marking this issue as stale.

Issue Summary:

  • The issue aims to improve documentation for DocumentLoader integrations by standardizing docstrings and integration docs.
  • Each DocumentLoader class should have detailed docstrings with specific sections and code examples.
  • Integration docs should follow a given template.
  • User "lucas-tucker" volunteered to work on this task, and you reacted positively.

Next Steps:

  • Please let me know if this issue is still relevant to the latest version of the LangChain repository. If so, you can keep the discussion open by commenting on the issue.
  • Otherwise, the issue will be automatically closed in 7 days.

Thank you for your understanding and contribution!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 26, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 3, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder help wanted Good issue for contributors
Projects
None yet
Development

No branches or pull requests

2 participants