Automatic data type detection

The add method automatically tries to detect the data_type, based on your input for the source argument. So app.add('https://www.youtube.com/watch?v=dQw4w9WgXcQ') is enough to embed a YouTube video.

This detection is implemented for all formats. It is based on factors such as whether itโ€™s a URL, a local file, the source data type, etc.

Debugging automatic detection

Set log_level: DEBUG in the config yaml to debug if the data type detection is done right or not. Otherwise, you will not know when, for instance, an invalid filepath is interpreted as raw text instead.

Forcing a data type

To omit any issues with the data type detection, you can force a data_type by adding it as a add method argument. The examples below show you the keyword to force the respective data_type.

Forcing can also be used for edge cases, such as interpreting a sitemap as a web_page, for reading its raw text instead of following links.

Remote data types

Use local files in remote data types

Some data_types are meant for remote content and only work with URLs. You can pass local files by formatting the path using the file: URI scheme, e.g. file:///info.pdf.

Reusing a vector database

Default behavior is to create a persistent vector db in the directory ./db. You can split your application into two Python scripts: one to create a local vector db and the other to reuse this local persistent vector db. This is useful when you want to index hundreds of documents and separately implement a chat interface.

Create a local index:

from embedchain import App

config = {
    "app": {
        "config": {
            "id": "app-1"
        }
    }
}
naval_chat_bot = App.from_config(config=config)
naval_chat_bot.add("https://www.youtube.com/watch?v=3qHkcs3kG44")
naval_chat_bot.add("https://navalmanack.s3.amazonaws.com/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf")

You can reuse the local index with the same code, but without adding new documents:

from embedchain import App

config = {
    "app": {
        "config": {
            "id": "app-1"
        }
    }
}
naval_chat_bot = App.from_config(config=config)
print(naval_chat_bot.query("What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?"))

Resetting an app and vector database

You can reset the app by simply calling the reset method. This will delete the vector database and all other app related files.

from embedchain import App

app = App()config = {
    "app": {
        "config": {
            "id": "app-1"
        }
    }
}
naval_chat_bot = App.from_config(config=config)
app.add("https://www.youtube.com/watch?v=3qHkcs3kG44")
app.reset()