Description

Retrieval-Augmented Generation (RAG) Workflows in R with Local and Web Search.

Description

Enables Retrieval-Augmented Generation (RAG) workflows in R by combining local vector search using 'DuckDB' with optional web search via the 'Tavily' API. Supports 'OpenAI'- and 'Ollama'-compatible embedding models, full-text and 'HNSW' (Hierarchical Navigable Small World) indexing, and modular large language model (LLM) invocation. Designed for advanced question-answering, chat-based applications, and production-ready AI pipelines. This package is the R equivalent of the 'python' package 'RAGFlowChain' available at <https://pypi.org/project/RAGFlowChain/>.

README.md

cran.r-project.org

RAGFlowChainR

Overview

RAGFlowChainR is an R package that brings Retrieval-Augmented Generation (RAG) capabilities to R, inspired by LangChain. It enables intelligent retrieval of documents from a local vector store (DuckDB), optional web search, and seamless integration with Large Language Models (LLMs).

Features include:

📂 Ingest files and websites
🔍 Semantic search using vector embeddings
🧠 RAG chain execution with conversational memory and dynamic prompt construction
🔌 Plug-and-play with OpenAI, Ollama, Groq, and Anthropic

Python version: RAGFlowChain (PyPI)
GitHub (Python): RAGFlowChain

Installation

install.packages("RAGFlowChainR")

Development version

To get the latest features or bug fixes, you can install the development version of RAGFlowChainR from GitHub:

# If needed
install.packages("remotes")

remotes::install_github("knowusuboaky/RAGFlowChainR")

See the full function reference or the package website for more details.

🔐 Environment Setup

Sys.setenv(TAVILY_API_KEY    = "your-tavily-api-key")
Sys.setenv(OPENAI_API_KEY    = "your-openai-api-key")
Sys.setenv(GROQ_API_KEY      = "your-groq-api-key")
Sys.setenv(ANTHROPIC_API_KEY = "your-anthropic-api-key")

To persist across sessions, add these to your ~/.Renviron file.

Usage

1. Data Ingestion

library(RAGFlowChainR)

local_files <- c("tests/testthat/test-data/sprint.pdf", 
                 "tests/testthat/test-data/introduction.pptx",
                 "tests/testthat/test-data/overview.txt")
website_urls <- c("https://www.r-project.org")
crawl_depth <- 1

response <- fetch_data(
  local_paths = local_files,
  website_urls = website_urls,
  crawl_depth = crawl_depth
)

response
#>                                source                                      title ...
#> 1                 documents/sprint.pdf                                       <NA> ...
#> 2          documents/introduction.pptx                                       <NA> ...
#> 3               documents/overview.txt                                       <NA> ...
#> 4            https://www.r-project.org R: The R Project for Statistical Computing ...
#> ...

cat(response$content[1])
#> Getting Started with Scrum\nCodeWithPraveen.com ...

2. Vector Store & Semantic Search

con <- create_vectorstore("tests/testthat/test-data/my_vectors.duckdb", overwrite = TRUE)

docs <- data.frame(head(response))  # reuse from fetch_data()

insert_vectors(
  con = con,
  df = docs,
  embed_fun = embed_openai(),
  chunk_chars = 12000
)

build_vector_index(con, type = c("vss", "fts"))

response <- search_vectors(con, query_text = "Tell me about R?", top_k = 5)

response
#>    id page_content                                                dist
#> 1   5 [Home]\nDownload\nCRAN\nR Project...\n...                0.2183
#> 2   6 [Home]\nDownload\nCRAN\nR Project...\n...                0.2183
#> ...

cat(response$page_content[1])
#> [Home]\nDownload\nCRAN\nR Project\nAbout R\nLogo\n...

3. RAG Chain Querying

rag_chain <- create_rag_chain(
  llm = call_llm,
  vector_database_directory = "tests/testthat/test-data/my_vectors.duckdb",
  method = "DuckDB",
  embedding_function = embed_openai(),
  use_web_search = FALSE
)

response <- rag_chain$invoke("Tell me about R")

response
#> $input
#> [1] "Tell me about R"
#>
#> $chat_history
#> [[1]] $role: "human", $content: "Tell me about R"
#> [[2]] $role: "assistant", $content: "R is a programming language..."
#>
#> $answer
#> [1] "R is a programming language and software environment commonly used for statistical computing and graphics..."

cat(response$answer)
#> R is a programming language and software environment commonly used for statistical computing and graphics...

LLM Support

call_llm(
  prompt = "Summarize the capital of France.",
  provider = "groq",
  model = "llama3-8b",
  temperature = 0.7,
  max_tokens = 200
)

📦 Related Package: `chatLLM`

The chatLLM package (now available on CRAN 🎉) offers a modular interface for interacting with LLM providers including OpenAI, Groq, Anthropic, DeepSeek, DashScope, and GitHub Models.

install.packages("chatLLM")

Features:

🔄 Uniform API across providers
🗣 Multi‑message context (system/user/assistant roles)
🔁 Retries & backoff with clear timeout handling
🔈 Verbose control (verbose = TRUE/FALSE)
⚙️ Discover models via list_models()
🏗 Factory interface for repeated calls
🌐 Custom endpoint override and advanced tuning
🔌 Native integration with RAGFlowChainR
🔐 .Renviron-based key management

r-RAGFlowChainR

RAGFlowChainR

Overview

Installation

Development version

🔐 Environment Setup

Usage

1. Data Ingestion

2. Vector Store & Semantic Search

3. RAG Chain Querying

LLM Support

📦 Related Package: `chatLLM`

License

Version

License

Status

Source

Homepage

Platforms (76)

RAGFlowChainR

Overview

Installation

Development version

🔐 Environment Setup

Usage

1. Data Ingestion

2. Vector Store & Semantic Search

3. RAG Chain Querying

LLM Support

📦 Related Package: chatLLM

License

Version

License

Status

Source

Homepage

Platforms76 (76)

📦 Related Package: `chatLLM`

Platforms (76)