MyNixOS website logo
Description

A Lightweight and Versatile NLP Toolkit.

A toolkit for web scraping, modular NLP pipelines, and text preparation for large language models. Organized around four core actions: fetching, reading, processing, and searching. Covers the full pipeline from raw web data acquisition to structural text processing and BM25 indexing. Supports multiple retrieval strategies including regex, dictionary matching, and ranked keyword search. Pipe-friendly with no heavy dependencies; all outputs are plain data frames or data.tables.

textpress

CRAN version CRAN downloads

textpress is an R toolkit for building text corpora and searching them -- no custom object classes, just plain data frames from start to finish. It covers the full arc from URL to retrieved passage through a consistent four-step API: Fetch, Read, Process, Search. Traditional tools (KWIC, BM25, dictionary matching) sit alongside modern ones (semantic search, LLM-ready chunking), all composing cleanly with |>.


Installation

From CRAN:

install.packages("textpress")

Development version:

remotes::install_github("jaytimm/textpress")

The textpress API

Conventions: corpus is a data frame with a text column plus identifier column(s) passed to by (default doc_id). All outputs are plain data frames or data.tables; pipe-friendly.

1. Fetch (fetch_*)

Find URLs and metadata -- not full text. Pass results to read_urls() to get content.

  • fetch_urls(query, n_pages, date_filter) -- Search engine query; returns candidate URLs with metadata.
  • fetch_wiki_urls(query, limit) -- Wikipedia article URLs matching a search phrase.
  • fetch_wiki_refs(url, n) -- External citation URLs from a Wikipedia article's References section.

2. Read (read_*)

Scrape and parse URLs into a structured corpus.

  • read_urls(urls, ...) -- Character vector of URLs → list(text, meta). text is one row per node (headings, paragraphs, lists); meta is one row per URL. For Wikipedia, exclude_wiki_refs = TRUE drops References / See also / Bibliography sections.

3. Process (nlp_*)

Prepare text for search or indexing.

  • nlp_split_paragraphs() -- Break documents into structural blocks.
  • nlp_split_sentences() -- Segment blocks into individual sentences.
  • nlp_tokenize_text() -- Normalize text into a clean token stream.
  • nlp_index_tokens() -- Build a weighted BM25 index for ranked retrieval.
  • nlp_roll_chunks() -- Roll sentences into fixed-size chunks with surrounding context (RAG-style).

4. Search (search_*)

Four retrieval modes over the same corpus. Data-first, pipe-friendly.

FunctionQuery typeUse case
search_regex(corpus, query)Regex patternSpecific strings, KWIC with inline highlighting.
search_dict(corpus, terms)Term vectorExact phrases and MWEs; built-in dict_generations, dict_political.
search_index(index, query)KeywordsBM25 ranked retrieval over a token index.
search_vector(embeddings, query)Numeric vectorSemantic nearest-neighbor search; use util_fetch_embeddings() to embed.

RAG & LLM pipelines

textpress is designed to compose cleanly into retrieval-augmented generation pipelines.

Hybrid retrieval -- run search_index() and search_vector() over the same chunks, then merge with reciprocal rank fusion (RRF). Chunks that rank well under both term frequency and meaning rise to the top.

Context assembly -- nlp_roll_chunks() with context_size > 0 gives each chunk a focal sentence plus surrounding context, so retrieved passages are self-contained when passed to an LLM.

Agent tool-calling -- the consistent API and plain data-frame outputs map naturally to tool use:

Agent taskFunction
"Find recent articles on X"fetch_urls()
"Scrape these pages"read_urls()
"Find all mentions of these entities"search_dict()
"Follow citations from this Wikipedia article"fetch_wiki_refs()

Vignettes


License

MIT © Jason Timm

Citation

citation("textpress")

Issues

Report bugs or request features at https://github.com/jaytimm/textpress/issues.

Metadata

Version

1.1.1

License

Unknown

Platforms (80)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arc-linux
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • sh4-linux
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows