MyNixOS website logo
Description

A Lightweight and Versatile NLP Toolkit.

A simple Natural Language Processing (NLP) toolkit focused on search-centric workflows with minimal dependencies. The package offers key features for web scraping, text processing, corpus search, and text embedding generation via the 'HuggingFace API' <https://huggingface.co/docs/api-inference/index>.

R buildstatus

textpress

A lightweight, versatile NLP package for R, focused on search-centric workflows with minimal dependencies and easy data-frame integration. This package provides key functionalities for:

  • Web Search: Perform search engine queries to retrieve relevant URLs.

  • Web Scraping: Extract URL content, including some relevant metadata.

  • Text Processing & Chunking: Segment text into meaningful units, eg, sentences, paragraphs, and larger chunks. Designed to support tasks related to retrieval-augmented generation (RAG).

  • Corpus Search: Perform keyword, phrase, and pattern-based searches across processed corpora, supporting both traditional in-context search techniques (e.g., KWIC, regex matching) and advanced semantic searches using embeddings.

  • Embedding Generation: Generate embeddings using the HuggingFace API for enhanced semantic search.

Ideal for users who need a basic, unobtrusive NLP toolkit in R.

Installation

devtools::install_github("jaytimm/textpress")

Usage

Web search

sterm <- 'AI and education'

yresults <- textpress::web_search(search_term = sterm, 
                                  search_engine = "Yahoo News", 
                                  num_pages = 5)

yresults |> select(2) |>  sample_n(5) |> knitr::kable()
raw_url
https://gulfbusiness.com/ai-is-transforming-healthcare-heres-how/
https://campustechnology.com/Articles/2023/04/06/What-the-Past-Can-Teach-Us-About-the-Future-of-AI-and-Education.aspx
https://natlawreview.com/article/decoding-californias-recent-flurry-ai-laws
https://www.zdnet.com/article/pearson-launches-new-ai-certification-with-focus-on-practical-use-in-the-workplace/
https://campustechnology.com/Articles/2024/02/21/Creating-Guidelines-for-the-Use-of-Gen-AI-Across-Campus.aspx

Web Scraping

arts <- yresults$raw_url |> 
  textpress::web_scrape_urls(cores = 4)

Text Processing & Chunking

nlp_split_paragraphs() < nlp_split_sentences() < nlp_build_chunks()

articles <- arts |>  
  mutate(doc_id = row_number())|>
  
  textpress::nlp_split_paragraphs(paragraph_delim = "\\n+") |>
  textpress::nlp_split_sentences(text_hierarchy = c('doc_id', 
                                                    'paragraph_id')) |>
  
  textpress::nlp_build_chunks(text_hierarchy = c('doc_id', 
                                                 'paragraph_id', 
                                                 'sentence_id'),
                              chunk_size = 1,
                              context_size = 1) |>
  
  mutate(id = paste(doc_id, paragraph_id, chunk_id, sep = '.'))
idchunkchunk_plus_context
1.1.1‘TO AI OR NOT TO AI?’‘TO AI OR NOT TO AI?’ This is one of the most pressing questions that today’s educators and higher education leaders face.
1.1.2This is one of the most pressing questions that today’s educators and higher education leaders face.‘TO AI OR NOT TO AI?’ This is one of the most pressing questions that today’s educators and higher education leaders face. While there is no doubt that artificial intelligence (AI) will play an increasingly central role in people’s lives, many in the education sector remain skeptical — with some even deeming it a harbinger of educational doom.
1.1.3While there is no doubt that artificial intelligence (AI) will play an increasingly central role in people’s lives, many in the education sector remain skeptical — with some even deeming it a harbinger of educational doom.This is one of the most pressing questions that today’s educators and higher education leaders face. While there is no doubt that artificial intelligence (AI) will play an increasingly central role in people’s lives, many in the education sector remain skeptical — with some even deeming it a harbinger of educational doom. In a study conducted by global educational technology or edtech leader Anthology, 30% or three in every 10 university leaders in the Philippines see generative AI as unethical and should be banned from being used in educational settings.

KWIC Search

sterm2 <- c('\\bhigher education\\b',
            '\\bsecondary education\\b')
            # '\\S+ education\\b',
            # '\\b\\w{4,}\\b education\\b')

kwics <- articles |>
  rename(text = chunk) |>
  textpress::sem_search_corpus(search = sterm2,
                               text_hierarchy = c('doc_id', 
                                                  'paragraph_id', 
                                                  'chunk_id'))

kwics |> 
  mutate(id = paste(doc_id, 
                    paragraph_id, 
                    chunk_id, 
                    sep = '.')) |>
  select(id, pattern, text) |> 
  sample_n(5) |> knitr::kable()
idpatterntext
1.3.1higher educationThe study conducted across 11 countries including the Philippines involved 5,000 higher education leaders and students.
1.8.1higher educationAI is a game-changer in higher education, bridging gaps in accessibility and quality.
1.2.3higher educationIt revealed that university leaders have certain reservations around allowing AI in higher education, perceiving it as being unethical.
9.2.1Higher EducationIn 1998, noted technology critic and historian of automation David Noble published his influential article “Digital Diploma Mills: The Automation of Higher Education,” in which he warned about the negative impacts the internet would have on education.
15.4.2secondary education“It underscores the urgent need to address the looming AI knowledge gap in schools—for both students and teachers—to raise parental awareness and increase their involvement in AI conversations, and push for stronger AI integration in American primary and secondary education.”

Semantic search

HuggingFace embeddings

api_url <- "https://api-inference.huggingface.co/models/BAAI/bge-base-en-v1.5"

vstore <- articles |>
  rename(text = chunk) |>
  textpress::api_huggingface_embeddings(
    text_hierarchy = c('doc_id', 
                       'paragraph_id',
                       'chunk_id'),
    verbose = F,
    api_url = api_url,
    dims = 768, #1024, 768, 384
    api_token = api_token)

Embedd query

“How can AI personalize learning experiences for students?”

q <- "How can AI personalize learning experiences for students?"

query <- textpress::api_huggingface_embeddings(
  query = q,
  api_url = api_url,
  dims = 768,
  api_token = api_token)
rags <- textpress::sem_nearest_neighbors(
  x = query,
  matrix = vstore,
  n = 20) |>
  left_join(articles, by = c("term2" = "id"))

Relevant chunks

idcos_simchunk_plus_context
14.10.20.8441. Personalized learning: AI can analyze data to understand each student’s learning style, strengths and areas for improvement. For example, an AI-driven platform could identify that a particular student struggles with reading comprehension and then provide tailored exercises that improve the student’s skills.
7.8.20.836There’s a better way. This is where AI-assisted learning steps in to create personalized lesson plans. In our schools, we’ve transformed the traditional teacher’s role into that of a “guide.”
22.2.20.835Artificial intelligence has permeated nearly every industry, and higher education is no exception. AI-powered solutions promise to revolutionize learning by providing personalized and adaptive experiences.
22.7.50.810Consider how new tools integrate with existing platforms and map to the entire learner lifecycle. AI should simplify, not complicate, the student experience. With thoughtful implementation, these intelligent technologies can personalize learning and improve outcomes from start to finish.
7.10.10.810AI is revolutionizing the role of teachers by excelling at delivering personalized learning experiences. These advanced AI programs can swiftly and accurately pinpoint what a student knows and doesn’t know in each subject, allowing lessons to be designed around their unique aptitudes without any judgment.

Summary.

Metadata

Version

1.0.0

License

Unknown

Platforms (77)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-windows
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-darwin
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-darwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-windows