MyNixOS website logo
Description

Extracting Semantic Motifs from Textual Data.

A framework for extracting semantic motifs around entities in textual data. It implements an entity-centered semantic grammar that distinguishes six classes of motifs: actions of an entity, treatments of an entity, agents acting upon an entity, patients acted upon by an entity, characterizations of an entity, and possessions of an entity. Motifs are identified by applying a set of extraction rules to a parsed text object that includes part-of-speech tags and dependency annotations - such as those generated by 'spacyr'. For further reference, see: Stuhler (2022) <doi: 10.1177/00491241221099551>.

semgram: Extracting Semantic Motifs from Textual Data

semgram extracts semantic motifs around entities in textual data. For details, please refer to this recent paper. semgram uses an entity-centered semantic grammar that distinguishes six classes of motifs: actions of an entity, treatments of an entity, agents acting upon an entity, patients acted upon by an entity, characterizations of an entity, and possessions of an entity. semgram uses a comprehensive set of extraction rules to recover semantic motifs from dependency trees (the output of dependency parsers). A short demo can be found here.

semgram builds on functionalities of spacyr for dependency parsing and rsyntax for implementing rules querying dependency trees. If you find yourself wanting to extract relations other than those incorporated in the semgram grammar and don't mind implementing the formal rules to do this from scratch, rsyntax is the way to go. You might also find their rsyntaxRecipes useful.

If you use semgram in your research, please cite as follows:

Stuhler, Oscar (2022). "Who does What to Whom? Making Text Parsers Work for Sociological Inquiry." Sociological Methods & Research. doi: 10.1177/00491241221099551.

Installation

Assuming you have installed devtools, you can install the development version of the package by running the following.

devtools::install_github("omstuhler/semgram")

Example

The first step in extracting semantic motifs from text is to pass it through an annotation pipeline. You can do this by running spacyr::spacy_parse().

text = "Emil chased the thief."
tokens_df = spacyr::spacy_parse(text, dependency=T)
tokens_df

#>   doc_id sentence_id token_id  token lemma   pos head_token_id dep_rel
#> 1  text1           1        1   Emil  Emil PROPN             2   nsubj
#> 2  text1           1        2 chased chase  VERB             2    ROOT
#> 3  text1           1        3    the   the   DET             4     det
#> 4  text1           1        4  thief thief  NOUN             2    dobj
#> 5  text1           1        5      .     . PUNCT             2   punct

The working horse of semgram is the extract_motifs function to which we pass an annotated tokens object. We can also specify in which entity we are interested (here "Emil"). By default, extract_motifs extracts motifs for all motif classes (actions, patients, treatments, etc.).

In the example sentence, we find an action motif (a_chase) and well as a composite action-Patient motif (aP_chase_thief). For some more functionalities, check out the demo.

extract_motifs(tokens = tokens_df, entities = c("Emil"), markup = T)

#> List of 8
#>  $actions   			
#>	doc_id	ann_id		Entity	action  markup
#>	text1	text1.1.1  	Emil  	chase   a_chase
#>  $treatments
#>	character(0)
#>  $characterizations
#>	character(0)
#>  $possessions
#>	character(0)
#>  $agent_treatments
#>	character(0)
#>  $action_patients	
#>	doc_id	ann_id		Entity	action 	Patient markup
#>	text1 	text1.1.2	Emil  	chase   thief   aP_chase_thief
Metadata

Version

0.1.0

License

Unknown

Platforms (75)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-darwin
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-darwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-windows