Detect and Parse Historic Dates.
unstruwwel
Overview
This R package provides means to detect and parse historic dates, e.g., to ISO 8601:2-2019. It automatically converts language-specific verbal information, e.g., “circa 1st half of the 19th century,” into its standardized numerical counterparts, e.g., “1801-01-01~/1850-12-31~.” The package follows the recommendations of the MIDAS (Marburger Informations-, Dokumentations- und Administrations-System), see, e.g., https://doi.org/10.11588/artdok.00003770. It internally uses lubridate. The name of the package is inspired by Heinrich Hoffmann’s rhymed story “Struwwelpeter”, which goes as follows:
Just look at him! there he stands, with his nasty hair and hands. See! his nails are never cut; they are grimed as black as soot; and the sloven, I declare, never once has combed his hair; anything to me is sweeter than to see Shock-headed Peter.
For the German-language original text, see the online digital library Wikisource.
Installation
You can install the released version of unstruwwel from CRAN with:
install.packages("unstruwwel")
To install the development version from GitHub use:
# install.packages("devtools")
devtools::install_github("stefanieschneider/unstruwwel")
Usage
The unstruwwel package contains only one function, unstruwwel()
, that does all the magic language-specific standardization. unstruwwel()
returns a named list, where each element is the result of applying the function to the corresponding element in the input vector.
English-language examples
dates <- c(
"5th century b.c.", "unknown", "late 16th century", "mid-12th century",
"mid-1880s", "June 1963", "August 11, 1958", "ca. 1920", "before 1856"
)
# returns valid ISO 8601:2-2019 dates
unlist(unstruwwel(dates, "en", scheme = "iso-format"), use.names = FALSE)
#> [1] "-0500-12-31/-0401-01-01" NA "1586-01-01/1600-12-31"
#> [4] "1146-01-01/1155-12-31" "1884-01-01/1885-12-31" "1963-06-01/1963-06-30"
#> [7] "1958-08-11/1958-08-11" "1920-01-01~/1920-12-31~" "..1855-12-31"
# returns a numerical interval of length 2
unstruwwel(dates, language = "en", scheme = "time-span") %>%
tibble::as_tibble() %>% dplyr::mutate(id = dplyr::row_number()) %>%
tidyr::gather(key = id) %>% tidyr::unnest_wider(value) %>%
dplyr::rename_all(dplyr::funs(c("text", "start", "end")))
#> # A tibble: 9 × 3
#> text start end
#> <chr> <dbl> <dbl>
#> 1 5th century b.c. -500 -401
#> 2 unknown NA NA
#> 3 late 16th century 1586 1600
#> 4 mid-12th century 1146 1155
#> 5 mid-1880s 1884 1885
#> 6 June 1963 1963 1963
#> 7 August 11, 1958 1958 1958
#> 8 ca. 1920 1920 1920
#> 9 before 1856 -Inf 1855
German-language examples
dates <- c(
"letztes Drittel 15. und 1. Hälfte 16. Jahrhundert", "undatiert", "1460?",
"wohl nach 1923", "spätestens 1750er Jahre", "1897 (Guss vmtl. vor 1906)"
)
# returns valid ISO 8601:2-2019 dates
unlist(unstruwwel(dates, "de", scheme = "iso-format"), use.names = FALSE)
#> [1] "1467-01-01/1550-12-31" NA "1460-01-01~/1460-12-31~"
#> [4] "1924-01-01?.." "..1749-12-31" "..1905-12-31?"
# returns a numerical interval of length 2
unstruwwel(dates, language = "de", scheme = "time-span") %>%
tibble::as_tibble() %>% dplyr::mutate(id = dplyr::row_number()) %>%
tidyr::gather(key = id) %>% tidyr::unnest_wider(value) %>%
dplyr::rename_all(dplyr::funs(c("text", "start", "end")))
#> # A tibble: 6 × 3
#> text start end
#> <chr> <dbl> <dbl>
#> 1 letztes Drittel 15. und 1. Hälfte 16. Jahrhundert 1467 1550
#> 2 undatiert NA NA
#> 3 1460? 1460 1460
#> 4 wohl nach 1923 1924 Inf
#> 5 spätestens 1750er Jahre -Inf 1749
#> 6 1897 (Guss vmtl. vor 1906) -Inf 1905
Contributing
Please report issues, feature requests, and questions to the GitHub issue tracker. We have a Contributor Code of Conduct. By participating in unstruwwel you agree to abide by its terms.