Read 'ELAN' Files.
readelan 
The goal of {readelan} is to access the data (i.e., annotations) and metadata (e.g., tier types/structure, linked media and controlled vocabularies) in ELAN annotation files (.eaf), template files (.etf) and external controlled vocabulary files (.ecv) directly in R.
The {readelan} package is a fast yet lightweight package with only one dependency outside of core packages, namely {xml2}. The {xml2} package is central to access the inner structure of ELAN files, which are all fundamentally XML files.
Installation
You can install the development version of {readelan} like so:
pak::pak("borstell/readelan")
You can also get the official release version from CRAN:
install.packages("readelan")
Basic use
Reading annotations
The most obvious use of {readelan} is to be able to read annotations directly from an .eaf file:
library(readelan)
eaf_file <- system.file("extdata",
"example.eaf",
package = "readelan")
read_eaf(eaf_file)
#> filename a time_1 time_2 annotation tier tier_type participant
#> 1 example.eaf a1 ts1 ts2 I words default-lt s001
#> 2 example.eaf a2 ts3 ts4 like words default-lt s001
#> 3 example.eaf a3 ts5 ts6 cats words default-lt s001
#> 4 example.eaf a4 ts1 ts2 pronoun pos pos s001
#> 5 example.eaf a7 ts1 ts2 subject syntax syntax s001
#> 6 example.eaf a5 ts3 ts4 verb pos pos s001
#> 7 example.eaf a8 ts3 ts4 predicate syntax syntax s001
#> 8 example.eaf a6 ts5 ts6 noun pos pos s001
#> 9 example.eaf a9 ts5 ts6 object syntax syntax s001
#> annotator parent_ref a_ref language_ref cv_id
#> 1 ABC <NA> <NA> <NA> <NA>
#> 2 ABC <NA> <NA> <NA> <NA>
#> 3 ABC <NA> <NA> <NA> <NA>
#> 4 DEF words a1 eng pos
#> 5 GHI words a1 eng syntax
#> 6 DEF words a2 eng pos
#> 7 GHI words a2 eng syntax
#> 8 DEF words a3 eng pos
#> 9 GHI words a3 eng syntax
#> cve_ref start end duration time_unit
#> 1 <NA> 490 1590 1100 milliseconds
#> 2 <NA> 2010 3670 1660 milliseconds
#> 3 <NA> 4370 5980 1610 milliseconds
#> 4 cveid_89b2ac38-7313-4737-aa5f-19e1231ccb18 490 1590 1100 milliseconds
#> 5 cveid_4990ed36-c1d1-40c6-800e-2bd264d9a89b 490 1590 1100 milliseconds
#> 6 cveid_d5558ab7-11c3-47d5-9a0f-403724b0e0b7 2010 3670 1660 milliseconds
#> 7 cveid_de8c8f23-6b49-42da-9bd4-5b8b59d8b1da 2010 3670 1660 milliseconds
#> 8 cveid_f2eec815-c427-4b61-84eb-cd1e1601c9b4 4370 5980 1610 milliseconds
#> 9 cveid_f37323ed-9b7e-48d9-bbd6-b9105186ed02 4370 5980 1610 milliseconds
The tiers and xpath arguments in read_eaf() allow the user to specify tiers of interest already on reading the files, which simplifies and speeds up the process in cases where there are many tiers but only some of which are necessary to import:
library(readelan)
eaf_file <- system.file("extdata",
"example.eaf",
package = "readelan")
read_eaf(eaf_file, tiers = list(tier = "words"))
#> filename a time_1 time_2 annotation tier tier_type participant
#> 1 example.eaf a1 ts1 ts2 I words default-lt s001
#> 2 example.eaf a2 ts3 ts4 like words default-lt s001
#> 3 example.eaf a3 ts5 ts6 cats words default-lt s001
#> annotator parent_ref a_ref language_ref cv_id cve_ref start end duration
#> 1 ABC <NA> <NA> <NA> <NA> <NA> 490 1590 1100
#> 2 ABC <NA> <NA> <NA> <NA> <NA> 2010 3670 1660
#> 3 ABC <NA> <NA> <NA> <NA> <NA> 4370 5980 1610
#> time_unit
#> 1 milliseconds
#> 2 milliseconds
#> 3 milliseconds
Reading controlled vocabularies (CVs)
Some ELAN files may contain controlled vocabularies, which are pre-defined sets of labels that can be used as annotation values. The function read_cv() can extract all CVs from a file – whether a regular annotation file (.eaf), a template file (.etf) or an external CV file (.ecv):
library(readelan)
eaf_file <- system.file("extdata",
"example.eaf",
package = "readelan")
dplyr::glimpse(read_cv(eaf_file))
#> Rows: 12
#> Columns: 7
#> $ filename <chr> "example.eaf", "example.eaf", "example.eaf", "example.eaf", "…
#> $ url <chr> NA, NA, NA, NA, NA, NA, "./inst/extdata/syntax.ecv", "./inst/…
#> $ cv_id <chr> "pos", "pos", "pos", "pos", "pos", "pos", "syntax", "syntax",…
#> $ cve_id <chr> "cveid_89b2ac38-7313-4737-aa5f-19e1231ccb18", "cveid_89b2ac38…
#> $ lang_ref <chr> "eng", "fin", "eng", "fin", "eng", "fin", "eng", "fin", "eng"…
#> $ language <chr> "English (eng)", "Finnish (fin)", "English (eng)", "Finnish (…
#> $ value <chr> "pronoun", "pronomini", "verb", "verbi", "noun", "substantiiv…
etf_file <- system.file("extdata",
"CLP_annotation_v2-2.etf",
package = "readelan")
dplyr::glimpse(read_cv(etf_file))
#> Rows: 92
#> Columns: 7
#> $ filename <chr> "CLP_annotation_v2-2.etf", "CLP_annotation_v2-2.etf", "CLP_an…
#> $ url <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ cv_id <chr> "CLP scene", "CLP scene", "CLP scene", "CLP scene", "CLP scen…
#> $ cve_id <chr> "cveid_be1d709d-9158-4278-a220-620d2df5ceef", "cveid_321f1558…
#> $ lang_ref <chr> "und", "und", "und", "und", "und", "und", "und", "und", "und"…
#> $ language <chr> "undetermined (und)", "undetermined (und)", "undetermined (un…
#> $ value <chr> "1a", "1b", "1c", "2a", "2b", "2c", "2d", "3a", "3b", "3c", "…
ecv_file <- system.file("extdata",
"syntax.ecv",
package = "readelan")
dplyr::glimpse(read_cv(ecv_file))
#> Rows: 6
#> Columns: 7
#> $ filename <chr> "syntax.ecv", "syntax.ecv", "syntax.ecv", "syntax.ecv", "synt…
#> $ url <chr> NA, NA, NA, NA, NA, NA
#> $ cv_id <chr> "syntax", "syntax", "syntax", "syntax", "syntax", "syntax"
#> $ cve_id <chr> "cveid_4990ed36-c1d1-40c6-800e-2bd264d9a89b", "cveid_4990ed36…
#> $ lang_ref <chr> "eng", "fin", "eng", "fin", "eng", "fin"
#> $ language <chr> "English (eng)", "Finnish (fin)", "English (eng)", "Finnish (…
#> $ value <chr> "subject", "subjekti", "predicate", "predikaatti", "object", …
Reading tier metadata
To access information about tiers in an annotation file (.eaf) or template file (.etf), the read_tiers() function can be used:
library(readelan)
eaf_file <- system.file("extdata",
"example.eaf",
package = "readelan")
dplyr::glimpse(read_tiers(eaf_file))
#> Rows: 3
#> Columns: 9
#> $ filename <chr> "example.eaf", "example.eaf", "example.eaf"
#> $ tier_type <chr> "default-lt", "pos", "syntax"
#> $ tier <chr> "words", "pos", "syntax"
#> $ participant <chr> "s001", "s001", "s001"
#> $ annotator <chr> "ABC", "DEF", "GHI"
#> $ parent_ref <chr> NA, "words", "words"
#> $ language_ref <chr> NA, "eng", "eng"
#> $ constraint <chr> NA, "Symbolic_Association", "Symbolic_Association"
#> $ cv_ref <chr> NA, "pos", "syntax"
etf_file <- system.file("extdata",
"CLP_annotation_v2-2.etf",
package = "readelan")
dplyr::glimpse(read_tiers(etf_file))
#> Rows: 14
#> Columns: 9
#> $ filename <chr> "CLP_annotation_v2-2.etf", "CLP_annotation_v2-2.etf", "CL…
#> $ tier_type <chr> "CLP main", "CLP congruence", "CLP event", "CLP interacti…
#> $ tier <chr> "CLP handshape", "CLP congruence", "CLP event", "CLP inte…
#> $ participant <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
#> $ annotator <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
#> $ parent_ref <chr> NA, "CLP handshape", "CLP handshape", "CLP handshape", "C…
#> $ language_ref <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
#> $ constraint <chr> NA, "Symbolic_Association", "Symbolic_Association", "Symb…
#> $ cv_ref <chr> "CLP handshape", "CLP congruence", "CLP event", "CLP inte…
Reading media metadata
To access information about linked media file in an annotation file (.eaf), the read_media() function can be used:
library(readelan)
eaf_file <- system.file("extdata",
"example.eaf",
package = "readelan")
read_media(eaf_file)
#> filename item attribute value
#> 1 example.eaf 1 MEDIA_URL file:///Users/user01/video_01.mp4
#> 2 example.eaf 1 MIME_TYPE video/mp4
#> 3 example.eaf 1 RELATIVE_MEDIA_URL ./video_01.mp4
#> 4 example.eaf 2 MEDIA_URL file:///Users/user01/video_02.mp4
#> 5 example.eaf 2 MIME_TYPE video/mp4
#> 6 example.eaf 2 RELATIVE_MEDIA_URL ./video_02.mp4
#> 7 example.eaf 3 MEDIA_URL file:///Users/user01/audio_02.mp3
#> 8 example.eaf 3 MIME_TYPE audio/*
#> 9 example.eaf 3 RELATIVE_MEDIA_URL ./audio_02.mp3
Iterating over multiple files
The read_ functions all allow for either a single file or a vector of files to be used as file input. With multiple files, the function will iterate over them, and return a single data frame:
library(readelan)
eaf_file2 <- system.file("extdata",
"example_no_annotations.eaf",
package = "readelan")
dplyr::glimpse(read_eaf(c(eaf_file, eaf_file2)))
#> Rows: 9
#> Columns: 18
#> $ filename <chr> "example.eaf", "example.eaf", "example.eaf", "example.eaf…
#> $ a <chr> "a1", "a2", "a3", "a4", "a7", "a5", "a8", "a6", "a9"
#> $ time_1 <chr> "ts1", "ts3", "ts5", "ts1", "ts1", "ts3", "ts3", "ts5", "…
#> $ time_2 <chr> "ts2", "ts4", "ts6", "ts2", "ts2", "ts4", "ts4", "ts6", "…
#> $ annotation <chr> "I", "like", "cats", "pronoun", "subject", "verb", "predi…
#> $ tier <chr> "words", "words", "words", "pos", "syntax", "pos", "synta…
#> $ tier_type <chr> "default-lt", "default-lt", "default-lt", "pos", "syntax"…
#> $ participant <chr> "s001", "s001", "s001", "s001", "s001", "s001", "s001", "…
#> $ annotator <chr> "ABC", "ABC", "ABC", "DEF", "GHI", "DEF", "GHI", "DEF", "…
#> $ parent_ref <chr> NA, NA, NA, "words", "words", "words", "words", "words", …
#> $ a_ref <chr> NA, NA, NA, "a1", "a1", "a2", "a2", "a3", "a3"
#> $ language_ref <chr> NA, NA, NA, "eng", "eng", "eng", "eng", "eng", "eng"
#> $ cv_id <chr> NA, NA, NA, "pos", "syntax", "pos", "syntax", "pos", "syn…
#> $ cve_ref <chr> NA, NA, NA, "cveid_89b2ac38-7313-4737-aa5f-19e1231ccb18",…
#> $ start <dbl> 490, 2010, 4370, 490, 490, 2010, 2010, 4370, 4370
#> $ end <dbl> 1590, 3670, 5980, 1590, 1590, 3670, 3670, 5980, 5980
#> $ duration <dbl> 1100, 1660, 1610, 1100, 1100, 1660, 1660, 1610, 1610
#> $ time_unit <chr> "milliseconds", "milliseconds", "milliseconds", "millisec…
Note: In this case, the resulting data frame is the same as in the first example, since the second .eaf file does not contain any annotations.
When iterating over a large number of files, setting the progress argument to TRUE can be advisable, in order to get visual information of the progress.
Similar packages
There are other packages that also contain functions to read ELAN annotation files, e.g., {act} and {phonfieldwork}. However, the goal of {readelan} is to be a lightweight package dedicated to ELAN files (including .etf and .ecv files), and which is also fast. Benchmarking against these other packages indicate that {readelan} is fast option.
eaf_file <- system.file("extdata",
"example.eaf",
package = "readelan")
microbenchmark::microbenchmark(
readelan::read_eaf(eaf_file),
phonfieldwork::eaf_to_df(eaf_file),
act::import_eaf(eaf_file),
times = 100
)
#> Unit: milliseconds
#> expr min lq mean median
#> readelan::read_eaf(eaf_file) 3.54554 3.752337 4.467007 3.916893
#> phonfieldwork::eaf_to_df(eaf_file) 12.21567 13.002686 17.925177 13.628694
#> act::import_eaf(eaf_file) 59.55592 63.103674 66.648167 63.936351
#> uq max neval cld
#> 4.522892 11.86703 100 a
#> 15.973282 376.29846 100 b
#> 65.180649 142.63887 100 c
Additionally, since {readelan}’s read_eaf() function can inherently iterate over many files, and specify target tiers or tier types using the tiers or xpath arguments, it becomes much faster to read larger sets of annotation files for which only specific tiers are of interest to import.