An overview of rezonateR for Rezonator users
overview.RmdAre you already familiar with Rezonator and want to do quantitative
analyses with it, but don’t yet know how to use rezonateR?
If so, then this overview is for you. This overview will not go through
all the details of rezonateR, but we will provide a
snapshot of the most important functions in the package that will help
you in your annotation and analysis. If you aren’t very familiar with
the companion tool Rezonator and want to know what can be done with this
tool, the toy example vignette("sample_proj") gives a
concrete example of a mini-research project using Rezonator and
rezonateR, and will give you a feel of what sorts of
projects are possible with Rezonator + rezonateR.It would
also be helpful to check out the official Rezonator guides and
try some simple annotations in Rezonator to get a feel of how it works.
If you already know how rezonateR works, you can start from
vignette("import_save_basics") to learn the nitty-gritty of
coding in rezonateR.
Preliminaries
Why use rezonateR?
If you’re reading this, you’re probably already using Rezonator to do
your daily work. Rezonator is a powerful tool for annotating and
visualising the dynamics of human engagement. The purpose of
rezonateR is to add a series of additional tools that
enhance the functionality of Rezonator and increase your productivity!
My goal is to minimise the time you spend on coding and annotating so
you have more time for thinking.
Rezonator has many cool features, but due to technical restrictions, there are certain things it can’t do, such as:
- Dividing annotations into layers
- Adding chunks and track chain entries that span multiple units
- Linking tree entries to the corresponding chunks
- Guessing the value of a field by looking at the values of other fields
- Automatically update the values of certain fields using information from other fields
rezonateR can do all these, and more!
Moreover, rezonateR is geared toward people of all skill
levels in R. I do a lot of the heavy lifting for you. As long as you
have some familiarity with base R, you can quickly pick up the basic
functions, but more advanced users can also extend it as they see
fit.
The rezonateR engine is based heavily on Tidyverse
packages, particularly rlang and dplyr. Although you don’t need to be
familiar with Tidyverse to use the basic functionality, Tidyverse users
will be happy to see a wide range of functions that mimic Tidyverse
functions in appearance, but with additional fields to support the wide
range of functionality in rezonateR.
If you’re wondering when you should start learning
rezonateR, the best time is now! Although you can get
started with the Rezonator GUI relatively easily, if there
are certain things you know you will use in rezonateR -
such as multi-line chunks - you probably want to have the
rezonateR post-processing in mind, even when you’re
annotating in Rezonator.
install.packages("devtools")
library(devtools)
install_github("rezonators/rezonateR")Some folks have reported that this does not work. You may want to try
adding the code options(download.file.method = "auto")
before running install_github() if this is the case.
A quick import
Let’s start by importing our first file.
library(rezonateR)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#> Loading required package: readr
#> Loading required package: stringr
#> Loading required package: rlangNow let’s import our first file, a short spoken text in Lhasa Tibetan (you can find the original video here: https://av.mandala.library.virginia.edu/video/couple-must-part-threes-company-02). This file contains a number of chunks, track chains, as well as trees, and we will deal with them in this vignette:
path = system.file("extdata", "virginia-library-20766.rez", package = "rezonateR")
layerRegex = list(
track = list(field = "trailLayer", regex = c("clausearg", "discdeix"), names = c("clausearg", "discdeix", "refexpr")),
chunk = list(field = "chunkLayer", regex = c("verb", "adv", "predadj"), names = c("verb", "adv", "predadj", "refexpr")))
myRez = importRez(path, layerRegex = layerRegex, concatFields = c("word", "wordWylie"))
#> Import starting - please be patient ...
#> Creating node maps ...
#> Creating rezrDFs ...
#> Adding foreign fields to rezrDFs and sorting (this is the slowest step) ...
#> >Adding to unit entry DF ...
#> >Adding to unit DF ...
#> >Adding to chunk DF ...
#> >Adding to track DFs ...
#> >Adding to tree DFs ...
#> Splitting rezrDFs into layers ...
#> A few finishing touches ...
#> Done!The layerRegex object is a series of instructions to
tell importRez how to divide chunks and track chains into
different layers. In this case, I placed a field called
‘trailLayer’ on track chains, which has three possible
values: clausearg, discdeix, and nothing.
These are captured in the regex field. The first two
regexes correspond to the two names ‘clausearg’ and
‘discdeix’, and the default case where neither of the first
two regexes are detected is ‘refexpr’. I have done the same
thing to chunks, as you can see above. If you don’t want to use layers,
you don’t have to specify layerRegex. In that case, I will
create a single layer called ‘default’ for you.
The other mysterious field in the importRez function is
concatFields. These are fields belonging to tokens that you
would like to concatenate for higher-level units like chunks and tokens.
For example, if tokens 1 and 2 are ‘happy’ and ‘person’, and you have a
chunk that contains these two tokens, you would want the whole string
‘happy person’ to be associated with the chunk. Typically, you should
specify at least one field for doing this. In this case, we will
concatenate the fields ‘word’ and ‘wordWylie’
(Wylie is the most common Romanisation system for Tibetan). It is
important not to overdo it, and specify too many fields to concatenate,
as this step can slow down your import considerably.
The result of the import is an object called rezrObj,
which we will discuss below. When you import a more substantial file,
say around 30 minutes, the import speed can be rather slow. Please be
patient! The good news is that rezonateR contains
functionality for saving and loading rezrObj objects, so
you don’t have to import each time you work on a file in R.
Introduction to rezrObjs and nodeMaps
rezrObjs
There are three main kinds of objects in rezonateR that
you will interact with directly, namely rezrObj,
nodeMap and rezrDF. rezrObjs and
nodeMaps will be covered in this section.
rezrDFs are relatively complex, and will form the bulk of
our discussion in this vignette.
rezrObj objects contain one single nodeMap
and several rezrDFs:
print("Item in myRez:")
#> [1] "Item in myRez:"
names(myRez)
#> [1] "nodeMap" "cardDF" "chunkDF" "docDF" "entryDF"
#> [6] "linkDF" "mergedDF" "stackDF" "tokenDF" "trackDF"
#> [11] "trailDF" "treeDF" "treeEntryDF" "treeLinkDF" "unitDF"The relationships between the various entites are as follows:
chunks and tree entries are built on top of tokens
entries correspond exactly to tokens, and units build on entries
track refers to both chunks and tokens
This has some ramifications for updating, which we’ll come to later.
nodeMaps
The nodeMap is similar to the internal representation of
the file used in Rezonator-generated .rez files. The node
map in .rez files are a disorganised list of nodes, each of
which correspond to an entity inside Rezonator: units, tokens, and so
on. The rezonateR nodeMap is similar. The major difference
is that nodes are organised into sub-categories, according to the type
of entity that the node is encoding. Let’s have a sneak peak at these
categories:
print("Items in the nodeMap:")
#> [1] "Items in the nodeMap:"
names(myRez$nodeMap)
#> [1] "token" "entry" "unit" "track" "chunk" "card"
#> [7] "link" "trail" "stack" "corpus" "doc" "treeEntry"
#> [13] "treeLink" "tree"In practice, you will not interact with most of these
nodeMaps except token. Most of the time, you will only be
dealing with rezrDFs, which are much easier to work with.
Let’s take a look at them.
Introduction to rezrDFs
A rezrDF is like a normal data frame that you know from
base R. Here’s the beginning of the unit table:
print(head(myRez$unitDF %>% select(id, unitSeq, srtLineBo)))
#> # A tibble: 6 × 3
#> id unitSeq srtLineBo
#> <chr> <dbl> <chr>
#> 1 22C69D930B150 1 "བཀྲ་-"
#> 2 2456BC9E7C5D5 2 "བཀྲ་ཤིས་བདེ་ལེགས། "
#> 3 201A180965346 3 "ཕེབས་གནང་། "
#> 4 2BD385D3E52C7 4 "རྒན་ངག་དབང་ལགས་ཡིན་ནམ། "
#> 5 2B558CAD0F0B4 5 "ལགས་ཡིན། ང་"
#> 6 1213A3F8E28E0 6 "ལགས་ཡིན། ང་"chunkDF, trackDF, rezDF,
stackDF, etc., are divided into layers. If you directly
access the ‘chunkDF’ and ‘trackDF’ components
of myRez, you will get a list of rezrDFs, one
for each layer. Here are the names of our chunk layers that you might
remember from the introduction, along with the beginning of one of the
associated rezrDFs:
print("Component DFs of chunkDF:")
names(myRez$chunkDF)
print(head(myRez$chunkDF$refexpr) %>% select(id, word))
rezrDF inherits from the Tidyverse ‘tibble’ structure,
so all tibble-related functions can be used with them. However, using
classic Tidyverse functions on rezrDFs is often dangerous,
as rezrDFs have additional functionality that go beyond
classic rezrDFs.
There are three main differences that make rezrDF
special:
Perk 1: Field access labels
Field access labels prevents you from accidentally changing things
that you shouldn’t be changing. Let’s look at the field access values of
the unitDF:
print("fieldaccess:")
#> [1] "fieldaccess:"
fieldaccess(myRez$unitDF)
#> id doc unitStart unitEnd
#> "key" "core" "core" "core"
#> unitSeq pID srtLineStart srtLineBo
#> "core" "core" "flex" "flex"
#> iuEnd iuStart srtLineEn iu
#> "flex" "flex" "flex" "flex"
#> srtLineID iuText speaker srtLineEnd
#> "flex" "flex" "flex" "flex"
#> word wordWylie docTokenSeqFirst docTokenSeqLast
#> "foreign" "foreign" "foreign" "foreign"There are five possible field access values:
‘
key’: The primary key of the table. You are not allowed to change it (unless you turn it into a non-key field, but this is not encouraged since you will basically break everything). If you try to update these fields usingrezonateRfunctions, I will stop you with an error.‘
core’: Core fields, mostly generated by Rezonator. You can change them, but I will give you a warning if you do, because changing a core field has strong potential to break things.‘
flex’: Flexible fields, usually fields whose values you enter into Rezonator, though there are also flex fields automatically generated by Rezonator.‘
auto’: Fields whose values are automatically generated using information from the SAMErezrDF.‘
foreign’: Fields whose values are automatically generated using information from a DIFFERENTrezrDF(or several differentrezrDFs, but this is an advanced feature we will stay away from in this vignette.)
Perk 2: Reloads
The reload() function is one of the core features of
rezonateR that makes it so convenient to use. The
reload() feature is based on updateFunctions.
You can access the updateFunctions of a table using
updateFunct():
print("updateFunct of unitDF:")
#> [1] "updateFunct of unitDF:"
updateFunct(myRez$unitDF)
#> $word
#> function (df, rezrObj)
#> updateLowerToHigher(df, rezrObj, address, fkeyAddress, action,
#> field, fkeyInDF, seqName)
#> <bytecode: 0x000001da3b578920>
#> <environment: 0x000001da3b5686a8>
#> attr(,"class")
#> [1] "updateFunction" "function"
#> attr(,"deps")
#> [1] "entryDF/word"
#>
#> $wordWylie
#> function (df, rezrObj)
#> updateLowerToHigher(df, rezrObj, address, fkeyAddress, action,
#> field, fkeyInDF, seqName)
#> <bytecode: 0x000001da3b57f670>
#> <environment: 0x000001da3b578290>
#> attr(,"class")
#> [1] "updateFunction" "function"
#> attr(,"deps")
#> [1] "entryDF/wordWylie"
#>
#> $docTokenSeqFirst
#> function (df, rezrObj)
#> updateLowerToHigher(df, rezrObj, address, fkeyAddress, action,
#> field, fkeyInDF, seqName)
#> <bytecode: 0x000001da3b57e1e0>
#> <environment: 0x000001da3b57efe0>
#> attr(,"class")
#> [1] "updateFunction" "function"
#> attr(,"deps")
#> [1] "entryDF/docTokenSeq"
#>
#> $docTokenSeqLast
#> function (df, rezrObj)
#> updateLowerToHigher(df, rezrObj, address, fkeyAddress, action,
#> field, fkeyInDF, seqName)
#> <bytecode: 0x000001da3b572c00>
#> <environment: 0x000001da3b573a00>
#> attr(,"class")
#> [1] "updateFunction" "function"
#> attr(,"deps")
#> [1] "entryDF/docTokenSeq"There are three reload functions, reloadLocal(),
reloadForeign() and reload().
reloadLocal() only takes a rezrDF, and only
updates auto fields. reloadForeign() and
reload() take a rezrDF and a
rezrObj, and updates the rezrDF using the
rezrObj (which may or may not contain the
rezrDF).
Let’s take a look at reload(), which is the most useful
function of these. Here, in the original data, when there are zero
mentions, only the orthographic representation is written as <0>;
the Wylie romanisation is a blank string. I want to change the Wylie
romanisation to also contain <0>s. I do that by using the
rez_mutate() function on the tokenDF first
(don’t worry about what that means yet; we’ll cover it later). After
that, I reload the entryDF, and then reload the
unitDF. (Recall that the units depend on entries which in
turn depend on tokens; that’s why we can’t just reload the
unitDF directly.) After the update, the unitDF
is updated with <0>s appearing in the Wylie romanisation:
print("Before the update")
#> [1] "Before the update"
myRez$unitDF %>% filter(str_detect(word, "<0>")) %>% rez_select(id, word, wordWylie) %>% head
#> # A tibble: 6 × 3
#> id word wordWylie
#> <chr> <chr> <chr>
#> 1 201A180965346 <0> ཕེབས་ གནང་ ། " phebs gnang …
#> 2 2BD385D3E52C7 <0> རྒན་ ངག་དབང་ ལགས་ ཡིན་ ནམ ། " rgan ngag dba…
#> 3 2B558CAD0F0B4 ལགས་ <0> ཡིན ། "lags yin /"
#> 4 1A5B143F0646A ལགས་ <0> <0> རེད ། "lags red /"
#> 5 2414F2965B74A ང་ འདི ར་ སྤྱིར་བཏང་ <0> སློབ་སྦྱོང་ བྱེད་ གར་ ཡོང་བ་ ཡིན ། "nga 'di ra sp…
#> 6 1FA517FFC43A <0> སློབ་ཕྲུག་ ཡིན་ X X " slob phrug yi…
#Change something in the token rezrDF that is significant for the unit rezrDF
myRez$tokenDF = myRez$tokenDF %>% rez_mutate(
wordWylie = case_when(word == "<0>" ~ "<0>", T ~ wordWylie))
myRez$entryDF = myRez$entryDF %>% reload(myRez)
myRez$unitDF = myRez$unitDF %>% reload(myRez)
print("After the update")
#> [1] "After the update"
myRez$unitDF %>% filter(str_detect(word, "<0>")) %>% rez_select(id, word, wordWylie) %>% head
#> # A tibble: 6 × 3
#> id word wordWylie
#> <chr> <chr> <chr>
#> 1 201A180965346 <0> ཕེབས་ གནང་ ། <0> phebs gnang…
#> 2 2BD385D3E52C7 <0> རྒན་ ངག་དབང་ ལགས་ ཡིན་ ནམ ། <0> rgan ngag d…
#> 3 2B558CAD0F0B4 ལགས་ <0> ཡིན ། lags <0> yin /
#> 4 1A5B143F0646A ལགས་ <0> <0> རེད ། lags <0> <0> re…
#> 5 2414F2965B74A ང་ འདི ར་ སྤྱིར་བཏང་ <0> སློབ་སྦྱོང་ བྱེད་ གར་ ཡོང་བ་ ཡིན ། nga 'di ra spy…
#> 6 1FA517FFC43A <0> སློབ་ཕྲུག་ ཡིན་ X X <0> slob phrug …You might be wondering how to reload an entire rezrObj.
Because tables often depend on each other (for example, field A in table
X relies on field B in table Y which in turn relies on field C in table
X), this is technically difficult, but I plan to add this function
before the 1.0 release. Stay tuned!
Perk 3: Correpondences to nodeMaps
rezrDFs encode information about whether a field is in
the nodeMap or not:
print("inNodeMap:")
#> [1] "inNodeMap:"
inNodeMap(myRez$unitDF)
#> id doc unitStart unitEnd
#> "key" "primary" "primary" "primary"
#> unitSeq pID srtLineStart srtLineBo
#> "primary" "primary" "tagmap" "tagmap"
#> iuEnd iuStart srtLineEn iu
#> "tagmap" "tagmap" "tagmap" "tagmap"
#> srtLineID iuText speaker srtLineEnd
#> "tagmap" "tagmap" "tagmap" "tagmap"
#> word wordWylie docTokenSeqFirst docTokenSeqLast
#> "no" "no" "no" "no"This doesn’t do so much yet, since you’re not yet allowed to push a
field created in a rezrDF back to a nodeMap.
This will be available in the 1.0 release.
Editing rezrDFs
One of the core features of rezonateR is to facilitate
the automatic and semi-automatic creation of fields, which is currently
not supported in Rezonator. There are also other operations you may want
to perform on rezrDFs.
To cater to users of different habits and skill levels, I have
introduced three different levels of rezrDFs.
EasyEditcan be quickly picked up by everyone, including base R users, and covers the most basic operations you would want to do to arezrDF(e.g.addField(),changeFieldForeign())TidyRezis easy to pick up for tidyverse users, though there is some learning curve for others (e.g.rez_mutate(),rez_left_join())Core engine: These are mostly functions that I use within
rezonateRunder the hood. Users who want maximum flexibility may also use them (e.g.lowerToHigher(),createLeftJoinUpdate()), but do be aware that I may make changes to these without notice, since I will assume that most users have little use for them.
Crucially, while EasyEdit and TidyRez
syntax are very similar within each category, functions within the core
engine are a lot more divergent, and EasyEdit and
TidyRez also differ considerably in their syntax. So if you
are comfortable with TidyRez, minimising the use of
EasyEdit may make your code look more consistent, and vice
versa.
EasyEdit
EasyEdit consists of four commonly used functions,
addFieldLocal(), addFieldForeign(),
changeFieldLocal() and changeFieldForeign().
There are also some less useful functions not covered in this vignette,
but which you can find in the references, like
addRow().
All of the four basic functions can be applied to both
rezrDFs and rezrObjs. In this vignette, we
will mainly apply them to rezrObjs. If you want to deal
with emancipated rezrDFs, i.e. rezrDFs that
are not part of a rezrObj, you will want to use the
versions that apply to rezrDFs, but those are simpler than
the rezrObj versions, so you should be able to pick them up
quickly using the manual.
Let’s start by looking at addFieldLocal().
addField() is a shortcut for addFieldLocal(),
and we will be using this shortcut name throughout. Our first example is
very simple. In our tokenDF, let’s add a field that
automatically calculates the length of a word in characters. Here,
‘entity’ specifies the name of the entity you would like to
change, ‘layer’ specifies the layer within that entity
(which is an empty string since there are no token layers),
fieldName is the name of the field we’re adding, expression
is the R expression with which we calculate the new field, and
fieldaccess tells rezonateR to make this an
auto field with an updateFunction that will be attached to
the table:
myRez = addField(myRez, entity = "token", layer = "",
fieldName = "orthoLength",
expression = nchar(word),
fieldaccess = "auto")
print("A fragment of the updated table:")
#> [1] "A fragment of the updated table:"
head(myRez$tokenDF %>% rez_select(id, word, orthoLength))
#> # A tibble: 6 × 3
#> id word orthoLength
#> <chr> <chr> <int>
#> 1 2ECADE1029CD3 བཀྲ་- 5
#> 2 15A5089F6157A བཀྲ་ཤིས་ 8
#> 3 197AA4A0C625F བདེ་ལེགས 8
#> 4 2C7746BD6F150 ། 1
#> 5 354CBFB3632B6 <0> 3
#> 6 2D1DAD2FFF22A ཕེབས་ 5
print("The updateFunction:")
#> [1] "The updateFunction:"
updateFunct(myRez$tokenDF, "orthoLength")
#> function (df)
#> updateMutate(df, field, x)
#> <environment: 0x000001da3bdb8c18>
#> attr(,"class")
#> [1] "updateFunction" "function"
#> attr(,"deps")
#> character(0)Now let’s spice this up a bit by adding a complex field. A complex
field takes information from multiple rows of a table. In this case, we
are working with the tokenDF, but want the new column to be
the longest length of the word that appears in the unit that the token
comes from. In this case, the groupField is ‘unit’, and we
specify the field type as ‘complex’. The expression uses the function
longestLength(), which is a rezonateR function
that returns the longest word in a series of words.
myRez = addField(myRez, entity = "token", layer = "",
fieldName = "longestWordInUnit",
expression = longestLength(word),
type = "complex",
groupField = "unit",
fieldaccess = "auto")
head(myRez$tokenDF %>% select(id, word, longestWordInUnit))
#> # A tibble: 6 × 3
#> id word longestWordInUnit
#> <chr> <chr> <int>
#> 1 2ECADE1029CD3 བཀྲ་- 5
#> 2 15A5089F6157A བཀྲ་ཤིས་ 8
#> 3 197AA4A0C625F བདེ་ལེགས 8
#> 4 2C7746BD6F150 ། 8
#> 5 354CBFB3632B6 <0> 5
#> 6 2D1DAD2FFF22A ཕེབས་ 5Now let’s add a simple foreign field. Let’s say when we look at the
tokenDF, we also want to know what the whole unit’s words
are. The source is the ‘word’ field of units, and we are
creating a new field for tokens called ‘unitWord’. The
foreign key is the field that contains IDs of the source table inside
the target table, in this case the ‘unit’ field of
tokenDF:
myRez = addFieldForeign(myRez,
targetEntity = "token", targetLayer = "",
sourceEntity = "unit", sourceLayer = "",
targetForeignKeyName = "unit",
targetFieldName = "unitWord", sourceFieldName = "word",
fieldaccess = "foreign")
head(myRez$tokenDF %>% select(id, word, unitWord))
#> # A tibble: 6 × 3
#> id word unitWord
#> <chr> <chr> <chr>
#> 1 2ECADE1029CD3 བཀྲ་- བཀྲ་-
#> 2 15A5089F6157A བཀྲ་ཤིས་ བཀྲ་ཤིས་ བདེ་ལེགས །
#> 3 197AA4A0C625F བདེ་ལེགས བཀྲ་ཤིས་ བདེ་ལེགས །
#> 4 2C7746BD6F150 ། བཀྲ་ཤིས་ བདེ་ལེགས །
#> 5 354CBFB3632B6 <0> <0> ཕེབས་ གནང་ །
#> 6 2D1DAD2FFF22A ཕེབས་ <0> ཕེབས་ གནང་ །Now let’s wrap it up with a complex field foreign field. Here, we’re
going to add a field in the unitDF that tells us the length
of the shortest word within the unit. We’re going to base this off the
entryDF.
However, because the entries that correspond to units are given in
the nodeMap, you also need to supply the list of entries
inside the unit nodeMap - here it’s called
entryList. Chunks and tree entries are built on tokens
instead, so they have a list called tokenList. Instead of
‘expression’, complex foreign fields have a field called
complexAction, which is a function performed on the source
field of the source table:
myRez = addFieldForeign(myRez,
targetEntity = "unit", targetLayer = "",
sourceEntity = "entry", sourceLayer = "",
targetForeignKeyName = "entryList",
targetFieldName = "shortestWordLength",
sourceFieldName = "word",
type = "complex",
complexAction = shortestLength,
fieldaccess = "foreign")
head(myRez$unitDF %>% select(id, word, shortestWordLength))
#> # A tibble: 6 × 3
#> id word shortestWordLength
#> <chr> <chr> <int>
#> 1 22C69D930B150 བཀྲ་- 5
#> 2 2456BC9E7C5D5 བཀྲ་ཤིས་ བདེ་ལེགས ། 1
#> 3 201A180965346 <0> ཕེབས་ གནང་ ། 1
#> 4 2BD385D3E52C7 <0> རྒན་ ངག་དབང་ ལགས་ ཡིན་ ནམ ། 1
#> 5 2B558CAD0F0B4 ལགས་ <0> ཡིན ། 1
#> 6 1213A3F8E28E0 ང་ 2(Because of the technicality that punctuation counts as a character,
most of these values are 1. There are ways we can fix this using
isWord conditions, as we’ll discuss below.)
So far we’ve only looked at addField(), but the good
news is that changeField works in 100% the exact same way!
Here’s an example, changing our orthoLength field to depend
on the Romanisation instead of the original Tibetan script:
myRez = changeField(myRez, entity = "token", layer = "",
fieldName = "orthoLength",
expression = nchar(wordWylie),
fieldaccess = "auto")
print("A fragment of the updated table:")
#> [1] "A fragment of the updated table:"
head(myRez$tokenDF %>% rez_select(id, word, orthoLength))
#> # A tibble: 6 × 3
#> id word orthoLength
#> <chr> <chr> <int>
#> 1 2ECADE1029CD3 བཀྲ་- 8
#> 2 15A5089F6157A བཀྲ་ཤིས་ 10
#> 3 197AA4A0C625F བདེ་ལེགས 8
#> 4 2C7746BD6F150 ། 1
#> 5 354CBFB3632B6 <0> 3
#> 6 2D1DAD2FFF22A ཕེབས་ 6Note that if you don’t specify the field access value, I will
automatically change it to flex, even in
changeField(). This is to force you to remember that
you are not only changing the value of the field itself, but also how it
will be updated in the future. If you don’t supply a field access value
and it’s originally an auto or foreign field, I will warn you about this
change, so you can run changeField() again if you want to
change your mind.
TidyRez
In general, TidyRez functions are called by adding
‘rez_’ in front of a dplyr function name, such a
rez_group_by or rez_mutate. Using
TidyRez functions allows you to keep and/or update your
field access values, inNodeMap values, and
updateFunctions. Using base R or classic dplyr functions
with rezrDFs may result in reload fails, unless
supplemented by core engine functions, which are not covered in this
vignette.
A few dplyr functions are completely safe to use in
rezonateR, mostly those that focus on selecting rows of a
table, such as filter(), arrange() or
slice(). Currently implemented TidyRez
functions include:
rez_add_row()for adding new entriesrez_mutate()for adding and editing columnsrez_rename()for renaming columnsrez_bind_rows()for combining rezrDFs verticallyrez_group_split()for splitting rezrDFs verticallyrez_group_by()and rez_ungroup for groupingrez_select()for selecting certain columns inside a rezrDFrez_left_join()for left joins
A few other planned ones include rez_bind_cols() and
rez_outer_join(), which will be especially useful for the
calculation of inter-annotator agreement.
Because TidyRez is relatively straightforward for
Tidyverse users, this vignette will focus on what TidyRez
adds on top of Tidyverse. If you want to learn about basic Tidyverse,
there are many existing tutorials on the Internet.
To see the power of TidyRez, let’s try creating an
emancipated rezrDF with only a subset of the original
columns. Here, we take trackDF$refexpr, the table of
referential expressions. We then damage one of the fields using a
classic dplyr function. As you can see here, the
emancipated rezrDF can still be updated using the
rezrObj, effectively overriding the damage:
refTable = myRez$trackDF$refexpr %>% rez_select(id, token, chain, name, word, tokenOrderLast)
print("Before:")
#> [1] "Before:"
head(refTable %>% select(id, tokenOrderLast))
#> # A tibble: 6 × 2
#> id tokenOrderLast
#> <chr> <dbl>
#> 1 3D148B2FEEA8 1
#> 2 1CA978DCDE1DB 1
#> 3 14C5BE6658C39 4
#> 4 1FE8CC91923D1 2
#> 5 33A483E30E811 1
#> 6 280CA1A8A425 1
refTable = refTable %>% mutate(tokenSeqLast = 1) #Damage refTable with a classic dplyr function
print("After:")
#> [1] "After:"
refTable = refTable %>% reload(myRez)
head(refTable %>% select(id, tokenOrderLast))
#> # A tibble: 6 × 2
#> id tokenOrderLast
#> <chr> <dbl>
#> 1 3D148B2FEEA8 1
#> 2 1CA978DCDE1DB 1
#> 3 14C5BE6658C39 4
#> 4 1FE8CC91923D1 2
#> 5 33A483E30E811 1
#> 6 280CA1A8A425 1A warning is in order: TidyRez only updates the
current table. If other tables have references to the table you’re
editing, they will not be updated. You must bear this in mind when using
rez_select() and rez_rename(). No problems
will arise if you use these functions on emancipated
rezrDFs. However, if you use these functions on
rezrDFs within rezrObjs, you should manually
update any fields in other rezrDFs that refer to the field
you’ve deleted or added. I plan to add a rename feature to
EasyEdit in the near future that will update references
from other rezrDFs.
Most of the TidyRez functions’ syntax deviate from dplyr
only minimally in ways that you can read about in the documentation.
However, rez_left_join() is worth a quick mention. In
addition to a fieldaccess field and a rezrObj
field, which are self-explanatory, there is a fkey field
and a df2Address field. fkey is the name of
the field in the first data.frame that corresponds to IDs of the second
data.frame. df2Address is a string that tells
rez_left_join() how to find the source rezrDF
next time. If the source rezrDF doesn’t belong to a layer,
e.g. tokenDF, just type that. If the source
rezrDF belongs to a layer, put a ‘/’ between the table and
the layer, e.g. ‘trackDF/refexpr’.
An interlude: Time and sequence
Before we continue our adventure, let’s look at a couple of ways we
can upgrade our rezrObj to contain even more
information.
The first thing we can do, which we hinted at before, is to set
certain tokens as non-words. You can do this with the
addIsWordField function. One immediate benefit of this is
that we get a new sequence value. The fields tokenOrder and
docTokenSeq values in the original rezrDF
count all tokens, whereas wordOrder and
docWordSeq will only count tokens counted as words
according to some criterion. Let’s set our criterion to
!str_detect(wordWylie, "/"), i.e. the token must not
contain the main punctuation mark in Tibetan. Notice that
wordOrder is generally slightly lower than
tokenOrder:
myRez = addIsWordField(myRez, !str_detect(wordWylie, "/"))
head(myRez$tokenDF %>% select(id, tokenOrder, docTokenSeq, wordOrder, docWordSeq))
#> # A tibble: 6 × 5
#> id tokenOrder docTokenSeq wordOrder docWordSeq
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 2ECADE1029CD3 1 1 1 1
#> 2 15A5089F6157A 1 2 1 2
#> 3 197AA4A0C625F 2 3 2 3
#> 4 2C7746BD6F150 3 4 0 0
#> 5 354CBFB3632B6 1 5 1 4
#> 6 2D1DAD2FFF22A 2 6 2 5By default, unitSeq information is not available to
rezrDFs other than unitDF. You can change this
using the addUnitSeq() feature, which can add
unitSeq information up to track chains:
myRez = addUnitSeq(myRez, "track")This adds a unitSeqFirst and unitSeqLast
field to chunks and track chains entries, and a unitSeq
field to tokens.
Updating rezonateR using external information
Some annotation actions are easier with a spreadsheet than in a
Rezonator, so one action you will frequently perform is to do
annotations in a spreadsheet programme and then integrate that
information back into a rezrObj. Fortunately,
rezonateR contains functionality that can facilitate this
and minimise errors generated in the process.
Let’s say we want to annotate the person of the referential
expressions inside trackDF$refexpr. Before we start
annotating manually, I wrote some simple rules to guess what the person
is that works for most situations, so we will only have to correct from
this baseline:
myRez$trackDF$refexpr = myRez$trackDF$refexpr %>% rez_mutate(person = case_when(word == "ང" | str_starts(word, "ང་") ~ 1,
str_starts(word, "(ཁྱེད|ཁྱོད|ཇོ་ལགས|ཨ་ཅག་ལགས|རྒན་ལགས)") ~ 2,
str_ends(word, "(ལགས|<0>)") ~ 0, #Multiple likely scenarios
T ~ 3))Before we export this as a CSV for annotation, I would like to add a
column inside the rezrDF that gives us the word of the
entire unit. (Since this document currently does not have multi-unit
track entries, it will suffice to use unitLast or
unitFirst). It will be useful to be able to see this column
while making manual annotations:
myRez$trackDF$refexpr = myRez$trackDF$refexpr %>%
rez_left_join(myRez$unitDF %>% rez_select(unitSeq, word), by = c(unitSeqLast = "unitSeq"), suffix = c("", "_unit"), df2key = "unitSeq", df2Address = "unitDF", fkey = "unitSeqLast") %>%
rez_rename(unitLastWord = word_unit)
#> Tip: When performed on a rezrDF inside a rezrObj, rez_rename is a potentially destructive action. It is NOT recommended to assign it back to a rezrDF inside a rezrObj. If you must do so, be careful to update all addresses from other DFs to this DF.The next step is to write the CSV file. rez_write_csv()
allows us to do this easily. The third argument of
rez_write_csv() is a vector of field names that we want to
export. It is advisable to keep the number of exported fields small to
make the spreadsheet more manageable and require less scrolling:
rez_write_csv(myRez$trackDF$refexpr, "refexpr.csv", c("id", "name", "unitLastWord", "unitSeqLast", "word", "docTokenSeqLast", "entityType", "roleType", "person"))After editing the CSV in a spreadsheet program, let’s import it back
using rez_read_csv(). (I’ve renamed the edited CSV - in
general, I recommend doing this to avoid accidentally overwriting your
edited file by running the export code again.) The origDF
argument tells rezonateR to look in the original
rezrDF that produced the CSV, and determine the data types
accordingly:
changeDF = rez_read_csv("refexpr_edited.csv", origDF = myRez$trackDF$refexpr)Finally, the updateFromDF() function allows us to update
the original rezrDF using information from the new
rezrDF. There are many fancy option you can choose from,
such as deciding whether to delete rows, add rows, add columns, etc. We
will only use the most vanilla options, and update the ‘person’
column:
myRez$trackDF$refexpr = myRez$trackDF$refexpr %>% updateFromDF(changeDF, changeCols = 'person')
#>
#> "id"
#> NULL
head(myRez$trackDF$refexpr %>% select(id, word, person))
#> # A tibble: 6 × 3
#> id word person
#> <chr> <chr> <dbl>
#> 1 3D148B2FEEA8 <0> NA
#> 2 1CA978DCDE1DB <0> NA
#> 3 14C5BE6658C39 རྒན་ ངག་དབང་ ལགས་ NA
#> 4 1FE8CC91923D1 <0> NA
#> 5 33A483E30E811 ང་ NA
#> 6 280CA1A8A425 འདི་ NAAnalysing track chains with EasyTrack
Now that we’ve looked at an example of semi-automatic annotation, let’s move on to some full automation! We will be looking in particular at coreference chains. rezonateR contains a suite of functions for generating features useful for analysing the choice of referential forms, reference comprehension, and similar topics.
Anaphoric and cataphoric distance
Let’s first find out how many units we are from the previous mention
of something. This is equivalent to the gapUnit column that
already exists as automatically generated by Rezonator:
myRez$trackDF$refexpr = myRez$trackDF$refexpr %>%
rez_mutate(unitsToLastMention = unitsToLastMention(unitSeqLast))
myRez$trackDF$refexpr %>% select(id, gapUnits, unitsToLastMention) %>% slice(10:16)
#> # A tibble: 7 × 3
#> id gapUnits unitsToLastMention
#> <chr> <chr> <dbl>
#> 1 134D42FB97545 0 1
#> 2 183DC7932D931 7 7
#> 3 1373D1F88358 N/A NA
#> 4 2DBE5E8F59A6D 1 1
#> 5 F20CE11F519F 1 1
#> 6 1366768617 2 2
#> 7 287C6AAAA209D N/A NANow let’s count the tokens from the last mention using the
tokensToLastMention() function. This one has a couple of
complications. The first one is which seq to count. In the interlude, we
mentioned that in addition to docTokenSeq, we have a
sequence value called docWordSeq that excludes nonwords. We
will use that value in counting. The second complication is how we will
treat zero mentions. Zeros do not actually exist in the world, so they
have no time to speak of. The ‘zeroProtocol’ argumentis
‘unitFinal’, telling rezonateR to count the
last word of whatever unit the zero comes from. Finally, since we’re
dealing with units, we need to pass the unitDF to ensure
that tokensToLastMention can have access to unit
information:
myRez$trackDF$refexpr = myRez$trackDF$refexpr %>%
rez_mutate(wordsToLastMention = tokensToLastMention(
docWordSeqLast, #What seq to use
zeroProtocol = "unitInitial", #How to treat zeroes
zeroCond = (word == "<0>"),
unitDF = myRez$unitDF)) #Additional argument for unitFinal protocol
myRez$trackDF$refexpr %>% select(id, wordsToLastMention) %>% slice(10:16)
#> # A tibble: 7 × 2
#> id wordsToLastMention
#> <chr> <dbl>
#> 1 134D42FB97545 0
#> 2 183DC7932D931 22
#> 3 1373D1F88358 NA
#> 4 2DBE5E8F59A6D 0
#> 5 F20CE11F519F 0
#> 6 1366768617 0
#> 7 287C6AAAA209D NANote that unitsToNextMention and
tokensToNextMention work in the same way.
Tallying preceding and following mentions
We can also count how many previous mentions of something there were
within a window of units. Most people do five or 20 unit. Let’s try this
with 5. The countPrevMentions allows us to do this
(countNextMentions() does this but for the succeeding
context):
myRez$trackDF$refexpr = myRez$trackDF$refexpr %>% rez_mutate(noPrevMentionsIn5 = countPrevMentions(5))
myRez$trackDF$refexpr %>% select(id, noPrevMentionsIn5) %>% slice(10:16)
#> # A tibble: 7 × 2
#> id noPrevMentionsIn5
#> <chr> <int>
#> 1 134D42FB97545 2
#> 2 183DC7932D931 0
#> 3 1373D1F88358 0
#> 4 2DBE5E8F59A6D 1
#> 5 F20CE11F519F 1
#> 6 1366768617 2
#> 7 287C6AAAA209D 0Sometimes, we may want to extract previous mentions conditionally,
e.g. only count subject mentions or zero mentions. The functions
countPrevMentionsIf() and countNextMentionIf()
allow us to define such a condition. Let’s try counting the number of
coming zero mentions. Here, we use the condition
word == "<0>", i.e. the word is a zero, and the
window is Inf, i.e. there’s no limit on how far in the
future we look:
myRez$trackDF$refexpr = myRez$trackDF$refexpr %>% rez_mutate(noComingZeroes = countNextMentionsIf(Inf, word == "<0>"))
myRez$trackDF$refexpr %>% select(id, noComingZeroes) %>% slice(10:16)
#> # A tibble: 7 × 2
#> id noComingZeroes
#> <chr> <int>
#> 1 134D42FB97545 1
#> 2 183DC7932D931 9
#> 3 1373D1F88358 0
#> 4 2DBE5E8F59A6D 9
#> 5 F20CE11F519F 0
#> 6 1366768617 8
#> 7 287C6AAAA209D 1Counting competitors
We may also want to count competing mentions, that is, recent
mentions not coreferential to the current mention.
countCompetitors() tallies the number of competitors
intervening between the previous and current mention, possibly within a
window. Here is one example with no window:
myRez$trackDF$refexpr = myRez$trackDF$refexpr %>% rez_mutate(noCompetitors = countCompetitors())
myRez$trackDF$refexpr %>% select(id, noComingZeroes) %>% slice(10:16)
#> # A tibble: 7 × 2
#> id noComingZeroes
#> <chr> <int>
#> 1 134D42FB97545 1
#> 2 183DC7932D931 9
#> 3 1373D1F88358 0
#> 4 2DBE5E8F59A6D 9
#> 5 F20CE11F519F 0
#> 6 1366768617 8
#> 7 287C6AAAA209D 1All of the functions introduced in this section have additional fields that allow for further customisation. Please feel free to refer to the manual for more information.
Adding tree information
Now let’s add some information from trees. The first thing to do is
to run the getAllTreeCorrespondences() function, which adds
a treeEntry column to non-tree tables. If you select
entity = "track", this column will be added to
tokenDF, chunkDF and trackDF.
myRez = getAllTreeCorrespondences(myRez, entity = "track")
myRez$trackDF$refexpr %>% select(id, treeEntry) %>% slice(10:16)
#> # A tibble: 7 × 2
#> id treeEntry
#> <chr> <chr>
#> 1 134D42FB97545 23E6B9A16C426
#> 2 183DC7932D931 253E6EE11F308
#> 3 1373D1F88358 1E1D73A61D4B0
#> 4 2DBE5E8F59A6D 5E28BE8ED183
#> 5 F20CE11F519F 363329E4BBB94
#> 6 1366768617 281F8FC643F88
#> 7 287C6AAAA209D 364D86848B187The best thing that trees do for us is connecting verb information,
stored in chunks, to track chain entry (i.e. referential expression)
information. We can do this in two steps. First we add a
treeParent column to trackDF$refexpr that
takes the value of the ‘parent’ column of treeEntryDF; in
simple terms, this means we’re getting the parent tree entry’s ID into
trackDF$refexpr. We then use this parent tree entry’s ID to
find the corresponding verb chunk, and with this, we have successfully
put the verb on the trackDF$refexpr table.
myRez = myRez %>% addFieldForeign("track", "refexpr", "treeEntry", "default", "treeEntry", "treeParent", "parent", fieldaccess = "foreign")
myRez$trackDF$refexpr = myRez$trackDF$refexpr %>% rez_left_join(myRez$chunkDF$verb %>% select(id, word, treeEntry), by = c(treeParent = "treeEntry"), suffix = c("", "_verb"), df2Address = "chunkDF/verb", fkey = "treeParent", df2key = "treeEntry", rezrObj = myRez) %>% rename(verbID = id_verb, verbWord = word_verb)
myRez$trackDF$refexpr %>% select(id, treeParent, verbID, verbWord) %>% slice(10:16)
#> # A tibble: 7 × 4
#> id treeParent verbID verbWord
#> <chr> <chr> <chr> <chr>
#> 1 134D42FB97545 16CA7768343C0 2CFB905E634DE རེད
#> 2 183DC7932D931 2EA73E6E3BED4 30E122DAF5010 ཕེབས་པ་ ཡིན་ ནམ
#> 3 1373D1F88358 2EA73E6E3BED4 30E122DAF5010 ཕེབས་པ་ ཡིན་ ནམ
#> 4 2DBE5E8F59A6D 359A5ED8BE360 2CCFFAD41AD38 ཡོང་པ་ ཡིན
#> 5 F20CE11F519F 359A5ED8BE360 2CCFFAD41AD38 ཡོང་པ་ ཡིན
#> 6 1366768617 A4DB76F5C0EB A4DC56F0B165 ཡོང་བ་ ཡིན
#> 7 287C6AAAA209D A4DB76F5C0EB A4DC56F0B165 ཡོང་བ་ ཡིནAdvanced: Chunk mergers
The last topic to cover is merging chunks, most useful for creating muti-line chunks. There are several steps to merging chunks:
Create constituent chunks that span the entire merged chunk
Create a tree leaf that contains all tokens in the merged chunk, and put the leaf in a tree.
Use the
mergeChunksWithTree()command inrezonateRto merge them.
mergeChunksWithTree() is very easy to use. After you
call this command, the merged chunks will be added to the bottom of the
correponding chunk rezrDF. Chunk tags are taken from the
first constituent chunk of each merger by default; see the manual for
setting custom conditions. There will in addition be a column called
combinedChunk that tells you whether a chunk is a combined
chunk, a member of a combined chunk, or neither.
myRez = mergeChunksWithTree(myRez)
myRez$chunkDF$refexpr %>% filter(combinedChunk != "") %>% select(id, name, word, combinedChunk) #Showing only combined chunks and their members
#> # A tibble: 6 × 4
#> id name word combi…¹
#> <chr> <chr> <chr> <chr>
#> 1 51A902A7C6CD Chunk 49 ཁོང་ ཚོ་ ནས་ |infom…
#> 2 2E2B9BB6462BC Chunk 51 དཔེ་སྐྲུན་ ཞུས་པ་ དེ་ |membe…
#> 3 17842C4087863 Chunk 235 ཕྱི་རྒྱལ་ ནུབ་ཕྱོགས་པ འི་ མི་རིགས་ འདི་འདྲས་ |infom…
#> 4 EF4325E7D5A9 Chunk 81 བོད་སྐད་ ནང་ བྲིས་པ འི་ རྒྱལ་རབས་ ཀྱི་ དེབ་ |membe…
#> 5 6ru0c9hVoSLwZ New Chunk 1 ཁོང་ ཚོ་ ནས་ དཔེ་སྐྲུན་ ཞུས་པ་ དེ་ combin…
#> 6 HpsUkPzuAeFcm New Chunk 2 ཕྱི་རྒྱལ་ ནུབ་ཕྱོགས་པ འི་ མི་རིགས་ འདི་འདྲས་ བོད་སྐད་ ནང་… combin…
#> # … with abbreviated variable name ¹combinedChunkYou may also augment the trackDF with the merged chunks;
the combinedChunk column works similarly:
myRez = mergedChunksToTrack(myRez, "refexpr")Where to go from here?
Now that you’ve seen the bare-bones basics of using
rezonateR, if you want to dive in and start using it, you
can proceed to our sequence of detailed tutorials starting from
vignette("import_save_basics"). If you want to see a
concrete example of a mini-project, take a look at
vignette("sample_proj").