An overview of rezonateR for Rezonator users
overview.Rmd
Are you already familiar with Rezonator and want to do quantitative
analyses with it, but don’t yet know how to use rezonateR
?
If so, then this overview is for you. This overview will not go through
all the details of rezonateR
, but we will provide a
snapshot of the most important functions in the package that will help
you in your annotation and analysis. If you aren’t very familiar with
the companion tool Rezonator and want to know what can be done with this
tool, the toy example vignette("sample_proj")
gives a
concrete example of a mini-research project using Rezonator and
rezonateR
, and will give you a feel of what sorts of
projects are possible with Rezonator + rezonateR
.It would
also be helpful to check out the official Rezonator guides and
try some simple annotations in Rezonator to get a feel of how it works.
If you already know how rezonateR
works, you can start from
vignette("import_save_basics")
to learn the nitty-gritty of
coding in rezonateR.
Preliminaries
Why use rezonateR?
If you’re reading this, you’re probably already using Rezonator to do
your daily work. Rezonator is a powerful tool for annotating and
visualising the dynamics of human engagement. The purpose of
rezonateR
is to add a series of additional tools that
enhance the functionality of Rezonator and increase your productivity!
My goal is to minimise the time you spend on coding and annotating so
you have more time for thinking.
Rezonator has many cool features, but due to technical restrictions, there are certain things it can’t do, such as:
- Dividing annotations into layers
- Adding chunks and track chain entries that span multiple units
- Linking tree entries to the corresponding chunks
- Guessing the value of a field by looking at the values of other fields
- Automatically update the values of certain fields using information from other fields
rezonateR
can do all these, and more!
Moreover, rezonateR
is geared toward people of all skill
levels in R. I do a lot of the heavy lifting for you. As long as you
have some familiarity with base R, you can quickly pick up the basic
functions, but more advanced users can also extend it as they see
fit.
The rezonateR
engine is based heavily on Tidyverse
packages, particularly rlang and dplyr. Although you don’t need to be
familiar with Tidyverse to use the basic functionality, Tidyverse users
will be happy to see a wide range of functions that mimic Tidyverse
functions in appearance, but with additional fields to support the wide
range of functionality in rezonateR
.
If you’re wondering when you should start learning
rezonateR
, the best time is now! Although you can get
started with the Rezonator
GUI relatively easily, if there
are certain things you know you will use in rezonateR
-
such as multi-line chunks - you probably want to have the
rezonateR
post-processing in mind, even when you’re
annotating in Rezonator
.
install.packages("devtools")
library(devtools)
install_github("rezonators/rezonateR")
Some folks have reported that this does not work. You may want to try
adding the code options(download.file.method = "auto")
before running install_github()
if this is the case.
A quick import
Let’s start by importing our first file.
library(rezonateR)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#> Loading required package: readr
#> Loading required package: stringr
#> Loading required package: rlang
Now let’s import our first file, a short spoken text in Lhasa Tibetan (you can find the original video here: https://av.mandala.library.virginia.edu/video/couple-must-part-threes-company-02). This file contains a number of chunks, track chains, as well as trees, and we will deal with them in this vignette:
path = system.file("extdata", "virginia-library-20766.rez", package = "rezonateR")
layerRegex = list(
track = list(field = "trailLayer", regex = c("clausearg", "discdeix"), names = c("clausearg", "discdeix", "refexpr")),
chunk = list(field = "chunkLayer", regex = c("verb", "adv", "predadj"), names = c("verb", "adv", "predadj", "refexpr")))
myRez = importRez(path, layerRegex = layerRegex, concatFields = c("word", "wordWylie"))
#> Import starting - please be patient ...
#> Creating node maps ...
#> Creating rezrDFs ...
#> Adding foreign fields to rezrDFs and sorting (this is the slowest step) ...
#> >Adding to unit entry DF ...
#> >Adding to unit DF ...
#> >Adding to chunk DF ...
#> >Adding to track DFs ...
#> >Adding to tree DFs ...
#> Splitting rezrDFs into layers ...
#> A few finishing touches ...
#> Done!
The layerRegex
object is a series of instructions to
tell importRez
how to divide chunks and track chains into
different layers. In this case, I placed a field called
‘trailLayer
’ on track chains, which has three possible
values: clausearg
, discdeix
, and nothing.
These are captured in the regex
field. The first two
regexes correspond to the two names ‘clausearg
’ and
‘discdeix
’, and the default case where neither of the first
two regexes are detected is ‘refexpr
’. I have done the same
thing to chunks, as you can see above. If you don’t want to use layers,
you don’t have to specify layerRegex
. In that case, I will
create a single layer called ‘default
’ for you.
The other mysterious field in the importRez
function is
concatFields
. These are fields belonging to tokens that you
would like to concatenate for higher-level units like chunks and tokens.
For example, if tokens 1 and 2 are ‘happy’ and ‘person’, and you have a
chunk that contains these two tokens, you would want the whole string
‘happy person’ to be associated with the chunk. Typically, you should
specify at least one field for doing this. In this case, we will
concatenate the fields ‘word
’ and ‘wordWylie
’
(Wylie is the most common Romanisation system for Tibetan). It is
important not to overdo it, and specify too many fields to concatenate,
as this step can slow down your import considerably.
The result of the import is an object called rezrObj
,
which we will discuss below. When you import a more substantial file,
say around 30 minutes, the import speed can be rather slow. Please be
patient! The good news is that rezonateR
contains
functionality for saving and loading rezrObj
objects, so
you don’t have to import each time you work on a file in R.
Introduction to rezrObjs and nodeMaps
rezrObjs
There are three main kinds of objects in rezonateR
that
you will interact with directly, namely rezrObj
,
nodeMap
and rezrDF
. rezrObj
s and
nodeMaps
will be covered in this section.
rezrDF
s are relatively complex, and will form the bulk of
our discussion in this vignette.
rezrObj
objects contain one single nodeMap
and several rezrDF
s:
print("Item in myRez:")
#> [1] "Item in myRez:"
names(myRez)
#> [1] "nodeMap" "cardDF" "chunkDF" "docDF" "entryDF"
#> [6] "linkDF" "mergedDF" "stackDF" "tokenDF" "trackDF"
#> [11] "trailDF" "treeDF" "treeEntryDF" "treeLinkDF" "unitDF"
The relationships between the various entites are as follows:
chunks and tree entries are built on top of tokens
entries correspond exactly to tokens, and units build on entries
track refers to both chunks and tokens
This has some ramifications for updating, which we’ll come to later.
nodeMaps
The nodeMap
is similar to the internal representation of
the file used in Rezonator-generated .rez
files. The node
map in .rez
files are a disorganised list of nodes, each of
which correspond to an entity inside Rezonator: units, tokens, and so
on. The rezonateR
nodeMap is similar. The major difference
is that nodes are organised into sub-categories, according to the type
of entity that the node is encoding. Let’s have a sneak peak at these
categories:
print("Items in the nodeMap:")
#> [1] "Items in the nodeMap:"
names(myRez$nodeMap)
#> [1] "token" "entry" "unit" "track" "chunk" "card"
#> [7] "link" "trail" "stack" "corpus" "doc" "treeEntry"
#> [13] "treeLink" "tree"
In practice, you will not interact with most of these
nodeMaps
except token. Most of the time, you will only be
dealing with rezrDFs
, which are much easier to work with.
Let’s take a look at them.
Introduction to rezrDFs
A rezrDF
is like a normal data frame that you know from
base R. Here’s the beginning of the unit table:
print(head(myRez$unitDF %>% select(id, unitSeq, srtLineBo)))
#> # A tibble: 6 × 3
#> id unitSeq srtLineBo
#> <chr> <dbl> <chr>
#> 1 22C69D930B150 1 "བཀྲ་-"
#> 2 2456BC9E7C5D5 2 "བཀྲ་ཤིས་བདེ་ལེགས། "
#> 3 201A180965346 3 "ཕེབས་གནང་། "
#> 4 2BD385D3E52C7 4 "རྒན་ངག་དབང་ལགས་ཡིན་ནམ། "
#> 5 2B558CAD0F0B4 5 "ལགས་ཡིན། ང་"
#> 6 1213A3F8E28E0 6 "ལགས་ཡིན། ང་"
chunkDF
, trackDF
, rezDF
,
stackDF
, etc., are divided into layers. If you directly
access the ‘chunkDF
’ and ‘trackDF
’ components
of myRez
, you will get a list of rezrDF
s, one
for each layer. Here are the names of our chunk layers that you might
remember from the introduction, along with the beginning of one of the
associated rezrDF
s:
print("Component DFs of chunkDF:")
names(myRez$chunkDF)
print(head(myRez$chunkDF$refexpr) %>% select(id, word))
rezrDF
inherits from the Tidyverse ‘tibble’ structure,
so all tibble-related functions can be used with them. However, using
classic Tidyverse functions on rezrDF
s is often dangerous,
as rezrDF
s have additional functionality that go beyond
classic rezrDF
s.
There are three main differences that make rezrDF
special:
Perk 1: Field access labels
Field access labels prevents you from accidentally changing things
that you shouldn’t be changing. Let’s look at the field access values of
the unitDF
:
print("fieldaccess:")
#> [1] "fieldaccess:"
fieldaccess(myRez$unitDF)
#> id doc unitStart unitEnd
#> "key" "core" "core" "core"
#> unitSeq pID srtLineStart srtLineBo
#> "core" "core" "flex" "flex"
#> iuEnd iuStart srtLineEn iu
#> "flex" "flex" "flex" "flex"
#> srtLineID iuText speaker srtLineEnd
#> "flex" "flex" "flex" "flex"
#> word wordWylie docTokenSeqFirst docTokenSeqLast
#> "foreign" "foreign" "foreign" "foreign"
There are five possible field access values:
‘
key
’: The primary key of the table. You are not allowed to change it (unless you turn it into a non-key field, but this is not encouraged since you will basically break everything). If you try to update these fields usingrezonateR
functions, I will stop you with an error.‘
core
’: Core fields, mostly generated by Rezonator. You can change them, but I will give you a warning if you do, because changing a core field has strong potential to break things.‘
flex
’: Flexible fields, usually fields whose values you enter into Rezonator, though there are also flex fields automatically generated by Rezonator.‘
auto
’: Fields whose values are automatically generated using information from the SAMErezrDF
.‘
foreign
’: Fields whose values are automatically generated using information from a DIFFERENTrezrDF
(or several differentrezrDF
s, but this is an advanced feature we will stay away from in this vignette.)
Perk 2: Reloads
The reload()
function is one of the core features of
rezonateR
that makes it so convenient to use. The
reload()
feature is based on updateFunction
s.
You can access the updateFunction
s of a table using
updateFunct()
:
print("updateFunct of unitDF:")
#> [1] "updateFunct of unitDF:"
updateFunct(myRez$unitDF)
#> $word
#> function (df, rezrObj)
#> updateLowerToHigher(df, rezrObj, address, fkeyAddress, action,
#> field, fkeyInDF, seqName)
#> <bytecode: 0x000001da3b578920>
#> <environment: 0x000001da3b5686a8>
#> attr(,"class")
#> [1] "updateFunction" "function"
#> attr(,"deps")
#> [1] "entryDF/word"
#>
#> $wordWylie
#> function (df, rezrObj)
#> updateLowerToHigher(df, rezrObj, address, fkeyAddress, action,
#> field, fkeyInDF, seqName)
#> <bytecode: 0x000001da3b57f670>
#> <environment: 0x000001da3b578290>
#> attr(,"class")
#> [1] "updateFunction" "function"
#> attr(,"deps")
#> [1] "entryDF/wordWylie"
#>
#> $docTokenSeqFirst
#> function (df, rezrObj)
#> updateLowerToHigher(df, rezrObj, address, fkeyAddress, action,
#> field, fkeyInDF, seqName)
#> <bytecode: 0x000001da3b57e1e0>
#> <environment: 0x000001da3b57efe0>
#> attr(,"class")
#> [1] "updateFunction" "function"
#> attr(,"deps")
#> [1] "entryDF/docTokenSeq"
#>
#> $docTokenSeqLast
#> function (df, rezrObj)
#> updateLowerToHigher(df, rezrObj, address, fkeyAddress, action,
#> field, fkeyInDF, seqName)
#> <bytecode: 0x000001da3b572c00>
#> <environment: 0x000001da3b573a00>
#> attr(,"class")
#> [1] "updateFunction" "function"
#> attr(,"deps")
#> [1] "entryDF/docTokenSeq"
There are three reload functions, reloadLocal()
,
reloadForeign()
and reload()
.
reloadLocal()
only takes a rezrDF
, and only
updates auto fields. reloadForeign()
and
reload()
take a rezrDF
and a
rezrObj
, and updates the rezrDF
using the
rezrObj
(which may or may not contain the
rezrDF
).
Let’s take a look at reload()
, which is the most useful
function of these. Here, in the original data, when there are zero
mentions, only the orthographic representation is written as <0>;
the Wylie romanisation is a blank string. I want to change the Wylie
romanisation to also contain <0>s. I do that by using the
rez_mutate()
function on the tokenDF
first
(don’t worry about what that means yet; we’ll cover it later). After
that, I reload the entryDF
, and then reload the
unitDF
. (Recall that the units depend on entries which in
turn depend on tokens; that’s why we can’t just reload the
unitDF
directly.) After the update, the unitDF
is updated with <0>s appearing in the Wylie romanisation:
print("Before the update")
#> [1] "Before the update"
myRez$unitDF %>% filter(str_detect(word, "<0>")) %>% rez_select(id, word, wordWylie) %>% head
#> # A tibble: 6 × 3
#> id word wordWylie
#> <chr> <chr> <chr>
#> 1 201A180965346 <0> ཕེབས་ གནང་ ། " phebs gnang …
#> 2 2BD385D3E52C7 <0> རྒན་ ངག་དབང་ ལགས་ ཡིན་ ནམ ། " rgan ngag dba…
#> 3 2B558CAD0F0B4 ལགས་ <0> ཡིན ། "lags yin /"
#> 4 1A5B143F0646A ལགས་ <0> <0> རེད ། "lags red /"
#> 5 2414F2965B74A ང་ འདི ར་ སྤྱིར་བཏང་ <0> སློབ་སྦྱོང་ བྱེད་ གར་ ཡོང་བ་ ཡིན ། "nga 'di ra sp…
#> 6 1FA517FFC43A <0> སློབ་ཕྲུག་ ཡིན་ X X " slob phrug yi…
#Change something in the token rezrDF that is significant for the unit rezrDF
myRez$tokenDF = myRez$tokenDF %>% rez_mutate(
wordWylie = case_when(word == "<0>" ~ "<0>", T ~ wordWylie))
myRez$entryDF = myRez$entryDF %>% reload(myRez)
myRez$unitDF = myRez$unitDF %>% reload(myRez)
print("After the update")
#> [1] "After the update"
myRez$unitDF %>% filter(str_detect(word, "<0>")) %>% rez_select(id, word, wordWylie) %>% head
#> # A tibble: 6 × 3
#> id word wordWylie
#> <chr> <chr> <chr>
#> 1 201A180965346 <0> ཕེབས་ གནང་ ། <0> phebs gnang…
#> 2 2BD385D3E52C7 <0> རྒན་ ངག་དབང་ ལགས་ ཡིན་ ནམ ། <0> rgan ngag d…
#> 3 2B558CAD0F0B4 ལགས་ <0> ཡིན ། lags <0> yin /
#> 4 1A5B143F0646A ལགས་ <0> <0> རེད ། lags <0> <0> re…
#> 5 2414F2965B74A ང་ འདི ར་ སྤྱིར་བཏང་ <0> སློབ་སྦྱོང་ བྱེད་ གར་ ཡོང་བ་ ཡིན ། nga 'di ra spy…
#> 6 1FA517FFC43A <0> སློབ་ཕྲུག་ ཡིན་ X X <0> slob phrug …
You might be wondering how to reload an entire rezrObj
.
Because tables often depend on each other (for example, field A in table
X relies on field B in table Y which in turn relies on field C in table
X), this is technically difficult, but I plan to add this function
before the 1.0 release. Stay tuned!
Perk 3: Correpondences to nodeMaps
rezrDF
s encode information about whether a field is in
the nodeMap
or not:
print("inNodeMap:")
#> [1] "inNodeMap:"
inNodeMap(myRez$unitDF)
#> id doc unitStart unitEnd
#> "key" "primary" "primary" "primary"
#> unitSeq pID srtLineStart srtLineBo
#> "primary" "primary" "tagmap" "tagmap"
#> iuEnd iuStart srtLineEn iu
#> "tagmap" "tagmap" "tagmap" "tagmap"
#> srtLineID iuText speaker srtLineEnd
#> "tagmap" "tagmap" "tagmap" "tagmap"
#> word wordWylie docTokenSeqFirst docTokenSeqLast
#> "no" "no" "no" "no"
This doesn’t do so much yet, since you’re not yet allowed to push a
field created in a rezrDF
back to a nodeMap
.
This will be available in the 1.0 release.
Editing rezrDFs
One of the core features of rezonateR
is to facilitate
the automatic and semi-automatic creation of fields, which is currently
not supported in Rezonator. There are also other operations you may want
to perform on rezrDF
s.
To cater to users of different habits and skill levels, I have
introduced three different levels of rezrDF
s.
EasyEdit
can be quickly picked up by everyone, including base R users, and covers the most basic operations you would want to do to arezrDF
(e.g.addField()
,changeFieldForeign()
)TidyRez
is easy to pick up for tidyverse users, though there is some learning curve for others (e.g.rez_mutate()
,rez_left_join()
)Core engine: These are mostly functions that I use within
rezonateR
under the hood. Users who want maximum flexibility may also use them (e.g.lowerToHigher()
,createLeftJoinUpdate()
), but do be aware that I may make changes to these without notice, since I will assume that most users have little use for them.
Crucially, while EasyEdit
and TidyRez
syntax are very similar within each category, functions within the core
engine are a lot more divergent, and EasyEdit
and
TidyRez
also differ considerably in their syntax. So if you
are comfortable with TidyRez
, minimising the use of
EasyEdit
may make your code look more consistent, and vice
versa.
EasyEdit
EasyEdit
consists of four commonly used functions,
addFieldLocal()
, addFieldForeign(),
changeFieldLocal()
and changeFieldForeign()
.
There are also some less useful functions not covered in this vignette,
but which you can find in the references, like
addRow()
.
All of the four basic functions can be applied to both
rezrDF
s and rezrObj
s. In this vignette, we
will mainly apply them to rezrObj
s. If you want to deal
with emancipated rezrDF
s, i.e. rezrDF
s that
are not part of a rezrObj
, you will want to use the
versions that apply to rezrDF
s, but those are simpler than
the rezrObj
versions, so you should be able to pick them up
quickly using the manual.
Let’s start by looking at addFieldLocal()
.
addField()
is a shortcut for addFieldLocal()
,
and we will be using this shortcut name throughout. Our first example is
very simple. In our tokenDF
, let’s add a field that
automatically calculates the length of a word in characters. Here,
‘entity
’ specifies the name of the entity you would like to
change, ‘layer
’ specifies the layer within that entity
(which is an empty string since there are no token layers),
fieldName
is the name of the field we’re adding, expression
is the R expression with which we calculate the new field, and
fieldaccess
tells rezonateR
to make this an
auto field with an updateFunction
that will be attached to
the table:
myRez = addField(myRez, entity = "token", layer = "",
fieldName = "orthoLength",
expression = nchar(word),
fieldaccess = "auto")
print("A fragment of the updated table:")
#> [1] "A fragment of the updated table:"
head(myRez$tokenDF %>% rez_select(id, word, orthoLength))
#> # A tibble: 6 × 3
#> id word orthoLength
#> <chr> <chr> <int>
#> 1 2ECADE1029CD3 བཀྲ་- 5
#> 2 15A5089F6157A བཀྲ་ཤིས་ 8
#> 3 197AA4A0C625F བདེ་ལེགས 8
#> 4 2C7746BD6F150 ། 1
#> 5 354CBFB3632B6 <0> 3
#> 6 2D1DAD2FFF22A ཕེབས་ 5
print("The updateFunction:")
#> [1] "The updateFunction:"
updateFunct(myRez$tokenDF, "orthoLength")
#> function (df)
#> updateMutate(df, field, x)
#> <environment: 0x000001da3bdb8c18>
#> attr(,"class")
#> [1] "updateFunction" "function"
#> attr(,"deps")
#> character(0)
Now let’s spice this up a bit by adding a complex field. A complex
field takes information from multiple rows of a table. In this case, we
are working with the tokenDF
, but want the new column to be
the longest length of the word that appears in the unit that the token
comes from. In this case, the groupField
is ‘unit’, and we
specify the field type as ‘complex’. The expression uses the function
longestLength()
, which is a rezonateR
function
that returns the longest word in a series of words.
myRez = addField(myRez, entity = "token", layer = "",
fieldName = "longestWordInUnit",
expression = longestLength(word),
type = "complex",
groupField = "unit",
fieldaccess = "auto")
head(myRez$tokenDF %>% select(id, word, longestWordInUnit))
#> # A tibble: 6 × 3
#> id word longestWordInUnit
#> <chr> <chr> <int>
#> 1 2ECADE1029CD3 བཀྲ་- 5
#> 2 15A5089F6157A བཀྲ་ཤིས་ 8
#> 3 197AA4A0C625F བདེ་ལེགས 8
#> 4 2C7746BD6F150 ། 8
#> 5 354CBFB3632B6 <0> 5
#> 6 2D1DAD2FFF22A ཕེབས་ 5
Now let’s add a simple foreign field. Let’s say when we look at the
tokenDF
, we also want to know what the whole unit’s words
are. The source is the ‘word
’ field of units, and we are
creating a new field for tokens called ‘unitWord
’. The
foreign key is the field that contains IDs of the source table inside
the target table, in this case the ‘unit
’ field of
tokenDF
:
myRez = addFieldForeign(myRez,
targetEntity = "token", targetLayer = "",
sourceEntity = "unit", sourceLayer = "",
targetForeignKeyName = "unit",
targetFieldName = "unitWord", sourceFieldName = "word",
fieldaccess = "foreign")
head(myRez$tokenDF %>% select(id, word, unitWord))
#> # A tibble: 6 × 3
#> id word unitWord
#> <chr> <chr> <chr>
#> 1 2ECADE1029CD3 བཀྲ་- བཀྲ་-
#> 2 15A5089F6157A བཀྲ་ཤིས་ བཀྲ་ཤིས་ བདེ་ལེགས །
#> 3 197AA4A0C625F བདེ་ལེགས བཀྲ་ཤིས་ བདེ་ལེགས །
#> 4 2C7746BD6F150 ། བཀྲ་ཤིས་ བདེ་ལེགས །
#> 5 354CBFB3632B6 <0> <0> ཕེབས་ གནང་ །
#> 6 2D1DAD2FFF22A ཕེབས་ <0> ཕེབས་ གནང་ །
Now let’s wrap it up with a complex field foreign field. Here, we’re
going to add a field in the unitDF
that tells us the length
of the shortest word within the unit. We’re going to base this off the
entryDF
.
However, because the entries that correspond to units are given in
the nodeMap
, you also need to supply the list of entries
inside the unit nodeMap
- here it’s called
entryList
. Chunks and tree entries are built on tokens
instead, so they have a list called tokenList
. Instead of
‘expression’, complex foreign fields have a field called
complexAction
, which is a function performed on the source
field of the source table:
myRez = addFieldForeign(myRez,
targetEntity = "unit", targetLayer = "",
sourceEntity = "entry", sourceLayer = "",
targetForeignKeyName = "entryList",
targetFieldName = "shortestWordLength",
sourceFieldName = "word",
type = "complex",
complexAction = shortestLength,
fieldaccess = "foreign")
head(myRez$unitDF %>% select(id, word, shortestWordLength))
#> # A tibble: 6 × 3
#> id word shortestWordLength
#> <chr> <chr> <int>
#> 1 22C69D930B150 བཀྲ་- 5
#> 2 2456BC9E7C5D5 བཀྲ་ཤིས་ བདེ་ལེགས ། 1
#> 3 201A180965346 <0> ཕེབས་ གནང་ ། 1
#> 4 2BD385D3E52C7 <0> རྒན་ ངག་དབང་ ལགས་ ཡིན་ ནམ ། 1
#> 5 2B558CAD0F0B4 ལགས་ <0> ཡིན ། 1
#> 6 1213A3F8E28E0 ང་ 2
(Because of the technicality that punctuation counts as a character,
most of these values are 1. There are ways we can fix this using
isWord
conditions, as we’ll discuss below.)
So far we’ve only looked at addField()
, but the good
news is that changeField
works in 100% the exact same way!
Here’s an example, changing our orthoLength
field to depend
on the Romanisation instead of the original Tibetan script:
myRez = changeField(myRez, entity = "token", layer = "",
fieldName = "orthoLength",
expression = nchar(wordWylie),
fieldaccess = "auto")
print("A fragment of the updated table:")
#> [1] "A fragment of the updated table:"
head(myRez$tokenDF %>% rez_select(id, word, orthoLength))
#> # A tibble: 6 × 3
#> id word orthoLength
#> <chr> <chr> <int>
#> 1 2ECADE1029CD3 བཀྲ་- 8
#> 2 15A5089F6157A བཀྲ་ཤིས་ 10
#> 3 197AA4A0C625F བདེ་ལེགས 8
#> 4 2C7746BD6F150 ། 1
#> 5 354CBFB3632B6 <0> 3
#> 6 2D1DAD2FFF22A ཕེབས་ 6
Note that if you don’t specify the field access value, I will
automatically change it to flex, even in
changeField()
. This is to force you to remember that
you are not only changing the value of the field itself, but also how it
will be updated in the future. If you don’t supply a field access value
and it’s originally an auto or foreign field, I will warn you about this
change, so you can run changeField()
again if you want to
change your mind.
TidyRez
In general, TidyRez
functions are called by adding
‘rez_
’ in front of a dplyr function name, such a
rez_group_by
or rez_mutate
. Using
TidyRez
functions allows you to keep and/or update your
field access values, inNodeMap
values, and
updateFunctions
. Using base R or classic dplyr functions
with rezrDF
s may result in reload fails, unless
supplemented by core engine functions, which are not covered in this
vignette.
A few dplyr functions are completely safe to use in
rezonateR
, mostly those that focus on selecting rows of a
table, such as filter()
, arrange()
or
slice()
. Currently implemented TidyRez
functions include:
rez_add_row()
for adding new entriesrez_mutate()
for adding and editing columnsrez_rename()
for renaming columnsrez_bind_rows()
for combining rezrDFs verticallyrez_group_split()
for splitting rezrDFs verticallyrez_group_by()
and rez_ungroup for groupingrez_select()
for selecting certain columns inside a rezrDFrez_left_join()
for left joins
A few other planned ones include rez_bind_cols()
and
rez_outer_join()
, which will be especially useful for the
calculation of inter-annotator agreement.
Because TidyRez
is relatively straightforward for
Tidyverse users, this vignette will focus on what TidyRez
adds on top of Tidyverse. If you want to learn about basic Tidyverse,
there are many existing tutorials on the Internet.
To see the power of TidyRez
, let’s try creating an
emancipated rezrDF
with only a subset of the original
columns. Here, we take trackDF$refexpr
, the table of
referential expressions. We then damage one of the fields using a
classic dplyr
function. As you can see here, the
emancipated rezrDF
can still be updated using the
rezrObj
, effectively overriding the damage:
refTable = myRez$trackDF$refexpr %>% rez_select(id, token, chain, name, word, tokenOrderLast)
print("Before:")
#> [1] "Before:"
head(refTable %>% select(id, tokenOrderLast))
#> # A tibble: 6 × 2
#> id tokenOrderLast
#> <chr> <dbl>
#> 1 3D148B2FEEA8 1
#> 2 1CA978DCDE1DB 1
#> 3 14C5BE6658C39 4
#> 4 1FE8CC91923D1 2
#> 5 33A483E30E811 1
#> 6 280CA1A8A425 1
refTable = refTable %>% mutate(tokenSeqLast = 1) #Damage refTable with a classic dplyr function
print("After:")
#> [1] "After:"
refTable = refTable %>% reload(myRez)
head(refTable %>% select(id, tokenOrderLast))
#> # A tibble: 6 × 2
#> id tokenOrderLast
#> <chr> <dbl>
#> 1 3D148B2FEEA8 1
#> 2 1CA978DCDE1DB 1
#> 3 14C5BE6658C39 4
#> 4 1FE8CC91923D1 2
#> 5 33A483E30E811 1
#> 6 280CA1A8A425 1
A warning is in order: TidyRez
only updates the
current table. If other tables have references to the table you’re
editing, they will not be updated. You must bear this in mind when using
rez_select()
and rez_rename()
. No problems
will arise if you use these functions on emancipated
rezrDF
s. However, if you use these functions on
rezrDF
s within rezrObj
s, you should manually
update any fields in other rezrDF
s that refer to the field
you’ve deleted or added. I plan to add a rename feature to
EasyEdit
in the near future that will update references
from other rezrDF
s.
Most of the TidyRez
functions’ syntax deviate from dplyr
only minimally in ways that you can read about in the documentation.
However, rez_left_join()
is worth a quick mention. In
addition to a fieldaccess
field and a rezrObj
field, which are self-explanatory, there is a fkey
field
and a df2Address
field. fkey
is the name of
the field in the first data.frame that corresponds to IDs of the second
data.frame. df2Address
is a string that tells
rez_left_join()
how to find the source rezrDF
next time. If the source rezrDF
doesn’t belong to a layer,
e.g. tokenDF
, just type that. If the source
rezrDF
belongs to a layer, put a ‘/’ between the table and
the layer, e.g. ‘trackDF/refexpr
’.
An interlude: Time and sequence
Before we continue our adventure, let’s look at a couple of ways we
can upgrade our rezrObj
to contain even more
information.
The first thing we can do, which we hinted at before, is to set
certain tokens as non-words. You can do this with the
addIsWordField
function. One immediate benefit of this is
that we get a new sequence value. The fields tokenOrder
and
docTokenSeq
values in the original rezrDF
count all tokens, whereas wordOrder
and
docWordSeq
will only count tokens counted as words
according to some criterion. Let’s set our criterion to
!str_detect(wordWylie, "/")
, i.e. the token must not
contain the main punctuation mark in Tibetan. Notice that
wordOrder
is generally slightly lower than
tokenOrder
:
myRez = addIsWordField(myRez, !str_detect(wordWylie, "/"))
head(myRez$tokenDF %>% select(id, tokenOrder, docTokenSeq, wordOrder, docWordSeq))
#> # A tibble: 6 × 5
#> id tokenOrder docTokenSeq wordOrder docWordSeq
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 2ECADE1029CD3 1 1 1 1
#> 2 15A5089F6157A 1 2 1 2
#> 3 197AA4A0C625F 2 3 2 3
#> 4 2C7746BD6F150 3 4 0 0
#> 5 354CBFB3632B6 1 5 1 4
#> 6 2D1DAD2FFF22A 2 6 2 5
By default, unitSeq
information is not available to
rezrDF
s other than unitDF
. You can change this
using the addUnitSeq()
feature, which can add
unitSeq
information up to track chains:
myRez = addUnitSeq(myRez, "track")
This adds a unitSeqFirst
and unitSeqLast
field to chunks and track chains entries, and a unitSeq
field to tokens.
Updating rezonateR using external information
Some annotation actions are easier with a spreadsheet than in a
Rezonator, so one action you will frequently perform is to do
annotations in a spreadsheet programme and then integrate that
information back into a rezrObj
. Fortunately,
rezonateR
contains functionality that can facilitate this
and minimise errors generated in the process.
Let’s say we want to annotate the person of the referential
expressions inside trackDF$refexpr
. Before we start
annotating manually, I wrote some simple rules to guess what the person
is that works for most situations, so we will only have to correct from
this baseline:
myRez$trackDF$refexpr = myRez$trackDF$refexpr %>% rez_mutate(person = case_when(word == "ང" | str_starts(word, "ང་") ~ 1,
str_starts(word, "(ཁྱེད|ཁྱོད|ཇོ་ལགས|ཨ་ཅག་ལགས|རྒན་ལགས)") ~ 2,
str_ends(word, "(ལགས|<0>)") ~ 0, #Multiple likely scenarios
T ~ 3))
Before we export this as a CSV for annotation, I would like to add a
column inside the rezrDF
that gives us the word of the
entire unit. (Since this document currently does not have multi-unit
track entries, it will suffice to use unitLast
or
unitFirst
). It will be useful to be able to see this column
while making manual annotations:
myRez$trackDF$refexpr = myRez$trackDF$refexpr %>%
rez_left_join(myRez$unitDF %>% rez_select(unitSeq, word), by = c(unitSeqLast = "unitSeq"), suffix = c("", "_unit"), df2key = "unitSeq", df2Address = "unitDF", fkey = "unitSeqLast") %>%
rez_rename(unitLastWord = word_unit)
#> Tip: When performed on a rezrDF inside a rezrObj, rez_rename is a potentially destructive action. It is NOT recommended to assign it back to a rezrDF inside a rezrObj. If you must do so, be careful to update all addresses from other DFs to this DF.
The next step is to write the CSV file. rez_write_csv()
allows us to do this easily. The third argument of
rez_write_csv()
is a vector of field names that we want to
export. It is advisable to keep the number of exported fields small to
make the spreadsheet more manageable and require less scrolling:
rez_write_csv(myRez$trackDF$refexpr, "refexpr.csv", c("id", "name", "unitLastWord", "unitSeqLast", "word", "docTokenSeqLast", "entityType", "roleType", "person"))
After editing the CSV in a spreadsheet program, let’s import it back
using rez_read_csv()
. (I’ve renamed the edited CSV - in
general, I recommend doing this to avoid accidentally overwriting your
edited file by running the export code again.) The origDF
argument tells rezonateR
to look in the original
rezrDF
that produced the CSV, and determine the data types
accordingly:
changeDF = rez_read_csv("refexpr_edited.csv", origDF = myRez$trackDF$refexpr)
Finally, the updateFromDF()
function allows us to update
the original rezrDF
using information from the new
rezrDF
. There are many fancy option you can choose from,
such as deciding whether to delete rows, add rows, add columns, etc. We
will only use the most vanilla options, and update the ‘person’
column:
myRez$trackDF$refexpr = myRez$trackDF$refexpr %>% updateFromDF(changeDF, changeCols = 'person')
#>
#> "id"
#> NULL
head(myRez$trackDF$refexpr %>% select(id, word, person))
#> # A tibble: 6 × 3
#> id word person
#> <chr> <chr> <dbl>
#> 1 3D148B2FEEA8 <0> NA
#> 2 1CA978DCDE1DB <0> NA
#> 3 14C5BE6658C39 རྒན་ ངག་དབང་ ལགས་ NA
#> 4 1FE8CC91923D1 <0> NA
#> 5 33A483E30E811 ང་ NA
#> 6 280CA1A8A425 འདི་ NA
Analysing track chains with EasyTrack
Now that we’ve looked at an example of semi-automatic annotation, let’s move on to some full automation! We will be looking in particular at coreference chains. rezonateR contains a suite of functions for generating features useful for analysing the choice of referential forms, reference comprehension, and similar topics.
Anaphoric and cataphoric distance
Let’s first find out how many units we are from the previous mention
of something. This is equivalent to the gapUnit
column that
already exists as automatically generated by Rezonator:
myRez$trackDF$refexpr = myRez$trackDF$refexpr %>%
rez_mutate(unitsToLastMention = unitsToLastMention(unitSeqLast))
myRez$trackDF$refexpr %>% select(id, gapUnits, unitsToLastMention) %>% slice(10:16)
#> # A tibble: 7 × 3
#> id gapUnits unitsToLastMention
#> <chr> <chr> <dbl>
#> 1 134D42FB97545 0 1
#> 2 183DC7932D931 7 7
#> 3 1373D1F88358 N/A NA
#> 4 2DBE5E8F59A6D 1 1
#> 5 F20CE11F519F 1 1
#> 6 1366768617 2 2
#> 7 287C6AAAA209D N/A NA
Now let’s count the tokens from the last mention using the
tokensToLastMention()
function. This one has a couple of
complications. The first one is which seq to count. In the interlude, we
mentioned that in addition to docTokenSeq
, we have a
sequence value called docWordSeq
that excludes nonwords. We
will use that value in counting. The second complication is how we will
treat zero mentions. Zeros do not actually exist in the world, so they
have no time to speak of. The ‘zeroProtocol
’ argumentis
‘unitFinal
’, telling rezonateR
to count the
last word of whatever unit the zero comes from. Finally, since we’re
dealing with units, we need to pass the unitDF
to ensure
that tokensToLastMention
can have access to unit
information:
myRez$trackDF$refexpr = myRez$trackDF$refexpr %>%
rez_mutate(wordsToLastMention = tokensToLastMention(
docWordSeqLast, #What seq to use
zeroProtocol = "unitInitial", #How to treat zeroes
zeroCond = (word == "<0>"),
unitDF = myRez$unitDF)) #Additional argument for unitFinal protocol
myRez$trackDF$refexpr %>% select(id, wordsToLastMention) %>% slice(10:16)
#> # A tibble: 7 × 2
#> id wordsToLastMention
#> <chr> <dbl>
#> 1 134D42FB97545 0
#> 2 183DC7932D931 22
#> 3 1373D1F88358 NA
#> 4 2DBE5E8F59A6D 0
#> 5 F20CE11F519F 0
#> 6 1366768617 0
#> 7 287C6AAAA209D NA
Note that unitsToNextMention
and
tokensToNextMention
work in the same way.
Tallying preceding and following mentions
We can also count how many previous mentions of something there were
within a window of units. Most people do five or 20 unit. Let’s try this
with 5. The countPrevMentions
allows us to do this
(countNextMentions()
does this but for the succeeding
context):
myRez$trackDF$refexpr = myRez$trackDF$refexpr %>% rez_mutate(noPrevMentionsIn5 = countPrevMentions(5))
myRez$trackDF$refexpr %>% select(id, noPrevMentionsIn5) %>% slice(10:16)
#> # A tibble: 7 × 2
#> id noPrevMentionsIn5
#> <chr> <int>
#> 1 134D42FB97545 2
#> 2 183DC7932D931 0
#> 3 1373D1F88358 0
#> 4 2DBE5E8F59A6D 1
#> 5 F20CE11F519F 1
#> 6 1366768617 2
#> 7 287C6AAAA209D 0
Sometimes, we may want to extract previous mentions conditionally,
e.g. only count subject mentions or zero mentions. The functions
countPrevMentionsIf()
and countNextMentionIf()
allow us to define such a condition. Let’s try counting the number of
coming zero mentions. Here, we use the condition
word == "<0>"
, i.e. the word is a zero, and the
window is Inf
, i.e. there’s no limit on how far in the
future we look:
myRez$trackDF$refexpr = myRez$trackDF$refexpr %>% rez_mutate(noComingZeroes = countNextMentionsIf(Inf, word == "<0>"))
myRez$trackDF$refexpr %>% select(id, noComingZeroes) %>% slice(10:16)
#> # A tibble: 7 × 2
#> id noComingZeroes
#> <chr> <int>
#> 1 134D42FB97545 1
#> 2 183DC7932D931 9
#> 3 1373D1F88358 0
#> 4 2DBE5E8F59A6D 9
#> 5 F20CE11F519F 0
#> 6 1366768617 8
#> 7 287C6AAAA209D 1
Counting competitors
We may also want to count competing mentions, that is, recent
mentions not coreferential to the current mention.
countCompetitors()
tallies the number of competitors
intervening between the previous and current mention, possibly within a
window. Here is one example with no window:
myRez$trackDF$refexpr = myRez$trackDF$refexpr %>% rez_mutate(noCompetitors = countCompetitors())
myRez$trackDF$refexpr %>% select(id, noComingZeroes) %>% slice(10:16)
#> # A tibble: 7 × 2
#> id noComingZeroes
#> <chr> <int>
#> 1 134D42FB97545 1
#> 2 183DC7932D931 9
#> 3 1373D1F88358 0
#> 4 2DBE5E8F59A6D 9
#> 5 F20CE11F519F 0
#> 6 1366768617 8
#> 7 287C6AAAA209D 1
All of the functions introduced in this section have additional fields that allow for further customisation. Please feel free to refer to the manual for more information.
Adding tree information
Now let’s add some information from trees. The first thing to do is
to run the getAllTreeCorrespondences()
function, which adds
a treeEntry
column to non-tree tables. If you select
entity = "track"
, this column will be added to
tokenDF
, chunkDF
and trackDF
.
myRez = getAllTreeCorrespondences(myRez, entity = "track")
myRez$trackDF$refexpr %>% select(id, treeEntry) %>% slice(10:16)
#> # A tibble: 7 × 2
#> id treeEntry
#> <chr> <chr>
#> 1 134D42FB97545 23E6B9A16C426
#> 2 183DC7932D931 253E6EE11F308
#> 3 1373D1F88358 1E1D73A61D4B0
#> 4 2DBE5E8F59A6D 5E28BE8ED183
#> 5 F20CE11F519F 363329E4BBB94
#> 6 1366768617 281F8FC643F88
#> 7 287C6AAAA209D 364D86848B187
The best thing that trees do for us is connecting verb information,
stored in chunks, to track chain entry (i.e. referential expression)
information. We can do this in two steps. First we add a
treeParent
column to trackDF$refexpr
that
takes the value of the ‘parent’ column of treeEntryDF
; in
simple terms, this means we’re getting the parent tree entry’s ID into
trackDF$refexpr
. We then use this parent tree entry’s ID to
find the corresponding verb chunk, and with this, we have successfully
put the verb on the trackDF$refexpr
table.
myRez = myRez %>% addFieldForeign("track", "refexpr", "treeEntry", "default", "treeEntry", "treeParent", "parent", fieldaccess = "foreign")
myRez$trackDF$refexpr = myRez$trackDF$refexpr %>% rez_left_join(myRez$chunkDF$verb %>% select(id, word, treeEntry), by = c(treeParent = "treeEntry"), suffix = c("", "_verb"), df2Address = "chunkDF/verb", fkey = "treeParent", df2key = "treeEntry", rezrObj = myRez) %>% rename(verbID = id_verb, verbWord = word_verb)
myRez$trackDF$refexpr %>% select(id, treeParent, verbID, verbWord) %>% slice(10:16)
#> # A tibble: 7 × 4
#> id treeParent verbID verbWord
#> <chr> <chr> <chr> <chr>
#> 1 134D42FB97545 16CA7768343C0 2CFB905E634DE རེད
#> 2 183DC7932D931 2EA73E6E3BED4 30E122DAF5010 ཕེབས་པ་ ཡིན་ ནམ
#> 3 1373D1F88358 2EA73E6E3BED4 30E122DAF5010 ཕེབས་པ་ ཡིན་ ནམ
#> 4 2DBE5E8F59A6D 359A5ED8BE360 2CCFFAD41AD38 ཡོང་པ་ ཡིན
#> 5 F20CE11F519F 359A5ED8BE360 2CCFFAD41AD38 ཡོང་པ་ ཡིན
#> 6 1366768617 A4DB76F5C0EB A4DC56F0B165 ཡོང་བ་ ཡིན
#> 7 287C6AAAA209D A4DB76F5C0EB A4DC56F0B165 ཡོང་བ་ ཡིན
Advanced: Chunk mergers
The last topic to cover is merging chunks, most useful for creating muti-line chunks. There are several steps to merging chunks:
Create constituent chunks that span the entire merged chunk
Create a tree leaf that contains all tokens in the merged chunk, and put the leaf in a tree.
Use the
mergeChunksWithTree()
command inrezonateR
to merge them.
mergeChunksWithTree()
is very easy to use. After you
call this command, the merged chunks will be added to the bottom of the
correponding chunk rezrDF
. Chunk tags are taken from the
first constituent chunk of each merger by default; see the manual for
setting custom conditions. There will in addition be a column called
combinedChunk
that tells you whether a chunk is a combined
chunk, a member of a combined chunk, or neither.
myRez = mergeChunksWithTree(myRez)
myRez$chunkDF$refexpr %>% filter(combinedChunk != "") %>% select(id, name, word, combinedChunk) #Showing only combined chunks and their members
#> # A tibble: 6 × 4
#> id name word combi…¹
#> <chr> <chr> <chr> <chr>
#> 1 51A902A7C6CD Chunk 49 ཁོང་ ཚོ་ ནས་ |infom…
#> 2 2E2B9BB6462BC Chunk 51 དཔེ་སྐྲུན་ ཞུས་པ་ དེ་ |membe…
#> 3 17842C4087863 Chunk 235 ཕྱི་རྒྱལ་ ནུབ་ཕྱོགས་པ འི་ མི་རིགས་ འདི་འདྲས་ |infom…
#> 4 EF4325E7D5A9 Chunk 81 བོད་སྐད་ ནང་ བྲིས་པ འི་ རྒྱལ་རབས་ ཀྱི་ དེབ་ |membe…
#> 5 6ru0c9hVoSLwZ New Chunk 1 ཁོང་ ཚོ་ ནས་ དཔེ་སྐྲུན་ ཞུས་པ་ དེ་ combin…
#> 6 HpsUkPzuAeFcm New Chunk 2 ཕྱི་རྒྱལ་ ནུབ་ཕྱོགས་པ འི་ མི་རིགས་ འདི་འདྲས་ བོད་སྐད་ ནང་… combin…
#> # … with abbreviated variable name ¹combinedChunk
You may also augment the trackDF
with the merged chunks;
the combinedChunk
column works similarly:
myRez = mergedChunksToTrack(myRez, "refexpr")
Where to go from here?
Now that you’ve seen the bare-bones basics of using
rezonateR
, if you want to dive in and start using it, you
can proceed to our sequence of detailed tutorials starting from
vignette("import_save_basics")
. If you want to see a
concrete example of a mini-project, take a look at
vignette("sample_proj")
.