An overview of rezonateR for Rezonator users • rezonateR

Are you already familiar with Rezonator and want to do quantitative analyses with it, but don’t yet know how to use rezonateR? If so, then this overview is for you. This overview will not go through all the details of rezonateR, but we will provide a snapshot of the most important functions in the package that will help you in your annotation and analysis. If you aren’t very familiar with the companion tool Rezonator and want to know what can be done with this tool, the toy example vignette("sample_proj") gives a concrete example of a mini-research project using Rezonator and rezonateR, and will give you a feel of what sorts of projects are possible with Rezonator + rezonateR.It would also be helpful to check out the official Rezonator guides and try some simple annotations in Rezonator to get a feel of how it works. If you already know how rezonateR works, you can start from vignette("import_save_basics") to learn the nitty-gritty of coding in rezonateR.

Preliminaries

Why use rezonateR?

If you’re reading this, you’re probably already using Rezonator to do your daily work. Rezonator is a powerful tool for annotating and visualising the dynamics of human engagement. The purpose of rezonateR is to add a series of additional tools that enhance the functionality of Rezonator and increase your productivity! My goal is to minimise the time you spend on coding and annotating so you have more time for thinking.

Rezonator has many cool features, but due to technical restrictions, there are certain things it can’t do, such as:

Dividing annotations into layers
Adding chunks and track chain entries that span multiple units
Linking tree entries to the corresponding chunks
Guessing the value of a field by looking at the values of other fields
Automatically update the values of certain fields using information from other fields

rezonateR can do all these, and more!

Moreover, rezonateR is geared toward people of all skill levels in R. I do a lot of the heavy lifting for you. As long as you have some familiarity with base R, you can quickly pick up the basic functions, but more advanced users can also extend it as they see fit.

The rezonateR engine is based heavily on Tidyverse packages, particularly rlang and dplyr. Although you don’t need to be familiar with Tidyverse to use the basic functionality, Tidyverse users will be happy to see a wide range of functions that mimic Tidyverse functions in appearance, but with additional fields to support the wide range of functionality in rezonateR.

If you’re wondering when you should start learning rezonateR, the best time is now! Although you can get started with the Rezonator GUI relatively easily, if there are certain things you know you will use in rezonateR - such as multi-line chunks - you probably want to have the rezonateR post-processing in mind, even when you’re annotating in Rezonator.

install.packages("devtools")
library(devtools)
install_github("rezonators/rezonateR")

Some folks have reported that this does not work. You may want to try adding the code options(download.file.method = "auto") before running install_github() if this is the case.

A quick import

Let’s start by importing our first file.

library(rezonateR)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Loading required package: readr
#> Loading required package: stringr
#> Loading required package: rlang

Now let’s import our first file, a short spoken text in Lhasa Tibetan (you can find the original video here: https://av.mandala.library.virginia.edu/video/couple-must-part-threes-company-02). This file contains a number of chunks, track chains, as well as trees, and we will deal with them in this vignette:

path = system.file("extdata", "virginia-library-20766.rez", package = "rezonateR")

layerRegex = list(
  track = list(field = "trailLayer", regex = c("clausearg", "discdeix"), names = c("clausearg", "discdeix", "refexpr")),
  chunk = list(field = "chunkLayer", regex = c("verb", "adv", "predadj"), names = c("verb", "adv", "predadj", "refexpr")))

myRez = importRez(path, layerRegex = layerRegex, concatFields = c("word", "wordWylie"))
#> Import starting - please be patient ...
#> Creating node maps ...
#> Creating rezrDFs ...
#> Adding foreign fields to rezrDFs and sorting (this is the slowest step) ...
#> >Adding to unit entry DF ...
#> >Adding to unit DF ...
#> >Adding to chunk DF ...
#> >Adding to track DFs ...
#> >Adding to tree DFs ...
#> Splitting rezrDFs into layers ...
#> A few finishing touches ...
#> Done!

The layerRegex object is a series of instructions to tell importRez how to divide chunks and track chains into different layers. In this case, I placed a field called ‘trailLayer’ on track chains, which has three possible values: clausearg, discdeix, and nothing. These are captured in the regex field. The first two regexes correspond to the two names ‘clausearg’ and ‘discdeix’, and the default case where neither of the first two regexes are detected is ‘refexpr’. I have done the same thing to chunks, as you can see above. If you don’t want to use layers, you don’t have to specify layerRegex. In that case, I will create a single layer called ‘default’ for you.

The other mysterious field in the importRez function is concatFields. These are fields belonging to tokens that you would like to concatenate for higher-level units like chunks and tokens. For example, if tokens 1 and 2 are ‘happy’ and ‘person’, and you have a chunk that contains these two tokens, you would want the whole string ‘happy person’ to be associated with the chunk. Typically, you should specify at least one field for doing this. In this case, we will concatenate the fields ‘word’ and ‘wordWylie’ (Wylie is the most common Romanisation system for Tibetan). It is important not to overdo it, and specify too many fields to concatenate, as this step can slow down your import considerably.

The result of the import is an object called rezrObj, which we will discuss below. When you import a more substantial file, say around 30 minutes, the import speed can be rather slow. Please be patient! The good news is that rezonateR contains functionality for saving and loading rezrObj objects, so you don’t have to import each time you work on a file in R.

savePath = "myRez.Rdata"
rez_save(myRez, savePath)
#> Saving rezrObj ...
myRez = rez_load(savePath)
#> Loading rezrObj ...

Introduction to rezrObjs and nodeMaps

rezrObjs

There are three main kinds of objects in rezonateR that you will interact with directly, namely rezrObj, nodeMap and rezrDF. rezrObjs and nodeMaps will be covered in this section. rezrDFs are relatively complex, and will form the bulk of our discussion in this vignette.

rezrObj objects contain one single nodeMap and several rezrDFs:

print("Item in myRez:")
#> [1] "Item in myRez:"
names(myRez)
#>  [1] "nodeMap"     "cardDF"      "chunkDF"     "docDF"       "entryDF"    
#>  [6] "linkDF"      "mergedDF"    "stackDF"     "tokenDF"     "trackDF"    
#> [11] "trailDF"     "treeDF"      "treeEntryDF" "treeLinkDF"  "unitDF"

The relationships between the various entites are as follows:

chunks and tree entries are built on top of tokens
entries correspond exactly to tokens, and units build on entries
track refers to both chunks and tokens

This has some ramifications for updating, which we’ll come to later.

nodeMaps

The nodeMap is similar to the internal representation of the file used in Rezonator-generated .rez files. The node map in .rez files are a disorganised list of nodes, each of which correspond to an entity inside Rezonator: units, tokens, and so on. The rezonateR nodeMap is similar. The major difference is that nodes are organised into sub-categories, according to the type of entity that the node is encoding. Let’s have a sneak peak at these categories:

print("Items in the nodeMap:")
#> [1] "Items in the nodeMap:"
names(myRez$nodeMap)
#>  [1] "token"     "entry"     "unit"      "track"     "chunk"     "card"     
#>  [7] "link"      "trail"     "stack"     "corpus"    "doc"       "treeEntry"
#> [13] "treeLink"  "tree"

In practice, you will not interact with most of these nodeMaps except token. Most of the time, you will only be dealing with rezrDFs, which are much easier to work with. Let’s take a look at them.

Introduction to rezrDFs

A rezrDF is like a normal data frame that you know from base R. Here’s the beginning of the unit table:

print(head(myRez$unitDF %>% select(id, unitSeq, srtLineBo)))
#> # A tibble: 6 × 3
#>   id            unitSeq srtLineBo              
#>   <chr>           <dbl> <chr>                  
#> 1 22C69D930B150       1 "བཀྲ་-"                 
#> 2 2456BC9E7C5D5       2 "བཀྲ་ཤིས་བདེ་ལེགས། "       
#> 3 201A180965346       3 "ཕེབས་གནང་། "           
#> 4 2BD385D3E52C7       4 "རྒན་ངག་དབང་ལགས་ཡིན་ནམ། "
#> 5 2B558CAD0F0B4       5 "ལགས་ཡིན། ང་"           
#> 6 1213A3F8E28E0       6 "ལགས་ཡིན། ང་"

chunkDF, trackDF, rezDF, stackDF, etc., are divided into layers. If you directly access the ‘chunkDF’ and ‘trackDF’ components of myRez, you will get a list of rezrDFs, one for each layer. Here are the names of our chunk layers that you might remember from the introduction, along with the beginning of one of the associated rezrDFs:

print("Component DFs of chunkDF:")
names(myRez$chunkDF)
print(head(myRez$chunkDF$refexpr) %>% select(id, word))

rezrDF inherits from the Tidyverse ‘tibble’ structure, so all tibble-related functions can be used with them. However, using classic Tidyverse functions on rezrDFs is often dangerous, as rezrDFs have additional functionality that go beyond classic rezrDFs.

There are three main differences that make rezrDF special:

Perk 1: Field access labels

Field access labels prevents you from accidentally changing things that you shouldn’t be changing. Let’s look at the field access values of the unitDF:

print("fieldaccess:")
#> [1] "fieldaccess:"
fieldaccess(myRez$unitDF)
#>               id              doc        unitStart          unitEnd 
#>            "key"           "core"           "core"           "core" 
#>          unitSeq              pID     srtLineStart        srtLineBo 
#>           "core"           "core"           "flex"           "flex" 
#>            iuEnd          iuStart        srtLineEn               iu 
#>           "flex"           "flex"           "flex"           "flex" 
#>        srtLineID           iuText          speaker       srtLineEnd 
#>           "flex"           "flex"           "flex"           "flex" 
#>             word        wordWylie docTokenSeqFirst  docTokenSeqLast 
#>        "foreign"        "foreign"        "foreign"        "foreign"

There are five possible field access values:

‘key’: The primary key of the table. You are not allowed to change it (unless you turn it into a non-key field, but this is not encouraged since you will basically break everything). If you try to update these fields using rezonateR functions, I will stop you with an error.
‘core’: Core fields, mostly generated by Rezonator. You can change them, but I will give you a warning if you do, because changing a core field has strong potential to break things.
‘flex’: Flexible fields, usually fields whose values you enter into Rezonator, though there are also flex fields automatically generated by Rezonator.
‘auto’: Fields whose values are automatically generated using information from the SAME rezrDF.
‘foreign’: Fields whose values are automatically generated using information from a DIFFERENT rezrDF (or several different rezrDFs, but this is an advanced feature we will stay away from in this vignette.)

Perk 2: Reloads

The reload() function is one of the core features of rezonateR that makes it so convenient to use. The reload() feature is based on updateFunctions. You can access the updateFunctions of a table using updateFunct():

print("updateFunct of unitDF:")
#> [1] "updateFunct of unitDF:"
updateFunct(myRez$unitDF)
#> $word
#> function (df, rezrObj) 
#> updateLowerToHigher(df, rezrObj, address, fkeyAddress, action, 
#>     field, fkeyInDF, seqName)
#> <bytecode: 0x000001da3b578920>
#> <environment: 0x000001da3b5686a8>
#> attr(,"class")
#> [1] "updateFunction" "function"      
#> attr(,"deps")
#> [1] "entryDF/word"
#> 
#> $wordWylie
#> function (df, rezrObj) 
#> updateLowerToHigher(df, rezrObj, address, fkeyAddress, action, 
#>     field, fkeyInDF, seqName)
#> <bytecode: 0x000001da3b57f670>
#> <environment: 0x000001da3b578290>
#> attr(,"class")
#> [1] "updateFunction" "function"      
#> attr(,"deps")
#> [1] "entryDF/wordWylie"
#> 
#> $docTokenSeqFirst
#> function (df, rezrObj) 
#> updateLowerToHigher(df, rezrObj, address, fkeyAddress, action, 
#>     field, fkeyInDF, seqName)
#> <bytecode: 0x000001da3b57e1e0>
#> <environment: 0x000001da3b57efe0>
#> attr(,"class")
#> [1] "updateFunction" "function"      
#> attr(,"deps")
#> [1] "entryDF/docTokenSeq"
#> 
#> $docTokenSeqLast
#> function (df, rezrObj) 
#> updateLowerToHigher(df, rezrObj, address, fkeyAddress, action, 
#>     field, fkeyInDF, seqName)
#> <bytecode: 0x000001da3b572c00>
#> <environment: 0x000001da3b573a00>
#> attr(,"class")
#> [1] "updateFunction" "function"      
#> attr(,"deps")
#> [1] "entryDF/docTokenSeq"

There are three reload functions, reloadLocal(), reloadForeign() and reload(). reloadLocal() only takes a rezrDF, and only updates auto fields. reloadForeign() and reload() take a rezrDF and a rezrObj, and updates the rezrDF using the rezrObj (which may or may not contain the rezrDF).

Let’s take a look at reload(), which is the most useful function of these. Here, in the original data, when there are zero mentions, only the orthographic representation is written as <0>; the Wylie romanisation is a blank string. I want to change the Wylie romanisation to also contain <0>s. I do that by using the rez_mutate() function on the tokenDF first (don’t worry about what that means yet; we’ll cover it later). After that, I reload the entryDF, and then reload the unitDF. (Recall that the units depend on entries which in turn depend on tokens; that’s why we can’t just reload the unitDF directly.) After the update, the unitDF is updated with <0>s appearing in the Wylie romanisation:

print("Before the update")
#> [1] "Before the update"
myRez$unitDF %>% filter(str_detect(word, "<0>")) %>% rez_select(id, word, wordWylie) %>% head
#> # A tibble: 6 × 3
#>   id            word                                           wordWylie        
#>   <chr>         <chr>                                          <chr>            
#> 1 201A180965346 <0> ཕེབས་ གནང་ །                                " phebs  gnang  …
#> 2 2BD385D3E52C7 <0> རྒན་ ངག་དབང་ ལགས་ ཡིན་ ནམ །                  " rgan  ngag dba…
#> 3 2B558CAD0F0B4 ལགས་ <0> ཡིན །                                  "lags   yin /"   
#> 4 1A5B143F0646A ལགས་ <0> <0> རེད །                              "lags    red /"  
#> 5 2414F2965B74A ང་ འདི ར་ སྤྱིར་བཏང་ <0> སློབ་སྦྱོང་ བྱེད་ གར་ ཡོང་བ་ ཡིན ། "nga  'di ra  sp…
#> 6 1FA517FFC43A  <0> སློབ་ཕྲུག་ ཡིན་ X X                             " slob phrug  yi…

#Change something in the token rezrDF that is significant for the unit rezrDF
myRez$tokenDF = myRez$tokenDF %>% rez_mutate(
  wordWylie = case_when(word == "<0>" ~ "<0>", T ~ wordWylie))
myRez$entryDF = myRez$entryDF %>% reload(myRez)
myRez$unitDF = myRez$unitDF %>% reload(myRez)

print("After the update")
#> [1] "After the update"
myRez$unitDF %>% filter(str_detect(word, "<0>")) %>% rez_select(id, word, wordWylie) %>% head
#> # A tibble: 6 × 3
#>   id            word                                           wordWylie        
#>   <chr>         <chr>                                          <chr>            
#> 1 201A180965346 <0> ཕེབས་ གནང་ །                                <0> phebs  gnang…
#> 2 2BD385D3E52C7 <0> རྒན་ ངག་དབང་ ལགས་ ཡིན་ ནམ །                  <0> rgan  ngag d…
#> 3 2B558CAD0F0B4 ལགས་ <0> ཡིན །                                  lags  <0> yin /  
#> 4 1A5B143F0646A ལགས་ <0> <0> རེད །                              lags  <0> <0> re…
#> 5 2414F2965B74A ང་ འདི ར་ སྤྱིར་བཏང་ <0> སློབ་སྦྱོང་ བྱེད་ གར་ ཡོང་བ་ ཡིན ། nga  'di ra  spy…
#> 6 1FA517FFC43A  <0> སློབ་ཕྲུག་ ཡིན་ X X                             <0> slob phrug  …

You might be wondering how to reload an entire rezrObj. Because tables often depend on each other (for example, field A in table X relies on field B in table Y which in turn relies on field C in table X), this is technically difficult, but I plan to add this function before the 1.0 release. Stay tuned!

Perk 3: Correpondences to nodeMaps

rezrDFs encode information about whether a field is in the nodeMap or not:

print("inNodeMap:")
#> [1] "inNodeMap:"
inNodeMap(myRez$unitDF)
#>               id              doc        unitStart          unitEnd 
#>            "key"        "primary"        "primary"        "primary" 
#>          unitSeq              pID     srtLineStart        srtLineBo 
#>        "primary"        "primary"         "tagmap"         "tagmap" 
#>            iuEnd          iuStart        srtLineEn               iu 
#>         "tagmap"         "tagmap"         "tagmap"         "tagmap" 
#>        srtLineID           iuText          speaker       srtLineEnd 
#>         "tagmap"         "tagmap"         "tagmap"         "tagmap" 
#>             word        wordWylie docTokenSeqFirst  docTokenSeqLast 
#>             "no"             "no"             "no"             "no"

This doesn’t do so much yet, since you’re not yet allowed to push a field created in a rezrDF back to a nodeMap. This will be available in the 1.0 release.

Editing rezrDFs

One of the core features of rezonateR is to facilitate the automatic and semi-automatic creation of fields, which is currently not supported in Rezonator. There are also other operations you may want to perform on rezrDFs.

To cater to users of different habits and skill levels, I have introduced three different levels of rezrDFs.

EasyEdit can be quickly picked up by everyone, including base R users, and covers the most basic operations you would want to do to a rezrDF (e.g. addField(), changeFieldForeign())
TidyRez is easy to pick up for tidyverse users, though there is some learning curve for others (e.g. rez_mutate(), rez_left_join())
Core engine: These are mostly functions that I use within rezonateR under the hood. Users who want maximum flexibility may also use them (e.g. lowerToHigher(), createLeftJoinUpdate()), but do be aware that I may make changes to these without notice, since I will assume that most users have little use for them.

Crucially, while EasyEdit and TidyRez syntax are very similar within each category, functions within the core engine are a lot more divergent, and EasyEdit and TidyRez also differ considerably in their syntax. So if you are comfortable with TidyRez, minimising the use of EasyEdit may make your code look more consistent, and vice versa.

EasyEdit

EasyEdit consists of four commonly used functions, addFieldLocal(), addFieldForeign(), changeFieldLocal() and changeFieldForeign(). There are also some less useful functions not covered in this vignette, but which you can find in the references, like addRow().

All of the four basic functions can be applied to both rezrDFs and rezrObjs. In this vignette, we will mainly apply them to rezrObjs. If you want to deal with emancipated rezrDFs, i.e. rezrDFs that are not part of a rezrObj, you will want to use the versions that apply to rezrDFs, but those are simpler than the rezrObj versions, so you should be able to pick them up quickly using the manual.

Let’s start by looking at addFieldLocal(). addField() is a shortcut for addFieldLocal(), and we will be using this shortcut name throughout. Our first example is very simple. In our tokenDF, let’s add a field that automatically calculates the length of a word in characters. Here, ‘entity’ specifies the name of the entity you would like to change, ‘layer’ specifies the layer within that entity (which is an empty string since there are no token layers), fieldName is the name of the field we’re adding, expression is the R expression with which we calculate the new field, and fieldaccess tells rezonateR to make this an auto field with an updateFunction that will be attached to the table:

myRez = addField(myRez, entity = "token", layer = "",
                 fieldName = "orthoLength",
                 expression = nchar(word),
                 fieldaccess = "auto")
print("A fragment of the updated table:")
#> [1] "A fragment of the updated table:"
head(myRez$tokenDF %>% rez_select(id, word, orthoLength))
#> # A tibble: 6 × 3
#>   id            word   orthoLength
#>   <chr>         <chr>        <int>
#> 1 2ECADE1029CD3 བཀྲ་-             5
#> 2 15A5089F6157A བཀྲ་ཤིས་           8
#> 3 197AA4A0C625F བདེ་ལེགས           8
#> 4 2C7746BD6F150 །                1
#> 5 354CBFB3632B6 <0>              3
#> 6 2D1DAD2FFF22A ཕེབས་             5
print("The updateFunction:")
#> [1] "The updateFunction:"
updateFunct(myRez$tokenDF, "orthoLength")
#> function (df) 
#> updateMutate(df, field, x)
#> <environment: 0x000001da3bdb8c18>
#> attr(,"class")
#> [1] "updateFunction" "function"      
#> attr(,"deps")
#> character(0)

Now let’s spice this up a bit by adding a complex field. A complex field takes information from multiple rows of a table. In this case, we are working with the tokenDF, but want the new column to be the longest length of the word that appears in the unit that the token comes from. In this case, the groupField is ‘unit’, and we specify the field type as ‘complex’. The expression uses the function longestLength(), which is a rezonateR function that returns the longest word in a series of words.

myRez = addField(myRez, entity = "token", layer = "",
                 fieldName = "longestWordInUnit",
                 expression = longestLength(word),
                 type = "complex",
                 groupField = "unit",
                 fieldaccess = "auto")
head(myRez$tokenDF %>% select(id, word, longestWordInUnit))
#> # A tibble: 6 × 3
#>   id            word   longestWordInUnit
#>   <chr>         <chr>              <int>
#> 1 2ECADE1029CD3 བཀྲ་-                   5
#> 2 15A5089F6157A བཀྲ་ཤིས་                 8
#> 3 197AA4A0C625F བདེ་ལེགས                 8
#> 4 2C7746BD6F150 །                      8
#> 5 354CBFB3632B6 <0>                    5
#> 6 2D1DAD2FFF22A ཕེབས་                   5

Now let’s add a simple foreign field. Let’s say when we look at the tokenDF, we also want to know what the whole unit’s words are. The source is the ‘word’ field of units, and we are creating a new field for tokens called ‘unitWord’. The foreign key is the field that contains IDs of the source table inside the target table, in this case the ‘unit’ field of tokenDF:

myRez = addFieldForeign(myRez,
                targetEntity = "token", targetLayer = "",
                sourceEntity = "unit", sourceLayer = "",
                targetForeignKeyName = "unit",
                targetFieldName = "unitWord", sourceFieldName = "word",
                fieldaccess = "foreign")
head(myRez$tokenDF %>% select(id, word, unitWord))
#> # A tibble: 6 × 3
#>   id            word   unitWord       
#>   <chr>         <chr>  <chr>          
#> 1 2ECADE1029CD3 བཀྲ་-   བཀྲ་-           
#> 2 15A5089F6157A བཀྲ་ཤིས་ བཀྲ་ཤིས་ བདེ་ལེགས །
#> 3 197AA4A0C625F བདེ་ལེགས བཀྲ་ཤིས་ བདེ་ལེགས །
#> 4 2C7746BD6F150 །      བཀྲ་ཤིས་ བདེ་ལེགས །
#> 5 354CBFB3632B6 <0>    <0> ཕེབས་ གནང་ །
#> 6 2D1DAD2FFF22A ཕེབས་   <0> ཕེབས་ གནང་ །

Now let’s wrap it up with a complex field foreign field. Here, we’re going to add a field in the unitDF that tells us the length of the shortest word within the unit. We’re going to base this off the entryDF.

However, because the entries that correspond to units are given in the nodeMap, you also need to supply the list of entries inside the unit nodeMap - here it’s called entryList. Chunks and tree entries are built on tokens instead, so they have a list called tokenList. Instead of ‘expression’, complex foreign fields have a field called complexAction, which is a function performed on the source field of the source table:

myRez = addFieldForeign(myRez,
                targetEntity = "unit", targetLayer = "",
                sourceEntity = "entry", sourceLayer = "",
                targetForeignKeyName = "entryList",
                targetFieldName = "shortestWordLength",
                sourceFieldName = "word",
                type = "complex",
                complexAction = shortestLength,
                fieldaccess = "foreign")
head(myRez$unitDF %>% select(id, word, shortestWordLength))
#> # A tibble: 6 × 3
#>   id            word                          shortestWordLength
#>   <chr>         <chr>                                      <int>
#> 1 22C69D930B150 བཀྲ་-                                           5
#> 2 2456BC9E7C5D5 བཀྲ་ཤིས་ བདེ་ལེགས །                                1
#> 3 201A180965346 <0> ཕེབས་ གནང་ །                                1
#> 4 2BD385D3E52C7 <0> རྒན་ ངག་དབང་ ལགས་ ཡིན་ ནམ །                  1
#> 5 2B558CAD0F0B4 ལགས་ <0> ཡིན །                                  1
#> 6 1213A3F8E28E0 ང་                                             2

(Because of the technicality that punctuation counts as a character, most of these values are 1. There are ways we can fix this using isWord conditions, as we’ll discuss below.)

So far we’ve only looked at addField(), but the good news is that changeField works in 100% the exact same way! Here’s an example, changing our orthoLength field to depend on the Romanisation instead of the original Tibetan script:

myRez = changeField(myRez, entity = "token", layer = "",
                 fieldName = "orthoLength",
                 expression = nchar(wordWylie),
                 fieldaccess = "auto")
print("A fragment of the updated table:")
#> [1] "A fragment of the updated table:"
head(myRez$tokenDF %>% rez_select(id, word, orthoLength))
#> # A tibble: 6 × 3
#>   id            word   orthoLength
#>   <chr>         <chr>        <int>
#> 1 2ECADE1029CD3 བཀྲ་-             8
#> 2 15A5089F6157A བཀྲ་ཤིས་          10
#> 3 197AA4A0C625F བདེ་ལེགས           8
#> 4 2C7746BD6F150 །                1
#> 5 354CBFB3632B6 <0>              3
#> 6 2D1DAD2FFF22A ཕེབས་             6

Note that if you don’t specify the field access value, I will automatically change it to flex, even in changeField(). This is to force you to remember that you are not only changing the value of the field itself, but also how it will be updated in the future. If you don’t supply a field access value and it’s originally an auto or foreign field, I will warn you about this change, so you can run changeField() again if you want to change your mind.

TidyRez

In general, TidyRez functions are called by adding ‘rez_’ in front of a dplyr function name, such a rez_group_by or rez_mutate. Using TidyRez functions allows you to keep and/or update your field access values, inNodeMap values, and updateFunctions. Using base R or classic dplyr functions with rezrDFs may result in reload fails, unless supplemented by core engine functions, which are not covered in this vignette.

A few dplyr functions are completely safe to use in rezonateR, mostly those that focus on selecting rows of a table, such as filter(), arrange() or slice(). Currently implemented TidyRez functions include:

rez_add_row() for adding new entries
rez_mutate() for adding and editing columns
rez_rename() for renaming columns
rez_bind_rows() for combining rezrDFs vertically
rez_group_split() for splitting rezrDFs vertically
rez_group_by() and rez_ungroup for grouping
rez_select() for selecting certain columns inside a rezrDF
rez_left_join() for left joins

A few other planned ones include rez_bind_cols() and rez_outer_join(), which will be especially useful for the calculation of inter-annotator agreement.

Because TidyRez is relatively straightforward for Tidyverse users, this vignette will focus on what TidyRez adds on top of Tidyverse. If you want to learn about basic Tidyverse, there are many existing tutorials on the Internet.

To see the power of TidyRez, let’s try creating an emancipated rezrDF with only a subset of the original columns. Here, we take trackDF$refexpr, the table of referential expressions. We then damage one of the fields using a classic dplyr function. As you can see here, the emancipated rezrDF can still be updated using the rezrObj, effectively overriding the damage:

refTable = myRez$trackDF$refexpr %>% rez_select(id, token, chain, name, word, tokenOrderLast)
print("Before:")
#> [1] "Before:"
head(refTable %>% select(id, tokenOrderLast))
#> # A tibble: 6 × 2
#>   id            tokenOrderLast
#>   <chr>                  <dbl>
#> 1 3D148B2FEEA8               1
#> 2 1CA978DCDE1DB              1
#> 3 14C5BE6658C39              4
#> 4 1FE8CC91923D1              2
#> 5 33A483E30E811              1
#> 6 280CA1A8A425               1
refTable = refTable %>% mutate(tokenSeqLast = 1) #Damage refTable with a classic dplyr function
print("After:")
#> [1] "After:"
refTable = refTable %>% reload(myRez)
head(refTable %>% select(id, tokenOrderLast))
#> # A tibble: 6 × 2
#>   id            tokenOrderLast
#>   <chr>                  <dbl>
#> 1 3D148B2FEEA8               1
#> 2 1CA978DCDE1DB              1
#> 3 14C5BE6658C39              4
#> 4 1FE8CC91923D1              2
#> 5 33A483E30E811              1
#> 6 280CA1A8A425               1

A warning is in order: TidyRez only updates the current table. If other tables have references to the table you’re editing, they will not be updated. You must bear this in mind when using rez_select() and rez_rename(). No problems will arise if you use these functions on emancipated rezrDFs. However, if you use these functions on rezrDFs within rezrObjs, you should manually update any fields in other rezrDFs that refer to the field you’ve deleted or added. I plan to add a rename feature to EasyEdit in the near future that will update references from other rezrDFs.

Most of the TidyRez functions’ syntax deviate from dplyr only minimally in ways that you can read about in the documentation. However, rez_left_join() is worth a quick mention. In addition to a fieldaccess field and a rezrObj field, which are self-explanatory, there is a fkey field and a df2Address field. fkey is the name of the field in the first data.frame that corresponds to IDs of the second data.frame. df2Address is a string that tells rez_left_join() how to find the source rezrDF next time. If the source rezrDF doesn’t belong to a layer, e.g. tokenDF, just type that. If the source rezrDF belongs to a layer, put a ‘/’ between the table and the layer, e.g. ‘trackDF/refexpr’.

An interlude: Time and sequence

Before we continue our adventure, let’s look at a couple of ways we can upgrade our rezrObj to contain even more information.

The first thing we can do, which we hinted at before, is to set certain tokens as non-words. You can do this with the addIsWordField function. One immediate benefit of this is that we get a new sequence value. The fields tokenOrder and docTokenSeq values in the original rezrDF count all tokens, whereas wordOrder and docWordSeq will only count tokens counted as words according to some criterion. Let’s set our criterion to !str_detect(wordWylie, "/"), i.e. the token must not contain the main punctuation mark in Tibetan. Notice that wordOrder is generally slightly lower than tokenOrder:

myRez = addIsWordField(myRez, !str_detect(wordWylie, "/"))
head(myRez$tokenDF %>% select(id, tokenOrder, docTokenSeq, wordOrder, docWordSeq))
#> # A tibble: 6 × 5
#>   id            tokenOrder docTokenSeq wordOrder docWordSeq
#>   <chr>              <dbl>       <dbl>     <dbl>      <dbl>
#> 1 2ECADE1029CD3          1           1         1          1
#> 2 15A5089F6157A          1           2         1          2
#> 3 197AA4A0C625F          2           3         2          3
#> 4 2C7746BD6F150          3           4         0          0
#> 5 354CBFB3632B6          1           5         1          4
#> 6 2D1DAD2FFF22A          2           6         2          5

By default, unitSeq information is not available to rezrDFs other than unitDF. You can change this using the addUnitSeq() feature, which can add unitSeq information up to track chains:

myRez = addUnitSeq(myRez, "track")

This adds a unitSeqFirst and unitSeqLast field to chunks and track chains entries, and a unitSeq field to tokens.

Updating rezonateR using external information

Some annotation actions are easier with a spreadsheet than in a Rezonator, so one action you will frequently perform is to do annotations in a spreadsheet programme and then integrate that information back into a rezrObj. Fortunately, rezonateR contains functionality that can facilitate this and minimise errors generated in the process.

Let’s say we want to annotate the person of the referential expressions inside trackDF$refexpr. Before we start annotating manually, I wrote some simple rules to guess what the person is that works for most situations, so we will only have to correct from this baseline:

myRez$trackDF$refexpr = myRez$trackDF$refexpr %>% rez_mutate(person = case_when(word == "ང" | str_starts(word, "ང་") ~ 1,
str_starts(word, "(ཁྱེད|ཁྱོད|ཇོ་ལགས|ཨ་ཅག་ལགས|རྒན་ལགས)") ~ 2,
str_ends(word, "(ལགས|<0>)") ~ 0, #Multiple likely scenarios
T ~ 3))

Before we export this as a CSV for annotation, I would like to add a column inside the rezrDF that gives us the word of the entire unit. (Since this document currently does not have multi-unit track entries, it will suffice to use unitLast or unitFirst). It will be useful to be able to see this column while making manual annotations:

myRez$trackDF$refexpr = myRez$trackDF$refexpr %>%
  rez_left_join(myRez$unitDF %>% rez_select(unitSeq, word), by = c(unitSeqLast = "unitSeq"), suffix = c("", "_unit"), df2key = "unitSeq", df2Address = "unitDF", fkey = "unitSeqLast") %>%
  rez_rename(unitLastWord = word_unit)
#> Tip: When performed on a rezrDF inside a rezrObj, rez_rename is a potentially destructive action. It is NOT recommended to assign it back to a rezrDF inside a rezrObj. If you must do so, be careful to update all addresses from other DFs to this DF.

The next step is to write the CSV file. rez_write_csv() allows us to do this easily. The third argument of rez_write_csv() is a vector of field names that we want to export. It is advisable to keep the number of exported fields small to make the spreadsheet more manageable and require less scrolling:

rez_write_csv(myRez$trackDF$refexpr, "refexpr.csv", c("id", "name", "unitLastWord", "unitSeqLast", "word", "docTokenSeqLast", "entityType", "roleType", "person"))

After editing the CSV in a spreadsheet program, let’s import it back using rez_read_csv(). (I’ve renamed the edited CSV - in general, I recommend doing this to avoid accidentally overwriting your edited file by running the export code again.) The origDF argument tells rezonateR to look in the original rezrDF that produced the CSV, and determine the data types accordingly:

changeDF = rez_read_csv("refexpr_edited.csv", origDF = myRez$trackDF$refexpr)

Finally, the updateFromDF() function allows us to update the original rezrDF using information from the new rezrDF. There are many fancy option you can choose from, such as deciding whether to delete rows, add rows, add columns, etc. We will only use the most vanilla options, and update the ‘person’ column:

myRez$trackDF$refexpr = myRez$trackDF$refexpr %>% updateFromDF(changeDF, changeCols = 'person')
#>      
#> "id" 
#> NULL
head(myRez$trackDF$refexpr %>% select(id, word, person))
#> # A tibble: 6 × 3
#>   id            word             person
#>   <chr>         <chr>             <dbl>
#> 1 3D148B2FEEA8  <0>                  NA
#> 2 1CA978DCDE1DB <0>                  NA
#> 3 14C5BE6658C39 རྒན་ ངག་དབང་ ལགས་     NA
#> 4 1FE8CC91923D1 <0>                  NA
#> 5 33A483E30E811 ང་                   NA
#> 6 280CA1A8A425  འདི་                  NA

Analysing track chains with EasyTrack

Now that we’ve looked at an example of semi-automatic annotation, let’s move on to some full automation! We will be looking in particular at coreference chains. rezonateR contains a suite of functions for generating features useful for analysing the choice of referential forms, reference comprehension, and similar topics.

Anaphoric and cataphoric distance

Let’s first find out how many units we are from the previous mention of something. This is equivalent to the gapUnit column that already exists as automatically generated by Rezonator:

myRez$trackDF$refexpr = myRez$trackDF$refexpr %>%
  rez_mutate(unitsToLastMention = unitsToLastMention(unitSeqLast))
myRez$trackDF$refexpr %>% select(id, gapUnits, unitsToLastMention) %>% slice(10:16)
#> # A tibble: 7 × 3
#>   id            gapUnits unitsToLastMention
#>   <chr>         <chr>                 <dbl>
#> 1 134D42FB97545 0                         1
#> 2 183DC7932D931 7                         7
#> 3 1373D1F88358  N/A                      NA
#> 4 2DBE5E8F59A6D 1                         1
#> 5 F20CE11F519F  1                         1
#> 6 1366768617    2                         2
#> 7 287C6AAAA209D N/A                      NA

Now let’s count the tokens from the last mention using the tokensToLastMention() function. This one has a couple of complications. The first one is which seq to count. In the interlude, we mentioned that in addition to docTokenSeq, we have a sequence value called docWordSeq that excludes nonwords. We will use that value in counting. The second complication is how we will treat zero mentions. Zeros do not actually exist in the world, so they have no time to speak of. The ‘zeroProtocol’ argumentis ‘unitFinal’, telling rezonateR to count the last word of whatever unit the zero comes from. Finally, since we’re dealing with units, we need to pass the unitDF to ensure that tokensToLastMention can have access to unit information:

myRez$trackDF$refexpr =  myRez$trackDF$refexpr %>%
  rez_mutate(wordsToLastMention = tokensToLastMention(
    docWordSeqLast, #What seq to use
    zeroProtocol = "unitInitial", #How to treat zeroes
    zeroCond = (word == "<0>"),
    unitDF = myRez$unitDF)) #Additional argument for unitFinal protocol
myRez$trackDF$refexpr %>% select(id, wordsToLastMention) %>% slice(10:16)
#> # A tibble: 7 × 2
#>   id            wordsToLastMention
#>   <chr>                      <dbl>
#> 1 134D42FB97545                  0
#> 2 183DC7932D931                 22
#> 3 1373D1F88358                  NA
#> 4 2DBE5E8F59A6D                  0
#> 5 F20CE11F519F                   0
#> 6 1366768617                     0
#> 7 287C6AAAA209D                 NA

Note that unitsToNextMention and tokensToNextMention work in the same way.

Tallying preceding and following mentions

We can also count how many previous mentions of something there were within a window of units. Most people do five or 20 unit. Let’s try this with 5. The countPrevMentions allows us to do this (countNextMentions() does this but for the succeeding context):

myRez$trackDF$refexpr = myRez$trackDF$refexpr %>% rez_mutate(noPrevMentionsIn5 = countPrevMentions(5))
myRez$trackDF$refexpr %>% select(id, noPrevMentionsIn5)  %>% slice(10:16)
#> # A tibble: 7 × 2
#>   id            noPrevMentionsIn5
#>   <chr>                     <int>
#> 1 134D42FB97545                 2
#> 2 183DC7932D931                 0
#> 3 1373D1F88358                  0
#> 4 2DBE5E8F59A6D                 1
#> 5 F20CE11F519F                  1
#> 6 1366768617                    2
#> 7 287C6AAAA209D                 0

Sometimes, we may want to extract previous mentions conditionally, e.g. only count subject mentions or zero mentions. The functions countPrevMentionsIf() and countNextMentionIf() allow us to define such a condition. Let’s try counting the number of coming zero mentions. Here, we use the condition word == "<0>", i.e. the word is a zero, and the window is Inf, i.e. there’s no limit on how far in the future we look:

myRez$trackDF$refexpr = myRez$trackDF$refexpr %>% rez_mutate(noComingZeroes = countNextMentionsIf(Inf, word == "<0>"))
myRez$trackDF$refexpr %>% select(id, noComingZeroes)  %>% slice(10:16)
#> # A tibble: 7 × 2
#>   id            noComingZeroes
#>   <chr>                  <int>
#> 1 134D42FB97545              1
#> 2 183DC7932D931              9
#> 3 1373D1F88358               0
#> 4 2DBE5E8F59A6D              9
#> 5 F20CE11F519F               0
#> 6 1366768617                 8
#> 7 287C6AAAA209D              1

Counting competitors

We may also want to count competing mentions, that is, recent mentions not coreferential to the current mention. countCompetitors() tallies the number of competitors intervening between the previous and current mention, possibly within a window. Here is one example with no window:

myRez$trackDF$refexpr = myRez$trackDF$refexpr %>% rez_mutate(noCompetitors = countCompetitors())
myRez$trackDF$refexpr %>% select(id, noComingZeroes)  %>% slice(10:16)
#> # A tibble: 7 × 2
#>   id            noComingZeroes
#>   <chr>                  <int>
#> 1 134D42FB97545              1
#> 2 183DC7932D931              9
#> 3 1373D1F88358               0
#> 4 2DBE5E8F59A6D              9
#> 5 F20CE11F519F               0
#> 6 1366768617                 8
#> 7 287C6AAAA209D              1

All of the functions introduced in this section have additional fields that allow for further customisation. Please feel free to refer to the manual for more information.

Adding tree information

Now let’s add some information from trees. The first thing to do is to run the getAllTreeCorrespondences() function, which adds a treeEntry column to non-tree tables. If you select entity = "track", this column will be added to tokenDF, chunkDF and trackDF.

myRez = getAllTreeCorrespondences(myRez, entity = "track")
myRez$trackDF$refexpr %>% select(id, treeEntry) %>% slice(10:16)
#> # A tibble: 7 × 2
#>   id            treeEntry    
#>   <chr>         <chr>        
#> 1 134D42FB97545 23E6B9A16C426
#> 2 183DC7932D931 253E6EE11F308
#> 3 1373D1F88358  1E1D73A61D4B0
#> 4 2DBE5E8F59A6D 5E28BE8ED183 
#> 5 F20CE11F519F  363329E4BBB94
#> 6 1366768617    281F8FC643F88
#> 7 287C6AAAA209D 364D86848B187

The best thing that trees do for us is connecting verb information, stored in chunks, to track chain entry (i.e. referential expression) information. We can do this in two steps. First we add a treeParent column to trackDF$refexpr that takes the value of the ‘parent’ column of treeEntryDF; in simple terms, this means we’re getting the parent tree entry’s ID into trackDF$refexpr. We then use this parent tree entry’s ID to find the corresponding verb chunk, and with this, we have successfully put the verb on the trackDF$refexpr table.

myRez = myRez %>% addFieldForeign("track", "refexpr", "treeEntry", "default", "treeEntry", "treeParent", "parent", fieldaccess = "foreign")
myRez$trackDF$refexpr = myRez$trackDF$refexpr %>% rez_left_join(myRez$chunkDF$verb %>% select(id, word, treeEntry), by = c(treeParent = "treeEntry"), suffix = c("", "_verb"), df2Address = "chunkDF/verb", fkey = "treeParent", df2key = "treeEntry", rezrObj = myRez) %>% rename(verbID = id_verb, verbWord = word_verb)
myRez$trackDF$refexpr %>% select(id, treeParent, verbID, verbWord) %>% slice(10:16)
#> # A tibble: 7 × 4
#>   id            treeParent    verbID        verbWord     
#>   <chr>         <chr>         <chr>         <chr>        
#> 1 134D42FB97545 16CA7768343C0 2CFB905E634DE རེད           
#> 2 183DC7932D931 2EA73E6E3BED4 30E122DAF5010 ཕེབས་པ་ ཡིན་ ནམ
#> 3 1373D1F88358  2EA73E6E3BED4 30E122DAF5010 ཕེབས་པ་ ཡིན་ ནམ
#> 4 2DBE5E8F59A6D 359A5ED8BE360 2CCFFAD41AD38 ཡོང་པ་ ཡིན     
#> 5 F20CE11F519F  359A5ED8BE360 2CCFFAD41AD38 ཡོང་པ་ ཡིན     
#> 6 1366768617    A4DB76F5C0EB  A4DC56F0B165  ཡོང་བ་ ཡིན     
#> 7 287C6AAAA209D A4DB76F5C0EB  A4DC56F0B165  ཡོང་བ་ ཡིན

Advanced: Chunk mergers

The last topic to cover is merging chunks, most useful for creating muti-line chunks. There are several steps to merging chunks:

Create constituent chunks that span the entire merged chunk
Create a tree leaf that contains all tokens in the merged chunk, and put the leaf in a tree.
Use the mergeChunksWithTree() command in rezonateR to merge them.

mergeChunksWithTree() is very easy to use. After you call this command, the merged chunks will be added to the bottom of the correponding chunk rezrDF. Chunk tags are taken from the first constituent chunk of each merger by default; see the manual for setting custom conditions. There will in addition be a column called combinedChunk that tells you whether a chunk is a combined chunk, a member of a combined chunk, or neither.

myRez = mergeChunksWithTree(myRez)
myRez$chunkDF$refexpr %>% filter(combinedChunk != "") %>% select(id, name, word, combinedChunk) #Showing only combined chunks and their members
#> # A tibble: 6 × 4
#>   id            name        word                                         combi…¹
#>   <chr>         <chr>       <chr>                                        <chr>  
#> 1 51A902A7C6CD  Chunk 49    ཁོང་ ཚོ་ ནས་                                   |infom…
#> 2 2E2B9BB6462BC Chunk 51    དཔེ་སྐྲུན་ ཞུས་པ་ དེ་                              |membe…
#> 3 17842C4087863 Chunk 235   ཕྱི་རྒྱལ་ ནུབ་ཕྱོགས་པ འི་ མི་རིགས་ འདི་འདྲས་             |infom…
#> 4 EF4325E7D5A9  Chunk 81    བོད་སྐད་ ནང་ བྲིས་པ འི་ རྒྱལ་རབས་ ཀྱི་ དེབ་            |membe…
#> 5 6ru0c9hVoSLwZ New Chunk 1 ཁོང་ ཚོ་ ནས་ དཔེ་སྐྲུན་ ཞུས་པ་ དེ་                   combin…
#> 6 HpsUkPzuAeFcm New Chunk 2 ཕྱི་རྒྱལ་ ནུབ་ཕྱོགས་པ འི་ མི་རིགས་ འདི་འདྲས་ བོད་སྐད་ ནང་… combin…
#> # … with abbreviated variable name ¹combinedChunk

You may also augment the trackDF with the merged chunks; the combinedChunk column works similarly:

myRez = mergedChunksToTrack(myRez, "refexpr")

Where to go from here?

Now that you’ve seen the bare-bones basics of using rezonateR, if you want to dive in and start using it, you can proceed to our sequence of detailed tutorials starting from vignette("import_save_basics"). If you want to see a concrete example of a mini-project, take a look at vignette("sample_proj").