Editing I: Using EasyEdit • rezonateR

This vignette will use the file saved at the end of vignette("time_seq"). As always, you don’t have to have read that tutorial beforehand, though it may be helpful if you are new to rezonateR.

library(rezonateR)
path = system.file("extdata", "rez007_edit1.Rdata", package = "rezonateR", mustWork = T)
rez007 = rez_load(path)
#> Loading rezrObj ...

Editing rezrDFs: Some preliminaries

Before editing, let’s familiarise ourselves with some basic properties of rezrDFs to keep in mind when editing them.

Can you edit this?

As you know, editing data can be tricky. If you accidentally remove information you should not have, the results could be disastrous. Field access labels prevents you from accidentally changing things that you shouldn’t be changing. Let’s look at the field access values of the unitDF:

fieldaccess(rez007$unitDF)
#>                 id                doc          unitStart            unitEnd 
#>              "key"             "core"             "core"             "core" 
#>            unitSeq                pID             unitId          unitStart 
#>             "core"             "core"             "flex"             "flex" 
#>              docId            unitDur           pSentSeq   unitDurSkipPause 
#>             "flex"             "flex"             "flex"             "flex" 
#>            unitEnd unitStartSkipPause           sequence        participant 
#>             "flex"             "flex"             "flex"             "flex" 
#>            turnSeq               text         transcript   docTokenSeqFirst 
#>             "flex"          "foreign"          "foreign"          "foreign" 
#>    docTokenSeqLast    docWordSeqFirst     docWordSeqLast 
#>          "foreign"          "foreign"          "foreign"

There are five possible field access values:

key: The primary key of the table. You are not allowed to change it (unless you turn it into a non-key field, but this is not encouraged since you will basically break everything). If you try to update these fields using rezonateR functions, I will stop you with an error.
core: Core fields, mostly generated by Rezonator. You can change them, but I will give you a warning if you do, because changing a core field has strong potential to break things.
flex: Flexible fields, usually fields whose values you enter into Rezonator, though there are also flex fields automatically generated by Rezonator. If you add fields in rezonateR that you would like to manually correct later, setting it to flex is also a good idea.
auto: Fields whose values are automatically generated using information from the same rezrDF. This should be used for fields that do not need to be manually annotated or corrected.
foreign: Fields whose values are automatically generated using information from a different rezrDF. Fields like text and tokenOrderFirst in the unitDF we’ve seen just before, for example, come from the entryDF and are therefore foreign.

Update functions and reloads

Whenever you have auto and foreign fields in a table, that means you will want them to be automatically updated as your annotations progress. The reload() function is one of the core features of rezonateR and allows you to do this. The reload() feature calls functions called updateFunctions. You can access the updateFunctions of a table using updateFunct():

updateFunct(rez007$unitDF)
#> $text
#> function(df, rezrObj) updateLowerToHigher(df, rezrObj, address, fkeyAddress, action, field, fkeyInDF, seqName)
#> <environment: 0x00000204e7cc7968>
#> attr(,"class")
#> [1] "updateFunction" "function"      
#> attr(,"deps")
#> [1] "entryDF/text"
#> 
#> $transcript
#> function(df, rezrObj) updateLowerToHigher(df, rezrObj, address, fkeyAddress, action, field, fkeyInDF, seqName)
#> <environment: 0x00000204e7c21fe0>
#> attr(,"class")
#> [1] "updateFunction" "function"      
#> attr(,"deps")
#> [1] "entryDF/transcript"
#> 
#> $docTokenSeqFirst
#> function(df, rezrObj) updateLowerToHigher(df, rezrObj, address, fkeyAddress, action, field, fkeyInDF, seqName)
#> <environment: 0x00000204e7c253c0>
#> attr(,"class")
#> [1] "updateFunction" "function"      
#> attr(,"deps")
#> [1] "entryDF/docTokenSeq"
#> 
#> $docTokenSeqLast
#> function(df, rezrObj) updateLowerToHigher(df, rezrObj, address, fkeyAddress, action, field, fkeyInDF, seqName)
#> <environment: 0x00000204e7c22460>
#> attr(,"class")
#> [1] "updateFunction" "function"      
#> attr(,"deps")
#> [1] "entryDF/docTokenSeq"
#> 
#> $docWordSeqFirst
#> function(df, rezrObj) updateLowerToHigher(df, rezrObj, address, fkeyAddress, action, field, fkeyInDF, seqName)
#> <environment: 0x00000204e7c23420>
#> attr(,"class")
#> [1] "updateFunction" "function"      
#> attr(,"deps")
#> [1] "entryDF/docWordSeq"
#> 
#> $docWordSeqLast
#> function(df, rezrObj) updateLowerToHigher(df, rezrObj, address, fkeyAddress, action, field, fkeyInDF, seqName)
#> <environment: 0x00000204e7c0c2d0>
#> attr(,"class")
#> [1] "updateFunction" "function"      
#> attr(,"deps")
#> [1] "entryDF/docWordSeq"

There are three reload functions:

reloadLocal() only takes a rezrDF, and only updates auto fields.
reloadForeign() take a rezrDF and a rezrObj, and updates the foreign fields of the rezrDF using the rezrObj (which may or may not contain the rezrDF).
reload() combines the two.

Once we start editing fields, we will experience the power of reloads. Let’s now first take a look at how we’ll be editing …

The core four

As you probably guessed from the title, this vignette covers the EasyEdit series of functions in rezonateR, which are simple but powerful functions for editing rezrDFs, and can be learnt even by users with no exposure to dplyr. EasyEdit consists of four core functions, along with a bunch of useful helpers. The four core functions are:

The terms ‘local’ and ‘foreign’ are inspired by, but extended from, database terminology. They refer to what source of information you are drawing from to create or change the field. The two ‘local’ functions add or change fields using information from the current rezrDF, and the two ‘foreign’ ones add or change fields using information from other rezrDF. The word ‘local’ can be dropped when you are using the local functions.

All of the four basic functions can be applied to both rezrDFs and rezrObjs. In general, whenever you are working with a rezrObj directly, it is safest to work directly on it. However, if you are working with an emancipated rezrDF - that is, a rezrDF stored in a variable outside of a rezrObj - then you will want to apply these functions to a single rezrDF. In practice, when using these functions, the main difference between the rezrDF and rezrObj versions is that the latter will require you to specify entity type and layer. This tutorial will mainly use the rezrObj editions; simply omit the entity type and layer fields when applying these functions to rezrDFs.

The change functions act in more or less the same way as the add functions, the only difference being that it works on an existing field instead of adding a new one. So our tutorial will be mostly working with the add functions.

Staying local

Let’s start by looking at addFieldLocal() using a simple application: In our tokenDF, let’s add a field that automatically calculates the length of a word in characters.

In this function, entity specifies the name of the entity you would like to change, layer specifies the layer within that entity (which is an empty string since there are no token layers), fieldName is the name of the field we’re adding, expression is the R expression with which we calculate the new field, and fieldaccess tells rezonateR to make this an auto field with an updateFunction that will be attached to the table. Let’s try this, and look at both the results and the updateFunction:

rez007 = addField(rez007, entity = "token", layer = "",
                 fieldName = "orthoLength",
                 expression = nchar(text),
                 fieldaccess = "auto")
print("A fragment of the updated table:")
#> [1] "A fragment of the updated table:"
head(rez007$tokenDF %>% rez_select(id, text, orthoLength))
#> # A tibble: 6 × 3
#>   id            text  orthoLength
#>   <chr>         <chr>       <int>
#> 1 31F282855E95E (...)           5
#> 2 363C1D373B2F7 God             3
#> 3 3628E4BD4CC05 ,               1
#> 4 37EFCBECFD691 I               1
#> 5 12D67756890C1 said            4
#> 6 936363B71D59  I               1
print("The updateFunction:")
#> [1] "The updateFunction:"
updateFunct(rez007$tokenDF, "orthoLength")
#> function (df) 
#> updateMutate(df, field, x)
#> <environment: 0x00000204e6df5630>
#> attr(,"class")
#> [1] "updateFunction" "function"      
#> attr(,"deps")
#> character(0)

You might notice that (...) has an orthoLength of 5. What if we decide that we don’t want to count these non-words? One feasible solution is to use isWord, which we added in vignette("time_seq"): if a token is a word, then we set orthoLength to the number of characters in the text column as before; if not, we set it to 0.

Although EasyEdit functions do not require users to use Tidyverse functions, I still suggest that the Tidyverse function dplyr::case_when() is the best for this purpose, and it can be easily combined with EasyEdit functions. This allows you to create a vector whose value can be calculated differently depending on certain conditions. The syntax of dplyr::case_when() is simple: each argument of the function a condition ~ value pair, and if you want an ‘else’ statement, simply use T as the condition in the last condition-value pair. In this case, we can use this function to create a vector of values that is empty when a token is not a word, and the text of the token when it is a word:

rez007 = changeField(rez007, entity = "token", layer = "",
                 fieldName = "orthoLength",
                 expression = nchar(case_when(isWord ~ text, T ~ "")),
                 fieldaccess = "auto")
print("A fragment of the updated table:")
#> [1] "A fragment of the updated table:"
head(rez007$tokenDF %>% rez_select(id, text, orthoLength))
#> # A tibble: 6 × 3
#>   id            text  orthoLength
#>   <chr>         <chr>       <int>
#> 1 31F282855E95E (...)           0
#> 2 363C1D373B2F7 God             3
#> 3 3628E4BD4CC05 ,               0
#> 4 37EFCBECFD691 I               1
#> 5 12D67756890C1 said            4
#> 6 936363B71D59  I               1

Notice that (...) now has an orthoLength of 0.

Now let’s spice this up a bit by adding a complex field. A complex field takes information from multiple rows of a table. Let’s say we are working with the tokenDF, but want the new column to be the longest length of the word that appears in the unit that the token comes from. In this case, the groupField argument that we haven’t seen before is unit, and we specify the field type as "complex". The expression uses the function longestLength(), which is a rezonateR function that returns the longest word in a series of words.

rez007 = addField(rez007, entity = "token", layer = "",
                 fieldName = "longestWordInUnit",
                 expression = longestLength(text),
                 type = "complex",
                 groupField = "unit",
                 fieldaccess = "auto")
head(rez007$tokenDF %>% select(id, text, longestWordInUnit))
#> # A tibble: 6 × 3
#>   id            text  longestWordInUnit
#>   <chr>         <chr>             <int>
#> 1 31F282855E95E (...)                 5
#> 2 363C1D373B2F7 God                   5
#> 3 3628E4BD4CC05 ,                     5
#> 4 37EFCBECFD691 I                     7
#> 5 12D67756890C1 said                  7
#> 6 936363B71D59  I                     7

longestLength() belongs to a small collection of functions useful for extracting information from a bunch of strings:

shortestLength(): Find the shortest token’s length within the group.
longestLength(): Find the longest token’s length.
shortest(): Get the shortest token’s text.
longest(): Get the longest token’s text.
concatenateAll(): Concatenate all the tokens together.
inLength(): Gives the size of the group (may be used with non-strings), possibly with isWord information.

Some base R functions that might be useful for numeric values include max(), min(), range(), mean(), etc.

Note that both times we added a field, we’ve set the field access to auto. If you do not set the field access, I will automatically set it to flex, which means that column - text in this case - will not be affected by reloads.

Going foreign

Now let’s add a simple foreign field. Let’s say when we look at the tokenDF, we also want to know what the whole unit’s words are. (This will come into handy when we want to do external editing!)

The trickiest part of addFieldForeign is keeping track of the source we’re getting information from, and the target we’re aiming to add information to. We need to know:

The source of information that our new field is create with: this means we need to know sourceEntity, sourceLayer, sourceFieldName.
The location of our new target field: this means we need targetEntity, targetLayer, targetFieldName.
How to link the source and target tables. This is handled by the argument targetForeignKeyName. We need to give the name of the column containing IDs of the source table inside the target table, i.e. the column of the target table that tells us which row of the source table to look at. In this case the unit field of tokenDF.

In our specific example:

The source is the ‘text’ field of units, so we set sourceEntity to "unit" and sourceLayer to the empty string, and sourceFieldName to "text".
The target is the ‘unitText’ field of units, so we set targetEntity to "token" and targetLayer to the empty string, and targetFieldName to "unitText".
The foreign key, targetForeignKeyName, is the unit field of tokenDF.

Let’s put it to practice:

rez007 = addFieldForeign(rez007,
                targetEntity = "token", targetLayer = "",
                sourceEntity = "unit", sourceLayer = "",
                targetForeignKeyName = "unit",
                targetFieldName = "unitText", sourceFieldName = "text",
                fieldaccess = "foreign")
head(rez007$tokenDF %>% select(id, text, unitText))
#> # A tibble: 6 × 3
#>   id            text  unitText                                 
#>   <chr>         <chr> <chr>                                    
#> 1 31F282855E95E (...) (...) God ,                              
#> 2 363C1D373B2F7 God   (...) God ,                              
#> 3 3628E4BD4CC05 ,     (...) God ,                              
#> 4 37EFCBECFD691 I     I said I was n't gon na do this anymore .
#> 5 12D67756890C1 said  I said I was n't gon na do this anymore .
#> 6 936363B71D59  I     I said I was n't gon na do this anymore .

Like its local counterpart, addFieldForeign() also has a complex flavour, i.e. we can draw from multiple lines of a different field. This is probably the hardest part of this tutorial, so buckle up!

Here, we’re going to add a field in the unitDF that tells us the average length of words within the unit. We’re going to base this off the entryDF.

This time, targetForeignKeyName works a bit differently. Because the entries that correspond to each unit are given in the nodeMap, you also need to supply the list of entries inside a unit node - that is, entryList, as you may recall from vignette("import_save_basics").

addFieldForeign() has a field called complexAction, which is a function performed on the source field of the source table, which could be any aggregating function (including the longestLength() series of functions that we have seen before). In this instance, we use mean()::

rez007 = addFieldForeign(rez007,
                targetEntity = "unit", targetLayer = "",
                sourceEntity = "entry", sourceLayer = "",
                targetForeignKeyName = "entryList",
                targetFieldName = "averageWordLength",
                sourceFieldName = "text",
                type = "complex",
                complexAction = function(x) mean(nchar(x)),
                fieldaccess = "foreign")
head(rez007$unitDF %>% select(id, text, averageWordLength))
#> # A tibble: 6 × 3
#>   id            text                                                     avera…¹
#>   <chr>         <chr>                                                      <dbl>
#> 1 2AD10A854E6D3 (...) God ,                                                 3   
#> 2 BDD7D839325A  I said I was n't gon na do this anymore .                   2.82
#> 3 2752E3B395FC1 (...) <0> Stay up late .                                    3.17
#> 4 8487A33D1DF2  (...) <0> Kinda defeats the purpose of getting up in th…    4.15
#> 5 107F655C3299D (...) I know .                                              2.75
#> 6 307808364906D (.) And it 's a hard habit to break .                       2.8 
#> # … with abbreviated variable name ¹averageWordLength

Reloads revisited

Having created a bunch of auto fields, naturally we will want to try out our reloads! Let’s try replacing the zero sign <0> with ∅ in the text column, which is more commonly used in linguistics papers. After doing this, we can then reload unitDF to look at the impact on our freshly created averageWordLength. Notice that we have to reload rezrDFs in order: first the entryDF, using information from tokenDF, then the unitDF, using information from entryDF (please be patient if running this on your computer, since reloads can take time):

#unitDF before the update
rez007$unitDF %>% filter(str_detect(text, "<0>")) %>% rez_select(id, text, averageWordLength) %>% head
#> # A tibble: 6 × 3
#>   id            text                                                     avera…¹
#>   <chr>         <chr>                                                      <dbl>
#> 1 2752E3B395FC1 (...) <0> Stay up late .                                    3.17
#> 2 8487A33D1DF2  (...) <0> Kinda defeats the purpose of getting up in th…    4.15
#> 3 786FB0DC8416  (...) I wan na <0> spend time with Ron ,                    3.1 
#> 4 3171F4905628A (.) If we sit down and <0> set some rules ,                 3   
#> 5 37BE6893BC78E (H) <0> going to (...) our parents ,                        3.62
#> 6 7B0D1EF95CEF  and (...) <0> complaining about one another ,               4.75
#> # … with abbreviated variable name ¹averageWordLength

#Change the zero format
rez007$tokenDF = changeFieldLocal(rez007$tokenDF,
                                  fieldName = "text",
                                  expression = case_when(text == "<0>" ~ "∅", T ~ text))
rez007$entryDF = rez007$entryDF %>% reload(rez007)
rez007$unitDF = rez007$unitDF %>% reload(rez007)

#unitDF after the update
rez007$unitDF %>% filter(str_detect(text, "∅")) %>% rez_select(id, text, averageWordLength) %>% head
#> # A tibble: 6 × 3
#>   id            text                                                     avera…¹
#>   <chr>         <chr>                                                      <dbl>
#> 1 2752E3B395FC1 (...) ∅ Stay up late .                                      2.83
#> 2 8487A33D1DF2  (...) ∅ Kinda defeats the purpose of getting up in the …    4   
#> 3 786FB0DC8416  (...) I wan na ∅ spend time with Ron ,                      2.9 
#> 4 3171F4905628A (.) If we sit down and ∅ set some rules ,                   2.82
#> 5 37BE6893BC78E (H) ∅ going to (...) our parents ,                          3.38
#> 6 7B0D1EF95CEF  and (...) ∅ complaining about one another ,                 4.5 
#> # … with abbreviated variable name ¹averageWordLength

Dealing with categorical variables

The tidyverse package forcats is much more powerful for dealing with categorical variables, but for those of us who don’t want to learn an entirely new package, rezonateR provides a few easy ways to deal with categories.

mergeCats() allows you to merge two categories. It takes a vector, normally a column, as the first argument. Subsequently, the name of each argument is a new category, and the value of each argument is a vector of names of old categories that the new category will encompass (as character values, even if the original column contains factors).

For example, the Santa Barbara Corpus categorises laughter as separate from other vocalisms. If you want to merge "Laugh" into `“Vocalism”, keeping everything else, then you can use this code:

#Laughter tokens before
head(rez007$tokenDF %>% select(id, doc, unit, text, kind) %>% filter(str_detect(text, "@")))
#> # A tibble: 6 × 5
#>   id            doc    unit          text  kind 
#>   <chr>         <chr>  <chr>         <chr> <chr>
#> 1 280154406371  sbc007 1E534FE7565B6 @     Laugh
#> 2 8C04EB23AE43  sbc007 274331D2D2283 @@    Laugh
#> 3 221FD665A9FA4 sbc007 EFCC48B9EDFA  @@@@@ Laugh
#> 4 15B804DDB7C66 sbc007 3759142333BAE @@@@@ Laugh
#> 5 261421712169B sbc007 A744F59296C2  @     Laugh
#> 6 2D0821FE951A4 sbc007 273B241320492 @@@   Laugh

rez007 = changeField(rez007, entity = "token", layer = "",
                 fieldName = "kind",
                 expression = mergeCats(kind, Vocalism = c("Laugh", "Vocalism"))) 
#Laughter tokens after
head(rez007$tokenDF %>% select(id, doc, unit, text, kind) %>% filter(str_detect(text, "@")))
#> # A tibble: 6 × 5
#>   id            doc    unit          text  kind    
#>   <chr>         <chr>  <chr>         <chr> <chr>   
#> 1 280154406371  sbc007 1E534FE7565B6 @     Vocalism
#> 2 8C04EB23AE43  sbc007 274331D2D2283 @@    Vocalism
#> 3 221FD665A9FA4 sbc007 EFCC48B9EDFA  @@@@@ Vocalism
#> 4 15B804DDB7C66 sbc007 3759142333BAE @@@@@ Vocalism
#> 5 261421712169B sbc007 A744F59296C2  @     Vocalism
#> 6 2D0821FE951A4 sbc007 273B241320492 @@@   Vocalism

renameCats() has identical syntax. For example, if you want to rename Vocalism further to Voc:

#Breath tokens before
head(rez007$tokenDF %>% select(id, doc, unit, text, kind) %>% filter(str_detect(text, "\\(H\\)")))
#> # A tibble: 6 × 5
#>   id            doc    unit          text  kind    
#>   <chr>         <chr>  <chr>         <chr> <chr>   
#> 1 B7392EC0FE18  sbc007 1EFAE7BD41500 (H)   Vocalism
#> 2 34038223F0E2  sbc007 EDCE0A20D5B3  (H)   Vocalism
#> 3 2C6A695AAE730 sbc007 3759142333BAE (H)   Vocalism
#> 4 1094B3337D73D sbc007 1F858888540F6 (H)   Vocalism
#> 5 3127D3EC0E7C5 sbc007 1017EC66C38EB (H)   Vocalism
#> 6 2EA63A8716B28 sbc007 10B9591B876F8 (H)   Vocalism

rez007 = changeField(rez007, entity = "token", layer = "",
                 fieldName = "kind",
                 expression = renameCats(kind, Voc = "Vocalism")) 
#Laughter tokens after
head(rez007$tokenDF %>% select(id, doc, unit, text, kind) %>% filter(str_detect(text, "\\(H\\)")))
#> # A tibble: 6 × 5
#>   id            doc    unit          text  kind 
#>   <chr>         <chr>  <chr>         <chr> <chr>
#> 1 B7392EC0FE18  sbc007 1EFAE7BD41500 (H)   Voc  
#> 2 34038223F0E2  sbc007 EDCE0A20D5B3  (H)   Voc  
#> 3 2C6A695AAE730 sbc007 3759142333BAE (H)   Voc  
#> 4 1094B3337D73D sbc007 1F858888540F6 (H)   Voc  
#> 5 3127D3EC0E7C5 sbc007 1017EC66C38EB (H)   Voc  
#> 6 2EA63A8716B28 sbc007 10B9591B876F8 (H)   Voc

Adding rows

Adding rows is not an operation you will do all the time. In general, it is recommended to just re-import the whole thing and run all the code again. On the occasional situation where you do have to do this, addRow() comes in handy.

You only need to add core and flex fields when you add a row. An ID will be automatically generated, and the foreign/auto fields will be automatically added. After specifying the rezrDF or rezrObj with entity and layer, each argument name is a column name, and its value is the value in the column.

rez007 = addRow(rez007, "trail", "default",
                doc = "sbc007",
                chainCreateSeq = max(rez007$trailDF$default$chainCreateSeq) + 1,
                name = "Danae",
                chainSize = 1)
tail(rez007$trailDF$default)
#> # A tibble: 6 × 6
#>   id            doc    chainCreateSeq name                      chainSize layer 
#>   <chr>         <chr>           <dbl> <chr>                         <dbl> <chr> 
#> 1 E57D22CC7190  sbc007             16 time (spending)                   1 defau…
#> 2 2E79208C27DC2 sbc007             17 Ron's work                        1 defau…
#> 3 21B1DDD41EFB0 sbc007             50 impersonal you                    3 defau…
#> 4 368B8B1D9E62C sbc007             56 up there                          1 defau…
#> 5 251AE6A659F6F sbc007              3 the purpose of getting up         1 defau…
#> 6 Sg8GbsAlUW0LD sbc007             88 Danae                             1 defau…
#Note: chainSize is currently flex as it is supplied by Rezonator and not calculated by rezonateR, but this may change in the future.

Onwards!

The next tutorial, vignette("edit_tidyRez") will be relatively short, because it will assume familiarity with the Tidyverse package dplyr. If you would like to do something that goes beyond the capabilities of easyRez, it is recommended that you familiarise yourself with the relevant dplyr function first, and then read the TidyRez vignette. If you are not familiar with Tidyverse and have no intention to learn it yet, you may elect to jump the next tutorial and go straight to vignette("edit_external").

And lest we forget, always save!

savePath = "rez007.Rdata"
rez_save(rez007, savePath)
#> Saving rezrObj ...