Editing I: Using EasyEdit
edit_easyEdit.Rmd
This vignette will use the file saved at the end of
vignette("time_seq")
. As always, you don’t have to have
read that tutorial beforehand, though it may be helpful if you are new
to rezonateR.
library(rezonateR)
path = system.file("extdata", "rez007_edit1.Rdata", package = "rezonateR", mustWork = T)
rez007 = rez_load(path)
#> Loading rezrObj ...
Editing rezrDFs: Some preliminaries
Before editing, let’s familiarise ourselves with some basic
properties of rezrDF
s to keep in mind when editing
them.
Can you edit this?
As you know, editing data can be tricky. If you accidentally remove information you should not have, the results could be disastrous. Field access labels prevents you from accidentally changing things that you shouldn’t be changing. Let’s look at the field access values of the unitDF:
fieldaccess(rez007$unitDF)
#> id doc unitStart unitEnd
#> "key" "core" "core" "core"
#> unitSeq pID unitId unitStart
#> "core" "core" "flex" "flex"
#> docId unitDur pSentSeq unitDurSkipPause
#> "flex" "flex" "flex" "flex"
#> unitEnd unitStartSkipPause sequence participant
#> "flex" "flex" "flex" "flex"
#> turnSeq text transcript docTokenSeqFirst
#> "flex" "foreign" "foreign" "foreign"
#> docTokenSeqLast docWordSeqFirst docWordSeqLast
#> "foreign" "foreign" "foreign"
There are five possible field access values:
-
key
: The primary key of the table. You are not allowed to change it (unless you turn it into a non-key field, but this is not encouraged since you will basically break everything). If you try to update these fields usingrezonateR
functions, I will stop you with an error. -
core
: Core fields, mostly generated by Rezonator. You can change them, but I will give you a warning if you do, because changing a core field has strong potential to break things.
-
flex
: Flexible fields, usually fields whose values you enter into Rezonator, though there are also flex fields automatically generated by Rezonator. If you add fields inrezonateR
that you would like to manually correct later, setting it toflex
is also a good idea. -
auto
: Fields whose values are automatically generated using information from the same rezrDF. This should be used for fields that do not need to be manually annotated or corrected. -
foreign
: Fields whose values are automatically generated using information from a different rezrDF. Fields liketext
andtokenOrderFirst
in theunitDF
we’ve seen just before, for example, come from theentryDF
and are thereforeforeign
.
Update functions and reloads
Whenever you have auto
and foreign
fields
in a table, that means you will want them to be automatically updated as
your annotations progress. The reload()
function is one of
the core features of rezonateR
and allows you to do this.
The reload()
feature calls functions called
updateFunction
s. You can access the
updateFunction
s of a table using
updateFunct()
:
updateFunct(rez007$unitDF)
#> $text
#> function(df, rezrObj) updateLowerToHigher(df, rezrObj, address, fkeyAddress, action, field, fkeyInDF, seqName)
#> <environment: 0x00000204e7cc7968>
#> attr(,"class")
#> [1] "updateFunction" "function"
#> attr(,"deps")
#> [1] "entryDF/text"
#>
#> $transcript
#> function(df, rezrObj) updateLowerToHigher(df, rezrObj, address, fkeyAddress, action, field, fkeyInDF, seqName)
#> <environment: 0x00000204e7c21fe0>
#> attr(,"class")
#> [1] "updateFunction" "function"
#> attr(,"deps")
#> [1] "entryDF/transcript"
#>
#> $docTokenSeqFirst
#> function(df, rezrObj) updateLowerToHigher(df, rezrObj, address, fkeyAddress, action, field, fkeyInDF, seqName)
#> <environment: 0x00000204e7c253c0>
#> attr(,"class")
#> [1] "updateFunction" "function"
#> attr(,"deps")
#> [1] "entryDF/docTokenSeq"
#>
#> $docTokenSeqLast
#> function(df, rezrObj) updateLowerToHigher(df, rezrObj, address, fkeyAddress, action, field, fkeyInDF, seqName)
#> <environment: 0x00000204e7c22460>
#> attr(,"class")
#> [1] "updateFunction" "function"
#> attr(,"deps")
#> [1] "entryDF/docTokenSeq"
#>
#> $docWordSeqFirst
#> function(df, rezrObj) updateLowerToHigher(df, rezrObj, address, fkeyAddress, action, field, fkeyInDF, seqName)
#> <environment: 0x00000204e7c23420>
#> attr(,"class")
#> [1] "updateFunction" "function"
#> attr(,"deps")
#> [1] "entryDF/docWordSeq"
#>
#> $docWordSeqLast
#> function(df, rezrObj) updateLowerToHigher(df, rezrObj, address, fkeyAddress, action, field, fkeyInDF, seqName)
#> <environment: 0x00000204e7c0c2d0>
#> attr(,"class")
#> [1] "updateFunction" "function"
#> attr(,"deps")
#> [1] "entryDF/docWordSeq"
There are three reload functions:
-
reloadLocal()
only takes arezrDF
, and only updates auto fields. -
reloadForeign()
take a rezrDF and a rezrObj, and updates theforeign
fields of the rezrDF using the rezrObj (which may or may not contain the rezrDF). -
reload()
combines the two.
Once we start editing fields, we will experience the power of reloads. Let’s now first take a look at how we’ll be editing …
The core four
As you probably guessed from the title, this vignette covers the
EasyEdit series of functions in rezonateR
, which are simple
but powerful functions for editing rezrDF
s, and can be
learnt even by users with no exposure to dplyr
. EasyEdit
consists of four core functions, along with a bunch of useful helpers.
The four core functions are:
The terms ‘local’ and ‘foreign’ are inspired by, but extended from,
database terminology. They refer to what source of information you are
drawing from to create or change the field. The two ‘local’ functions
add or change fields using information from the current
rezrDF
, and the two ‘foreign’ ones add or change fields
using information from other rezrDF
. The word ‘local’ can
be dropped when you are using the local functions.
All of the four basic functions can be applied to both
rezrDF
s and rezrObj
s. In general, whenever you
are working with a rezrObj
directly, it is safest to work
directly on it. However, if you are working with an emancipated
rezrDF
- that is, a rezrDF
stored in a
variable outside of a rezrObj
- then you will want to apply
these functions to a single rezrDF
. In practice, when using
these functions, the main difference between the rezrDF
and
rezrObj
versions is that the latter will require you to
specify entity type and layer. This tutorial will mainly use the
rezrObj
editions; simply omit the entity type and layer
fields when applying these functions to rezrDF
s.
The change functions act in more or less the same way as the add functions, the only difference being that it works on an existing field instead of adding a new one. So our tutorial will be mostly working with the add functions.
Staying local
Let’s start by looking at addFieldLocal()
using a simple
application: In our tokenDF
, let’s add a field that
automatically calculates the length of a word in characters.
In this function, entity
specifies the name of the
entity you would like to change, layer
specifies the layer
within that entity (which is an empty string since there are no token
layers), fieldName
is the name of the field we’re adding,
expression is the R expression with which we calculate the new field,
and fieldaccess
tells rezonateR
to make this
an auto field with an updateFunction
that will be attached
to the table. Let’s try this, and look at both the results and the
updateFunction
:
rez007 = addField(rez007, entity = "token", layer = "",
fieldName = "orthoLength",
expression = nchar(text),
fieldaccess = "auto")
print("A fragment of the updated table:")
#> [1] "A fragment of the updated table:"
head(rez007$tokenDF %>% rez_select(id, text, orthoLength))
#> # A tibble: 6 × 3
#> id text orthoLength
#> <chr> <chr> <int>
#> 1 31F282855E95E (...) 5
#> 2 363C1D373B2F7 God 3
#> 3 3628E4BD4CC05 , 1
#> 4 37EFCBECFD691 I 1
#> 5 12D67756890C1 said 4
#> 6 936363B71D59 I 1
print("The updateFunction:")
#> [1] "The updateFunction:"
updateFunct(rez007$tokenDF, "orthoLength")
#> function (df)
#> updateMutate(df, field, x)
#> <environment: 0x00000204e6df5630>
#> attr(,"class")
#> [1] "updateFunction" "function"
#> attr(,"deps")
#> character(0)
You might notice that (...)
has an orthoLength of 5.
What if we decide that we don’t want to count these non-words? One
feasible solution is to use isWord
, which we added in
vignette("time_seq")
: if a token is a word, then we set
orthoLength
to the number of characters in the
text
column as before; if not, we set it to 0.
Although EasyEdit functions do not require users to use Tidyverse
functions, I still suggest that the Tidyverse function
dplyr::case_when()
is the best for this purpose, and it can
be easily combined with EasyEdit functions. This allows you to create a
vector whose value can be calculated differently depending on certain
conditions. The syntax of dplyr::case_when()
is simple:
each argument of the function a condition
~
value
pair, and if you want an ‘else’ statement, simply use
T
as the condition in the last condition-value pair. In
this case, we can use this function to create a vector of values that is
empty when a token is not a word, and the text of the token when it
is a word:
rez007 = changeField(rez007, entity = "token", layer = "",
fieldName = "orthoLength",
expression = nchar(case_when(isWord ~ text, T ~ "")),
fieldaccess = "auto")
print("A fragment of the updated table:")
#> [1] "A fragment of the updated table:"
head(rez007$tokenDF %>% rez_select(id, text, orthoLength))
#> # A tibble: 6 × 3
#> id text orthoLength
#> <chr> <chr> <int>
#> 1 31F282855E95E (...) 0
#> 2 363C1D373B2F7 God 3
#> 3 3628E4BD4CC05 , 0
#> 4 37EFCBECFD691 I 1
#> 5 12D67756890C1 said 4
#> 6 936363B71D59 I 1
Notice that (...)
now has an orthoLength
of
0.
Now let’s spice this up a bit by adding a complex field. A complex
field takes information from multiple rows of a table. Let’s say we are
working with the tokenDF
, but want the new column to be the
longest length of the word that appears in the unit that the token comes
from. In this case, the groupField
argument that we haven’t
seen before is unit
, and we specify the field type as
"complex"
. The expression uses the function
longestLength()
, which is a rezonateR
function
that returns the longest word in a series of words.
rez007 = addField(rez007, entity = "token", layer = "",
fieldName = "longestWordInUnit",
expression = longestLength(text),
type = "complex",
groupField = "unit",
fieldaccess = "auto")
head(rez007$tokenDF %>% select(id, text, longestWordInUnit))
#> # A tibble: 6 × 3
#> id text longestWordInUnit
#> <chr> <chr> <int>
#> 1 31F282855E95E (...) 5
#> 2 363C1D373B2F7 God 5
#> 3 3628E4BD4CC05 , 5
#> 4 37EFCBECFD691 I 7
#> 5 12D67756890C1 said 7
#> 6 936363B71D59 I 7
longestLength()
belongs to a small collection of
functions useful for extracting information from a bunch of strings:
-
shortestLength()
: Find the shortest token’s length within the group. -
longestLength()
: Find the longest token’s length. -
shortest()
: Get the shortest token’s text. -
longest()
: Get the longest token’s text. -
concatenateAll()
: Concatenate all the tokens together. -
inLength()
: Gives the size of the group (may be used with non-strings), possibly withisWord
information.
Some base R functions that might be useful for numeric values include
max()
, min()
, range()
,
mean()
, etc.
Note that both times we added a field, we’ve set the field access to
auto
. If you do not set the field access, I will
automatically set it to flex
, which means that column -
text
in this case - will not be affected by reloads.
Going foreign
Now let’s add a simple foreign field. Let’s say when we look at the
tokenDF
, we also want to know what the whole unit’s words
are. (This will come into handy when we want to do external
editing!)
The trickiest part of addFieldForeign
is keeping track
of the source we’re getting information from, and the target
we’re aiming to add information to. We need to know:
- The source of information that our new field is create with: this
means we need to know
sourceEntity
,sourceLayer
,sourceFieldName
. - The location of our new target field: this means we need
targetEntity
,targetLayer
,targetFieldName
. - How to link the source and target tables. This is handled by the
argument
targetForeignKeyName
. We need to give the name of the column containing IDs of the source table inside the target table, i.e. the column of the target table that tells us which row of the source table to look at. In this case theunit
field of tokenDF.
In our specific example:
- The source is the ‘text’ field of units, so we set
sourceEntity
to"unit"
andsourceLayer
to the empty string, andsourceFieldName
to"text"
. - The target is the ‘unitText’ field of units, so we set
targetEntity
to"token"
andtargetLayer
to the empty string, and targetFieldName to"unitText"
. - The foreign key,
targetForeignKeyName
, is theunit
field of tokenDF.
Let’s put it to practice:
rez007 = addFieldForeign(rez007,
targetEntity = "token", targetLayer = "",
sourceEntity = "unit", sourceLayer = "",
targetForeignKeyName = "unit",
targetFieldName = "unitText", sourceFieldName = "text",
fieldaccess = "foreign")
head(rez007$tokenDF %>% select(id, text, unitText))
#> # A tibble: 6 × 3
#> id text unitText
#> <chr> <chr> <chr>
#> 1 31F282855E95E (...) (...) God ,
#> 2 363C1D373B2F7 God (...) God ,
#> 3 3628E4BD4CC05 , (...) God ,
#> 4 37EFCBECFD691 I I said I was n't gon na do this anymore .
#> 5 12D67756890C1 said I said I was n't gon na do this anymore .
#> 6 936363B71D59 I I said I was n't gon na do this anymore .
Like its local counterpart, addFieldForeign()
also has a
complex flavour, i.e. we can draw from multiple lines of a different
field. This is probably the hardest part of this tutorial, so buckle
up!
Here, we’re going to add a field in the unitDF that tells us the
average length of words within the unit. We’re going to base this off
the entryDF
.
This time, targetForeignKeyName
works a bit differently.
Because the entries that correspond to each unit are given in the
nodeMap
, you also need to supply the list of entries inside
a unit node - that is, entryList
, as you may recall from
vignette("import_save_basics")
.
addFieldForeign()
has a field called
complexAction
, which is a function performed on the source
field of the source table, which could be any aggregating function
(including the longestLength()
series of functions that we
have seen before). In this instance, we use mean()
::
rez007 = addFieldForeign(rez007,
targetEntity = "unit", targetLayer = "",
sourceEntity = "entry", sourceLayer = "",
targetForeignKeyName = "entryList",
targetFieldName = "averageWordLength",
sourceFieldName = "text",
type = "complex",
complexAction = function(x) mean(nchar(x)),
fieldaccess = "foreign")
head(rez007$unitDF %>% select(id, text, averageWordLength))
#> # A tibble: 6 × 3
#> id text avera…¹
#> <chr> <chr> <dbl>
#> 1 2AD10A854E6D3 (...) God , 3
#> 2 BDD7D839325A I said I was n't gon na do this anymore . 2.82
#> 3 2752E3B395FC1 (...) <0> Stay up late . 3.17
#> 4 8487A33D1DF2 (...) <0> Kinda defeats the purpose of getting up in th… 4.15
#> 5 107F655C3299D (...) I know . 2.75
#> 6 307808364906D (.) And it 's a hard habit to break . 2.8
#> # … with abbreviated variable name ¹averageWordLength
Reloads revisited
Having created a bunch of auto fields, naturally we will want to try
out our reloads! Let’s try replacing the zero sign <0> with ∅ in
the text
column, which is more commonly used in linguistics
papers. After doing this, we can then reload unitDF
to look
at the impact on our freshly created averageWordLength
.
Notice that we have to reload rezrDF
s in order: first the
entryDF
, using information from tokenDF
, then
the unitDF
, using information from entryDF
(please be patient if running this on your computer, since reloads can
take time):
#unitDF before the update
rez007$unitDF %>% filter(str_detect(text, "<0>")) %>% rez_select(id, text, averageWordLength) %>% head
#> # A tibble: 6 × 3
#> id text avera…¹
#> <chr> <chr> <dbl>
#> 1 2752E3B395FC1 (...) <0> Stay up late . 3.17
#> 2 8487A33D1DF2 (...) <0> Kinda defeats the purpose of getting up in th… 4.15
#> 3 786FB0DC8416 (...) I wan na <0> spend time with Ron , 3.1
#> 4 3171F4905628A (.) If we sit down and <0> set some rules , 3
#> 5 37BE6893BC78E (H) <0> going to (...) our parents , 3.62
#> 6 7B0D1EF95CEF and (...) <0> complaining about one another , 4.75
#> # … with abbreviated variable name ¹averageWordLength
#Change the zero format
rez007$tokenDF = changeFieldLocal(rez007$tokenDF,
fieldName = "text",
expression = case_when(text == "<0>" ~ "∅", T ~ text))
rez007$entryDF = rez007$entryDF %>% reload(rez007)
rez007$unitDF = rez007$unitDF %>% reload(rez007)
#unitDF after the update
rez007$unitDF %>% filter(str_detect(text, "∅")) %>% rez_select(id, text, averageWordLength) %>% head
#> # A tibble: 6 × 3
#> id text avera…¹
#> <chr> <chr> <dbl>
#> 1 2752E3B395FC1 (...) ∅ Stay up late . 2.83
#> 2 8487A33D1DF2 (...) ∅ Kinda defeats the purpose of getting up in the … 4
#> 3 786FB0DC8416 (...) I wan na ∅ spend time with Ron , 2.9
#> 4 3171F4905628A (.) If we sit down and ∅ set some rules , 2.82
#> 5 37BE6893BC78E (H) ∅ going to (...) our parents , 3.38
#> 6 7B0D1EF95CEF and (...) ∅ complaining about one another , 4.5
#> # … with abbreviated variable name ¹averageWordLength
Dealing with categorical variables
The tidyverse package forcats
is much more powerful for
dealing with categorical variables, but for those of us who don’t want
to learn an entirely new package, rezonateR
provides a few
easy ways to deal with categories.
mergeCats()
allows you to merge two categories. It takes
a vector, normally a column, as the first argument. Subsequently, the
name of each argument is a new category, and the value of each argument
is a vector of names of old categories that the new category will
encompass (as character values, even if the original column contains
factors).
For example, the Santa Barbara Corpus categorises laughter as
separate from other vocalisms. If you want to merge "Laugh"
into `“Vocalism”, keeping everything else, then you can use this
code:
#Laughter tokens before
head(rez007$tokenDF %>% select(id, doc, unit, text, kind) %>% filter(str_detect(text, "@")))
#> # A tibble: 6 × 5
#> id doc unit text kind
#> <chr> <chr> <chr> <chr> <chr>
#> 1 280154406371 sbc007 1E534FE7565B6 @ Laugh
#> 2 8C04EB23AE43 sbc007 274331D2D2283 @@ Laugh
#> 3 221FD665A9FA4 sbc007 EFCC48B9EDFA @@@@@ Laugh
#> 4 15B804DDB7C66 sbc007 3759142333BAE @@@@@ Laugh
#> 5 261421712169B sbc007 A744F59296C2 @ Laugh
#> 6 2D0821FE951A4 sbc007 273B241320492 @@@ Laugh
rez007 = changeField(rez007, entity = "token", layer = "",
fieldName = "kind",
expression = mergeCats(kind, Vocalism = c("Laugh", "Vocalism")))
#Laughter tokens after
head(rez007$tokenDF %>% select(id, doc, unit, text, kind) %>% filter(str_detect(text, "@")))
#> # A tibble: 6 × 5
#> id doc unit text kind
#> <chr> <chr> <chr> <chr> <chr>
#> 1 280154406371 sbc007 1E534FE7565B6 @ Vocalism
#> 2 8C04EB23AE43 sbc007 274331D2D2283 @@ Vocalism
#> 3 221FD665A9FA4 sbc007 EFCC48B9EDFA @@@@@ Vocalism
#> 4 15B804DDB7C66 sbc007 3759142333BAE @@@@@ Vocalism
#> 5 261421712169B sbc007 A744F59296C2 @ Vocalism
#> 6 2D0821FE951A4 sbc007 273B241320492 @@@ Vocalism
renameCats()
has identical syntax. For example, if you
want to rename Vocalism
further to Voc
:
#Breath tokens before
head(rez007$tokenDF %>% select(id, doc, unit, text, kind) %>% filter(str_detect(text, "\\(H\\)")))
#> # A tibble: 6 × 5
#> id doc unit text kind
#> <chr> <chr> <chr> <chr> <chr>
#> 1 B7392EC0FE18 sbc007 1EFAE7BD41500 (H) Vocalism
#> 2 34038223F0E2 sbc007 EDCE0A20D5B3 (H) Vocalism
#> 3 2C6A695AAE730 sbc007 3759142333BAE (H) Vocalism
#> 4 1094B3337D73D sbc007 1F858888540F6 (H) Vocalism
#> 5 3127D3EC0E7C5 sbc007 1017EC66C38EB (H) Vocalism
#> 6 2EA63A8716B28 sbc007 10B9591B876F8 (H) Vocalism
rez007 = changeField(rez007, entity = "token", layer = "",
fieldName = "kind",
expression = renameCats(kind, Voc = "Vocalism"))
#Laughter tokens after
head(rez007$tokenDF %>% select(id, doc, unit, text, kind) %>% filter(str_detect(text, "\\(H\\)")))
#> # A tibble: 6 × 5
#> id doc unit text kind
#> <chr> <chr> <chr> <chr> <chr>
#> 1 B7392EC0FE18 sbc007 1EFAE7BD41500 (H) Voc
#> 2 34038223F0E2 sbc007 EDCE0A20D5B3 (H) Voc
#> 3 2C6A695AAE730 sbc007 3759142333BAE (H) Voc
#> 4 1094B3337D73D sbc007 1F858888540F6 (H) Voc
#> 5 3127D3EC0E7C5 sbc007 1017EC66C38EB (H) Voc
#> 6 2EA63A8716B28 sbc007 10B9591B876F8 (H) Voc
Adding rows
Adding rows is not an operation you will do all the time. In general,
it is recommended to just re-import the whole thing and run all the code
again. On the occasional situation where you do have to do this,
addRow()
comes in handy.
You only need to add core and flex fields when you add a row. An ID will be automatically generated, and the foreign/auto fields will be automatically added. After specifying the rezrDF or rezrObj with entity and layer, each argument name is a column name, and its value is the value in the column.
rez007 = addRow(rez007, "trail", "default",
doc = "sbc007",
chainCreateSeq = max(rez007$trailDF$default$chainCreateSeq) + 1,
name = "Danae",
chainSize = 1)
tail(rez007$trailDF$default)
#> # A tibble: 6 × 6
#> id doc chainCreateSeq name chainSize layer
#> <chr> <chr> <dbl> <chr> <dbl> <chr>
#> 1 E57D22CC7190 sbc007 16 time (spending) 1 defau…
#> 2 2E79208C27DC2 sbc007 17 Ron's work 1 defau…
#> 3 21B1DDD41EFB0 sbc007 50 impersonal you 3 defau…
#> 4 368B8B1D9E62C sbc007 56 up there 1 defau…
#> 5 251AE6A659F6F sbc007 3 the purpose of getting up 1 defau…
#> 6 Sg8GbsAlUW0LD sbc007 88 Danae 1 defau…
#Note: chainSize is currently flex as it is supplied by Rezonator and not calculated by rezonateR, but this may change in the future.
Onwards!
The next tutorial, vignette("edit_tidyRez")
will be
relatively short, because it will assume familiarity with the Tidyverse
package dplyr
. If you would like to do something that goes
beyond the capabilities of easyRez
, it is recommended that
you familiarise yourself with the relevant dplyr
function
first, and then read the TidyRez vignette. If you are not familiar with
Tidyverse and have no intention to learn it yet, you may elect to jump
the next tutorial and go straight to
vignette("edit_external")
.
And lest we forget, always save!
savePath = "rez007.Rdata"
rez_save(rez007, savePath)
#> Saving rezrObj ...