Time and sequence
time_seq.Rmd
Getting started
In this tutorial, we will learn about how time and sequence are
handled in rezonateR.
More features relating to more
fine-grained time may be available later, so watch this space!
This file will use the file saved at the end of
vignette("import_save_basics")
. You don’t have to have read
that tutorial beforehand, though it may be helpful if you are new to
rezonateR
.
library(rezonateR)
path = system.file("extdata", "rez007_time.Rdata", package = "rezonateR", mustWork = T)
rez007 = rez_load(path)
#> Loading rezrObj ...
Token order and sequence
Rezonator by default provides two fields related to the position of a
token, which you will see in tokenDF
as columns:
-
docTokenSeq
- refers to the order of a token within the entire text -
tokenOrder
- refers to the position of a token within its intonation unit
Both orders count all tokens. In the Santa Barbara Corpus text we are
using, this includes endnotes (such as ,
and
.
), transcriptions of vocalisms (such as (H)
for in-breaths and @@@
for laughter), and so on. You can
see these in the tokenDF
of our sample file here:
head(rez007$tokenDF)
#> # A tibble: 6 × 19
#> id doc unit docTo…¹ token…² kind place text trans…³ endNote order
#> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 31F282855… sbc0… 2AD1… 1 1 Pause "" (...) (...) "" 1
#> 2 363C1D373… sbc0… 2AD1… 2 2 Word "1" God God "" 2
#> 3 3628E4BD4… sbc0… 2AD1… 3 3 EndN… "" , , "conti… 3
#> 4 37EFCBECF… sbc0… BDD7… 4 1 Word "1" I I "" 1
#> 5 12D677568… sbc0… BDD7… 5 2 Word "2" said said "" 2
#> 6 936363B71… sbc0… BDD7… 6 3 Word "3" I I "" 3
#> # … with 8 more variables: negPlace <chr>, corpusSeq <chr>, pSentOrder <chr>,
#> # POS_dft <chr>, tokenSeq <chr>, chunkType <chr>, turnOrder <chr>,
#> # largerChunk <chr>, and abbreviated variable names ¹docTokenSeq,
#> # ²tokenOrder, ³transcript
#> # ℹ Use `colnames()` to see all variable names
‘Larger’ elements that span multiple tokens have four token sequence-related fields:
-
docTokenSeqFirst
- refers to thedocTokenSeq
of the first token. -
docTokenSeqLast
- refers to thedocTokenSeq
of the last token. -
tokenOrderFirst
- refers to thetokenOrder
of the first token. -
tokenOrderLast
- refers to thetokenOrder
of the last token.
You can see these fields in action in the chunkDF
:
head(rez007$chunkDF$refexpr %>% select(id, doc, name, text, tokenOrderFirst, docTokenSeqFirst, tokenOrderLast, docTokenSeqLast))
#> # A tibble: 6 × 8
#> id doc name text token…¹ docTo…² token…³ docTo…⁴
#> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 35E3E0AB6803A sbc007 Chunk 1 Stay up late 3 17 5 19
#> 2 1F6B5F0B3FF59 sbc007 Chunk 2 the purpose of … 5 25 12 32
#> 3 24FE2B219BD21 sbc007 Chunk 55 getting up in t… 8 28 12 32
#> 4 158B579C1BA49 sbc007 Chunk 3 the morning 11 31 12 32
#> 5 2B6521E881365 sbc007 Chunk 4 all this other … 2 144 5 147
#> 6 5B854594DD34 sbc007 Chunk 5 the way (...) t… 5 156 10 161
#> # … with abbreviated variable names ¹tokenOrderFirst, ²docTokenSeqFirst,
#> # ³tokenOrderLast, ⁴docTokenSeqLast
Setting isWord
status
The Santa Barbara Corpus by default contains other tokens, provided
through the tagMap
in the Rezonator edition, such as
these:
-
corpusSeq
- the position of a token within the entire corpus. -
place
- the position of a word within the intonation unit, excluding elements like endnotes and vocalisms. -
negPlace
- the position of a word within the intonation unit, counting backwards from the last token.
However, if you are working with your own texts rather than the Santa
Barbara Corpus, you will not have access to place
by
default. We will therefore have to create place
by
ourselves.
To create something that functions similarly as place
,
you must first define what a word is. For example, if you are dealing
with a very ‘clean’ transcription that ignores elements like breaths and
laughter, then you may simple create a regular expression that captures
all the punctuation.
The function addIsWordField
adds a column
isWord
to a rezrDF
or rezrObj
stating whether a token is a word. For the Santa Barbara Corpus, this is
simple since the Kind
column is set to "Word"
for actual words. Thus, we can use the expression
Kind == "Word"
for the definition of what counts as a word.
This expression is passed to addIsWordField
as its second
parameter (the first parameter being the rezrDF
or
rezrObj
). Note that if addIsWordField
is
called for a rezrObj
, then both tokenDF
and
entryDF
will have this new field.
By default, addIsWordField
also adds the fields
docWordSeq
and wordOrder
(=place
)
to the tokenDF
and entryDF
, and the columns
wordOrderFirst
, wordOrderLast
,
docWordSeqFirst
and docWordSeqLast
will be
added to unitDF
, chunkDF
and
trackDF
. These work similarly to their token
counterparts, except non-words are give 0 values, and only words are
counted in determining order.
rez007 = addIsWordField(rez007, kind == "Word")
head(rez007$tokenDF %>% select(id, tokenOrder, docTokenSeq, wordOrder, docWordSeq))
#> # A tibble: 6 × 5
#> id tokenOrder docTokenSeq wordOrder docWordSeq
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 31F282855E95E 1 1 0 0
#> 2 363C1D373B2F7 2 2 1 1
#> 3 3628E4BD4CC05 3 3 0 0
#> 4 37EFCBECFD691 1 4 1 2
#> 5 12D67756890C1 2 5 2 3
#> 6 936363B71D59 3 6 3 4
head(rez007$chunkDF$refexpr %>% select(id, tokenOrderFirst, tokenOrderLast, docTokenSeqFirst, docTokenSeqLast, wordOrderFirst, wordOrderLast, docWordSeqFirst, docWordSeqLast))
#> # A tibble: 6 × 9
#> id tokenO…¹ token…² docTo…³ docTo…⁴ wordO…⁵ wordO…⁶ docWo…⁷ docWo…⁸
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 35E3E0AB6803A 3 5 17 19 1 3 12 14
#> 2 1F6B5F0B3FF59 5 12 25 32 3 10 17 24
#> 3 24FE2B219BD21 8 12 28 32 6 10 20 24
#> 4 158B579C1BA49 11 12 31 32 9 10 23 24
#> 5 2B6521E881365 2 5 144 147 2 5 96 99
#> 6 5B854594DD34 5 10 156 161 2 6 103 107
#> # … with abbreviated variable names ¹tokenOrderFirst, ²tokenOrderLast,
#> # ³docTokenSeqFirst, ⁴docTokenSeqLast, ⁵wordOrderFirst, ⁶wordOrderLast,
#> # ⁷docWordSeqFirst, ⁸docWordSeqLast
Unit order
The ordering of units, unitSeq
, is not available to
rezrDF
s other than unitDF
by default. The
addUnitSeq()
function adds unitSeq
to other
fields. This function allows you to set which entity (and optionally,
layer). Whichever entity type you specify, everything ‘below’ it will
also get unit orders. For example, if you specify ‘track
’
as the level at which you want unit order, then tokenDF
and
chunkDF
will get it too. Similarly, if you specify
‘stack
’ as the level, then cardDF
will also
get it.
rez007 = addUnitSeq(rez007, "track")
rez007 = addUnitSeq(rez007, "stack")
head(rez007$trackDF$default %>% select(id, text, unitSeqFirst, unitSeqLast))
#> # A tibble: 6 × 4
#> id text unitSeqFirst unitSeqLast
#> <chr> <chr> <dbl> <dbl>
#> 1 1096E4AFFFE65 I 2 2
#> 2 92F20ACA5F06 I 2 2
#> 3 7E5BB65072C was n't gon na do 2 2
#> 4 1F74D2B049FA4 this 2 2
#> 5 2485C4F740FC0 <0> 3 3
#> 6 1BF2260B4AB78 Stay up late 3 3
head(rez007$stackDF %>% select(id, name, unitSeqFirst, unitSeqLast))
#> # A tibble: 6 × 4
#> id name unitSeqFirst unitSeqLast
#> <chr> <chr> <dbl> <dbl>
#> 1 16CE3C9693F60 Stack 91 391 392
#> 2 AC5AC90A6E65 Stack 175 689 690
#> 3 1A87960035B5A Stack 38 225 225
#> 4 31A9F84B7EE07 Stack 133 540 541
#> 5 1E013C5A85CA9 Stack 47 247 251
#> 6 3872B2E956219 Stack 154 609 609
Note that chunkDF
and layers depend on it get
unitSeqFirst
and unitSeqLast
, because of a
chunk combination feature that will be discussed in the trees
chapter.
The function getSeqBounds()
is mostly used
rezonateR
-internally, though more advanced users may use it
to create functions similar to addIsWordField()
and
addUnitSeq()
.
Advanced sequence operations
In addition to the two most common operations we covered above,
rezonateR
also has other functions to deal with common
problems related to time and sequence. These require knowledge of
editing, so if you want to learn more about these functions, you can
skip this for now and come back after having read
vignette("edit_easyEdit")
.
Generally, when elements in a file belong to a larger structure,
there are three ways of representing them inside the rezrDF that houses
the smaller structure: * Sequence of the larger structure:
unitSeq
in tokenDF
is an example. * Order
(position) of the smaller element within the larger structure:
tokenOrder
in tokenDF
is an example. * BILUO:
Is the current element the beginning (B) of the larger structure, an
intermediate (I) element, the last (L) element, the only (U) element, or
not within the larger structure (O)? Generally used for artificial
intelligence applications.
The following functions are currently available for dealing with these and related issues:
-
getOrderFromSeq()
: Converts from the first representation to the second one. -
getSeqFromOrder()
: Converts from the second representation to the first one. -
isInitial()
: Is the current element the initial member of a larger structure? -
isFinal()
: Is the current element the final member of a larger structure? Used afterinLength()
. -
getBiluoFromOrder()
: Converts from the second representation to the third one.
getOrderFromseq()
is straightforward. For example, if we
want to replicate tokenOrder
from unitSeq
, we
can do this:
rez007$tokenDF = addFieldLocal(rez007$tokenDF, "tokenOrder2", getOrderFromSeq(unitSeq))
#Check that tokenOrder and tokenOrder2 are identical
all(rez007$tokenDF$tokenOrder == rez007$tokenDF$tokenOrder2)
#> [1] TRUE
getSeqFromOrder()
is also straightforward. If we want to
get the prosodic sentence in which a token is found, which is not given
in the Rezonator version of the Santa Barbara Corpus, we can get it from
the place within the prosodic sentence (pSentOrder
):
rez007$tokenDF = addFieldLocal(rez007$tokenDF, "pSent", getSeqFromOrder(pSentOrder))
head(rez007$tokenDF %>% select(id, text, pSent, pSentOrder))
#> # A tibble: 6 × 4
#> id text pSent pSentOrder
#> <chr> <chr> <dbl> <chr>
#> 1 31F282855E95E (...) 1 1
#> 2 363C1D373B2F7 God 1 2
#> 3 3628E4BD4CC05 , 1 3
#> 4 37EFCBECFD691 I 1 4
#> 5 12D67756890C1 said 1 5
#> 6 936363B71D59 I 1 6
isInitial()
and isFinal()
tell us whether
something is the initial or final member of a larger unit.
isInitial()
is simple: Its only parameter is the order
(i.e. second representation).
For example, the following code determines whether a token is the start of the prosodic sentence:
rez007$tokenDF = addFieldLocal(rez007$tokenDF, "isPSentInitial", isInitial(pSentOrder))
head(rez007$tokenDF %>% select(id, text, pSentOrder, isPSentInitial))
#> # A tibble: 6 × 4
#> id text pSentOrder isPSentInitial
#> <chr> <chr> <chr> <lgl>
#> 1 31F282855E95E (...) 1 TRUE
#> 2 363C1D373B2F7 God 2 FALSE
#> 3 3628E4BD4CC05 , 3 FALSE
#> 4 37EFCBECFD691 I 4 FALSE
#> 5 12D67756890C1 said 5 FALSE
#> 6 936363B71D59 I 6 FALSE
inLength()
gives the length of the larger unit and its
result is used by isFinal()
, which takes the length of the
larger unit in addition to the order value. (Note that for
pSentLength
, isWord
is set to simply text not
being zeroes, since we count non-words like breaths, endnotes, etc., as
part of the prosodic sentence.)
rez007$tokenDF = addFieldLocal(rez007$tokenDF, "pSentLength", inLength(pSentOrder, isWord = (text != "<0>")), type = "complex", groupField = "pSent")
rez007$tokenDF = addFieldLocal(rez007$tokenDF, "isPSentFinal", isFinal(pSentOrder, pSentLength))
head(rez007$tokenDF %>% select(id, text, pSentOrder, pSentLength, isPSentFinal))
#> # A tibble: 6 × 5
#> id text pSentOrder pSentLength isPSentFinal
#> <chr> <chr> <chr> <int> <lgl>
#> 1 31F282855E95E (...) 1 14 FALSE
#> 2 363C1D373B2F7 God 2 14 FALSE
#> 3 3628E4BD4CC05 , 3 14 FALSE
#> 4 37EFCBECFD691 I 4 14 FALSE
#> 5 12D67756890C1 said 5 14 FALSE
#> 6 936363B71D59 I 6 14 FALSE
getBiluoFromOrder()
is similar in requiring order and
length. Let’s get the BILUO values for prosodic sentences:
rez007$tokenDF = changeFieldLocal(rez007$tokenDF, "pSentBiluo", getBiluoFromOrder(pSentOrder, pSentLength))
head(rez007$tokenDF %>% select(id, text, pSentOrder, pSentLength, pSentBiluo))
#> # A tibble: 6 × 5
#> id text pSentOrder pSentLength pSentBiluo
#> <chr> <chr> <chr> <int> <chr>
#> 1 31F282855E95E (...) 1 14 B
#> 2 363C1D373B2F7 God 2 14 I
#> 3 3628E4BD4CC05 , 3 14 I
#> 4 37EFCBECFD691 I 4 14 I
#> 5 12D67756890C1 said 5 14 I
#> 6 936363B71D59 I 6 14 I
Onwards!
Let’s practice saving our data again:
savePath = "rez007.Rdata"
rez_save(rez007, savePath)
#> Saving rezrObj ...
Now that you know how to deal with time and sequence, it will be even
easier to work with rezonateR.
If you’re following the
whole tutorial series, the next tutorial is
vignette("edit_easyEdit")
, where you will start learning to
add automatic annotations to rezrDF
s!