Time and sequence
time_seq.RmdGetting started
In this tutorial, we will learn about how time and sequence are
handled in rezonateR. More features relating to more
fine-grained time may be available later, so watch this space!
This file will use the file saved at the end of
vignette("import_save_basics"). You don’t have to have read
that tutorial beforehand, though it may be helpful if you are new to
rezonateR.
library(rezonateR)
path = system.file("extdata", "rez007_time.Rdata", package = "rezonateR", mustWork = T)
rez007 = rez_load(path)
#> Loading rezrObj ...Token order and sequence
Rezonator by default provides two fields related to the position of a
token, which you will see in tokenDF as columns:
-
docTokenSeq- refers to the order of a token within the entire text -
tokenOrder- refers to the position of a token within its intonation unit
Both orders count all tokens. In the Santa Barbara Corpus text we are
using, this includes endnotes (such as , and
.), transcriptions of vocalisms (such as (H)
for in-breaths and @@@ for laughter), and so on. You can
see these in the tokenDF of our sample file here:
head(rez007$tokenDF)
#> # A tibble: 6 × 19
#> id doc unit docTo…¹ token…² kind place text trans…³ endNote order
#> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 31F282855… sbc0… 2AD1… 1 1 Pause "" (...) (...) "" 1
#> 2 363C1D373… sbc0… 2AD1… 2 2 Word "1" God God "" 2
#> 3 3628E4BD4… sbc0… 2AD1… 3 3 EndN… "" , , "conti… 3
#> 4 37EFCBECF… sbc0… BDD7… 4 1 Word "1" I I "" 1
#> 5 12D677568… sbc0… BDD7… 5 2 Word "2" said said "" 2
#> 6 936363B71… sbc0… BDD7… 6 3 Word "3" I I "" 3
#> # … with 8 more variables: negPlace <chr>, corpusSeq <chr>, pSentOrder <chr>,
#> # POS_dft <chr>, tokenSeq <chr>, chunkType <chr>, turnOrder <chr>,
#> # largerChunk <chr>, and abbreviated variable names ¹docTokenSeq,
#> # ²tokenOrder, ³transcript
#> # ℹ Use `colnames()` to see all variable names‘Larger’ elements that span multiple tokens have four token sequence-related fields:
-
docTokenSeqFirst- refers to thedocTokenSeqof the first token. -
docTokenSeqLast- refers to thedocTokenSeqof the last token. -
tokenOrderFirst- refers to thetokenOrderof the first token. -
tokenOrderLast- refers to thetokenOrderof the last token.
You can see these fields in action in the chunkDF:
head(rez007$chunkDF$refexpr %>% select(id, doc, name, text, tokenOrderFirst, docTokenSeqFirst, tokenOrderLast, docTokenSeqLast))
#> # A tibble: 6 × 8
#> id doc name text token…¹ docTo…² token…³ docTo…⁴
#> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 35E3E0AB6803A sbc007 Chunk 1 Stay up late 3 17 5 19
#> 2 1F6B5F0B3FF59 sbc007 Chunk 2 the purpose of … 5 25 12 32
#> 3 24FE2B219BD21 sbc007 Chunk 55 getting up in t… 8 28 12 32
#> 4 158B579C1BA49 sbc007 Chunk 3 the morning 11 31 12 32
#> 5 2B6521E881365 sbc007 Chunk 4 all this other … 2 144 5 147
#> 6 5B854594DD34 sbc007 Chunk 5 the way (...) t… 5 156 10 161
#> # … with abbreviated variable names ¹tokenOrderFirst, ²docTokenSeqFirst,
#> # ³tokenOrderLast, ⁴docTokenSeqLastSetting isWord status
The Santa Barbara Corpus by default contains other tokens, provided
through the tagMap in the Rezonator edition, such as
these:
-
corpusSeq- the position of a token within the entire corpus. -
place- the position of a word within the intonation unit, excluding elements like endnotes and vocalisms. -
negPlace- the position of a word within the intonation unit, counting backwards from the last token.
However, if you are working with your own texts rather than the Santa
Barbara Corpus, you will not have access to place by
default. We will therefore have to create place by
ourselves.
To create something that functions similarly as place,
you must first define what a word is. For example, if you are dealing
with a very ‘clean’ transcription that ignores elements like breaths and
laughter, then you may simple create a regular expression that captures
all the punctuation.
The function addIsWordField adds a column
isWord to a rezrDF or rezrObj
stating whether a token is a word. For the Santa Barbara Corpus, this is
simple since the Kind column is set to "Word"
for actual words. Thus, we can use the expression
Kind == "Word" for the definition of what counts as a word.
This expression is passed to addIsWordField as its second
parameter (the first parameter being the rezrDF or
rezrObj). Note that if addIsWordField is
called for a rezrObj, then both tokenDF and
entryDF will have this new field.
By default, addIsWordField also adds the fields
docWordSeq and wordOrder (=place)
to the tokenDF and entryDF, and the columns
wordOrderFirst, wordOrderLast,
docWordSeqFirst and docWordSeqLast will be
added to unitDF, chunkDF and
trackDF. These work similarly to their token
counterparts, except non-words are give 0 values, and only words are
counted in determining order.
rez007 = addIsWordField(rez007, kind == "Word")
head(rez007$tokenDF %>% select(id, tokenOrder, docTokenSeq, wordOrder, docWordSeq))
#> # A tibble: 6 × 5
#> id tokenOrder docTokenSeq wordOrder docWordSeq
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 31F282855E95E 1 1 0 0
#> 2 363C1D373B2F7 2 2 1 1
#> 3 3628E4BD4CC05 3 3 0 0
#> 4 37EFCBECFD691 1 4 1 2
#> 5 12D67756890C1 2 5 2 3
#> 6 936363B71D59 3 6 3 4
head(rez007$chunkDF$refexpr %>% select(id, tokenOrderFirst, tokenOrderLast, docTokenSeqFirst, docTokenSeqLast, wordOrderFirst, wordOrderLast, docWordSeqFirst, docWordSeqLast))
#> # A tibble: 6 × 9
#> id tokenO…¹ token…² docTo…³ docTo…⁴ wordO…⁵ wordO…⁶ docWo…⁷ docWo…⁸
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 35E3E0AB6803A 3 5 17 19 1 3 12 14
#> 2 1F6B5F0B3FF59 5 12 25 32 3 10 17 24
#> 3 24FE2B219BD21 8 12 28 32 6 10 20 24
#> 4 158B579C1BA49 11 12 31 32 9 10 23 24
#> 5 2B6521E881365 2 5 144 147 2 5 96 99
#> 6 5B854594DD34 5 10 156 161 2 6 103 107
#> # … with abbreviated variable names ¹tokenOrderFirst, ²tokenOrderLast,
#> # ³docTokenSeqFirst, ⁴docTokenSeqLast, ⁵wordOrderFirst, ⁶wordOrderLast,
#> # ⁷docWordSeqFirst, ⁸docWordSeqLastUnit order
The ordering of units, unitSeq, is not available to
rezrDFs other than unitDF by default. The
addUnitSeq() function adds unitSeq to other
fields. This function allows you to set which entity (and optionally,
layer). Whichever entity type you specify, everything ‘below’ it will
also get unit orders. For example, if you specify ‘track’
as the level at which you want unit order, then tokenDF and
chunkDF will get it too. Similarly, if you specify
‘stack’ as the level, then cardDF will also
get it.
rez007 = addUnitSeq(rez007, "track")
rez007 = addUnitSeq(rez007, "stack")
head(rez007$trackDF$default %>% select(id, text, unitSeqFirst, unitSeqLast))
#> # A tibble: 6 × 4
#> id text unitSeqFirst unitSeqLast
#> <chr> <chr> <dbl> <dbl>
#> 1 1096E4AFFFE65 I 2 2
#> 2 92F20ACA5F06 I 2 2
#> 3 7E5BB65072C was n't gon na do 2 2
#> 4 1F74D2B049FA4 this 2 2
#> 5 2485C4F740FC0 <0> 3 3
#> 6 1BF2260B4AB78 Stay up late 3 3
head(rez007$stackDF %>% select(id, name, unitSeqFirst, unitSeqLast))
#> # A tibble: 6 × 4
#> id name unitSeqFirst unitSeqLast
#> <chr> <chr> <dbl> <dbl>
#> 1 16CE3C9693F60 Stack 91 391 392
#> 2 AC5AC90A6E65 Stack 175 689 690
#> 3 1A87960035B5A Stack 38 225 225
#> 4 31A9F84B7EE07 Stack 133 540 541
#> 5 1E013C5A85CA9 Stack 47 247 251
#> 6 3872B2E956219 Stack 154 609 609Note that chunkDF and layers depend on it get
unitSeqFirst and unitSeqLast, because of a
chunk combination feature that will be discussed in the trees
chapter.
The function getSeqBounds() is mostly used
rezonateR-internally, though more advanced users may use it
to create functions similar to addIsWordField() and
addUnitSeq().
Advanced sequence operations
In addition to the two most common operations we covered above,
rezonateR also has other functions to deal with common
problems related to time and sequence. These require knowledge of
editing, so if you want to learn more about these functions, you can
skip this for now and come back after having read
vignette("edit_easyEdit").
Generally, when elements in a file belong to a larger structure,
there are three ways of representing them inside the rezrDF that houses
the smaller structure: * Sequence of the larger structure:
unitSeq in tokenDF is an example. * Order
(position) of the smaller element within the larger structure:
tokenOrder in tokenDF is an example. * BILUO:
Is the current element the beginning (B) of the larger structure, an
intermediate (I) element, the last (L) element, the only (U) element, or
not within the larger structure (O)? Generally used for artificial
intelligence applications.
The following functions are currently available for dealing with these and related issues:
-
getOrderFromSeq(): Converts from the first representation to the second one. -
getSeqFromOrder(): Converts from the second representation to the first one. -
isInitial(): Is the current element the initial member of a larger structure? -
isFinal(): Is the current element the final member of a larger structure? Used afterinLength(). -
getBiluoFromOrder(): Converts from the second representation to the third one.
getOrderFromseq() is straightforward. For example, if we
want to replicate tokenOrder from unitSeq, we
can do this:
rez007$tokenDF = addFieldLocal(rez007$tokenDF, "tokenOrder2", getOrderFromSeq(unitSeq))
#Check that tokenOrder and tokenOrder2 are identical
all(rez007$tokenDF$tokenOrder == rez007$tokenDF$tokenOrder2)
#> [1] TRUEgetSeqFromOrder() is also straightforward. If we want to
get the prosodic sentence in which a token is found, which is not given
in the Rezonator version of the Santa Barbara Corpus, we can get it from
the place within the prosodic sentence (pSentOrder):
rez007$tokenDF = addFieldLocal(rez007$tokenDF, "pSent", getSeqFromOrder(pSentOrder))
head(rez007$tokenDF %>% select(id, text, pSent, pSentOrder))
#> # A tibble: 6 × 4
#> id text pSent pSentOrder
#> <chr> <chr> <dbl> <chr>
#> 1 31F282855E95E (...) 1 1
#> 2 363C1D373B2F7 God 1 2
#> 3 3628E4BD4CC05 , 1 3
#> 4 37EFCBECFD691 I 1 4
#> 5 12D67756890C1 said 1 5
#> 6 936363B71D59 I 1 6isInitial() and isFinal() tell us whether
something is the initial or final member of a larger unit.
isInitial() is simple: Its only parameter is the order
(i.e. second representation).
For example, the following code determines whether a token is the start of the prosodic sentence:
rez007$tokenDF = addFieldLocal(rez007$tokenDF, "isPSentInitial", isInitial(pSentOrder))
head(rez007$tokenDF %>% select(id, text, pSentOrder, isPSentInitial))
#> # A tibble: 6 × 4
#> id text pSentOrder isPSentInitial
#> <chr> <chr> <chr> <lgl>
#> 1 31F282855E95E (...) 1 TRUE
#> 2 363C1D373B2F7 God 2 FALSE
#> 3 3628E4BD4CC05 , 3 FALSE
#> 4 37EFCBECFD691 I 4 FALSE
#> 5 12D67756890C1 said 5 FALSE
#> 6 936363B71D59 I 6 FALSEinLength() gives the length of the larger unit and its
result is used by isFinal(), which takes the length of the
larger unit in addition to the order value. (Note that for
pSentLength, isWord is set to simply text not
being zeroes, since we count non-words like breaths, endnotes, etc., as
part of the prosodic sentence.)
rez007$tokenDF = addFieldLocal(rez007$tokenDF, "pSentLength", inLength(pSentOrder, isWord = (text != "<0>")), type = "complex", groupField = "pSent")
rez007$tokenDF = addFieldLocal(rez007$tokenDF, "isPSentFinal", isFinal(pSentOrder, pSentLength))
head(rez007$tokenDF %>% select(id, text, pSentOrder, pSentLength, isPSentFinal))
#> # A tibble: 6 × 5
#> id text pSentOrder pSentLength isPSentFinal
#> <chr> <chr> <chr> <int> <lgl>
#> 1 31F282855E95E (...) 1 14 FALSE
#> 2 363C1D373B2F7 God 2 14 FALSE
#> 3 3628E4BD4CC05 , 3 14 FALSE
#> 4 37EFCBECFD691 I 4 14 FALSE
#> 5 12D67756890C1 said 5 14 FALSE
#> 6 936363B71D59 I 6 14 FALSEgetBiluoFromOrder() is similar in requiring order and
length. Let’s get the BILUO values for prosodic sentences:
rez007$tokenDF = changeFieldLocal(rez007$tokenDF, "pSentBiluo", getBiluoFromOrder(pSentOrder, pSentLength))
head(rez007$tokenDF %>% select(id, text, pSentOrder, pSentLength, pSentBiluo))
#> # A tibble: 6 × 5
#> id text pSentOrder pSentLength pSentBiluo
#> <chr> <chr> <chr> <int> <chr>
#> 1 31F282855E95E (...) 1 14 B
#> 2 363C1D373B2F7 God 2 14 I
#> 3 3628E4BD4CC05 , 3 14 I
#> 4 37EFCBECFD691 I 4 14 I
#> 5 12D67756890C1 said 5 14 I
#> 6 936363B71D59 I 6 14 IOnwards!
Let’s practice saving our data again:
savePath = "rez007.Rdata"
rez_save(rez007, savePath)
#> Saving rezrObj ...Now that you know how to deal with time and sequence, it will be even
easier to work with rezonateR. If you’re following the
whole tutorial series, the next tutorial is
vignette("edit_easyEdit"), where you will start learning to
add automatic annotations to rezrDFs!