Getting started

In this tutorial, we will learn about how time and sequence are handled in rezonateR. More features relating to more fine-grained time may be available later, so watch this space!

This file will use the file saved at the end of vignette("import_save_basics"). You don’t have to have read that tutorial beforehand, though it may be helpful if you are new to rezonateR.

path = system.file("extdata", "rez007_time.Rdata", package = "rezonateR", mustWork = T)
rez007 = rez_load(path)
#> Loading rezrObj ...

Token order and sequence

Rezonator by default provides two fields related to the position of a token, which you will see in tokenDF as columns:

  • docTokenSeq - refers to the order of a token within the entire text
  • tokenOrder - refers to the position of a token within its intonation unit

Both orders count all tokens. In the Santa Barbara Corpus text we are using, this includes endnotes (such as , and .), transcriptions of vocalisms (such as (H) for in-breaths and @@@ for laughter), and so on. You can see these in the tokenDF of our sample file here:

#> # A tibble: 6 × 19
#>   id         doc   unit  docTo…¹ token…² kind  place text  trans…³ endNote order
#>   <chr>      <chr> <chr>   <dbl>   <dbl> <chr> <chr> <chr> <chr>   <chr>   <chr>
#> 1 31F282855… sbc0… 2AD1…       1       1 Pause ""    (...) (...)   ""      1    
#> 2 363C1D373… sbc0… 2AD1…       2       2 Word  "1"   God   God     ""      2    
#> 3 3628E4BD4… sbc0… 2AD1…       3       3 EndN… ""    ,     ,       "conti… 3    
#> 4 37EFCBECF… sbc0… BDD7…       4       1 Word  "1"   I     I       ""      1    
#> 5 12D677568… sbc0… BDD7…       5       2 Word  "2"   said  said    ""      2    
#> 6 936363B71… sbc0… BDD7…       6       3 Word  "3"   I     I       ""      3    
#> # … with 8 more variables: negPlace <chr>, corpusSeq <chr>, pSentOrder <chr>,
#> #   POS_dft <chr>, tokenSeq <chr>, chunkType <chr>, turnOrder <chr>,
#> #   largerChunk <chr>, and abbreviated variable names ¹​docTokenSeq,
#> #   ²​tokenOrder, ³​transcript
#> # ℹ Use `colnames()` to see all variable names

‘Larger’ elements that span multiple tokens have four token sequence-related fields:

  • docTokenSeqFirst - refers to the docTokenSeq of the first token.
  • docTokenSeqLast - refers to the docTokenSeq of the last token.
  • tokenOrderFirst - refers to the tokenOrder of the first token.
  • tokenOrderLast - refers to the tokenOrder of the last token.

You can see these fields in action in the chunkDF:

head(rez007$chunkDF$refexpr %>% select(id, doc, name, text, tokenOrderFirst, docTokenSeqFirst, tokenOrderLast, docTokenSeqLast))
#> # A tibble: 6 × 8
#>   id            doc    name     text             token…¹ docTo…² token…³ docTo…⁴
#>   <chr>         <chr>  <chr>    <chr>              <dbl>   <dbl>   <dbl>   <dbl>
#> 1 35E3E0AB6803A sbc007 Chunk 1  Stay up late           3      17       5      19
#> 2 1F6B5F0B3FF59 sbc007 Chunk 2  the purpose of …       5      25      12      32
#> 3 24FE2B219BD21 sbc007 Chunk 55 getting up in t…       8      28      12      32
#> 4 158B579C1BA49 sbc007 Chunk 3  the morning           11      31      12      32
#> 5 2B6521E881365 sbc007 Chunk 4  all this other …       2     144       5     147
#> 6 5B854594DD34  sbc007 Chunk 5  the way (...) t…       5     156      10     161
#> # … with abbreviated variable names ¹​tokenOrderFirst, ²​docTokenSeqFirst,
#> #   ³​tokenOrderLast, ⁴​docTokenSeqLast

Setting isWord status

The Santa Barbara Corpus by default contains other tokens, provided through the tagMap in the Rezonator edition, such as these:

  • corpusSeq - the position of a token within the entire corpus.
  • place - the position of a word within the intonation unit, excluding elements like endnotes and vocalisms.
  • negPlace - the position of a word within the intonation unit, counting backwards from the last token.

However, if you are working with your own texts rather than the Santa Barbara Corpus, you will not have access to place by default. We will therefore have to create place by ourselves.

To create something that functions similarly as place, you must first define what a word is. For example, if you are dealing with a very ‘clean’ transcription that ignores elements like breaths and laughter, then you may simple create a regular expression that captures all the punctuation.

The function addIsWordField adds a column isWord to a rezrDF or rezrObj stating whether a token is a word. For the Santa Barbara Corpus, this is simple since the Kind column is set to "Word" for actual words. Thus, we can use the expression Kind == "Word" for the definition of what counts as a word. This expression is passed to addIsWordField as its second parameter (the first parameter being the rezrDF or rezrObj). Note that if addIsWordField is called for a rezrObj, then both tokenDF and entryDF will have this new field.

By default, addIsWordField also adds the fields docWordSeq and wordOrder (=place) to the tokenDF and entryDF, and the columns wordOrderFirst, wordOrderLast, docWordSeqFirst and docWordSeqLast will be added to unitDF, chunkDF and trackDF. These work similarly to their token counterparts, except non-words are give 0 values, and only words are counted in determining order.

rez007 = addIsWordField(rez007, kind == "Word")
head(rez007$tokenDF %>% select(id, tokenOrder, docTokenSeq, wordOrder, docWordSeq))
#> # A tibble: 6 × 5
#>   id            tokenOrder docTokenSeq wordOrder docWordSeq
#>   <chr>              <dbl>       <dbl>     <dbl>      <dbl>
#> 1 31F282855E95E          1           1         0          0
#> 2 363C1D373B2F7          2           2         1          1
#> 3 3628E4BD4CC05          3           3         0          0
#> 4 37EFCBECFD691          1           4         1          2
#> 5 12D67756890C1          2           5         2          3
#> 6 936363B71D59           3           6         3          4
head(rez007$chunkDF$refexpr %>% select(id, tokenOrderFirst, tokenOrderLast, docTokenSeqFirst, docTokenSeqLast, wordOrderFirst, wordOrderLast, docWordSeqFirst, docWordSeqLast))
#> # A tibble: 6 × 9
#>   id            tokenO…¹ token…² docTo…³ docTo…⁴ wordO…⁵ wordO…⁶ docWo…⁷ docWo…⁸
#>   <chr>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#> 1 35E3E0AB6803A        3       5      17      19       1       3      12      14
#> 2 1F6B5F0B3FF59        5      12      25      32       3      10      17      24
#> 3 24FE2B219BD21        8      12      28      32       6      10      20      24
#> 4 158B579C1BA49       11      12      31      32       9      10      23      24
#> 5 2B6521E881365        2       5     144     147       2       5      96      99
#> 6 5B854594DD34         5      10     156     161       2       6     103     107
#> # … with abbreviated variable names ¹​tokenOrderFirst, ²​tokenOrderLast,
#> #   ³​docTokenSeqFirst, ⁴​docTokenSeqLast, ⁵​wordOrderFirst, ⁶​wordOrderLast,
#> #   ⁷​docWordSeqFirst, ⁸​docWordSeqLast

Unit order

The ordering of units, unitSeq, is not available to rezrDFs other than unitDF by default. The addUnitSeq() function adds unitSeq to other fields. This function allows you to set which entity (and optionally, layer). Whichever entity type you specify, everything ‘below’ it will also get unit orders. For example, if you specify ‘track’ as the level at which you want unit order, then tokenDF and chunkDF will get it too. Similarly, if you specify ‘stack’ as the level, then cardDF will also get it.

rez007 = addUnitSeq(rez007, "track")
rez007 = addUnitSeq(rez007, "stack")
head(rez007$trackDF$default %>% select(id, text, unitSeqFirst, unitSeqLast))
#> # A tibble: 6 × 4
#>   id            text              unitSeqFirst unitSeqLast
#>   <chr>         <chr>                    <dbl>       <dbl>
#> 1 1096E4AFFFE65 I                            2           2
#> 2 92F20ACA5F06  I                            2           2
#> 3 7E5BB65072C   was n't gon na do            2           2
#> 4 1F74D2B049FA4 this                         2           2
#> 5 2485C4F740FC0 <0>                          3           3
#> 6 1BF2260B4AB78 Stay up late                 3           3
head(rez007$stackDF %>% select(id, name, unitSeqFirst, unitSeqLast))
#> # A tibble: 6 × 4
#>   id            name      unitSeqFirst unitSeqLast
#>   <chr>         <chr>            <dbl>       <dbl>
#> 1 16CE3C9693F60 Stack 91           391         392
#> 2 AC5AC90A6E65  Stack 175          689         690
#> 3 1A87960035B5A Stack 38           225         225
#> 4 31A9F84B7EE07 Stack 133          540         541
#> 5 1E013C5A85CA9 Stack 47           247         251
#> 6 3872B2E956219 Stack 154          609         609

Note that chunkDF and layers depend on it get unitSeqFirst and unitSeqLast, because of a chunk combination feature that will be discussed in the trees chapter.

The function getSeqBounds() is mostly used rezonateR-internally, though more advanced users may use it to create functions similar to addIsWordField() and addUnitSeq().

Advanced sequence operations

In addition to the two most common operations we covered above, rezonateR also has other functions to deal with common problems related to time and sequence. These require knowledge of editing, so if you want to learn more about these functions, you can skip this for now and come back after having read vignette("edit_easyEdit").

Generally, when elements in a file belong to a larger structure, there are three ways of representing them inside the rezrDF that houses the smaller structure: * Sequence of the larger structure: unitSeq in tokenDF is an example. * Order (position) of the smaller element within the larger structure: tokenOrder in tokenDF is an example. * BILUO: Is the current element the beginning (B) of the larger structure, an intermediate (I) element, the last (L) element, the only (U) element, or not within the larger structure (O)? Generally used for artificial intelligence applications.

The following functions are currently available for dealing with these and related issues:

getOrderFromseq() is straightforward. For example, if we want to replicate tokenOrder from unitSeq, we can do this:

rez007$tokenDF = addFieldLocal(rez007$tokenDF, "tokenOrder2", getOrderFromSeq(unitSeq))
#Check that tokenOrder and tokenOrder2 are identical
all(rez007$tokenDF$tokenOrder == rez007$tokenDF$tokenOrder2)
#> [1] TRUE

getSeqFromOrder() is also straightforward. If we want to get the prosodic sentence in which a token is found, which is not given in the Rezonator version of the Santa Barbara Corpus, we can get it from the place within the prosodic sentence (pSentOrder):

rez007$tokenDF = addFieldLocal(rez007$tokenDF, "pSent", getSeqFromOrder(pSentOrder))
head(rez007$tokenDF %>% select(id, text, pSent, pSentOrder))
#> # A tibble: 6 × 4
#>   id            text  pSent pSentOrder
#>   <chr>         <chr> <dbl> <chr>     
#> 1 31F282855E95E (...)     1 1         
#> 2 363C1D373B2F7 God       1 2         
#> 3 3628E4BD4CC05 ,         1 3         
#> 4 37EFCBECFD691 I         1 4         
#> 5 12D67756890C1 said      1 5         
#> 6 936363B71D59  I         1 6

isInitial() and isFinal() tell us whether something is the initial or final member of a larger unit. isInitial() is simple: Its only parameter is the order (i.e. second representation).

For example, the following code determines whether a token is the start of the prosodic sentence:

rez007$tokenDF = addFieldLocal(rez007$tokenDF, "isPSentInitial", isInitial(pSentOrder))
head(rez007$tokenDF %>% select(id, text, pSentOrder, isPSentInitial))
#> # A tibble: 6 × 4
#>   id            text  pSentOrder isPSentInitial
#>   <chr>         <chr> <chr>      <lgl>         
#> 1 31F282855E95E (...) 1          TRUE          
#> 2 363C1D373B2F7 God   2          FALSE         
#> 3 3628E4BD4CC05 ,     3          FALSE         
#> 4 37EFCBECFD691 I     4          FALSE         
#> 5 12D67756890C1 said  5          FALSE         
#> 6 936363B71D59  I     6          FALSE

inLength() gives the length of the larger unit and its result is used by isFinal(), which takes the length of the larger unit in addition to the order value. (Note that for pSentLength, isWord is set to simply text not being zeroes, since we count non-words like breaths, endnotes, etc., as part of the prosodic sentence.)

rez007$tokenDF = addFieldLocal(rez007$tokenDF, "pSentLength", inLength(pSentOrder, isWord = (text != "<0>")), type = "complex", groupField = "pSent")
rez007$tokenDF = addFieldLocal(rez007$tokenDF, "isPSentFinal", isFinal(pSentOrder, pSentLength))
head(rez007$tokenDF %>% select(id, text, pSentOrder, pSentLength, isPSentFinal))
#> # A tibble: 6 × 5
#>   id            text  pSentOrder pSentLength isPSentFinal
#>   <chr>         <chr> <chr>            <int> <lgl>       
#> 1 31F282855E95E (...) 1                   14 FALSE       
#> 2 363C1D373B2F7 God   2                   14 FALSE       
#> 3 3628E4BD4CC05 ,     3                   14 FALSE       
#> 4 37EFCBECFD691 I     4                   14 FALSE       
#> 5 12D67756890C1 said  5                   14 FALSE       
#> 6 936363B71D59  I     6                   14 FALSE

getBiluoFromOrder() is similar in requiring order and length. Let’s get the BILUO values for prosodic sentences:

rez007$tokenDF = changeFieldLocal(rez007$tokenDF, "pSentBiluo", getBiluoFromOrder(pSentOrder, pSentLength))
head(rez007$tokenDF %>% select(id, text, pSentOrder, pSentLength, pSentBiluo))
#> # A tibble: 6 × 5
#>   id            text  pSentOrder pSentLength pSentBiluo
#>   <chr>         <chr> <chr>            <int> <chr>     
#> 1 31F282855E95E (...) 1                   14 B         
#> 2 363C1D373B2F7 God   2                   14 I         
#> 3 3628E4BD4CC05 ,     3                   14 I         
#> 4 37EFCBECFD691 I     4                   14 I         
#> 5 12D67756890C1 said  5                   14 I         
#> 6 936363B71D59  I     6                   14 I


Let’s practice saving our data again:

savePath = "rez007.Rdata"
rez_save(rez007, savePath)
#> Saving rezrObj ...

Now that you know how to deal with time and sequence, it will be even easier to work with rezonateR. If you’re following the whole tutorial series, the next tutorial is vignette("edit_easyEdit"), where you will start learning to add automatic annotations to rezrDFs!