Tracking down coreference phenomena • rezonateR

This tutorial discusses the handling of trails and tracks in Rezonator using the EasyTrack series of functions. In more generally accepted linguistic terms, a trail is a coreference chain, and a track is a mention or referential expression within a coreference chain.

We will be using the same Santa Barbara Corpus annotations as before:

library(rezonateR)
path = system.file("extdata", "rez007_track.Rdata", package = "rezonateR", mustWork = T)
rez007 = rez_load(path)
#> Loading rezrObj ...

The file contains coreference annotations for the first fifth of the text or so, and the .Rdata file imported here has been processed to include information on trees in vignettes("trees"). This tutorial will make use of this feature.

This tutorial will build towards a very simple toy analysis at the end, using all the changes that have been made to the rezrObj so far, to show the capabilities of rezonateR.

Getting information from previous mentions

Anaphoric and cataphoric distance

In studying coreference, we often want to know the difference from the current mention to the previoue mention. EasyTrack takes care of this using a family of functions

lastMentionUnit() and nextMentionUnit(): Give you the unit ID of the previous and next mention, respectively.
lastMentionToken() and nextMentionToken(): Give you the token ID of the previous and next mention, respectively.
unitsToLastMention() and unitsToNextMention(): Give you the number of units from the current mention to the last mention and to the next mention, respectively.
tokensToLastMention() and tokensToNextMention(): Give you the number of tokens from the current mention to the last mention and to the next mention, respectively.

The first four functions are rarely used in practice, so we will focus on the last four, which builds on the first four

Let’s first find out how many units we are from the previous mention of something using unitsToLastMention(). This is equivalent to the gapUnit column that already exists as automatically generated by Rezonator. There are two optional arguments:

unitSeq: The unit order values where the mentions appeared. Here, we use the unitSeqLast column, which is the default value, though unitSeqFirst is also possible.
chain: The column that gives the chain that each track belongs to. Typically there is no reason to touch this parameter; just leave it blank, and the column chain will be used.

The value will be NA if there are no previous mentions:

rez007$trackDF$default = rez007$trackDF$default %>%
  rez_mutate(unitsToLastMention = unitsToLastMention(unitSeqLast))
rez007$trackDF$default %>% select(id, gapUnits, unitsToLastMention) %>% slice(1:20)
#> # A tibble: 20 × 3
#>    id            gapUnits unitsToLastMention
#>    <chr>         <chr>                 <dbl>
#>  1 1096E4AFFFE65 N/A                      NA
#>  2 92F20ACA5F06  0                        NA
#>  3 7E5BB65072C   N/A                      NA
#>  4 1F74D2B049FA4 N/A                      NA
#>  5 2485C4F740FC0 1                         1
#>  6 1BF2260B4AB78 1                         1
#>  7 6B37B5A80F2A  1                         1
#>  8 259C2C2979B6C N/A                      NA
#>  9 1D1F2B7054E32 N/A                      NA
#> 10 1FA3806680C84 N/A                      NA
#> 11 4B32FD84BA10  N/A                      NA
#> 12 3098AB24A0FA6 2                         2
#> 13 2D5885FCA1E15 N/A                      NA
#> 14 1C311FD331AC4 2                         2
#> 15 2E01153F693D3 2                         2
#> 16 28CFE0272CE1C 2                         2
#> 17 24229602BDC4C 2                         2
#> 18 1A598AE39592B N/A                      NA
#> 19 38C628AB4DA4D N/A                      NA
#> 20 36B8918BB6E64 8                         8

Now let’s count the tokens from the last mention using the tokensToLastMention() function instead.. This one has a couple of complications. There are more parameters to fill this time:

tokenOrder: Similar to unitSeq, but for counting tokens. Common choices are docTokenSeqFirst, docTokenSeqLast, wordTokenSeqFirst and wordTokenseqLast (see vignette("time_seq") for the last two). By default it’s docTokenSeqLast.
chain: As above.
zeroProtocol: How the positions of zero are determined. By default, it is "literal", i.e. the position at which the zero was inserted. If unitFinal, zeroes will be treated as being located at the end of the unit. If unitFirst, they will be treated as the first word.
zeroCond: A condition for determining whether a token is zero. Normally, this is (word column) == "<0>" if the default Rezonator zero is used, though others may prefer "<ZERO>" or similar.
unitSeq: As above. Required when using the unitFinal and unitFirst protocols.
unitTokenSeqName: The name of the tokenSeq column to be used in the unitDF.
unitDF: the rezrDF containing the unit.


rez007$trackDF$default =  rez007$trackDF$default %>%
  rez_mutate(wordsToLastMention = tokensToLastMention(
    docWordSeqFirst, #What seq to use
    zeroProtocol = "unitInitial", #How to treat zeroes
    zeroCond = (text == "<0>"),
    unitDF = rez007$unitDF,
    unitTokenSeqName = "docWordSeqFirst")) #Additional argument for unitFinal protocol
rez007$trackDF$default %>% select(id, wordsToLastMention) %>% slice(1:20)
#> # A tibble: 20 × 2
#>    id            wordsToLastMention
#>    <chr>                      <dbl>
#>  1 1096E4AFFFE65                 NA
#>  2 92F20ACA5F06                  NA
#>  3 7E5BB65072C                   NA
#>  4 1F74D2B049FA4                 NA
#>  5 2485C4F740FC0                 10
#>  6 1BF2260B4AB78                 10
#>  7 6B37B5A80F2A                   3
#>  8 259C2C2979B6C                 NA
#>  9 1D1F2B7054E32                 NA
#> 10 1FA3806680C84                 NA
#> 11 4B32FD84BA10                  NA
#> 12 3098AB24A0FA6                 13
#> 13 2D5885FCA1E15                 NA
#> 14 1C311FD331AC4                 11
#> 15 2E01153F693D3                 12
#> 16 28CFE0272CE1C                  8
#> 17 24229602BDC4C                 13
#> 18 1A598AE39592B                 NA
#> 19 38C628AB4DA4D                 NA
#> 20 36B8918BB6E64                 29

The functions unitsToNextMention() and tokensToNextMention() work in the same way, except that they deal with following rather than preceding mentions.

Extracting features from previous mentions

In addition to getting the location of a previous mention, we might also want to extract a property of it:

getPrevMentionField() and getNextMentionField(): Extract a feature of the previous or next mention.

Let’s try to extract the subject status (using the Relation field annotated on treeLinks). Firstly, we have to supplement this Relation field to rez007$trackDF$default, then replace the NA entries with "NonSubj" so that they missing values are treated as meaningful, and not just missing (fieldaccess is changed to "flex" to avoid future reloads messing this up):

rez007$trackDF$default = rez007$trackDF$default %>% addFieldForeign(sourceDF = rez007$treeEntryDF$default, targetForeignKeyName = "treeEntry", targetFieldName = "Relation", sourceFieldName = "Relation", fieldaccess = "foreign")
rez007$trackDF$default = rez007$trackDF$default %>% rez_mutate(Relation = coalesce(Relation, "NonSubj"), fieldaccess = "flex")
#> Note that you are changing a field Relation from auto to flex This field will no longer reload.

The first and obligatory argument of getPrevMentionField() is the name of the column, or feature, you’re extracting. The other arguments, tokenOrder and chain, work the same way as before.

rez007$trackDF$default = rez007$trackDF$default %>%
  addFieldLocal(fieldName = "prevRelation",
                expression = getPrevMentionField(Relation),
                fieldaccess = "auto")
head(rez007$trackDF$default) %>% select(id, text, name, Relation, prevRelation)
#> # A tibble: 6 × 5
#>   id            text              name            Relation prevRelation
#>   <chr>         <chr>             <chr>           <chr>    <chr>       
#> 1 1096E4AFFFE65 I                 Mary            NonSubj  NA          
#> 2 92F20ACA5F06  I                 Mary            Subj     NonSubj     
#> 3 7E5BB65072C   was n't gon na do Trail 87        NonSubj  NA          
#> 4 1F74D2B049FA4 this              Staying up late NonSubj  NA          
#> 5 2485C4F740FC0 <0>               Mary            Subj     Subj        
#> 6 1BF2260B4AB78 Stay up late      Staying up late NonSubj  NonSubj

Tallying preceding and following mentions

Apart from looking only at the previous or next unit, We can also count how many mentions of something there were within a window of units before or after a mention, optionally with additional conditions. Here are the relevant functions:

countPrevMentions() and countNextMentions(): Get the number of previous or following units within a specified window of units.
countPrevMentionsIf() and countNextMentionsIf(): Get the number of previous or following units within a specified window of units given that they satisfy certain conditions (which do not depend on the current mention).
countPrevMentionsMatch() and countNextMentionsMatch(): Get the number of previous or following units within a specified window of units given that they have the same value as the current mention for some field.

These functions have the following fields, in order:

windowSize: How many IUs before / after the current one do you want to count?
cond (countPrevMentionsIf only): The condition that fields must satisfy to count.
field(countPrevMentionsMatch only): The field whose value is to be matched.
unitSeq: As before.
chain: As before.

Let’s try all three functions. We will count the number of previous mentions in the previous 20 units, the previous subject mentions, and the previous mentions whose subject/nonsubject value agrees with the present mention:

rez007$trackDF$default = rez007$trackDF$default %>%
  rez_mutate(noPrevMentionsIn20 = countPrevMentions(20),
             noPrevSubjMentionsIn20 = countPrevMentionsIf(20, Relation == "Subj"),
             noPrevSubjMatchMentionsIn20 = countPrevMentionsMatch(20, "Relation"))
rez007$trackDF$default %>% select(id, noPrevMentionsIn20, noPrevSubjMentionsIn20, noPrevSubjMatchMentionsIn20)  %>% slice(1:20)
#> # A tibble: 20 × 4
#>    id            noPrevMentionsIn20 noPrevSubjMentionsIn20 noPrevSubjMatchMent…¹
#>    <chr>                      <int>                  <int>                 <int>
#>  1 1096E4AFFFE65                  0                      0                     0
#>  2 92F20ACA5F06                   0                      0                     0
#>  3 7E5BB65072C                    0                      0                     0
#>  4 1F74D2B049FA4                  0                      0                     0
#>  5 2485C4F740FC0                  2                      1                     2
#>  6 1BF2260B4AB78                  1                      0                     1
#>  7 6B37B5A80F2A                   2                      0                     2
#>  8 259C2C2979B6C                  0                      0                     0
#>  9 1D1F2B7054E32                  0                      0                     0
#> 10 1FA3806680C84                  0                      0                     0
#> 11 4B32FD84BA10                   0                      0                     0
#> 12 3098AB24A0FA6                  3                      1                     3
#> 13 2D5885FCA1E15                  0                      0                     0
#> 14 1C311FD331AC4                  1                      1                     1
#> 15 2E01153F693D3                  4                      2                     4
#> 16 28CFE0272CE1C                  2                      2                     2
#> 17 24229602BDC4C                  3                      3                     3
#> 18 1A598AE39592B                  0                      0                     0
#> 19 38C628AB4DA4D                  0                      0                     0
#> 20 36B8918BB6E64                  5                      3                     5
#> # … with abbreviated variable name ¹noPrevSubjMatchMentionsIn20

If you don’t want a window restriction, you can set the window to Inf. Here’s an example where we extract the number of future zero mentions, regardless of how far it is from the current one:

rez007$trackDF$default = rez007$trackDF$default %>% rez_mutate(noComingZeroes = countNextMentionsIf(Inf, text == "<0>"))
rez007$trackDF$default %>% select(id, noComingZeroes)  %>% slice(1:20)
#> # A tibble: 20 × 2
#>    id            noComingZeroes
#>    <chr>                  <int>
#>  1 1096E4AFFFE65              1
#>  2 92F20ACA5F06               1
#>  3 7E5BB65072C                0
#>  4 1F74D2B049FA4              1
#>  5 2485C4F740FC0              0
#>  6 1BF2260B4AB78              1
#>  7 6B37B5A80F2A               0
#>  8 259C2C2979B6C              0
#>  9 1D1F2B7054E32              0
#> 10 1FA3806680C84              0
#> 11 4B32FD84BA10               7
#> 12 3098AB24A0FA6              0
#> 13 2D5885FCA1E15              0
#> 14 1C311FD331AC4              7
#> 15 2E01153F693D3              0
#> 16 28CFE0272CE1C              7
#> 17 24229602BDC4C              7
#> 18 1A598AE39592B              0
#> 19 38C628AB4DA4D              0
#> 20 36B8918BB6E64              0

Counting competitors

We may also want to count competing mentions, that is, recent mentions not coreferential to the current mention. The presence of competitors usually suggests that a referential form is more likely to be explicit. countCompetitors() tallies the number of competitors recently. The following parameters are present, all of which are:

cond: The condition under which something counts as a competitor (other than being non-coreferential with the present mention). By default, anything goes.
window: How many far back (in units) do you want to look? By default, there is no limit.
tokenSeq: As before.
unitSeq: As before.
chain: As before.
between: Do you count only competitors between the current mention and the previous mention in the same trail, or do you also count mentions from before the previous mention?

The function countMatchingCompetitors() is similar, but instead of col, there is a field matchCol, where you should put the name of a field in which competitors should match the current mention in order to be mentioned:

Here is one example. noCompetitors uses a window of 10 units, and may look beyond the previous mention, whereas noMatchingCompetitors is similar, but only looks between the current and previous mention, and only counts mentions with matching Relation values:

rez007$trackDF$default = rez007$trackDF$default %>%
  rez_mutate(noCompetitors = countCompetitors(windowSize = 10, between = F),
             noMatchingCompetitors = countMatchingCompetitors(Relation, windowSize = 10, between = F))
rez007$trackDF$default %>% select(id, text, noCompetitors, noMatchingCompetitors)  %>% slice(1:20)
#> # A tibble: 20 × 4
#>    id            text                                     noCompetitors noMatc…¹
#>    <chr>         <chr>                                            <int>    <int>
#>  1 1096E4AFFFE65 I                                                    0        0
#>  2 92F20ACA5F06  I                                                    0        0
#>  3 7E5BB65072C   was n't gon na do                                    2        1
#>  4 1F74D2B049FA4 this                                                 3        2
#>  5 2485C4F740FC0 <0>                                                  2        0
#>  6 1BF2260B4AB78 Stay up late                                         2        1
#>  7 6B37B5A80F2A  <0>                                                  1        1
#>  8 259C2C2979B6C the purpose of getting up in the morning             0        0
#>  9 1D1F2B7054E32 getting up in the morning                            0        0
#> 10 1FA3806680C84 the morning                                          0        0
#> 11 4B32FD84BA10  I                                                    3        0
#> 12 3098AB24A0FA6 it                                                   4        1
#> 13 2D5885FCA1E15 a hard habit to break                                1        0
#> 14 1C311FD331AC4 I                                                    2        1
#> 15 2E01153F693D3 It                                                   2        1
#> 16 28CFE0272CE1C I                                                    1        1
#> 17 24229602BDC4C I                                                    0        0
#> 18 1A598AE39592B midnight                                             1        0
#> 19 38C628AB4DA4D What                                                 0        0
#> 20 36B8918BB6E64 I                                                    1        0
#> # … with abbreviated variable name ¹noMatchingCompetitors

Adding verb information to the track table, and vice versa

We often want to connect information about verbs to their arguments. We may either put verb information in a track table, or put track information in a verb table. The former approach can be taken when investigating issues like coreference, and the latter for issues like argument structure.

If we want to add verb information to the track table, we can do this in two steps:

Add a treeParent column to trackDF$refexpr that takes the value of the parent column of treeEntryDF.
Using the treeParent column, find the corresponding verb in the verb table chunkDF$verb through the treeEntry column of chunkDF$verb.

rez007 = rez007 %>%
  addFieldForeign("track", "default", "treeEntry", "default", "treeEntry", "treeParent", "parent", fieldaccess = "foreign")
rez007$trackDF$default = rez007$trackDF$default %>%
  rez_left_join(rez007$chunkDF$verb %>% select(id, text, treeEntry),
                by = c(treeParent = "treeEntry"),
                suffix = c("", "_verb"),
                df2Address = "chunkDF/verb",
                fkey = "treeParent",
                df2key = "treeEntry",
                rezrObj = rez007) %>%
  rename(verbID = id_verb, verbText = text_verb)
rez007$trackDF$default %>% select(id, treeParent, verbID, verbText) %>% slice(1:20)
#> # A tibble: 20 × 4
#>    id            treeParent    verbID        verbText         
#>    <chr>         <chr>         <chr>         <chr>            
#>  1 1096E4AFFFE65 3005263A8352B ABe614xvgQ7XT said             
#>  2 92F20ACA5F06  35B1DC6EA25E5 744AD104FE64  was n't gon na do
#>  3 7E5BB65072C   NA            NA            NA               
#>  4 1F74D2B049FA4 35B1DC6EA25E5 744AD104FE64  was n't gon na do
#>  5 2485C4F740FC0 27760709F2C9B 15B9BB5D5086C Stay             
#>  6 1BF2260B4AB78 NA            NA            NA               
#>  7 6B37B5A80F2A  37FF92CF72A49 1A14BB68EDAC9 defeats          
#>  8 259C2C2979B6C 37FF92CF72A49 1A14BB68EDAC9 defeats          
#>  9 1D1F2B7054E32 NA            NA            NA               
#> 10 1FA3806680C84 NA            NA            NA               
#> 11 4B32FD84BA10  1CC3629361ED5 2CCD5F5A950BE know             
#> 12 3098AB24A0FA6 1CAE658BC6B04 1F1135429F347 's               
#> 13 2D5885FCA1E15 1CAE658BC6B04 1F1135429F347 's               
#> 14 1C311FD331AC4 3342A614FCFB  30F18BABDB83C do n't           
#> 15 2E01153F693D3 BCC81F077D99  94188DEA38BB  is               
#> 16 28CFE0272CE1C 6B9C62B7793B  13A37EC5064A2 do n't stay      
#> 17 24229602BDC4C 23BEEC722C8A0 29C877785FC4B 'm               
#> 18 1A598AE39592B NA            NA            NA               
#> 19 38C628AB4DA4D EC3E7F082CF3  1981C9A8D5DD8 do               
#> 20 36B8918BB6E64 EC3E7F082CF3  1981C9A8D5DD8 do

Now with the verbID in place, we can do the reverse process of putting argument information in the verb table too. The reverse process is similar, but a little more dangerous because you could potentially create duplicate rows in chunkDF$verb if you are not careful! Let’s say we want to add information about the subject to chunkDF$verb - say, the number of the subject. You would first have to make sure that each verb only has one subject! If you have made a lot of annotations, mistakes are likely to happen, so make sure to check first. In this code, we extract the verb IDs of each subject, and make sure that the verb IDs are unique:

verbsBySubject = rez007$trackDF$default %>%
  filter(Relation == "Subj", !is.na(verbID)) %>%
  pull(verbID)
if(length(verbsBySubject) != unique(length(verbsBySubject))) print("Error found!") else print("You're good!")
#> [1] "You're good!"

Now we can safely move this information to the verb table:

rez007$chunkDF$verb = rez007$chunkDF$verb %>%
  rez_left_join(rez007$trackDF$default %>% select(id, text, verbID),
                by = c(id = "verbID"),
                suffix = c("", "_subj"),
                df2Address = "trackDF/default",
                fkey = "id",
                df2key = "verbID",
                rezrObj = rez007) %>%
  rename(subjID = id_subj, subjText = text_subj)
rez007$chunkDF$verb %>% select(id, text, subjID, subjText)
#> # A tibble: 163 × 4
#>    id            text                  subjID        subjText                   
#>    <chr>         <chr>                 <chr>         <chr>                      
#>  1 210FB26A315A  said                  NA            NA                         
#>  2 744AD104FE64  was n't gon na do     92F20ACA5F06  I                          
#>  3 744AD104FE64  was n't gon na do     1F74D2B049FA4 this                       
#>  4 15B9BB5D5086C Stay                  2485C4F740FC0 <0>                        
#>  5 1A14BB68EDAC9 defeats               6B37B5A80F2A  <0>                        
#>  6 1A14BB68EDAC9 defeats               259C2C2979B6C the purpose of getting up …
#>  7 2CCD5F5A950BE know                  4B32FD84BA10  I                          
#>  8 1F1135429F347 's                    3098AB24A0FA6 it                         
#>  9 1F1135429F347 's                    2D5885FCA1E15 a hard habit to break      
#> 10 2F857247FD9D5 a hard habit to break NA            NA                         
#> # … with 153 more rows
#> # ℹ Use `print(n = ...)` to see more rows

Bridging

Natively, Rezonator does not yet support bridging annotation. rezonateR handles bridging a bit unusually, since it is difficult to do direct bridging annotation without an annotation interface like Rezonator’s.

The first step in doing bridging annotation is to create a frameMatrix, a data frame which is used to enter framing relationships between chains (for example, the car’s engine has a part-whole relationship with the car). But before doing that, we must ensure that there are no chains with repeat names. We use the function undupeLayers(), which has three arguments: the rezrObj, the layer you want to undupe, and the field/column you want to undupe. This will add numbers next to duplicated chain names. Then we can call addFrameMatrix() on the rezrObj. The following code does these steps, then shows the first 10 rows and 12 columns of rez007 (the first two columns give the IDs and names of the chains):

rez007 = undupeLayers(rez007, "trail", "name")
rez007 = addFrameMatrix(rez007)
frameMatrix(rez007)[1:10, 1:12]
#> # A tibble: 10 × 12
#>    id      name  the w…¹ right…² The t…³ ànyth…⁴ the t…⁵ Trail…⁶ our p…⁷ All t…⁸
#>    <chr>   <chr> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
#>  1 2053E9… the … ""      ""      ""      ""      ""      ""      ""      ""     
#>  2 2DFABF… righ… ""      ""      ""      ""      ""      ""      ""      ""     
#>  3 13B30F… The … ""      ""      ""      ""      ""      ""      ""      ""     
#>  4 21E91E… ànyt… ""      ""      ""      ""      ""      ""      ""      ""     
#>  5 14CE0A… the … ""      ""      ""      ""      ""      ""      ""      ""     
#>  6 2B6707… Trai… ""      ""      ""      ""      ""      ""      ""      ""     
#>  7 4D1295… our … ""      ""      ""      ""      ""      ""      ""      ""     
#>  8 2A0D42… All … ""      ""      ""      ""      ""      ""      ""      ""     
#>  9 976E7D… hypo… ""      ""      ""      ""      ""      ""      ""      ""     
#> 10 C0687A… Tim'… ""      ""      ""      ""      ""      ""      ""      ""     
#> # … with 2 more variables: `hypothetical Tim complaint` <chr>,
#> #   `Tim's first reaction` <chr>, and abbreviated variable names
#> #   ¹`the way they were feeling`, ²`right now`, ³`The two couples`,
#> #   ⁴`ànything positive about Tim's uncomfortable situation that he could do`,
#> #   ⁵`the thing that really scares Alice`, ⁶`Trail 87`, ⁷`our parents`,
#> #   ⁸`All this other shit (other than talking to Ben)`
#> # ℹ Use `colnames()` to see all variable names

The second step is to export the frameMatrix and populate it with actual annotations. Although it does not apply to our cases, in many cases the function reduceFrameMatrix() will be useful for removing rows and columns that do not actually participate in framing relations, or divide them into subparts that don’t have framing relationships with each other, so that we end up with a cleaner CSV to annotate.

We use the function obscureUpper() to obscure the upper triangular matrix of ‘repeat’ entries so that we don’t duplicate our annotation efforts: if we’ve already annotated that the car’s engine and the car have a part-whole relationship in the lower triangular matrix, we don’t need to annotate that the car and the car’s engine have a whole-part relationship!

After obscureUpper() and (optionally) reduceFrameMatrix(), the frameMatrix can be exported using rez_write_csv() and edited in an external editor. A spreadsheet program is highly recommended for this; you can use the freeze frame feature to annotate these relations more easily. When entering relationships, use the format ‘(role of the row entity)-(role of the column entity)’. For example, if the car’s engine is the row and the car is the column, type part-whole, not whole-part. Alternatively, if there are only a few of these relations and you already remember which ones they are, you can edit those relations in R directly. You can simply use base R assignment for this.

For this text, we will mainly be annotating individual-group relationships.

rez_write_csv(obscureUpper(frameMatrix(rez007)), "rez007_frame.csv")
newFrame = rez_read_csv("rez007_frame_edited.csv", origDF = frameMatrix(rez007))
newFrame[10:20,c(1,2,12:22)]
#> # A tibble: 11 × 13
#>    id    name  Tim's…¹ what …² imper…³ a har…⁴ nine …⁵ self …⁶ Tim   Mary  Ron  
#>    <chr> <chr> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr> <chr> <chr>
#>  1 C068… Tim'… /       /       /       /       /       /       /     /     /    
#>  2 1AE4… what… NA      /       /       /       /       /       /     /     /    
#>  3 318B… impe… NA      NA      /       /       /       /       /     /     /    
#>  4 2DAC… a ha… NA      NA      NA      /       /       /       /     /     /    
#>  5 30AF… nine… NA      NA      NA      NA      /       /       /     /     /    
#>  6 1685… self… NA      NA      NA      NA      NA      /       /     /     /    
#>  7 5F7A… Tim   NA      NA      NA      NA      NA      NA      /     /     /    
#>  8 278D… Mary  NA      NA      NA      NA      NA      NA      NA    /     /    
#>  9 29D3… Ron   NA      NA      NA      NA      NA      NA      NA    NA    /    
#> 10 309A… shit… NA      NA      NA      NA      NA      NA      NA    NA    NA   
#> 11 34F8… Tim … NA      NA      NA      NA      NA      NA      grou… NA    NA   
#> # … with 2 more variables: `shit (nothing)` <chr>, `Tim and Mandy` <chr>, and
#> #   abbreviated variable names ¹`Tim's first reaction`,
#> #   ²`what they don't realize`, ³`impersonal you 3 (trained by hard times)`,
#> #   ⁴`a hard habit to break`, ⁵`nine o'clock`, ⁶`self assertiveness`
#> # ℹ Use `colnames()` to see all variable names

Rather than updateFromDF(), we use updateFrameMatrixFromDF() to update the frameMatrix. This function will ‘flip’ the relationships for the upper triangular matrix. For example, if in the lower triangular matrix you annotated the car’s engine as having a part-whole relationship with the car, then ‘whole-part’ will show up in the upper triangular matrix for the row ‘car’ and the column the car’s engine.

frameMatrix(rez007) = updateFrameMatrixFromDF(frameMatrix(rez007), newFrame)

After having updated the frame matrix, we can easily

lastBridgeUnit(): Get the location (in unit) of the previous unit with a bridge to this unit.
lastBridgeToken(): Get the location (in tokens) of the bridging expression to this unit.
unitsToLastBridge(): Get the number of units between the closest unit with a bridge to the current unit, and the current unit.
tokensToLastBridge(): Get the number of tokens between the bridging expression and this unit.
countPrevBridges(): Count the number of previous bridging expressions in a specified window.

In practice, we will generally be using unitsToLastBridge() and tokensToLastBridge(). The arguments needed are very similar to what we saw for unitsToLastMention() and tokensToLastMention(). Here are the arguments for unitsToLastMention():

frameMatrix: The frameMatrix.
unitSeq: As before.
chain: As before.
tokenOrderLast: The token sequence value for the last token in an expression, by default docTokenSeqFirst.
tokenOrderFirst: The token sequence value for the first token in an expression, by default docTokenSeqLast.
inclRelations: Vector of relations that will be counted. This allows you to, for example, count part-whole relations but not whole-part. If left blank, everything will be counted.

The arguments of tokensToTheLastBridge() are again mostly things we have seen before:

frameMatrix: As before.
firstOrLast: Do you count the first or last token of the previous bridging expression? Either "first" or, by default `“last”.
tokenOrderFirst: As before.
tokenOrderLast: As before.
chain: As before.
zeroProtocol: As before.
zeroCond: As before.
unitSeq: As before.
unitDF: As before.
inclRelations: As before.

Let’s say you want to extract the units and tokens to the previous bridge, including only the "individual-group" relation (so that it doesn’t count as a bridge of the current mention is a group that includes the previous mention, but it does count it the current mention is an individual that is included by the previous mention). Here is the code:

rez007$trackDF$default = rez007$trackDF$default %>%
  rez_mutate(bridgeDistUnit = unitsToLastBridge(frameMatrix(rez007),
                                                     inclRelations = "individual-group"),
             bridgeDistToken = tokensToLastBridge(frameMatrix(rez007),
                                                     inclRelations = "individual-group"))
rez007$trackDF$default %>% select(id, text, bridgeDistUnit, bridgeDistToken) %>% slice(53:63)
#> # A tibble: 11 × 4
#>    id            text                            bridgeDistUnit bridgeDistToken
#>    <chr>         <chr>                                    <dbl>           <dbl>
#>  1 16A52A7994637 Tim                                          8              46
#>  2 4A063419D556  he                                          NA              NA
#>  3 36C85595E9EA4 every little dime that he makes             NA              NA
#>  4 1EA9E1A9D3BEE he                                          NA              NA
#>  5 290A040DBE92D He                                          NA              NA
#>  6 513F50ED54C2  any breaks                                  NA              NA
#>  7 382D95A5FE64F Tim                                         15              80
#>  8 21B834B5E4F57 salary                                      NA              NA
#>  9 CE3A09CF9407  he                                          16              86
#> 10 15C2ADACD3CE5 leave                                       NA              NA
#> 11 15408B58961EE he                                          19              97

Notice that if there has been no preceding bridge, the value is NA.

A toy analysis

Now that we’ve come so far, let’s try to do a mockup of a real linguistic analysis! Of course, an actual analysis would take far more data than what we have here, as well as a more carefully designed annotation scheme. But what we do here should suffice to demonstrate how an analysis might be done.

Let’s try to predict the number of characters inside a referential expression from three variables:

noPrevSubjMentionsIn20: The number of coreferent subject mentions within the 20 previous units.
noPrevNonSubjMentionsIn20: The number of coreferent non-subject mentions within the 20 previous units.
noCompetitors: The number of competitors within the five previous units.
Relation: Is the current mention a subject?
number: Is the current mention singular or plural?

We will use a linear regression for this prediction. (This probably isn’t the best model, but let’s keep it simple for this demonstration.) First we need to create noPrevNonSubjMentionsIn20 (which we do in an emancipated rezrDF to avoid clogging up the main table), then we’ll convert Relation to a factor, and then we’ll use the lm() function in base R to do the prediction:

analysis_df = rez007$trackDF$default %>% rez_mutate(
  noPrevNonSubjMentionsIn20 = noPrevMentionsIn20 - noPrevSubjMentionsIn20,
  Relation = stringToFactor(Relation)
)
lm_nochar = lm(charCount ~ noPrevSubjMentionsIn20 + noPrevNonSubjMentionsIn20 + noCompetitors + Relation + number, data = analysis_df)
lm_nochar
#> 
#> Call:
#> lm(formula = charCount ~ noPrevSubjMentionsIn20 + noPrevNonSubjMentionsIn20 + 
#>     noCompetitors + Relation + number, data = analysis_df)
#> 
#> Coefficients:
#>               (Intercept)     noPrevSubjMentionsIn20  
#>                   15.0460                    -0.7583  
#> noPrevNonSubjMentionsIn20              noCompetitors  
#>                   -0.9617                    -0.5511  
#>              RelationSubj                   numbersg  
#>                   -7.8235                    -0.3911
anova(lm_nochar)
#> Analysis of Variance Table
#> 
#> Response: charCount
#>                            Df  Sum Sq Mean Sq F value    Pr(>F)    
#> noPrevSubjMentionsIn20      1  2387.2 2387.18 23.6741 2.140e-06 ***
#> noPrevNonSubjMentionsIn20   1   399.1  399.05  3.9575   0.04787 *  
#> noCompetitors               1     1.3    1.26  0.0125   0.91102    
#> Relation                    1  3162.2 3162.21 31.3602 6.191e-08 ***
#> number                      1     5.2    5.23  0.0518   0.82008    
#> Residuals                 226 22788.8  100.84                      
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see, noPrevSubjMentionsIn20, noPrevNonSubjMentionsIn20 and Relation emerge as very good predictors, with entities mentioned more in the previous 20 units and which are subjects having a much stronger tendency to be light.

And so our journey ends here.

This concludes our journey through the basic functionality of rezonateR. And of course, don’t forget to save:

As usual, let’s not forget, for one last time:

savePath = "rez007.Rdata"
rez_save(rez007, savePath)
#> Saving rezrObj ...