Tracking down coreference phenomena
track.Rmd
This tutorial discusses the handling of trails and tracks in Rezonator using the EasyTrack series of functions. In more generally accepted linguistic terms, a trail is a coreference chain, and a track is a mention or referential expression within a coreference chain.
We will be using the same Santa Barbara Corpus annotations as before:
library(rezonateR)
path = system.file("extdata", "rez007_track.Rdata", package = "rezonateR", mustWork = T)
rez007 = rez_load(path)
#> Loading rezrObj ...
The file contains coreference annotations for the first fifth of the
text or so, and the .Rdata
file imported here has been
processed to include information on trees in
vignettes("trees")
. This tutorial will make use of this
feature.
This tutorial will build towards a very simple toy analysis at the
end, using all the changes that have been made to the
rezrObj
so far, to show the capabilities of
rezonateR
.
Getting information from previous mentions
Anaphoric and cataphoric distance
In studying coreference, we often want to know the difference from
the current mention to the previoue mention. EasyTrack
takes care of this using a family of functions
-
lastMentionUnit()
andnextMentionUnit()
: Give you the unit ID of the previous and next mention, respectively. -
lastMentionToken()
andnextMentionToken()
: Give you the token ID of the previous and next mention, respectively. -
unitsToLastMention()
andunitsToNextMention()
: Give you the number of units from the current mention to the last mention and to the next mention, respectively. -
tokensToLastMention()
andtokensToNextMention()
: Give you the number of tokens from the current mention to the last mention and to the next mention, respectively.
The first four functions are rarely used in practice, so we will focus on the last four, which builds on the first four
Let’s first find out how many units we are from the previous mention
of something using unitsToLastMention()
. This is equivalent
to the gapUnit
column that already exists as automatically
generated by Rezonator. There are two optional arguments:
-
unitSeq
: The unit order values where the mentions appeared. Here, we use theunitSeqLast
column, which is the default value, thoughunitSeqFirst
is also possible. -
chain
: The column that gives the chain that each track belongs to. Typically there is no reason to touch this parameter; just leave it blank, and the columnchain
will be used.
The value will be NA
if there are no previous
mentions:
rez007$trackDF$default = rez007$trackDF$default %>%
rez_mutate(unitsToLastMention = unitsToLastMention(unitSeqLast))
rez007$trackDF$default %>% select(id, gapUnits, unitsToLastMention) %>% slice(1:20)
#> # A tibble: 20 × 3
#> id gapUnits unitsToLastMention
#> <chr> <chr> <dbl>
#> 1 1096E4AFFFE65 N/A NA
#> 2 92F20ACA5F06 0 NA
#> 3 7E5BB65072C N/A NA
#> 4 1F74D2B049FA4 N/A NA
#> 5 2485C4F740FC0 1 1
#> 6 1BF2260B4AB78 1 1
#> 7 6B37B5A80F2A 1 1
#> 8 259C2C2979B6C N/A NA
#> 9 1D1F2B7054E32 N/A NA
#> 10 1FA3806680C84 N/A NA
#> 11 4B32FD84BA10 N/A NA
#> 12 3098AB24A0FA6 2 2
#> 13 2D5885FCA1E15 N/A NA
#> 14 1C311FD331AC4 2 2
#> 15 2E01153F693D3 2 2
#> 16 28CFE0272CE1C 2 2
#> 17 24229602BDC4C 2 2
#> 18 1A598AE39592B N/A NA
#> 19 38C628AB4DA4D N/A NA
#> 20 36B8918BB6E64 8 8
Now let’s count the tokens from the last mention using the
tokensToLastMention()
function instead.. This one has a
couple of complications. There are more parameters to fill this
time:
-
tokenOrder
: Similar tounitSeq
, but for counting tokens. Common choices aredocTokenSeqFirst
,docTokenSeqLast
,wordTokenSeqFirst
andwordTokenseqLast
(seevignette("time_seq")
for the last two). By default it’sdocTokenSeqLast
. -
chain
: As above. -
zeroProtocol
: How the positions of zero are determined. By default, it is"literal"
, i.e. the position at which the zero was inserted. IfunitFinal
, zeroes will be treated as being located at the end of the unit. IfunitFirst
, they will be treated as the first word. -
zeroCond
: A condition for determining whether a token is zero. Normally, this is(word column) == "<0>"
if the default Rezonator zero is used, though others may prefer"<ZERO>"
or similar. -
unitSeq
: As above. Required when using theunitFinal
andunitFirst
protocols. -
unitTokenSeqName
: The name of thetokenSeq
column to be used in theunitDF
. -
unitDF
: therezrDF
containing the unit.
rez007$trackDF$default = rez007$trackDF$default %>%
rez_mutate(wordsToLastMention = tokensToLastMention(
docWordSeqFirst, #What seq to use
zeroProtocol = "unitInitial", #How to treat zeroes
zeroCond = (text == "<0>"),
unitDF = rez007$unitDF,
unitTokenSeqName = "docWordSeqFirst")) #Additional argument for unitFinal protocol
rez007$trackDF$default %>% select(id, wordsToLastMention) %>% slice(1:20)
#> # A tibble: 20 × 2
#> id wordsToLastMention
#> <chr> <dbl>
#> 1 1096E4AFFFE65 NA
#> 2 92F20ACA5F06 NA
#> 3 7E5BB65072C NA
#> 4 1F74D2B049FA4 NA
#> 5 2485C4F740FC0 10
#> 6 1BF2260B4AB78 10
#> 7 6B37B5A80F2A 3
#> 8 259C2C2979B6C NA
#> 9 1D1F2B7054E32 NA
#> 10 1FA3806680C84 NA
#> 11 4B32FD84BA10 NA
#> 12 3098AB24A0FA6 13
#> 13 2D5885FCA1E15 NA
#> 14 1C311FD331AC4 11
#> 15 2E01153F693D3 12
#> 16 28CFE0272CE1C 8
#> 17 24229602BDC4C 13
#> 18 1A598AE39592B NA
#> 19 38C628AB4DA4D NA
#> 20 36B8918BB6E64 29
The functions unitsToNextMention()
and
tokensToNextMention()
work in the same way, except that
they deal with following rather than preceding mentions.
Extracting features from previous mentions
In addition to getting the location of a previous mention, we might also want to extract a property of it:
-
getPrevMentionField()
andgetNextMentionField()
: Extract a feature of the previous or next mention.
Let’s try to extract the subject status (using the
Relation
field annotated on treeLink
s).
Firstly, we have to supplement this Relation
field to
rez007$trackDF$default
, then replace the NA
entries with "NonSubj"
so that they missing values are
treated as meaningful, and not just missing (fieldaccess
is
changed to "flex"
to avoid future reloads messing this
up):
rez007$trackDF$default = rez007$trackDF$default %>% addFieldForeign(sourceDF = rez007$treeEntryDF$default, targetForeignKeyName = "treeEntry", targetFieldName = "Relation", sourceFieldName = "Relation", fieldaccess = "foreign")
rez007$trackDF$default = rez007$trackDF$default %>% rez_mutate(Relation = coalesce(Relation, "NonSubj"), fieldaccess = "flex")
#> Note that you are changing a field Relation from auto to flex This field will no longer reload.
The first and obligatory argument of
getPrevMentionField()
is the name of the column, or
feature, you’re extracting. The other arguments, tokenOrder
and chain
, work the same way as before.
rez007$trackDF$default = rez007$trackDF$default %>%
addFieldLocal(fieldName = "prevRelation",
expression = getPrevMentionField(Relation),
fieldaccess = "auto")
head(rez007$trackDF$default) %>% select(id, text, name, Relation, prevRelation)
#> # A tibble: 6 × 5
#> id text name Relation prevRelation
#> <chr> <chr> <chr> <chr> <chr>
#> 1 1096E4AFFFE65 I Mary NonSubj NA
#> 2 92F20ACA5F06 I Mary Subj NonSubj
#> 3 7E5BB65072C was n't gon na do Trail 87 NonSubj NA
#> 4 1F74D2B049FA4 this Staying up late NonSubj NA
#> 5 2485C4F740FC0 <0> Mary Subj Subj
#> 6 1BF2260B4AB78 Stay up late Staying up late NonSubj NonSubj
Tallying preceding and following mentions
Apart from looking only at the previous or next unit, We can also count how many mentions of something there were within a window of units before or after a mention, optionally with additional conditions. Here are the relevant functions:
-
countPrevMentions()
andcountNextMentions()
: Get the number of previous or following units within a specified window of units. -
countPrevMentionsIf()
andcountNextMentionsIf()
: Get the number of previous or following units within a specified window of units given that they satisfy certain conditions (which do not depend on the current mention). -
countPrevMentionsMatch()
andcountNextMentionsMatch()
: Get the number of previous or following units within a specified window of units given that they have the same value as the current mention for some field.
These functions have the following fields, in order:
-
windowSize
: How many IUs before / after the current one do you want to count? -
cond
(countPrevMentionsIf
only): The condition that fields must satisfy to count. -
field
(countPrevMentionsMatch
only): The field whose value is to be matched. -
unitSeq
: As before. -
chain
: As before.
Let’s try all three functions. We will count the number of previous mentions in the previous 20 units, the previous subject mentions, and the previous mentions whose subject/nonsubject value agrees with the present mention:
rez007$trackDF$default = rez007$trackDF$default %>%
rez_mutate(noPrevMentionsIn20 = countPrevMentions(20),
noPrevSubjMentionsIn20 = countPrevMentionsIf(20, Relation == "Subj"),
noPrevSubjMatchMentionsIn20 = countPrevMentionsMatch(20, "Relation"))
rez007$trackDF$default %>% select(id, noPrevMentionsIn20, noPrevSubjMentionsIn20, noPrevSubjMatchMentionsIn20) %>% slice(1:20)
#> # A tibble: 20 × 4
#> id noPrevMentionsIn20 noPrevSubjMentionsIn20 noPrevSubjMatchMent…¹
#> <chr> <int> <int> <int>
#> 1 1096E4AFFFE65 0 0 0
#> 2 92F20ACA5F06 0 0 0
#> 3 7E5BB65072C 0 0 0
#> 4 1F74D2B049FA4 0 0 0
#> 5 2485C4F740FC0 2 1 2
#> 6 1BF2260B4AB78 1 0 1
#> 7 6B37B5A80F2A 2 0 2
#> 8 259C2C2979B6C 0 0 0
#> 9 1D1F2B7054E32 0 0 0
#> 10 1FA3806680C84 0 0 0
#> 11 4B32FD84BA10 0 0 0
#> 12 3098AB24A0FA6 3 1 3
#> 13 2D5885FCA1E15 0 0 0
#> 14 1C311FD331AC4 1 1 1
#> 15 2E01153F693D3 4 2 4
#> 16 28CFE0272CE1C 2 2 2
#> 17 24229602BDC4C 3 3 3
#> 18 1A598AE39592B 0 0 0
#> 19 38C628AB4DA4D 0 0 0
#> 20 36B8918BB6E64 5 3 5
#> # … with abbreviated variable name ¹noPrevSubjMatchMentionsIn20
If you don’t want a window restriction, you can set the window to
Inf
. Here’s an example where we extract the number of
future zero mentions, regardless of how far it is from the current
one:
rez007$trackDF$default = rez007$trackDF$default %>% rez_mutate(noComingZeroes = countNextMentionsIf(Inf, text == "<0>"))
rez007$trackDF$default %>% select(id, noComingZeroes) %>% slice(1:20)
#> # A tibble: 20 × 2
#> id noComingZeroes
#> <chr> <int>
#> 1 1096E4AFFFE65 1
#> 2 92F20ACA5F06 1
#> 3 7E5BB65072C 0
#> 4 1F74D2B049FA4 1
#> 5 2485C4F740FC0 0
#> 6 1BF2260B4AB78 1
#> 7 6B37B5A80F2A 0
#> 8 259C2C2979B6C 0
#> 9 1D1F2B7054E32 0
#> 10 1FA3806680C84 0
#> 11 4B32FD84BA10 7
#> 12 3098AB24A0FA6 0
#> 13 2D5885FCA1E15 0
#> 14 1C311FD331AC4 7
#> 15 2E01153F693D3 0
#> 16 28CFE0272CE1C 7
#> 17 24229602BDC4C 7
#> 18 1A598AE39592B 0
#> 19 38C628AB4DA4D 0
#> 20 36B8918BB6E64 0
Counting competitors
We may also want to count competing mentions, that is, recent
mentions not coreferential to the current mention. The presence of
competitors usually suggests that a referential form is more likely to
be explicit. countCompetitors()
tallies the number of
competitors recently. The following parameters are present, all of which
are:
-
cond
: The condition under which something counts as a competitor (other than being non-coreferential with the present mention). By default, anything goes. -
window
: How many far back (in units) do you want to look? By default, there is no limit. -
tokenSeq
: As before. -
unitSeq
: As before. -
chain
: As before. -
between
: Do you count only competitors between the current mention and the previous mention in the same trail, or do you also count mentions from before the previous mention?
The function countMatchingCompetitors()
is similar, but
instead of col
, there is a field matchCol
,
where you should put the name of a field in which competitors should
match the current mention in order to be mentioned:
Here is one example. noCompetitors
uses a window of 10
units, and may look beyond the previous mention, whereas
noMatchingCompetitors
is similar, but only looks between
the current and previous mention, and only counts mentions with matching
Relation
values:
rez007$trackDF$default = rez007$trackDF$default %>%
rez_mutate(noCompetitors = countCompetitors(windowSize = 10, between = F),
noMatchingCompetitors = countMatchingCompetitors(Relation, windowSize = 10, between = F))
rez007$trackDF$default %>% select(id, text, noCompetitors, noMatchingCompetitors) %>% slice(1:20)
#> # A tibble: 20 × 4
#> id text noCompetitors noMatc…¹
#> <chr> <chr> <int> <int>
#> 1 1096E4AFFFE65 I 0 0
#> 2 92F20ACA5F06 I 0 0
#> 3 7E5BB65072C was n't gon na do 2 1
#> 4 1F74D2B049FA4 this 3 2
#> 5 2485C4F740FC0 <0> 2 0
#> 6 1BF2260B4AB78 Stay up late 2 1
#> 7 6B37B5A80F2A <0> 1 1
#> 8 259C2C2979B6C the purpose of getting up in the morning 0 0
#> 9 1D1F2B7054E32 getting up in the morning 0 0
#> 10 1FA3806680C84 the morning 0 0
#> 11 4B32FD84BA10 I 3 0
#> 12 3098AB24A0FA6 it 4 1
#> 13 2D5885FCA1E15 a hard habit to break 1 0
#> 14 1C311FD331AC4 I 2 1
#> 15 2E01153F693D3 It 2 1
#> 16 28CFE0272CE1C I 1 1
#> 17 24229602BDC4C I 0 0
#> 18 1A598AE39592B midnight 1 0
#> 19 38C628AB4DA4D What 0 0
#> 20 36B8918BB6E64 I 1 0
#> # … with abbreviated variable name ¹noMatchingCompetitors
Adding verb information to the track table, and vice versa
We often want to connect information about verbs to their arguments. We may either put verb information in a track table, or put track information in a verb table. The former approach can be taken when investigating issues like coreference, and the latter for issues like argument structure.
If we want to add verb information to the track table, we can do this in two steps:
- Add a
treeParent
column totrackDF$refexpr
that takes the value of theparent
column oftreeEntryDF
. - Using the
treeParent
column, find the corresponding verb in the verb tablechunkDF$verb
through thetreeEntry
column ofchunkDF$verb
.
rez007 = rez007 %>%
addFieldForeign("track", "default", "treeEntry", "default", "treeEntry", "treeParent", "parent", fieldaccess = "foreign")
rez007$trackDF$default = rez007$trackDF$default %>%
rez_left_join(rez007$chunkDF$verb %>% select(id, text, treeEntry),
by = c(treeParent = "treeEntry"),
suffix = c("", "_verb"),
df2Address = "chunkDF/verb",
fkey = "treeParent",
df2key = "treeEntry",
rezrObj = rez007) %>%
rename(verbID = id_verb, verbText = text_verb)
rez007$trackDF$default %>% select(id, treeParent, verbID, verbText) %>% slice(1:20)
#> # A tibble: 20 × 4
#> id treeParent verbID verbText
#> <chr> <chr> <chr> <chr>
#> 1 1096E4AFFFE65 3005263A8352B ABe614xvgQ7XT said
#> 2 92F20ACA5F06 35B1DC6EA25E5 744AD104FE64 was n't gon na do
#> 3 7E5BB65072C NA NA NA
#> 4 1F74D2B049FA4 35B1DC6EA25E5 744AD104FE64 was n't gon na do
#> 5 2485C4F740FC0 27760709F2C9B 15B9BB5D5086C Stay
#> 6 1BF2260B4AB78 NA NA NA
#> 7 6B37B5A80F2A 37FF92CF72A49 1A14BB68EDAC9 defeats
#> 8 259C2C2979B6C 37FF92CF72A49 1A14BB68EDAC9 defeats
#> 9 1D1F2B7054E32 NA NA NA
#> 10 1FA3806680C84 NA NA NA
#> 11 4B32FD84BA10 1CC3629361ED5 2CCD5F5A950BE know
#> 12 3098AB24A0FA6 1CAE658BC6B04 1F1135429F347 's
#> 13 2D5885FCA1E15 1CAE658BC6B04 1F1135429F347 's
#> 14 1C311FD331AC4 3342A614FCFB 30F18BABDB83C do n't
#> 15 2E01153F693D3 BCC81F077D99 94188DEA38BB is
#> 16 28CFE0272CE1C 6B9C62B7793B 13A37EC5064A2 do n't stay
#> 17 24229602BDC4C 23BEEC722C8A0 29C877785FC4B 'm
#> 18 1A598AE39592B NA NA NA
#> 19 38C628AB4DA4D EC3E7F082CF3 1981C9A8D5DD8 do
#> 20 36B8918BB6E64 EC3E7F082CF3 1981C9A8D5DD8 do
Now with the verbID in place, we can do the reverse process of
putting argument information in the verb table too. The reverse process
is similar, but a little more dangerous because you could potentially
create duplicate rows in chunkDF$verb
if you are not
careful! Let’s say we want to add information about the subject to
chunkDF$verb
- say, the number of the subject. You would
first have to make sure that each verb only has one subject! If you have
made a lot of annotations, mistakes are likely to happen, so make sure
to check first. In this code, we extract the verb IDs of each subject,
and make sure that the verb IDs are unique:
verbsBySubject = rez007$trackDF$default %>%
filter(Relation == "Subj", !is.na(verbID)) %>%
pull(verbID)
if(length(verbsBySubject) != unique(length(verbsBySubject))) print("Error found!") else print("You're good!")
#> [1] "You're good!"
Now we can safely move this information to the verb table:
rez007$chunkDF$verb = rez007$chunkDF$verb %>%
rez_left_join(rez007$trackDF$default %>% select(id, text, verbID),
by = c(id = "verbID"),
suffix = c("", "_subj"),
df2Address = "trackDF/default",
fkey = "id",
df2key = "verbID",
rezrObj = rez007) %>%
rename(subjID = id_subj, subjText = text_subj)
rez007$chunkDF$verb %>% select(id, text, subjID, subjText)
#> # A tibble: 163 × 4
#> id text subjID subjText
#> <chr> <chr> <chr> <chr>
#> 1 210FB26A315A said NA NA
#> 2 744AD104FE64 was n't gon na do 92F20ACA5F06 I
#> 3 744AD104FE64 was n't gon na do 1F74D2B049FA4 this
#> 4 15B9BB5D5086C Stay 2485C4F740FC0 <0>
#> 5 1A14BB68EDAC9 defeats 6B37B5A80F2A <0>
#> 6 1A14BB68EDAC9 defeats 259C2C2979B6C the purpose of getting up …
#> 7 2CCD5F5A950BE know 4B32FD84BA10 I
#> 8 1F1135429F347 's 3098AB24A0FA6 it
#> 9 1F1135429F347 's 2D5885FCA1E15 a hard habit to break
#> 10 2F857247FD9D5 a hard habit to break NA NA
#> # … with 153 more rows
#> # ℹ Use `print(n = ...)` to see more rows
Bridging
Natively, Rezonator does not yet support bridging annotation.
rezonateR
handles bridging a bit unusually, since it is
difficult to do direct bridging annotation without an annotation
interface like Rezonator’s.
The first step in doing bridging annotation is to create a
frameMatrix
, a data frame which is used to enter framing
relationships between chains (for example, the car’s engine has
a part-whole relationship with the car). But before doing that,
we must ensure that there are no chains with repeat names. We use the
function undupeLayers()
, which has three arguments: the
rezrObj
, the layer you want to undupe, and the field/column
you want to undupe. This will add numbers next to duplicated chain
names. Then we can call addFrameMatrix()
on the
rezrObj
. The following code does these steps, then shows
the first 10 rows and 12 columns of rez007
(the first two
columns give the IDs and names of the chains):
rez007 = undupeLayers(rez007, "trail", "name")
rez007 = addFrameMatrix(rez007)
frameMatrix(rez007)[1:10, 1:12]
#> # A tibble: 10 × 12
#> id name the w…¹ right…² The t…³ ànyth…⁴ the t…⁵ Trail…⁶ our p…⁷ All t…⁸
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2053E9… the … "" "" "" "" "" "" "" ""
#> 2 2DFABF… righ… "" "" "" "" "" "" "" ""
#> 3 13B30F… The … "" "" "" "" "" "" "" ""
#> 4 21E91E… ànyt… "" "" "" "" "" "" "" ""
#> 5 14CE0A… the … "" "" "" "" "" "" "" ""
#> 6 2B6707… Trai… "" "" "" "" "" "" "" ""
#> 7 4D1295… our … "" "" "" "" "" "" "" ""
#> 8 2A0D42… All … "" "" "" "" "" "" "" ""
#> 9 976E7D… hypo… "" "" "" "" "" "" "" ""
#> 10 C0687A… Tim'… "" "" "" "" "" "" "" ""
#> # … with 2 more variables: `hypothetical Tim complaint` <chr>,
#> # `Tim's first reaction` <chr>, and abbreviated variable names
#> # ¹`the way they were feeling`, ²`right now`, ³`The two couples`,
#> # ⁴`ànything positive about Tim's uncomfortable situation that he could do`,
#> # ⁵`the thing that really scares Alice`, ⁶`Trail 87`, ⁷`our parents`,
#> # ⁸`All this other shit (other than talking to Ben)`
#> # ℹ Use `colnames()` to see all variable names
The second step is to export the frameMatrix
and
populate it with actual annotations. Although it does not apply to our
cases, in many cases the function reduceFrameMatrix()
will
be useful for removing rows and columns that do not actually participate
in framing relations, or divide them into subparts that don’t have
framing relationships with each other, so that we end up with a cleaner
CSV to annotate.
We use the function obscureUpper()
to obscure the upper
triangular matrix of ‘repeat’ entries so that we don’t duplicate our
annotation efforts: if we’ve already annotated that the car’s
engine and the car have a part-whole relationship in the
lower triangular matrix, we don’t need to annotate that the car
and the car’s engine have a whole-part relationship!
After obscureUpper()
and (optionally)
reduceFrameMatrix()
, the frameMatrix can be exported using
rez_write_csv()
and edited in an external editor. A
spreadsheet program is highly recommended for this; you can use the
freeze frame feature to annotate these relations more easily. When
entering relationships, use the format ‘(role of the row entity)-(role
of the column entity)’. For example, if the car’s engine is the
row and the car is the column, type part-whole
,
not whole-part
. Alternatively, if there are only a few of
these relations and you already remember which ones they are, you can
edit those relations in R directly. You can simply use base R assignment
for this.
For this text, we will mainly be annotating individual-group relationships.
rez_write_csv(obscureUpper(frameMatrix(rez007)), "rez007_frame.csv")
newFrame = rez_read_csv("rez007_frame_edited.csv", origDF = frameMatrix(rez007))
newFrame[10:20,c(1,2,12:22)]
#> # A tibble: 11 × 13
#> id name Tim's…¹ what …² imper…³ a har…⁴ nine …⁵ self …⁶ Tim Mary Ron
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 C068… Tim'… / / / / / / / / /
#> 2 1AE4… what… NA / / / / / / / /
#> 3 318B… impe… NA NA / / / / / / /
#> 4 2DAC… a ha… NA NA NA / / / / / /
#> 5 30AF… nine… NA NA NA NA / / / / /
#> 6 1685… self… NA NA NA NA NA / / / /
#> 7 5F7A… Tim NA NA NA NA NA NA / / /
#> 8 278D… Mary NA NA NA NA NA NA NA / /
#> 9 29D3… Ron NA NA NA NA NA NA NA NA /
#> 10 309A… shit… NA NA NA NA NA NA NA NA NA
#> 11 34F8… Tim … NA NA NA NA NA NA grou… NA NA
#> # … with 2 more variables: `shit (nothing)` <chr>, `Tim and Mandy` <chr>, and
#> # abbreviated variable names ¹`Tim's first reaction`,
#> # ²`what they don't realize`, ³`impersonal you 3 (trained by hard times)`,
#> # ⁴`a hard habit to break`, ⁵`nine o'clock`, ⁶`self assertiveness`
#> # ℹ Use `colnames()` to see all variable names
Rather than updateFromDF()
, we use
updateFrameMatrixFromDF()
to update the
frameMatrix
. This function will ‘flip’ the relationships
for the upper triangular matrix. For example, if in the lower triangular
matrix you annotated the car’s engine as having a part-whole
relationship with the car, then ‘whole-part’ will show up in
the upper triangular matrix for the row ‘car’ and the column the
car’s engine.
frameMatrix(rez007) = updateFrameMatrixFromDF(frameMatrix(rez007), newFrame)
After having updated the frame matrix, we can easily
-
lastBridgeUnit()
: Get the location (in unit) of the previous unit with a bridge to this unit. -
lastBridgeToken()
: Get the location (in tokens) of the bridging expression to this unit. -
unitsToLastBridge()
: Get the number of units between the closest unit with a bridge to the current unit, and the current unit. -
tokensToLastBridge()
: Get the number of tokens between the bridging expression and this unit. -
countPrevBridges()
: Count the number of previous bridging expressions in a specified window.
In practice, we will generally be using
unitsToLastBridge()
and tokensToLastBridge()
.
The arguments needed are very similar to what we saw for
unitsToLastMention()
and
tokensToLastMention()
. Here are the arguments for
unitsToLastMention()
:
-
frameMatrix
: TheframeMatrix
. -
unitSeq
: As before. -
chain
: As before. -
tokenOrderLast
: The token sequence value for the last token in an expression, by defaultdocTokenSeqFirst
. -
tokenOrderFirst
: The token sequence value for the first token in an expression, by defaultdocTokenSeqLast
. -
inclRelations
: Vector of relations that will be counted. This allows you to, for example, count part-whole relations but not whole-part. If left blank, everything will be counted.
The arguments of tokensToTheLastBridge()
are again
mostly things we have seen before:
-
frameMatrix
: As before. -
firstOrLast
: Do you count the first or last token of the previous bridging expression? Either"first"
or, by default `“last”. -
tokenOrderFirst
: As before. -
tokenOrderLast
: As before. -
chain
: As before. -
zeroProtocol
: As before. -
zeroCond
: As before. -
unitSeq
: As before. -
unitDF
: As before. -
inclRelations
: As before.
Let’s say you want to extract the units and tokens to the previous
bridge, including only the "individual-group"
relation (so
that it doesn’t count as a bridge of the current mention is a group that
includes the previous mention, but it does count it the current mention
is an individual that is included by the previous mention). Here is the
code:
rez007$trackDF$default = rez007$trackDF$default %>%
rez_mutate(bridgeDistUnit = unitsToLastBridge(frameMatrix(rez007),
inclRelations = "individual-group"),
bridgeDistToken = tokensToLastBridge(frameMatrix(rez007),
inclRelations = "individual-group"))
rez007$trackDF$default %>% select(id, text, bridgeDistUnit, bridgeDistToken) %>% slice(53:63)
#> # A tibble: 11 × 4
#> id text bridgeDistUnit bridgeDistToken
#> <chr> <chr> <dbl> <dbl>
#> 1 16A52A7994637 Tim 8 46
#> 2 4A063419D556 he NA NA
#> 3 36C85595E9EA4 every little dime that he makes NA NA
#> 4 1EA9E1A9D3BEE he NA NA
#> 5 290A040DBE92D He NA NA
#> 6 513F50ED54C2 any breaks NA NA
#> 7 382D95A5FE64F Tim 15 80
#> 8 21B834B5E4F57 salary NA NA
#> 9 CE3A09CF9407 he 16 86
#> 10 15C2ADACD3CE5 leave NA NA
#> 11 15408B58961EE he 19 97
Notice that if there has been no preceding bridge, the value is
NA
.
A toy analysis
Now that we’ve come so far, let’s try to do a mockup of a real linguistic analysis! Of course, an actual analysis would take far more data than what we have here, as well as a more carefully designed annotation scheme. But what we do here should suffice to demonstrate how an analysis might be done.
Let’s try to predict the number of characters inside a referential expression from three variables:
-
noPrevSubjMentionsIn20
: The number of coreferent subject mentions within the 20 previous units. -
noPrevNonSubjMentionsIn20
: The number of coreferent non-subject mentions within the 20 previous units. -
noCompetitors
: The number of competitors within the five previous units. -
Relation
: Is the current mention a subject? -
number
: Is the current mention singular or plural?
We will use a linear regression for this prediction. (This probably
isn’t the best model, but let’s keep it simple for this demonstration.)
First we need to create noPrevNonSubjMentionsIn20
(which we
do in an emancipated rezrDF
to avoid clogging up the main
table), then we’ll convert Relation
to a factor, and then
we’ll use the lm()
function in base R to do the
prediction:
analysis_df = rez007$trackDF$default %>% rez_mutate(
noPrevNonSubjMentionsIn20 = noPrevMentionsIn20 - noPrevSubjMentionsIn20,
Relation = stringToFactor(Relation)
)
lm_nochar = lm(charCount ~ noPrevSubjMentionsIn20 + noPrevNonSubjMentionsIn20 + noCompetitors + Relation + number, data = analysis_df)
lm_nochar
#>
#> Call:
#> lm(formula = charCount ~ noPrevSubjMentionsIn20 + noPrevNonSubjMentionsIn20 +
#> noCompetitors + Relation + number, data = analysis_df)
#>
#> Coefficients:
#> (Intercept) noPrevSubjMentionsIn20
#> 15.0460 -0.7583
#> noPrevNonSubjMentionsIn20 noCompetitors
#> -0.9617 -0.5511
#> RelationSubj numbersg
#> -7.8235 -0.3911
anova(lm_nochar)
#> Analysis of Variance Table
#>
#> Response: charCount
#> Df Sum Sq Mean Sq F value Pr(>F)
#> noPrevSubjMentionsIn20 1 2387.2 2387.18 23.6741 2.140e-06 ***
#> noPrevNonSubjMentionsIn20 1 399.1 399.05 3.9575 0.04787 *
#> noCompetitors 1 1.3 1.26 0.0125 0.91102
#> Relation 1 3162.2 3162.21 31.3602 6.191e-08 ***
#> number 1 5.2 5.23 0.0518 0.82008
#> Residuals 226 22788.8 100.84
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we can see, noPrevSubjMentionsIn20
,
noPrevNonSubjMentionsIn20
and Relation
emerge
as very good predictors, with entities mentioned more in the previous 20
units and which are subjects having a much stronger tendency to be
light.
And so our journey ends here.
This concludes our journey through the basic functionality of
rezonateR
. And of course, don’t forget to save:
As usual, let’s not forget, for one last time:
savePath = "rez007.Rdata"
rez_save(rez007, savePath)
#> Saving rezrObj ...