У меня снова вопрос по функции kwic()
из пакета quanteda
. Я хочу извлечь пять слов вокруг определенного ключевого слова (в приведенном ниже примере это «переполнение стека» и «радиозвезда»). Однако после удаления стоп-слов в процессе токенизации kwic()
возвращает не фактическое окно из 5 слов до и после ключевого слова, а меньше слов. Есть ли способ сказать kwic()
игнорировать стоп-слова при подсчете ключевых слов в контексте?
Репрекс ниже:
library(quanteda)
speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stack overflow. However there are so many more words that I am not interested in assessing the sentiment of. Now I am also adding a few words that would not be removed as stopwords, as follows: Maintenance, Television, Superstar, Textual Analysis. Video killed the radio star is another sentence I would like to include.",
"This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stack overflow. Once again adding some non-stopwords: Maintenance, television, superstar, textual analysis. Video killed the radio star is another sentence I would like to include.",
"Finally, this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech. Here are some more non-stopwords: Maintenance, television, superstar, textual analysis")
data <- data.frame(id=1:3,
speechContent = speech)
test_corpus <- corpus(data,
docid_field = "id",
text_field = "speechContent")
test_tokens <- tokens(test_corpus,
remove_punct = TRUE,
remove_numbers = TRUE) %>%
tokens_remove(stopwords("en"), padding = TRUE) %>%
tokens_compound(pattern = phrase(c("stack overflow*", "radio star*")),
concatenator = " ")
test_kwic <- kwic(test_tokens,
pattern = c("stack overflow", "radio star"),
window = 5)
Как предложил @phiver, использование padding = FALSE
при удалении стоп-слов решило проблему. Благодарю вас!
Используйте
padding = FALSE
при удалении стоп-слов.