Проблема
structure(list(macroscopic_description = c(
"POT A: The stomach and fundus are unremarkable. The incisura is slightly inflamed. The antrum shows mild erythema. The pylorus appears normal. The body is intact.\nPOT B: The body and pylorus are in good condition. The stomach exhibits no abnormalities. The incisura is clear. The antrum has a smooth appearance. The fundus shows no lesions.\nPOT C: The fundus and body appear healthy. The stomach lining is intact. The incisura is not inflamed. The antrum shows minor irritation. The pylorus is well-formed.\nPOT D: The pylorus and incisura look normal. The stomach has no visible issues. The antrum shows some redness. The fundus is clear of lesions. The body is unremarkable.\nPOT E: The antrum and body are in good shape. The stomach shows no signs of disease. The pylorus appears healthy. The incisura is normal. The fundus is clear.",
"POT A: The body and pylorus are normal. The incisura is slightly inflamed. The stomach is in good shape. The antrum is healthy. The fundus is unremarkable.\nPOT B: The fundus and incisura look normal. The pylorus is in good condition. The antrum shows minor redness. The body appears healthy. The stomach is intact.\nPOT C: The stomach and fundus are clear. The antrum has no lesions. The body shows no abnormalities. The incisura is unremarkable. The pylorus is well-formed.\nPOT D: The antrum and pylorus are in good shape. The stomach lining is healthy. The incisura shows no inflammation. The body appears normal. The fundus is clear.\nPOT E: The incisura and body look normal. The antrum shows mild erythema. The pylorus is clear. The stomach is unremarkable. The fundus is healthy.",
"POT A: The antrum and body are clear. The stomach is unremarkable. The pylorus appears normal. The incisura is not inflamed. The fundus is in good condition.\nPOT B: The stomach and incisura look healthy. The antrum is normal. The body is clear. The pylorus shows no abnormalities. The fundus is intact.\nPOT C: The pylorus and stomach are in good condition. The body appears healthy. The incisura is slightly inflamed. The antrum is clear. The fundus shows no lesions.\nPOT D: The body and incisura are unremarkable. The stomach has no visible issues. The antrum is healthy. The fundus looks normal. The pylorus is clear.\nPOT E: The fundus and body appear normal. The stomach is clear. The antrum shows no abnormalities. The incisura is healthy. The pylorus looks good.",
"POT A: The stomach and pylorus are clear. The incisura appears normal. The antrum is in good shape. The fundus shows no lesions. The body is healthy.\nPOT B: The body and fundus are unremarkable. The incisura shows no inflammation. The stomach is healthy. The antrum is clear. The pylorus is normal.\nPOT C: The antrum and body look normal. The pylorus is clear. The stomach is unremarkable. The incisura is healthy. The fundus shows no abnormalities.\nPOT D: The fundus and incisura are in good condition. The body shows no lesions. The stomach is healthy. The antrum is normal. The pylorus is clear.\nPOT E: The body and antrum are clear. The fundus appears healthy. The stomach is normal. The incisura shows no abnormalities. The pylorus is unremarkable.",
"POT A: The fundus and antrum are clear. The stomach shows no issues. The body is healthy. The incisura is normal. The pylorus is clear.\nPOT B: The incisura and body are unremarkable. The stomach looks healthy. The antrum is clear. The pylorus appears normal. The fundus is intact.\nPOT C: The body and pylorus are normal. The incisura shows no inflammation. The stomach is healthy. The antrum is clear. The fundus looks good.\nPOT D: The fundus and stomach appear normal. The antrum is healthy. The incisura is clear. The body shows no lesions. The pylorus is unremarkable.\nPOT E: The stomach and antrum look healthy. The body is in good shape. The incisura is normal. The pylorus is clear. The fundus shows no abnormalities."
)), class = "data.frame", row.names = c(NA, -5L))
Желаемый результат У меня есть список терминов, которые я хотел бы извлечь, например
terms<-("stomach","antrum","incisura")
Я хотел бы извлечь эти термины в соответствии с POT, в котором они находятся, в отдельный столбец, чтобы, например, выходные данные для первых двух строк были
POT A stomach, fundus, incisura, antrum, pylorus, body POT B body, pylorus, stomach, incisura, antrum, fundus POT C fundus, body,stomach, incisura, antrum, pylorus, POT D pylorus, incisura, stomach, antrum, fundus, body POT E antrum, body, stomach, pylorus
POT A body, pylorus, incisura, stomach, antrum, fundus POT B fundus, incisura, pylorus, antrum, body, stomach POT C stomach, fundus,antrum, body, incisura, pylorus, POT D antrum, pylorus, stomach, incisura, body, fundus POT E incisura, body, pylorus,stomach, fundus
Попытки
Я пробовал использовать str_extract_all
из stringr
potmergednew$macroscopic_descriptionClean <- sapply(
str_extract_all(potmergednew$macroscopic_description, paste(terms, collapse = "|")),
function(x) tolower(x)
)
но это просто извлекает термины, не распределяя их по каждому POT
Есть ли шанс, что вы сможете включить образец из вашего фактического набора данных? Например, вывод dput(head(potmergednew[,"macroscopic_description", drop = F]))
я сейчас добавил
Почему в выводе должны присутствовать Body, Pylorus и Fundus? Желаемым результатом будет отдельный фрейм данных или вектор с комбинированными совпадениями из всех POT, так что, например. POT A каждой строки из входного фрейма данных находится в одной строке на выходе?
Ааа, извини, да, ты прав. Я исправлю свою ошибку. Спасибо
Возможно, есть одно решение, но я не уверен, что это то, что вам нужно. Извлеченные термины добавляются в новый столбец.
data |>
mutate(
Extracted_Terms = sapply(str_extract_all(Description, paste(terms, collapse = "|")),
function(x) paste(x, collapse = ", "))
)
Description Extracted_Terms
1 POT A stomach intact stomach
2 POT B antrum available antrum
3 POT C antrum and incisura antrum, incisura
4 POT D Kidneys
Спасибо. Входные данные представляют собой текст, разделенный символами новой строки, а не отдельные строки или строки в фрейме данных, поэтому, к сожалению, это не обеспечивает указанный результат.
С str_extract_all
и strsplit
terms <- c("stomach", "antrum", "incisura", "fundus", "pylorus", "body")
library(stringr)
potmergednew$terms <-
sapply(strsplit(potmergednew$macroscopic_description, "\n"), \(x)
paste(sapply(x, \(y)
unlist(str_extract_all(y, paste0("POT...|", terms, collapse = "|")))),
collapse = " "))
выход
potmergednew$terms
[1] "POT A: stomach fundus incisura antrum pylorus body POT B: body pylorus stomach incisura antrum fundus POT C: fundus body stomach incisura antrum pylorus POT D: pylorus incisura stomach antrum fundus body POT E: antrum body stomach pylorus incisura fundus"
[2] "POT A: body pylorus incisura stomach antrum fundus POT B: fundus incisura pylorus antrum body stomach POT C: stomach fundus antrum body incisura pylorus POT D: antrum pylorus stomach incisura body fundus POT E: incisura body antrum pylorus stomach fundus"
[3] "POT A: antrum body stomach pylorus incisura fundus POT B: stomach incisura antrum body pylorus fundus POT C: pylorus stomach body incisura antrum fundus POT D: body incisura stomach antrum fundus pylorus POT E: fundus body stomach antrum incisura pylorus"
[4] "POT A: stomach pylorus incisura antrum fundus body POT B: body fundus incisura stomach antrum pylorus POT C: antrum body pylorus stomach incisura fundus POT D: fundus incisura body stomach antrum pylorus POT E: body antrum fundus stomach incisura pylorus"
[5] "POT A: fundus antrum stomach body incisura pylorus POT B: incisura body stomach antrum pylorus fundus POT C: body pylorus incisura stomach antrum fundus POT D: fundus stomach antrum incisura body pylorus POT E: stomach antrum body incisura pylorus fundus"
С токенизатором tidytext
возможно что-то вроде этого:
library(tidyverse)
library(tidytext)
potmergednew <- structure(list(macroscopic_description = c(
"POT A: The stomach and fundus are unremarkable. The incisura is slightly inflamed. The antrum shows mild erythema. The pylorus appears normal. The body is intact.\nPOT B: The body and pylorus are in good condition. The stomach exhibits no abnormalities. The incisura is clear. The antrum has a smooth appearance. The fundus shows no lesions.\nPOT C: The fundus and body appear healthy. The stomach lining is intact. The incisura is not inflamed. The antrum shows minor irritation. The pylorus is well-formed.\nPOT D: The pylorus and incisura look normal. The stomach has no visible issues. The antrum shows some redness. The fundus is clear of lesions. The body is unremarkable.\nPOT E: The antrum and body are in good shape. The stomach shows no signs of disease. The pylorus appears healthy. The incisura is normal. The fundus is clear.",
"POT A: The body and pylorus are normal. The incisura is slightly inflamed. The stomach is in good shape. The antrum is healthy. The fundus is unremarkable.\nPOT B: The fundus and incisura look normal. The pylorus is in good condition. The antrum shows minor redness. The body appears healthy. The stomach is intact.\nPOT C: The stomach and fundus are clear. The antrum has no lesions. The body shows no abnormalities. The incisura is unremarkable. The pylorus is well-formed.\nPOT D: The antrum and pylorus are in good shape. The stomach lining is healthy. The incisura shows no inflammation. The body appears normal. The fundus is clear.\nPOT E: The incisura and body look normal. The antrum shows mild erythema. The pylorus is clear. The stomach is unremarkable. The fundus is healthy.",
"POT A: The antrum and body are clear. The stomach is unremarkable. The pylorus appears normal. The incisura is not inflamed. The fundus is in good condition.\nPOT B: The stomach and incisura look healthy. The antrum is normal. The body is clear. The pylorus shows no abnormalities. The fundus is intact.\nPOT C: The pylorus and stomach are in good condition. The body appears healthy. The incisura is slightly inflamed. The antrum is clear. The fundus shows no lesions.\nPOT D: The body and incisura are unremarkable. The stomach has no visible issues. The antrum is healthy. The fundus looks normal. The pylorus is clear.\nPOT E: The fundus and body appear normal. The stomach is clear. The antrum shows no abnormalities. The incisura is healthy. The pylorus looks good.",
"POT A: The stomach and pylorus are clear. The incisura appears normal. The antrum is in good shape. The fundus shows no lesions. The body is healthy.\nPOT B: The body and fundus are unremarkable. The incisura shows no inflammation. The stomach is healthy. The antrum is clear. The pylorus is normal.\nPOT C: The antrum and body look normal. The pylorus is clear. The stomach is unremarkable. The incisura is healthy. The fundus shows no abnormalities.\nPOT D: The fundus and incisura are in good condition. The body shows no lesions. The stomach is healthy. The antrum is normal. The pylorus is clear.\nPOT E: The body and antrum are clear. The fundus appears healthy. The stomach is normal. The incisura shows no abnormalities. The pylorus is unremarkable.",
"POT A: The fundus and antrum are clear. The stomach shows no issues. The body is healthy. The incisura is normal. The pylorus is clear.\nPOT B: The incisura and body are unremarkable. The stomach looks healthy. The antrum is clear. The pylorus appears normal. The fundus is intact.\nPOT C: The body and pylorus are normal. The incisura shows no inflammation. The stomach is healthy. The antrum is clear. The fundus looks good.\nPOT D: The fundus and stomach appear normal. The antrum is healthy. The incisura is clear. The body shows no lesions. The pylorus is unremarkable.\nPOT E: The stomach and antrum look healthy. The body is in good shape. The incisura is normal. The pylorus is clear. The fundus shows no abnormalities."
)), class = "data.frame", row.names = c(NA, -5L)) |>
as_tibble()
terms <- tibble(tokens = c("stomach", "antrum", "incisura", "fundus", "pylorus", "body"))
potmergednew |>
rowid_to_column("id") |>
separate_longer_delim(macroscopic_description, delim = "\n") |>
separate_wider_regex(macroscopic_description, patterns = c(pot = "^POT \\w", ":\\s+", text = ".*")) |>
unnest_tokens(tokens, text) |>
# filter by terms
semi_join(terms) |>
# from long to collapsed wide; perhaps consider sort(tokens)
summarise(terms = str_c(tokens, collapse = ", "), .by = c(id, pot)) |>
pivot_wider(names_from = pot, values_from = terms)
#> Joining with `by = join_by(tokens)`
#> # A tibble: 5 × 6
#> id `POT A` `POT B` `POT C` `POT D` `POT E`
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 stomach, fundus, incisura, antrum, pylo… body, … fundus… pyloru… antrum…
#> 2 2 body, pylorus, incisura, stomach, antru… fundus… stomac… antrum… incisu…
#> 3 3 antrum, body, stomach, pylorus, incisur… stomac… pyloru… body, … fundus…
#> 4 4 stomach, pylorus, incisura, antrum, fun… body, … antrum… fundus… body, …
#> 5 5 fundus, antrum, stomach, body, incisura… incisu… body, … fundus… stomac…
Created on 2024-06-13 with reprex v2.1.0
Вам нужно включить/исключить частичные совпадения? Например, должны ли
POT A stomach intact
иterms<-("mach")
иметь результат с совпадениемstomach
,mach
или без совпадения?