Как извлечь термины из списка для каждого абзаца текста

Проблема

structure(list(macroscopic_description = c(
  "POT A: The stomach and fundus are unremarkable. The incisura is slightly inflamed. The antrum shows mild erythema. The pylorus appears normal. The body is intact.\nPOT B: The body and pylorus are in good condition. The stomach exhibits no abnormalities. The incisura is clear. The antrum has a smooth appearance. The fundus shows no lesions.\nPOT C: The fundus and body appear healthy. The stomach lining is intact. The incisura is not inflamed. The antrum shows minor irritation. The pylorus is well-formed.\nPOT D: The pylorus and incisura look normal. The stomach has no visible issues. The antrum shows some redness. The fundus is clear of lesions. The body is unremarkable.\nPOT E: The antrum and body are in good shape. The stomach shows no signs of disease. The pylorus appears healthy. The incisura is normal. The fundus is clear.",
  "POT A: The body and pylorus are normal. The incisura is slightly inflamed. The stomach is in good shape. The antrum is healthy. The fundus is unremarkable.\nPOT B: The fundus and incisura look normal. The pylorus is in good condition. The antrum shows minor redness. The body appears healthy. The stomach is intact.\nPOT C: The stomach and fundus are clear. The antrum has no lesions. The body shows no abnormalities. The incisura is unremarkable. The pylorus is well-formed.\nPOT D: The antrum and pylorus are in good shape. The stomach lining is healthy. The incisura shows no inflammation. The body appears normal. The fundus is clear.\nPOT E: The incisura and body look normal. The antrum shows mild erythema. The pylorus is clear. The stomach is unremarkable. The fundus is healthy.",
  "POT A: The antrum and body are clear. The stomach is unremarkable. The pylorus appears normal. The incisura is not inflamed. The fundus is in good condition.\nPOT B: The stomach and incisura look healthy. The antrum is normal. The body is clear. The pylorus shows no abnormalities. The fundus is intact.\nPOT C: The pylorus and stomach are in good condition. The body appears healthy. The incisura is slightly inflamed. The antrum is clear. The fundus shows no lesions.\nPOT D: The body and incisura are unremarkable. The stomach has no visible issues. The antrum is healthy. The fundus looks normal. The pylorus is clear.\nPOT E: The fundus and body appear normal. The stomach is clear. The antrum shows no abnormalities. The incisura is healthy. The pylorus looks good.",
  "POT A: The stomach and pylorus are clear. The incisura appears normal. The antrum is in good shape. The fundus shows no lesions. The body is healthy.\nPOT B: The body and fundus are unremarkable. The incisura shows no inflammation. The stomach is healthy. The antrum is clear. The pylorus is normal.\nPOT C: The antrum and body look normal. The pylorus is clear. The stomach is unremarkable. The incisura is healthy. The fundus shows no abnormalities.\nPOT D: The fundus and incisura are in good condition. The body shows no lesions. The stomach is healthy. The antrum is normal. The pylorus is clear.\nPOT E: The body and antrum are clear. The fundus appears healthy. The stomach is normal. The incisura shows no abnormalities. The pylorus is unremarkable.",
  "POT A: The fundus and antrum are clear. The stomach shows no issues. The body is healthy. The incisura is normal. The pylorus is clear.\nPOT B: The incisura and body are unremarkable. The stomach looks healthy. The antrum is clear. The pylorus appears normal. The fundus is intact.\nPOT C: The body and pylorus are normal. The incisura shows no inflammation. The stomach is healthy. The antrum is clear. The fundus looks good.\nPOT D: The fundus and stomach appear normal. The antrum is healthy. The incisura is clear. The body shows no lesions. The pylorus is unremarkable.\nPOT E: The stomach and antrum look healthy. The body is in good shape. The incisura is normal. The pylorus is clear. The fundus shows no abnormalities."
)), class = "data.frame", row.names = c(NA, -5L))

Желаемый результат У меня есть список терминов, которые я хотел бы извлечь, например

terms<-("stomach","antrum","incisura")

Я хотел бы извлечь эти термины в соответствии с POT, в котором они находятся, в отдельный столбец, чтобы, например, выходные данные для первых двух строк были

POT A stomach, fundus, incisura, antrum, pylorus, body POT B body, pylorus, stomach, incisura, antrum, fundus  POT C fundus, body,stomach, incisura, antrum, pylorus,  POT D pylorus, incisura, stomach, antrum, fundus, body POT E  antrum, body, stomach, pylorus

POT A body, pylorus, incisura, stomach, antrum, fundus POT B fundus, incisura, pylorus, antrum, body, stomach  POT C stomach, fundus,antrum, body, incisura, pylorus,  POT D antrum, pylorus, stomach, incisura, body, fundus POT E  incisura, body, pylorus,stomach, fundus

Попытки Я пробовал использовать str_extract_all из stringr

potmergednew$macroscopic_descriptionClean <- sapply(
  str_extract_all(potmergednew$macroscopic_description, paste(terms, collapse = "|")),
  function(x) tolower(x)
)

но это просто извлекает термины, не распределяя их по каждому POT

Вам нужно включить/исключить частичные совпадения? Например, должны ли POT A stomach intact и terms<-("mach") иметь результат с совпадением stomach, mach или без совпадения?

margusl 13.06.2024 11:11

Есть ли шанс, что вы сможете включить образец из вашего фактического набора данных? Например, вывод dput(head(potmergednew[,"macroscopic_description", drop = F]))

margusl 13.06.2024 12:27

я сейчас добавил

Sebastian Zeki 13.06.2024 12:42

Почему в выводе должны присутствовать Body, Pylorus и Fundus? Желаемым результатом будет отдельный фрейм данных или вектор с комбинированными совпадениями из всех POT, так что, например. POT A каждой строки из входного фрейма данных находится в одной строке на выходе?

Andre Wildberg 13.06.2024 13:09

Ааа, извини, да, ты прав. Я исправлю свою ошибку. Спасибо

Sebastian Zeki 13.06.2024 13:14
Стоит ли изучать PHP в 2023-2024 годах?
Стоит ли изучать PHP в 2023-2024 годах?
Привет всем, сегодня я хочу высказать свои соображения по поводу вопроса, который я уже много раз получал в своем сообществе: "Стоит ли изучать PHP в...
Поведение ключевого слова "this" в стрелочной функции в сравнении с нормальной функцией
Поведение ключевого слова "this" в стрелочной функции в сравнении с нормальной функцией
В JavaScript одним из самых запутанных понятий является поведение ключевого слова "this" в стрелочной и обычной функциях.
Приемы CSS-макетирования - floats и Flexbox
Приемы CSS-макетирования - floats и Flexbox
Здравствуйте, друзья-студенты! Готовы совершенствовать свои навыки веб-дизайна? Сегодня в нашем путешествии мы рассмотрим приемы CSS-верстки - в...
Тестирование функциональных ngrx-эффектов в Angular 16 с помощью Jest
В системе управления состояниями ngrx, совместимой с Angular 16, появились функциональные эффекты. Это здорово и делает код определенно легче для...
Концепция локализации и ее применение в приложениях React ⚡️
Концепция локализации и ее применение в приложениях React ⚡️
Локализация - это процесс адаптации приложения к различным языкам и культурным требованиям. Это позволяет пользователям получить опыт, соответствующий...
Пользовательский скаляр GraphQL
Пользовательский скаляр GraphQL
Листовые узлы системы типов GraphQL называются скалярами. Достигнув скалярного типа, невозможно спуститься дальше по иерархии типов. Скалярный тип...
1
5
65
3
Перейти к ответу Данный вопрос помечен как решенный

Ответы 3

Возможно, есть одно решение, но я не уверен, что это то, что вам нужно. Извлеченные термины добавляются в новый столбец.

data |> 
  mutate(
    Extracted_Terms = sapply(str_extract_all(Description, paste(terms, collapse = "|")), 
                       function(x) paste(x, collapse = ", "))
  )

                Description        Extracted_Terms
1      POT A stomach intact          stomach
2    POT B antrum available           antrum
3 POT C antrum and incisura antrum, incisura
4             POT D Kidneys 

Спасибо. Входные данные представляют собой текст, разделенный символами новой строки, а не отдельные строки или строки в фрейме данных, поэтому, к сожалению, это не обеспечивает указанный результат.

Sebastian Zeki 13.06.2024 12:02
Ответ принят как подходящий

С str_extract_all и strsplit

terms <- c("stomach", "antrum", "incisura", "fundus", "pylorus", "body")

library(stringr)

potmergednew$terms <- 
  sapply(strsplit(potmergednew$macroscopic_description, "\n"), \(x) 
    paste(sapply(x, \(y) 
      unlist(str_extract_all(y, paste0("POT...|", terms, collapse = "|")))), 
      collapse = " "))

выход

potmergednew$terms
[1] "POT A: stomach fundus incisura antrum pylorus body POT B: body pylorus stomach incisura antrum fundus POT C: fundus body stomach incisura antrum pylorus POT D: pylorus incisura stomach antrum fundus body POT E: antrum body stomach pylorus incisura fundus"
[2] "POT A: body pylorus incisura stomach antrum fundus POT B: fundus incisura pylorus antrum body stomach POT C: stomach fundus antrum body incisura pylorus POT D: antrum pylorus stomach incisura body fundus POT E: incisura body antrum pylorus stomach fundus"
[3] "POT A: antrum body stomach pylorus incisura fundus POT B: stomach incisura antrum body pylorus fundus POT C: pylorus stomach body incisura antrum fundus POT D: body incisura stomach antrum fundus pylorus POT E: fundus body stomach antrum incisura pylorus"
[4] "POT A: stomach pylorus incisura antrum fundus body POT B: body fundus incisura stomach antrum pylorus POT C: antrum body pylorus stomach incisura fundus POT D: fundus incisura body stomach antrum pylorus POT E: body antrum fundus stomach incisura pylorus"
[5] "POT A: fundus antrum stomach body incisura pylorus POT B: incisura body stomach antrum pylorus fundus POT C: body pylorus incisura stomach antrum fundus POT D: fundus stomach antrum incisura body pylorus POT E: stomach antrum body incisura pylorus fundus"

С токенизатором tidytext возможно что-то вроде этого:

library(tidyverse)
library(tidytext)

potmergednew <- structure(list(macroscopic_description = c(
  "POT A: The stomach and fundus are unremarkable. The incisura is slightly inflamed. The antrum shows mild erythema. The pylorus appears normal. The body is intact.\nPOT B: The body and pylorus are in good condition. The stomach exhibits no abnormalities. The incisura is clear. The antrum has a smooth appearance. The fundus shows no lesions.\nPOT C: The fundus and body appear healthy. The stomach lining is intact. The incisura is not inflamed. The antrum shows minor irritation. The pylorus is well-formed.\nPOT D: The pylorus and incisura look normal. The stomach has no visible issues. The antrum shows some redness. The fundus is clear of lesions. The body is unremarkable.\nPOT E: The antrum and body are in good shape. The stomach shows no signs of disease. The pylorus appears healthy. The incisura is normal. The fundus is clear.",
  "POT A: The body and pylorus are normal. The incisura is slightly inflamed. The stomach is in good shape. The antrum is healthy. The fundus is unremarkable.\nPOT B: The fundus and incisura look normal. The pylorus is in good condition. The antrum shows minor redness. The body appears healthy. The stomach is intact.\nPOT C: The stomach and fundus are clear. The antrum has no lesions. The body shows no abnormalities. The incisura is unremarkable. The pylorus is well-formed.\nPOT D: The antrum and pylorus are in good shape. The stomach lining is healthy. The incisura shows no inflammation. The body appears normal. The fundus is clear.\nPOT E: The incisura and body look normal. The antrum shows mild erythema. The pylorus is clear. The stomach is unremarkable. The fundus is healthy.",
  "POT A: The antrum and body are clear. The stomach is unremarkable. The pylorus appears normal. The incisura is not inflamed. The fundus is in good condition.\nPOT B: The stomach and incisura look healthy. The antrum is normal. The body is clear. The pylorus shows no abnormalities. The fundus is intact.\nPOT C: The pylorus and stomach are in good condition. The body appears healthy. The incisura is slightly inflamed. The antrum is clear. The fundus shows no lesions.\nPOT D: The body and incisura are unremarkable. The stomach has no visible issues. The antrum is healthy. The fundus looks normal. The pylorus is clear.\nPOT E: The fundus and body appear normal. The stomach is clear. The antrum shows no abnormalities. The incisura is healthy. The pylorus looks good.",
  "POT A: The stomach and pylorus are clear. The incisura appears normal. The antrum is in good shape. The fundus shows no lesions. The body is healthy.\nPOT B: The body and fundus are unremarkable. The incisura shows no inflammation. The stomach is healthy. The antrum is clear. The pylorus is normal.\nPOT C: The antrum and body look normal. The pylorus is clear. The stomach is unremarkable. The incisura is healthy. The fundus shows no abnormalities.\nPOT D: The fundus and incisura are in good condition. The body shows no lesions. The stomach is healthy. The antrum is normal. The pylorus is clear.\nPOT E: The body and antrum are clear. The fundus appears healthy. The stomach is normal. The incisura shows no abnormalities. The pylorus is unremarkable.",
  "POT A: The fundus and antrum are clear. The stomach shows no issues. The body is healthy. The incisura is normal. The pylorus is clear.\nPOT B: The incisura and body are unremarkable. The stomach looks healthy. The antrum is clear. The pylorus appears normal. The fundus is intact.\nPOT C: The body and pylorus are normal. The incisura shows no inflammation. The stomach is healthy. The antrum is clear. The fundus looks good.\nPOT D: The fundus and stomach appear normal. The antrum is healthy. The incisura is clear. The body shows no lesions. The pylorus is unremarkable.\nPOT E: The stomach and antrum look healthy. The body is in good shape. The incisura is normal. The pylorus is clear. The fundus shows no abnormalities."
)), class = "data.frame", row.names = c(NA, -5L)) |>
  as_tibble()

terms <- tibble(tokens = c("stomach", "antrum", "incisura", "fundus", "pylorus", "body"))

potmergednew |>
  rowid_to_column("id") |>
  separate_longer_delim(macroscopic_description, delim = "\n") |>
  separate_wider_regex(macroscopic_description, patterns = c(pot = "^POT \\w", ":\\s+", text = ".*")) |>
  unnest_tokens(tokens, text) |>
  # filter by terms
  semi_join(terms) |>
  # from long to collapsed wide; perhaps consider sort(tokens)
  summarise(terms = str_c(tokens, collapse = ", "), .by = c(id, pot)) |>
  pivot_wider(names_from = pot, values_from = terms)
#> Joining with `by = join_by(tokens)`
#> # A tibble: 5 × 6
#>      id `POT A`                                  `POT B` `POT C` `POT D` `POT E`
#>   <int> <chr>                                    <chr>   <chr>   <chr>   <chr>  
#> 1     1 stomach, fundus, incisura, antrum, pylo… body, … fundus… pyloru… antrum…
#> 2     2 body, pylorus, incisura, stomach, antru… fundus… stomac… antrum… incisu…
#> 3     3 antrum, body, stomach, pylorus, incisur… stomac… pyloru… body, … fundus…
#> 4     4 stomach, pylorus, incisura, antrum, fun… body, … antrum… fundus… body, …
#> 5     5 fundus, antrum, stomach, body, incisura… incisu… body, … fundus… stomac…

Created on 2024-06-13 with reprex v2.1.0

Другие вопросы по теме