Извлечение текста из веб-скрапа

Я пытаюсь получить текст с веб-сайта Мой код работает (вроде как)

for (i in 1:no_urls) {
  this_url=urls_meetings[[i]]
  page=read_html(this_url)
  
  text=page |> html_elements("body") |> html_text2()
  text_date=text[1]
  date<- str_extract(text_date, "\\b\\w+ \\d{1,2}, \\d{4}\\b")
  # Convert the abbreviated month name to its full form
  date_str <- gsub("^(.*)\\s(\\d{1,2}),\\s(\\d{4})$", "\\1 \\2, \\3", date)

  # Convert to Date object
  date <- mdy(date_str)
  date_1=as.character(date)
  date_1=gsub("-", "", date_1)


  text=text[2]
  statements_list2[[i]]=text
  names(statements_list)[i] <- date_1

}

Проблема в том, что вывод, если строка

text=page |> html_elements("body") |> html_text2()

который дает мне весь текст страницы

[1] "\r \r\r \r\nRelease Date: January 29, 2003\r\n\n\n\n\n\r For immediate release\r\n\n\r\n\n\r\r\n\n\r The Federal Open Market Committee decided today to keep its target for the federal funds rate unchanged at 1-1/4 percent. \r\n\n\r Oil price premiums and other aspects of geopolitical risks have reportedly fostered continued restraint on spending and hiring by businesses. However, the Committee believes that as those risks lift, as most analysts expect, the accommodative stance of monetary policy, coupled with ongoing growth in productivity, will provide support to an improving economic climate over time.\r\n\n\r In these circumstances, the Committee believes that, against the background of its long-run goals of price stability and sustainable economic growth and of the information currently available, the risks are balanced with respect to the prospects for both goals for the foreseeable future. \r\n\n\r Voting for the FOMC monetary policy action were Alan Greenspan, Chairman; William J. McDonough, Vice Chairman; Ben S. Bernanke, Susan S. Bies; J. Alfred Broaddus, Jr.; Roger W. Ferguson, Jr.; Edward M. Gramlich; Jack Guynn; Donald L. Kohn; Michael H. Moskow; Mark W. Olson, and Robert T. Parry. \r \r \r\n\n\r -----------------------------------------------------------------------------------------\r DO NOT REMOVE: Wireless Generation\r ------------------------------------------------------------------------------------------\r 2003 Monetary policy \r\n\nHome | News and \r events\nAccessibility\r\n\r Last update: January 29, 2003\r\r \r\n(function(){if (!document.body) return;var js = \"window['__CF$cv$params'] = {r:'8775c6b49a2a2015',t:'MTcxMzYyMjgzOC41MjIwMDA='};_cpo=document.createElement('script');_cpo.nonce='',_cpo.src='/cdn-cgi/challenge-platform/scripts/jsd/main.js',document.getElementsByTagName('head')[0].appendChild(_cpo);\";var _0xh = document.createElement('iframe');_0xh.height = 1;_0xh.width = 1;_0xh.style.position = 'absolute';_0xh.style.top = 0;_0xh.style.left = 0;_0xh.style.border = 'none';_0xh.style.visibility = 'hidden';document.body.appendChild(_0xh);function handler() {var _0xi = _0xh.contentDocument || _0xh.contentWindow.document;if (_0xi) {var _0xj = _0xi.createElement('script');_0xj.innerHTML = js;_0xi.getElementsByTagName('head')[0].appendChild(_0xj);}}if (document.readyState !== 'loading') {handler();} else if (window.addEventListener) {document.addEventListener('DOMContentLoaded', handler);} else {var prev = document.onreadystatechange || function () {};document.onreadystatechange = function (e) {prev(e);if (document.readyState !== 'loading') {document.onreadystatechange = prev;handler();}};}})();"

Мне нужно сохранить только соответствующий текст. Я пробовал все виды вещей

str_extract(text, "(?<=The Federal Open Market)(.*?)(?=Voting)")


 str_match(text, "The Federal Open Market(.*?)Voting")

но все они дают мне взамен нулевой символ

Идеальный результат — это

The Federal Open Market Committee decided today to keep its target for the federal funds rate unchanged at 1-1/4 percent. \r\n\n\r Oil price premiums and other aspects of geopolitical risks have reportedly fostered continued restraint on spending and hiring by businesses. However, the Committee believes that as those risks lift, as most analysts expect, the accommodative stance of monetary policy, coupled with ongoing growth in productivity, will provide support to an improving economic climate over time.\r\n\n\r In these circumstances, the Committee believes that, against the background of its long-run goals of price stability and sustainable economic growth and of the information currently available, the risks are balanced with respect to the prospects for both goals for the foreseeable future.

Вам нужно дать определение «релевантному». Вы знаете, что имеете в виду. Мы этого не делаем.

— 20.04.2024 17:00

Не могли бы вы указать желаемый результат для этого сайта? Он не включает «Голосование», поэтому ваше регулярное выражение не может найти совпадение.

— 20.04.2024 17:01

html r web-scraping rvest

20.04.2024 16:41

Улучшение производительности загрузки с помощью Google Tag Manager и атрибута Defer

В настоящее время производительность загрузки веб-сайта имеет решающее значение не только для удобства пользователей, но и для ранжирования в...

Введение в CSS

CSS является неотъемлемой частью трех основных составляющих front-end веб-разработки.

Как выровнять Div по центру?

Чтобы выровнять элемент <div>по горизонтали и вертикали с помощью CSS, можно использовать комбинацию свойств и значений CSS. Вот несколько методов,...

Навигация по приложениям React: Исчерпывающее руководство по React Router

React Router стала незаменимой библиотекой для создания одностраничных приложений с навигацией в React. В этой статье блога мы подробно рассмотрим...

Система управления парковками с использованием HTML, CSS и JavaScript

Веб-сайт по управлению парковками был создан с использованием HTML, CSS и JavaScript. Это простой сайт, ничего вычурного. Основная цель -...

Toor - Ангулярный шаблон для бронирования путешествий

Toor - Travel Booking Angular Template один из лучших Travel & Tour booking template in the world. 30+ валидированных HTML5 страниц, которые помогут...

Перейти к ответу Данный вопрос помечен как решенный

Ответы 3

Ответ принят как подходящий

Символ `.` по умолчанию не соответствует новым строкам.

Причина, по которой ваш шаблон не работает, заключается в том, что в вашей строке есть новые строки. Определение метасимвола . заключается в том, что он соответствует любому символу, кроме символа новой строки.

Вот более короткий пример:

txt <- "there are some\r\nwords here"
str_extract(txt, "some.+words")
# [1] NA

Переопределить значения по умолчанию

Чтобы переопределить значения по умолчанию в stringr::str_extract(), вам нужно использовать stringr::regex() с соответствующей опцией. В этом случае,

Вы можете разрешить . соответствовать всему, включая \n, установив dotall = TRUE:

str_extract(txt, regex("some.+words", dotall = TRUE))
# [1] "some\r\nwords"

Или в случае вашей строки:

str_extract(text, regex("(?<=The Federal Open Market)(.*?)(?=Voting)", dotall = TRUE))  |> 
    trimws()
# [1] "Committee decided today to keep its target for the federal funds rate unchanged at 1-1/4 percent. \r\n\n\r Oil price premiums and other aspects of geopolitical risks have reportedly fostered continued restraint on spending and hiring by businesses. However, the Committee believes that as those risks lift, as most analysts expect, the accommodative stance of monetary policy, coupled with ongoing growth in productivity, will provide support to an improving economic climate over time.\r\n\n\r In these circumstances, the Committee believes that, against the background of its long-run goals of price stability and sustainable economic growth and of the information currently available, the risks are balanced with respect to the prospects for both goals for the foreseeable future."

Я также передал это в Trimws(), чтобы удалить начальные и конечные пробелы.

Другие многострочные варианты

Если вы хотите расширить свои регулярные выражения так, чтобы они соответствовали нескольким строкам (а не только метасимволу .), вы можете использовать regex(pattern, multiline = TRUE):

Для многострочных строк вы можете использовать регулярное выражение (multiline = TRUE). Это меняет поведение ^ и $ и вводит три новых оператора:
^ теперь соответствует началу каждой строки.
$ теперь соответствует концу каждой строки.
\A соответствует началу ввода.
\z соответствует концу ввода.
\Z соответствует концу ввода, но перед последним признаком конца строки, если он существует.

Дополнительную информацию см. в stringrдокументации.

20.04.2024 17:06

Глядя на html, кажется, что вы можете извлечь только таблицу:

this_url <- "https://www.federalreserve.gov/boarddocs/press/general/2002/20020130/"
page=read_html(this_url)

text=page |> html_elements("table") |> html_text() |> trimws()

И просто получите это:

[1] «Федеральный комитет по открытым рынкам решил сегодня сохранить целевой показатель ставки по федеральным фондам не изменился на уровне 1-3/4. процентов.\r\n\r\nПризнаки того, что слабость спроса уменьшается и экономическая активность начинает укрепляться, становятся более распространенными. С силы, сдерживающие экономику, начинают ослабевать, а с долгосрочные перспективы роста производительности остаются благоприятными и монетарная политика является адаптивной, перспективы восстановления экономики становятся более перспективными.\r\n\r\nСтепень какой-либо силы в бизнесе однако капитальные расходы и расходы домохозяйств все еще остаются неопределенными. Следовательно, Комитет по-прежнему считает, что на фоне его долгосрочные цели стабильности цен и устойчивого экономического роста и Из имеющейся на данный момент информации риски оцениваются главным образом к условиям, которые могут привести к экономической слабости в обозримом будущем.\r\n\r\n\r\n\r\n2002 Денежно-кредитная политика \r\n\r\nГлавная страница |\r\nНовости и события\r\nДоступность\r\n\r\nПоследнее обновление: 30 января, 2002"

Или разделите абзацы следующим образом:

page |> html_elements("table p") |> html_text() |> trimws()

20.04.2024 17:22

Предполагая, что структура более-менее стабильна, можно указать, какие позиционные абзацы включать/исключать.

library(rvest)
library(stringr)

# list of urls
urls_ <- c("https://www.federalreserve.gov/boarddocs/press/general/2002/20020130/")

collect_text <- function(url_){
  html <- read_html(url_)
  
  release_date <- 
    html_element(html, "body > font > i") |> 
    html_text() |> 
    str_split_i(": ", 2) |> 
    lubridate::mdy()
  
  text <- 
    # all p elements in td, except the last one
    html_elements(html, "td > p:not(p:last-of-type)") |> 
    html_text(trim = TRUE) |> 
    str_c(collapse = "") |>
    str_squish()

  # reurn named list
  list(release_date = release_date, text = text)
}

df <- 
  lapply(urls_, collect_text) |>
  dplyr::bind_rows()
df  
#> # A tibble: 1 × 2
#>   release_date text                                                             
#>   <date>       <chr>                                                            
#> 1 2002-01-30   The Federal Open Market Committee decided today to keep its targ…

str_view(df[1,"text"])
#> [1] │ The Federal Open Market Committee decided today to keep its target for the federal funds rate unchanged at 1-3/4 percent.Signs that weakness in demand is abating and economic activity is beginning to firm have become more prevalent. With the forces restraining the economy starting to diminish, and with the long-term prospects for productivity growth remaining favorable and monetary policy accommodative, the outlook for economic recovery has become more promising.The degree of any strength in business capital and household spending, however, is still uncertain. Hence, the Committee continues to believe that, against the background of its long-run goals of price stability and sustainable economic growth and of the information currently available, the risks are weighted mainly toward conditions that may generate economic weakness in the foreseeable future.

20.04.2024 17:32