Title: | Be Nice on the Web |
---|---|
Description: | Be responsible when scraping data from websites by following polite principles: introduce yourself, ask for permission, take slowly and never ask twice. |
Authors: | Dmytro Perepolkin [aut, cre] |
Maintainer: | Dmytro Perepolkin <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.3 |
Built: | 2024-12-19 04:23:20 UTC |
Source: | https://github.com/dmi3kno/polite |
Introduce yourself to the host
bow( url, user_agent = "polite R package", delay = 5, times = 3, force = FALSE, verbose = FALSE, ... ) is.polite(x)
bow( url, user_agent = "polite R package", delay = 5, times = 3, force = FALSE, verbose = FALSE, ... ) is.polite(x)
url |
URL |
user_agent |
character value passed to user agent string |
delay |
desired delay between scraping attempts. Final value will be the maximum of desired and mandated delay, as stipulated by |
times |
number of times to attempt scraping. Default is 3. |
force |
refresh all memoised functions. Clears up |
verbose |
TRUE/FALSE |
... |
other curl parameters wrapped into |
x |
object of class |
object of class polite
, session
library(polite) host <- "https://www.cheese.com" session <- bow(host) session
library(polite) host <- "https://www.cheese.com" session <- bow(host) session
Guess download file name from the URL
guess_basename(x)
guess_basename(x)
x |
url to guess basename from |
guessed file name
guess_basename("https://bit.ly/polite_sticker")
guess_basename("https://bit.ly/polite_sticker")
Convert collection of html nodes into data frame
html_attrs_dfr( x, attrs = NULL, trim = FALSE, defaults = NA_character_, add_text = TRUE )
html_attrs_dfr( x, attrs = NULL, trim = FALSE, defaults = NA_character_, add_text = TRUE )
x |
|
attrs |
character vector of attribute names. If missing, all attributes will be used |
trim |
if |
defaults |
character vector of default values to be passed to |
add_text |
if |
data frame with one row per xml node, consisting of an html_text column with text and additional columns with attributes
library(polite) library(rvest) bow("https://en.wikipedia.org/wiki/List_of_cognitive_biases") %>% scrape() %>% html_nodes("tr td:nth-child(1) a") %>% html_attrs_dfr()
library(polite) library(rvest) bow("https://en.wikipedia.org/wiki/List_of_cognitive_biases") %>% scrape() %>% html_nodes("tr td:nth-child(1) a") %>% html_attrs_dfr()
Agree modification of session path with the host
nod(bow, path, verbose = FALSE)
nod(bow, path, verbose = FALSE)
bow |
object of class |
path |
string value of path/URL to follow. The function accepts either a path (string part of URL following domain name) or a full URL |
verbose |
|
object of class polite
, session
with modified URL
library(polite) host <- "https://www.cheese.com" session <- bow(host) %>% nod(path="by_type") session
library(polite) host <- "https://www.cheese.com" session <- bow(host) %>% nod(path="by_type") session
Give your web-scraping function good manners polite
politely( fun, user_agent = paste0("polite ", getOption("HTTPUserAgent"), " bot"), robots = TRUE, force = FALSE, delay = 5, verbose = FALSE, cache = memoise::cache_memory() )
politely( fun, user_agent = paste0("polite ", getOption("HTTPUserAgent"), " bot"), robots = TRUE, force = FALSE, delay = 5, verbose = FALSE, cache = memoise::cache_memory() )
fun |
function to be turned "polite". Must contain an argument named |
user_agent |
optional, user agent string to be used. Defaults to |
robots |
optional, should robots.txt be consulted for permissions. Default is TRUE |
force |
whether or not tp force fresh download of robots.txt |
delay |
minimum delay in seconds, not less than 1. Default is 5. |
verbose |
output more information about querying process |
cache |
memoise cache function for storing results. Default |
polite function
polite_GET <- politely(httr::GET)
polite_GET <- politely(httr::GET)
Print host introduction object
## S3 method for class 'polite' print(x, ...)
## S3 method for class 'polite' print(x, ...)
x |
object of class |
... |
other parameters passed to methods |
Polite file download
rip( bow, destfile = NULL, ..., mode = "wb", path = tempdir(), overwrite = FALSE )
rip( bow, destfile = NULL, ..., mode = "wb", path = tempdir(), overwrite = FALSE )
bow |
host introduction object of class |
destfile |
optional new file name to use when saving the file. If missing, it will be guessed from 'basename(url)“ |
... |
other parameters passed to |
mode |
character. The mode with which to write the file. Useful values are |
path |
character. Path where to save the destfile. By default is temporary directory created with |
overwrite |
if |
Full path to the locally saved file indicated by the user in destfile
(and path
)
bow("https://en.wikipedia.org/") %>% nod("wiki/Flag_of_the_United_States#/media/File:Flag_of_the_United_States.svg") %>% rip()
bow("https://en.wikipedia.org/") %>% nod("wiki/Flag_of_the_United_States#/media/File:Flag_of_the_United_States.svg") %>% rip()
Scrape the content of authorized page/API
scrape( bow, query = NULL, params = NULL, accept = "html", content = NULL, verbose = FALSE )
scrape( bow, query = NULL, params = NULL, accept = "html", content = NULL, verbose = FALSE )
bow |
host introduction object of class |
query |
named list of parameters to be appended to URL in the format |
params |
deprecated. Use |
accept |
character value of expected data type to be returned by host (e.g. |
content |
MIME type (aka internet media type) used to override the content type returned by the server.
See http://en.wikipedia.org/wiki/Internet_media_type for a list of common types. You can add the |
verbose |
extra feedback from the function. Defaults to |
Object of class httr::response
which can be further processed by functions in rvest
package
library(rvest) bow("https://en.wikipedia.org/wiki/List_of_cognitive_biases") %>% scrape(content="text/html; charset=UTF-8") %>% html_nodes(".wikitable") %>% html_table()
library(rvest) bow("https://en.wikipedia.org/wiki/List_of_cognitive_biases") %>% scrape(content="text/html; charset=UTF-8") %>% html_nodes(".wikitable") %>% html_table()
Reset scraping/ripping rate limit
set_scrape_delay(delay) set_rip_delay(delay)
set_scrape_delay(delay) set_rip_delay(delay)
delay |
Delay between subsequent requests. Default for package is 5 sec. It can be set lower only under the condition of specifying a custom user-agent string. |
Updates rate-limit property of scrape
and rip
functions, respectively.
library(polite) host <- "https://www.cheese.com" session <- bow(host) session
library(polite) host <- "https://www.cheese.com" session <- bow(host) session
Creates collection of polite
functions for scraping and downloading
use_manners(save_as = "R/polite-scrape.R", open = TRUE)
use_manners(save_as = "R/polite-scrape.R", open = TRUE)
save_as |
File where function should be created Defaults to " |
open |
if |