How do I take a cell in Excel, which has text that is hyperlinked, and extract the hyperlink part?
- 
                    Do you just need to identify if a cell contains a hyperlink? – hrbrmstr Jun 10 '14 at 20:07
- 
                    1Can you read in the file and use gsub() or a variant to locate and extract the hyperlinks? – lawyeR Jun 10 '14 at 20:14
- 
                    With `xlsx`, once you have something akin to a `cells` variable after a call to `getCells(…)`, then you can do a `getCallValue` setting `encoding` to `"unknown"` and then use various other R code to test if it's a hyperlink. – hrbrmstr Jun 10 '14 at 20:21
- 
                    I submitted this question as I was running out the door. Perhaps I could have done a better job. I know which column contains hyperlinks. When I read the Excel file, all I get is the text, but I need to extract the address/URL. – rrs Jun 10 '14 at 20:40
- 
                    You can use the regex from [this SO thread](http://stackoverflow.com/a/163684/1457051) to find and extract pretty much any URI/URL. – hrbrmstr Jun 10 '14 at 20:50
- 
                    what do you mean by "I need to extract the address"? How is that different from getting the text? Are you looking for `browseURL`, `download.file`, or `RCurl::getURL`? – GSee Jun 10 '14 at 22:32
- 
                    @GSee no, the Excel file contains hyperlinks in one of the columns, meaning there's text that you can click on which will take you to a file or URL. I need to extract this address in R, but I'm only getting whatever is visible when you open the file in Excel. – rrs Jun 10 '14 at 22:43
- 
                    This just bit me. With @hrbrmstr 's comments, I was able to come up with a solution that needs paring down, but is a complete reprex. It is here: https://stackoverflow.com/q/70013136/1022967 – mpettis Nov 18 '21 at 00:25
2 Answers
I found a super convoluted way to extract the hyperlinks:
library(XML)
# rename file to .zip
my.zip.file <- sub("xlsx", "zip", my.excel.file)
file.copy(from = my.excel.file, to = my.zip.file)
# unzip the file
unzip(my.zip.file)
# unzipping produces a bunch of files which we can read using the XML package
# assume sheet1 has our data
xml <- xmlParse("xl/worksheets/sheet1.xml")
# finally grab the hyperlinks
hyperlinks <- xpathApply(xml, "//x:hyperlink/@display", namespaces="x")
Derived from this blogpost.
 
    
    - 9,615
- 4
- 28
- 38
- 
                    1(a) great find; (b) not so convoluted. [RExcelXML](http://www.omegahat.org/RExcelXML/) does something similar. Don't forget to mark your own question as answered! – hrbrmstr Jun 11 '14 at 00:37
- 
                    2It's clear from your comments what packages you're using, but you should still include the appropriate preamble (*i.e.* `library(XML)`) in your answer. – Hugh Jun 11 '14 at 01:48
- 
                    
- 
                    1my.excel.file = "your_excel_file.xlsx" at that moment he hasn´t read the excel file yet, just created a character vector with the same name of his excel file, but with .zip as the file extension. Then he creates a copy of its excel file but to the excel file wit the .zip extension, then he unzips it and then he reads sheet1 as an xml with xml <- xmlParse("xl/worksheets/sheet1.xml") – csmontt Jan 19 '18 at 21:04
Inspired by @rrs's post above, I put together a function to extract all hyperlinks in a workbook with reference (sheet!A1 format), with which you can look up the hyperlink in any cell.
**Some notes explain the snippet below: **
-- Hyperlinks are stored in two files after unzipping the excel file, (a) xl/worksheets/sheet1.xml and xl/worksheets/_rels/sheet1.xml.res. While the xml one has cell position to rId (relationship id?) table, the xml.res one has the actually rId and hyperlinks table
-- embedded function read_relationships parse XML and join them
-- temp_base_dir is used host the file operation
-- map_df stack hyperlinks from all tabs together
-- Output dataframe has 5 columns. i.e. id (relationship id), target (the hyperlink), ref (cell reference in sheet!A1 format), tab_idx (sheet index), tab (sheet name)
Snippet:
library(tidyverse)
library(XML)
extract_hyperlinks_from_excel <- function(aExcelFile, aRefOutputFile = NULL){
  sheets <- readxl::excel_sheets(aExcelFile)
  
  read_relationships <- function(aSheetIndex){
    filename <- file.path(tmp_base_dir, 'xl', 'worksheets', '_rels', paste0('sheet', aSheetIndex, '.xml.rels'))    
    rel <- xmlParse(filename)
    rel <- xmlToList(rel)
    rel <- purrr::map_dfr(rel, as.list)
    rel <- rel[, c('Id', 'Target')]
    names(rel) <- c('id', 'target')
    if(nrow(rel) == 0){
      return(NULL)
    }
    
    filename <- file.path(tmp_base_dir, 'xl', 'worksheets', paste0('sheet', aSheetIndex, '.xml'))    
    pos <- xmlParse(filename)
    pos <- xmlToList(pos)
    if(is.null(pos$hyperlinks)){
      return(NULL)
    }
    
    pos <- purrr::map_dfr(pos$hyperlinks, as.list)
    pos <- pos[, c('ref', 'id')]
    
    ret <- inner_join(rel, pos, by = 'id')
    ret$tab_idx <- aSheetIndex
    
    return(ret)
  }
  
  EXCEL_TEMP_NAME <- 'unzipped_excel'
  
  tmp_base_dir <- file.path(tempdir(),
                            paste0('tmpexcl', 
                                   as.character(round(runif(1, 1000000000000, 9999999999999)))))
  dir.create(tmp_base_dir)
  on.exit(unlink(tmp_base_dir))
  
  zipfile <- file.path(tmp_base_dir, paste0(EXCEL_TEMP_NAME, '.zip'))
  file.copy(from = aExcelFile, to = zipfile)
  unzip(zipfile, exdir = tmp_base_dir)
  
  ret <- map_df(seq_along(sheets), read_relationships)
  
  ret %>% 
    mutate(tab = sheets[tab_idx]) %>% 
    mutate(ref = paste0("'", tab, "'!", ref)) %>% 
    select(id,tab_idx, tab, ref, target)  ->
    ret
  if(!is.null(aRefOutputFile)){
    write_csv(ret, aRefOutputFile)
  }
  
  return(ret)
}
 
    
    - 48
- 6
- 
                    
- 
                    I tried this with [my own use case](https://stackoverflow.com/questions/62506660) and the map_df function returned `Error: XML content does not seem to be XML: '' `. – NotReallyHere12 Jan 13 '22 at 19:47
- 
                    What is the version of your Excel file. The code works on XLSX format. – Sean Yang Mar 02 '22 at 15:34
- 
                    
- 
                    @NotReallyHere12. It seems the error is related to either hidden sheet in Excel or temporary folder. I updated the code. Hope it works on your ends. – Sean Yang Mar 10 '22 at 16:56
- 
                    
- 
                    @AuronusBen: XLS is binary file which can not be parsed, whereas XLSX is XML/zip which is text based. – Sean Yang Nov 18 '22 at 14:15
- 
                    @SeanYang Thanks for this clarification. So there's no way to extract hyperlinks from a XLS file? – AnonX Nov 21 '22 at 08:12
