Related to but different from a previous question of mine, Extracting p within h1 with Python/Scrapy, I've come across a situation where Scrapy (for Python) will not extract a span tag within an h4 tag.
Example HTML is:
<div class="event-specifics">
 <div class="event-location">
  <h3>   Gourmet Matinee </h3>
  <h4>
   <span id="spanEventDetailPerformanceLocation">Knight Grove</span>
  </h4>
</div>
</div>
I'm attempting to grab the text "Knight Grove" within the span tags. When using scrapy shell on the command line,
response.xpath('.//div[@class="event-location"]//span//text()').extract()
returns:
['Knight Grove']
And
response.xpath('.//div[@class="event-location"]/node()')
returns the entire node, viz:
['\n                    ', '<h3>\n                        Gourmet Matinee</h3>', '\n                    ', '<h4><span id="spanEventDetailPerformanceLocation"><p>Knight Grove</p></span></h4>', '\n                ']
BUT, when then same Xpath is run within a spider, nothing is returned. Take for instance the following spider code, written to scrape the page from which the above sample HTML was taken, https://www.clevelandorchestra.com/17-blossom--summer/1718-gourmet-matinees/2017-07-11-gourmet-matinee/. (Some of the code is removed since it doesn't relate to the question):
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from concertscraper.items import Concert
from scrapy.contrib.loader import XPathItemLoader
from scrapy import Selector
from scrapy.http import XmlResponse
class ClevelandOrchestra(CrawlSpider):
    name = 'clev2'
    allowed_domains = ['clevelandorchestra.com']
    start_urls = ['https://www.clevelandorchestra.com/']
    rules = (
         Rule(LinkExtractor(allow=''), callback='parse_item', follow=True),
    )
    def parse_item(self, response):
     thisconcert = ItemLoader(item=Concert(), response=response)
     for concert in response.xpath('.//div[@class="event-wrap"]'): 
        thisconcert.add_xpath('location','.//div[@class="event-location"]//span//text()')
     return thisconcert.load_item()
This returns no item['location']. I've also tried:
thisconcert.add_xpath('location','.//div[@class="event-location"]/node()')
Unlike in the question above regarding p within h, span tags are permitted within h tags in HTML, unless I am mistaken?
For clarity, the 'location' field is defined within the Concert() object, and I have all pipelines disabled in order to troubleshoot.
Is is possible that span within h4 is in some way invalid HTML; if not, what could be causing this?
Interestingly, going about the same task using add_css(), like this:
thisconcert.add_css('location','.event-location')
yields a node with the span tags present but the internal text missing:
['<div class="event-location">\r\n'
          '                    <h3>\r\n'
          '                        BLOSSOM MUSIC FESTIVAL </h3>\r\n'
          '                    <h4><span '
          'id="spanEventDetailPerformanceLocation"></span></h4>\r\n'
          '                </div>']
To confirm this is not a duplicate: It is true on this particular example there is a p tag inside of a span tag which is inside of the h4 tag; however, the same behavior occurs when there is no p tag involved, such as at: https://www.clevelandorchestra.com/1718-concerts-pdps/1718-rental-concerts/1718-rentals-other/2017-07-21-cooper-competition/?performanceNumber=16195.
