/ / scrapy XMLFeedSpider analizar el indicador de grupo xml - xml, scrapy

Indicador del grupo XML XML parseado de XMLFeedSpider: xml, scrapy

Tengo algún problema con el selector xpath en Scrapy. No puedo analizar la etiqueta multimedia. ¿Me pueden ayudar, algunas ideas, algún código de ejemplo? Gracias Esta es mi araña

import scrapy
from scrapy.spiders import XMLFeedSpider
from crawler.items import News

class CNNSpider(XMLFeedSpider):
name = "cnn"
start_urls = [
"http://rss.cnn.com/rss/edition.rss", # Top stories
#"http://rss.cnn.com/rss/cnn_latest.rss", # most recerent
]
iterator = "iternodes"  # This is actually unnecessary, since it"s the default value
itertag = "item"

def parse_node(self, response, node):
item = News()
item["title"] = node.xpath("./title/text()").extract()
item["description"] = node.xpath("./description/text()").extract()
item["link"] = node.xpath("./link/text()").extract()
item["media"] = node.xpath("./media:group/media:content/@url").extract()
item["pubDate"] = node.xpath("./pubDate/text()").extract()
print item["media"]

Y mi feed xml:

<item>
<title><![CDATA[More than 200 dead in Mexico quake, buildings toppled]]></title>
<link>http://www.cnn.com/collections/mexico-city-earthquake-intl/</link>
<guid isPermaLink="true">http://www.cnn.com/collections/mexico-city-earthquake-intl/</guid>
<pubDate>Wed, 20 Sep 2017 10:03:24 GMT</pubDate>
<media:group>
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-super-169.jpg" height="619" width="1100" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-large-11.jpg" height="300" width="300" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-vertical-large-gallery.jpg" height="552" width="414" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-video-synd-2.jpg" height="480" width="640" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-live-video.jpg" height="324" width="576" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-t1-main.jpg" height="250" width="250" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-vertical-gallery.jpg" height="360" width="270" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-story-body.jpg" height="169" width="300" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-t1-main.jpg" height="250" width="250" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-assign.jpg" height="186" width="248" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-hp-video.jpg" height="144" width="256" />
</media:group>
</item>

Respuestas

0 para la respuesta № 1

Necesitas usar debajo de Xpath

item["media"] = node.xpath("./*[local-name()="group"]/*[local-name()="content"]/@url").extract()

Básicamente, el problema es que los nodos están usando el espacio de nombres. O puede registrar el espacio de nombres dentro de su parse_node funcionar y hacerlo funcionar

node.register_namespace("media", "http://search.yahoo.com/mrss/")
item["media"] = node.xpath("./media:group/media:content/@url").extract()