/ / स्क्रैपी XMLFeedSpider पार्स xml समूह सूचक - xml, स्क्रैपी

स्केपर XMLFeedSpider पार्स एक्सएमएल समूह संकेतक - एक्सएमएल, स्केपर

मुझे स्क्रेपी में xpath चयनकर्ता के साथ कुछ समस्या है। मैं "मीडिया पार्स" नहीं कर सकता। क्या आप मेरी मदद कर सकते हैं, कुछ विचार, कुछ उदाहरण कोड। धन्यवाद यह मेरे मकड़ी ys

import scrapy
from scrapy.spiders import XMLFeedSpider
from crawler.items import News

class CNNSpider(XMLFeedSpider):
name = "cnn"
start_urls = [
"http://rss.cnn.com/rss/edition.rss", # Top stories
#"http://rss.cnn.com/rss/cnn_latest.rss", # most recerent
]
iterator = "iternodes"  # This is actually unnecessary, since it"s the default value
itertag = "item"

def parse_node(self, response, node):
item = News()
item["title"] = node.xpath("./title/text()").extract()
item["description"] = node.xpath("./description/text()").extract()
item["link"] = node.xpath("./link/text()").extract()
item["media"] = node.xpath("./media:group/media:content/@url").extract()
item["pubDate"] = node.xpath("./pubDate/text()").extract()
print item["media"]

और मेरा xml फ़ीड:

<item>
<title><![CDATA[More than 200 dead in Mexico quake, buildings toppled]]></title>
<link>http://www.cnn.com/collections/mexico-city-earthquake-intl/</link>
<guid isPermaLink="true">http://www.cnn.com/collections/mexico-city-earthquake-intl/</guid>
<pubDate>Wed, 20 Sep 2017 10:03:24 GMT</pubDate>
<media:group>
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-super-169.jpg" height="619" width="1100" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-large-11.jpg" height="300" width="300" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-vertical-large-gallery.jpg" height="552" width="414" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-video-synd-2.jpg" height="480" width="640" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-live-video.jpg" height="324" width="576" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-t1-main.jpg" height="250" width="250" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-vertical-gallery.jpg" height="360" width="270" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-story-body.jpg" height="169" width="300" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-t1-main.jpg" height="250" width="250" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-assign.jpg" height="186" width="248" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-hp-video.jpg" height="144" width="256" />
</media:group>
</item>

उत्तर:

जवाब के लिए 0 № 1

आपको Xpath के नीचे उपयोग करने की आवश्यकता है

item["media"] = node.xpath("./*[local-name()="group"]/*[local-name()="content"]/@url").extract()

मूल रूप से मुद्दा यह है कि नोड्स नाम-स्थान का उपयोग कर रहे हैं। या आप अपने अंदर नेमस्पेस रजिस्टर कर सकते हैं parse_node कार्य करें और इसे काम करें

node.register_namespace("media", "http://search.yahoo.com/mrss/")
item["media"] = node.xpath("./media:group/media:content/@url").extract()