Issue
This Content is from Stack Overflow. Question asked by Legion Inc.
I am trying to scrape products (not something surprising) – but honestly, defining the CSS selector for the product descriptions that works on any product page gives me a headache.
I look for the selector that defines the product description from the following link:
The selector is:
#inner > div > div.col-lg-12-full.col-md-12-full > div:nth-child(1) > div:nth-child(12)
Alternatively, the selector can be:
div.pd_description:nth-of-type(6)
But sometimes the selector changes:
Here is the selector:
#inner > div > div.col-lg-12-full.col-md-12-full > div:nth-child(1) > div:nth-child(11)
Alternatively, the selector can be:
div.pd_description:nth-of-type(5)
When I look at the source code, the section of product description is defined with
.pd_description
But it’s too general and used often in the source code for other sections too.
I can’t figure out how to solute this problem.
My spider runs correctly, but from product to product i get empty descriptions (cause of my described issue).
def parse_product(self, response):
for product in response.css("body"):
yield {
"brand": product.css('div.pd_inforow:nth-of-type(4) span::text').extract(),
"item_name": product.css("h1::text').extract(),
"description": product.css('#inner > div > div.col-lg-12-full.col-md-12-full > div:nth-child(1) > div:nth-child(12)').extract_first
Why don’t I match the product description with a CSS selector on all pages?
Solution
Using XPath selector (get div
with class equal to pd_description
that contains h4
with text Produktbeschreibung
):
product.xpath('.//div[@class="pd_description"][h4[.="Produktbeschreibung"]]').get()
This Question was asked in StackOverflow by Legion Inc. and Answered by gangabass It is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.