[SOLVED] Oh no – Scrapy CSS selector used several times on a product detail page?

Issue

This Content is from Stack Overflow. Question asked by Legion Inc.

I am trying to scrape products (not something surprising) – but honestly, defining the CSS selector for the product descriptions that works on any product page gives me a headache.

I look for the selector that defines the product description from the following link:

https://www.onlinebaufuchs.de/Werkzeug-Technik/Elektrowerkzeuge/Akku-Geraete/Akku-Schlagschrauber/Guede-Akku-Schlagschrauber-BSS-18-1-4-Zoll-0-Akkuschrauber-ohne-Akku-Ladegeraet::7886.html

The selector is:

#inner > div > div.col-lg-12-full.col-md-12-full > div:nth-child(1) > div:nth-child(12)

Alternatively, the selector can be:

div.pd_description:nth-of-type(6)

But sometimes the selector changes:

https://www.onlinebaufuchs.de/Werkzeug-Technik/Elektrowerkzeuge/Akku-Geraete/18-Volt-Lithium-Ionen-Akkusystem/Guede-Ladegeraet-LG-18-05-0-5-A-Aufladegeraet-fuer-diverse-Guede-Akku-Werkzeuge::7852.html

Here is the selector:

#inner > div > div.col-lg-12-full.col-md-12-full > div:nth-child(1) > div:nth-child(11)

Alternatively, the selector can be:

div.pd_description:nth-of-type(5)

When I look at the source code, the section of product description is defined with

.pd_description

But it’s too general and used often in the source code for other sections too.

I can’t figure out how to solute this problem.

My spider runs correctly, but from product to product i get empty descriptions (cause of my described issue).

def parse_product(self, response):
  for product in response.css("body"):
     yield {
     "brand": product.css('div.pd_inforow:nth-of-type(4) span::text').extract(),
     "item_name": product.css("h1::text').extract(),
     "description": product.css('#inner > div > div.col-lg-12-full.col-md-12-full > div:nth-child(1) > div:nth-child(12)').extract_first

Why don’t I match the product description with a CSS selector on all pages?



Solution

Using XPath selector (get div with class equal to pd_description that contains h4 with text Produktbeschreibung):

product.xpath('.//div[@class="pd_description"][h4[.="Produktbeschreibung"]]').get()


This Question was asked in StackOverflow by Legion Inc. and Answered by gangabass It is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.

people found this article helpful. What about you?