Issue
This Content is from Stack Overflow. Question asked by hareko
I am trying to scrape soccer players’ data using python’s Scrapy package. The website I’m scraping has the format
https://www.example.com/players — I’ll refer to it as “Homepage”
Here, there is a list of players playing in the league. To get to the data I’m looking for from the start url, I have to click the player’s name and it takes me to an overview page of that player. To get the data I want to scrape for the second player and so forth, I have to go back up to the Homepage and click the name of the second player and scrape the data > back up to the Homepage again and click the name of the third player and so on. So how should I go about doing this task? Should I use basic spider or crawlspider? How do I tell scrapy I want to go into a specific page (player’s overview page) and out to the Homepage where the list of all players exist so I’m able to go to the next player repeating the same process? Thank you in advance!
Solution
Assuming that the page isn’t rendered with javascript the scrapy would be a great tool.
I would suggest reading the installation docs and the tutorial to get a general understanding of how it works, where to begin and how to start a new project.
Here is an example of what your spider could look like:
import scrapy
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = ["https://example.com/homepage"]
def parse(self, response):
for players_name in response.xpath_or_css_selector(some_selector_path_to_url).getall():
yield scrapy.Request(url, callback=self.parse_player)
def parse_player(self, response):
# scrape the player data into a dictionary and then yield it as an item
yield {player: data}
This Question was asked in StackOverflow by hareko and Answered by Alexander It is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.