how do I scrape unicode text properly?


This Content is from Stack Overflow. Question asked by Indiego

👋🏼 I’m trying to scrape this list of options:

from lxml import html 
import requests as req

ifb_resp = req.get(
    headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/ Safari/537.36',
             'accept-language': 'en-US,en;q=0.9,fa;q=0.8'})
tree= html.fromstring(html=ifb_resp.content)
instruments = tree.xpath('//select[@id="ContentPlaceHolder1_SymbolCombo"]/option')
a1 = instruments[1]

but the text element is in Farsi(Persian) and it comes out like this:

‍‍‍' اعتضاد غدÛx8cر1_بازار سÙx88Ùx85'

I tried encoding it with ‘utf-8’ and got this:

b' xc3x98xc2xa7xc3x98xc2xb9xc3x98xc2xaaxc3x98xc2xb6xc3x98xc2xa7xc3x98xc2xaf xc3x98xc2xbaxc3x98xc2xafxc3x9bxc2x8cxc3x98xc2xb11_xc3x98xc2xa8xc3x98xc2xa7xc3x98xc2xb2xc3x98xc2xa7xc3x98xc2xb1 xc3x98xc2xb3xc3x99xc2x88xc3x99xc2x85'

why does it turn into binary?!!!! I’m so lost here. how do I get the text as it is on the page?

the page which I’m scraping
comes out like this


This question is not yet answered, be the first one who answer using the comment. Later the confirmed answer will be published as the solution.

This Question and Answer are collected from stackoverflow and tested by JTuto community, is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.

people found this article helpful. What about you?