psf/requests-html

When requesting a page that is ISO-8859-1 encoded, HTML is still interpreted as UTF-8

Open

#442 aberto em 27 de jan. de 2021

Ver no GitHub
 (3 comments) (0 reactions) (0 assignees)Python (976 forks)batch import
help wanted

Métricas do repositório

Stars
 (13.555 stars)
Métricas de merge de PR
 (Nenhuma PRs mesclada em 30d)

Description

When requesting a page that is ISO-8859-1 encoded:

>>> r = session.get('https://gerda.geus.dk/Gerda/Search')
>>> r.encoding
'ISO-8859-1'
>>> r.html.default_encoding
'ISO-8859-1'
>>> r.html.encoding
'utf8'
>>> r.html.find("option")[-1].text
'Bygge-anl�g'

Expected behavior:

>>> r.html.find("option")[-1].text
'Bygge-anlæg'

As far as I can see, there are two problems:

  • r.html.encoding is incorrectly set
  • r.html.element (The PyQuery instance) does not take encoding into account at all but just assumes utf-8

Guia do colaborador