Chinese Characters Decoding using BeastifulSoup

When scraping Chinese website using python (Windows 10 system) and requests, it usually returns “gb2312” code for Chinese characters. However, if one does not declare the encoding of requests, it uses its default encoding, which is not “gb2312”. The following example provides a solution to scrape Chinese contents and save them as Unicode encoding in a data file.

import requests

from bs4 import BeautifulSoup

# set sys default enconding to be unicode utf-8

import sys

sys.setdefaultencoding("utf-8")

# website scraping with request

url_to_scrape = 'http://www.mitbbs.com'

readOut = requests.get(url_to_scrape)

# in request, there is a method to search/get the real encoding of the website which is

# apparent_endcoding, so one need to set the encoding to be the apparent_encoding

readOut.encoding = readOut.apparent_encoding

# use beautifulsoup to get the text information

textSoup = BeautifulSoup(readOut.text, "lxml")

# now when printing out the content in the textSoup, you will get the right display of Chinese characters.

print(textSoup.title.string)

# with the sys default encoding to be uft-8, it will give the right display of Chinese characters in the txt file.

fileToWrite = open("fileToWrite.txt", "w")

fileToWrite.write("%s" %textSoup.title.string)

fileToWrite.close()

Related Posts

Robust Principal Component Analysis 25 Oct 2018

Reduce the cost for home services by electricians, plumbers, etc. 03 Oct 2018

Information in the Phase of Fourier Transformation 03 Oct 2018