爬取百度贴吧的时候遇到的问题就是爬下来有数据的代码都被注释掉了,python获取不到,所以要把代码注释取消掉
正常的html代码注释是这样的:
<!-- code -->
所以,只要把任意一办标签换成别的符号即可
from lxml import etree
import requests
url = 'https://tieba.baidu.com/f?kw=%E5%AD%99%E7%AC%91%E5%B7%9D'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'
}
r = requests.get(url=url, headers=headers)
html = etree.HTML(r.text.replace('<!--',' ')) # 取消注释的代码
data_list = html.xpath('//a[@class="j_th_tit "]/text()')
for data in data_list:
print(data)
运行结果: