小红书博主主页笔记采集，一键获取笔记数据

Pake

2025-07-16 10:03

107031

文章图片失效了，参考这篇文章吧

https://cibdglc9vxp.feishu.cn/wiki/YyctwHXAUiB6mEkCAb5cTpLOn9c?from=from_copylink

笔记链接获取

使用懒加载指令采集小红书笔记数据时常会报出“元素查找失败”的错误

打开F12检查会发现，明明页面已经加载了两百多个元素，而校验到的只有21个，其原因是小红书使用的不是懒加载技术而是虚拟列表

解决方式如下，通过无限循环加字典解决

具体方法说明如下

配置自定义对话框输入链接及保存路径

由于无限循环中每次都会获取一次相似元素列表，而相似元素列表中又有可能有重复数据，所以我选择用data-index属性作为键名，确保唯一性，(标题可能会存在一样的标题笔记，所以不以标题作为键名)

因为笔记链接在“//[@id="userPostedFeeds"]/section”获取不到，所以应该多捕获一个元素列表“//[@id="userPostedFeeds"]/section/div”

获取到笔记元素的源代码后，调用Python模块对源代码进行清洗，获取到index，标题，链接信息

Python代码如下

from common.util.shared_variables import GetSharedVariable
from common.util.elements_util import elementsFormatNew
from projects.rpaRoot import SZEnv
from ..global_data import SHIZAI_ELEMENT_DICT, globalVar
from ..global_data import run_module, print

from bs4 import BeautifulSoup
def data_index(html_content: str):
    soup = BeautifulSoup(html_content, 'html.parser')# 使用BeautifulSoup解析HTML
    element = soup.find(attrs={"data-index": True})# 查找所有带有data-index属性的元素
    data_index = element.get("data-index")# 提取data-index属性值
    return data_index
# def data_title(html_content: str):
#     soup = BeautifulSoup(html_content, 'html.parser')
#     element = soup.find('a', {'data-v-a264b01a': ''})#查找标题元素
#     data_title = element.text.strip()
def data_title(html_content: str):
    soup = BeautifulSoup(html_content, 'html.parser')
    # 查找包含标题的a标签
    element = soup.find('a', {'class': 'title'})
    if element:
        # 获取a标签内的span标签中的文本
        span = element.find('span')
        if span:
            return span.text.strip()
    return None  # 如果没有找到标题，返回None
    return data_title
def data_link(html_content: str):
    soup = BeautifulSoup(html_content, 'html.parser')
    a_tag = soup.find('a', class_=['cover', 'mask', 'ld'])# 查找指定class的a标签
    if a_tag and a_tag.has_attr('href'):# 获取href属性值
        link = a_tag['href']
        return f'https://www.xiaohongshu.com{link}'
    return None

def main(args):
    pass

笔记采集完成后，可以通过循环字典将数据写入Excel表格中

详细数据获取

在打开网页指令后将一个获取cookies指令，以便后续流程调用

由于获取到的cookies值为字典，所以要通过遍历字典处理一下

打开流程管理，在Python模块管理中新建一个Py流程

将这段代码复制上去

# 实在代码编辑器使用说明：
# 1. 默认引入了获取元素、使用全局变量、调用流程块等常用工具方法
# 2. 可在流程块、子流程中使用“调用Python模块”组件调用此Python模块
# 3. 更多使用帮助可按快捷键“ Alt + Shift + F1 ”

from common.util.shared_variables import GetSharedVariable
from common.util.elements_util import elementsFormatNew
from projects.rpaRoot import SZEnv
from ..global_data import SHIZAI_ELEMENT_DICT, globalVar
from ..global_data import run_module, print

import requests
import re
from bs4 import BeautifulSoup
class xhs:
    def image(html):  # 获取图片下载链接（包括视频封面）
        pattern = r'<meta name="og:image" content="(http[^"]+)"'
        image_urls = re.findall(pattern, html)
        return image_urls
    def video(html):  # 获取视频下载链接
        soup = BeautifulSoup(html, "html.parser")
        video_meta = soup.find("meta", {"name": "og:video"})
        if video_meta:
            return video_meta.get("content")
        else:
            return None
    def title(html):
        soup = BeautifulSoup(html, "html.parser")
        title_meta = soup.find("meta", {"name": "og:title"})
        title = re.sub(r'\s-\s小红书$', '', title_meta.get("content"))
        return title
    def content(html):
        soup = BeautifulSoup(html, "html.parser")
        content_meta = soup.find("meta", {"name": "description"})
        return content_meta.get("content")
    def keywords(html):
        soup = BeautifulSoup(html, "html.parser")
        keywords_meta = soup.find("meta", {"name": "keywords"})
        return keywords_meta.get("content")
    def note_like(html):
        soup = BeautifulSoup(html, "html.parser")
        note_like_meta = soup.find("meta", {"name": "og:xhs:note_like"})
        return note_like_meta.get("content")
    def note_collect(html):
        soup = BeautifulSoup(html, "html.parser")
        note_collect_meta = soup.find("meta", {"name": "og:xhs:note_collect"})
        return note_collect_meta.get("content")
    def note_comment(html):
        soup = BeautifulSoup(html, "html.parser")
        note_comment_meta = soup.find("meta", {"name": "og:xhs:note_comment"})
        return note_comment_meta.get("content")
def xhs_detail(url,Cookie):
    if Cookie != None:
        headers = {
            'Cookie': Cookie,
            'User-Agent': 'Mozilla/5.0...'
        }
        respone = requests.get(url,headers=headers)
    else:
        respone = requests.get(url)
    html = respone.text
    detail = {}
    detail['标题'] = xhs.title(html)
    detail['内容'] = xhs.content(html)
    detail['话题'] = xhs.keywords(html)
    detail['点赞'] = xhs.note_like(html)
    detail['收藏'] = xhs.note_collect(html)
    detail['评论'] = xhs.note_comment(html)
    detail['图片'] = xhs.image(html)
    detail['视频'] = xhs.video(html)
    return detail

def main():
    pass

返回值为字典类型，各键值对类型如下;

标题-字符串；内容-字符串；话题-字符串；点赞-字符串；收藏-字符串；评论-字符串；图片-列表；视频-字符串

在遍历字典时调用该模块，同时写入Excel表格，即可完成对数据的采集

运行结果如下

3人点赞

后可进行评论