Skip to content

识别部分bug修复#437

Closed
ddfinshes wants to merge 58 commits into
ccprocessor:devfrom
ddfinshes:dev
Closed

识别部分bug修复#437
ddfinshes wants to merge 58 commits into
ccprocessor:devfrom
ddfinshes:dev

Conversation

@ddfinshes

@ddfinshes ddfinshes commented Jun 6, 2025

Copy link
Copy Markdown
Contributor
  1. paragraph部分
  • inline两段text拼接时,中间是否使用空格拼接的判断更新(' '还是'')。避免中文双引号、单引号、(、[等拼接时用空格连接,导致一行内出现额外空格, 例如'( APC)', '“ I do wear..'。
  • paragraph inline判断tags新增
  • <sup>和 <sub>标签避免转义为<sup&gt;和<sub&gt;导致文本中上下标功能失效,重写replace_entities().
  1. code部分
  • 原逻辑中classname中存在code单词即被打上code标签,导致audio、list类别内容被识别为code。增加一重判断,如果标签tag为audio、ul等就不被识别为code。
  1. list 部分
  • 在list存入content item列表时,增加一重判断,避免item元素为''或者'-'依旧存入content_list中。
  • 遇到<ul><ul></ul></ul>or<ul><div></div></ul>嵌套结构的列表没有被识别出来,导致正文缺少一部分列表内容。

@codecov

codecov Bot commented Jun 6, 2025

Copy link
Copy Markdown

Codecov Report

Attention: Patch coverage is 92.30769% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
..._web_kit/extractor/html/recognizer/code/classes.py 77.77% 2 Missing ⚠️

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #437      +/-   ##
==========================================
- Coverage   89.97%   89.47%   -0.50%     
==========================================
  Files         107      102       -5     
  Lines        8288     8294       +6     
==========================================
- Hits         7457     7421      -36     
- Misses        831      873      +42     
Files with missing lines Coverage Δ
llm_web_kit/extractor/html/recognizer/list.py 97.31% <100.00%> (+0.09%) ⬆️
llm_web_kit/extractor/html/recognizer/text.py 94.54% <100.00%> (+0.17%) ⬆️
llm_web_kit/libs/html_utils.py 92.39% <ø> (-1.14%) ⬇️
..._web_kit/extractor/html/recognizer/code/classes.py 86.20% <77.77%> (-3.80%) ⬇️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ddfinshes ddfinshes closed this Jun 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant