增加元素识别和抽取magic-html的接口#457
Conversation
Merge branch 'dev' of https://github.com/dt-yy/llm-webkit-mirror into dev
Codecov ReportAttention: Patch coverage is
@@ Coverage Diff @@
## dev #457 +/- ##
==========================================
- Coverage 89.97% 89.69% -0.28%
==========================================
Files 107 102 -5
Lines 8288 8433 +145
==========================================
+ Hits 7457 7564 +107
- Misses 831 869 +38
... and 6 files with indirect coverage changes 🚀 New features to boost your workflow:
|
| return result | ||
|
|
||
|
|
||
| def __extract_magic_html(url:str, html_str: str, page_layout_type:str) -> DataJson: |
There was a problem hiding this comment.
page_layout_type应该是个枚举值。字符串的话,写什么难找
| return result | ||
|
|
||
|
|
||
| def extract_pure_html_to_md(url:str, html_content: str) -> str: |
There was a problem hiding this comment.
--> extract_xx_html_to_md, 和下方的extract_html_to_md,可合成一个,用参数clip_html=True|False区分。 pure_html有点含糊。
| return result.get_content_list().to_mm_md() | ||
|
|
||
|
|
||
| def extract_magic_html(url:str, html_str: str, page_layout_type:str = 'article') -> str: |
There was a problem hiding this comment.
函数名字建议为extract_main_html_by_maigic_html
|
|
||
|
|
||
| class HTMLFileFormatExtractor(BaseFileFormatExtractor): | ||
| class HTMLFileFormatExtractor(PureHTMLFileFormatExtractor): |
There was a problem hiding this comment.
从文件组织来说extractor.py更像是基类,pure_extractor.py是子类。
另外PureHTMLFIleFormatorExtractor和HTMLFIleFormatorExtractor 名字建议修改为
HTMLFIleFormatorExtractor -> MagicHTMLFIleFormatorExtractor
PureHTMLFIleFormatorExtractor -> NoClipHTMLFIleFormatorExtractor
未来可能还会有 LLM7BHTMLFIleFormatorExtractor ...
|
|
||
|
|
||
| def __extract_pure_html(url:str, html_content: str) -> DataJson: | ||
| extractor = PureHTMLFileFormatExtractor(load_pipe_tpl('html')) |
There was a problem hiding this comment.
https://github.com/ccprocessor/llm-webkit-mirror/blob/dev/llm_web_kit/config/pipe_tpl/html.jsonc 这个配置理论上是不会选取PureHTMLFileFormatExtractor生效。这里要么代码有歧义,要么想要的功能并没有实现。
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
Please describe the motivation of this PR and the goal you want to achieve through this PR.
增加元素识别和抽取magic-html的接口
Modification
Please briefly describe what modification is made in this PR.
BC-breaking (Optional)
Does the modification introduce changes that break the backward compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
Use cases (Optional)
If this PR introduces a new feature, it is better to list some use cases here and update the documentation.
Checklist
Before PR:
After PR: