Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 18 additions & 18 deletions docs/llm_web_kit/model/html_simplify_classify.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,23 +14,7 @@

## HTML simplify

该部分用于简化HTML从而提高网页布局分类的效果。在 `llm_web_kit/model/html_lib/html_lib` 目录下的 `simplify.py` 文件中的 `general_simplify_html_str` 函数实现了对html字符串的简化操作。

默认路径为`~/.llm-web-kit.jsonc`中需要使用如下配置,可以自动下载模型:

```json
{
"resources": {
"common":{
"cache_path": "~/.llm_web_kit_cache"
},
"html_cls-25m2": {
"download_path": "s3://web-parse-huawei/shared_resource/html_layout_cls/html_cls_25m2.zip",
"md5": "e15ea22a9aa65aa8c7c3a0e3c2e0c98a"
},
}
}
```
该部分用于简化HTML从而提高网页布局分类的效果。在 `llm_web_kit/model/html_lib/simplify.py` 文件中的 `general_simplify_html_str` 函数实现了对html字符串的简化操作。该部分的功能实现已经迁移到 `html_alg_lib` 库中,目前直接调用 `html_alg_lib.simplify` 中的 `process_to_cls_alg_html` 函数。

使用方法如下:

Expand All @@ -47,7 +31,7 @@ print(simp_html)

## HTML 分类

将简化后的html分类article, forum, other三个类别。在`llm_web_kit/model/html_layout_cls.py`中,使用`HTMLLayoutClassifier`类完成自动下载checkpoint和推理过程。
将简化后的html分类Article, Forum_or_Article_with_commentsection, Content Listing, Other四个类别。在`llm_web_kit/model/html_layout_cls.py`中,使用`HTMLLayoutClassifier`类完成自动下载checkpoint和推理过程。

使用方法如下:

Expand All @@ -59,3 +43,19 @@ html_str_input = ['<html>layout1</html>', '<html>layout2</html>']
layout_type = model.predict(html_str_input)
print(layout_type)
```

默认路径为`~/.llm-web-kit.jsonc`中需要使用如下配置,可以自动下载模型:

```json
{
"resources": {
"common":{
"cache_path": "~/.llm_web_kit_cache"
},
"html_cls-25m2": {
"download_path": "s3://web-parse-huawei/shared_resource/html_layout_cls/html_cls_25m4.zip",
"md5": "31b4889b4d9c8a1a6da7a5c58270e611"
},
}
}
```
11 changes: 5 additions & 6 deletions llm_web_kit/model/html_classify/model.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import os

import torch

from llm_web_kit.model.resource_utils import import_transformer
Expand All @@ -6,13 +8,12 @@
class Markuplm():
def __init__(self, path, device):
self.path = path
self.model_path = self.path + '/markuplm-base'
self.checkpoint_path = self.path + '/markuplm_202501222031_epoch_2.pt'
self.model_path = os.path.join(self.path, 'markuplm-base')

self.device = device
self.num_labels = 3
self.num_labels = 4
self.max_tokens = 512
self.label2id = {0:'article', 1:'forum', 2:'other'}
self.label2id = {0: 'Article', 1: 'Forum_or_Article_with_commentsection', 2: 'Content Listing', 3: 'Other'}

self.model = self.load_model()
self.tokenizer = self.load_tokenizer()
Expand All @@ -25,8 +26,6 @@ def load_tokenizer(self):
def load_model(self):
transformers = import_transformer()
model = transformers.MarkupLMForSequenceClassification.from_pretrained(self.model_path, num_labels=self.num_labels)
# load checkpoint
model.load_state_dict(torch.load(self.checkpoint_path, map_location=self.device))
model.to(self.device)
model.eval()

Expand Down
4 changes: 2 additions & 2 deletions llm_web_kit/model/html_layout_cls.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ def __init__(self, model_path: str = None, device: str = 'cuda'):
self.model = Markuplm(model_path, device)

def auto_download(self) -> str:
"""Default download the html_cls_25m2.zip model."""
resource_name = 'html_cls-25m2'
"""Default download the html_cls_25m4.zip model."""
resource_name = 'html_cls-25m4'
resource_config = load_config()['resources']
print(resource_config)
model_config: dict = resource_config[resource_name]
Expand Down
97 changes: 0 additions & 97 deletions llm_web_kit/model/html_lib/base_func.py

This file was deleted.

17 changes: 0 additions & 17 deletions llm_web_kit/model/html_lib/merge_tags.py

This file was deleted.

46 changes: 0 additions & 46 deletions llm_web_kit/model/html_lib/modify_tags.py

This file was deleted.

133 changes: 0 additions & 133 deletions llm_web_kit/model/html_lib/remove_tags.py

This file was deleted.

Loading