Switching to html-parse for faster tokenisation (1/10th of time and peak memory allocated)#115
Switching to html-parse for faster tokenisation (1/10th of time and peak memory allocated)#115benjaminweb wants to merge 6 commits intofimad:masterfrom
Conversation
fimad
left a comment
There was a problem hiding this comment.
@fimad might require renaming of any function with Tag to Token -- would that be a breaking change?
Let's not change function names. If we were starting from scratch, maybe it would make sense to call things token. But I think "tag" is a reasonable term to use for HTML outside the context of TagSoup that I don't think it is worth a breaking change just to follow our underlying HTML parser's terminology.
| -- | A value of 'Scraper' @a@ defines a web scraper that is capable of consuming | ||
| -- a list of 'TagSoup.Tag's and optionally producing a value of type @a@. | ||
| type Scraper str = ScraperT str Identity | ||
| -- a list of 'HP.Tag's and optionally producing a value of type @a@. |
There was a problem hiding this comment.
'HP.Tag' should be 'HP.Token'? See a couple instances throughout.
There was a problem hiding this comment.
Resolved with 69b57d4.
(Remaining matches are entities from HP that is Text.HTML.Parser itself.)
|
Will fix those.
…On Tue, 3 Jun 2025, at 06:29, Will Coster wrote:
***@***.**** requested changes on this pull request.
> @fimad <https://github.com/fimad> might require renaming of any function with Tag to Token -- would that be a breaking change?
>
Let's not change function names. If we were starting from scratch, maybe it would make sense to call things token. But I think "tag" is a reasonable term to use for HTML outside the context of TagSoup that I don't think it is worth a breaking change just to follow our underlying HTML parser's terminology.
In scalpel-core/src/Text/HTML/Scalpel/Internal/Scrape.hs <#115 (comment)>:
> ask = MkScraper (lift . lift $ ask)
local f (MkScraper op) = (fmap MkScraper . mapReaderT . local) f op
-- | A value of 'Scraper' @A@ defines a web scraper that is capable of consuming
--- a list of 'TagSoup.Tag's and optionally producing a value of type @***@***.***
-type Scraper str = ScraperT str Identity
+-- a list of 'HP.Tag's and optionally producing a value of type @***@***.***
'HP.Tag' should be 'HP.Token'? See a couple instances throughout.
—
Reply to this email directly, view it on GitHub <#115 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ3EVO5W56KIB4AKUJQ3333BUQBHAVCNFSM6AAAAAB554KRMGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDQOJQGU2DONJSHA>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
|
The CI is currently failing because After adding it, it looks like there are still some build errors in the examples. |
|
Added html-parse as stack extra-dep with c9edff2. |
scrapeURLTagtoToken-- would that be a breaking change?