Conversation
|
I deleted an earlier comment about misunderstanding a Sorbet error. All good for review now! |
yob
left a comment
There was a problem hiding this comment.
Thanks for the well written submission ❤️
I've been flat out, but I'll try and take a look soon.
It's definitely the case that Page#text isn't very versatile, returning all the text in plain text with no markup.
There is Page#runs which returns a lower level view of the text of a page, including positioning data. I've flip flopped over time on how much I want to add to PDF::Reader directly, and how much I want to encourage folks to build their own code on top of Page#runs. I'll take this for a spin and see how it feels though, cheers!
|
@yob I hope you're well! Thanks! :) |
|
@judy I was interested in this but it looks like the branch has gone from the upstream? EDIT: it's ok, found github's special refs for this repo's PR: |
We're using PDF::Reader at Zipline for parsing content out of PDFs. (I also forked this project on our team repo here.) We have a number of cases where we want to pull out all of the text from multi-column PDF layouts, but PDF::Reader's visually-aligned output via
page.textwas still difficult for us to programmatically parse.I saw that borb has a paragraph extraction feature (code here), and this is my attempt to implement something similar in Ruby.
I'm leaving this in Draft until I can finish the remaining todo items below. Any feedback is appreciated.
To-do: