cold/development_notes.html at main · caltechlibrary/cold · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
<!DOCTYPE html>
<html lang="en-US">
<head>
    <title>cold</title>
    <link rel="stylesheet" href="https://caltechlibrary.github.io/css/site.css">
    <link rel="stylesheet" href="https://media.library.caltech.edu/cl-webcomponents/css/code-blocks.css">
    <script type="module" src="https://media.library.caltech.edu/cl-webcomponents/copyToClipboard.js"></script>
    <script type="module" src="https://media.library.caltech.edu/cl-webcomponents/footer-global.js"></script>
</head>
<body>
<header>
<a href="https://library.caltech.edu"><img src="https://media.library.caltech.edu/assets/caltechlibrary-logo.png" alt="Caltech Library logo"></a>
</header>
<nav>
<ul>
    <li><a href="/">Home</a></li>
    <li><a href="index.html">README</a></li>
    <li><a href="LICENSE">LICENSE</a></li>
    <li><a href="INSTALL.html">INSTALL</a></li>
    <li><a href="user_manual.html">User Manual</a></li>
    <li><a href="about.html">About</a></li>
    <li><a href="search.html">Search</a></li>
    <li><a href="https://github.com/caltechlibrary/cold">GitHub</a></li>
</ul>
</nav>
<section>
<h1 id="development-notes">Development Notes</h1>
<p>NOTE: this document describes my thinking during the development
process. It is not necessarily a description of how things actually
wound up being implemented.</p>
<h2 id="application-layout-and-structure">Application layout and
structure</h2>
<p>The primary task of the COLD UI is to provide a means of curating our
list of objects and vocabularies. Each list is held in a dataset
collection. Datasetd is used to provide a JSON API to curate the
collections. TypeScript compiled via Deno is providing the middleware to
tie our JSON API with our static content. The front end web server
(i.e. Apache 2) provides single sign on and access control (e.g. via
Shibboleth).</p>
<p>I am relying on feeds.library.caltech.edu to provide the public
facing API. Data is transferred to feeds via scripts run on a schedule
or “on demand” in the reports module.</p>
<h2 id="the-go-dataset-collections">The Go dataset collections</h2>
<p>The <code>datasetd</code> program provides localhost static file and
JSON API access for managing multiple dataset collection. These use the
<code>https://caltechlibrary.github.io/ts_dataset/mod.ts</code> module
for working with the datasetd JSON API. The middleware provides pass
through proxy services to the localhost instance of the datasetd API for
selected queries (e.g. people, groups and ror lookups).</p>
<p>The public API isn’t part of COLD. The reports system can replicate
COLD public data to feeds if that is appropriate.</p>
<h2 id="data-enhancement">Data enhancement</h2>
<p>The content curated in cold can be enhanced from external sources.
This is done via scheduled tasks. Initially these tasks are going to be
run from cron. An example is importing biographical information
published in the Caltech Directory. For a subset of CaltechPEOPLE we
know their IMSS userid. Using that we can contact the public directory
website and return the biographical details such as their faculty role
and title, division, and educational background. We only harvest those
records that have both a directory user id and are marked for inclusion
in feeds.</p>
<p>External data sources:</p>
<ul>
<li>Caltech Directory</li>
<li>orcid.org</li>
<li>ror.org</li>
</ul>
<h2 id="reports">Reports</h2>
<p>Reports are often needed for managing library data and systems.
COLD’s focus is on managing lists and data but can also serve as a
reports request hub.</p>
<p>Many of the reports require aggregation across data sources and often
these will take too long or require too many resources to be run
directly on our application server. That suggests what should run on the
applications server is a simple reports request management interface.
The suggestions the following requirements.</p>
<ul>
<li>A way to make a request for a report</li>
<li>A means of indicating a report status (e.g. requested or scheduled,
processing, available or problem indicator)</li>
<li>A means of notifying the requester(s) when report is available</li>
<li>A means of purging old reports for the reports status list</li>
</ul>
<p>These features can be implemented as a simple queue. The metadata
needed to manage a report requests and their life cycle are as
follows.</p>
<ul>
<li>name of report</li>
<li>any additional options needed by the report program</li>
<li>an email address(es) to contact when the report is ready</li>
<li>current status of the report (e.g. requested, processing, available,
problem)</li>
<li>a link to where the report can be “picked up”</li>
<li>the report’s content type, (e.g. application/json, application/yaml,
application/x-sqlite3, text/csv, text/plain, text/x-markdown)</li>
<li>the date the report was requested</li>
<li>the updated (when the status last changed)</li>
</ul>
<p>The user interface would consist of a simple web form to request a
predefined set up reports and a list of reports available, processing,
requested or scheduled.</p>
<p>The reports themselves can be implemented as command line programs in
a language of your choice. The report runner will be responsible for
checking the queue and updating the queue. The report would be
responsible for notification (e.g. is there is an email list then send
out an email with the report link). In principle since our GitHub
actions are accessible via the GitHub APIs a report could be implemented
as a GitHub action.</p>
<p>The advantage of this approach is that it avoids the problems of slow
running or resource intensive reports running directly on the
application server. COLD just manages the report queue.</p>
<p>Advantage of narrowing the COLD’s report to managing a report queue
is that it separates the concerns (e.g. resource management, security,
report access).</p>
<p>For the report management interface to be useful you do need a report
runner. The report runner would be responsible for checking the report
queue, updating status of the report queue and making the report
request.</p>
<p>NOTE: the runner doesn’t need to run on your apps server. It just
needs access to the queue.</p>
<p>A report would need to implement a few things.</p>
<ul>
<li>accept the metadata held in the report queue</li>
<li>storing the report result</li>
<li>return a result needed by the runner to update the report queue
(success, failure and the link to the result)</li>
</ul>
<p>QUESTION: Should the report be responsible for notification or the
runner?</p>
<p>The individual reports can be implemented as a script (e.g. Bash), a
program (e.g. something in Python) or even externally (e.g. GitHub
action). The interface for the report system takes advantage of standard
input and standard output. This simplifies writing the report programs.
An example would be to process a JSON expression from standard input and
return a JSON expression via standard output to the runner along with an
error code (i.e. zero no problem, non-zero there was a problem). The
report script or program would use a link to indicate where the report
could be picked up and be responsible to placing content in a storage
location accessible via the link.</p>
<p>Report status:</p>
<dl>
<dt>requested</dt>
<dd>
An entry that a request has been made and is waiting to be serviced
</dd>
<dt>processing</dt>
<dd>
The report request is being serviced but is not yet available
</dd>
<dt>available</dt>
<dd>
A report result is available and the link indicates where you can pick
it up
</dd>
<dt>problem</dt>
<dd>
The report request could not be completed and the link indicates where
the details can be found about what when wrong.
</dd>
</dl>
<p>Report identifiers:</p>
<p>There are two basic report types. Those which are run on a schedule
(e.g. recent grant report from thesis or creators report from authors)
and those which are requested then run. For the scheduled reports the
identifier would be in the reports’ unique name. For requested reports
another mechanism maybe required. A good candidate for the identifier
would be UUID v5. Since the report script or program is responsible for
storing the results it would also be responsible for versioning the
stored results if needed. By separating the ID from the report instance
it is left to the report what the name of the stored result is while
still being able to map a request to that link’s instance.</p>
<p>Reports can be of different content types. Most reports we generate
manually today are either CSV, tab delimited or Excel files. By allowing
reports to have different content types we also allow for the report to
be provided in a relevant type. E.g. a report could be generated as a
PDF or even an SQLite3 database.</p>
<h3 id="exploring-the-report-runner">Exploring the report runner</h3>
<p>COLD provides a collection called “reports.ds”. Assuming that
collection is readable on your data processing machine a runner needs to
be able to do several actions.</p>
<ol type="1">
<li>Retrieve the next report to initiate</li>
<li>Update the report status (e.g. request -&gt; processing)</li>
<li>The runner needs to execute the shell command that implements the
report</li>
<li>Update the report status (e.g. processing -&gt; available or
processing -&gt; problem)</li>
</ol>
<p>The report runner repeats these four steps until there are no more
requests available. At that time it can sleep for a designated period of
time then start the loop again when requests are available.</p>
<p>To control what is executed it is desirable to have a specific
configurable task runner available. This will prevent arbitrary commends
from running.</p>
<p>Off the shelf task runners include</p>
<ul>
<li>Make, a build system dating back to the origin times of Unix</li>
<li>just, a new simpler command runner that is cross platform and
language agnostic</li>
</ul>
<p>The report runner would take the report request record, set status to
processing and then pass the report name and options to task runner.
When the task completed (either successfully or failing) the result
would be captured and stored in a designated storage system
(e.g. G-Drive) and the report request record would need to be updated
with the final status and link to the report or error report.</p>
<h2 id="date-handling">Date Handling</h2>
<p>The difference between date formats, languages and representation can
be considerable. The default way a the TypeScript/JavaScript Date object
render a date is “MM/DD/YYYY” using the <code>toDateString()</code>
instance method. Our databases and most of our code base expects date to
be formatted in “YYYY-MM-DD” so I am using two TypeScript/JavaScript
methods to achieve that. First you use <code>.toJSON()</code> to render
the date in JSON format then you trim the result to 11 characters using
<code>.substring(0,10)</code>.</p>
<h2 id="booleans-and-webforms">Booleans and webforms</h2>
<p>When the web form is transcribed checkboxes return a “on” if checked
value. We want these to be actual JSON booleans so in the middleware is
a functions that checks for “true” or “on” before setting the value to
the boolean <code>true</code>. This will help normalize for changed and
saved records.</p>
<h2 id="reports-implementation">Reports Implementation</h2>
<p>Reports are implemented scripts or programs that are defined in a
YAML file (e.g. reports.yaml). Reports can be slow to run so COLD
implements a naive queue system. The reports.ds collection holds report
requests. Those marked as “requested” are pickup by a runner that then
attempts to executes the report. Since reports are running as
executables on the system outside the runner reports MUST be defined in
the YAML configuration file. There are zero user controlled options.
This removes the attack surface of using COLD’s report system to
compromise the application server. Additionally the scripts/programs
implementing the reports retrieved by a data processing service on a
different service. It is on this machine that the reports are defined.
This machine is not directly accessible by the web and should be
configured to restrict non-campus network access as an additional step
to minimize the attack surface.</p>
<p>The report scripts/programs should return an error message or link
where the reports can be picked up. This is be used by the runner to
resolve the final report request status.</p>
<p>The individual reports can be written in your language of choice
(e.g. Python, Bash, TypeScript). The primary requirements are reports
are responsible for storing their results and providing a link or error
message to standard out when completed. Since they are just programs
that write results to standard out they are able to interact with any
necessary systems they are allowed to talk to (e.g. databases, external
services, etc).</p>
<p>A garbage collections script should clear out old requests in a
timely fashion (e.g. once a week or once a month).</p>
<h3 id="requests-and-runner">Requests and Runner</h3>
<p>A request queue is implemented track report requests via the COLD UI.
A separate process reads the queue, renders the reports and then updates
the queue upon completion or error. If email addresses are provided then
they will be contact with the result of the report request. The message
should include the report’s request id, name, status and link or error
message.</p>
<h2 id="web-ui-and-javascript-behaviors">Web UI and JavaScript
behaviors</h2>
<p>Some of the objects managed by COLD are complex in the sense they
each will have nested structure. E.g. a list of groups a person is
associated with. These need to be validated both server and browser
side. Since COLD is being developed primarily as a Deno+TypeScript
application the code that validates can be used both server and browser
side too. This is accomplished by cross compiling the TypeScript to
JavaScript using Deno’s emit package. There is a task called “htdocs”
defined in the “deno.json” file. This in turn calls “build.ts” which
uses the “emit” package to generate the JavaScript used by the browser.
The generated JavaScript is written to “htdocs/modules”.</p>
</section>
<footer-global></footer-global>
</body>
</html>