Northern lights journalism
April 16, 2025 at 10:39 AM by Dr. Drang
I read a story on Apple News this morning, Northern lights may be viewable in some US states this week: Where and when to see it, and I want to complain about it. Unlike my previous two posts, this will be short.
The article is from USA Today, and you can read it on their site in your browser, which you may prefer, as that gives you the opportunity to skip the ads via Reader Mode. But my complaint isn’t about ads in Apple News, it’s about the article itself.
You might think from the article’s title that its main topic is where you’ll be able to see tonight’s northern lights. And it does get to that, but only after 400 words in a 600-word article, a frustrating inversion of the inverted pyramid. The 400 words of throat clearing explains what the northern lights are, what causes them, that they’re also called the aurora borealis, that there’s a similar phenomenon in the southern hemisphere called the aurora australis, and how NASA uses satellites to predict aurorae. These are all useful bits of information to people who don’t already know them, but they don’t belong before the list of states where you have a decent chance of seeing the northern lights.
(In some ways, of course, this is a complaint about ads. The inverted inverted pyramid structure is there to force you to scroll through the ads that Apple and USA Today want you to see.)
BBEdit says I’m now about 250 words into this post, so let me give you the list of states:
- North Dakota
- Montana
- Minnesota
- Washington
- Michigan
- Wisconsin
- Maine
- Oregon
- Idaho
- Wyoming
- Iowa
- New York
- Nebraska
- Illinois
- Vermont
- New Hampshire
- Pensylvania
That’s the same order as in the article, and if you can figure out the rationale behind it, you’re doing better than I am. I thought at first that it put the northernmost states first, but that doesn’t explain Vermont and New Hampshire coming after Illinois. And you may be wondering how someone could lead off with North Dakota and include Nebraska but somehow omit South Dakota. 🤷
By the way, the article was written yesterday afternoon. As of 10:30 this morning (CDT), the NOAA Space Weather Prediction Center says the likelihood of seeing northern lights in the southernmost of these states isn’t very good.
If you’re like me, the red-on-black text at the bottom of the image is hard to read. It says
View Line Indicates The Southern Extent of Where Aurora Might Be Seen on the Northern Horizon
So the southern extent is now just north of the Illinois/Wisconsin border. I’ll probably go out tonight to take a quick look anyway.
HTML man pages
April 15, 2025 at 7:48 AM by Dr. Drang
Near the end of my last post, there was a link to the xargs
man page. If you followed that link, you saw something like this:
It’s a web page, hosted here at leancrew.com, with what almost looks like a plain text version of the xargs
man page you’d see if you ran
man xargs
at the Terminal. Some things that would typically be in bold aren’t, but more importantly, the reference to the find
command down near the bottom of the window looks like a link. And it is. If you were to click on it, you’d be taken to a similar web page, but for find
. Disgusted with Apple’s decision, made many years ago, to remove all the man pages from its website, I decided to bite the bullet. I built a set of interlinked web pages, over 4,000 of them, that cover all of the man pages I’m ever likely to refer to here on ANIAT. They’re hosted on leancrew.com with URLs like
https://leancrew.com/all-this/man/man1/xargs.html
This structure mimics the usual directory structure in which man pages are stored on Unix/Linux computers. This post will explain how I made them.
To start off, here’s how I didn’t make them. Man pages are built using the old Unix nroff
/troff
text formatting system. Historically, the nroff
command formatted text for terminal display, while troff
formatted it for typesetters. Nowadays, people use the GNU system known as groff
, which can output formatted text in a variety of forms, including HTML. This would seem like an ideal way to make HTML versions of man pages, but it isn’t.
The problem is that groff
-generated HTML output doesn’t make links. If you run
groff -mandoc -Thtml /usr/share/man/man1/xargs.1 > xargs.html
you’ll get a decent-looking HTML page, but it won’t make a link to the find
command (or any of the other referenced commands). I’ve seen some projects that try to do a better job—Keith Smiley’s xcode-man-pages, for instance—but I’ve never found one that does what I want. Hence the set of scripts described below.
I’ve mentioned in other posts that you can get a plain text version of a man page through this pipeline
man xargs | col -bx
The -b
option to the col
command turns off the generation of backspaces, which is how bold text is produced. The -x
option converts tabs to spaces in the output. The combination gives you plain text in which everything lines up vertically if you use a monospaced font. This will be our starting point.
The key script for turning the plain text into HTML is htmlman
. The goals of htmlman
are as follows:
- Replace any reference to another command—which are the command name followed by the section number in parentheses—with a link to the HTML page for that command.
- Replace any less-than, greater-than, or ampersand symbol with its corresponding HTML entity.
- Replace any URL in the man page with a link to that URL.
- Collect all the references to other commands described Item 1.
- Add code before and after the man page to make it valid HTML and set the format.
Item 4 needs a little more explanation. I don’t intend to generate HTML pages for all the man pages on my Mac. That would be, I think, over 15,000 pages, most of which I would never link to here. So I’ve decided to limit myself to these man pages:
- Those in Section 1, which covers general commands.
- Those in Section 8, which covers system administration commands.
- Those referenced in the man pages of Section 1 and Section 8.
I think of the pages in the first two groups as the “top level.” They’re the commands that are most likely to appear in my scripts, so they’re the ones I’m most likely to link to. The third group are “one level deep” from Sections 1 and 8. Readers who follow a top level link may see a reference in that man page to another command and want to follow up. That’s where the chain stops, though. Links to referenced man pages that are in Sections 1 and 8 can, of course, be followed, but other links are not guaranteed to have an HTML man page.
So when htmlman
is making HTML for Section 1 and Section 8 commands, it’s also making a list of commands referenced by those top level pages. I later run htmlman
on all of those second-level pages but ignore the references therein. You’ll see how that works in a bit.
Here’s the Python code for htmlman
:
python:
1: #!/usr/bin/env python3
2:
3: import re
4: import sys
5:
6: # Regular expressions for references and HTML entities.
7: manref = re.compile(r'([0-9a-zA-Z_.:+-]+?)\(([1-9][a-zA-Z]*?)\)')
8: entity = re.compile(r'<|>|&')
9:
10: # Regular expression for bare URLs taken from
11: # https://stackoverflow.com/questions/3809401/what-is-a-good-regular-expression-to-match-a-url
12: # I removed the & from the last character class to avoid interference
13: # with HTML entities. I don't think there will be any URLs with ampersands
14: # in the man pages.
15: url = re.compile(r'''https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?//=]*)''')
16:
17: # Functions for substitution.
18: def manrepl(m):
19: 'Replace man page references with links'
20: return f'<a href="../man{m.group(2)}/{m.group(1)}.html">{m.group(1)}({m.group(2)})</a>'
21:
22: def entityrepl(m):
23: 'Replace HTML special characters with entities'
24: e = {'<': '<', '>': '>', '&': '&'}
25: return e[m.group(0)]
26:
27: def urlrepl(m):
28: 'Replace http and https URLs with links'
29: return f'<a href="{m.group(0)}">{m.group(0)}</a>'
30:
31: # Beginning and ending of HTML file.
32: header = r'''<html>
33: <head>
34: <link rel="stylesheet" href="../man.css" />
35: <title>{0} man page</title>
36: </head>
37: <body>
38: <pre>'''
39:
40: footer = r'''</pre>
41: </body>
42: </html>'''
43:
44: # Initialize.
45: html = []
46: refs = set()
47:
48: # The man page section and name are the first and second arguments.
49: section = sys.argv[1]
50: name = sys.argv[2]
51: title = f'{name}({section})'
52:
53: # Read the plain text man page contents from standard input.
54: # into a list of lines
55: text = sys.stdin.readlines()
56:
57: # Convert references to other man pages to links,
58: # change <, >, and & into HTML entities, and
59: # turn bare URLs into links.
60: # Build list of man pages referred to.
61: # Leave the first and last lines as-is.
62: html.append(text[0])
63: for line in text[1:-1]:
64: for m in manref.finditer(line):
65: refs.add(f'{m.group(2)[0]} {m.group(1)}')
66: line = entity.sub(entityrepl, line)
67: line = url.sub(urlrepl, line)
68: html.append(manref.sub(manrepl, line))
69: html.append(text[-1])
70:
71: # Print the HTML.
72: print(header.format(title))
73: print(''.join(html))
74: print(footer)
75:
76: # Write the references to stderr.
77: if len(refs) > 0:
78: sys.stderr.write('\n'.join(refs) + '\n')
htmlman
is intended to be called like this:1
man 1 xargs | col -bx | ./htmlman 1 xargs
It takes two arguments, the section number and command name, and gets the plain text of the man page fed to it through standard input.
I know I usually give explanations of my code, but I really think htmlman
is pretty self-explanatory. Only a few things need justifying.
First, the goal is to put the HTML man pages into the appropriate subfolders of this directory structure:
To make it easy if I ever decide to move this directory structure, the links to other man pages are relative. Their href
attributes go up one directory and then back down. For example the links on the xargs
page are
<a href="../man1/find.html">find(1)</a>
<a href="../man1/echo.html">echo(1)</a>
<a href="../man5/compat.html">compat(5)</a>
<a href="../man3/execvp.html">execvp(3)</a>
You see this on Line 20 in the manrepl
function.
Second, you may be wondering about the comment on Line 61 about leaving the first and last lines as-is. I leave them alone because the first line of a man page usually has text that would match the manref
regex:
XARGS(1) General Commands Manual XARGS(1)
It would be silly to turn XARGS(1)
into a link on the xargs
man page itself. Sometimes the last line of a man page has a similar line (not for xargs
, though), and the same logic applies there.
Finally, htmlman
prints the HTML to standard output (Lines 71–74) and the list of references to other man pages to standard error (Lines 76–78). This list is a series of lines with the section followed by the command name. For xargs
the list of references sent to standard error is
1 find
3 execvp
5 compat
1 echo
You may feel this is an abuse of stderr, and I wouldn’t disagree. But I decided to use stderr for something that’s not an error because it makes the shell scripts that use htmlman
so much easier to write. Sue me.
With htmlman
in hand, it’s time to gather the names of all the commands in Sections 1 and 8. This is done with another Python script, list-mans
. It’s called this way:
./list-mans 1 > man1-list.txt
./list-mans 8 > man8-list.txt
The first of these looks through all the Section 1 man page directories and produces output that looks like this:
[…]
1 cal
1 calendar
1 cancel
1 cap_mkdb
1 captoinfo
1 case
1 cat
1 cc
1 cc_fips_test
1 cd
[…]
This is the same format that htmlman
sends to stderr, which is not a coincidence. The second command above does the same thing, but with the Section 8 directories. Here’s the code for list-mans
:
python:
1: #!/usr/bin/env python3
2:
3: import os
4: import os.path
5: import subprocess
6: import sys
7:
8: def all_mans(d, s):
9: 'Return a set of all available "section name" arguments for the man command'
10:
11: comp_extensions = ['.Z', '.gz', '.zip', '.bz2']
12: man_args = set()
13: for fname in os.listdir(d):
14: # Skip files that aren't man pages
15: if fname == '.DS_Store' or fname == 'README.md':
16: continue
17: # Strip any compression extensions
18: bname, ext = os.path.splitext(fname)
19: if ext in comp_extensions:
20: fname = bname
21: # Strip the remaining extension
22: page_name, ext = os.path.splitext(fname)
23: # Add "section command" to the set if the extension matches
24: if ext[1] == s:
25: man_args.add(f"{s} {page_name.replace(' ', '\\ ')}")
26:
27: return man_args
28:
29: # Get all the man directories using the `manpath` command
30: manpath = subprocess.run('manpath', capture_output=True, text=True).stdout.strip()
31:
32: # Add the subdirectory given on the command line
33: section = sys.argv[1]
34: man_dirs = [ p + f'/man{section}' for p in manpath.split(':') ]
35:
36:
37: args = set()
38: for d in man_dirs:
39: if os.path.isdir(d):
40: args |= all_mans(d, section)
41:
42: print('\n'.join(sorted(args)))
It uses the subprocess
library to call manpath
, which returns a bunch of colon-separated paths to the top-level man
directories. There are a bunch of them. On my Mac, this is what manpath
produces:
/opt/homebrew/share/man:
/opt/homebrew/opt/curl/share/man:
/usr/local/share/man:
/usr/local/man:
/System/Cryptexes/App/usr/share/man:
/usr/share/man:
/opt/X11/share/man:
/Library/TeX/texbin/man:
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/share/man:
/Applications/Xcode.app/Contents/Developer/usr/share/man:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/share/man
where I’ve put the directories on separate (sometimes very long) lines. There’s a set of man1
, man2
, etc. subdirectories under each of these.
list-mans
uses the os.path
library to pull out all the files in the manX
subdirectories, where X
is the argument given to it. The only thing that isn’t straightforward about this is that some man page files are compressed, and the all_mans
function (Lines 8–27) has to account for common compression extensions when pulling out the base of each filename. Also, there are the ubiquitous .DS_Store
files and an occasional README
that have to be skipped over.
With man1-list.txt
and man8-list.txt
in hand, I run
cat man1-list.txt man8-list.txt | ./build-man-refs
to build all the Section 1 and Section 8 man pages and create a file, allref-list.txt
, with a list of all the references to other man pages. Here’s the code for build_man_refs
:
bash:
1: #!/usr/bin/env bash
2:
3: while read sec cmd; do
4: man $sec "$cmd" 2> /dev/null |\
5: col -bx |\
6: ./htmlman $sec "$cmd" > "man/man$sec/$cmd.html" 2>>allref-list.txt
7: done
For each line of input, it reads the section number and command name and then executes the man
command with those as arguments. Any error messages from this go to /dev/null
. The man page is then piped through col -bx
as discussed above, and then through htmlman
. Standard output (the HTML) is saved to the appropriate man
subdirectory, and standard error (the list of referenced man pages) is added to the end of allref-list.txt
.
At this point, all the Section 1 and Section 8 man pages are built and we have a file with a list with all the referenced man pages. This list will have a lot of repetition (e.g., many man pages refer to find
) and it will have many entries for man pages we’ve already built. These three commands get rid of the duplicates and the entries for Section 1 and Section 8 pages:
sort -u allref-list.txt | pbcopy
pbpaste > allref-list.txt
sed -i.bak '/^[18] /d' allref-list.txt
Now we have a list of sections and commands that we want to make HTML pages from. Almost. It turns out that many of these referenced man pages don’t exist. Presumably Apple—who didn’t write most of the man pages itself, but took them from the FreeBSD and GNU projects—decided not to include lots of man pages in macOS but didn’t remove references to them from the man pages they did include. That would be a lot of work, and you can’t expect a $3 trillion company to put in that much effort.
So, I run a filter, good-refs
, to get rid of the references to non-existent pages:
./goodrefs < allref-list.txt > goodref-list.txt
Here’s the code for good-refs
:
bash:
1: #!/usr/bin/env bash
2:
3: while read sec cmd; do
4: man $sec $cmd &> /dev/null && echo "$sec $cmd"
5: done
It uses the short-circuiting feature of &&
. The echo
part of the command runs only if the man
part was successful.
Now it’s time to make HTML man pages for all the commands listed in goodref-list.txt
. Recall, though, that this time we’re not going to collect the references to other man pages. So we run
./build-man-norefs < goodref-list.txt
where build-man-norefs
is basically the same as build-man-refs
, but the redirection of stderr goes to /dev/null
instead of a file:
bash:
1: #!/usr/bin/env bash
2:
3: while read section command; do
4: man $section "$command" 2> /dev/null |\
5: col -bx |\
6: ./htmlman $section "$command" > "man/man$section/$command.html" 2> /dev/null
7: done
And with that, all the HTML man pages I want have been made and are in a nice directory structure that I can upload to the leancrew server. According to my notes, the whole process of building these pages takes about six minutes on my M4 MacBook Pro. That’s a long time, but I don’t expect to rebuild the pages more than once a year, whenever there’s a new major release of macOS.
-
All of the scripts in this post are kept in a directory called
man-pages
, and all the commands described here are executed within that directory. Becauseman-pages
is not in my$PATH
, I use the./
prefix when calling the scripts. ↩
Adding books to my library database
April 12, 2025 at 8:54 AM by Dr. Drang
To follow up on the two posts I wrote a couple of weeks ago, I thought I’d explain how I use the Library of Congress catalog to add entries to the SQLite database of my technical books.
As a reminder, the database consists of three tables, book
, author
, and book_author
. The fields for each table are listed below:
book | author | book_author |
---|---|---|
id | id | id |
title | name | book_id |
subtitle | author_id | |
volume | ||
edition | ||
publisher | ||
published | ||
lccn | ||
loc | ||
added |
In each table, the id
field is a sequential integer that’s automatically generated when a new record is added. The book_id
and author_id
fields in the book_author
table are references to the id
values in the other two tables. They tie book
and author
together and handle the many-to-many relationship between the two.
The author
table is about as simple as it can be. I don’t care about breaking names into given and surname, nor do I care about their birth/death dates. The names are saved in last, first [middle]
format, i.e.,
Ang, Alfredo Hua-Sing
Gere, James M.
King, Wilton W.
McGill, David J.
Tang, Wilson H.
Timoshenko, Stephen
Most of the book
fields are self-explanatory. The published
field is the publication date (just a year), and added
is the date I added the book to the database, which helps when I want to print out information about recently added books. SQLite doesn’t have a date datatype, so added
is a string in yyyy-mm-dd
format. The loc
is the Library of Congress Classification, an alphanumeric identifier similar to the Dewey Decimal system. The lccn
is the Library of Congress Control Number, which is basically a serial number that’s prefixed by the year. It has nothing to do with classification by topic or shelving, but it’s the key to quickly collecting all the other data on books in the Library of Congress catalog.
I shelve my books according to the Library of Congress Classification. All other things being equal, I’d prefer to use the Dewey Decimal system because that’s the system used by most of the libraries I’ve patronized.1 But all other things are not equal.
- Virtually all of my technical books are in the Library of Congress.
- The LoC catalog is freely available online.
- The LoC’s records can be downloaded in a convenient format.
- Unfortunately, many of the LoC records do not include a Dewey Decimal number.
The advantages of using the LoC Classification far outweigh my short-lived comfort. I’m getting used to my structural engineering books being in the TA600 series instead of the 624.1 series.
The Library of Congress keeps its catalog records in a few different formats. There’s the venerable MARC format, which uses numbers to identify fields and letters to identify subfields. There’s MARCXML, which is a more or less direct translation of MARC into XML. Neither of these were appealing to me. But there’s also the MODS format, which uses reasonable names for the various elements. For example, here’s the MODS record for An Introduction to Dynamics by McGill & King:
xml:
<?xml version="1.0" encoding="UTF-8"?><mods xmlns="http://www.loc.gov/mods/v3" xmlns:zs="http://docs.oasis-open.org/ns/search-ws/sruResponse" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="3.8" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-8.xsd">
<titleInfo>
<nonSort xml:space="preserve">An </nonSort>
<title>introduction to dynamics</title>
</titleInfo>
<titleInfo type="alternative">
<title>Engineering mechanics</title>
</titleInfo>
<name type="personal" usage="primary">
<namePart>McGill, David J.,</namePart>
<namePart type="date">1939-</namePart>
</name>
<name type="personal">
<namePart>King, Wilton W.,</namePart>
<namePart type="date">1937-</namePart>
</name>
<typeOfResource>text</typeOfResource>
<originInfo>
<place>
<placeTerm authority="marccountry" type="code">cau</placeTerm>
</place>
<dateIssued encoding="marc">1984</dateIssued>
<issuance>monographic</issuance>
<place>
<placeTerm type="text">Monterey, Calif</placeTerm>
</place>
<agent>
<namePart>Brooks/Cole Engineering Division</namePart>
</agent>
<dateIssued>c1984</dateIssued>
</originInfo>
<language>
<languageTerm authority="iso639-2b" type="code">eng</languageTerm>
</language>
<physicalDescription>
<form authority="marcform">print</form>
<extent>xv, 608 p. : ill. (some col.) ; 25 cm.</extent>
</physicalDescription>
<note type="statement of responsibility">David J. McGill and Wilton W. King.</note>
<note>Cover title: Engineering mechanics.</note>
<note>Includes index.</note>
<subject authority="lcsh">
<topic>Dynamics</topic>
</subject>
<classification authority="lcc">TA352 .M385 1984</classification>
<classification authority="ddc" edition="19">620.1/04</classification>
<identifier type="isbn">0534029337</identifier>
<identifier type="lccn">83025283</identifier>
<recordInfo>
<descriptionStandard>aacr</descriptionStandard>
<recordContentSource authority="marcorg">DLC</recordContentSource>
<recordCreationDate encoding="marc">831128</recordCreationDate>
<recordChangeDate encoding="iso8601">19840409000000.0</recordChangeDate>
<recordIdentifier>4242715</recordIdentifier>
<recordOrigin>Converted from MARCXML to MODS version 3.8 using MARC21slim2MODS3-8_XSLT1-0.xsl
(Revision 1.172 20230208)</recordOrigin>
</recordInfo>
</mods>
OK, XML isn’t as nice as JSON, but there are Python modules for parsing it, and it’s relatively easy to pick out the elements I want to put in my database.
And if you know a book’s LCCN, you can get its MODS record using a simple URL. The LCCN of McGill & King’s book is 83025283, and the URL to download it is
https://lccn.loc.gov/83025283/mods
That’s very convenient.
It does raise the question, though, of how you get the LCCN of a book. In many of my books, especially the older ones, the LCCN is printed on the copyright page. Here’s a photo of the copyright page of Timoshenko & Gere’s Theory of Elastic Stability:
The name of the LCCN has changed over the years, and you’ll sometimes see it like this with a dash between the year and the serial number, but it’s easy to convert it to the current canonical form.
If the LCCN isn’t in the book, I use the LoC’s Advanced Search form to find it. This allows searches by title, author, ISBN (my older books don’t have ISBNs, but all the newer books do), or combinations. The record that comes up will always have the LCCN.
However I manage to get the LCCN, I then run the lccn2library
command with the LCCN as its argument. That adds the necessary entries to the book
, author
, and book_author
tables and returns the loc
catalog value. For McGill & King’s book, it would work like this:
lccn2library 83025283
which returns
TA352 .M385 1984
This typically gets printed on a label that I stick on the spine of the book before shelving it.
Here’s the code for lccn2library
. It’s longer than the scripts I usually post here, but that’s because there are a lot of details that have to be handled.
python:
1: #!/usr/bin/env python3
2:
3: import sys
4: import re
5: import sqlite3
6: import requests
7: from unicodedata import normalize
8: import xml.etree.ElementTree as et
9: from datetime import date
10: import time
11:
12: ########## Functions ##########
13:
14: def canonicalLCCN(lccn):
15: """Return an LCCN with no hyphens and the correct number of digits.
16:
17: 20th century LCCNs have a 2-digit year. 21st century LCCNs have a
18: 4-digit year. The serial number needs to be 6 digits. Pad with
19: zeros if necessary."""
20:
21: # All digits
22: if re.search(r'^\d+$', lccn):
23: if len(lccn) == 8 or len(lccn) == 10:
24: return lccn
25: else:
26: return lccn[:2] + f'{int(lccn[2:]):06d}'
27: # 1-3 lowercase letters followed by digits
28: elif m := re.search(r'^([a-z]{1,3})(\d+)$', lccn):
29: if len(m.group(2)) == 8 or len(m.group(2)) == 10:
30: return lccn
31: else:
32: return m.group(1) + m.group(2)[:2] + f'{int(m.group(2)[2:]):06d}'
33: # 20th century books are sometimes given with a hyphen after the year
34: elif m := re.search(r'^(\d\d)-(\d+)$', lccn):
35: return m.group(1) + f'{int(m.group(2)):06d}'
36: else:
37: raise ValueError(f'{lccn} is in an unknown form')
38:
39: def correctName(name):
40: """Return the author name without spurious trailing commas,
41: space, or periods."""
42:
43: # Regex for finding trailing periods that are not from initials
44: trailingPeriod = re.compile(r'([^A-Z])\.$')
45:
46: name = name.rstrip(', ')
47: name = trailingPeriod.sub(r'\1', name)
48: return(name)
49:
50: def dbAuthors(cur):
51: """Return a dictionary of all the authors in the database.
52: Keys are names and values are IDs."""
53:
54: res = cur.execute('select name, id from author')
55: authorList = res.fetchall()
56: return dict(authorList)
57:
58: def addAuthor(cur, name):
59: """Add a new author to the database and return the ID."""
60:
61: params = [name]
62: insertCmd = 'insert into author(name) values(?)'
63: res = cur.execute(insertCmd, params)
64: params = [name]
65: idCmd = 'select id from author where name = ?'
66: res = cur.execute(idCmd, params)
67: return res.fetchone()[0]
68:
69: def bookData(root):
70: """Return a dictionary of information about the book.
71:
72: Keys are field names and values are from the book.
73: If a field name is missing, it's given the value None."""
74:
75: # Initialize the dictionary
76: book = dict()
77:
78: # Collect the book information from the MODS XML record
79: # Use the order in the database: title, subtitle, volume,
80: # edition, publisher, published, lccn, loc
81:
82: # The default namespace for mods in XPath searches
83: ns = {'m': 'http://www.loc.gov/mods/v3'}
84:
85: # Get the title, subtitle, and part/volume
86: for t in root.findall('m:titleInfo', ns):
87: if len(t.attrib.keys()) == 0:
88: # Title
89: try:
90: starter = t.find('m:nonSort', ns).text
91: except AttributeError:
92: starter = ''
93: book['title'] = starter + t.find('m:title', ns).text.rstrip(', ')
94:
95: # Subtitle
96: try:
97: book['subtitle'] = t.find('m:subTitle', ns).text.rstrip(', ')
98: except AttributeError:
99: book['subtitle'] = None
100:
101: # Part/volume
102: try:
103: book['volume'] = t.find('m:partName', ns).find.rstrip(', ')
104: except AttributeError:
105: book['volume'] = None
106:
107: # Get the origin/publishing information
108: # Edition
109: try:
110: book['edition'] = root.find('m:originInfo/m:edition', ns).text
111: except AttributeError:
112: book['edition'] = None
113:
114: # Publisher
115: try:
116: book['publisher'] = root.find('m:originInfo/m:agent/m:namePart', ns).text
117: except AttributeError:
118: book['publisher'] = None
119:
120: # Date published
121: try:
122: book['published'] = root.find('m:originInfo/m:dateIssued', ns).text
123: except AttributeError:
124: book['published'] = None
125:
126: # ID numbers
127: # LCCN (must be present)
128: book['lccn'] = root.find('m:identifier[@type="lccn"]', ns).text
129:
130: # LOC classification number (must be present)
131: book['loc'] = root.find('m:classification[@authority="lcc"]', ns).text
132:
133: # Date added to database is today
134: book['added'] = date.today().strftime('%Y-%m-%d')
135:
136: return book
137:
138: def authorData(cur, root):
139: """Return a dictionary of authors of the book, primary first.
140:
141: The keys are the author names and values are their IDs.
142: Authors not already in the database are added to it."""
143:
144: # The default namespace for mods in XPath searches
145: ns = {'m': 'http://www.loc.gov/mods/v3'}
146:
147: # Get all the authors (primary and secondary) of the book
148: # The primary author goes first in the authors list
149: authors = []
150: names = root.findall('m:name', ns)
151: pnames = root.findall('m:name[@usage="primary"]', ns)
152: snames = list(set(names) - set(pnames))
153: pnames = [ correctName(n.find('m:namePart', ns).text) for n in pnames ]
154: snames = [ correctName(n.find('m:namePart', ns).text) for n in snames ]
155:
156: # Get the authors already in the database
157: existingAuthors = dbAuthors(cur)
158:
159: # Determine which authors are new to the database and add them.
160: # The primary author comes first.
161: authors = dict()
162: for n in pnames:
163: if n in existingAuthors.keys():
164: authors[n] = existingAuthors[n]
165: else:
166: newID = addAuthor(cur, n)
167: authors[n] = newID
168: for n in snames:
169: if n in existingAuthors.keys():
170: authors[n] = existingAuthors[n]
171: else:
172: newID = addAuthor(cur, n)
173: authors[n] = newID
174:
175: return authors
176:
177:
178: ########## Main program ##########
179:
180: # Connect to the database
181: con = sqlite3.connect('library.db')
182: cur = con.cursor()
183:
184: # Get the LCCN from the argument
185: lccn = canonicalLCCN(sys.argv[1])
186:
187: # Get and parse the MODS data for the LCCN
188: r = requests.get(f'https://lccn.loc.gov/{lccn}/mods')
189: mods = normalize('NFC', r.content.decode())
190: root = et.fromstring(mods)
191:
192: # Collect the book data and add it to the book table
193: book = bookData(root)
194: params = list(book.values())
195: insertCmd = 'insert into book(title, subtitle, volume, edition, publisher, published, lccn, loc, added) values(?, ?, ?, ?, ?, ?, ?, ?, ?);'
196: res = cur.execute(insertCmd, params)
197: params = [book["lccn"]]
198: idCmd = f'select id from book where lccn = ?'
199: res = cur.execute(idCmd, params)
200: bookID = res.fetchone()[0]
201:
202: # Collect the authors, adding the new ones to the author table
203: authors = authorData(cur, root)
204:
205: # Add entries to the book_author table
206: for authorID in authors.values():
207: params = [bookID, authorID]
208: insertCmd = f'insert into book_author(book_id, author_id) values(?, ?)'
209: res = cur.execute(insertCmd, params)
210:
211: # Commit and close the database
212: con.commit()
213: con.close()
214:
215: # Print the LOC classification number
216: print(book['loc'])
The script starts with a couple of utility functions, canonicalLCCN
and correctName
. The former (Lines 12–37) takes an LCCN as its argument and returns it in the form needed in the URL we talked about above. For 20th century books, that form is a two-digit year followed by a six-digit serial number. For 21st century books, it’s a four-digit year followed by a six-digit serial number. In both cases, the serial number part is padded with zeros to make it six digits long. Hyphens are removed. Oh, and there can sometimes be 1–3 lowercase letters in front of the digits.
correctName
(Lines 39–48) is necessary because I noticed that sometimes the authors’ names are followed by spurious commas or periods. You can see trailing commas in both authors’ names in the MODS file shown above. I think these extra bits of punctuation made some sense in the MARC format, but I don’t want them in my database.
Often, the book I’m adding to the database has one or more authors that are already entered in the author
table. I don’t want them entered again, so I use the dbAuthors
function (Lines 50–56) to query the database for all the authors and put them in a dictionary. The dictionary may seem backwards—its keys are the names and the values are the IDs—but that makes it easy to look up an author’s ID by their name.
The addAuthor
function (Lines 58–67) does what you’d expect: it executes an SQL command to add a new author to the author
table. The return value is the author’s ID.
The bookData
function (Lines 69–136) is by far the longest function. It starts at the root of the XML element tree and pulls out all of the elements needed for the book
table entry. It returns a dictionary in which the keys are the book
field names (other than id
, which will be automatically generated), and the values are the corresponding entries in the MODS file. If there is no entry for, say, a subtitle or volume number, the dictionary is given a None
value for that key.
I’m using the ElementTree
module for parsing and searching the MODS, and its find
and findall
functions want the namespace of the elements they’re looking for when the XML data has more than one namespace. As you can see in the first line of the example above, there are three namespaces in MODS, the first of which is
http://www.loc.gov/mods/v3
That’s the reason for the ns
dictionary defined on Line 83 and the m:
prefix in all the element names.
Searching for the book
fields takes up many lines of code, partly because MODS has nested data and partly because some of the fields may not be present. That’s why there are several try/except
blocks.
One last thing: the added
field (Line 128) comes from the date on which lccn2library
is run. It has nothing to do with the MODS data.
The last function is authorData
(Lines 138–175). It pulls the names of the authors from the MODS data, distinguishing between the primary author and the others. It then uses dbAuthors
(Line 151) to figure out which of this book’s authors are already in the database. Those that aren’t in the database are added to it using the addAuthor
function described above. A dictionary of all the book’s authors is returned. As with dbAuthors
, the keys are the author names and the values are the author IDs. The primary author comes first in the dictionary.2
The main program starts on Line 181 by connecting to the database and setting up a “cursor” for executing commands. The LCCN argument is put in canonical form (Line 185), which is then used with the requests
library to download the MODS data.
Before parsing the XML, I normalize
the Unicode data into the NFC
format on Line 189. This means that something like é
is made into a single character, rather than an e
followed by a combining acute accent character. I did this because some utilities I use to format the results of database queries don’t like combining characters.
With the root of the MODS element tree defined in Line 190, the script then calls bookData
to get the dictionary of book info. That is then insert
ed into the book
table on Lines 194–196 using SQL command execution with parameters. The ID of the newly added book—which was automatically generated by the insert
command—is then gathered from the database and put in the bookID
variable in Lines 197–200.
The authors are added to the author
table by calling authorData
on Line 203. The dictionary of author names and IDs is saved in the authors
variable.
The entries in the book_author
table are insert
ed in Lines 206–209 using the bookID
and authors
values.
With all the table insertions done, the changes are committed and the database connection closed in Lines 212-213. The loc
field for the book is printed out by Line 216.
Phew!
As you might imagine, I didn’t run lccn2library
by hand hundreds of times when I was initially populating the database. No, I made files with lists of LCCNs, one per line, and ran commands like
xargs -n 1 lccn2library < lccn-list.txt
Giving the -n 1
option to xargs
insures that lccn2library
will consume only one argument each time it’s run.
I typically did this in chunks of 20-50 LCCNs, mainly because every time I thought I finally had lccn2library
debugged, some new MODS input would reveal an error. I feel certain there are still bugs in it, but none have turned up in a while.
There are currently about 500 books in the database, and I think I’ve cleaned out every nook and cranny in the house where books might be hiding. Of course, books still somehow show up when I visit used book stores or (this is the most dangerous) AbeBooks.
-
Shout out to librarians for keeping the word “patron” alive and kicking. ↩
-
For the last several point releases of Python, dictionaries maintain the order in which they were built. Before that, you had to use the
OrderedDict
class. The value of putting the primary author first is it insures that scripts likebytitle
andbyauthor
–discussed in my earlier posts—will return the list of authors with the primary author first. That’s how I think of the books. It’s McGill & King, not King & McGill. ↩
Human nature
April 8, 2025 at 4:54 PM by Dr. Drang
On my walk this morning, I was greeted by this scene at the entrance to the Springbrook Prairie Preserve.
By the rocks in the lower right corner and elsewhere on that side of the path were about a dozen colorful bags of dogshit.
I’ve downsized the image so you can’t zoom in to read the little white sign stuck in the ground near the center of the frame. Here it is:
It says
Please! Keep our beautiful preserve clean. Haul your dog waste bags out of the park with you. Thank you.
Bags of shit left behind by dog walkers are not uncommon in Springbrook, but I’ve never seen so many in one spot. Maybe this is my general misanthropy talking, but I couldn’t help but think that the bags were deliberately carried and dropped there as a “fuck you” to the sign writer. “You can’t tell me what to do.”
No question, the sign writer was being passive aggressive and probably should have expected that reaction. I would have, but that’s probably my misanthropy coming out again.
I just got back from my second walk to Springbrook today. I took a large bag, gathered up all the small bags, and put them in my garbage bin at home. The bins go out tonight, so it’s not like I’m hoarding dogshit.
I didn’t do this because I’m a good person. My wife was the good person, and she would have been appalled. Not by the dogshit but by the plastic bags, which would tear apart (some already had) and spread tiny bits of plastic over the prairie to be eaten by the birds and coyotes who live there. She would’ve gone back to pick up the bags, so I had to.