DSHR's Blog: 2016

Thursday, December 22, 2016

Walking Away From The Table

Last time we were buying a car, at the end of a long and frustrating process we finally decided that what we wanted was the bottom end of the range, with no options. The dealer told us that choice wasn't available in our market. We said "OK, call us if you ever find a car like that" and walked away. It was just over two weeks before we got the call. At the end of 2014 I wrote:

The discussions between libraries and major publishers about subscriptions have only rarely been actual negotiations. In almost all cases the libraries have been unwilling to walk away and the publishers have known this. This may be starting to change; Dutch libraries have walked away from the table with Elsevier.

Actually, negotiations continued and a year later John Bohannon reported for Science that a deal was concluded:

A standoff between Dutch universities and publishing giant Elsevier is finally over. After more than a year of negotiations — and a threat to boycott Elsevier's 2500 journals — a deal has been struck: For no additional charge beyond subscription fees, 30% of research published by Dutch researchers in Elsevier journals will be open access by 2018. ... The dispute involves a mandate announced in January 2014 by Sander Dekker, state secretary at the Ministry for Education, Culture and Science of the Netherlands. It requires that 60% of government-funded research papers should be free to the public by 2019, and 100% by 2024.

By being willing to walk away, the Dutch achieved a partial victory against Elsevier's defining away of double-dipping, their insistance that author processing charges were in addition to subscriptions not instead of subscriptions. This is a preview of the battle over the EU's 2020 open access mandate.

The UK has just concluded negotiations, and a major German consortium is in the midst of them. Below the fold, some commentary on their different approaches.

Reference Rot Is Worse Than You Think

At the Fall CNI Martin Klein presented a new paper from LANL and the University of Edinburgh, Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content. Shawn Jones, Klein and the co-authors followed on from the earlier work on web-at-large citations from academic papers in Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot, which found:

one out of five STM articles suffering from reference rot, meaning it is impossible to revisit the web context that surrounds them some time after their publication. When only considering STM articles that contain references to web resources, this fraction increases to seven out of ten.

Reference rot comes in two forms:

Link rot: The resource identified by a URI vanishes from the web. As a result, a URI reference to the resource ceases to provide access to referenced content.

Content drift: The resource identified by a URI changes over time. The resource’s content evolves and can change to such an extent that it ceases to be representative of the content that was originally referenced.

Source

The British Library's Andy Jackson analyzed the UK Web Archive and found:

I expected the rot rate to be high, but I was shocked by how quickly link rot and content drift come to dominate the scene. 50% of the content is lost after just one year, with more being lost each subsequent year. However, it’s worth noting that the loss rate is not maintained at 50%/year. If it was, the loss rate after two years would be 75% rather than 60%. This indicates there are some islands of stability, and that any broad ‘average lifetime’ for web resources is likely to be a little misleading.

Clearly, the problem is very serious. Below the fold, details on just how serious and discussion of a proposed mitigation.

The Medium-Term Prospects for Long-Term Storage Systems

Back in May I posted The Future of Storage, a brief talk written for a DARPA workshop of the same name. The participants were experts in one or another area of storage technology, so the talk left out a lot of background that a more general audience would have needed. Below the fold, I try to cover the same ground but with this background included, which makes for a long post.

This is an enhanced version of a journal article that has been accepted for publication in Library Hi Tech, with images that didn't meet the journal's criteria, and additional material reflecting developments since submission. Storage technology evolution can't be slowed down to the pace of peer review.

BITAG on the IoT

The Broadband Internet Technical Advisory Group, an ISP industry group, has published a technical working group report entitled Internet of Things (IoT) Security and Privacy Recommendations. It's a 43-page PDF including a 6-page executive summary. The report makes a set of recommendations for IoT device manufacturers:

In many cases, straightforward changes to device development, distribution, and maintenance processes can prevent the distribution of IoT devices that suffer from significant security and privacy issues. BITAG believes the recommendations outlined in this report may help to dramatically improve the security and privacy of IoT devices and minimize the costs associated with collateral damage. In addition, unless the IoT device sector—the sector of the industry that manufactures and distributes these devices—improves device security and privacy, consumer backlash may impede the growth of the IoT marketplace and ultimately limit the promise that IoT holds.

Although the report is right that following its recommendations would "prevent the distribution of IoT devices that suffer from significant security and privacy issues" there are good reasons why this will not happen, and why even if it did the problem would persist. The Department of Homeland Security has a similar set of suggestions, and so does the Internet Society, both with the same issues. Below the fold I explain, and point out something rather odd about the BITAG report. I start from an excellent recent talk.

Talks at the Library of Congress Storage Architecture Meeting

Slides from the talks at last September's Library of Congress Storage Architecture meeting are now on-line. Below the fold, links to and commentary on three of them.

Lurking Malice in the Cloud

It is often claimed that the cloud is more secure than on-premises IT:

If you ask Greg Arnette if the cloud is more secure than on-premises infrastructure he’ll say “absolutely yes.” Arnette is CTO of cloud archive provider Sonian, which is hosted mostly in AWS’s cloud. The public cloud excels in two critical security areas, Arnette contends: Information resiliency and privacy.

But even if the cloud provider's infrastructure were completely secure, using the cloud does not free the user from all responsibility for security. In Lurking Malice in the Cloud: Understanding and Detecting Cloud Repository as a Malicious Service, a team from Georgia Tech, Indiana U., Bloomington and UCSB report on the alarming results of a survey of the use of cloud services to store malware components. Many of the malware stashes they found were hosted in cloud storage rented by legitimate companies, presumably the result of inadequate attention to security details by those companies. Below the fold, some details and comments.

Asymmetric Warfare

Asymmetric warfare is where the attack is cheap but the defense is expensive. It is very difficult to win in this situation; the attacker can wage a war of attrition at much less cost than the defender. Similarly, one of the insights in our 2003 SOSP paper was that services were vulnerable to denial of service if handling a request was significantly more expensive than requesting it. We implemented two mitigations, "effort balancing", making requesting a service artificially expensive, and rate limits on services. Both were ways of cheaply denying requests, and thus decreasing asymmetry by adjusting the relative cost to the attacker. Below the fold, the most recent example of asymmetric warfare to come my way.

Fake News

In Fake News, Ben Thompson at Stratechery argues (and I agree) that:

The reason the media covered Trump so extensively is quite simple: that is what users wanted. And, in a world where media is a commodity, to act as if one has the editorial prerogative to not cover a candidate users want to see is to face that reality square in the face absent the clicks that make the medicine easier to take.

Indeed, this is the same reason fake news flourishes: because users want it. These sites get traffic because users click on their articles and share them, because they confirm what they already think to be true. Confirmation bias is a hell of a drug — and, as Techcrunch reporter Kim-Mai Cutler so aptly put it on Twitter, it’s a hell of a business model.

No feet on the street

But, as I pointed out in Open Access and Surveillance using this graph (via Yves Smith, base from Carpe Diem), there is another problem. Facebook, Google et al have greatly increased the demand for "news" while they sucked the advertising dollars away from the companies that generated actual news. The result has to be a reduction in the quality of news. The invisible hand of the market ensures that a supply of news-like substances arises, from low-cost suppliers to fill the gap.

Thompson concludes:

I am well aware of the problematic aspects of Facebook’s impact; I am particularly worried about the ease with which we sort ourselves into tribes, in part because of the filter bubble effect noted above (that’s one of the reasons Why Twitter Must Be Saved). But the solution is not the reimposition of gatekeepers done in by the Internet; whatever fixes this problem must spring from the power of the Internet, and the fact that each of us, if we choose, has access to more information and sources of truth than ever before, and more ways to reach out and understand and persuade those with whom we disagree. Yes, that is more work than demanding Zuckerberg change what people see, but giving up liberty for laziness never works out well in the end.

Its hard to disagree, but I think Thompson should acknowledge that the idea that "each of us ... has access to more information and sources of truth than ever before" is imperiled by the drain of resources away from those whose job it is to seek out the "sources of truth" and make them available to us.

Tuesday, November 15, 2016

Open Access and Surveillance

Recent events have greatly increased concerns about privacy online. Spencer Ackerman and Ewan McAskill report for The Guardian that during the campaign Donald Trump said:

“I wish I had that power,” ... while talking about the hack of Democratic National Committee emails. “Man, that would be power.”

and that Snowden's ACLU lawyer, Ben Wizner said:

“I think many Americans are waking up to the fact we have created a presidency that is too powerful.”

Below the fold, some thoughts on online surveillance and how it relates to the Open Access movement.

More From Mackie-Mason on Gold Open Access

Back in May I posted Jeffrey Mackie-Mason on Gold Open Access, discussing the Berkeley Librarian and economist's blog post advocating author-pays open access. In September and October he had two more posts on the topic worthy of attention, which they get below the fold.

The Exception That Proves The Rule

Chris Bourg, who moved from the Stanford Libraries to be library director at MIT, gave a thoughtful talk at Educause entitled Libraries and future of higher education. Below the fold, my thoughts on how it provides the exception that proves the rule I described in Why Did Institutional Repositories Fail?.

Fixing broken links in Wikipedia

Mark Graham has a post at the Wikimedia Foundation's blog, Wikipedia community and Internet Archive partner to fix one million broken links on Wikipedia:

The Internet Archive, the Wikimedia Foundation, and volunteers from the Wikipedia community have now fixed more than one million broken outbound web links on English Wikipedia. This has been done by the Internet Archive's monitoring for all new, and edited, outbound links from English Wikipedia for three years and archiving them soon after changes are made to articles. This combined with the other web archiving projects, means that as pages on the Web become inaccessible, links to archived versions in the Internet Archive's Wayback Machine can take their place. This has now been done for the English Wikipedia and more than one million links are now pointing to preserved copies of missing web content.

This is clearly a good thing, but follow me below the fold.

Updates on the Dyn DDoS

In the aftermath of the Dyn DDoS attack too much is happening to fit into a comment on Tuesday's post. Below the fold, a roundup of the last two day's news from the IoT war zone.

You Were Warned

Four weeks ago yesterday I posted The Things Are Winning about the IoT-based botnet attack on Krebs On Security. I wrote:

And don't think that knocking out important individual Web sites like KrebsOnSecurity is the limit of the bad guys capabilities. Everyone seems to believe that the current probing of the root servers' defenses is the work of China but, as the Moon Worm showed, careful preparation isn't necessarily a sign of a state actor. There are many bad guys out there who could take the Internet down; the only reason they don't is not to kill the goose that lays the golden eggs.

Last Friday's similar attack on Dyn, a major US DNS provider, caused many of its major customer websites to be inaccessible, including Twitter, Amazon, Tumblr, Reddit, Spotify, Netflix, PayPal and github. Dyn's DNS infrastructure was so overloaded that requests for name-to-IP-address translations were dropped or timed out. The LOCKSS team uses github, so we were affected.

It is important to note that these attacks are far from the largest we can expect, and that it is extraordinarily difficult to obtain reliable evidence as to who is responsible. Attackers will be able to produce effects far more disruptive than a temporary inability to tweet with impunity. Below the fold some commentary and useful links.

A Cost-Effective Large LOCKSS Box

Back in August I wrote A Cost-Effective DIY LOCKSS Box, describing a small, 8-slot LOCKSS box capable of providing about 48TB of raw RAID-6 storage at about $64/TB. Now, the existing boxes in the CLOCKSS Archive's 12-node network are nearing the end of their useful life. We are starting a rolling program to upgrade them with much larger boxes to accommodate future growth in the archive.

Last week the first of the upgraded LOCKSS boxes came on-line. They are 4U systems with 45 slots for 3.5" drives from 45drives.com, the same boxes Backblaze uses. We are leaving 9 slots empty for future upgrades and populating the rest with 36 8TB WD Gold drives, giving about 224TB of raw RAID-6 storage, say a bit over 200TB after file system overhead. etc. We are specifying 64GB of RAM and dual CPUs. This configuration on the 45drives website is about $28K before tax and shipping. Using the cheaper WD Purple drives it would be about $19K.

45drives has recently introduced a cost-reduced version. Configuring this with 45 8TB Purple drives and 32GB RAM would get 280TB for $17K, or about $61/TB. It would be even cheaper with the Seagate 8TB archive drives we are using in the 8-slot box.

Tuesday, October 18, 2016

Why Did Institutional Repositories Fail?

Richard Poynder has a blogpost introducing a PDF containing a lengthy introduction that expands on the blog post and a Q&A with Cliff Lynch on the history and future of Institutional Repositories (IRs). Richard and Cliff agree that IRs have failed to achieve the hopes that were placed in them at their inception in a 1999 meeting at Santa Fe, NM. But they disagree about what those hopes were. Below the fold, some commentary.

More Is Not Better

Quite a few of my recent posts have been about how the mainstream media is catching on to the corruption of science caused by the bad incentives all parties operate under, from science journalists to publishers to institutions to researchers. Below the fold I look at some recent evidence that this meme has legs.

Software Art and Emulation

Apart from a short paper describing a heroic effort of Web archaeology, recreating Amsterdam's De Digitale Stadt, the whole second morning of iPRES2016 was devoted to the preservation of software and Internet-based art. It featured a keynote by Sabine Himmelsbach of the House of Electronic Arts (HeK) in Basel, and three papers using the bwFLA emulation technology to present preserved software art (proceedings in one PDF):

A Case Study on Emulation-based Preservation in the Museum: Flusser Hypertext, Padberg et al.
Towards a Risk Model for Emulation-based Preservation Strategies: A Case Study from the Software-based Art Domain, Rechert et al.
Exhibiting Digital Art via Emulation – Boot-to-Emulator with the EMiL Kiosk System, Espenschied et al.

Preserving software art is an important edge case of software preservation. Each art piece is likely to have many more dependencies on specific hardware components, software environments and network services than mainstream software. Focus on techniques for addressing these dependencies in an emulated environment is useful in highlighting them. But it may be somewhat misleading, by giving an exaggerated idea of how hard emulating more representative software would be. Below the fold, I discuss these issues.

Software Heritage Foundation

Back in 2009 I wrote:

who is to say that the corpus of open source is a less important cultural and historical artifact than, say, romance novels.

Back in 2013 I wrote:

Software, and in particular open source software is just as much a cultural production as books, music, movies, plays, TV, newspapers, maps and everything else that research libraries, and in particular the Library of Congress, collect and preserve so that future scholars can understand our society.

There are no legal obstacles to collecting and preserving open source code. Technically, doing so is much easier than general Web archiving. It seemed to me like a no-brainer, especially because almost all other digital preservation efforts depended upon the open source code no-one was preserving! I urged many national libraries to take this work on. They all thought someone else should do it, but none of the someones agreed.

Finally, a team under Roberto di Cosmo with initial support from INRIA has stepped into the breach. As you can see at their website they are already collecting a vast amount of code from open source repositories around the Internet.

softwareheritage.org statistics 06Oct16

They are in the process of setting up a foundation to support this work. Everyone should support this ~~important~~ essential work.

Wednesday, October 5, 2016

Another Vint Cerf Column

Vint Cerf has another column on the problem of digital preservation. He concludes:

These thoughts immediately raise the question of financial support for such work. In the past, there were patrons and the religious orders of the Catholic Church as well as the centers of Islamic science and learning that underwrote the cost of such preservation. It seems inescapable that our society will need to find its own formula for underwriting the cost of preserving knowledge in media that will have some permanence. That many of the digital objects to be preserved will require executable software for their rendering is also inescapable. Unless we face this challenge in a direct way, the truly impressive knowledge we have collectively produced in the past 100 years or so may simply evaporate with time.

Vint is right about the fundamental problem but wrong about how to solve it. He is right that the problem isn't not knowing how to make digital information persistent, it is not knowing how to pay to make digital information persistent. Yearning for quasi-immortal media makes the problem of paying for it worse not better, because quasi-immortal media such as DNA are both more expensive and their more expensive cost is front-loaded. Copyability is inherent in on-line information, that's how you know it is on-line. Work with this grain of the medium, don't fight it.

Tuesday, October 4, 2016

RU18?

LOCKSS is eighteen years old! I told the story of its birth three years ago.

There's a list of the publications in that time, and talks in the last decade, on the LOCKSS web site.

Thanks again to the NSF, Sun Microsystems, and the Andrew W. Mellon Foundation for the funding that allowed us to develop the system, and to the steadfast support of the libraries of the LOCKSS Alliance, and the libraries and publishers of the CLOCKSS Archive that has sustained it in production.

Panel on Software Preservation at iPRES

I was one of five members of a panel on Software Preservation at iPRES 2016, moderated by Maureen Pennock. We each had three minutes to answer the question "what have you contributed towards software preservation in the past year?" Follow me below the fold for my answer.

What danah boyd said (personal)

I'd like you all to read a piece by danah boyd entitled There was a bomb on my block. Go read it, then follow me below the fold for an explanation of why I think you should have done so.

The Things Are Winning

More than three years ago my friend Jim Gettys, who worked on One Laptop Per Child, and on the OpenWrt router software, started warning that the Internet of Things was a looming security disaster. Bruce Schneier's January 2014 article The Internet of Things Is Wildly Insecure — And Often Unpatchable and Dan Geer's April 2014 Heartbleed as Metaphor were inspired by Jim's warnings. That June Jim gave a talk at Harvard's Berkman Center entitled (In)Security in Home Embedded Devices. That September Vint Cerf published Bufferbloat and Other Internet Challenges, and Jim blogged about it. That Christmas a botnet running on home routers took down the gaming networks of Microsoft's Xbox and Sony's Playstation. That wasn't enough to motivate action to fix the problem.

As I write this on 9/24/16 the preceding link doesn't work, although the Wayback Machine has copies. To find out why the link isn't working and what it has to do with the IoT, follow me below the fold.

Where Did All Those Bits Go?

Lay people reading the press about storage, and even some "experts" writing in the press about storage, believe two things:

per byte, storage media are getting cheaper very rapidly (Kryder's Law), and
the demand for storage greatly exceeds the supply.

These two things cannot both be true. Follow me below the fold for an analysis of a typical example, Lauro Rizatti's article in EE Times entitled Digital Data Storage is Undergoing Mind-Boggling Growth.

Brief Talk at the Storage Architecture Meeting

I was asked to give a brief summary of the discussions at the "Future of Storage" workshop to the Library of Congress' Storage Architecture meeting. Below the fold, the text of the talk with links to the sources.

Nature's DNA storage clickbait

Andy Extance at Nature has a news article that illustrates rather nicely the downside of Marcia McNutt's (editor-in-chief of Science) claim that one reason to pay the subscription to top journals is that:

Our news reporters are constantly searching the globe for issues and events of interest to the research and nonscience communities.

Follow me below the fold for an analysis of why no-one should be paying Nature to publish this kind of stuff.

Scary Monsters Under The Bed

So don't look there!

I sometimes hear about archives which scan for and remove malware from the content they ingest. It is true that archives contain malware, but this isn't a good idea:

Most content in archives is never accessed by a reader who might be a target for malware, so most of the malware scan effort is wasted. It is true that increasingly these days data mining accesses much of an archive's content, but it does so in ways that are unlikely to activate malware.
At ingest time, the archive doesn't know what it is about the content future scholars will be interested in. In particular, they don't know that the scholars aren't studying the history of malware. By modifying the content during ingest they may be destroying its usefulness to future scholars.
Scanning and removing malware during ingest doesn't guarantee that the archive contains no malware, just that it doesn't contain any malware known at the time of ingest. If an archive wants to protect readers from malware, it should scan and remove it as the preserved content is being disseminated, creating a safe surrogate for the reader. This will guarantee that the reader sees no malware known at access time, likely to be a much more comprehensive set.

This is essentially the same argument as lies behind the LOCKSS system's approach to format migration, demonstrated more than a decade ago. It is, if necessary, to create temporary access surrogates on demand in the dissemination pipeline, in a less doomed format or shorn of malware as the case may be.

See, for example, the Internet Archive's Malware Museum, which contains access surrogates of malware which has been defanged.

Tuesday, September 6, 2016

Memento at W3C

Herbert van de Sompel's post at the W3C's blog Memento and the W3C announces that both the W3C's specifications and their Wiki now support Memento (RFC7089):

The Memento protocol is a straightforward extension of HTTP that adds a time dimension to the Web. It supports integrating live web resources, resources in versioning systems, and archived resources in web archives into an interoperable, distributed, machine-accessible versioning system for the entire web. The protocol is broadly supported by web archives. Recently, its use was recommended in the W3C Data on the Web Best Practices, when data versioning is concerned. But resource versioning systems have been slow to adopt. Hopefully, the investment made by the W3C will convince others to follow suit.

This is a very significant step towards broad adoption of Memento. Below the fold, some details.

CrossRef on "fake DOIs"

I should have pointed to Geoff Bilder's post to the CrossRef blog, DOI-like strings and fake DOIs when it appeared at the end of June. It responds to the phenomena described in Eric Hellman's Wiley's Fake Journal of Constructive Metaphysics and the War on Automated Downloading, to which I linked in the comments on Improving e-Journal Ingest (among other things). Details below the fold.

Fighting the Web Flab

Source: Frederic Filloux

At Monday Note, Frederic Filloux's Bloated HTML, the best and the worse starts where I've started several times, with the incredibly low density of actual content in today's Web:

When reading this 800 words Guardian story — about half of page of text long — your web browser loads the equivalent of 55 pages of HTML code, almost half a million characters. To be precise: an article of 757 words (4667 characters and spaces), requires 485,527 characters of code ... “useful” text (the human-readable article) weighs less than one percent (0.96%) of the underlying browser code. The rest consists of links (more than 600) and scripts of all types (120 references), related to trackers, advertising objects, analytics, etc.

But he ends on a somewhat less despairing note. Follow me below the fold for a faint ray of hope.

Evanescent Web Archives

Below the fold, discussion of two articles from last week about archived Web content that vanished.

Content negotiation and Memento

Back in March Ilya Kreymer summarized discussions he and I had had about a problem he'd encountered building oldweb.today thus:

a key problem with Memento is that, in its current form, an archive can return an arbitrarily transformed object and there is no way to determine what that transformation is. In practice, this makes interoperability quite difficult.

What Ilya was referring to was that, for a given Web page, some archives have preserved the HTML, the images, the CSS and so on, whereas some have preserved a PNG image of the page (transforming it by taking a screenshot). Herbert van de Sompel, Michael Nelson and others have come up with a creative solution. Details below the fold.

The 120K BTC Heist

Based on my experience of P2P systems in the LOCKSS Program, I've been writing skeptically about Bitcoin and the application of blockchain technology to other applications for nearly three years. In that time there have been a number of major incidents warning that skepticism is essential, including:

The Mt. Gox theft.
Several protocol vulnerabilities.
Bitcoin's failure to remain effectively distributed.
The DAO theft. and subsequent hard fork.
And most recently the Bitfinex theft.

Despite these warnings, enthusiasm for the future of blockchain technology is still rampant. Below the fold, the latest hype and some recent responses from less credulous sources.

OK, I'm really amazed

Ever since I read Maciej Cegłowski's What Happens Next Will Amaze You (its a must-read) I've been noticing how unpleasant the experience of browsing the Web has become. Ever since I read Georgis Kontaxis and Monica Chew's Tracking Protection in Firefox for Privacy and Performance I've been noticing how slow browsing the Web has become.

Because I work at Stanford I have a discounted subscription to the New York Times, so I'm that rarity on the Web, a paying customer. You would think they would try to make my Web browsing experience pleasant and hassle-free. So here I am, using my hotel's WiFi and Chrome on my totally up-to-date Google Nexus 9 tablet with no ad-blocker. I'm scrolling down the front page of the New York Times and I notice a story that looks interesting. My finger touches the link. And what happens next amazes me. In fact, it tips me over the edge into full-on rant mode, which starts below the fold. You have been warned, and I apologize for two rants in a row.

Correlated Distraction

It is 11:44AM Pacific and I'm driving, making a left on to Central Expressway in Mountain View, CA and trying to avoid another vehicle whose driver isn't paying attention when an ear-splitting siren goes off in my car. After a moment of panic I see "Connected" on the infotainment system display. Its the emergency alert system. When it is finally safe to stop and check, I see this message:

Emergency Alert: Dust Storm Warning in this area until 12:00PM MST. Avoid travel. Check local media - NWS.

WTF? Where to even begin with this stupidity? Well, here goes:

"this area" - what area? In the Bay Area we have earthquakes, wildfires, flash floods, but we don't yet have dust storms. Why does the idiot who composed the message think they know where everyone who will read it is?
Its 11:44AM Pacific, or 18:44UTC. That's 12:44PM Mountain. Except we're both on daylight savings time. So did the message mean 12:00PM MDT, in which case the message was already 44 minutes too late? Or did the message mean 12:00MST, or 19:00UTC, in which case it had 16 minutes to run? Why send a warning 44 minutes late or use the wrong time zone?
A dust storm can be dangerous, so giving people 16 minutes (but not -44 minutes) warning could save some lives. Equally, distracting everyone in "this area" who is driving, operating machinery, performing surgery, etc. could cost some lives. Did anyone balance the upsides and downsides of issuing this warning, even assuming it only reached people in "this area"?
I've written before about the importance and difficulty of modelling correlated failures. Now that essentially every driver is carrying (but hopefully not talking on) a cellphone, the emergency alert system is a way to cause correlated distraction of every driver across the entire nation. Correlated distraction caused by rubbernecking at accidents is a well-known cause of additional accidents. But at least that is localized in space. Who thought that building a system to cause correlated distraction of every driver in the nation was a good idea?
Who has authority to trigger the distraction? Who did trigger the distraction? Can we get that person fired?
This is actually the third time the siren has gone off while I'm driving. The previous two were Amber alerts. Don't get me wrong. I think getting drivers to look out for cars that have abducted children is a good idea, and I'm glad to see the overhead signs on freeways used for that purpose. But it isn't a good enough idea to justify the ear-splitting siren and consequent distraction. So I had already followed instructions to disable Amber alerts. I've now also disabled Emergency alerts.

So, once again, because no-one thought What Could Possibly Go Wrong?, a potentially useful system has crashed and burned.

Thursday, August 4, 2016

A Cost-Effective DIY LOCKSS Box

U-NAS NSC-800

Several LOCKSS Alliance members have asked us about cost-effective, high-capacity LOCKSS box hardware. We recently assembled and are testing our answer to these questions, a LOCKSS box built into the U-NAS NSC800 chassis. It supports 8 3.5" drives so, for example, using the recently available 8TB drives and RAID-6 it would provide about 48TB of raw storage before file system overhead. Below the fold, a detailed parts list, links to build instructions, and comments on the build process.

Cameron Neylon's "Squaring Circles"

Cameron Neylon's Squaring Circles: The economics and governance of scholarly infrastructures is an expanded version of his excellent talk at the JISC-CNI workshop. Below the fold, some extracts and comments, but you should read the whole thing.

End of Moore's Law

Richard Chirgwin at The Register reports that the Semiconductor Industry Association has issued their roadmap for chip technology, the ITRS:

The group suggests that the industry is approaching a point where economics, rather than physics, becomes the Moore's Law roadblock. The further below 10 nanometres transistors go, the harder it is to make them economically. That will put a post-2020 premium on stacking transistors in three dimensions without gathering too much heat for them to survive.

This is about logic, such as CPUs, but it is related to the issues that have forced flash memories to use 3D.

Energy demand of computing

There are other problems than the difficulty of making transistors smaller:

The biggest is electricity. The world's computing infrastructure already uses a significant slice of the world's power, and the ITRS says the current trajectory is self-limiting: by 2040, ... computing will need more electricity than the world can produce.

So we're looking at limits both on the affordability of the amounts of data that can be stored and the computations that can be performed on it.

The ITRS points to the wide range of different applications that the computations will be needed for, and the resulting:

research areas a confab of industry, government and academia see as critical: cyber-physical systems; intelligent storage; realtime communication; multi-level and scalable security; manufacturing; “insight” computing; and the Internet of Things.

We can see the end of the era of data and computation abundance. Dealing with an era of constrained resources will be very different.In particular, enthusiasm for blockchain technology as A Solution To Everything will need to be tempered by its voracious demand for energy. An estimate of the 2020 energy demand of the bitcoin blockchain alone ranges from optimistically the output of a major power station to pessimistically the output of Denmark. Deploying technologies that, like blockchains, deliberately waste vast amounts of computation will no longer be economically feasible.

Tuesday, July 26, 2016

The Citation Graph

An important point raised during the discussions at the recent JISC-CNI meeting is also raised by Larivière et al's A simple proposal for the publication of journal citation distributions:

However, the raw citation data used here are not publicly available but remain the property of Thomson Reuters. A logical step to facilitate scrutiny by independent researchers would therefore be for publishers to make the reference lists of their articles publicly available. Most publishers already provide these lists as part of the metadata they submit to the Crossref metadata database and can easily permit Crossref to make them public, though relatively few have opted to do so. If all Publisher and Society members of Crossref (over 5,300 organisations) were to grant this permission, it would enable more open research into citations in particular and into scholarly communication in general.

In other words, despite the importance of the citation graph for understanding and measuring the output of science, the data are in private hands, and are analyzed by opaque algorithms to produce a metric (journal impact factor) that is easily gamed and is corrupting the entire research ecosystem.

Simply by asking to flip a bit, publishers already providing their citations to CrossRef can make them public, but only a few have done so.

Larivière et al's painstaking research shows that journal publishers and others with access to these private databases (Web of Science and Scopus) can use it to graph the distribution of citations to the articles they publish. Doing so reveals that:

the shape of the distribution is highly skewed to the left, being dominated by papers with lower numbers of citations. Typically, 65-75% of the articles have fewer citations than indicated by the JIF. The distributions are also characterized by long rightward tails; for the set of journals analyzed here, only 15-25% of the articles account for 50% of the citations

Thus, as has been shown many times before, the impact factor of a journal conveys no useful information about the quality of a paper it contains. Further, the data on which it is based is itself suspect:

On a technical point, the many unmatched citations ... that were discovered in the data for eLife, Nature Communications, Proceedings of the Royal Society: Biology Sciences and Scientific Reports raises concerns about the general quality of the data provided by Thomson Reuters. Searches for citations to eLife papers, for example, have revealed that the data in the Web of Science^TM are incomplete owing to technical problems that Thomson Reuters is currently working to resolve. ...

Because the citation graph data is not public, audits such as Larivière et al's are difficult and rare. Were the data to be public, both publishers and authors would be able to, and motivated to, improve it. It is perhaps a straw in the wind that Larivière's co-authors include senior figures from PLoS, AAAS, eLife, EMBO, Nature and the Royal Society.

Thursday, July 21, 2016

QLC Flash on the horizon

Exabytes shipped

Last May in my talk at the Future of Storage workshop I discussed the question of whether flash would displace hard disk as the bulk storage medium. As the graph shows, flash is currently only a small proportion of the total exabytes shipped. How rapidly it could displace hard disk is determined by how rapidly flash manufacturers can increase capacity. Below the fold I revisit this question based on some more recent information about flash technology and the hard disk business.

More on Terms of Service

When Jefferson Bailey & I finished writing My Web Browser's Terms of Service I thought I was done with the topic, but two recent articles bought it back into focus. Below the fold are links, extracts and comments.

What is wrong with science?

This is a quick post to flag two articles well worth reading.

Talk at JISC/CNI Workshop

I was invited to give a talk at a workshop convened by JISC and CNI in Oxford. Below the fold, an edited text with links to the sources.

The Major Threat is Economic

I've frequently said that the major threat to digital preservation is economic; back in 2013 I posted The Major Threat is Economic. We are reminded of this by the announcement last March that:

The future of the Trove online database is in doubt due to funding cuts to the National Library of Australia.

Trove is the National Library's system:

In 2014, the database's fifth year, an estimated 70,000 people were using the website each day.

Australia Library and Information Association chief executive Sue McKarracher said Trove was a visionary move by the library and had turned into a world-class resource.
...
"If you look at things like the digital public libraries in the United States, really a lot of that came from looking at our Trove and seeing what a nation could do investing in a platform that would hold museum, gallery and library archives collections and make them accessible to the world."

Monday, June 20, 2016

Glyn Moody on Open Access

At Ars Technica, Glyn Moody writes Open access: All human knowledge is there—so why can’t everybody access it? , a long (9 "page") piece examining this question:

What's stopping us? That's the central question that the "open access" movement has been asking, and trying to answer, for the last two decades. Although tremendous progress has been made, with more knowledge freely available now than ever before, there are signs that open access is at a critical point in its development, which could determine whether it will ever succeed

It is a really impressive, accurate, detailed and well-linked history of how we got into the mess we're in, and a must-read despite the length. Below the fold, a couple of comments.

The Vienna Principles

This is an excellent piece of work, well worth reading and thinking about:

Between April 2015 and June 2016, members of the Open Access Network Austria (OANA) working group “Open Access and Scholarly Communication” met in Vienna to discuss [Open Science]. The main outcome of our considerations is a set of twelve principles that represent the cornerstones of the future scholarly communication system. They are designed to provide a coherent frame of reference for the debate on how to improve the current system. With this document, we are hoping to inspire a widespread discussion towards a shared vision for scholarly communication in the 21st century.

Their twelve principles are:

Thursday, June 16, 2016

Bruce Schneier on the IoT

John Leyden at The Register reports that Government regulation will clip coders' wings, says Bruce Schneier. He spoke at Infosec 2016:

Government regulation of the Internet of Things will become inevitable as connected kit in arenas as varied as healthcare and power distribution becomes more commonplace, ... “Governments are going to get involved regardless because the risks are too great. When people start dying and property starts getting destroyed, governments are going to have to do something,” ... The trouble is we don’t yet have a good regulatory structure that might be applied to the IoT. Policy makers don’t understand technology and technologists don’t understand policy. ... “Integrity and availability are worse than confidentiality threats, especially for connected cars. Ransomware in the CPUs of cars is gonna happen in two to three years,” ... technologists and developers ought to design IoT components so they worked even when they were offline and failed in a safe mode."

Not to mention the problem that the DMCA places researchers who find vulnerabilities in the IoT at risk of legal sanctions, despite the recent rule change. So much for the beneficial effects of government regulation.

This post will take over from Gadarene swine as a place to collect the horrors of the IoT. Below the fold a list of some of the IoT lowlights in the 17 weeks since then.

What took so long?

More than ten months ago I wrote Be Careful What You Wish For which, among other topics, discussed the deal between Elsevier and the University of Florida:

And those public-spirited authors who take the trouble to deposit their work in their institution's repository are likely to find that it has been outsourced to, wait for it, Elsevier! The ... University of Florida, is spearheading this surrender to the big publishers.

Only now is the library community starting to notice that this deal is part of a consistent strategy by Elsevier and other major publishers to ensure that they, and only they, control the accessible copies of academic publications. Writing on this recently we have:

Ellen Finnie and Greg Eow from the MIT Library.
The Coalition of Open Access Policy Institutions steering committee.
And Barbara Fister.

Barbara Fister writes:

librarians need to move quickly to collectively fund and/or build serious alternatives to corporate openwashing. It will take our time and money. It will require taking risks. It means educating ourselves about solutions while figuring out how to put our values into practice. It will mean making tradeoffs such as giving up immediate access for a few who might complain loudly about it in order to put real money and time into long-term solutions that may not work the first time around. It means treating equitable access to knowledge as our primary job, not as a frill to be worked on when we aren’t too busy with our “real” work of negotiating licenses, fixing broken link resolvers, and training students in the use of systems that will be unavailable to them once they graduate.

Amen to all that, even if it is 10 months late. If librarians want to stop being Elsevier's minions they need to pay close, timely attention to what Elsevier is doing. Such as buying SSRN. How much would arXiv.org cost them?

Tuesday, June 14, 2016

Decentralized Web Summit

Brad Shirakawa/Internet Archive

This is a quick report from the Decentralized Web Summit. First, Brewster Kahle, Wendy Hanamura and the Internet Archive staff deserve great praise for assembling an amazing group of people and running an inspiring and informative meeting. It was great to see so many different but related efforts to build a less centralized Web.

Pictures and videos are up here. You should definitely take the time to watch, at least the talks on the second day by:

and the panel moderated by Kevin Marks, in particular this contribution from Zooko Wilcox. He provides an alternative view on my concerns about Economies of Scale in Peer-to-Peer Networks.

I am working on a post about my reactions to the first two days (I couldn't attend the third) but it requires a good deal of thought, so it'll take a while.

Monday, June 13, 2016

Eric Kaltman on Game Preservation

At How They Got Game, Eric Kaltman's Current Game Preservation is Not Enough is a detailed discussion of why game preservation has become extraordinarily difficult. Eric expands on points made briefly in my report on emulation. His TL;DR sums it up:

The current preservation practices we use for games and software need to be significantly reconsidered when taking into account the current conditions of modern computer games. Below I elaborate on the standard model of game preservation, and what I’m referring to as “network-contingent” experiences. These network-contingent games are now the predominant form of the medium and add significant complexity to the task of preserving the “playable” historical record. Unless there is a general awareness of this problem with the future of history, we might lose a lot more than anyone is expecting. Furthermore, we are already in the midst of this issue, and I think we need to stop pushing off a larger discussion of it.

Well worth reading.

Tuesday, June 7, 2016

The Need For Black Hats

I was asked to provided some background for a panel on "Security" at the Decentralized Web Summit held at the Internet Archive. Below the fold is a somewhat expanded version.

He Who Pays The Piper

As expected, the major publishers have provided an amazingly self-serving response to the EUs proposed open access mandate. My suggestion for how the EU should respond in turn is:

When the EU pays for research, the EU controls the terms under which it is to be published. If the publishers want to control the terms under which some research is published, publishers should pay for that research. You can afford to.

;-)

Thursday, May 26, 2016

Abby Smith Rumsey's "When We Are No More"

Back in March I attended the launch of Abby Smith Rumsey's book When We Are No More. I finally found time to read it from cover to cover, and can recommend it. Below the fold are some notes.

Randall Munroe on Digital Preservation

Randall Munroe succinctly illustrates a point I made at length in my report on emulation:

And here, for comparison, is one of the Internet Archive's captures of the XKCD post. Check the mouse-over text.

Monday, May 23, 2016

Improving e-Journal Ingest (among other things)

Herbert Van de Sompel, Michael Nelson and I have a new paper entitled Web Infrastructure to Support e-Journal Preservation (and More) that:

describes the ways archives ingest e-journal articles,
shows the areas in which these processes use heuristics, which makes them fallible and expensive to maintain,
and shows how the use of DOIs, ResourceSync, and Herbert and Michael's "Signposting" proposal could greatly improve these and other processes that need to access e-journal content.

It concludes with a set of recommendations for CrossRef and the e-journal publishers that would be easy to adopt and would not merely improve these processes but also help remedy the deficiencies in the way DOI's are used in practice that were identified in Martin Klein et al's paper in PLoS One entitled Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot, and in Persistent URIs Must Be Used To Be Persistent, presented by Herbert and co-authors to the 25th international world wide web conference.

Tuesday, May 17, 2016

Jeffrey MacKie-Mason on Gold Open Access

I've written before about the interesting analysis behind the Max Planck Society's initiative to "flip" the academic publishing system from one based on subscriptions to one based on "gold" open access (article processing charges or APCs). They are asking institutions to sign an "Expression of Interest in the Large-scale Implementation of Open Access to Scholarly Journals". They now have 49 signatures, primarily from European institutions.

The US library community appears generally skeptical or opposed, except for the economist and Librarian of UC Berkeley, Jefferey MacKie-Mason. In response to what he describes as the Association of Research Libraries':

one-sided briefing paper in advance of a discussion during the spring ARL business meeting on 27 January. (I say “one-sided” because support of gold OA was presented, tepidly, in just nine words — “the overall aim of this initiative is highly laudable” — followed by nearly a page of single spaced “concerns and criticisms”.)

he posted Economic Thoughts About Gold Open Access, a detailed and well-argued defense of the initiative. It is well worth reading. Below the fold, some commentary.

The Future of Storage

My preparation for a workshop on the future of storage included giving a talk at Seagate and talking to the all-flash advocates. Below the fold I attempt to organize into a coherent whole the results of these discussions and content from a lot of earlier posts.

Signal or Noise?

I've been blogging critically about the state of scientific publishing since my very first post 9 years ago. In particular, I've been pointing out that the several billion dollars a year that go to the publisher's bottom lines, plus the several billion dollars a year in unpaid work by the reviewers, is extremely poor value for money. The claim is that the peer-review process guarantees the quality of published science. But the reality is that it doesn't; it cannot even detect most fraud or major errors.

The fundamental problem is that all participants have bad incentives. Follow me below the fold for some recent examples that illustrate their corrupting effects.

Talk at Seagate

I gave a talk at Seagate as part of a meeting to prepare myself for an upcoming workshop on The Future of Storage. It pulls together ideas from many previous posts. Below the fold, a text of the talk with links to the sources that has been edited to reflect some of what I learnt from the discussions.

My Web Browser's Terms of Service

This post was co-authored with Jefferson Bailey. NB - neither of us is a lawyer. Follow us below the fold to find out why this disclaimer is necessary.

How few copies?

A question that always gets asked about digital preservation is "how many copies do I need to be safe?" The obvious questions in response are "how safe do you need to be?" - it isn't possible to be completely safe - and "how much can you afford to spend being safe?" - costs tend to rise rapidly with each additional 9 of reliability.

User rblandau at MIT-Informatics has a high-level simulation of distributed preservation that looks like an interesting way of exploring these questions. Below the fold, my commentary.

The Architecture of Emulation on the Web

In the program my talk at the IIPC's 2016 General Assembly in Reykjavík was entitled Emulation & Virtualization as Preservation Strategies. But after a meeting to review my report called by the Mellon Foundation I changed the title to The Architecture of Emulation on the Web. Below the fold, an edited text of the talk with an explanation for the change, and links to the sources.

Brewster Kahle's "Distributed Web" proposal

Back in August last year Brewster Kahle posted Locking the Web Open: A Call for a Distributed Web. It consisted of an analysis of the problems of the current Web, a set of requirements for a future Web that wouldn't have those problems, and a list of pieces of current technology that he suggested could be assembled into a working if simplified implementation of those requirements layered on top of the current Web. I meant to blog about it at the time, but I was busy finishing my report on emulation.

Last November, Brewster gave the EE380 lecture on this topic (video from YouTube or Stanford), reminding me that I needed to write about it. I still didn't find time to write a post. On 8^th June, Brewster, Vint Cerf and Cory Doctorow are to keynote a Decentralized Web Summit. I encourage you to attend. Unfortunately, I won't be able to, and this has finally forced me to write up my take on this proposal. Follow me below the fold for a brief discussion; I hope to write a more detailed post soon.

The Curious Case of the Outsourced CA

I took part in the Digital Preservation of Federal Information Summit, a pre-meeting of the CNI Spring Membership Meeting. Preservation of government information is a topic that the LOCKSS Program has been concerned with for a long time; my first post on the topic was nine years ago. In the second part of the discussion I had to retract a proposal I made in the first part that had seemed obvious. The reasons why the obvious was in fact wrong are interesting. The explanation is below the fold.

The Amazon Tax

Ben Thompson at Stratechery has an insightful post entitled The Amazon Tax on the 10^th anniversary of the rollout of Amazon S3:

Until then Amazon Web Services had primarily been about providing developers with a way to tap into the Amazon retail store; S3, though, had nothing at all to do with retail,² at least not directly.

Below the fold, some comments.

Following Up On The Emulation Report

A meeting was held at the Mellon Foundation to follow up on my report Emulation and Virtualization as Preservation Strategies. I was asked to provide a brief introduction to get discussion going. The discussions were confidential, but below the fold is an edited text of my introduction with links to the sources.

Long Tien Nguyen & Alan Kay's "Cuneiform" System

Jason Scott points me to Long Tien Nguyen and Alan Kay's paper from last October entitled The Cuneiform Tablets of 2015. It describes what is in effect a better implementation of Raymond Lorie's Universal Virtual Computer. They attribute the failure of the UVC to its complexity:

They tried to make the most general virtual machine they could think of, one that could easily emulate all known real computer architectures easily. The resulting design has a segmented memory model, bit-addressable memory, and an unlimited number of registers of unlimited bit length. This Universal Virtual Computer requires several dozen pages to be completely specified and explained, and requires far more than an afternoon (probably several weeks) to be completely implemented.

They are correct that the UVC was too complicated, but the reasons why it was a failure are far more fundamental and, alas, apply equally to Chifir, the much simpler virtual machine they describe. Below the fold, I set out these reasons.

The Dawn of DAWN?

At the 2009 SOSP David Anderson and co-authors from C-MU presented FAWN, the Fast Array of Wimpy Nodes. It inspired me to suggest, in my 2010 JCDL keynote, that the cost savings FAWN realized without performance penalty by distributing computation across a very large number of very low-power nodes might also apply to storage.

The following year Ian Adams and Ethan Miller of UC Santa Cruz's Storage Systems Research Center and I looked at this possibility more closely in a Technical Report entitled Using Storage Class Memory for Archives with DAWN, a Durable Array of Wimpy Nodes. We showed that it was indeed plausible that, even at then current flash prices, the total cost of ownership over the long term of a storage system built from very low-power system-on-chip technology and flash memory would be competitive with disk while providing high performance and enabling self-healing.

Although flash remains more expensive than hard disk, since 2011 the gap has narrowed from a factor of about 12 to about 6. Pure Storage recently announced FlashBlade, an object storage fabric composed of large numbers of blades, each equipped with:

Compute – 8-core Xeon system-on-a-chip – and Elastic Fabric Connector for external, off-blade, 40GbitE networking,
Storage – NAND storage with 8TB or 52TB raw capacity of raw capacity and on-board NV-RAM with a super-capacitor-backed write buffer plus a pair of ARM CPU cores and an FPGA,
On-blade networking – PCIe card to link compute and storage cards via a proprietary protocol.

Chris Mellor at The Register has details and two commentaries.

FlashBlade clearly isn't DAWN. Each blade is much bigger, much more powerful and much more expensive than a DAWN node. No-one could call a node with an 8-core Xeon, 2 ARMs, and 52TB of flash "wimpy", and it'll clearly be too expensive for long-term bulk storage. But it is a big step in the direction of the DAWN architecture.

DAWN exploits two separate sets of synergies:

Like FlashBlade, it moves the computation to where the data is, rather then moving the data to where the computation is, reducing both latency and power consumption. The further data moves on wires from the storage medium, the more power and time it takes. This is why Berkeley's Aspire project's architecture is based on optical interconnect technology, which when it becomes mainstream will be both faster and lower-power than wires. In the meantime, we have to use wires.
Unlike FlashBlade, it divides the object storage fabric into a much larger number of much smaller nodes, implemented using the very low-power ARM chips used in cellphones. Because the power a CPU needs tends to grow faster than linearly with performance, the additional parallelism provides comparable performance at lower power.

So FlashBlade currently exploits only one of the two sets of synergies. But once Pure Storage has deployed this architecture in its current relatively high-cost and high-power technology, re-implementing it in lower-cost, lower-power technology should be easy and non-disruptive. They have done the harder of the two parts.

Thursday, March 17, 2016

Dr. Pangloss loves technology roadmaps

Its nearly three years since we last saw the renowned Dr. Pangloss chuckling with glee at the storage industry's roadmaps. But last week he was browsing Slashdot and found something much to his taste. Below the fold, an explanation of what the good Doctor enjoyed so much.

Elsevier and the Streisand Effect

Nearly a year ago I wrote The Maginot Paywall about the rise of research into the peer-to-peer sharing of academic papers via mechanisms including Library Genesis, Sci-Hub and #icanhazpdf. Although these mechanisms had been in place for some time they hadn't received a lot of attention. Below the fold, a look at how and why this has recently changed.

Talk on Evolving the LOCKSS Technology at PASIG

At the PASIG meeting in Prague I gave a brief update on the ongoing evolution of the LOCKSS technology. Below the fold, an edited text of the talk with links to the sources.

Talk on Private LOCKSS Networks at PASIG

I stood in for Vicky Reich to give an overview of Private LOCKSS Networks to the PASIG meeting. Below the fold, an edited text of the talk with links to the sources.

Death of the "free internet"?

I've linked before to the excellent work of Izabella Kaminska at the FT's Alphaville blog. She's recently started a new series of posts she's calling Web Perestroika:

an occasional series lamenting the hypothetical eventuality of a world without a free internet* and the extraordinary implications this could have for markets and companies. A tragedy of the web commons if you will.

It is inspired both by India’s ruling to bar Facebook from subsidising internet availability with Free Basics packages (see Kadhim’s series of posts for more on that) but also Balaji Srinivasan (he of 21 Inc toaster fame), and his attempts — including a Stanford Bitcoin course — to convince the world the web should in fact be a paid-for luxury product of scarcity.

And yes, the asterisk means she does understand that The Internet is not free:

*when we say “internet” we mean it in the popular sense of the word.

She means a world without free Web content. Below the fold, some thoughts on the first two posts in the series, both from Feb 10^th.

The Cloudy Future of Disk Drives

For many years, following Dave Anderson of Seagate, I've been pointing out that the constraints of manufacturing capacity mean that the only medium available on which to store the world's bulk data is hard disk. Eric Brewer's fascinating FAST2016 keynote, entitled Spinning Disks and their Cloudy Future and Google's associated white paper, start from this premise:

The rise of portable devices and services in the Cloud has the consequence that (spinning) hard disks will be deployed primarily as part of large storage services housed in data centers. Such services are already the fastest growing market for disks and will be the majority market in the near future.

Eric's argument is that since cloud storage will shortly be the majority of the market, and that other segments are declining, the design of hard drives no longer needs to be a compromise suitable for a broad range of uses, but should be optimized for the Cloud. Below the fold, I look into some details of the optimizations and provide some supporting evidence.

2016 FAST Conference

I've just finished attending the 2016 Usenix FAST conference. Eric Brewer of Google gave a fascinating keynote, which deserves a complete post to itself. Below the fold are notes on the papers that caught my eye, arranged by topic.

1000 long-tail publishers!

The e-journal content that is at risk of loss or cancellation comes from the "long tail" of small publishers. Somehow, the definition of "small publisher" has come to be one that publishes ten or fewer journals. This seems pretty big to me, but if we adopt this definition the LOCKSS Program just passed an important milestone. We just sent out a press release announcing that the various networks using LOCKSS technology now preserve content from over 1000 long-tail publishers. There is still a long way to go, but as the press release says:

there are tens of thousands of long tail publishers worldwide, which makes preserving the first 1,000 publishers an important first step to a larger endeavor to protect vulnerable digital content.

Saturday, February 20, 2016

Andrew Orlowski speaks!

At the Battle of Ideas Festival at the Barbican last year, Claire Fox chaired a panel titled "Is Technology Limiting Our Humanity?", and invited my friend Andrew Orlowski of The Register to speak. Two short but thought-provoking extracts are now up, which The Register's editors have entitled:

Terrified robots will take middle class jobs? Look in a mirror, in which Andrew argues that jobs are automated only when they have been drained of all human functions such as judgement.
Meet the original Big Data, TED Talk, Thought Shower Futurist, in which Andrew discusses the analogies between the work of William Playfair (1759-1823) and today's Big Data enthusiasts.

Playfair in particular is a fascinating character:

an embezzler and a blackmailer, with some unscrupulous data-gathering methods. He would kidnap farmers until they told him how many sheep they had. Today he’s remembered as the father of data visualisation. He was the first to use the pie chart, the line chart, the bar chart.
...
Playfair stressed the confusion of the moment, its historical discontinuity, and advanced himself as a guru with new methods who was able to make sense of it.

Both extracts are worth your time.

Thursday, February 18, 2016

Gadarene swine

I've been ranting about the way we, possessed by the demons of the Internet of Things, are rushing like the Gadarene Swine to our doom. Below the fold, the latest rant in the series, which wanders off into related, but equally doom-laden areas.

James Jacobs on Looking Forward

Government documents have long been a field that the LOCKSS Program has been involved in. Recent history, such as that of the Harper administration in Canada, is full of examples of Winston Smith style history editing by governments. This makes it essential that copies of government documents are maintained outside direct government custody, and several private LOCKSS networks are doing this for various kinds of government documents. Below the fold, a look at the US Federal Depository Library Program, which has been doing this in the paper world for a long time, and the state of its gradual transition to the digital world.

The Malware Museum

Mikko Hypponen and Jason Scott at the Internet Archive have put up the Malware Museum:

a collection of malware programs, usually viruses, that were distributed in the 1980s and 1990s on home computers. Once they infected a system, they would sometimes show animation or messages that you had been infected. Through the use of emulations, and additionally removing any destructive routines within the viruses, this collection allows you to experience virus infection of decades ago with safety.

The museum is an excellent use of emulation and well worth a visit.

I discussed the issues around malware in my report on emulation. The malware in the Malware Museum is too old to be networked, and thus avoids the really difficult issues that running software with access to the network that is old, and thus highly vulnerable, causes.

Even if emulation can ensure that only the virtual machine and not its host is infected, and users can be warned not to input any personal information to it, this may not be enough. The goal of the infection is likely to be to co-opt the virtual machine into a botnet, or to act as a Trojan on your network. If you run this vulnerable software you are doing something that a reasonable person would understand puts other people's real machines at risk. The liability issues of doing so bear thinking about.

Tuesday, February 2, 2016

Always read the fine print

When Amazon announced Glacier I took the trouble to read their pricing information carefully and wrote:

Because the cost penalties for peak access to storage and for small requests are so large ..., if Glacier is not to be significantly more expensive than local storage in the long term preservation systems that use it will need to be carefully designed to rate-limit accesses and to request data in large chunks.

Now, 40 months later, Simon Sharwood at The Register reports that people who didn't pay attention are shocked that using Glacier can cost more in a month than enough disk to store the data 60 times over:

Last week, a chap named Mario Karpinnen took to Medium with a tale of how downloading 60GB of data from Amazon Web Services' archive-grade Glacier service cost him a whopping US$158.

Karpinnen went into the fine print of Glacier pricing and found that the service takes your peak download rate, multiplies the number of gigabytes downloaded in your busiest hour for the month and applies it to every hour of the whole month. His peak data retrieval rate of 15.2GB an hour was therefore multiplied by the $0.011 per gigabyte charged for downloads from Glacier. And then multiplied by the 744 hours in January. Once tax and bandwidth charges were added, in came the bill for $158.

Karpinnen's post is a cautionary tale for Glacier believers, but the real problem is he didn't look the gift horse in the mouth:

But doing the math (and factoring in VAT and the higher prices at AWS’s Irish region), I had the choice of either paying almost $10 a month for the simplicity of S3 or just 87¢/mo for what was essentially the same thing,

He should have asked himself how Amazon could afford to sell "essentially the same thing" for one-tenth the price. Why wouldn't all their customers switch? I asked myself this in my post on the Glacier announcement:

In order to have a competitive product in the the long-term storage market Amazon had to develop a new one, with a different pricing model. S3 wasn't competitive.

As Sharwood says:

Karpinnen's post and Oracle's carping about what it says about AWS both suggest a simple moral to this story: cloud looks simple, but isn't, and buyer beware applies every bit as much as it does for any other product or service.

The fine print was written by the vendor's lawyers. They are not your friends.

Thursday, December 22, 2016

Tuesday, December 20, 2016

Tuesday, December 13, 2016

Thursday, December 1, 2016

Tuesday, November 29, 2016

Tuesday, November 22, 2016

Friday, November 18, 2016

Thursday, November 17, 2016

Tuesday, November 15, 2016

Thursday, November 10, 2016

Tuesday, November 8, 2016

Tuesday, November 1, 2016

Thursday, October 27, 2016

Tuesday, October 25, 2016

Thursday, October 20, 2016

Tuesday, October 18, 2016

Thursday, October 13, 2016

Tuesday, October 11, 2016

Thursday, October 6, 2016

Wednesday, October 5, 2016

Tuesday, October 4, 2016

Wednesday, September 28, 2016

Monday, September 26, 2016

Thursday, September 22, 2016

Tuesday, September 20, 2016

Thursday, September 15, 2016

Tuesday, September 13, 2016

Tuesday, September 6, 2016

Thursday, September 1, 2016

Tuesday, August 30, 2016

Thursday, August 25, 2016

Tuesday, August 23, 2016

Thursday, August 18, 2016

Tuesday, August 16, 2016

Tuesday, August 9, 2016

Thursday, August 4, 2016

Tuesday, August 2, 2016

Thursday, July 28, 2016

Tuesday, July 26, 2016

Thursday, July 21, 2016

Tuesday, July 19, 2016

Saturday, July 16, 2016

Wednesday, July 6, 2016

Tuesday, July 5, 2016

Monday, June 20, 2016

Friday, June 17, 2016

Thursday, June 16, 2016

Wednesday, June 15, 2016

Tuesday, June 14, 2016

Monday, June 13, 2016

Tuesday, June 7, 2016

Friday, June 3, 2016

Thursday, May 26, 2016

Wednesday, May 25, 2016

Monday, May 23, 2016

Tuesday, May 17, 2016

Thursday, May 12, 2016

Thursday, May 5, 2016

Tuesday, May 3, 2016

Tuesday, April 26, 2016

Tuesday, April 19, 2016

Wednesday, April 13, 2016

Monday, April 11, 2016

Tuesday, April 5, 2016

Thursday, March 31, 2016

Tuesday, March 29, 2016

Thursday, March 24, 2016

Tuesday, March 22, 2016

Thursday, March 17, 2016

Tuesday, March 15, 2016

Friday, March 11, 2016

Thursday, March 10, 2016

Thursday, March 3, 2016

Tuesday, March 1, 2016