Location via proxy:   
[Report a bug]   [Manage cookies]                

mgr's weblog

AI web scraping running wild

July 26, 2024, Miscellaneous
Last edited on July 26, 2024

I have been running https://www.poezio.net for more then 20 years now. It is a poetry database written for the hundreds of Esperanto translation of my father, at the same time it is one of my first Common Lisp programs. Output can be arranged dynamically in multi-column PDF documents generated by TeX with support for many languages.

It is been running almost unchanged for many years. Originally written in 2003, with one overhaul in 2010 adding new CSS and a logo to give it a fresh look. Occasionally, I had to move it to a new host, then into a virtual server, which I had to convert again. Now it's behind some proxy servers but still the same behind the scene, unchanged for more then a decade.

In the beginning of 2010 performance degraded, as Google started to crawl it constantly for new combinations of translations and have them exported as PDF. As the different language version can be dynamically combined, that page looked huge for a crawler. So I had to throttle Google a bit and change the logic so that the UI would not allow to repeat the same translation multiple times. That was it. For more then 14 years.

Until now.

In April 2024 this completely changed. The page is constantly overloaded. And it is not only one bot, Googlebot/2.1, but: AhrefsBot/7.0, Amazonbot/0.1, Applebot/0.1, bingbot/2.0, Bytespider, ClaudeBot/1.0, DotBot/1.2, DuckDuckBot/1.1, Googlebot/2.1, PetalBot, SeekportBot, SemrushBot/7~bl, serpstatbot/2.1, YandexBot/3.0, CCBot/2.0, ChatGPT-User/1.0, MojeekBot/0.11, Mail.RU_Bot/2.0, serpstatbot/2.1, DataForSeoBot/1.0, GPTBot/1.2.

I had to cater for it on 2024-05-14, and again on 2024-07-24 adding more bots. It was a much longer period before, with the last modification in 2010-01-20. And is still not in a good state. Every day I get mails from my monitoring that the availability is reduced.

This only this little web site. Image the impact worldwide. This constant scraping must be truly massive and cause an immense power consumption for all servers and the networking around the world.

Pushing the logic to the data – Running Dydra's revisioning algorithm within RonDB's data nodes

July 23, 2024, Lisp
Last edited on July 26, 2024

July 23, 2024. News release.

Datagraph GmbH, Berlin. – We are working on ways to distribute not just data storage, but also query processing. Dydra, our graph database, can use a RonDB NDB Cluster as storage backend for large repositories of billions of triples. We discussed with Mikael Ronström of RonDB and Hopsworks how we could improve our use of RonDB and its advanced features, and proposed to investigate how extensions to the RonDB interpreted code language could make it possible to move our revision visibility test from the core of our graph database system to an implementation that runs in RonDB's data nodes. That would distribute the processing load and reduce the transferred data.

That is, we could push the test to the data instead of having to pull all data to the test.

In Dydra, all data can be revisioned. For that, each statement can have a vector of revision ordinals associated with it to describe its visibility, and thus its full history. Now, if the revision visibility test ran in the data nodes already, queries that involve scans over the data will fetch only those statements that match the specified revision and the visibility information will not leave the data nodes, greatly reducing the data that has to move.

In recent weeks, Mikael not only implemented the minimal set of our proposal but took it as a opportunity to overhaul the interpreted code language greatly, adding dozens of new commands, changing the instruction format to allow for even more commands, supporting both full and partial reads of data columns into memory and than copying out parts of that data into registers of the register machine, and more. This will allow for interesting new optimizations of many applications based on RonDB.

Max-Gerd Retzlaff of Dydra and Datagraph implemented a nifty little compiler on top of the interpreter that allows to write NDB interpreted code for RonDB in more high-level Lisp code rather than NDB interpreted code (NDB IC) instructions for the virtual register machine that runs the NDB IC instructions. So you can write your logic with high level conditionals such as IF, WHEN and COND (which is Lisp's IF..ELSEIF..ELSE construct) rather then having to define labels and using branch and jump instructions which are more in the fashion of writing assembly.

At the same time, this Lisp NDB IC Compiler can not only compile to NDB interpreted code instructions but also to regular Common Lisp. This allows for testing and debugging of algorithms within the Lisp development image with all its tooling available, and to shift over to NDB IC instructions only when new code passed all tests.

Max used the new compiler to reimplement, and test, Dydra's revision visibility algorithm, which is based on binary search with a number of corner cases, to directly work on the visibility data stored as a VARCHAR column in RonDB.

Data that used to be opaque to RonDB and had needed to be retrieved from the data nodes and interpreted by the Dydra query processor, is now analyzed by NDB interpreted code within the RonDB cluster already.

More information and detailed performance testing to follow. Availability

The Lisp NDB IC compiler is part of a development branch of CL-NDBAPI, our Open Source Common Lisp bindings to the C++ NDB API of RonDB, available at https://github.com/datagraph/cl-ndbapi.

This branch currently bases on the preview version rondb-22.10.97 of RonDB, that is made for the pull request "RONDB-671: Add a set of new instructions to interpreter making it more complete" at https://github.com/logicalclocks/rondb/pull/472. These changes are scheduled to be in the RonDB 24.10 development tree in late August.

Our work will be made available in CL-NDBAPI's repository at https://github.com/datagraph/cl-ndbapi when the work on of the new RonDB branch and in turn our development version has been stabilized.

Got my German Class A amateur radio license

February 21, 2023, Electronics
Last edited on February 21, 2023

Happy in front of the BNetzA with my new certificate yesterday:

me in front of the BNetzA; click for a larger version (71 kB).

And one day later I already got my admission to participation in the amateur radio service via letter post. The people at the Bundesnetzagentur are really quick (and on top of that very friendly as well)!

I am going to write a few posts on that over the next few days. Maybe on calculaters, the wonderful app Funktrainer by Dominik Mayer…

uLisp on M5Stack (ESP32):
support for the LED matrix of the M5Atom Matrix

December 9, 2021, Lisp
Last edited on December 14, 2021

I got a good friend join the uLisp fun and he extended my support for the single LED of the M5Atom Lite to support the 25 LEDs of the M5Atom Matrix. The single LED has just the same interface as the LED matrix, as expected.

Thanks, Thorsten!

It has a nice backwards compatible interface, the functions atomled (for C) and atom-led (for Lisp) just have a new second argument index, which is 0 by default, for the first— or, in case of the M5Atom Lite, only—LED.

The C function you can call like this:

atomled(0x00ff00);
/* or: */
atomled(0x00ff00, 23);

where 0x00ff00 describes a RGB color in 32 bits.

And the uLisp function you can call very similarly like this:

(atom-led #xffff00)
#| or: |#
(atom-led #xffff00 23)

I have merged it to my repository ulisp-esp-m5stack already. Activate the new flag #define enable_m5atom_led_matrix in addition to #define enable_m5atom_led to use the whole LED matrix of the M5Atom Matrix instead of just the first LED.


See also built-in LED of the M5Atom Lite.

uLisp on M5Stack (ESP32):
built-in LED of the M5Atom Lite

December 8, 2021, Lisp

I just published support of the M5Atom Lite LED at ulisp-esp-m5stack.

There is a C function that you can call like this:

atomled(0x00ff00);

where 0x00ff00 describes a RGB color in 32 bits.

And a uLisp function that you can call very similarly like this:

(atom-led #xffff00)

Activate #define enable_m5atom_led to get it. That will also automatically init_atomled(); in setup() after booting the ESP32.

I have actually tried the libraries FastLED (by Daniel Garcia, version 3.4.0), Easy Neopixels (by Evelyn MAsso, version 0.2.3), and NeoPixelBus (by Makuna, version 2.6.9) as well, but settled to use the library Adafruit NeoPixel (by Adafruit, version 1.10.0). It is small, doesn't have tons of bloat, works for me and has a nice interface that makes my implementation so tiny you would think it was almost no work.

uLisp on M5Stack (ESP32):
new version published

December 6, 2021, Lisp
Last edited on July 26, 2024

I got notified that I haven't updated ulisp-esp-m5stack at GitHub for quite a while. Sorry, for that. Over the last months I worked on a commercial project using uLisp and forgot to update the public repository. At least I have bumped ulisp-esp-m5stack to my version of it from May 13th, 2021 now.

It is a—then—unpublished version of uLisp named 3.6b which contains a bug fix for a GC bug in with-output-to-string and a bug fix for lispstring, both authored by David Johnson-Davies who sent them to my via email for testing. Thanks a lot again! It seems they are also included in the uLisp Version 3.6b that David published on 20th June 2021.

I know there David published a couple of new releases of uLisp in the meantime with many more interesting improvements but this is the version I am using since May together with a lot of changes by me which I hope to find time to release as well in the near future.

Error-handling in uLisp by Goheeca

I am using Goheeca's Error-handling code since June and I couldn't work without it anymore. I just noticed that he allowed my to push his work to my repository in July already. So I just also published my branch error-handling to ulisp-esp-m5stack/error-handling. It's Goheeca's patches together with a few small commits by me on top of it, mainly to achieve this (as noted in the linked forum thread already):

To circumvent the limitation of the missing multiple-values that you mentioned with regard to ignore-errors, I have added a GlobalErrorString to hold the last error message and a function get-error to retrieve it. I consider this to be a workaround but it is good enough to show error messages in the little REPL of the Lisp handheld.


See also "Stand-alone uLisp computer (with code!)".

uLisp on M5Stack (ESP32):
controlling relays connected to I2C via a PCF8574

October 18, 2021, Lisp
Last edited on October 18, 2021

relay module connected to I2C via a PCF8574; click for a larger version (180 kB).

Looking at the data sheet of the PCF8574 I found that it will be trivially simple to use it to control a relay board without any lower level Arduino library: Just write a second byte in addtion to the address to the I2C bus directly with uLisp's WITH-I2C.

Each bit of the byte describes the state of one of the eight outputs, or rather its inverted state as the PCF8574 has open-drain outputs and thus setting an output to LOW opens a connection to ground (with up to 25 mA), while HIGH disables the relay. (The data sheets actually say they are push-pull outputs but as high-level output the maximum current is just 1 mA which is not much and for this purpuse certainly not enough.)

The whole job can basically done with one or two lines. Here is switching on the forth relay (that is number 3 with zero-based counting):

(with-i2c (str #x20)
  (write-byte (logand #xff (lognot (ash 1 3))) str))

Here is my whole initial library:

#| control a relay module connected to I2C via a PCF8574 module |#
#| written by Max-Gerd Retzlaff <m.retzlaff@gmx.net>, 2021 |#
 
#| the current state of the relay |#
(defvar *relay* 0)
 
#| address of the PCF8574 module connected to the relay |#
(defvar *relay-address* #x20)
 
#| show state of relay as binary number |#
(defun show-relay ()
  (format nil "~8,'0b" *relay*))
 
#| translate *relay* to relay byte as sent to the PCF8574 |#
(defun relay-byte ()
  (logand #xff (lognot *relay*)))
 
#| actually set real relay via i2c |#
(defun set-relay ()
  (with-i2c (str #x20)
    (write-byte (relay-byte) str)))
 
#| initialize relay |#
(defvar init-relay set-relay)
 
#| switch on relay N |#
(defun relay-on (n)
  (setf *relay* (logior *relay* (ash 1 n)))
  (set-relay))
 
#| switch off relay N |#
(defun relay-off (n)
  (setf *relay* (logand *relay* (lognot (ash 1 n))))
  (set-relay))
 
#| set relay N to STATE |#
(defun relay! (n state)
  ((if state relay-on relay-off) n))
 
#| query state of relay N |#
(defun relay? (n)
  (= 1 (logand 1 (ash *relay* (- n)))))

Be sure to read the newer data sheets "PCF8574 Remote 8-Bit I/O Expander for I2C Bus" by Texas Instruments, revised in March 2015, or "PCF8574; PCF8574A – Remote 8-bit I/O expander for I2C-bus with interrup" by NXP, revised on 27 May 2013, and not the ancient one by Philips of 2002 that many link to. The new ones are much more detailed and explanatory.


See also "Stand-alone uLisp computer (with code!)", "temperature sensors via one wire", "Curl/Wget for uLisp", time via NTP, lispstring without escaping and more space, flash support, muting of the speaker and backlight control and uLisp on M5Stack (ESP32).

Older entries...

Select a Theme:

Basilique du Sacré-Cœur de Montmartre (Paris) Parc Floral de Paris Castillo de Santa Barbara (Alicante) About the photos

Entries: