Drone
Drone
Drone
Fredrik Svanström
examiner:
Professor Slawomir Nowaczyk
Using sensor fusion, the system is made more robust than the in-
dividual sensors. It is observed that when using the proposed sensor
fusion approach, the output system results are more stable, and the
number of false detections is mitigated.
iii
CONTENTS
1 introduction 1
1.1 Related work 2
1.1.1 Thermal infrared sensors 3
1.1.2 Sensors in the visible range 4
1.1.3 Acoustic sensors 6
1.1.4 RADAR 6
1.1.5 Other drone detection techniques 7
1.1.6 Sensor fusion 7
1.1.7 Drone detection datasets 8
1.2 Thesis scope 9
1.2.1 Thesis system specifications and limitations 9
2 methods and materials 11
2.1 Proposed methodology 12
2.2 System architecture 13
2.3 Hardware 16
2.3.1 Thermal infrared camera 17
2.3.2 Video camera 18
2.3.3 Fisheye lens camera 18
2.3.4 Microphone 18
2.3.5 RADAR module 18
2.3.6 ADS-B receiver 19
2.3.7 GPS receiver 20
2.3.8 Pan/tilt platform including servo controller 20
2.3.9 Laptop 21
2.4 Software 21
2.4.1 System software 22
2.4.2 Support software 35
2.5 Graphical user interface 36
2.6 Dataset for training and evaluation 41
3 results 47
3.1 Performance of the individual sensors 47
3.1.1 Thermal infrared detector and classifier 48
3.1.2 Video detector and classifier 53
3.1.3 Fisheye lens camera motion detector 60
3.1.4 Acoustic classifier 63
3.1.5 RADAR module 65
3.2 Sensor fusion and system performance 67
3.3 Drone detection dataset 81
4 discussion 85
5 conclusions 89
bibliography 91
v
INTRODUCTION
1
Small and remotely controlled unmanned aerial vehicles (UAVs), here-
inafter referred to as drones, can be useful and of benefit for society.
Examples of their usefulness are to deliver automated external defib-
rillators [1], to more effectively fight fires [2] and for law enforcement
purposes. Moreover, the low cost and ease of operation make drones
suitable for entertainment and amusement purposes [3].
Figure 1: Search trend for "drone detection" over the last ten years. From [6].
1 Hubsan H107D+, DJI Phantom 4 Pro and DJI Flame Wheel F450
1
2 introduction
The sensors that may be considered for drone detection and clas-
sification tasks, and hence can be found in the related scientific lit-
erature, are: RADAR (on several different frequency bands and both
active and passive), cameras for the visible spectrum, cameras detect-
ing thermal infrared emission (IR), microphones for the detection of
acoustic vibrations, i.e. sound, sensors to detect radio frequency sig-
nals to and from the drone and the drone controller (RF), Scanning
lasers (Lidar) and, as mentioned in [7] and explored further in [8],
even humans. Recently it has also been successfully demonstrated
that animals can be trained to fulfil an anti-drone role [9].
A weakness of [5] is that it does not include any papers that make
use of thermal infrared cameras and, as we will see next, this is some-
thing that will be of importance to this thesis.
The work described in [11], from 2017, does not utilize any form of
machine learning but instead the drone detection and classification is
done by a human in real-time looking at the output video stream.
With the background from this paper and the ones above, this the-
sis will try to extend these finding using a higher resolution sensor
(FLIR Boson with 320x256 pixels) in combination with machine learn-
4 introduction
The possible advantages of using not only video in the visible range
but also a thermal infrared camera are explored to some extent in
[12]. In the paper, from 2019, the authors describe how they combine
the sensors with deep-learning-based detection and tracking modules.
The IR-videos used are stated to have a resolution of 1920x1080, but
unfortunately, the sensor is not specified in any further detail.
The paper presents results using curves of the precision and recall
of the detector and the authors conclude that "In both real-world and
synthetic thermal datasets, the thermal detector achieves better performance
than the visible detector". Sadly any details on the distance between the
sensor and the target drone is omitted.
Finally, the authors claim that the visible and thermal datasets used
are made available as the "USC drone detection and tracking dataset",
without giving any link to it in the paper. This dataset is also men-
tioned in [13], but the link found in that paper is unfortunately not
working.
The fact that the two primary sensors (IR and visible) of the drone
detector system in this thesis are placed on a pan/tilt platform also
arises the need for a wide-angle sensor to steer the sensors in direc-
tions where suspicious objects appear. In [20] a static camera with
110◦ field of view (FoV) is used to align a rotating narrow-field cam-
era (the images from this is analysed with the YOLOv3-model as
pointed out above). To find the objects to be investigated further by
the narrow-field camera, the video stream from the wide-angle one is
analysed by means of a Gaussian Mixture Model (GMM) foreground
detector in one of the setups evaluated in [20].
1.1.4 RADAR
The F450 drone is also used in [32], where the micro-doppler char-
acteristics of drones are investigated. These are typically echoes from
the moving blades of the propellers and can be detected on top of the
1.1 related work 7
bulk motion doppler signal of the drone. Since the propellers are gen-
erally made from plastic materials, the RCS of these parts are even
smaller, and in [29] it is stated that the echoes from the propellers
are 20 to 25 dB weaker than that of the drone body itself. Neverthe-
less, papers like [33], [34] and [35] all accompany [32] in exploring
the possibility to classify drones using the micro-doppler signature.
Very few drones are autonomous in the flight phase. Generally, they
are controlled by means of a manned ground equipment, and often
the drones themself also send out information on some radio fre-
quency (RF). The three drones used in this thesis are all controlled in
real-time. The information sent out by the drones can range from just
simple telemetry data such as battery level (DJI Flame wheel F450),
a live video stream (Hubsan H107D+), to both a video stream and
extensive position and status information (DJI Phantom 4 Pro).
As shown in [7] the fusion of data from multiple sensors, i.e. using
several sensors in combination to achieve more accurate results than
derived from single sensors, while compensating for their individual
weaknesses, is well-founded when it comes to the drone detection
task.
1 Shape factor, kurtosis (the tailedness of the energy curve), and the variance
8 introduction
Just as in [7] this thesis also considers early and late sensor fusion
and differentiate these two principles based on if the sensor data is
fused before or after the detection element.
In the references of [7], there are two useful links to datasets for vis-
ible video detectors. One of these is [41], where 500 annotated drone
images can be found. The other link leads to the dataset [42] of the
Drone-vs-Bird challenge held by the Horizon2020 SafeShore project
consortium. However, the dataset is only available upon request and
with restrictions to the usage and sharing of the data. The Drone-vs-
Bird challenge is also mentioned in [18], [19] and by the winning team
of the challenge in 2017 [43]. The results from the 2019 Drone-vs-Bird
challenge is presented in [44].
There are also some limitations to the scope of this thesis and some
of these are that:
Finally, the methods used for the composition of the drone detec-
tion dataset are described, including the division of the dataset ac-
cording to the sensor type, target class and sensor-to-target distance
bin.
11
12 methods and materials
Combining the data from several sensors under the time constraints
described above must be kept simple and streamlined. This together
with the fact that very few papers are exploring sensor fusion tech-
niques is the motivation to have a system where the inclusion and
weights of the sensors can be altered at runtime to find a feasible set-
ting.
1 Cooperative aircraft are here defined as being equipped with an equipment that
broadcasts the aircraft’s position, velocity vectors and identification information
14 methods and materials
Figure 3: The main parts of the drone detection system. On the lower left the
microphone and above that the fisheye lens camera. On the pan
and tilt platform in the middle are the IR- and video cameras. The
holder for the servo controller and power relay boards is placed
behind the pan servo inside the aluminium mounting channel.
16 methods and materials
2.3 hardware
To have a stable base for the system, all hardware components, ex-
cept the laptop, are mounted on a standard surveyors tripod. This
solution also facilitates the deployment of the system outdoors, as
shown in Figure 4. Due to the nature of the system, it must also eas-
ily be transported to and from any deployment. Hence a transport
solution is available where the system can be disassembled into a few
large parts and placed in a transport box.
Figure 4: The system deployed just north of the runway at Halmstad airport.
2.3 hardware 17
Notably, the Boson sensor of the FLIR Breach has a higher resolu-
tion than the one used in [11] where a FLIR Lepton sensor with 80x60
pixels was used. In that paper, the authors were able to detect three
different drone types up to a distance of 100m, however this detection
was done manually by a person looking at the live video stream and
not, as in this thesis, by means of a trained embedded and intelligent
system.
The IRcam has two output formats, a raw 320x256 pixels format
(Y16 with 16-bit greyscale) and an interpolated 640x512 pixels im-
age in the I420 format (12 bits per pixel). For the interpolated image
format, the colour palette can be changed, and several other image
18 methods and materials
processing features are also available. In the system, the raw format
is used to avoid the extra overlaid text information of the interpolated
image. This choice also grants a better control of the image process-
ing operations as they are implemented in Matlab instead.
The output from the IRcam is sent to the laptop via a USB-C port
at a rate of 60 frames per second (FPS). The IRcam is also powered
via the USB connection.
The Vcam has an adjustable zoom lens, and with this, the field of
view can be set to be both wider or narrower than that described
above of the IRcam. The Vcam is set to have about the same field of
view as the IRcam.
2.3.4 Microphone
To be able to capture the distinct sound that drones emit when flying
a Boya BY-MM1 mini cardioid directional microphone is also con-
nected to the laptop.
1 Phase-locked loop
2 Global Navigation Satellite System
20 methods and materials
To present the decoded ADS-B data in a correct way the system is also
equipped with a G-STAR IV BU-353S4 GPS receiver connected via
USB. The receiver outputs messages following the National Marine
Electronics Association (NMEA) format standard.
To be able to detect targets in a wider field of view than just 24◦ hori-
zontally and 19◦ vertically the IR- and video cameras are mounted on
a pan/tilt platform. This is the Servocity DDT-560H direct drive tilt
platform together with the DDP-125 Pan assembly, also from Servoc-
ity. To achieve the pan/tilt motion two Hitec HS-7955TG servos are
used.
To supply the servos with the necessary voltage and power, both a
net adapter and a DC-DC converter are available. The DC-DC solu-
tion is used when the system is deployed outdoors and, for simplicity,
it uses the same battery type as one of the available drones.
Some other parts from Actobotics are also used in the mounting
of the system and the following has been designed and 3D-printed:
adapters for the IR-, video- and fisheye lens cameras, a radar module
mounting plate and a case for the servo controller and power relay
boards.
2.3.9 Laptop
2.4 software
The software used in the thesis can be divided into two parts. First,
the software running in the system when it is deployed, as depicted
in Figure 2. Additionally, there is also a set of support software used
for tasks such as to form the training data sets and to train the system.
22 methods and materials
The IRcam and Vcam workers are similar in their basic structure
and both import and run a trained YOLOv2 detector and classifier.
The information sent to the main script is the class of the detected tar-
get, the confidence2 , and the horizontal and vertical offsets in degrees
from the centre point of the image. The latter information is used by
the main script to calculate servo commands when an object is being
tracked by the system.
The Audio worker sends information about the class and confi-
dence to the main script. Unlike the others, the ADS-B worker has
two output queues. One is consisting of current tracks and the other
of the history tracks. This is done so that the presentation clearly
shows the heading and altitude changes of the targets.
Looking further into the different output classes or labels that the
main code can receive from the workers, as shown in Table 1, it is clear
that not all sensors can output all the target classes used in this thesis.
Furthermore, the audio worker has an additional background class
and the ADS-B will output the NoData-class if the vehicle category
Table 1: The output classes of the sensors, and their corresponding class
colours.
The main script and all the workers are set up so that they can be
run independently of each other in a stand-alone mode. The reason
for this is to facilitate the development and tuning without having to
set up the whole physical system.
The main script is the core of the system software, as shown in Fig-
ure 2. Besides initially starting the five workers (threads) and setting
up the queues to communicate with these, the main script also sends
commands to and reads data from the servo controller and the GPS-
receiver, respectively. After the start-up sequence, the script goes into
a loop that runs until the program is stopped by the user via the
graphical user interface (GUI).
Updating the GUI and reading user inputs are the most frequent
tasks done on every iteration of the loop. At regular intervals, the
main script also interacts with the workers and the servo controller.
Servo positions are read and queues are polled ten times a second.
Within this part, the system results, i.e. the system output label and
confidence, are also calculated using the most recent results from the
workers. Furthermore, at a rate of 5 Hz new commands are sent to the
servo controller for execution. Every two seconds the ADS-B plot is
updated. Having different intervals for various tasks makes the script
more efficient since, for example, an aircraft sends out its position via
ADS-B every second, and hence updating the ADS-B plots to often
would only be a waste of computational resources. The main script
pseudocode is shown in Algorithm 1.
24 methods and materials
The use of workers will also allow the different detectors to run
asynchronously, i.e. handling as many frames per second as possible
without any inter-sensor delays and waiting time.
The sensor fusion is an essential part of the main code and every
time the main script polls the worker’s queues it puts the results in
a 4x4 matrix, organized so that each class is a column and each sen-
sor is a row. The value depends not only on the class label and the
confidence but also on the setting of which sensors to include and the
weight of the specific sensor, i.e. how much we trust it at the moment.
1x4 array, and the column with the highest value will be the output
system class.
The next actions taken at start-up is loading the detector and con-
necting to the IR-camera. The video stream is thereafter started and
set to be continuous using the triggerconfig-command, so that the
worker can use the getsnaphot-function to read an image at any time
within the loop it goes into when running.
To give information about the current state of the detector and the
performance in terms of frames per second (FPS), this information is
also inserted in the top left corner of the image. The current frames
per second processed by the worker is also sent to the main script
together with the detection results.
The YOLOv2 detector is set up and trained using one of the scripts
belonging to the support software. The YOLOv2 is formed by mod-
Six detection layers and three final layers are also added to the net-
work. Besides setting the number of output classes of the final layers
the anchor boxes used are also specified.
hand, using more anchor boxes will also increase the computational
cost and may lead to overfitting. After assessing the plot, the num-
ber of anchor boxes is chosen to be three and the sizes of these, with
the scaling factor of 0.8 in width to match the downsize of the in-
put layer from 320 to 256 pixels, are taken from the output of the
estimateAnchorBoxes-function.
The detector is trained using data picked from the available dataset
after the evaluation data have been selected and put aside, as de-
scribed in Section 2.6. The training data for the IRcam YOLOv2 de-
tector consists of 120 video clips, each one just over 10 seconds and
evenly distributed among all classes and all distance bins, making
the total number of annotated images in the training set 37428. The
detector is trained for five epochs1 using the stochastic gradient de-
scent with momentum (SGDM) optimizer and an initial learning rate
of 0.001.
Since the training data is specified as a table and not using the
datastore-format the trainYOLOv2ObjectDetector-function performs
preprocessing augmentation automatically. The augmentation imple-
mented is reflection, scaling, and changing brightness, hue, saturation
and contrast.
The Vcam worker is very similar to the IRcam worker, with some
exceptions. The input image from the Vcam is 1280x720 pixels, and
directly after the getsnapshot-function, it is resized to 640x512 pix-
els. This is the only image processing operation done to the visible
video image. The input layer of the YOLOv2 detector has a size of
416x416x3. The increased size, compared to the detector in the IRcam
worker, is directly reflected in the FPS performance of the detector.
Due to the increased image size used to train the detector, the train-
ing time is also extended compared to the IR case. When using a com-
puter with an Nvidia GeForce RTX2070 8GB GPU, the time for one
epoch is 2 h 25 min, which is significantly longer than what is pre-
sented in Table 2. The training set consists of 37519 images, and the
detector is trained for five epochs just as the detector of the IRcam
worker.
Initially, the fisheye lens camera was mounted facing upwards, but,
as it turned out, this caused the image distortion to be significant in
the area just above the horizon where the interesting targets usually
appear. After turning the camera so that it faces forward, as can be
seen in Figure 3, the motion detector is less affected by the image
distortion, and since half of the field of view is not used anyway, this
is a feasible solution. Initially, the image was cropped so that the half
where the pan/tilt platform obscures the view was not used. Now
instead, the part of the image covering the area below the horizon in
front of the system is not processed.
The Fcam worker sets up queues and connects to the camera much
in the same way as the IRcam and Vcam workers. The input im-
age from the camera is 1024x768 pixels, and immediately after the
getsnapshot-function the lower part of the image is cropped so that
1024x384 pixels remain to be processed.
To get rid of noise from the parts of the square-shaped image sensor
that lie outside the circular area seen by the fisheye lens, the mask im-
age is also multiplied with another binary mask before it is processed
by the BlobAnalysis-function. This outputs an array of centroids and
bounding boxes for all objects that are considered to be moving.
All these centroids and bounding boxes are sent to a Kalman filter
multi-object tracker, which is a customised version of a script avail-
able in one of the Matlab computer vision toolbox tutorials [51]. Out
of the tracks started and updated by the Kalman filter, the one with
30 methods and materials
the longest track history is picked out and marked as the best one. In
the Fcam presentation window all tracks, both updated and predicted
are visualised and the track considered to be the best is also marked
with red colour. With the press of a button in the GUI, the user can
choose to show the moving object mask in the presentation window
instead of the normal fisheye lens camera image. This is shown in
Figure 10
Figure 10: The normal Fcam image and track presentation above, and the
moving object mask image below.
The output from the Fcam worker is the FPS-status, together with
the elevation and azimuth angles of the best track, if such track ex-
ists at the moment. Out of all the workers, the Fcam is the one with
most tuning parameters. This involves choosing and tuning the im-
age processing operations, the foreground detector and blob analysis
settings, and finally the multi-object Kalman filter tracker parameters.
The audio worker uses the attached directional microphone and col-
lects acoustic data in a one-second long buffer (44100 samples), set to
be updated 20 times per second. To classify the source of the sound in
the buffer, it is first processed with the mfcc-function from the Audio
toolbox. Based on empirical trails, the parameter LogEnergy is set to
Ignore, and then the extracted features are sent to the classifier.
2.4 software 31
Figure 11: The audio classifier trained for 250 epochs showing signs of over-
fitting.
Figure 13: Two examples of audio worker plots with output classification la-
bels, and below that the audio input amplitudes and the extracted
MFCC-features.
2.4 software 33
As mentioned above, not all aircraft will send out their vehicle cate-
gory as part of the ADS-B squitter message. Looking at how to imple-
ment the decoding of the ADS-B message, two alternative solutions
arise. The first is to use the Dump1090-software [53] and then im-
port the information into Matlab and have the worker just sorting the
data to suit the main script. The other alternative is to implement the
ADS-B decoding in Matlab using functions from the Communications
toolbox.
One might wonder if there are any such aircraft sending out that
they belong to the UAV vehicle category. Examples are in fact found
looking at the Flightradar24 service [54]. Here we can find one such
drone as shown in Figure 14 flying at Gothenburg City Airport, one
of the locations used when collecting the dataset of this thesis. The
drone is operated by the company Everdrone AB [55], involved in the
automated external defibrillators delivery trails of [1].
Figure 15: Surveillance drone over the straight of Dover. From [54].
The graphical user interface (GUI) is a part of the main script, but any-
way described separately. The GUI presents the results from the dif-
ferent sensors/workers and also provides possibilities for the user to
easily control the system without having to stop the code and chang-
ing it manually to get the desired configuration. The principal layout
of the GUI is shown in Figure 16.
The gap under the results panel is intentional, making the Matlab
command window visible at all times, so that messages, for example
exceptions, can be monitored during the development and use of the
system.
ADS-B targets are presented using the class label colours, as seen
in Figure 17, together with the track history plots. The presentation
of the altitude information is done in a logarithmic plot so that the
lower altitude portion is more prominent.
Figure 17: ADS-B presentation area. The circles of the PPI are 10 km apart.
38 methods and materials
Following the general GUI layout from Figure 16, the area directly
below the ADS-B presentation is the control panel, as shown in Fig-
ure 18. Starting from the top left corner, we have radiobuttons1 for
the range settings of the ADS-B PPI and altitude presentations. Next
is the number of ADS-B targets currently received and below that the
set orientation angle relative to the north of the system. The Close
GUI button is used to shut down the main script and the workers.
The servo settings can be controlled with buttons in the mid col-
umn of Figure 18. To complement the Fcam in finding interesting
object the pan/tilt can be set to move in two different search patterns.
One where the search is done from side to side using a static elevation
of 10◦ , so that the area from the horizon up to 20◦ is covered, and one
where the search is done with two elevation angles to increase the
coverage.
The results panel features settings for the sensor fusion and presents
the workers and system results to the user. The servo controller col-
umn seen in Figure 19 indicates the source of information currently
controlling the servos of the pan/tilt platform.
Figure 19: The results panel. Here a bird is detected and tracked by the
IRcam worker.
In the lower-left corner of Figure 19, the angles of the servos are
presented. The settings for the sensor fusion and the detection results
presentation are found in the middle of the panel, as described in
Section 2.4.1.1. The information in the right part of the panel is the
current time and the position of the system. The system elevation and
azimuth relative to the north are also presented here. Note the differ-
ence in azimuth angle compared to the lower-left corner where the
system internal angle of the pan/tilt platform is presented.
The last part of Figure 19 presents offset angles for the ADS-B tar-
get, if one is present at the moment in the field of view of the IR-
and video cameras. These values are used to detect systematic errors
in the orientation of the system. The sloping distance to the ADS-B
target is also presented here. See the bottom part of Figure 8 for an
example of this.
Figure 20 shows the whole GUI, including the video displays. The
image is turned 90◦ to use the space of the document better.
methods and materials
Figure 20: The graphical user interface. Here the IRcam, Vcam and audio workers all detect and classify the drone correctly. The Fcam worker is also
tracking the drone, and the pan/tilt platform is for the moment controlled by the IRcam worker. The performance of the workers, in terms
40
Three different drones are used to collect and compose the dataset.
These are of the following types: Hubsan H107D+, a small-sized first-
person-view (FPV) drone, the high-performance DJI Phantom 4 Pro,
and finally, the medium-sized kit drone DJI Flame Wheel. This can
be built both as a quadcopter (F450) or in a hexacopter configuration
(F550). The version used in this thesis is an F450 quadcopter. All three
types can be seen in Figure 21.
(a) Hubsan H107D+ (b) DJI Phantom 4 Pro (c) DJI Flame Wheel F450
These drones differ a bit in size, with Hubsan H107D+ being the
smallest having a side length from motor-to-motor of 0.1 m. The Phan-
tom 4 Pro and the DJI Flame Wheel F450 are a bit larger with 0.3 and
0.4 m motor-to-motor side length, respectively.
The drone flights during the data collection and system evalua-
tion are all done in compliance with the national rules for unmanned
aircraft found in [58]. The most important points applicable to the
drones and locations used in this thesis are:
• When flown, the unmanned aircraft shall be within its opera-
tional range and well within the pilot’s visual line of sight
• When flown in uncontrolled airspace, the drone must stay be-
low 120 m from the ground
• When flying within airports’ control zones or traffic information
zones and if you do not fly closer than 5 km from any section
of the airport’s runway(s), you may fly without clearance if you
stay below 50 m from the ground
1 The airport code as defined by the International Air Transport Association (IATA)
2 The airport code as defined by the International Civil Aviation Organization (ICAO)
42 methods and materials
Since the drones must be flown within visual range, the dataset
is recorded in daylight, even if the system designed in the thesis, to
some extent can be effective even in complete darkness using the ther-
mal infrared and acoustic sensors. The ADS-B information received
is naturally also working as usual at night.
The weather in the dataset stretches from clear and sunny, to scat-
tered clouds and completely overcast, as shown in Figure 22.
The audio in the dataset is taken from the videos or recorded sepa-
rately using one of the support software scripts.
Both the videos and the audio-files are cut into ten-second clips to
be easier to annotate. To get a more comprehensive dataset, both in
terms of aircraft types and sensor-to-target distances, it has also been
completed with non-copyrighted material from the YouTube channel
"Virtual Airfield operated by SK678387" [59]. This is in total 11 plus
38 video clips in the airplane and helicopter categories, respectively.
Table 3: The distance bin division for the different target classes.
Bin Class
1 Bell 429, one of the helicopter types in the dataset, has a length of 12.7 m
2 Saab 340 has a length of 19.7 m and a wingspan of 21.4 m
2.6 dataset for training and evaluation 45
Figure 25: Objects on the limit between the close and medium distance bins.
At this level, we can not only detect but also recognize the differ-
ent objects, albeit without necessarily identifying them, i.e. explicitly
telling what kind of helicopter it is an so on.
Note that exactly the same distances used in this division, are also
implemented for the video data. This notwithstanding the fact that
the input layer of the Vcam worker YOLOv2 detector has a resolu-
tion 1.6 times that of the IRcam ditto as described in Section 2.4.1.2
and Section 2.4.1.3.
The annotation of the video dataset is done using the Matlab video
labeller app. An example from a labelling session is shown in Fig-
ure 26.
46 methods and materials
From the dataset, 120 clips (5 from each class and target bin) were
put aside to form the evaluation dataset. Out of the remaining videos
240 were then picked as evenly distributed as possible to create the
training set.
R E S U LT S
3
Recall that the scope of this thesis work was twofold: First, to ex-
plore the possibilities and limitations of designing and constructing
a multi-sensor drone detection system while building on state-of-the-
art methods and techniques. Secondly, to collect, compose and pub-
lish a drone detection dataset.
In what follows the results of the thesis are outlined, initially with
the performance of the drone detection system, on both sensor and
system levels, and finally, the dataset and its content is presented.
To evaluate the individual sensors, a part of the dataset was put aside
at an early stage and hence kept out of the training process. The eval-
uation set for the audio classifier contains five 10-second clips from
each output category. Since the classifier processes a one-second in-
put buffer, the evaluation set is also cut into that length, and using an
overlap of 0.5, there are a total of 297 clips in the evaluation set, 99
from each class.
tp
1 How many of the selected items are relevant? Precision = tp+fp
tp
2 How many of the relevant items are selected? Recall = tp+fn
3 The F1-score is defined to be the harmonic mean of the precision and recall, hence
precision·recall
F1 = 2 · precision+recall
47
48 results
For the detectors of the IRcam and Vcam workers, not only the
classification label but also the placement of the bounding box must
be taken under consideration. In this the IoU, as defined in Sec-
tion 2.4.1.2, is used once more. The use similar IoU-requirements fa-
cilitates the results to be compared, and looking at the related work,
we see that an IoU of 0.5 is used in [15], [18] and [19]. A lower IoU of
0.2 is used in [43].
Using the distance bin division described in Section 2.6 the pre-
cision and recall of the IRcam worker detector, with a confidence
threshold set to 0.5 and an IoU-requirement of 0.5, are shown in Ta-
ble 5, Table 6 and Table 7 below. This thesis conforms to the use of an
IoU-requirement of 0.5.
Table 5: Precision and recall of the IRcam worker detector for the distance
bin Close.
Class Average
Table 6: Precision and recall of the IRcam worker detector for the distance
bin Medium.
Class Average
Table 7: Precision and recall of the IRcam worker detector for the distance
bin Distant.
Class Average
Taking the average result from each of these distance bins and cal-
culate their receptive F1-scores, we obtain Table 8.
We can observe that the precision and recall values are well bal-
anced using a detection threshold of 0.5, and altering the setting con-
firms that a higher threshold leads to higher precision, at the cost
of a lower recall value, as shown in Table 9. The drop in recall with
increasing sensor-to-target distance is also prominent.
Table 9: Precision and recall values of the IRcam worker detector, averaged
over all classes, using a detection threshold of 0.8 instead of 0.5.
Figure 27: The F1-score of the IRcam worker detector as a function of de-
tection threshold, using all the 18691 images in the evaluation
dataset.
Using not only the bounding boxes and class labels, but also the
confidence scores, the detector can be evaluated using the Matlab
evaluateDetectionPrecision-function. From this, we obtain plots of
the PR-curves1 , as shown below in Figure 28.
Note that the average precision results output from the Matlab
evaluateDetectionPrecision-function is defined to be the area un-
der the PR-curve, and hence it is not the same as the actual aver-
age precision values of the detector on the evaluation dataset, as pre-
sented in Table 5, Table 6 and Table 7 above.
Figure 28: Precision and recall curves for the IRcam worker detector. The
achieved values with a detection threshold of 0.5 are marked by
stars.
The choice of the detection threshold will affect the achieved pre-
cision and recall values. By plotting the values from Table 5, Table 6
and Table 7 as stars in Figure 28, we can conclude that a threshold
of 0.5 results in a balanced precision-recall combination near the top
right edge of the respective curves. Compare this to the precision and
recall values we obtain when using a detection threshold of 0.8, as
shown in Table 9 above.
Calculating the mAP for the IRcam worker we obtain Table 10.
3.1 performance of the individual sensors 53
Table 10: The mean values, over all classes, of the area under the PR-curve
(mAP) of the IRcam worker detector for the different distance bins
including the average of these values.
Figure 29: A false target of the IRcam worker detector caused by a small
cloud lit by the sun.
Table 11: Precision and recall of the Vcam worker detector for the distance
bin Close.
Class Average
Table 12: Precision and recall of the Vcam worker detector for the distance
bin Medium.
Class Average
Table 13: Precision and recall of the Vcam worker detector for the distance
bin Distant.
Class Average
These results differ no more than 3% from the results of the IRcam
worker detector. Recall that the input layers of the YOLOv2-detectors
are different and hence that the resolution of the Vcam worker1 is
1.625 higher than that of the IRcam worker2 . So even with a lower res-
olution, and the fact that the image is in greyscale and not in colour,
the IR sensor performs as well as the visible one. This conforms well
1 416x416 pixels
2 256x256 pixels
3.1 performance of the individual sensors 55
However, one notable difference lies in that the detector in [15] has
only one output class. This fact could confirm the doctrine of this the-
sis, i.e. that the detectors should also be trained in recognizing object
easily confused for being drones. Unfortunately, there is no notation
of the sensor-to-target distance other than that "75% of the drones have
widths smaller than 100 pixels". Since the authors implement an origi-
nal YOLOv2 model from darknet, it is assumed that the input size of
the detector is 416x416 pixels.
Just as for the IRcam, as shown in Figure 27, we can also explore the
effects of the detection threshold setting. This can be seen in Figure 30
below.
Figure 30: The F1-score of the Vcam worker detector as a function of de-
tection threshold, using all the 18773 images in the evaluation
dataset.
The PR-curves of the Vcam worker detector for the different target
classes and distance bins are shown in Figure 31.
56 results
Figure 31: Precision and recall curves for the Vcam worker detector. The
achieved values with a detection threshold of 0.5 are marked by
stars.
Calculating the mAP for the Vcam worker from the results above
we obtain Table 15.
Table 15: The mean values, taken over all classes, of the area under the PR-
curve (mAP) of the Vcam worker detector for the different distance
bins including the average of these values.
Once again, this is not far from the 0.7097 mAP of the IRcam worker
detector. The result is also close to what is presented in [18] where a
mAP of 0.66 achieved, albeit using a detector with drones as the only
output class and giving no information about the sensor-to-target dis-
tances.
3.1 performance of the individual sensors 57
Table 16: Results from the related work and the IRcam and Vcam worker
detectors.
Figure 32: An airplane detected and classified correctly by the Vcam worker at a sloping distance of more than 35000 m. The reason behind the fact
that the ADS-B FoV-target distance display is empty is that the main script will not present that information until the target is within 30000
m horizontal distance. This limit is set base on the assumption that no target beyond 30000 m should be detectable. An assumption that
58
The most frequent problem of the video part, when running the
drone detector system outside, is the autofocus feature of the video
camera. Unlike the Fcam and IRcam, clear skies are not the ideal
weather, but rater a scenery with some objects that can help the cam-
era to set the focus correctly. However, note that this fact is not af-
fecting the evaluation results of the Vcam worker detector perfor-
mance, as presented above, since only videos where the objects are
seen clearly and hence are possible to annotate are used.
Figure 33: The drone detected only by the IRcam since the autofocus of the
video camera is set wrongly.
Figure 34: Airplanes, a helicopter and birds detected when running the
Vcam worker detector as a stand-alone application on an aircraft
video from [59]. Note the FPS-performance.
1 This is threshold to control what pixels are to be considered as part of the foreground
or the background
62 results
Figure 37: The results panel just after Figure 36. Note the time stamp.
Using the 297 clips of the evaluation audio dataset and the Matlab
confusionchart-function we obtain the confusion matrix shown in
Figure 38.
Figure 38: Confusion matrix from the evaluation of the audio classifier
So from this, we can put together Table 17 with the precision and
recall results for the different classes and can thereby also calculate
the average over all classes.
64 results
Table 17: Precision and recall from the evaluation of the audio classifier.
Class Average
Table 18: Results from the related work and the audio worker classifier.
From the datasheet of the K-MD2 [47], we have that it can detect a
person with a Radar Cross Section (RCS) of 1 m2 up to a distance
of 100 m. Since we have from [29] that the RCS of the F450 drone is
0.02 m2 , it is straight forward to calculate that, theoretically, the F450
should be possible to detect up to a distance of
r
4 0.02
· 100 = 37.6m.
1
Furthermore, given that the micro-doppler echoes from the rotors
are 20 dB below that of the drone body, these should be detectable up
to a distance of
r
4 0.02
· 100 = 11.9m.
1 · 100
Practically the F450 drone is detected and tracked by the K-MD2
up to a maximum distance of 24 m, as shown in Figure 39. This is
however the maximum recorded distance, and it is observed that the
drone is generally detected up to a distance of 18 m.
Figure 39: The maximum recorded detection distance of the K-MD 2 radar
module against the F450 drone
66 results
Figure 41: The echo of a person walking in front of the radar module
The choice of early or late sensor fusion has also been investigated.
By early sensor fusion, we here mean to fuse the images from the IR-
and video cameras before processing them in a detector and classi-
fier. Late sensor fusion will, in this case, be to combine the output
results from the separate detectors running on the IR- and video cam-
era streams, respectively.
Figure 42: From the left: The video image, the thermal infrared camera im-
age and the fused image.
have not been possible to achieve within the scope of this thesis.
To evaluate the system level has also turned out to be even harder
than expected due to the current situation. For example, all regular
flights to and from the local airport have been cancelled, and hence
the possibility for a thorough system evaluation against airplanes de-
creased drastically.
Comparing the system results after the sensor fusion1 with the out-
put from the respective sensors, we can observe that the system out-
puts a drone classification at some time in 78% of the detection op-
portunities. Closest to this is the performance of the Vcam detector
that outputs a drone classification in 67% of the opportunities.
1 Having all sensors included with weight 1.0, and the minimum number of sensors
set to two
70 results
Table 20: Table of false detections appearing in a ten minutes long section of
screen recording from an evaluation session, including the type of
object causing the false detection.
The false detections caused by insects flying just in front of the sen-
sors are very short-lived. The ones caused by clouds can last longer,
sometimes several seconds. Figure 43 shows the false detections of
the IRcam at 02:34 and the Vcam at 02:58 from Table 20.
3.2 sensor fusion and system performance 71
Figure 43: The false detections of the IRcam at 02:34 and the Vcam at 02:58.
Nevertheless, after the sensor fusion the system output class is ob-
served to be robust as shown in Figure 44 and Figure 45 where the
IRcam and Vcam classifies the drone incorrectly but still with a cor-
rect system output.
Since the output class depends on the confidence score, the result is
sometimes also the opposite, as shown in Figure 46, so that one very
confident sensor causes the system output to be wrong. If this turns
72 results
Looking at Figure 49, we can see that when the airplane comes
within 30000 m horizontal distance from the system the ADS-B infor-
mation will appear in the results panel. At this moment, the sloping
distance is 32000 m and the offset between the camera direction and
the calculated one is zero. Moreover, since the system has not yet re-
ceived the vehicle category information at this moment the target is
marked with a square in the ADS-B presentation area, and the confi-
dence score of the ADS-B result is 0.75.
The next interesting event, shown in Figure 50, is when the system
receives the vehicle category message. To indicate this, the symbol in
the ADS-B presentation is changed into a circle, and the confidence is
set to one since we are now sure that it is an airplane. At a distance
of 20800 m, it is also detected and classified correctly by the IRcam
worker, as shown in Figure 51.
3.2 sensor fusion and system performance
Figure 44: A correct system output even if the Vcam classifies the drone as being a bird.
73
results
Figure 45: A correct system output even if the IRcam classifies the drone as being a helicopter.
74
3.2 sensor fusion and system performance
Figure 46: The high confidence score of the audio classifier cases the system output to be incorrect just as the audio output.
75
results
Figure 47: The time smoothing part of the sensor fusion reduces the effect of an occasional misclassification even if that has a high confidence scores.
76
3.2 sensor fusion and system performance
Figure 48: The drone is misclassified by several sensors at the same time.
77
results
Figure 49: When the airplane is within 30000 m horizontal distance the ADS-B information is presented in the results panel.
78
3.2 sensor fusion and system performance
Figure 50: The vehicle category message has been received by the system, so the confidence for the airplane classification is set to 1. The airplane is
now at a distance of 23900m.
79
results
Figure 51: The airplane is detected by the IRcam worker at a distance of 20800 m.
80
3.3 drone detection dataset 81
Since one of the objectives of this thesis is to collect, compose and pub-
lish a multi-sensor dataset for drone detection this is also described
as part of the results chapter.
Bin Class
Close 9 10 24 15
Medium 25 23 94 20
Distant 40 46 39 20
Bin Class
Close 17 10 21 27
Medium 17 21 68 24
Distant 25 20 25 10
This is done to ensure that the objects are clearly visible when an-
notating the videos. The visible videos have a resolution of 640x512
pixels.
The filenames of the videos start with the sensor type, followed by
the target type and a serial number, e.g. IR_DRONE_001.mp4. The an-
notation of the respective clip has the additional name LABELS, e.g
IR_DRONE_001_LABELS.mat. These files are Matlab GroundTruth-
objects, and using the Matlab video labeller app, the videos and re-
spective label-file can easily be opened, inspected and even edited. If
the dataset is to be used in another development environment, the
label-files can be opened in Matlab, and the content can be copy-
pasted into Excel and saved in the desired format, for example, .csv.
Importing .mat-files into the Python-environment can also be done
using the scipy.io.loadmat-command.
To retrieve the images from the videos and form a dataset of the in-
dividual images the selectLabels- and objectDetectorTrainingData-
functions are recommended.
It has not been possible to film all types of suitable targets, so, as
mentioned in Section 2.6, some clips are taken from longer aircraft
videos downloaded from [59]. This is a total of 49 clips (11 airplane
and 38 helicopter clips) in the visible video dataset.
Since the distance bin information of the clip is not included in the
filename, there is also an associated excel-sheet where this is shown
in a table. This table also contains information about exact drone type,
and if the clip comes from the Internet or not.
Figure 52: Two drones detected and classified correctly by the IRcam
worker.
DISCUSSION
4
An unforeseen problem occurring when designing the system was ac-
tually of mechanical nature. Even though the system uses a pan/tilt
platform with ball-bearings and very high-end titanium gear digital
servos, the platform was observed to oscillate in some situations. This
phenomena was mitigated by carefully balancing the tilt platform and
by introducing some friction in the pivot point of the pan segment. It
might also be the case that such problems could be overcome using
a servo programmer1 . Changing the internal settings2 of the servos
can also increase their maximum ranges from 90◦ to 180◦ . This would
extend the volume covered by the IRcam and Vcam, so that all targets
tracked by the Fcam could be investigated, not just a portion of them,
as now.
One thing not explored in this thesis is the use of the Fcam to-
gether with the audio classifier as a means to output the position and
label of a system detection. Implementing a YOLOv2-detector on the
Fcam could also be considered. However, a dataset for the training of
this must either be collected separately or by skewing images from
the visible video dataset, so that the distortion of the Fcam image is
matched. Neither is the performance of the audio classifiers perfor-
mance as a function of sensor-to-target distance explored in the same
way as the IR and visible sensors.
85
86 discussion
Due to the very short practical detection range, the RADAR mod-
ule was unfortunately not included in the final system setup. Having
a RADAR with a useful range would have contributed significantly
to the system results since it is the only one of the available sensors
being able to measure the distance to the target efficiently. Another
way to increase the efficiency of the system could also be to exploit
the temporal dimension better, i.e. to use the flight paths and the be-
haviour of the object to better classify them.
Figure 53: A drone from the "spy birds" programme lead by Song Bifeng,
professor at the Northwestern Polytechnical University in Xian.
From [64].
discussion 87
The work done in this thesis is also applicable to other areas. One
such that immediately springs to mind, is road traffic surveillance.
Except for the ADS-B receiver, all other parts and scripts could be
adopted and retrained to detect and track pedestrians or just a spe-
cific vehicle type, e.g. motorcycles.
Due to the lack of a publicly available dataset, the other main con-
cern of this thesis was to contribute with such a multi-sensor dataset.
This dataset is especially suited for the comparison of infrared and
visible video detectors due to the similarities in conditions and target
types in the set.
89
90 conclusions
[7] S. Samaras et al. Deep learning on multi sensor data for counter
UAV applications — a systematic review. Sensors, 19(4837), 2019.
91
92 bibliography
[19] C. Aker and S. Kalkan. Using deep networks for drone detection.
14th IEEE International Conference on Advanced Video and Signal
Based Surveillance (AVSS), 2017.
[21] J. Kim et al. Real-time UAV sound detection and analysis system.
IEEE Sensors Applications Symposium (SAS), 2017.
[23] S. Park et al. Combination of radar and audio sensors for iden-
tification of rotor-type unmanned aerial vehicles (UAVs). IEEE
SENSORS, 2015.
[49] J. Redmon et al. You Only Look Once unified, real-time object
detection. IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), 2016.