Designing efficient machine learning algorithms for near-sensor data processing on
the edge has been at the research forefront in recent years. To achieve the required
edge processing constraints, massively parallel binary neural networks have been
developed. Binary neural networks implemented in purely combinational circuits
provide resource utilization efficiency and performance. This thesis describes and
researches massively parallel combinational binary neural network logic and how
it is to be used in real-world deployment situations which include training and
constructing networks for a variety of examples. A high-level synthesis toolchain
is designed, which enables users to produce the hardware description language
models of combinational binary neural networks circuits directly from application
datasets. Standard and optimized combinational architectures are built for
different edge processing applications by using this toolchain. For machine vision,
Ethernet packet calssification, and experimental physics as edge processing
examples, a hardwired Verilog hardware description language code is built using
the toolchain. It is synthesized for an FPGA system to create designs for a
set of concrete edge processing problems. Synthesis results show that massiveley
parallel binary neural networks use minimal resources and achieve less than 30
ns inference delays, which is crucial for high-speed applications, less than 2 W
power consumption and less than 60, 000 FPGA slices. This shows that parallel
binary neural networks enable efficient hardware machine learning performance
for a variety of edge processing problems. However, both from these examples and
previous work done it is concluded that more efficient circuit design and optimization
algorithms are still needed. Therefore, I design, describe, and implement into
the toolchain three novel optimization techniques that require fewer adders and
overall operations for parallel neuron activation computations. The first proposed
optimization algorithm looks for similarities between the nerurons to reduce the
amount and size of adders needed. It reaches a 39.9 % improvement in terms of
FPGA slice usage, a 28.2 % improvement in nets used, and a 51.9 % reduction in
power consumption compared to the naive implementation. By using the second
optimization algorithm, called the genetically optimized ripple architecture, the
networks are constructed and trained with the aim of tackling the problem of
classifying Ethernet packets efficiently for intrusion detection systems. Shallow,
single-hidden-layer binary neural networks are trained on benchmark NSL-KDD
and UNSW-NB15 datasets and achieve accuracy rates (77.77 % to 98.96 %) comparable
to those of similar compact networks used for detecting intrusions. These
networks are then implemented in FPGA using this novel combinational ripple
architecture, which is optimized using a genetic algorithm and uses neuron-toneuron
similarities to achieve state-of-the-art performance in terms of resource
usage (8, 606 to 17, 990 lookup tables) and classification latency (16–19 ns). With
the third optimization algorithm we presents the development and simulation of
a ship-detecting edge-processing system for deployment on an aerial FPGA platform.
A ship detection chain was developed with imager-specific pre-processing
algorithms, massively parallel FPGA neural network inference, and host postprocessing
procedures. The ship detection binary neural network implemented
in combinational logic that enables high frame and detection rates, and achieves
93.59 % patch classification accuracy. A new algorithm for optimizing a combinational
binary neural network circuit is presented that merges multiple neurons in
a network layer taking advantage of similarities between neuron weights, which
leads to lower adder logic size and power consumption. Thus, state-of-the-art
performance is achieved in comparison to the naive implementation and similar
previous works using combinational binary neural networks, achieving 38.2 ns
inference latency, 0.425 W of power dissipation, and only 19, 000 FPGA slices.
|