Commit e1dfb964 authored by Jules Saget's avatar Jules Saget
Browse files

Work on report

parent 8f8cdf76
......@@ -23,9 +23,11 @@
backend = biber,
style = alphabetic,
%style = alphabetic,
sorting = nyt,
maxbibnames = 99,
maxcitenames = 3,
......@@ -46,11 +48,11 @@
\title{--- CEPHALOPODE ---\\
Design of execution unit in\\
......@@ -62,7 +64,7 @@
......@@ -89,40 +91,83 @@
My main task was to design an Arithmetical Unit that could handle arbitrary
precision integers, which I believe is unprecedented. By the end of the
internship, the AU could perform additions, subtractions and
precision integers.
Arbitrary precision is traditionally implemented in software and we have
not been able to find any documented hardware implementation.
By the end of the internship, the AU could perform additions, subtractions
and multiplications.
\paragraph{Note:} This internship lead to the submission of two papers:
\cite{CEPHALOPODE} and \cite{stately}.
Some of the content of this report is directly taken from what I have written
for these papers.
When a part is mostly taken from one of these papers, the reader will be warned
by a note saying so.
\section*{Preamble: conditions of the internship}
Because the internship took place during the 2020 coronavirus outbreak, I
believe it is relevant to describe the conditions I worked in.
I was able to meet Carl Seger and Jeremy Pope at the begining of the
I worked at Chalmers for two weeks before it closed.
From that point, I worked from my appartment in Sweden.
I had a meeting with Carl Seger and Jeremy Pope every week and a meeting with
the rest of the Octopi team every two weeks.
I went back to France in the begining of June, when it was clear that the
university would not reopen by the end of the internship.
The division meetings with the whole team stopped, but the meeting frequency
with Carl and Jeremy stayed the same.
Due to the stress brougth by the pandemic and the fact that I had no proper
office in my Swedish appartment (I had to work on my kitchen table), I
sometimes lost focus and struggled to work for a few days.
I had to take a week off in May to get back to work.
Apart from that, there were no other difficulties caused by the situation.
IoT devices are used more and more nowadays.
Unlike computers, the main objective focus of the underlying hardware is not
performance but rather soundness and power efficiency: these devices are meant
to be left unmanned with minimal user interaction and some of them may run on
battery power.
One of the goal of the Octopi team is to design a processor with respect to
these constraint: energy consumption, then soundness, then performance
\footnote{Performance is still a good thing to have, but we are ready to trade
it for power efficiency.}.
To achieve this, our idea was to build a processor that executes a functional
language, in order to take fully advantage of \emph{lazy evaluation}. This
guarantees that the processor will only compute something when it is sure that
the thing it computed will be used, leading in minimal energy consumption at
the price of performance. One of the goal of the Octopi project is to find out
whether this trade-off is relevant in the case of IoT devices.
Because executing a functional language on a traditional processor is rather
slow, we wanted to design a processor specialized in functional languages,
where the assembly code is not a queue of instructions but a
$\lambda$-expression (see subsection \ref{subsec:progrep}).
The Internet of Things (IoT) is a system of interconnected (embedded) devices
which can communicate together without requiring the need of human interaction.
The IoT domain is relatively young, and growing into many products (cars,
watches, home assistants, etc.).
In comparison to most personal devices, the security of IoT devices is critical
because potential hackers can have a access to sensitive information.
There are already examples of such vulnerabilities: access to the content of
tablets for children \cite{VTech}, smart-fridge hacked in order to reveal GMail
passwords \cite{fridge} and even remotely controlled cars \cite{cars}.
To prevent this, the Octopi team is designing a secured environment to design
IoT devices.
To guarantee security properties without asking the developers to use complex
methods (such as higher-order logic), the idea is to use high-level pure lazy
functional languages, like Haskell, where security and privacy guarantees can
be proved more easily.
Unfortunately, running a functional language on traditional hardware is very
demanding and thus not suitable for IoT devices (some of them need to run on
battery for several years).
This is why the Octopi team is designing a custom processor that can run a
functional language with respect to these three constraints: energy consumption,
then soundness, then performance\footnote{Performance is still a good thing to
have, but we are ready to trade it for power efficiency.}.
To achieve this, our idea was to design a processor based on combinator graph
reduction (see subsection \ref{subsec:progrep} or \cite{Pey87} for more
details) using \emph{lazy evaluation}: this guarantees that the processor will
only compute something when needed, leading to minimal energy consumption.
One of the goals of the Octopi project is to find out whether this trade-off is
relevant in the case of IoT devices.
To add security, we wanted to prevent overflows in integer operations.
To do so, we decided to use arbitrary precision integers, which consist of
......@@ -131,36 +176,234 @@ a list of digits of base $2^{\texttt{size\_of\_int}}$, see subsection
\paragraph{My contribution}
What I did was mostly design the Arithmetical Unit (see section \ref{sec:au}).
What I did was mostly design the Arithmetical and Logical Unit (see section
The main objective of the internship was to come up with basic implementations
of arithmetical operations, so that proving their correctness would be easier.
A secondary objective was to find out the tools we did not not have (and
therefore should make) that could help us designing hardware.
A secondary objective was to find out the tools we did not have (and therefore
should make) that could help us designing hardware, such as
\section{Hardware model}
\section{Tools and theory}
Je sais pas quoi écrire là dedans
Because our processor will only execute functional languages, we decided to
step-away from the traditional Von Neumann architecture and use a graph
reduction architecture \cite{sem&prag}, in which there is no difference between
program and data.
Before explaining how our architecture works, I first need to describe how we
represent a functional language program at a low level.
\subsection{Program representation}
\subsection{Functional assembly code}
\subsection{Data representation}
A program written in a functional language is in practice a
As described in \cite{Pey87}, we can compile this expression into another one
that do not contain variables, but instead contains combinators.
To give an intuition of how combinators work, let us have a look at \emph{S},
\emph{K} and \emph{I}.
Here is how they are defined:
I x & \rightarrow_\beta x \\
K x y & \rightarrow_\beta x \\
S x y z & \rightarrow_\beta x z (y z) \\
Their behaviour is shown on figure \ref{fig:ski}.
Note that \emph{K} and \emph{I} create an \emph{indirection node} (represented
by the $=$ symbol), which is similar to a symbolic link in Linux.
These nodes ensure laziness is not lost \cite{Pey87}
Intuitively, these nodes are invisible and are simply bypassed when the content
of a node is needed.
\caption{Behavior of the \emph{S}, \emph{K} and \emph{I} combinators}
Consider this $\lambda$-expression:
(\lambda x.x+1) 2
After compiling this to combinators using a standard “brackecting” approach
\cite{Pey87}, we would obtain the leftmost graph shown in figure
\caption{A simple graph reduction example}
Note that the reduction of \emph{I} is done before the application of $+$.
This is because addition is strict in its arguments: it will use them both, so
both of them needs to be reduced.
The way we handled the order of the reductions is descibed in
\subsection{Low-level data representation}
\subsection{Processor structure}
In a graph reduction engine, there is no difference between \emph{instructions}
and \emph{data}: everything is a node.
In our architecture, nodes are currently stored on 27 bits: the first three
bits determine the type of the nodes and how to read the remaining bits.
The different types include:
\item Application node (@): contains two addresses to $u$ and $v$ and means
\item Combinator node: contains a combinator name (\emph{S}, \emph{K} or
\item Primitive node: contains a primitive function performed by the ALU:
arithmetic operations, integer comparisons, logical operators (and, or,
not, if then else).
\item Indirection node (=): points to another node.
\item Integer node: contains a value (stored on 16 bits) and an adress to the
next part of the integer. More details in subsection \ref{subsubsec:api}.
Apart from indirection node which are handled directly by memory, all nodes are
decoded by the reduction engine and processed accordingly.
When it detects the application of a primitive function to two or three
arguments, it sends the data to the ALU to be processed.
\subsubsection{Arbitrary precision integers}
\emph{Note: most of this section is taken from \cite{CEPHALOPODE}}
Before describing how arbitrary precision works, I shall give some motivation
on why we decided to go with it in the first place.
A common function among many IoT programs is to filter input data
and compare the resulting value against some threshold.
For example, consider a climate control IoT that should allow the user to
set the desired temperature to anything between -30 and +30 degrees centigrades
on their smartphone.
The IoT device takes a temperature reading every minute.
In order to smooth out transient readings, the program averages the
last four readings.
This average computation is done by adding the values together and
the sum is divided by 4.
If the average is less then or equal to some set temperature the IoT device
starts a heater, and if the average is greater than the set temperature
it turns the heater off.
Suppose now that the initial design has a heater that
can change the temperature in the room by at most one degree per minute
and that all computations are performed with 8 bits 2's complement arithmetic.
With some effort, it is not too difficult to show that this IoT controlled
system will guarantee that the temperature never go above 33 degrees.
Suppose now that the heater is replaced with a more efficient one that
can raise the temperature 3 degrees per minute.
Now the system can fail disastrously.
To see this, consider the situation in which the system is turned on
set to 30 degrees in a room that already is 30 degrees.
The heater will turn on and after one minute the room will be at 33 degrees.
However, the average will still be 30 (remember we are using integer
arithmetic) and thus the heater will remian on.
After two minutes, the temperature in the room will be 36 degrees and
now we encounter a problem.
The summation in the average computation will overflow!
As a result, the average will be -32 and thus the heater will stay on;
likely burn out the system or possibly even igniting a fire!
This type of unintended side effects are very difficult to anticipate, since
the original system worked perfectly fine and is a good
example of the difficulty of predicting unintended side effects with
potentially catastrophic effects caused by overflow.
Similarly, overflows can also easily lead to array bounds checks not
catching illegal access attempts thus allowing access to confidential
data, which is particularly troublesome for IoT devices.
With this in mind, we decided that Cephalopode should use arbitrary
precision arithmetic natively in the processor.
In other words, arbitrary precision integers should be supported in the hardware
directly and no fixed integers would be available.
Arbitrary precision arithmetic is traditionally implemented in software
and we have not been able to find any documented dedicated arbitrary precision
hardware implementation.
We believe that not having dedicated hardware makes operations significantly
slower: our contention is that managing arbitrary precision directly
in hardware will make it fast enough so that all arithmetic can be done
using it.
One design goal we had for the arithmetic unit was to make arithmetic
on small integers (ones that fit in a single chunk) as fast as possible
and (ideally) no slower than had we implemented a fixed integer format.
In the remaining of this section, we describe our arbitrary
precision arithmetic and logic unit (ALU) that we designed
to prevent overflows as much as possible.
For Cephalopode, we selected 2's complement as the underlying
integer representation, but split a number into a list of chunks.
In other words, we represent an integer as a list of
\texttt{(int, address)} pairs.
Since we can use as many fixed precision integer chunks as
we need, we can represent numbers of any size.
With this representation, overflows can only occur when there is no
more memory in the system, which make them much less likely to occur
and, most importantly, a system wide issue rather than an arithmetic issue.
To motivate our representation, consider how numbers are represented
with various numbers of bits, as shown in Table~\ref{tab:arith_2cpl}.
When going from 2 bits to 3 bits, only the leftmost bit needs to be
copied (or sign extend).
In fact, this property holds true for every number of bits: we can always sign
extend existing representations when we increase the number of bits.
This is what allows us to cut integers into chunks.
To illustrate the representation, assume we use chunks of 3 bits, and
try to represent the number -42.
\item Write 42 in binary: \texttt{0101010}.
\item Perform 2's complement: \texttt{1010110}.
\item Cut in chunks of three bits, starting with the least significant part:
\item Sign extend the most significant chunk to three bits:
In hardware, integers are effectively represented as an \texttt{(int,
address)} pair (where \texttt{int} has a fixed size) with a special address
\texttt{addr\_end} that marks the end of the integer.
Integers are linked using little-endianness: the first element in the list
represents the least significant chunk.
Whether to use little-endian or big-endian is a trade-off between
efficiency of arithmetic operations versus comparison operations.
Arithmetic operations prefer little-endian because of carry propagation.
On the other hand, to determine an integer's sign, the most significant chunk is
Finally we require integers to be stored as efficiently as possible and in
a canonical form: it is forbidden to have \texttt{111·111·010·110} because
the last chunk is redundant.
This is important because of the following property: when integers are stored
with the minimal number of chunks, if two numbers have the same sign then the
longest (strictly) in term of chunks has the greatest absolute value.
To ensure this property, the ALU includes a ``cleaning'' unit that
deletes redundant trailing chunks if needed.
\section{Designing the arithmetical unit}
\subsection{Tools for designing hardware}
\section{Designing the arithmetical unit}
\subsection{General structure}
\subsection{Addition Unit}
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment