Biometric approach to user identification
Rapid development in IT, DLT, and AI are prompting biometrics to constantly innovate and make the most of market demand. According to the latest reports, the global biometrics market is forecast to reach from $82.8 billion to nearly $100 billion by 2027, growing at a >19.3% Compound Annual Growth Rate (CAGR) from an estimated $24.1 billion in 2020. According to these reports, the multimodal biometric systems segment is projected to increase in revenue at a significant CAGR during the forecast period.
In terms of authentication type, voice recognition is supposed to witness significant growth due to consumer desires for a safer identity mechanism. Facial recognition is also poised for growth, as it is witnessing a boost from the launch of Apple’s Face ID system.
In 2020, the global market for mobile biometrics was estimated at $18 billion, and it is projected to reach a revised size of $79.8 billion by 2027, growing at a CAGR of 23.7% over the analysis period 2020–2027. Growth in the scanner segment is readjusted to a revised 20.1% CAGR for the next seven-year period.
Furthermore, the post-COVID 19 global digital identity verification market is forecast to grow from $7.6 billion in 2020 to $15.8 billion by 2025, at a CAGR of 15.6%.
The ability to privately secure user authentication through biometrics has been the goal of many cryptographic researchers. For the last two decades, cryptographers have concentrated their efforts on solving the problem of biometric protection against malicious activities of the verifier. Solutions like BioHashing, Biometric Cryptosystems, and cancelable biometrics were all researched and proven to be inefficient or insecure for a hypothetical user (G. Davida et al., 1998; N. Ratha, J. Connell & R.M. Bolle, 2001, 2002; A.T.B Jin, D.N.C Ling & A. Goh, 2004; A. Kong, 2006; A.B.J. Teoh, Y.W. Kuan & S. Lee, 2008; C. Rathgeb & A. Uhl, 2011; M.A Syarif, et al., 2014; B.J. Jisha Nair & S. Ranjitha Kumari, 2015).
Until not so long ago, biometric identification methods carried a heavy risk to personal privacy. Biometric data is considered to be very sensitive, as it can uniquely be associated with a human being. Passwords are not considered PII (Personally Identifiable Information), as they can be changed and not associated with any person directly. The main risks of biometric matching in the past were based on the fact that they required the biometric data to be visible at some point during the process.
The privacy and security of the biometric data have been among the most critical aspects to take into account when deciding on a technology to use in Humanode. Biometric registration and authentication are carried out through a novel method based on cryptographically secure neural networks for the private classification of images of users' faces so that we can:
- Guarantee the image's privacy, performing all operations without the biometrics of the user's face having to leave the device.
- Obtain a certificate or proof that the operations are carried out correctly, without malicious manipulation.
- Have resistance to different attacks, such as the Sybil attack and reply attack.
- Carry out all registration and authentication operations without the need for a central entity or authority that handles the issuance and registration of users' cryptographic keys.
- Compare the feature vector each time the user wants to authenticate in a cryptographically secure way.
Let's now see how the different technologies that we use to perform the registration and authentication of users are broken down, guaranteeing privacy in a decentralized environment.
Traditionally, neural networks are used to identify an image. A neural network is a particular case of machine learning technique that consists of a series of so-called nodes structured in layers. These nodes or neurons are mathematical functions that perform a specific operation according to the layer they belong to.
For example, the convolutional layer is in charge of filtering the information to determine the similarity between the original image covered by a filter and the filter itself. The activation layer also determines if the filter pattern defined in the convolutional layer is present at a particular position in the image. There is also a layer called max-pooling that modifies the data to make it easier to handle.
When the user logs into the system for the first time, the neural network gives us a unique feature vector that identifies the user. Once this vector is registered, we can store it for future comparisons when the user wishes to authenticate.
The main objective of the biometric registration and authentication system is to protect the images of users throughout the whole process and on the different layers of the neural network. It is required that the operations are carried out effectively and efficiently, preventing unauthorized access to the data, from when it is obtained on the user's device to it being processed in the neural network and registered in the system.
A malicious user gaining access to the neural network should not be able to obtain any sensitive information. This is why Humanode's biometric system architecture is designed to run neural networks locally on the user's device and only send the proof that all the neural network layers were executed. The user will also send the neural network's output in the form of an encrypted feature vector.
Often referred to as CNNs or ConvNets, Convolutional Neural Networks specialize in processing data that is grid-like in topology, such as images.
In a digital image, each pixel contains a binary value that denotes how bright and what color it should be. It contains a series of pixels that are arranged in a grid-like format.
Each neuron works in its own receptive field, interconnected with other neurons so that the entire visual field is covered. The human brain processes enormous amounts of information as soon as it sees an image.
In the same way that each neuron in the biological vision system responds to stimuli only in its receptive field, each neuron in a CNN also processes information only within its receptive field. With a CNN, one can enable computers to sense simpler patterns (lines, curves, etc.) at the beginning and more complex patterns (faces, objects, etc.) as they progress.
There are 4 main layers of CNNs: a convolutional layer, a pooling layer, a fully connected layer and Activation Layers.
CNNs have a convolution layer that carries a vast amount of computation on its behalf.
Using this layer, we perform a dot product between two matrices, one that contains the set of learnable parameters, known as a kernel, and the other that contains the restricted portion of the receptive field.
In the case of an image composed of three (RGB) channels, the kernel height and width will be smaller than the image, but the depth will encompass all three channels.
When the forward pass is made, the kernel slides across the height and width of the image, creating an image representation of the receptive region. A kernel response is generated by computing an activation map in two dimensions that results in a representation of the image for each spatial position. A stride refers to the size of the kernel as it slides. The size of the output volume can be calculated as follows if we have an input of size W x W x D and a number of kernels of size F with a stride S and a padding P:
Formula for Convolution Layer
This will yield an output volume of size Wout x Wout x Dout.
Figure 8. Convolution Operation (Source: Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville)
During the pooling layer summary statistics are derived from the nearby outputs in order to replace certain outputs of the network. As a result the size of the representation is reduced resulting in a decrease in computation and weights. The pooling operation is applied to every slice in turn.
In addition to the rectangular neighborhood average there are several pooling functions such as the L2 norm of the rectangular neighborhood and the weighted average based on the distance to the central pixel. Max pooling, however, is the process most commonly used which reports the max output from the neighbors.
Figure 9. Example of Max-Pooling Operation
The size of the output volume can be determined by this formula if we have an activation map with dimensions W x W x D, a pooling kernel with dimensions F and a stride: Formula for Padding Layer
This generates an output volume of Wout x Wout x D.
The translation invariance of pooling makes it possible to recognize objects wherever they appear in the frame regardless of their position in the frame.
As with regular FCNNs, neurons in this layer are fully connected to neurons in the preceding and following layers. Thus, it can be calculated as usual by a matrix multiplication followed by a bias effect. This layer enables mapping of inputs and outputs between representations.
Non-linear layers are often placed directly after the convolutional layer to introduce non-linearity to the activation map, due to the linear nature of convolution and the non-linear nature of images.
The sigmoid nonlinearity has the mathematical form σ(κ) = 1/(1+e¯κ). This formula takes a real-valued number and "squashes'' it between 0 and 1. However, the gradient of sigmoid is almost zero when the activation is at either tail. In backpropagation, if the local gradient becomes very small, it will effectively "kill" the gradient. Furthermore, if sigmoid is always positive, it will produce either all positives or all negatives, resulting in a zig-zag trend in gradient updates for the weights.
Tanh squashes a real-valued number between -1 and 1. The activation of sigmoid neurons saturates, but the output is zero-centered unlike sigmoid neurons.
In the last few years, Rectified Linear Units (ReLUs) have been very popular. It computes the function ƒ(κ) = max (0,κ). In other words, the activation is simply threshold at zero. With ReLU, convergence is six times faster than Sigmoid and Tanh.
The disadvantage of ReLU is that it can be fragile during training. It can be updated by a large gradient in such a way that the neuron is never further updated. This can be addressed by setting a learning rate that is appropriate.
Humanode facial recognition system uses modified ResNet architecture for facial feature extraction and uses cosine similarity for matching.
Cosine Similarity is a measurement that quantifies the similarity between two or more vectors. It is measured by the cosine of the angle between vectors and determines whether two vectors are pointing in roughly the same direction. The vectors are typically non-zero and are within an inner product space.
The cosine similarity is described as the division between the dot product of vectors and the product of the euclidean norms or magnitude of each vector.
Cosine Similarity is a value within a constrained range between 0 and 1. The similarity measurement is a measure of the cosine of the angle between the two non-zero vectors A and B.
Assume the angle between the two vectors is 90 degrees. The cosine similarity will be zero in that case. This indicates that the two vectors are orthogonal or perpendicular to each other. The angle between the two vectors A and B decreases as the cosine similarity measurement approaches 1. The image below illustrates this more clearly.
Figure 10. Two vectors with 96% similarity based on the cosine of the angle between the vectors.
Figure 11. Two vectors with 34% similarity based on the cosine of the angle between the vectors. Humanode uses cosine similarity in the facial feature vector matching part.
Enterprises use face recognition for onboarding, validating, and approving customers due to its reliability and ease of use. The demand for liveness detection is growing rapidly. Liveness detection identifies presentation attacks like photo or video spoofing, deepfakes, 3D masks or models, rather than matching the facial features.
This makes it much harder for an adversary to spoof an identity. Facial recognition determines whether the person is unique and the same whereas liveness detection determines whether the person is a living human being. Liveness detection confirms the presence of a user’s identification credentials and that the user is physically present, whether on a mobile phone, a computer or tablet or on any camera-enabled device.
There are two methods in facial liveness detection: active and passive.
Active liveness detection method asks the user to do something to confirm that they are a live person. A user would be normally asked to either change the head position, nod, blink their eyes or follow a mark on their device’s screen with their eyes. In spite of this, fraudsters can fool the active method using a so-called presentation attack, also known as the PAD attack. Scammers can use various gadgets or "artifacts" to fool the system, some of which are remarkably low-tech.
Humanode active liveness detection model asks the user to turn their face left or right, blink eyes, make emotions like happiness, anger, surprise and determines whether the user is fake or real depending on the result.
With passive liveness detection the user is not asked to do anything. This provides end users with a modernized and convenient experience. It is an excellent method for determining whether the user is present without any specific movement or gesture. Passive methods use a single image, which is examined for an array of multiple characteristics to determine if a live person is present.
Humanode passive liveness detection model determines if a live person is present based on texture and local shape analysis, distortion analysis and edge analysis:
- Texture and local shape analysis: analyze the input image from texture analysis point of view by image quality assessment, characterization of printing artifacts and differences in light reflection.
- Distortion analysis: analyze the input image using the IDA (image distortion analysis) feature vector that consists of four different features, that is specular reflection, blurriness, chromatic moment and color diversity.
- Edge analysis: analyze the edge of the input to find out whether the edge component is presented or not.
Figure 12. Analyses types in liveness detection
While the active liveness detection process is going on, passive liveness detection is performed in the background.
By combining the advantages of active and passive liveness detection approaches, we made our liveness detection system more secure.
The use of biometrics, the science of analyzing physical or behavioral characteristics unique to each individual to recognize their identity, has many benefits. However, there are some risks associated with biometric authentication, which are as follows.
Table 2. Merits and demerits of biometric identification
When the user registers in the system, the executed private neural network allows the feature vector to be extracted from the user's face for the first time. It is essential to safely store this vector to evaluate the subsequent times that the user wants to authenticate in the system. But this storage must be encrypted. Moreover, to compare the new vector with the already stored one, we cannot decrypt the data. For this, there is an encryption method called homomorphic.
Homomorphic encryption is nothing more than an encryption algorithm with the additional characteristic that operations can be defined so that they can be preserved by encryption.
In mathematics, the preservation of an operation is obtained when we have an operation and a function between two spaces. The function that goes from one space to the other is said to preserve the operation if it is invariant under said operation.
Formally we say that f from space A in space B is homomorphic if given two elements we have that:
This section will discuss a method used in neural networks to evaluate the similarity between two feature vectors. Then, we will define the homomorphic encryption method that will allow us to store the encrypted feature vector and perform the similarity operation without decrypting the vector.
As mentioned above, one of the most efficient and natural ways to find the similarity between two feature vectors in neural networks is cosine similarity. Let
be two vectors in
the cosine similarity between a and b is defined by the equation
is the norm of the vector a.
From (1), if we calculate the internal product between two vectors, we can determine if two vectors are similar directly. In simple terms, the cosine similarity of the angle of two vectors tells us whether two vectors point in the same direction.
If in addition the vectors are normalized, then it is evident that:
In the cryptobiometric authentication system, we must define an encryption scheme that allows us to calculate the internal product between two vectors, which will give us the similarity between them. This calculation will be carried out on the encrypted vectors without the need to decrypt them.
It is natural to look for a homomorphic encryption scheme where the calculations to determine similarity are performed in the encrypted space.
In a traditional encryption scheme, which only encrypts the data to be sent, it would have to handle the private keys with which the user encrypted the data, decrypt the vectors, and then make the similarity calculation on clear data. From a decentralized perspective, this traditional approach has a flaw as users' private keys are in an environment where peers are by nature untrusted. In a decentralized environment, there is no trusted third party to handle the keys securely.
There are different proposals for encryption schemes that preserve operations in a homomorphic manner through the encryption function. In particular, one of the most straightforward and most efficient is encryption based on learning with errors (LWE). Let's see in this section the mathematical preliminaries of this cipher and the algorithms that compose it, namely:
- Key generation
- Homomorphic operations.
In group theory a lattice in
is an algebraic subgroup of
that spans the vector space
with integer coefficients in its basis.
be a matrix, and
-th row of B with
. Then the linear combinations of
are defined as
is a subgroup of
. If the
are linearly independent, we say that L(B) is a Lattice in
of dimension n.
Lattice-based ciphers are one of the leading candidates for post-quantum cryptographic algorithms. If an efficient quantum computer is ever built, a post-quantum encryption scheme can resist attacks. In 1994, Shor theoretically demonstrated that a protocol could be built on a quantum computer that would break in polynomial time the problems on which most public-key ciphers known as RSA, Diffie-Hellman, or cryptosystems of elliptic curves are based.
The computational complexity of the problem that shapes cryptosystems based on lattices ensures their quantum resistance.
Furthermore, the LWE-based cryptosystem can be completely homomorphic: it possesses homomorphism in both operations of addition and multiplication. Which is very useful for the calculation of the inner product, and consequently for the similarity of the cosine.
Let’s see in detail how the ring-LWE encryption scheme works and how the homomorphic operations are defined.
First of all we need to define certain general parameters to be used in the key generation algorithm:
- Seta degree parameter.
- Letbe a prime number, defining the ring. This ring is the ciphertext space.
- Takeas an arbitrary integer, with, defining the ring. This ring is the plaintext space.
- The standard deviation σ, as the parameter for the discrete Gaussian distribution
First we sample random elements as follows:
- Samplefrom the Gaussian distribution
- take a randomand the error e sampled from.
Then the public-key is defined as
, and the secret-key is
After encoding the plaintext m as an element in
and given the public-key
, we sample
from the distribution χ and compute
is a ciphertext and
the private key, then the decryption is simply
If we write the secret key vector
Now, if we have two elements in the encrypted space,
, the homomorphic operations are given by
The cosine similarity operation requires, as we saw, the calculation of the inner product in the encrypted space. It is evident then that if we define accordingly a transformation in the encrypted space, thanks to the homomorphic properties of the encryption scheme, we can extract the inner product as a constant term from the encrypted result .
Thus, let F, Q be transformations onto the ring
If we multiply
Thus, if we encrypt
, thanks to the homomorphic properties of the encryption scheme, we can extract the inner product as a constant term from the encrypted result:
In our setup, a node does not trust any other node in the system. This means that a node can be trusted to follow the protocol but may not be trusted with the computation of the feature extraction process and liveness detection process.
During the registration process, a node will extract a feature vector from the face image and then send it to a peer node. The problem is how does the peer node trust the feature vector? A node may or may not have followed the feature extraction process as required. In this situation, zero-knowledge-based verifiable computation comes to the rescue.
Verifiable computation is a technique to prove that the computation process was followed correctly by an untrusted party. Let
be the result of computation on input x. The prover generates a proof of computation, , along with the result and sends
to the verifier. Using
and verification keys, the verifier verifies the correctness of the proof
- 1.SafetyNet: Specialized interactive proof protocol for verifiable execution of a class of deep neural networks. It supports only quadratic activation functions but in our NN model ReLU is necessary to achieve higher accuracy.
- 2.zkDT: Verifiable inference and accuracy schemes on decision trees. Decision trees are simple and quite different from neural network architecture.
- 3.vCNN: verifiable inference scheme for neural networks with zero-knowledge. It optimizes only convolution. vCNN uses mixing of QAP (Quadratic arithmetic program), QPP (quadratic polynomial program) and CP-SNARK for making a connection between QAP and QPP. QAP works at the arithmetic circuit level and is costly in terms of computation.
- 4.ZEN: R1CS friendly optimized zero-knowledge neural network inference scheme. Proposes R1CS friendly quantization technique. Uses arithmetic level circuit and Groth zero-knowledge proof.
- 5.zkCNN: Interactive zero-knowledge proof scheme for Convolutional neural network. Proposes a new sum-check protocol. Uses GKR protocol
vCNN, ZEN, and zkCNN are most closely related to our scenario but all of them reduce the computation program to arithmetic circuit level and then use Groth zkp protocol for verification.
Any verifiable computation scheme utilizes the homomorphic property of the underlying primitive for verification. Therefore, it can support computation that involves either addition or multiplication. Since neural network computations are often complex and non-linear, researchers are using the idea of converting the program to arithmetic circuit level which involves only addition and multiplication at the bit level and then uses zkSNARK type proof. This is a more generalized technique for any circuit. However, if the circuit involves only addition and multiplication at integer level then there is no need to convert it to the arithmetic circuit level.
Our idea is to break down the neural network model of feature extraction into different layers and then prove the computation of individual layers separately. There are four main layers: convolution layer, Batch-normalization layer, ReLU layer, and average pooling layer. Out of these, only the ReLU layer is not in the form of addition and multiplication.
So, to make it compatible with our idea, we replaced the ReLU function with the bit-decomposition of ReLU which involves bit-level addition and multiplication. After this, we used the idea for Verifiable Private Polynomial Evaluation (PIPE) where an untrusted cloud server proves that the polynomial computation,
, is correct without revealing coefficients of the polynomial f. We are aware of other similar schemes like Pinocchio, PolyCommit by Kate et al. and other Garbled circuit-based schemes but PIPE is best suitable for our decentralized untrusted P2P network scenario.
Our scenario is similar but slightly different. We assume that the neural network parameters are available with each node. That means coefficients of the kernel in the convolution layer are available with each node. For input
the output of convolution can be represented as:
In PIPE scheme, ai is kept secret from the verifier and in our scenario, xi (which represents input image) is kept secret from the verifier. Moreover, in PIPE scheme, input and output are available in plain form for the verifier. However, we cannot reveal the input and outputs of the neural network as well as intermediate layers due to privacy concerns. That means we had to modify the PIPE scheme in such a way that the verifier can still verify the correctness of computation using encrypted input and outputs.
Finally, here is what we have in a ZKP system for the feature vector extraction process.
Figure 12. ZKP for the feature vector
Prover picks an input and performs computation. Since verifier does not trust the prover, the prover needs to prove that the output y is computed correctly.
Requirement: The coefficients of the computation,
, are public and known to verifiers. The prover can’t disclose
and y to the verifier due to privacy concerns.
We combined Feldman’s Verifiable Secret sharing, ElGamal Crypto system and non-interactive zero-knowledge proof.
- Feldman’s Verifiable Secret Sharing:
It is a secret sharing scheme where each share is a point (x,y) on a secret polynomial f. In Feldman’s VSS, given a share (a,b) anybody can verify the validity of the share using some public value corresponding to the secret polynomial f. This means anyone can check whether a = f(b) without knowing the coefficients of the polynomial f.
be a k-degree polynomial with
Let G be a multiplicative group of a prime order p and g be a generator of G. For each
public. Given a share
, one can check the validity of the share by verifying the following equation:
Note: There are two concerns here. First, the share (a,b) is in plain form and hence, if we use this as it is in our scenario, then we have to reveal input and output to the verifier. Second concern is that hi hides ai under the assumption that it is difficult to solve for
under Discrete Logarithm assumption. However, if
is a small value then it will be very easy to find
. In neural network computation, the values (input and weight parameters) are always small values and can’t hide it properly.
- Feldman’s VSS with encrypted input and output:
To hide input and output, we need to encrypt both in such a way that we can perform some operation over encrypted value. That means we have to use some homomorphic encryption scheme.
We use ElGamal encryption mainly because it is homomorphic with respect to plaintext multiplication and scalar multiplication as well which suits our system perfectly.
ElGamal Key pair: =
Finally, we have
which is an ElGamal encryption of
. So now, prover needs to convince the verifier that
computed from encrypted input is a valid ciphertext of gy. Here, we use NIZKP of
- Non-Interactive Zero-Knowledge Proof:
If we generalized above log equation, then we have
. In 1993, David Chaum and T. P. Pedersen proposed NIZKP to prove exactly this.
NIZKP LogEq: Let G be a multiplicative group of prime order p and
be a hash function. Let the language
be the set of all
. The NIZKP LogEq = (prove,verify) is as follows:
: Using the witness
, it picks a random r from
It outputs proof
, it computes
Then it outputs 1, else it outputs 0. We achieve the ZKP system for an individual layer of our NN model by combining Feldman’s VSS, ElGamal cryptosystem and NIZKP LogEq properly. Our ZKP system is unconditionally ZK-secure and UNF-secure under Random Oracle Model. Our ZKP system is also privacy preserving under the DDH assumption in the Random Oracle Model.
We generalize the input image as higher-dimensional vector
. Similarly, we assume the output of each layer is again higher-dimensional vector
. For each layer, we encrypt its input and output using ElGamal encryption. The ElGamal Public Key Encryption scheme is defined as follows:
The ElGamal Public Key Encryption scheme is defined as follows:
- Gen: It returnsandwhereis a multiplicative group of prime orderand.
- : It returnswhere r is a randomly chosen integer between 1 and (p-1).
- : It returns.
In our scheme, we use 1024 bit prime p to achieve recommended security. Note that ElGamal encryption is randomized encryption and not deterministic. That means if the same message is encrypted twice then both ciphertexts will be different. Thus each transaction will be indistinguishable and preserve the privacy of the user. Moreover, ElGamal encryption is homomorphic with respect to plaintext multiplication and scalar multiplication.
The result of liveness detection is proved by sending the output of the detection algorithm. This output comes in the form of a yes or no. That is a Boolean result.
In a centralized system the algorithm runs in a controlled environment where the central authority manages the input and output.
When the user is given the ability to run the liveness detection algorithm on their own there is the risk of a malicious user tampering with the result of the algorithm. Errors can also occur in the transmission of data or local failures in executing the algorithm and obtaining the results.
The system's decentralization includes the need to prove that the result is obtained through a correct execution of the algorithm. That is why in Humanode, we have an algorithm to generate proof of the correctness of each function of the liveness detection process. In addition, there will be a verification algorithm for the said proof, thus having a Zero-Knowledge Proof System suitable for decentralized testing of the correct execution of liveness detection.
One of the most critical problems to solve when defining encryption schemes in decentralized environments is the handling of cryptographic keys, where in addition, the calculations are performed and verified by peers through multi-party computation.
In this sense, we will consider a subgroup of the Humanode network, whom we will call Collective Authority, whose objective is to generate the collective keys for homomorphic encryption and also verify the calculations performed by each peer.
In simple terms, the collective authority works as a trusted third party for key generation and verification but is also composed of several peers within the network.
During the Setup process, the collective authority is the one who defines the generic parameters for the establishment of the cryptographic protocols. The security that this collective authority provides us is that each peer takes these generic parameters and locally generates its public and private keys, as we saw in section 2.2.2.
Each user keeps his private key secured locally but sends the public key to the collective authority. After collecting the public keys from each user, the collective authority constructs a collective public key and distributes it back to all users. This collective public key is the one used to encrypt the feature vectors.
If a malicious user intercepts the public key in a traditional cryptosystem, obtaining the private key is computationally challenging. In our case, if the collective public key is intercepted, the perpetrator can't get the private keys as he must know which partial element belongs to which peer. Thus we have an additional layer of security to the public key cryptosystem, in what we can call a lattice-based decentralized public-key cryptosystem.
The Biometric Identification Matrix was created by the Humanode core to understand which of the existing biometric modalities are the most suitable and superior and, therefore, to choose the proper ones for Humanode biometric processing methods.
According to recent studies, there are three types of biometric measurements (G. Kaur et al., 2014):
- Physiological measurement includes face recognition, finger or palm prints, hand geometry, vein pattern, eye (iris and retina), ear shape, DNA, etc.
- Behavioral measurement relating to human behavior that can vary over time and includes keystroke pattern, signature, and gait (S. Jaiswal et al., 2011).
- There are also some biometric traits that act as both physiological and behavioral characteristics (e.g., brain waves or electroencephalography (EEG)). EEG depends on the head or skull shape and size, but it changes from time to time depending on circumstances and varies according to age.
In light of the latest developments, we propose a fourth measurement—neurological—as a part of both physiological (internal) and behavioral measurements. We believe that neurosignature, the technology of reading a human's state of mind, i.e., signals that trigger a unique and distinct pattern of nerve cell firing and chemical release that can be activated by appropriate stimuli, should be developed and implemented in the Humanode as the most reliable and secure way of biometric processing.
Until then, Humanode implements a multimodal biometric system of several biometric modalities. Each biometric modality has its own merits and demerits. It is laborious to make a direct comparison. Since the end of the 1990s, when A. K. Jain, R. M. Bolle, and S. Pankanti conducted their comprehensive research on all existing biometrics (Jain et al., 1999), seven significant factors were identified to study and compare the biometric types: acceptability, universality, uniqueness (distinctiveness), permanence, collectability, performance, and resistance to circumvention—which are also known as ‘the seven pillars of biometrics’ (K. A. Jain, A. Ross & S. Prabhakar 2004).
Based on Jain et al.’s classification and recent all-encompassing surveys on various biometric systems (A. C. Weaver 2006; T. Sabhanayagam, V. Prasanna Venkatesan & K. Senthamaraikannan, 2018), cancelable systems (B. Choudhury, P. Then, B. Issac & V. Raman, 2018), and unimodal, multimodal biometrics and fusion techniques (A.S. Raju & V. Udayashankara, 2018), we provide a comparison study of different biometric modalities, and propose a ‘Biometric Identification Matrix’, by studying and combining characteristics revealed in the aforementioned works and by adding factors we found necessary to examine. Thus, we divided the ‘Performance’ category proposed by Jain et al., which relates to the accuracy, speed, and robustness of technology used, into two sub-categories (‘Accuracy’ and ‘Processing Speed’) to study the space in more detail. To grasp how easy it is to collect biometric data on a person, we decided to add the ‘Security’ category which refers to vulnerability to attack vectors, as paths or means by which attackers can gain access to biometric data to deliver malicious actions. The category ‘Hardware’ which relates to the type of hardware, its prevalence, and cost, was added to understand which devices are required to be used nowadays and which are best to use in the network.
‘Acceptability’ relates to the relevant population’s willingness to use a certain modality of biometrics, their acceptance of the technology, and their readiness to have their biometrics trait captured and assessed.
Complex and intrusive technologies have low levels of public acceptance. Retina recognition is not socially acceptable, as it is not a very user-friendly method because of the highly intrusive authentication process using retina scanning (J. Mazumdar, 2018). Electrophysiological methods (EEG, ECG) and neurosignatures are not highly accepted nowadays, as they are intricate and not yet well-known or fully developed.
An active liveness detection technology may be uncomfortable for the average user if the trait acquisition method tends to be demanding or time-consuming. Even in the absence of physical contact with sensors, many users still develop a natural apathy for the entire liveness detection process, describing it as over intrusive (K. Okereafor & Clement E. Onime, 2016).
‘Collectability’ refers to the ease of data capturing, measuring, and processing, reflecting how easy this biometric modality is for both the user and the personnel involved.
Fingerprint and hand geometry recognition techniques are very easy to use. Their template sizes are small and so matching is fast (S. Jaiswal et al., 2011). Similarly, the advantage of face biometrics is that it is contactless and the acquisition process is simple. The advantage of all behavioral recognition methods is the ease of acquisition as well.
‘Permanence’ relates to long-term stability—how a modality varies over time. More specifically, a modality with 'high' permanence will be invariant over time with respect to the specific matching algorithm.
Physiological measurements tend to be permanent, while behavioral measurements are usually not long-term stable. Such modalities have a low or medium level of permanence.
The same person can sign in different ways, as it is affected by physical conditions and feelings. Voice is not constant, as it may change based on an individual's emotion, sickness, or age (L. Rabiner & B.-H. Juang, 1993).
Facial traits are persistent, but may change and vary over time, although heat generated by the facial tissues has a measurable repeatable pattern. It can be more stable than the facial structure (Hanmandlu et al. 2012). Finger and palm prints and vein patterns tend to remain constant. Hand geometry is more likely to be affected by diseases, weight loss/gain, injury. However, the results of hand geometry recognition are not as much affected by skin moisture or texture changes depending on age. Ear size changes over time (S. Jaiswal et al., 2011; Abaza et al. 2013). DNA is highly permanent. Iris remains the same throughout life (G. Kaur et al., 2014; Bowyer et al. 2008). However, diabetes and some other serious diseases cause alterations in it. Likewise, the stable retina pattern changes during medical conditions like pregnancy, blood pressure, other ailments, etc. (G. Kaur et al., 2014).
‘Universality’ means that every person using a system may have the modality.
Different biometric systems have their own limitations, likewise the modalities. For example, some people have damaged or eliminated fingerprints, hand geometry is efficient only for adults, etc. Biological/chemical, electrophysiological, and neurological (in theory) biometrics measurement categories should have the highest level of universality.
‘Uniqueness’ relates to characteristics that should be sufficiently different for individuals such that they can be distinguished from one another.
Every person has a unique walking style as well as writing style and hence a person has his own gate and signature. Voice recognition technology identifies the distinct vocal characteristic of the individual. Even so, human behavior is not as unique as physiological patterns.
Finger and palm prints are extremely distinctive. The blood vessels underneath the skin are also unique from person to person. The iris is highly unique and rich in texture. Moreover, the texture of both eyes are different from each other. Each person has a unique body odor and such chemical agents of human body odor can be extracted from the pores to recognize a person (M. Shu et al. 2014). People display a distinct ‘brain signature’ when they are processing information, similar to fingerprints. At one time, neuroscientists thought brain activity was pretty much the same from one person to another (E. Finn et al., 2015, 2019; A. Demertzi et al., 2019).
Nevertheless, even physical modalities have limitations. Thus, faces seem to be unique, however, in the case of twins, distinctiveness is not guaranteed. DNA itself is unique for each individual, except identical twins, therefore, it achieves high accuracy. However, retina recognition is highly reliable, since no two people have the same retinal pattern and even identical twins have distinct patterns. We assume that neurosignature is to be one of the premier biometric technologies on grounds of the unique nature of human thoughts, memories, and other mental conditions.
‘Accuracy’ is a part of the ‘Performance’ category. It describes how well a biometric modality can tell individuals apart. This is partially determined by the amount of information gathered as well as the quality of the neural network resulting in higher or lower false acceptance and false rejection rates.
2D facial recognition may give inaccurate results, as facial features tend to change over time due to expression, and other external factors. Also, it is highly dependent on lighting for correct input. Thermograms, which are easy to obtain and process, are invariant to illumination and work more accurately even in dim light, are far better.
3D face recognition has the potential to achieve greater accuracy than its 2D counterpart by measuring the geometry of facial features. It avoids such pitfalls of 2D face recognition as lighting, makeup, etc. It is worth noting, 3D face recognition with liveness detection is considered the best in accuracy.
Palm prints show a higher level of accuracy than fingerprints. Considering the number of minutiae points of all five fingers, the palm print has more minutiae points to help make comparisons during the matching process compared to fingerprints alone (A. Kong et al. 2009).
The iris provides a high degree of accuracy (iris patterns match for 1 in 10 billion people; J. Daugman, 2004), but still can be affected by wearing glasses or contact lenses. Similarly, retina recognition is a highly accurate technology, however, diseases such as cataracts, glaucoma, diabetes, etc. may affect the results.
‘Security’ refers to vulnerability to attack vectors, as paths or means by which attackers can gain access to users’ biometric data to deliver malicious actions.
Vascular biometrics ranks first as the safest because of the many benefits it inherently offers, it is simple and contact-free as well as resilient to presentation attacks. This applies to both hand and eye vein recognition. The vein pattern is not visible and cannot be easily collected like facial features, fingerprints, voice or DNA, which stay exposed and can be collected without a person’s consent.
However, face recognition offers appropriate security if the biometric system employs anti-spoofing and liveness detection so that an imposter may not gain access with presentation attacks. 3D templates and the requirement of blinking eyes or smiling for a successful face scan are some of the techniques that improve the security of face recognition.
- Processing Speed
‘Processing Speed’ is a part of the ‘Performance’ category. It is related to the time it takes a biometric technology to identify an individual.
As different modalities have different computation requirements, the processing power of the systems used varies. Fingerprints and face recognition are still the fastest in the identification process. The time used by vein recognition systems is also very impressive and reliable, in terms of the comparison of the recorded database to that of the current data. Currently, the time which is taken to verify each individual is shorter than other methods (average is 1/2 second; P. O'Neill, 2011). Iris and retina recognition have a small template size, hence promising processing speed (2 to 5 seconds). Ear shape recognition techniques demonstrate faster identification results, thanks to reduced processing time. The more complicated the procedure, the longer it takes. Behavioral modality identification is fast in processing. Signature, voice, lip motion recognition take a few seconds. The EEG and ECG processes differ. Acquisition of a DNA sample requires a long procedure to return results (S. Bhable et al., 2015).
‘Circumvention’ relates to an act of cheating; thus, the identifying characteristic used must be hard to deceive and imitate using an artifact or a substitute.
Nearly every modality may become an easy subject for forgers. Signatures can be effortlessly mimicked by professional attackers; voices can be simply spoofed. Fingerprints are easily deceived through artificial fingers made of wax, gelatin, or clay. Iris-based systems can be attacked with fake irises printed on paper or wearable plastic lenses, while face-based systems without 5 levels of liveness detection can be fooled with sophisticated 3D masks (A. Babu & V. Paul, 2016). Even vein patterns can be imitated by developing a hand substitute.
Having said that our DNA is left everywhere, and has no inherent liveness, it is believed to be the most difficult characteristic to dupe, as the DNA of each person is unique (Maestre, 2009). Brain activity and heartbeat patterns are also hard to emulate.
‘Hardware’ category refers to the type and cost of hardware required to use the type of biometric.
Nowadays, there is no need for extra new devices if you have a smartphone for biometric recognition. Facial recognition and fingerprint are common features of smartphones. For lip motion recognition existing image capturing devices, i.e., cameras, can be used. Thermograms need specialized sensor cameras. Voice recognition is also easy to implement on smartphones or any audio device. Hand vein recognition has a low cost in terms of installation and equipment. Nowadays, mobile apps for vascular biometric recognition are integrated using the palm vein modality (R. Garcia-Martin & R. Sanchez-Reillo, 2020). For eye vein identification, smartphones are currently in development, while retina recognition is still an expensive technology, i.e., a high equipment cost. Keystrokes need no special hardware or new sensors, and low-cost identification is fast and secure. Image-based smartphone application prototypes for ear biometrics are in development (S. Bargal & A. Welles, 2015; A. F. Abate, M. Nappi & S. Ricciardi, 2016), as well as mobile apps with digital signatures (E. Rahmawati, M. Listyasari, A. S. Aziz & S. Sukaridhoto, 2017).
In the meantime, electroencephalograms are needed for EEG, and electrocardiograms for ECG. Brain-computer interfaces (BCI) are needed for neurosignature. Special expensive equipment and hardware are needed for DNA matching procedures.
We assume that a combination of the aforementioned biometrics methods (and even multimodal biometrics) is not one hundred percent safe/secure. In the future, we plan to expand the system with this multimodal scheme, making neurosignature one of the main methods of Humanode user identification/verification.Other emerging modalities to research and to possibly implement in Humanode’s verification system are as follows (Goudelis et al. 2009): smile recognition, thermal palm recognition, hand/finger knuckle, magnetic fingerprints/smart magnet, nail ID, eye movement, skin spectroscopy, body salinity, otoacoustic emission recognition (OAE), mouse dynamics, palate, dental biometrics, cognitive biometrics.
Table 3. ‘Biometric Identification Matrix’: Biometrics Techniques Comparison
The different biometrics techniques are discussed. The advantages and disadvantages associated with each of them are listed in Table 4.
Table 4. ‘Biometric Identification Matrix’: Biometrics Techniques Pros and Cons
We assigned each factor its own value point depending on its effectiveness for the enrollment of new human nodes to the network:
- Acceptability (6)
- Collectability (6)
- Permanence (5)
- Universality (5)
- Uniqueness (10)
- Accuracy (8)
- Security (10)
- Processing Speed (3)
- Circumvention (10)
- Hardware (8)
Thus, we assume that the most significant for the network are ‘Uniqueness’ and ‘Security’ of the biometric modality, ‘Accuracy’ of the biometric method, low level of ‘Circumvention,’ and ‘Hardware’ type used.
To evaluate every aforementioned biometrics modality technique, we proposed the ‘Humanode Biometric Modalities Score,’ based on the ‘Biometric Identification Matrix’ analyzed.
The study revealed that 3D facial recognition technique has the highest score (198), facial thermography recognition (192) and iris recognition (190) are not far behind. Retina recognition (176) and eye vein recognition (178) also got quite high scores, as well as neurosignature (173) which is not so highly scored as it is not yet fully developed and massively adopted.
Table 5. ‘Biometric Identification Matrix’: Modalities Scores
* When calculated, we swapped the levels (numbers) for the ‘Circumvention’ factor so that it could be correlated with other factors, since a ‘High’ level of circumvention means it is easy to imitate the body part, the modality, by using an artifact or substitute, while ‘Low’ level of circumvention means this is practically impossible to do. In our model ‘Low’ gets 3 while ‘High’ - 1.
Diagram 1. ‘Biometric Identification Matrix’: Modalities Scores
To create a human node, only those modalities will be used that have score points above the median value (>147), i.e., 2D facial recognition, 3D facial recognition, facial thermography recognition, iris, retina, finger/hand vein recognition, eye vein recognition, ECG, DNA matching, and neurosignature (in future).
Due to the possible development of cheap methods of attacks on the current biometric security set-up in the future, the Humanode network will require human nodes to provide additional biometric data during network upgrades. For instance, once iris verification is proven to be secure on smartphone devices, it will be added as an additional minimum requirement to deploy a node. While Samsung already has made attempts to deploy consumer-scale iris recognition into its smartphones, its quality and security levels are quite low compared to specialized hardware.
On top of this, in order to increase the cost of possible attacks on biometrics, the Humanode network requires high standards for the multimodal biometric system used for granting a permission to launch a human node. Starting only with 3D facial recognition and liveness detection, later on one will have to go through multimodal biometric processing.
Also, the ability to create several wallets and to choose their types in the system will be correlated with the biometric modalities selected. For example, to create a high-value wallet, a more secure and complex verification technique should be chosen, and vice versa.
Currently, there are eight possible attacks against biometric systems.
Figure 13. Possible attacks on biometric verification systems
Attackers can present fake biometrics in front of sensors (Jain et al. 2008). For example, someone can make a fake hand with fake vein patterns, or finger with fake wax fingerprint; wear special-made lenses to bypass the iris scanner; other intruders can create images of a legitimate user to bypass the face recognition system, etc. The possible solutions for this type of attack are multimodal biometrics, liveness detection, as well as soft biometrics (Kamaldeep, 2011).
Multimodal biometrics is the main way to prevent attacks and make the biometric system more secure. Multimodal biometrics refers to methods in which several biometric features are considered for enrollment and authentication. When multiple biometric characteristics are used, it becomes difficult for an attacker to gain access to all of them.
Humanode utilizes multimodal biometrics. The network has three tiers with combined biometric modalities that are required to set a human node (read more in the ‘Humanode Biometric Modalities Score’ section).
Liveness detection uses different physiological properties to differentiate between real and fake characters. It is an AI computer system’s ability to determine that it is interfacing with a physically present human being and not an inanimate spoof artifact.
A non-living object that exhibits human traits is called an ‘artifact’. The goal of the artifact is to fool biometric sensors into believing that they are interacting with a real human being instead of an artificial copycat. When an artifact tries to bypass a biometric sensor, it's called a ‘spoof.’ Artifacts include photos, videos, masks, deepfakes and many other sophisticated methods of fooling the AI. Another method of trying to bypass the sensors is by trying to insert already captured data into a system directly without camera interaction. The latter is referred to as ‘bypass’.
In the biometric authentication process, liveness data should be valid only for a set period of time (can be up to several minutes) and then is deleted. As this data is not stored, it can’t be used to spoof liveness detection with corresponding artefacts to try and bypass the system.
The security of liveness detection is really dependent on the size of data they are able to detect. That is why low resolution cameras might never be totally secure. For example if we take a low-res camera and put a 4k monitor in front of it then weak liveness detection methods such as turning your head, blinking, smiling, speaking random words etc. can be easily emulated to fool the system.
In 2017, the International Organization of Standardization (ISO) published ISO/IEC 30107-3:2017 standard for presentation attacks went over ways to stop artifacts such as high-resolution photos, commercially available lifelike dolls, silicone 3D masks etc. from spoofing fake identities. Since then, sanctioned PAD (Presentation Attack Detection) tests for biometric authentication solutions have been created so that any new solutions meet the specified requirements before hitting the market. The most famous of them all is the iBeta PAD Test. It is a strict and thorough evaluation of biometric processing solutions in order to understand whether they can withstand the most intense presentation attacks. Four years have passed since then and this standard is condemned as outdated by many specialists in the field, and iBeta PAD tests have gradually become easy to pass with modern sophisticated spoofing methods.
FaceTec, one of the leading companies in liveness detection, divides attacks into 5 categories that go way beyond those stated in the 30107-3:2017 standard and represent the real world threats much precisely.
Depending on the artifact type, there are three levels of PAD attacks:
- Level 1: Hi-Res digital photos, HD videos, and paper masks.
- Level 2: Commercially available lifelike dolls, latex & silicone 3D masks.
- Level 3 includes ultra-realistic artifacts like 3D masks, and wax heads.
Furthermore, depending on the bypass type, FaceTec researchers identify Level 4 & 5 biometric template tampering, and virtual-camera & video injection attacks:
- Level 4: Decrypt & edit the contents of a 3D FaceMap™ to contain synthetic data not collected from the session, have the server process and respond with ‘Liveness Success’.
- Level 5: Take over the camera feed & inject previously captured video frames or a deepfake puppet that results in the FaceTec AI responding with ‘Liveness Success’.
Figure 14: 5 levels of liveness:
Almost all liveness detection methods as well as those described above in the Humanode approach to user identification are software-based and available for any modern smartphone. In hardware-based methods an additional device is installed on the sensor to detect the properties of a living person: fingerprint sweat, blood pressure, or specific reflection properties of the eye.
With liveness detection, the chances of successful spoofing become low enough to make the cost of an attack higher by an order of magnitude in comparison to the potential transaction fees collected by an artificially created human node minus costs to run a node.
The Humanode network implements 3D facial liveness detection from the testnet.
A replay attack is an attack on the communication channel between the sensors and the feature extractor module. In this attack, an impostor can steal biometric data and later can submit old recorded data to bypass the feature extraction module (Jain et al. 2008).
Traditional solutions to prevent this kind of attack are as follows.
- Steganography is the way by which biometric characteristics can be securely communicated without giving any clue to the intruders. It is mainly used for covert communication and therefore biometric data can be transmitted to different modules of the biometric system within an unsuspected host image.
- Watermarking is a similar technique where an identifying pattern is embedded in a signal to avoid forging. It is a way to combat replay attacks, but only if that data has been seen before or the watermark can't be removed.
- A challenge-response system, in which a task or a question is given to the person as a challenge and the person responds to the challenge voluntarily or involuntarily (Kamaldeep, 2011).
The attacker intrudes the channel to modify the existing data or to replay the old one. Traditionally, this attack can be prevented by such solutions as challenge-response systems, watermarking, and steganographic techniques as a Replay attack (Bolle et al. 2002).
The attacker can intervene in the database where the templates are stored to compromise the biometric characteristics of a user, replace, modify, or delete the existing templates.
There are two common template protection schemes to counter this attack:
- Cancelable biometrics, in which the intruder cannot get access to the original biometric pattern from the database because instead of the original data, a distorted version is stored.
- Cryptobiometrics, where all data is encrypted before sending in the database while the original template is deleted, therefore, it is quite difficult for the attacker to steal the original template, as it exists only for a few seconds on the user’s device.
The Humanode network uses the second type.
As the software application may have bugs, an intruder can override the actual decision made by the matcher.
Humanode ensures that nobody knows the actual decision result of matching but the protocol before this decision is executed. This attack can be prevented using soft biometrics as well (Kamaldeep, 2011).
This attack relates to overriding the feature extractor to produce predetermined feature sets, as the feature extractor is substituted and controlled remotely to intercept the biometric system.
In the Humanode system, feature extraction takes place on the client's device. The human node encrypts the embedded feature vector using the public key and gets the encrypted feature vector. Further, it provides ZKP proof that the feature vector is extracted through the system's feature extraction process only, as a result, hence the attacker is unable to override it.
Overriding the matcher to output high scores compromises system security. In this way, the intruder can control the matching score and generate a high matching score to confirm authentication to the imposter.
In Humanode, the matching score is computed over an encrypted feature vector. Moreover, the matcher is required to provide proof of correctness for the matched score. As a result, the attacker cannot override matchers to generate a high matching score for a target feature vector.
The route from the feature extractor to the matcher is intercepted to steal the feature vector of the authorized user. Using the legitimate feature vector, the attacker then iteratively changes the false data, retaining only those changes that improve the score until an acceptable match score is generated and the biometric system accepts the false data. The legitimate feature sets are replayed later with synthetic feature sets to bypass the matcher (Bolle et al. 2002; Jain et al. 2008; Kamaldeep, 2011).
In the Humanode system there is a private channel between the feature extractor and the matcher while the feature vector is always kept in encrypted form and is never available in the plain form to the attacker. Therefore these kinds of attacks are not possible.
Recently, Mai G. et al. (Mai G. et. al. 2018) proposed a neighborly de-convolutional neural network (NbNet) to reconstruct face images from their deep templates. In a distributed P2P network, a node can have access to a biometric template database and it can use NbNet to reconstruct corresponding 2D or 3D mask with very high success probability for verification.
A robust Liveness detection prevents use of reconstructed 2D or 3D mask but it does not protect the privacy of the corresponding user. For protecting privacy, there are several solutions based on user specific randomness in deep networks and user specific subject-keys. Along with using robust liveness detection Humanode stores all biometric templates in the encrypted form and those are never available in the plain form to the attacker.
Table 6. Attacks on biometric systems and their possible solutions:
With the evolution of neural implants, it became possible to convert the neuroactivity of the brain into electronic signals that can be comprehended by modern computers. Since the 1960s, the neurotech field has moved from simple electroencephalography (EEG) recordings to real brain-computer communication and the creation of sophisticated BCI-controlled applications. Since the late 2010s, large companies have begun to actively pursue brain-computer interface (BCI) development, rapidly approaching its adoption. In 2014, Brainlab developed a prototype that allows a Google Glass user to interface with and give commands to the device using evoked brain responses rather than swipes or voice commands. In 2015, Afergan et al. developed an fNIRS-based BCI using OST-HMD called Phylter, a control system connected to Google Glass that helped prevent the user from getting flooded by notifications. In 2017, Facebook announced the BCI program, outlining its goal to build a non-invasive, wearable device that lets people type by simply imagining themselves talking. In March 2020, the company published the results of a study that set a new benchmark for decoding speech directly from brain activity. Companies, like BrainGate and Neuralink, have manufactured working prototypes of invasive and noninvasive brain-computer interfaces that build a digital link between brains and computers. Even with the immeasurable complexity of neurons and ridiculous entanglement of somas, axons, and dendrites, the above-mentioned projects were able to create devices that not only stimulate and capture the output but also distinguish patterns of signals from one another.
A person will be able to use his own mental state, conscious state, or simply signals from the motor cortex to initiate node deployment and verify transactions without compromising the data itself.