Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Rubika Cube Auth: 3x3x3 Cube Puzzle Authentication

3.005 Aufrufe

Veröffentlicht am

This paper proposes a Web Service Authentication Concept Test Service based on Machine Learning analysis of the movements of a 3x3x3 cube making use of the resolution sequences of cubes that can send both its rotation sequences and positioning through the BLE protocol. In order to make this possible, both a hardware development of the authentication device and software implementation of the platform, interface and authentication engine in the web service have been performed.

Veröffentlicht in: Technologie
  • Is it possible to improve your memory? How can I improve my memory recall? more info... ★★★ https://bit.ly/2GEWG9T
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Gehören Sie zu den Ersten, denen das gefällt!

Rubika Cube Auth: 3x3x3 Cube Puzzle Authentication

  1. 1. Rubika Cube Auth: 3x3x3 Cube Puzzle Authentication Chema Alonso1 Jorge Rivera2 Víctor Mundilla3 Julio J. García4 Enrique Blanco5 CDO, Telefónica Madrid, Spain {chema1 , jorge.rivera2 , victor.mundilla3 , julio.garcia4 }@11paths.com {enrique.blancohenriquez5 }@telefonica.com Abstract- Cybersecurity is becoming an increasingly important issue in today's society, especially thanks to the growing connectivity between devices and services. Password theft events have the ability to cause severe damage to business and general public, so awareness and prevention have to be granted in every sphere of the society to diminish the negatve effects of credential stealing. This paper proposes a Web Service Authentication Concept Test Service based on Machine Learning analysis of the movements of a 3x3x3 cube making use of the resolution sequences of cubes that can send both its rotation sequences and positioning through the BLE protocol. In order to make this possible, both a hardware development of the authentication device and software implementation of the platform, interface and authentication engine in the web service have been performed. In this work, a 3x3x3 cube has been designed and manufactured by ElevenPaths, called Cube11Paths. This device is capable of transmitting via Bluetooth channel not only the sequences of turns, but also positioning sequences, which differentiates it from the rest of today’s commercial puzzles. To allow authentication in a Web Service using a 3x3x3 cube, a machine learning engine dedicated to binary classification is proposed through Logistic Regression, Support Vector Machine and Random Forest Classifier algortihms using the most representative characteristics of the resolutions of each user. In this researchs Random Forest Classifiers and Logistic Regression algorithms have proven to have the best performances as engines for user authentication. Mean scores of for Random Forest Classifiers and for Logistic Regression considering all users were obtained. Index Terms- 3-D Combination Puzzle, Biometrics, Authentication, Pattern recognition, 2 Factor-Authentication, Machine Learning, Logistic Regression, Support Vector Machine, Random Forest Classifier. Type of contribution: Original investigation (X pages limit) I. INTRODUCTION Biometrics, referred as the identification of an individual according to their physical or behavioral characteristics with a certain element, is an emerging discipline as an authentication mechanism in multiple services thanks to its robustness, speed of verification and simplicity of interaction by users. In addition, the advantage lies in the fact that the attributes that characterize a biometric pattern are, in general, impossible to lose or forget. Physiological characteristics such as facial recognition [1], fingerprints, the geometry of the fingers and / or hands, the size of the retina [2] and the iris or user behavior patterns such as the handwritten signature [3], its own voice, the way of walking or the pulsations of a keyboard become resources that can be used to characterize a user in a robust way [4]. Theoretically, an authentication mechanism must ensure that the user requesting access to a resource is a legitimate user. These mechanisms are designed in such a way that they allow to validate different authentication factors in a safe way. The authentication factors allow the user to show who he claims to be. Different strategies can be considered to propose authentication factors based on the following casuistry. Something that the user knows. For example, an access password and verify that the user knows it. Something that the user has or is provided with. The possession of a mobile device or an RSA token allows the system that must verify the identity of a user to have an alternative channel to communicate with it. One of the most common uses is tokens generation or sending text messages with a code through these channels, with the aim of allowing the user to enter a second authentication step. Something the user is. Recently, feature verification is being widely used in the world of security to allow a system to authenticate a user, such as biometrics. Something the user does. In the end, all users interact in a modeled way with the information systems. This guide allows you to define an expected behavior that can be used as an authentication factor. The purpose of this project is to provide an authentication tool to a specific Web Service based on user authentication based on the steps it has taken to solve a 3x3x3 cube from a random state of disorder using Machine Learning. The
  2. 2. intention is to characterize the cube resolutions for each user to make use of them as a biometric feature that allows, using Machine Learning algorithms, their authentication and the consequent access to a service. A "3x3x3 Cube" is a three-dimensional mechanical puzzle or puzzle by combination of sequential movements. This kind of device has the form of a regular hexahedron or "cube", where each of its six square faces has a differentiated color, and these are divided into 3 rows and 3 columns, totaling 54 squares, 9 of each of the 6 possible colors. Since it is a three- dimensional polyhedron, each square on the edge of the face is joined to the square of the adjoining face, formed 26 different pieces: 12 edges, 8 vertices and 6 centers. His invention involves several patent litigation, but is generally attributed to the Hungarian professor, Erno Rubik, who devised it as an architectural design exercise in the mid- 1970s, although it was known worldwide at the beginning of the 1980s. , when it was marketed as an entertainment toy. The cube allows the complete turn of a face in steps of +90 or -90 degrees; with each turn, change the position and orientation 8 pieces: 4 edges and 4 vertices (the center is considered immobile when turning on itself) so also the 9 fours of the face, also change the 3 annexes of the 4 faces contiguous. The total number of permutations is possible positions. The challenge or game is to make a series of sequential turns in their faces until reaching a position where all the squares of the same color are on the same side (solved state), starting from a previous position of random disorder. There are different algorithms that allow solving the cube given any position of random disorder, but the way to implement them varies greatly from one person to another, since there are many sequences that allow such resolution, this being the basis of the characterization that you want to demonstrate using Machine Learning techniques. Several investigations make use of this type of puzzle as a device aimed at encrypting or encoding information based on the mapping of its configuration or changes in the disposition of its faces after a certain number of preset turns [5] [6]. The purpose of this research is to expand the use of a 3x3x3 cube in the field of cybersecurity beyond what already exists, providing a Machine Learning based solutions allowing the use the resolution sequences of a 3x3x3 cube as a Second Authentication Factor. With the approach chosen for this Proof of Concept (PoC), the four points mentioned in which the Authentication Factors are based will be covered. The user makes use of something he has (in this case, a 3x3x3 cube with Bluetooth Low Energy connection (BLE)) to authenticate him/herself in a service based on what he knows (his experience when solving a puzzle of this type from a state of disorder) depending on what he does (set of turns and positioning of the device in that resolution process). This set of movements and positioning defines what the user is, allowing to find a biometric pattern of the users supported in the System based on their 3x3x3 cube resolutions. In order to carry out this PoC, ElevenPaths has developed a 3x3x3 cube capable of transmitting the turns of the faces made by the user together with its relative position (tracking) of three axes yaw-pitch-roll (inclination, yaw and roll) in real time, through radio frequency at 2.4GHz BLE standard. The paper is structured as follows: Section I provides a summary on System’s Architecture main components, including details on HW and SW specs on 3x3x3 cube, as well as all features supported by both Frontend and Backend. In Section II, the architecture of the system is explained in detail and the functionality of the components of the system is explored, among which we highlight the 3x3x3 cube developed by ElevenPaths, the Frontend in charge of managing training and authentication flows and the Machine Learning algorithms supported by the Backend, aimed at the authentication of the users registered in the system through the characterization of their solving sequences of the cube. In Section III, the theoretical bases of the Machine Learning model implemented at the Backend level by the Chief Data Officer (CDO) team for the classification of sequences are specified. Section IV summarizes the use logic of the System, showing both training and authentication flows. Section V provides an analysis of the performance of the model for the use case in this paper. Finally, the conclusions and future work are summarized in Section VI. II. RUBIKA CUBE AUTH SYSTEM OVERVIEW The System created in the present research is composed of a set of interacting hardware and software elements whose purpose is to collect the movements and positioning of a 3x3x3 cube to train a Machine Learning algorithm classifier, aimed at authenticating the people registered in the service. Fig. 1. Rubika Cube Auth Web Service Authentication System Architecture scheme According to what is depicted in Fig. 1, the basis of the authentication platform are summarized hereafter:  A 3x3x3 cube; the system supports this kind of device on its generic version (i.e, GiiKER® type) or in the Cube11Paths of ElevenPaths. The difference between both devices is that the cube Cube11Paths not only transmits the turns of the faces, but also provides the spatial positioning of
  3. 3. the cube.  A Frontend, which allows users to complete the enrollment, login, training and authentication phases for every user. More information on Frontend features is provided in Section II.B and System usage and users’ interaction with the Frontend is depicted in Section IV.  A Backend which interacts with Frontend by answering its requests, hosting the Machine Learning engine dedicated to authentication. Training and testing routines are supported for regardless of the type of 3x3x3 cube used by the user. A. 3x3x3 cube As indicated at the beginning of this section, the System has the ability to link via BLE protocol with any kind of physical 3x3x3 cube with this functionality supported, as well as collecting every movements of two different types of physical cubes. Cube11Paths, a edge cube, entirely designed and integrated by ElevenPaths (see Fig. 2), is capable of transmitting the rotations of its faces and their spatial position on three axes in real time; all this information is acquired by an Inertial Measurement Unit (IMU) and provided through radio frequency BLE standard, implemented from v4.0 of the Bluetooth protocol. The system is able to interact with is any any generic GiiKer® cube with the ability to transmit the rotation of their faces through the BLE protocol. Therefore, it should be noted that Cube11Paths provides information including its positioning togeteher with its turns, which is a feature not supported by any puzzle of this type currently commercialised. Fig. 2 Cube11Paths 3x3x3 cube developed by ElevenPaths B. Frontend This section summarizes the functionalities of the different Frontend components. The Web Application responsable for BLE connection (developed in JavaScript) is made in Flask / Python and dynamically shows the turns and orientation of the rubik device in 3D (HTML5 / CSS3 / JavaScript) and prepares them for later provision to the Backend. As it is outlined in Fig. 1, the Frontend contains the database in which all the information related to the users registered in the system is stored, as well as all the resolution movements of the cube with which it is stored. will train the binary classification model that we face. All this information is stored in a MongoDB database [7] with four collections defined to hold information about each user:  users: this collection registers all the users enrolled in the System, as well as its authentication status and records;  movements: the collection stores all movements gathered from the users, so they can be filtered as a function of thei origin and the sequence they belong to;  trainingsequences: this collection is responsible for mapping the sequence id code with the user;  logins: both successful and unsuccessful login attempts of every user enrolled in the platform are recorded in the database together with the authentication probability. The MongoDB database formed by these four collections will be responsible of providing all the required information to the Backend when required for the Machine Learning algortihms trainings. Frontend also supports a Login Page in which the credential that identifies the user is entered. This page allows access to an Enrollment Page in the case that the user is not registered in the system, to a Training Page if it exists in the system but is not yet authenticable - that is, it does not have enough sequences of resolution to consider any authentication attempts reliable - and to an Authentication Page if that user is already authenticated. Section IV provides more detailed information on the flow of navigation between the different pages mentioned above. C. Auth Backend ML The component of the system identified as Auth Backend ML is responsible for managing and answering to every requests received from the Frontend both to initiate training and to launch an authentication process for a user declared as authenticable. This component is, therefore, responsible for hosting Machine Learning algorithms dedicated to the modeling of user-generated sequences of movements, as well as for evaluating and predicting authorship a set of consecutive resolution movements as part of the process of authentication. Regardless of the features of the sequences chosen to characterize the user's biometric pattern, which are explained shortly afterwards in section V, a problem of data sequences classification is addressed. That is, given a set of features that characterize the sequence of resolution of the 3x3x3 cube, it is sought to characterize binary classification models so that it learns to discern the authorship of the sequences that are provided ex post for each user. Machine Learning engine supports, in training routine, the following activities ordered by order of execution:  Movements and positioning data reading, filtering and preprocessing from Frontend database;  Model and building compilation;  grid search routines for hyperparameter tuning for each model and user;  When training Machine Learning or Machine Learning models, a checkpoint is identified as weights of the model that may be used directly for testing, or used as the starting point for a new run, picking up where it left
  4. 4. off. Checkpoint storing once training of the models is finished is supported. Reload of checkpoints is also included for authentication purposes or new training attempts;  Model fitting to the features derived from current status of the dataset; Authentication routine supports following functionalities:  reception of the queued authentication movements: construction, compilation of the model and loading of the most recent checkpoint;  Models evaluation once all the expected movements of the user have been received;  Provide the Frontend of the validity of the binary label predicted by the trained models. III. MACHINE LEARNING MODELS DESCRIPTION Three Machine Learning models have been chosen and implemented for its incorporation to Authentication Backend ML:  A first one devoted to binary Logistic Regression;  A second one based on a Support Vector Machine (SVM) Classifier;  The last one uses a Random Forest Classifier to authenticate users. All three algorithms make use of data sequences and cube positioning patterns to classify or learn its authorship. A classifier or classification rule is a function 𝜙: 𝒳 → 𝒴, where 𝒴 is a finite set of clases or labels {𝑐1, 𝑐2, … , 𝑐𝐽}. Machine Learning angine has been developed in Python and uses libraries as scikit-learn for data mining and data analyisis tasks [8], built on Numpy [9], SciPy [10] and matplotlib [11]. As each user has a set of resolution sequences including moments of rotation, the turns performed by the user and a set of positions for each turn, the mission of the generated models is learning to map several characteristics derived from these data in front of a univocal label assigned to each user, as it is outlined in Fig. 3. Percentages of movements of each side taking into acount if they’re clockwise or counter-clockwise, the duration of each sequence, the mean of temporal distance between movements for each resolution are some of the most important features considerd for training and testing the models. More information on feature extraction is provided in section V.A. The task of the Machine learning algorithms used in this project is based on supervised learning paradigm, aimed at directly mapping features derived from the sequences against known labels. The most commonly used algorithms in Machine Learning are those capable of automating a process after having learned from a set of known examples. The user provides a given algorithm for example input-output pairs, with the algorithm being in charge of learning the relationship to be able to subsequently output an unseen input without the supervision of anyone. In supervised learning problems, the algorithm is taught or trained from data that is already labeled with the correct answer [12] [13]. Once the training is completed, unseen data is provided, without the correct response labels, and the learning algorithm uses the past experience acquired during the training stage to predict a result (see Fig. 2.). Fig. 3 Training and testing diagram for every algortihm / model assessed duting the present research These models can support both sequences with positioning and sequences without gyroscopic information. A user can enter sequences with different cubes - Cube11Paths or GiiKER® cubes - and train and test them regardless of whether the device has gyroscope supported or not. Each time a user's training is launched, before starting any training routine, a random hyperparameters Grid Search is performed to maximize the precision on the dataset available to each user. With the optimal hyperparameters, the training of a model dedicated exclusively to the authentication of that user is launched. This means that all the sequences belonging to that user will be tagged with a while those of the other users will be tagged with a . Our models, therefore, will offer as output a number that will be between and , which corresponds to the probability that the person being authenticated is the expected one, being the total certainty that the sequence belongs to the person who asks for authentication. A. Linear Regression Classifier as authentication algorithm As indicated in the beginning of the present Section, the first of the three Machine Learning models that have been assessed corresponds to a simple logistic regression classifier. In order to measure the likelihood of the model to succeed in correct label determination, the well-know odds ratio can be used: This describes the ratio between the probability that a certain, positive, event occurs and the probability that it doesn't occur where positive 𝑦 = 1 refers to the “event” that we want to predict, i.e., , where , . There is a direct relationship between the coefficients produced by logit and the odds ratios produced by logistic. A logit is defined as the log base e (log) of the odds: With the inverse of the previous function it is possible to obtain the probability , identified as the sigmoid function:
  5. 5. With that probability, it is possible to predict confidence scores for samples with reference to the supported classes of our problem. B. Support Vector Machine Classifier as authentication algorithm Support Vector Classification were also used as engine for user authentication. SVMs have proven to be a useful technique for data classification. Each instance in the training set contains one target value (i.e. the class labels) and several the features or observed variables. The aim of SVM is to produce a model, based on the training data, which predicts the target values of the test data given only the test data attributes. Given a training set of instance-label pairs where , and , the Support Vector Machines [14] require the solution of the following optimization problem: With the following condition to be applied: Here training vectors are mapped into a higher dimensional space by the function , which is used to define the kernel function of the algortithm, defined as . SVM finds a linear separating hyperplane with the maximal margin in this higher dimensional space. is the penalty parameter of the error term. C. Random Forest Classifier as authentication algorithm Random Forest Classifier capabilities to correctly achieve a successful authentication of the user has also been investigated. These kind of algorithms are built by combining the predictions of several trees, each of which is trained in isolation. Typically, the trees are trained independently and the predictions obtained are combined through averaging. Decision trees, and by extension, of all tree-based methods, are very used in these kind of tasks by several factors that make them quite attractive in practice [15]:  Decision trees are non-parametric. They can model arbitrarily complex relations between inputs and outputs, without any a priori assumption;  Decision trees handle heterogeneous data (ordered or categorical variables, or a mix of both);  Decision trees intrinsically implement feature selection, making them robust to irrelevant or noisy variables (at least to some extent);  Decision trees are robust to outliers or errors in labels;  Decision trees are easily interpretable, even for non- statistically oriented users. In fact, decision trees are at the foundation of many modern and state-of-the-art algorithms, including forests of randomized trees or boosting methods. Random forests construct many individual decision trees at training. Predictions from all trees are pooled to make the final prediction; the mode of the classes for classification or the mean prediction for regression. As they use a collection of results to make a final decision, they are referred to as Ensemble techniques. Next section summarizes the interaction between the user and System’s Frontend, as well as all the operations the user has to perform to successfully trigger tranin and authentication routines at Backend level. IV.SYSTEM USAGE In this paper, we present a tool that is able to authenticate via web users by using cube 3x3x3 solving sequences by means of binary logistic regression. Next, we explain the details about how our hardware and software work. The tool has a friendly interface that allows the user to interact with it a simple and easy way. The user will navigate to the authentication Frontend through following url https://cubeauth.e-paths.com, which automatically redirects to the Login Page https://cubeauth.e- paths.com/login. Fig. 4. Cube Auth Login Page In case the user has never interact with the authentication platform, it is mandatory to conduct an enrollment phase. In order to achieve a correct incorporation of the userto the platform, the user shall navigate to the Enroll Page https://cubeauth.e-paths.com/signup, clicking the “Regístrate” button, as depinted in Fig. 4. Fig. 5. Rubika Cube Auth Enroll Page After entering the email the user wants to use as a mean to enroll in the authntication platform, an email is sent with a session token with a validity period of 24 hours (see Fig. 5). A sample of the text provided by Frontend to the user via email, together with a link including the valid session token is provided in Fig. 6.
  6. 6. Fig. 6. Rubika Cube Auth email notification with token provision for training attempts The user can acess to Training Page by clicking “Entrenar”, which redirects to following url structure https://cubeauth.e- paths.com/train?token=token, where the user can find the possibility to vinculate via BLE a 3x3x3 cube for further attempting its firts training sequences (see Fig. 7). Fig. 7. Rubika Cube Auth Training Page before BLE connection is stablished 3x3x3 cube webpage rendering will be different depending on the type of cube the user has available. Not only colors change to adapt to Cube11Path and GiiKer® color codes, but also movement sync based on the capability of the cube to provide its positioning. In case the vinculated cube is GiiKER® type, 3x3x3 cube plotting will suffer no rotation at all (see Fig. 8). Fig. 8. Rubika Cube Auth Training Page after BLE connection is stablished After the cube has successfully connected to the platform, the user can start solving the cube from a random disorderd state for a fixed mínimum amount of times, so the system can train a model with a relaible amount of information for the sake of most accurate authentication in the platform. It must be noted that the System provides the user the possibility to cancel any sequence solving attempt in case any mistakes or wrong movements have occured during any trainign attempt. as indicated in Fig. 9. Fig. 9. Rubika Cube Auth Training Page during training sequence collection If the user can already be considered as authenticable, training for this user using all supported algorithms in the Backend is launched after hyperparameters grid search has ended, as indicated in Section II.C. When training is resumed, a checkpoint s stored for further training attempts of the user or authetication / testing purposes. An authenticated user can access through the login page to https://cubeauth.e-paths.com/login for the sake of authentication using a resolution sequence of the cube as a factor. As in the training page, see Fig. 10, cube’s rendering on this page will vary depending on the linked 3x3x3 cube type. After cube pairing is successfully completed, the user interacts appropriately with the puzzle to solve the problem. Fig. 10. Rubika Cube Auth Login Page Once the necessary movements for solving the puzzle have been completed by the user, the system launches the
  7. 7. authentication process, so the Frontend sends the movements to the ML Engine to launch a authentication attempt for the model. The Machine Learning Engine or Backend is responsible for compiling the model and restoring the weights apprehended for that user. According to Fig. 3, the extracted features of the resolution sequence are conveniently preprocessed to obtain a probability of authentication base don the features extracted from the movements. If the returned float by our model is greater tan a reference value: , the authentication attempt of the user will be considered as successful, as indicated in Fig. 11. Fig. 11. Notification of Successful Authentication Attempt in Login Page If that reference value is not reached, the Frontend will send an email to the user alerting a failed access attempt. Besides, the user who has attempted authentication is notified an error, as shown in Fig. 12, notification along with the probability of failed authentication attempt. Fig. 12. Notification of Unsuccessful Authentication Attempt in Login Page The next section provides more details about the performance of this system together with the characteristics of the dataset gathered from the users for the algorithm training. V. EXPERIMENTAL EVALUATION AND PERFORMANCE ANALYSIS This section offers an evaluation of the features derived from the available dataset with which the model has been trained, together with the training methodology chosen and the performance observed in this proof of concept for the two algorithms supported in Auth Backend ML. Each user was asked to perform a minimum of resolution sequences starting from an arbitrary state of disorder. At the time of issuing the present paper, a number of sequences were gathered for authenticable users. In overall, of movements were stored in the database from all autheticable users. A. Feature extraction A significant amount of information is stored in database after each resolution attempt of the 3x3x3 cube by the users during training phase. The possible combinations of rotations and resolution speed rates per user in a cube are so diverse (see section I) that, with the intention of finding discernible resolution structures by users, the present research try to find the biometric signature for each user by extracting the following features:  Cube 3x3x3 type. Type string. Supported values: ‘11Paths’ o ‘GiiKER’®.  Delta timestamps for each turn , where 𝑛 is the number of movements in a secuence. Type float.  Solving sequence duration in seconds. Type float.  Number of movements per secuence . Type integer.  The median quartile of the delta timestamps distribution for each sequence in seconds. Type float.  Percentage of turns according to supported values code for each sequence where values of can be: . Type float.  Yaw-pitch-roll angles between movements con yaw, pitch, roll; deg. Type float. In case Cube11Paths is being used, the database will register all the positions between turns. Each 𝜏 millisecond, the 3x3x3 cube sends a positioning in yaw-pitch-roll, so the length of the number of positioning provided in each turn is not necessarily the same, depending on the general separation between the turns. If GiiKer® cube is being used, as no positioning data is provided, will be stored in database. As an input to the model, from cube positioning following features are derived: - Standard deviations for the mean of yaw, pitch and roll for the entire sequence . - Mean values of standard deviations for yaw, pitch and roll during the whole sequence { ( )̅̅̅̅̅̅̅̅} . - Mean values of the angular speed in each rotation axis of the 3x3x3 cube , derived from delta timestamps for each turn . All previous features can be obtained from every sequence resolution, together with its corresponding and unique label.
  8. 8. B. Feature scaling normalization Scaling and normalization before applying the classification models included in the present Auth Backend ML is very important, especially for SVMs. The main advantage of scaling is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges, which is something that occurs in our dataset. Another advantage is to avoid numerical difficulties during the calculation [16]. For the case of SVMs, kernel values usually depend on the dot products of feature vectors, large feature values might cause numerical problems. Logistic Regression can handle non- scaled data in a better way, but feature scaling has been applied also when training using this techniques as better performance was observed. Every data extracted from the database collection name “movements”, where all the movements of the cube are stored as indidcated in section II.B, is entered for its manipulation in a Pandas [17] Dataframe. All the features extracted are numerical, so all of them were normalized to zero mean and unit variance, as it was the normalization technique that proved a better result when training both algorithms. No normalization has been applied to data when training and testing Random Forest Classifier. Random Forest is invariant to transformations of individual features. Translations or per feature scalings will not change anything for the Random Forest. For this kind of algorithm, since one feature is never compared in magnitude to other features, the ranges do not matter; it is only the range of one feature that is split at each stage. It is very important to remark that the features extraction strategy from the sequences does not change if the cube support position provisioning or not. C. Machine Learning Classifiers training Before supported models training attempt are launched and given the great diversity of features derived from users solving sequences, it is mandatory to perform a Grid Search devoted to finetuning hyperparameters that guarantees the maximum possible accuracy for each user. In this work, for Logistic Regression, Support Vector Machine and Random Forest Classifiers, a random Grid Search was conducted for the applicable parameters for each algorithm. Considering the big amount of features extracted from the available dataset, it was necessary to perform an assessment on the right hyperparameters to apply on training stage for an optimized behaviour of the classifiers by means of an exhaustive Grid Search. These experiments are common in the literature of empirical machine learning, where they are used to optimize the hyper-parameters of learning algorithms. It is also common to perform multistage, multi-resolution grid experiments that are more or less automated [18]. The training procedure with normalized data (for Logistic Regresion and Support Vector Classifier) and with unscaled data (in case of Random Forest Classifier) is performed after the grid search is finished, so optimal hyperparatmeters can be applied to every user. Even though a minimum of solving sequences have been required for each user, a scarce dataset was available, with only sequences to work with. Thus, it was necessary to apply data augmentation techniques without completely detracting the resolution characteristics of the sequences. In addition, when feeding our model, a clear case of Imbalanced Learning is faced, in which the number of sequences not belonging to the user is much higher than the number of resolutions in whose generalization we are interested in. Unbalanced classes create an important problem: the accuracy (i.e. ratio of test samples for which we predicted the correct class) is no longer a good measure of the model performance [19] as the training process might arrive at a local optimum that always predicts that the sequence feeded to the model does not belong to that user, making it hard to further improve the model. Our desired trade-off between sensitivity and specificity shall be given by an up-sampling procedure in our original dataset. Up-sampling, in this case, is identified with the process of firstly randomly increasing up to a of the size of the majority class the observations from the minority class – label equals to 1 in the scenario of the present publication – in order to reinforce its effect in the training of the model in both training and test dataset After that, Synthetic Minority Over- sampling Technique (SMOTE) [20] is used to over-sampling again the minority class by creating synthetic minority class examples. Otherwise, the model will only be able to predict 0, which means it's completely ignoring the minority class in favor of the majority class. The accuracy would be very high, but would not reflect the effectiveness on prediction positive classes. In order to incorporate the desired trade-off into the training process, we need the samples of the different classes to have a different contribution to the loss. Besides, the use of the available dataset with all sequences resolutions may not be enough to produce the desired accuracy in our model. Once the subrepresentation of the user sequences is compensated, the whole dataset is splitted in training and validation set in a ratio. D. Machine Learning Classifiers performance evaluation Different assessment methods as accuracy or precission are sensitive to the imbalanced data when the samples of one class in a dataset outnumber the samples of the other class or classes [21] [22]:  Recall, Sensitivity, true positive rate or hit rate of a classifier represents the positive correctly classified samples to the total number of positive samples, and it is estimated according to the following expression: Where are the true positives and fn are the false negatives.  Additionaly, predictive values, positive and negative, reflect the performance of the prediction. Positive prediction value or precision
  9. 9. represents the proportion of positive samples that were correctly classified to the total number of positive predicted samples [22]. Being the samples wrongly classified as performed by the user that seeks for authentication.  As indicated above, accuracy can be largely contributed by a large number of true negatives if the dataset is imbalanced. -score rises as a better measure to use if a balance between Precision and Recall is sought in case an imbalanced class distribution is observed in the dataset; for instance, large number of actual negatives. Recall, precision and -score are consider as metrics to be used as reference of the goodness of our classifier. In order to evaluate the performance of our estimators using the provided database after the contribution of the users, a manual k-Fold Cross Validation of the dataset with has been conducted to robustly estimate the performance for the algorithms supported on unseen data. The performance measure is then averaged across all models that are created. When the current paper was issued, there were no resolutions with positioning available, that is, all the training was done with a 3x3x3 GiiKER® cube, so the impact that the incorporation of the positioning could have on the user's authentication capacity has not be assessed. Results for the conducted experiments are presented hereafter in Table 1. Precision, recall and F1-score for every users and for the three models supported in the System. For each user, the first row indicates the metrics for Logistic Regression model, the second row provides the result for Support Vector Machine Classifier and the third one refers to the Random Forest Classifier prediction capabilities. Table 1. Precision/recall/F1-Score for every authenticable user in the System considering the three supported algorithms. LR SVM RF LR SVM RF LR SVM RF LR SVM RF LR SVM RF LR SVM RF LR SVM RF LR SVM RF LR SVM RF LR SVM RF Logistic Regression and Random Forest Classifiers have shown the best performance, as it is indicated in Table 1. The high variance of scores depending on the user, is clearly caused by the difficulty of the assessed task as well as the fact thar not all the users considered in the experiments solved the 3x3x3 cube the same number of times. However, it seems that inference capabilities of both models is adequate, obtaining a mean 𝐹1 score of for Logistic Regression and for Random Forest Classifiers considering all users. In Random Forest model, all precisions are close to 1, while recalls are smaller. That clearly indicates that false negatives may occur if this algorithm is chosen as the main engine of Auth Backend ML, while false positives seems quite unlikely to happen during authentication attempts. A different scenario can be depicted after Logistic Regression performance. In this case, recall is close to 1 while precision values are lower. That means that this model, even when the performance is good, is more likely to make a mistakes as a negative sequence is more likely to be classified as false positive. Performance of the Support Vector Machine Classifier is not adequate. Only high precisions/recalls/ scores differents from zero were obtained, for those users whith very short and chaotic solving sequences with the 3x3x3 cube. Therefore, this type of algorithm combined with the extracted features from each sequence, as indicated in section V.A., can not correctly infer the biometric pattern from a user to ensure reliable authentication attempts. In an ideal scenario, precision and recall should be as close as possible to but, as this is not always possible, the best alternative shall be chosen: it is more interesting for the Web Service to provide a low rate of false positives so only legitimate users have chances to make a successful use of the authentication System. According to this argument, Random Forest Classifier, the simplest Machine Learning algorithm supported in the service, becomes the most suitable one to be used as an authentication engine. VI.CONCLUSIONS AND FUTURE WORK In this paper Web Service Authentication Concept Test Service based on Machine Learning analysis of the movements of a 3x3x3 cube making use of the resolution sequences of cubes that can send both its rotation sequences and positioning through the BLE protocol is proposed. In order to make this possible, both a hardware development of the authentication
  10. 10. device and software implementation of the platform, interface and authentication engine in the web service have been performed. In this work, a 3x3x3 cube has been designed and manufactured by ElevenPaths, called Cube11Paths. This device is capable of transmitting via BLE channel not only the sequences of turns, but also positioning sequences, which differentiates it from the rest of today’s commercial puzzles. To allow authentication in a Web Service using a 3x3x3 cube, a machine learning engine dedicated to binary classification is proposed through Logistic Regression, Support Vector Machine and Random Forest Classifier algortithms using the most representative characteristics of the resolutions of each user. In the previous section it was proven that Random Forest Classifiers and Logistic Regression algorithms have the best performances as engines for user authentication. A mean score of for Random Forest Classifiers and for Logistic Regression considering all users were obtained. The present investigation highlights that authentication making use of this type of 3x3x3 puzzles can be a chalenging task. On the one hand, it must be considered that obtaining a suitably populated dataset is time demanding for each user. On the other, the large number of features to be mapped in the addressed problem makes the training of Machine Learning models complex. As part of the future work it is proposed to take into account the positioning data of the 3x3x3 cube through the specified characteristics (see Section V.A.) stricter Grid Search for hyperparameter finetuning, as well as the application of dimensionality reduction techniques [23] on the dataset to check if the authentication capacity of the System is improved. Considering the high dimensionality of the collected dataset, not only is the research of new Machine Learning algorithms that could be more effective will be beared in mind, but it is also intended to make use of Deep Learning techniques to address the problem that is approached in this paper. Among the many fields of Machine Learning, Deep Learning is one of those that has increased the most in recent years [24] thanks to its high capacity for abstraction and simplification of the data which architectures are fed with. Deep Learning, both in its supervised and unsupervised learning strategies, is able to automatically learn hierarchical representations, which makes it the most appropriate to process large volumes of unstructured and heterogeneous data, which is what it is faced when all the movements generated by the users registered in the System are processed. Deep Learning recent rise has allowed a rapid and effective development of fields such as artificial vision or the processing of sequences and recognition of patterns in them, to name a few, in which we can encompass tasks as well known today as the Natural Language Processing (NLP), Sentiment Analysis or recognition of biometric patterns [25] [26], so we suspect that this type of architecture could be effective in the task addressed in this research, possibly making use of a combination of Recurrent Neural Networks [27] [28] [29] and Convolutional Neural Networks [4]. ACKOWLEDGEMENT The authentication system has been deployed in Amazon Web Service (AWS) instance with following specifications: Operative System Ubuntu Server, 16 GB RAM and a storage of 31 GB. This is a work developed by the CDO unit of Telefonica. REFERENCES [1] M. Wang, W. Deng (2018). “Deep Face Recognition: A Survey” https://arxiv.org/pdf/1804.06655.pdf [2] A. S. Al‑Waisy, R. Qahwaji, Stanley Ipson, S. Al‑Fahdawi (2017). “A multi-biometric iris recognition system based on a deep learning approach” https://bradscholars.brad.ac.uk/handle/10454/15682 [3] B. Balci, D. Saadati, D. Shiferaw Handwritten Text Recognition using Deep Learning. Available: http://cs231n.stanford.edu/reports/2017/pdfs/810.pdf [4] Y. LeCun and Y. Bengio, “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, pp. 255–258, 1998 [5] Rajavel Dhandabani, Shantharajah S. Periyasamy, Padma Theagarajan, Arun Kumar Sangaiah, "Six-face cubical key encryption and decryption based on product cipher using hybridisation and Rubik's cubes", Networks IET, vol. 7, no. 5, pp. 313-320, 2018. [6] Govinda K.; Prasanna S. A generic image cryptography based on Rubik's cube [7] Chodorow, Kristina; Dirolf, Michael (September 23, 2010), MongoDB: The Definitive Guide (1st ed.), O'Reilly Media, p. 216, ISBN 978-1-4493-8156-1 [8] F. Pedregosa et al. Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, 12, 2825-2830 (2011) (publisher link) [9] Travis E, Oliphant. A guide to NumPy, USA: Trelgol Publishing, (2006). [10] Jones E, Oliphant E, Peterson P, et al. SciPy: Open Source Scientific Tools for Python, 2001-, http://www.scipy.org/ [11] John D. Hunter. Matplotlib: A 2D Graphics Environment, Computing in Science & Engineering, 9, 90-95 (2007), DOI:10.1109/MCSE.2007.55 (publisher link) [12] Travis E. Oliphant. Python for Scientific Computing, Computing in Science & Engineering, 9, 10-20 (2007), DOI:10.1109/MCSE.2007.58 (publisher link) [13] K. Jarrod Millman and Michael Aivazis. Python for Scientists and Engineers, Computing in Science & Engineering, 13, 9-12 (2011), DOI:10.1109/MCSE.2011.36 (publisher link) [14] A Practical Guide to Support Vector Classification Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin Department of Computer Science. National Taiwan University, Taipei 106, Taiwan. http://www.csie.ntu.edu.tw/~cjlin Last updated: May 19, 2016 [15] Louppe, G. Understanding Random Forests: From Theory to Practice, 2015. URL https://arxiv.org/abs/1407.7502 [16] W. S. Sarle. Neural Network FAQ, 1997. URL ftp://ftp.sas.com/pub/neural/FAQ.html. Periodic posting to the Usenet newsgroup comp.ai.neural-nets. [17] Wes McKinney. Data Structures for Statistical Computing in Python, Proceedings of the 9th Python in Science Conference, 51-56 (2010) (publisher link) [18] J. Bergstra, Y. Bengio. Random Search for Hyper- Parameter Optimization. Journal of Machine Learning Research 13 (2012) 281-305. URL http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12 a.pdf [19] H. He, E. A. Garcia. Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, Vol. 21, no. 9, September 2009.
  11. 11. [20] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 321-357, 2002. https://jair.org/index.php/jair/article/view/10302/24590 [21] M. Sokolova, N. Japkowicz, S. Szpakowicz Beyond accuracy, f-score and roc: a family of discriminant measures for performance evaluation. Australasian Joint Conference on Artificial Intelligence, Springer (2006), pp. 1015-1021 [22] C. Torrano, P. Recuero, F. J. Ramírez, S. Hernández, J. Torres. Machine Learning aplicado a Ciberseguridad. Técnicas y ejemplos en la detección de amenazas. Pages 113- 115. Ed. 0xWORD. Primera Edición. ISBN: 978-84-09- 06918-7. [23] L. van der Maaten, E. Postma, J. van den Herik, Dimensionality Reduction: A Comparative Review. October 26, 2009. https://lvdmaaten.github.io/publications/papers/TR_Dimensio nality_Reduction_Review_2009.pdf [24] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA, USA: MIT Press, 2016. Disponible online: http://www.deeplearningbook.org [25] J. Schmidhuber, ‘‘Deep learning in neural networks: An overview,’’ Neural Netw., vol. 61, pp. 85–117, Jan. 2015. [26] Bengio, Yoshua (2009). "Learning Deep Architectures for AI" Available: https://www.iro.umontreal.ca/~lisa/pointeurs/TR1312.pdf [27] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997. [28] I. Sutskever, O. Vinyals, and Q. V. Le, ‘‘Sequence to sequence learning with neural networks,’’ in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2014, pp. 3104–3112 [29] A. Graves, A. R. Mohamed, and G. Hinton, “Towards end-to-end speech recognition with recurrent neural networks” in Proc. Int. Conf. Mach. Learn., vol. 14. 2014, pp. 1764–1772.