Artificial intelligence (AI) is primarily a math problem. We finally have enough data and processing capacity to take full advantage of deep neural networks, a type of AI that learns to discover patterns in data, when they began to surpass standard algorithms 10 years ago.
Today’s neural networks are even more data and processing power-hungry. Training them necessitates fine-tuning the values of millions, if not billions, of parameters that describe these networks and represent the strength of artificial neuron connections. The goal is to discover almost perfect values for them, known as optimization, but training the networks to get there is difficult.
That could change shortly.
The University of Guelph has created and trained a “hyper network” — a kind of overlord for other neural networks — that could help speed up the training process. The hypernetwork forecasts the parameters for a new, untrained deep neural network tailored for some task in fractions of a second, potentially eliminating the need for training.
The finding may have more profound theoretical ramifications because the hypernetwork learns the incredibly intricate patterns in the designs of deep neural networks.
For the time being, the hypernetwork functions admirably in some situations, but there is still potential for improvement, which is understandable given the scope of the problem.
A technique known as stochastic gradient descent (SGD) is currently the best to train and improve deep neural networks. The training aims to reduce the network’s errors, such as image recognition, on a particular job.
An SGD algorithm churns through labeled data to change the network’s parameters and eliminate errors or losses. Gradient descent is an iterative process of descending from high loss function values to a minimum value that indicates the best possible parameter values.
However, this method is only effective if you have a network to optimize.
Engineers must rely on intuition and rules of thumb to create the initial neural network, which comprises numerous layers of artificial neurons that lead from an input to an output. In theory, one could start with many architectures, optimize each one, and then choose the best.
It might be impossible to train and test every network architecture contender. But it doesn’t scale well, especially when millions of different designs are taken into account.
Given a series of potential architectures, a graph hyper network (GHN) was created to discover the optimum deep neural network architecture to accomplish a job. The strategy is summed up in the name.
The term “graph” alludes to the idea that a deep neural network’s design can be compared to a mathematical graph, consisting of a collection of points, or nodes, connected by lines or edges. The nodes represent computing units (often a whole layer of a neural network), and the edges indicate the connections between these units.
Any architecture that needs to be optimized (let’s call it the candidate) is the starting point for a graph hypernetwork. It then tries its hardest to anticipate the candidate’s optimal parameters. The team then calibrates the parameters of a real neural network to the predicted values before putting it to the test on a specific job. In cases where the best isn’t good enough, gradient descent can be used to train the network further.
GHN-2 was the second such technique, which had improvement on two critical characteristics of GHN. First, it used GHN’s representation of a neural network’s architecture as a graph. Each node in the graph represents a subset of neurons that do specific processing. The graph’s edges indicate the flow of data from node to node, from input to output.
The second idea was to train the hypernetwork to produce predictions for new candidate designs.
This necessitates the use of two more neural networks. The first allows computations on the original candidate graph, resulting in changes to information associated with each node. The second one takes the updated nodes as input and then predicts the parameters for the candidate neural network’s corresponding computational units. These two networks each have their own set of parameters that must be improved before the hypernetwork can forecast parameter values appropriately.
You’ll need training data for this, which is a random sampling of possible artificial neural network (ANN) topologies in this case. You start with a graph for each architecture in the sample, then use a graph hypernetwork to predict parameters and initialize the candidate ANN with the expected parameters.
The ANN then performs a specified task, such as image recognition.
You calculate the ANN’s loss and then adjust the parameters of the hypernetwork that created the forecast in the first place, rather than updating the ANN’s parameters to make a better prediction. This allows the hypernetwork to improve its performance the following time around.
Iterate every image in a labeled training data set of images and every ANN in a random sample of architectures, minimizing the loss at each step until it can no longer perform better. After a while, you’ll have a trained hypernetwork.
To ensure that GHN-2 learns to forecast parameters for a wide range of target neural network architectures, a unique data set of 1 million different structures was constructed. As a result, the prediction abilities of GHN-2 are more likely to translate effectively to unknown target architectures.
For example, they can account for all of the typical state-of-the-art designs that people use.
Results
Of course, putting GHN-2 to the test was the actual test.
After training it to predict parameters for a specific goal, such as categorizing photos within a particular data set, Knyazev and his team put it to the test to predict parameters for any random candidate architecture.
This new contender could have traits that are comparable to those of the million structures in the training data set, or it could be unique — an outlier.
The target architecture is distributed in the first scenario and out of distribution in the second. Deep neural networks frequently fail to predict the latter therefore putting GHN-2 to the test on such data was crucial.
The scientists used a fully trained GHN-2 to forecast parameters for 500 random target network designs that had never been seen before. The 500 networks were then matched against the identical networks trained using stochastic gradient descent, with their parameters set to the predicted values. The new hypernetwork was able to compete with hundreds of iterations of SGD and, in some cases, outperformed them, albeit the results were mixed.
The average accuracy of GHN-2 on in-distribution designs was 66.9% for a data set known as CIFAR-10, which was close to the 69.2 percent average accuracy produced by networks trained using 2,500 iterations of SGD. GHN-2 performed surprisingly well for out-of-distribution architectures, reaching roughly 60% accuracy. It created a good 58.6% accuracy for a well-known deep neural network architecture called ResNet-50 in particular.
Most importantly, GHN-2 predicted ImageNet parameters in less than a second. In contrast, SGD took 10,000 times longer on their graphical processing unit to achieve the same performance as the predicted parameters (the current deep neural network training).
When GHN-2 selects the best neural network for a job from a set of designs, and that best option isn’t good enough, the winner is at least partially trained and can be further refined. Instead of unleashing SGD on a network with random parameter values, the predictions of GHN-2 can be used as a starting point.
GHN-2 and Beyond
Machine learning experts used to favor hand-crafted algorithms to the mysterious deep nets. However, this changed as gigantic deep nets trained on massive data outperformed standard algorithms.
There are numerous areas where it can develop further. For example, GHN-2 can only be trained to predict parameters for a specific task, such as classifying CIFAR-10 or ImageNet images, but not both.
If these hypernetworks become popular, the creation and development of revolutionary deep neural networks will no longer be limited to organizations with significant finances and access to large amounts of data. Anyone could participate in the game.
However, if hypernetworks like GHN-2 become the usual way to optimize neural networks, you will have a neural network — effectively a black box — anticipating the parameters of another neural network. So you have no way of explaining when it makes a mistake.
GHN-2 demonstrates graph neural networks’ ability to detect patterns in complex data. Deep neural networks typically look for patterns in photos, text, or audio signals, all very structured sorts of data. GHN-2 looks for patterns in the graphs of neural network topologies that are wholly random.
GHN-2 can generalize, producing reasonable parameter predictions for unknown and even out-of-distribution network structures. This shows that many patterns in different architectures are comparable somehow, and a model may learn how to transfer knowledge from one design to another.
This would lead to a better understanding of the mysterious black boxes.
Reference: https://www.quantamagazine.org/researchers-build-ai-that-builds-ai-20220125/
Paper 1: https://arxiv.org/abs/2110.13100
Paper 2: https://arxiv.org/abs/1810.05749
Suggested
Credit: Source link
Comments are closed.