Fully Connected Neural Network: Building a Neural Network Library from Scratch
In the world of machine learning, neural networks have become a cornerstone of modern AI applications. While many developers rely on high-level libraries like TensorFlow or PyTorch, there’s something exhilarating about building a neural network library from scratch. In this article, we will explore the process of creating a fully connected neural network (FCNN) using C and OpenCL, focusing on the intricacies of the implementation and the underlying concepts.
Why C and OpenCL?
You might wonder why we would choose C over more user-friendly languages like Python. The primary reason is performance. When it comes to training neural networks, especially on large datasets, efficiency is key. GPUs are designed to handle parallel processing, and they do not natively understand Python or even C++. By using C in conjunction with OpenCL, we can harness the power of GPUs to accelerate our computations.
Setting Up the Library Skeleton
Before diving into the complexities of GPU programming, we need to establish the basic structure of our neural network library. Here’s a simple setup to get us started:
OpenCL::initialize_OpenCL();
std::vector<std::vector<float>> inputs, targets;
std::vector<std::vector<float>> testinputs;
std::vector<float> testtargets;
ConvNN m_nn;
std::vector<int> netVec;
netVec = { 1024, 10 };
m_nn.createFullyConnectedNN(netVec, 1, 32);
m_nn.trainFCNN(inputs, targets, testinputs, testtargets, 50000);
m_nn.trainingAccuracy(testinputs, testtargets, 2000, 1);
This snippet outlines the ordinary process of a machine learning pipeline, but instead of using libraries like Scikit-learn or TensorFlow, we are implementing it in C++. This is a significant achievement, but we are just getting started.
Understanding Nodes and Layers
The fundamental building blocks of any neural network are nodes (or neurons) and layers. A node represents a single unit of computation, while a layer is a collection of nodes. Here’s how we can define these structures in C:
typedef struct Node {
int numberOfWeights;
float weights[1200];
float output;
float delta;
} Node;
typedef struct Layer {
int numOfNodes;
Node nodes[1200];
} Layer;
Since we are using plain C, we cannot utilize std::vector
. Instead, we define fixed-size arrays. While this approach has its limitations, it allows us to keep our implementation straightforward.
Constructing the Neural Network
With our basic structures in place, we can now create the neural network itself, which is essentially a stack of layers:
h_netVec = newNetVec;
Layer *inputLayer = layer(h_netVec[0], 0);
h_layers.push_back(*inputLayer);
for (unsigned int i = 1; i < h_netVec.size(); i++) {
Layer *hidlayer = layer(h_netVec[i], h_netVec[i - 1]);
h_layers.push_back(*hidlayer);
}
At this point, we have a simple neural network represented as a vector of layers, each containing a vector of nodes. However, our work is far from complete. We need to train our network using actual data, and this is where OpenCL comes into play.
OpenCL Buffers
To enable GPU access, we must convert our vectors into OpenCL buffers. Buffers are essential for transferring data between the host (CPU) and the device (GPU). Here’s how we can create and populate these buffers:
d_InputBuffer = cl::Buffer(OpenCL::clcontext, CL_MEM_READ_WRITE, sizeof(float) * inpdim * inpdim);
tempbuf = cl::Buffer(OpenCL::clcontext, CL_MEM_READ_WRITE, sizeof(Node) * h_layers[0].numOfNodes);
(OpenCL::clqueue).enqueueWriteBuffer(tempbuf, CL_TRUE, 0, sizeof(Node) * h_layers[0].numOfNodes, h_layers[0].nodes);
d_layersBuffers.push_back(tempbuf);
for (int i = 1; i < h_layers.size(); i++) {
tempbuf = cl::Buffer(OpenCL::clcontext, CL_MEM_READ_WRITE, sizeof(Node) * h_layers[i].numOfNodes);
(OpenCL::clqueue).enqueueWriteBuffer(tempbuf, CL_TRUE, 0, sizeof(Node) * h_layers[i].numOfNodes, h_layers[i].nodes);
d_layersBuffers.push_back(tempbuf);
}
This code snippet demonstrates how to create buffers for each layer in our network. The enqueueWriteBuffer
function transfers the data from the host to the device, making it ready for processing.
Defining OpenCL Kernels
Next, we need to define the OpenCL kernels that will execute on the GPU. Kernels are the functions that perform computations. For our fully connected neural network, we need three kernels:
- Forward Propagation Kernel
- Backward Propagation Kernel for the Output Layer
- Backward Propagation Kernel for the Hidden Layers
Here’s how we can define these kernels:
compoutKern = cl::Kernel(OpenCL::clprogram, "compout");
backpropoutKern = cl::Kernel(OpenCL::clprogram, "backpropout");
backprophidKern = cl::Kernel(OpenCL::clprogram, "backprophid");
These kernels will handle the computations for forward and backward propagation, which are crucial for training our neural network.
Implementing Backpropagation
Backpropagation is the algorithm used to train neural networks by minimizing the error between predicted and actual outputs. Here’s a simplified version of the forward and backward propagation kernels:
Forward Propagation Kernel
kernel void compout(global Node* nodes, global Node* prevnodes, int softflag) {
const int n = get_global_size(0);
const int i = get_global_id(0);
float t = 0;
for (int j = 0; j < nodes[i].numberOfWeights; j++)
t += nodes[i].weights[j] * prevnodes[j].output;
t += 0.1;
nodes[i].output = sigmoid(t);
}
Backward Propagation Kernel for Hidden Layers
kernel void backprophid(global Node* nodes, global Node* prevnodes, global Node* nextnodes, int nextnumNodes, float a) {
const int n = get_global_size(0);
const int i = get_global_id(0);
float delta = 0;
for (int j = 0; j != nextnumNodes; j++)
delta += nextnodes[j].delta * nextnodes[j].weights[i];
delta *= devsigmoid(nodes[i].output);
nodes[i].delta = delta;
for (int j = 0; j != nodes[i].numberOfWeights; j++)
nodes[i].weights[j] -= a * delta * prevnodes[j].output;
}
Backward Propagation Kernel for Output Layer
kernel void backpropout(global Node* nodes, global Node* prevnodes, global float* targets, float a, int softflag) {
const int n = get_global_size(0);
const int i = get_global_id(0);
float delta = 0;
delta = (nodes[i].output - targets[i]) * devsigmoid(nodes[i].output);
for (int j = 0; j != nodes[i].numberOfWeights; j++)
nodes[i].weights[j] -= a * delta * prevnodes[j].output;
nodes[i].delta = delta;
}
These kernels perform the necessary computations for both forward and backward propagation, allowing our neural network to learn from the data.
Conclusion
Congratulations! You have successfully built a fully connected neural network library from scratch using C and OpenCL. This implementation allows you to train your neural network on a GPU, significantly improving performance compared to CPU-only training.
For the complete code and further exploration, feel free to visit my GitHub repository: Neural Network Library.
In the next part, we will extend this library to include Convolutional Neural Networks (CNNs), which are essential for image processing tasks. Stay tuned for more exciting developments in deep learning!
Deep Learning in Production Book 📖
If you’re interested in learning how to build, train, deploy, scale, and maintain deep learning models, consider checking out the book "Deep Learning in Production." It provides hands-on examples and insights into ML infrastructure and MLOps. Learn more.
Disclosure: Please note that some of the links above might be affiliate links, and at no additional cost to you, we will earn a commission if you decide to make a purchase after clicking through.