Visualizing decision tree

A couple of days ago, a friend of mine asked me to do a mini-project on decision tree classification. But I shouldn’t use packages such as scikit learn and I had to write it from scratch. I completed it and hand it over. Then he asked me to do bonus part of this project, which was visualizing the tree without using any packages. I had some fun doing it, so I thought maybe I can share it here ;).

What should you do if training loss does not decrease? (part 1)

I was working on a time-series classification problem for my PhD thesis a few of months ago. I was experimenting with a decoder-less transformer, similart to BERT. The problem was that no matter what I did, the training loss would not decrease. I worked on this problem for almost 5 weeks. At first, I focused on fine-tuning hyper-parameters. My first thought was that the issue was the learning rate. I performed a lot of study on learning rate choices and the well-known learning rate schedulers. I was thinking that I was using an incorrect learning rate, which resulted in either skipping or failing to achieve the minima. As shown in the image below:

Why do we need multiple hidden layers?

Deep neural networks stack numerous hidden layers, although the reasoning behind this is yet unclear. However, there are a few strong arguments that we can accept. The first reason is that as the network’s depth (number of layers) increases, each layer’s width (size) decreases. In other words, a shallow network requires significantly more neurons for a fixed accuracy than a deep, low-width network. Therefore, we can have an efficient and equally-effective deep network with fewer neurons rather than a large shallow network with many neurons. The following two images are taken from the book Deep Learning by Aaron Courville, Ian Goodfellow, and Yoshua Bengio (2015):