Voc-Notes/CS615_lecture_transcript.txt at main · invcble/Voc-Notes · GitHub

1
Current neural network idea that if you want to be able to do something with a quince and started to augment. Currency of a feedback propagation time. We also recognized some of the limitations back propagation through time the algorithm a was acceptable to voting for radius And also make a prediction of times he is kind of confusing information on x t - 1. So on so forth, so it was played by what was the own as short-term memory. Do events at the Academy evolution of that. I was a long short term memory. And they udid. And by doing so they can. for a little bit longer German film they still have an improvement over R & M Silver Plate still have issues. Idea of Transformers with a W and all these large language moderns are built off of a cell phone as my Slammer here. I want to take a look at the architecture if it's a little bit about what's happening in a Transformer because I certainly is highly related to learning art of natural language. 815 South 400 MW if you are there is a course in the data science course number 691 and offered quite often and pretty technical courses offered for the Austin as well. And ultimately to do a lot of stuff is happening in Transformers. He's had a lot of baccarat background in NLP nevertheless. I think we can kind of still talk about you. Large lighted models and Transformers all kind of date. important paper on attention is all you need and they're the authors propose that rather than observation at a time in a sequence better to provide the entire sequence at once and allow the system to learn what is important in that sentence to make me so it's a lot better contact for what was pretty intense this idea of attention. You give it a politically is going to figure out where to put attention in order to figure out. And it's kind of hard to text you that they created they referred to as transform. Then it's kind of a visualization from their paper of what the architecture of a transformer looks like. It has got two sides to it one side is referred to as the encoders over here and over on the right hand side. You have to prefer to smile. Encoders and and after we can do the decoding, there's typically Springs hours. things that we want to predict the number of wild Transformers are using a lot of different domains that say the number of words. That we want to be able to predict the probability of essentially the next works. And started understanding what's happening is to transform start off with how are they are as also how they are used and then we'll look at that propagate. So the idea is that we start off with some input. I probably at the Do you give it this? Do you give it the inputs is typically has a maximum sequence length that it will support. Adding some kind of padding as needed. Typically the scenery if you will support it. And then there is some kind of embedding of the input. Typically we have a start is going to be. Generate Auto regression. I thought was going to be generating the output. So the sequence of the house was initially that we have no house. But he's back actually what we end up having it at the very beginning is just a single thing which easy start starting to generate output every other place again, because we haven't generated had a doormat supported. And ultimately we are going to confuse this architecture and we're going to keep for propagating through with General. A new outfit time to add to our list of outpost until eventually what gets out foot is the end time what we do is we take the original input current set of output. We encode and decode we take Cody output then we decode them. Bible you give me in Coatings of the infant and then that's going to generate us the next Logan in our output. What's happening on generating the word token at a time for the output until eventually the end Hogan? I should you take a little bit deeper dive into how it does all of this. Everything is maybe some sort of nobe. Strategies to create embedding a representation of each word in this sequence. Again. There's a lot of different in betting strategies that are discussed in Great Gatsby. The simplest thing we can imagine is maybe doing some sort of one hot and coded for each word that we want to support that is going to be, you know a fifth column of r18. No, maybe my input is 1 2 3 that's a glance. I could one hot you coded something like that. anything for my I'll take my current output whatever it is, and I will create and embedding a representation for I am busy. This is creating a representation. The next thing that happens to that if you take our embedding and we add in. About their location in the sequences. This is known as the conditional. Who sang every single location in our embedding we would add? This amount to the ice location where position? Row in Arlington wedding and I is the column Our model was at the ice location. I would have a decided at the I plus one. Plus I would have a full size. So the adding that additional encoding they also suggested multiplying The Wedding by Taylor of the Audi of our wedding wedding and the encoding have similar function. Getting our prepared to feed into our encoders. We have our our wedding now and they're getting fed into RN coders and decoded and the first thing you can see they had to hear it's the most attention from the authors of that paper that we that would be V8 coated in person better than coated in but in the case of the decoder the queen encoded embeddings of the output and then we have the value and as using Do kind of osteopathic look up two hours? how is the What is the each of the value for that. According to those probably do some ways. It could be used to think about this as looking things up in the dictionary what a stochastic probability of each word or given a query evil probability of valve. Stone and numerically is to shoot it like this. We have our query and we do the doctor to the dot product of it each of the key. And why the stock Mack do that? probability of each Valley probabilities By the values and now we get this kind of attention score for each of these out. What's going to be the key idea with the attention. Now there's also something known as mass attention. I go back to this architecture here. You can see that in particular in the decoder. They arguing no more attention. So the idea is that there's three times where we might want to do mask. He said that the encoder. Typically want to have some padding in it. A maximum input lag. So you might want to come things that aren't real or are part of the input Speedway. Windber training also has zero padding and therefore also you might need to mask their as well. I meant for the evaluation. We wants to Augustine make sure that coder is only using that have already been out with not things that were just Do masking one thing we can do is just add an element wise binary Mass to our output of our attention to depending on no going back to the equation depending on what's going on here in the pits. Imagine for the decoder. The only queries that are valid are the ones up the time sheet anything before that anything after that is not kind of bad a valid query self attention attention that you coder is using his phone. Or it's cheese in value as well. One option would be to apply binary mask. the output of the attention process or the other approaches you but you seen most of the implementation is to just which has zeros for valid locations and negative Infinity for invalid locations are going to have my IQ time is giving me but in the end you'll give me the probability of each value. But then I can add it is such that when I applied to stop Max things that were invalid and have negative Infinity zero probability problem output for a saving 2018. So no idea of tension and math. And I live in five most I had intentions idea that we want to have multiple attention units. Same kind of cheesy Cleary and values are going to have their own weight matrices associated with the keys. So ultimately it looks like Sissy output of a head kind of the same thing. But now we can see for the ice head. Have their own made have their own Matrix and a value-added. Ultimately to get the outfit of the Motorhead attention we can take you to patination of the output of each of the house. And then apply ate another major is to give us our final most attention with matrices in addition to the mall that has weights. the idea of multi-head attention and math That was also worth noting that. of multi-head attention that The value and the query for all the state the value and clear. So we call this. over here in the decoder and values are both coming from the outfits of the encoder whereas theory is coming from the so this kind of S Mode I had attention layer here is doing what's known. Values are coming from in Kotor. The query is coming from the other things are happening in his architecture. You're seeing a lot of these ad and Norms a sweetie add is providing its residual. normalization And then Architects and architecture. Is the Ford the feed forward is really just to fully connected layers with a Riolu activation function. dimensionality of our data to some really all of the pieces that are Define a Transformer and all the pieces asked to find an inmate. Now in the literature I believe. initial paper at They had on the left side. Hooters A bunch of these one after another I think today. Nothing left as I kind of went through need to have multi head self attention where you know, that's the number of Weights in there if we have and has 3n + 1. residual with layered That into that piece for word network, which is to fully connected layers of screen. But your Norm then theoretically this would go to coder and so on and so forth as many sikotar block. Quarter sized things are pretty similar again. We have our kind of each voted in wedding. Send it through now a math multi-head however now And the values are coming from the output of the last in Kotor. Where as Siri is coming from the output of that. skip residual and normalization we do that as we Send that through a softmax university probability. Yeah, just kind of at least see the idea of the architecture a transport a little bit about how data is Ford propagating through as usual on the most difficult thing is learning process the backpropagation process will look at that a little bit earnings artist sleep question to cop. So just looking at how this is kind of working. I know we have some stored. I'm putting that we wants you to tell you are welcome with freaking. It's pushed you are speaking to us and coder initially. our outfits token that would get embedded out of there go to our Dakota. Attention don't do as many orders as we need and then goes during the new transformation softmax, and maybe it was probable next words after start sequence would be dead. That day to get put into this sequence. And now we'd repeat do I already have nothing has changed on the other side of things through my Dakota get its representation. What is a noun next off the most likely next word would be nada? Put into my output. Not all them guests 10th through the decoder again. It's been out most probable fate and token. but I'll quit sequence in Mark the ending of the forward propagating through this track for add label this lot of training, but I thought Sports training as usual. Going to ask me out for tomorrow 8:50 about the gradients functions. I'm going to need a stablished many air is just a foolish next layer of skin do house back propagation through that and update. Back propagation through the coders and decoders and also how to update their waist. decoder I should be able to. I should be able to back property as you know, what my target word was at that time. I can see with my outfit probability Vector at that time and then I can just back propagate through a softmax and Call about Scotts propagating through layer Norm y'all so she can afford is just a fully connected to fully connected layers interview. Losing the screen them. Those are things we done you'd also have to. backwards again through layer normalization Which we haven't talked about after read. Are gradients Will's father has passed coming back out of the most side and attention and will also be back in talked about. normalization to talk about and then they would come back with amounts in this context go back to So do you have them? But obviously the only things that we haven't talked about is propagating through the most I had the tension. so as it is, so I was what's happening? Go ahead and head this way and I'll put on the iPad is computed from this formula. Ultimately, what we need is for each head to know. What is the gradient of Back to our objectives were also going to back propagate. also going to need to know the gradient of value Tire head wo imagine that we put all of our attention units. the outfit of our I'll get the gradient of objective. Wo transposed applying it to the gradient that was coming backwards into my attention. Figure out what's actually going then? patination art which is better my hedges, I'd like and then just take the gradient transpose of that. I'm very similar to what we did with Felicia connected life. Then propagate into a particular head. We just need to split off the gradient for that. However long it is a concatenation of a bunch of DJ and we just split off what we need for that particular. Now for that particular head I can take my back. I mean djani I can use it. Give me my BJs. For this Ed jd3 Ricci and value as a pertains to the said ultimately it stopped using this equation which kind of more or less done with our house so we can ask my errors and activation function cell. Obviously in doing this we can go ahead and use. Used to update the weight of that particular head. And I say we use this to update Z-Wave. But he's right here are ultimately going to be used to back propagate through. all kind of depends on whether review warcross attention remembrance value and query all for the same thing. So what I'm having self attention Trying to back propagate. my Motorhead adding together Jay-Z and DJ TV as they were obtained from each of the head. in the case of right here would be basically my DJ. Right here would be my DJ. so sweetie information flow backwards Hodor just going to be DJ dkng jbp added together you'll just be the accumulation of CCP jbq over the Hat. I pretty much to be idea haven't gone too bad now. We really also have everything that we need to back propagates through the encoder I can take this. Jay encoder that has just obtained via this I'm passing back through you know. The back through the fee for beaches going to included too, but we connected layers and the renew activation function through the norm now back through mobile head of tension belt. And in general and that would go back if there was another. backwards player Christopher lair a general idea of full moon I'm free word. 4 vertebral fractures 2017 a Transformer add 100 million our architecture you play with or without any how long it takes to train? parameters large amount of resources earliest Floyd Kellam feeling words and just take out time. burnable parameters and come trillions of words in Hindi It's no work while in the context of we've been talking about throughout the court. Meijer all that stuff I would certainly encourage you to natural language processing.