-
-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use the rope scaling with x_transformers ? #159
Comments
@cutoken oh hey do you mean the interpolation of the positions? do you have a pretrained model with RoPE you are finetuning? |
you are referring to lucidrains/rotary-embedding-torch@e7ce8e0 ? |
@cutoken try using this setting 8fa7b4c#diff-2e64ac8840195d7dc3e07a3aac70b50bbab1cdf80f3a7432be40105e6097fc0aR896 |
Thank you @lucidrains for your quick help. To answer your other query I have some small models trained on TinyStories with RoPE which I wanted to use to recreate the interpolation paper results with. So far no luck :) |
It is actually quite weird. I'm able to recreate the issue of increasing loss on context increase but the interpolation solution doesn't really work for me. Have you had success recreating the paper on any smaller model ? |
@cutoken do you mean you tried it just now and it didn't work? did you follow their recipe of fine tuning on 1k longer context samples, with an |
@cutoken if i hear it doesn't work from a few people, i may just remove it in favor of xpos |
Will confirm once again in a fresh experiment @lucidrains |
share it with w&b! |
Can confirm it is actually worse than directly fine tuning without interpolation. I'm using a really small model but I don't see why that should matter as the interpolation factor is being followed as per the paper guidelines. I can send you the weights and biases if needed of the smaller pre-intrapolated one so that you can also test if needed (please provide your mail id in that case) |
@cutoken that would be great! could you share the training script too? i think i'll go ahead and remove it from this repository until i hear more feedback (or see a paper that corroborates the technique) |
@cutoken do you want to double check your experiments? i see someone legit corroborating the results from Meta https://kaiokendev.github.io/context (seems to be concurrent work) |
@cutoken they also released a new long context model, LongChat, at 16k |
I hope it works as well :) Below link contains the training script I have used and also the checkpoint with rope enabled (both checkpoints are the same. Just made a backup in case you end up overwriting it). You can use it as the starting point to train with and without the newly added parameter to see the difference. You will need TinyStories data set from huggingface. I have added the sentencepiece vocab file already so you would only need sentencepiece available - no need to tokenize the dateset again. Let me know if you face any issues running it. |
@cutoken thanks! i'll allot some time this Sunday to do some training |
Hello, it is really cool you added this feature! One thing I would mention is that in my case, the intuition is that the large model may overfit to the position embedding, such that it is easier to train on the interpolated position than using OOD positions. The counter point is that small models may not be overfit in the same way - I see Meta only trained on 7B parameters and up, so it's possible the effect decreases for smaller models. There was also no ablation performed for non-LLaMA RoPE models, so it is unknown how much it depends on other factors as well. Just a thought |
@kaiokendev , that sounds like a good explanation on why I'm not seeing same results with a smaller model (50M params). |
@kaiokendev oh interesting; you can probably run a few experiments to back up your idea, and share it on twitter would be an important caveat that should be noted in their paper! |
@kaiokendev there's actually a number of papers popping up here and there that tries to reduce overfitting of the positional embeddings the two i've seen are (1) randomly offset positions by some constant and (2) within a range of 0 to a length L where L > maximum number of tokens, use a random subset of positions from that range, ascending |
@kaiokendev have you seen https://arxiv.org/abs/2307.03170 ? |
I did skim it, there are a lot of external memory approaches I saw, but I do not really play with the cases involving approximated attention like that one |
Hi,
Thank you for this wonderful library. I'm wondering if there is a way to use the recent rope paper's scaling workaround with x_transformers. I have seen your recent change to the rotary position encoding repo but wasn't able to identify where to modify similarly in x_transfomers repo.
The text was updated successfully, but these errors were encountered: