Aakriti Budhraja

About Aakriti Budhraja

Aakriti Budhraja, With an exceptional h-index of 2 and a recent h-index of 2 (since 2020), a distinguished researcher at Indian Institute of Technology Madras, specializes in the field of NLP, Deep Learning, Machine Learning.

His recent articles reflect a diverse array of research interests and contributions to the field:

On the prunability of attention heads in multilingual BERT

The heads hypothesis: A unifying statistical approach towards understanding multi-headed attention in bert

On the weak link between importance and prunability of attention heads

On the importance of local information in transformer based models

Aakriti Budhraja Information

University

Indian Institute of Technology Madras

Position

Research Scholar

Citations(all)

23

Citations(since 2020)

23

Cited By

1

hIndex(all)

2

hIndex(since 2020)

2

i10Index(all)

1

i10Index(since 2020)

1

Email

University Profile Page

Indian Institute of Technology Madras

Aakriti Budhraja Skills & Research Interests

NLP

Deep Learning

Machine Learning

Top articles of Aakriti Budhraja

On the prunability of attention heads in multilingual BERT

Authors

Aakriti Budhraja,Madhura Pande,Pratyush Kumar,Mitesh M Khapra

Journal

arXiv preprint arXiv:2109.12683

Published Date

2021/9/26

Large multilingual models, such as mBERT, have shown promise in crosslingual transfer. In this work, we employ pruning to quantify the robustness and interpret layer-wise importance of mBERT. On four GLUE tasks, the relative drops in accuracy due to pruning have almost identical results on mBERT and BERT suggesting that the reduced attention capacity of the multilingual models does not affect robustness to pruning. For the crosslingual task XNLI, we report higher drops in accuracy with pruning indicating lower robustness in crosslingual transfer. Also, the importance of the encoder layers sensitively depends on the language family and the pre-training corpus size. The top layers, which are relatively more influenced by fine-tuning, encode important information for languages similar to English (SVO) while the bottom layers, which are relatively less influenced by fine-tuning, are particularly important for agglutinative and low-resource languages.

The heads hypothesis: A unifying statistical approach towards understanding multi-headed attention in bert

Authors

Madhura Pande,Aakriti Budhraja,Preksha Nema,Pratyush Kumar,Mitesh M Khapra

Journal

Proceedings of the AAAI conference on artificial intelligence

Published Date

2021/5/18

Multi-headed attention heads are a mainstay in transformer-based models. Different methods have been proposed to classify the role of each attention head based on the relations between tokens which have high pair-wise attention. These roles include syntactic (tokens with some syntactic relation), local (nearby tokens), block (tokens in the same sentence) and delimiter (the special [CLS],[SEP] tokens). There are two main challenges with existing methods for classification:(a) there are no standard scores across studies or across functional roles, and (b) these scores are often average quantities measured across sentences without capturing statistical significance. In this work, we formalize a simple yet effective score that generalizes to all the roles of attention heads and employs hypothesis testing on this score for robust inference. This provides us the right lens to systematically analyze attention heads and confidently comment on many commonly posed questions on analyzing the BERT model. In particular, we comment on the co-location of multiple functional roles in the same attention head, the distribution of attention heads across layers, and effect of fine-tuning for specific NLP tasks on these functional roles. The code is made publicly available at https://github. com/iitmnlp/heads-hypothesis

On the weak link between importance and prunability of attention heads

Authors

Aakriti Budhraja,Madhura Pande,Preksha Nema,Pratyush Kumar,Mitesh M Khapra

Published Date

2020/11

Given the success of Transformer-based models, two directions of study have emerged: interpreting role of individual attention heads and down-sizing the models for efficiency. Our work straddles these two streams: We analyse the importance of basing pruning strategies on the interpreted role of the attention heads. We evaluate this on Transformer and BERT models on multiple NLP tasks. Firstly, we find that a large fraction of the attention heads can be randomly pruned with limited effect on accuracy. Secondly, for Transformers, we find no advantage in pruning attention heads identified to be important based on existing studies that relate importance to the location of a head. On the BERT model too we find no preference for top or bottom layers, though the latter are reported to have higher importance. However, strategies that avoid pruning middle layers and consecutive layers perform better. Finally, during fine-tuning the compensation for pruned attention heads is roughly equally distributed across the un-pruned heads. Our results thus suggest that interpretation of attention heads does not strongly inform pruning.

On the importance of local information in transformer based models

Authors

Madhura Pande,Aakriti Budhraja,Preksha Nema,Pratyush Kumar,Mitesh M Khapra

Journal

arXiv preprint arXiv:2008.05828

Published Date

2020/8/13

The self-attention module is a key component of Transformer-based models, wherein each token pays attention to every other token. Recent studies have shown that these heads exhibit syntactic, semantic, or local behaviour. Some studies have also identified promise in restricting this attention to be local, i.e., a token attending to other tokens only in a small neighbourhood around it. However, no conclusive evidence exists that such local attention alone is sufficient to achieve high accuracy on multiple NLP tasks. In this work, we systematically analyse the role of locality information in learnt models and contrast it with the role of syntactic information. More specifically, we first do a sensitivity analysis and show that, at every layer, the representation of a token is much more sensitive to tokens in a small neighborhood around it than to tokens which are syntactically related to it. We then define an attention bias metric to determine whether a head pays more attention to local tokens or to syntactically related tokens. We show that a larger fraction of heads have a locality bias as compared to a syntactic bias. Having established the importance of local attention heads, we train and evaluate models where varying fractions of the attention heads are constrained to be local. Such models would be more efficient as they would have fewer computations in the attention layer. We evaluate these models on 4 GLUE datasets (QQP, SST-2, MRPC, QNLI) and 2 MT datasets (En-De, En-Ru) and clearly demonstrate that such constrained models have comparable performance to the unconstrained models. Through this systematic evaluation we establish that attention in …

See List of Professors in Aakriti Budhraja University(Indian Institute of Technology Madras)

Aakriti Budhraja FAQs

What is Aakriti Budhraja's h-index at Indian Institute of Technology Madras?

The h-index of Aakriti Budhraja has been 2 since 2020 and 2 in total.

What are Aakriti Budhraja's top articles?

The articles with the titles of

On the prunability of attention heads in multilingual BERT

The heads hypothesis: A unifying statistical approach towards understanding multi-headed attention in bert

On the weak link between importance and prunability of attention heads

On the importance of local information in transformer based models

are the top articles of Aakriti Budhraja at Indian Institute of Technology Madras.

What are Aakriti Budhraja's research interests?

The research interests of Aakriti Budhraja are: NLP, Deep Learning, Machine Learning

What is Aakriti Budhraja's total number of citations?

Aakriti Budhraja has 23 citations in total.

    academic-engine

    Useful Links