Abdelrahman Shaker

Abdelrahman Shaker

Ain Shams University

H-index: 7

Africa-Egypt

About Abdelrahman Shaker

Abdelrahman Shaker, With an exceptional h-index of 7 and a recent h-index of 7 (since 2020), a distinguished researcher at Ain Shams University, specializes in the field of Computer Vision - Deep Learning - Signal Processing.

His recent articles reflect a diverse array of research interests and contributions to the field:

Efficient Video Object Segmentation via Modulated Cross-Attention Memory

PALO: A Polyglot Large Multimodal Model for 5B People

Learnable weight initialization for volumetric medical image segmentation

Xraygpt: Chest radiographs summarization using medical vision-language models

Arabic Mini-ClimateGPT: A Climate Change and Sustainability Tailored Arabic LLM

Glamm: Pixel grounding large multimodal model

SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications

UNETR++: delving into efficient and accurate 3D medical image segmentation

Abdelrahman Shaker Information

University

Ain Shams University

Position

Teaching Assistant

Citations(all)

325

Citations(since 2020)

324

Cited By

22

hIndex(all)

7

hIndex(since 2020)

7

i10Index(all)

5

i10Index(since 2020)

5

Email

University Profile Page

Ain Shams University

Abdelrahman Shaker Skills & Research Interests

Computer Vision - Deep Learning - Signal Processing

Top articles of Abdelrahman Shaker

Efficient Video Object Segmentation via Modulated Cross-Attention Memory

Authors

Abdelrahman Shaker,Syed Talal Wasim,Martin Danelljan,Salman Khan,Ming-Hsuan Yang,Fahad Shahbaz Khan

Journal

arXiv preprint arXiv:2403.17937

Published Date

2024/3/26

Recently, transformer-based approaches have shown promising results for semi-supervised video object segmentation. However, these approaches typically struggle on long videos due to increased GPU memory demands, as they frequently expand the memory bank every few frames. We propose a transformer-based approach, named MAVOS, that introduces an optimized and dynamic long-term modulated cross-attention (MCA) memory to model temporal smoothness without requiring frequent memory expansion. The proposed MCA effectively encodes both local and global features at various levels of granularity while efficiently maintaining consistent speed regardless of the video length. Extensive experiments on multiple benchmarks, LVOS, Long-Time Video, and DAVIS 2017, demonstrate the effectiveness of our proposed contributions leading to real-time inference and markedly reduced memory demands without any degradation in segmentation accuracy on long videos. Compared to the best existing transformer-based approach, our MAVOS increases the speed by 7.6x, while significantly reducing the GPU memory by 87% with comparable segmentation performance on short and long video datasets. Notably on the LVOS dataset, our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU. Our code and models will be publicly available at: https://github.com/Amshaker/MAVOS.

PALO: A Polyglot Large Multimodal Model for 5B People

Authors

Muhammad Maaz,Hanoona Rasheed,Abdelrahman Shaker,Salman Khan,Hisham Cholakal,Rao M Anwer,Tim Baldwin,Michael Felsberg,Fahad S Khan

Journal

arXiv preprint arXiv:2402.14818

Published Date

2024/2/22

In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called \textsc{Palo}. \textsc{Palo} offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of 5B people (65\% of the world population). Our approach involves a semi-automated translation approach to adapt the multimodal instruction dataset from English to the target languages using a fine-tuned Large Language Model, thereby ensuring high linguistic fidelity while allowing scalability due to minimal manual effort. The incorporation of diverse instruction sets helps us boost overall performance across multiple languages especially those that are underrepresented like Hindi, Arabic, Bengali, and Urdu. The resulting models are trained across three scales (1.7B, 7B and 13B parameters) to show the generalization and scalability where we observe substantial improvements compared to strong baselines. We also propose the first multilingual multimodal benchmark for the forthcoming approaches to evaluate their vision-language reasoning capabilities across languages. Code: https://github.com/mbzuai-oryx/PALO.

Learnable weight initialization for volumetric medical image segmentation

Authors

Shahina Kunhimon,Abdelrahman Shaker,Muzammal Naseer,Salman Khan,Fahad Shahbaz Khan

Journal

Artificial Intelligence in Medicine

Published Date

2024/4/3

Hybrid volumetric medical image segmentation models, combining the advantages of local convolution and global attention, have recently received considerable attention. While mainly focusing on architectural modifications, most existing hybrid approaches still use conventional data-independent weight initialization schemes which restrict their performance due to ignoring the inherent volumetric nature of the medical data. To address this issue, we propose a learnable weight initialization approach that utilizes the available medical training data to effectively learn the contextual and structural cues via the proposed self-supervised objectives. Our approach is easy to integrate into any hybrid model and requires no external training data. Experiments on multi-organ and lung cancer segmentation tasks demonstrate the effectiveness of our approach, leading to state-of-the-art segmentation performance. Our proposed …

Xraygpt: Chest radiographs summarization using medical vision-language models

Authors

Omkar Thawkar,Abdelrahman Shaker,Sahal Shaji Mullappilly,Hisham Cholakkal,Rao Muhammad Anwer,Salman Khan,Jorma Laaksonen,Fahad Shahbaz Khan

Journal

arXiv preprint arXiv:2306.07971

Published Date

2023/6/13

The latest breakthroughs in large vision-language models, such as Bard and GPT-4, have showcased extraordinary abilities in performing a wide range of tasks. Such models are trained on massive datasets comprising billions of public image-text pairs with diverse tasks. However, their performance on task-specific domains, such as radiology, is still under-investigated and potentially limited due to a lack of sophistication in understanding biomedical images. On the other hand, conversational medical models have exhibited remarkable success but have mainly focused on text-based analysis. In this paper, we introduce XrayGPT, a novel conversational medical vision-language model that can analyze and answer open-ended questions about chest radiographs. Specifically, we align both medical visual encoder (MedClip) with a fine-tuned large language model (Vicuna), using a simple linear transformation. This alignment enables our model to possess exceptional visual conversation abilities, grounded in a deep understanding of radiographs and medical domain knowledge. To enhance the performance of LLMs in the medical context, we generate ~217k interactive and high-quality summaries from free-text radiology reports. These summaries serve to enhance the performance of LLMs through the fine-tuning process. Our approach opens up new avenues the research for advancing the automated analysis of chest radiographs. Our open-source demos, models, and instruction sets are available at: https://github.com/mbzuai-oryx/XrayGPT.

Arabic Mini-ClimateGPT: A Climate Change and Sustainability Tailored Arabic LLM

Authors

Sahal Shaji Mullappilly,Abdelrahman Shaker,Omkar Thawakar,Hisham Cholakkal,Rao Muhammad Anwer,Salman Khan,Fahad Shahbaz Khan

Journal

arXiv preprint arXiv:2312.09366

Published Date

2023/12/14

Climate change is one of the most significant challenges we face together as a society. Creating awareness and educating policy makers the wide-ranging impact of climate change is an essential step towards a sustainable future. Recently, Large Language Models (LLMs) like ChatGPT and Bard have shown impressive conversational abilities and excel in a wide variety of NLP tasks. While these models are close-source, recently alternative open-source LLMs such as Stanford Alpaca and Vicuna have shown promising results. However, these open-source models are not specifically tailored for climate related domain specific information and also struggle to generate meaningful responses in other languages such as, Arabic. To this end, we propose a light-weight Arabic Mini-ClimateGPT that is built on an open-source LLM and is specifically fine-tuned on a conversational-style instruction tuning curated Arabic dataset Clima500-Instruct with over 500k instructions about climate change and sustainability. Further, our model also utilizes a vector embedding based retrieval mechanism during inference. We validate our proposed model through quantitative and qualitative evaluations on climate-related queries. Our model surpasses the baseline LLM in 88.3% of cases during ChatGPT-based evaluation. Furthermore, our human expert evaluation reveals an 81.6% preference for our model's responses over multiple popular open-source models. Our open-source demos, code-base and models are available here https://github.com/mbzuai-oryx/ClimateGPT.

Glamm: Pixel grounding large multimodal model

Authors

Hanoona Rasheed,Muhammad Maaz,Sahal Shaji,Abdelrahman Shaker,Salman Khan,Hisham Cholakkal,Rao M Anwer,Erix Xing,Ming-Hsuan Yang,Fahad S Khan

Journal

arXiv preprint arXiv:2311.03356

Published Date

2023/11/6

Large Multimodal Models (LMMs) extend Large Language Models to the vision domain. Initial efforts towards LMMs used holistic images and text prompts to generate ungrounded textual responses. Very recently, region-level LMMs have been used to generate visually grounded responses. However, they are limited to only referring a single object category at a time, require users to specify the regions in inputs, or cannot offer dense pixel-wise object grounding. In this work, we present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks. GLaMM not only grounds objects appearing in the conversations but is flexible enough to accept both textual and optional visual prompts (region of interest) as input. This empowers users to interact with the model at various levels of granularity, both in textual and visual domains. Due to the lack of standard benchmarks for the novel setting of generating visually grounded detailed conversations, we introduce a comprehensive evaluation protocol with our curated grounded conversations. Our proposed Grounded Conversation Generation (GCG) task requires densely grounded concepts in natural scenes at a large-scale. To this end, we propose a densely annotated Grounding-anything Dataset (GranD) using our proposed automated annotation pipeline that encompasses 7.5M unique concepts grounded in a total of 810M regions available with segmentation masks. Besides GCG, GLaMM also performs effectively on several downstream tasks e.g., referring expression segmentation, image and region …

SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications

Authors

Abdelrahman Shaker,Muhammad Maaz,Hanoona Rasheed,Salman Khan,Ming-Hsuan Yang,Fahad Shahbaz Khan

Published Date

2023

Self-attention has become a defacto choice for capturing global context in various vision applications. However, its quadratic computational complexity with respect to image resolution limits its use in real-time applications, especially for deployment on resource-constrained mobile devices. Although hybrid approaches have been proposed to combine the advantages of convolutions and self-attention for a better speed-accuracy trade-off, the expensive matrix multiplication operations in self-attention remain a bottleneck. In this work, we introduce a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations with linear element-wise multiplications. Our design shows that the key-value interaction can be replaced with a linear layer without sacrificing any accuracy. Unlike previous state-of-the-art methods, our efficient formulation of self-attention enables its usage at all stages of the network. Using our proposed efficient additive attention, we build a series of models called" SwiftFormer" which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed. Our small variant achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2x faster compared to MobileViT-v2. Our code and models: https://tinyurl. com/5ft8v46w

UNETR++: delving into efficient and accurate 3D medical image segmentation

Authors

Abdelrahman Shaker,Muhammad Maaz,Hanoona Rasheed,Salman Khan,Ming-Hsuan Yang,Fahad Shahbaz Khan

Journal

arXiv preprint arXiv:2212.04497

Published Date

2022/12/8

Owing to the success of transformer models, recent works study their applicability in 3D medical segmentation tasks. Within the transformer models, the self-attention mechanism is one of the main building blocks that strives to capture long-range dependencies. However, the self-attention operation has quadratic complexity which proves to be a computational bottleneck, especially in volumetric medical imaging, where the inputs are 3D with numerous slices. In this paper, we propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed. The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features using a pair of inter-dependent branches based on spatial and channel attention. Our spatial attention formulation is efficient having linear complexity with respect to the input sequence length. To enable communication between spatial and channel-focused branches, we share the weights of query and key mapping functions that provide a complimentary benefit (paired attention), while also reducing the overall network parameters. Our extensive evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy. On Synapse, our UNETR++ sets a new state-of-the-art with a Dice Score of 87.2%, while being significantly efficient with a reduction of over 71% in terms of both parameters and FLOPs, compared to the …

Edgenext: efficiently amalgamated cnn-transformer architecture for mobile vision applications

Authors

Muhammad Maaz,Abdelrahman Shaker,Hisham Cholakkal,Salman Khan,Syed Waqas Zamir,Rao Muhammad Anwer,Fahad Shahbaz Khan

Published Date

2022/10/23

In the pursuit of achieving ever-increasing accuracy, large and complex neural networks are usually developed. Such models demand high computational resources and therefore cannot be deployed on edge devices. It is of great interest to build resource-efficient general purpose networks due to their usefulness in several application areas. In this work, we strive to effectively combine the strengths of both CNN and Transformer models and propose a new efficient hybrid architecture EdgeNeXt. Specifically in EdgeNeXt, we introduce split depth-wise transpose attention (STDA) encoder that splits input tensors into multiple channel groups and utilizes depth-wise convolution along with self-attention across channel dimensions to implicitly increase the receptive field and encode multi-scale features. Our extensive experiments on classification, detection and segmentation tasks, reveal the merits of the proposed …

INSTA-YOLO: Real-time instance segmentation

Authors

Eslam Mohamed,Abdelrahman Shaker,Ahmad El-Sallab,Mayada Hadhoud

Published Date

2021/2/12

Instance segmentation has gained recently huge attention in various computer vision applications. It aims at providing different IDs to different objects of the scene, even if they belong to the same class. Instance segmentation is usually performed as a two-stage pipeline. First, an object is detected, then semantic segmentation within the detected box area is performed which involves costly up-sampling. In this paper, we propose Insta-YOLO, a novel one-stage end-to-end deep learning model for real-time instance segmentation. Instead of pixel-wise prediction, our model predicts instances as object contours represented by 2D points in Cartesian space. We evaluate our model on three datasets, namely, Carvana,Cityscapes and Airbus. We compare our results to the state-of-the-art models for instance segmentation. The results show our model achieves competitive accuracy in terms of mAP at twice the speed on GTX-1080 GPU.

Deep convolutional neural networks for ECG heartbeat classification using two-stage hierarchical method

Authors

Abdelrahman M Shaker,Manal Tantawi,Howida A Shedeed,Mohamed F Tolba

Published Date

2021

Electrocardiogram (ECG) is widely used in computer-aided systems for arrhythmia detection because it provides essential information for the heart functionalities. The cardiologist uses it to diagnose and detect the abnormalities of the heart. Hence, automating the process of ECG heartbeat classification plays a vital role in the clinical diagnosis. In this paper, a two-stage hierarchical method is proposed using deep Convolution Neural Networks (CNN) to determine the category of the heartbeats in the first stage, and then classify the classes belonging to that category in the second stage. This work is based on 16 different classes from the public MIT-BIH arrhythmia dataset. But the MIT-BIH dataset is unbalanced, which degrades the classification accuracy of the deep learning models. This problem is solved by using an adaptive synthetic sampling technique to generate synthetic heartbeats to restore the …

Generalization of convolutional neural networks for ECG classification using generative adversarial networks

Authors

Abdelrahman M Shaker,Manal Tantawi,Howida A Shedeed,Mohamed F Tolba

Journal

IEEE Access

Published Date

2020/2/17

Electrocardiograms (ECGs) play a vital role in the clinical diagnosis of heart diseases. An ECG record of the heart signal over time can be used to discover numerous arrhythmias. Our work is based on 15 different classes from the MIT-BIH arrhythmia dataset. But the MIT-BIH dataset is strongly imbalanced, which impairs the accuracy of deep learning models. We propose a novel data-augmentation technique using generative adversarial networks (GANs) to restore the balance of the dataset. Two deep learning approaches-an end-to-end approach and a two-stage hierarchical approach-based on deep convolutional neural networks (CNNs) are used to eliminate hand-engineering features by combining feature extraction, feature reduction, and classification into a single learning method. Results show that augmenting the original imbalanced dataset with generated heartbeats by using the proposed techniques more …

Combination of convolutional and recurrent neural networks for heartbeat classification

Authors

Abdelrahman M Shaker,Manal Tantawi,Howida A Shedeed,Mohamed F Tolba

Published Date

2020

Electrocardiogram (ECG) plays an essential role in the medical field, it records the electrical activity of the heart over time and provides information about the heart condition. Hence, the cardiologist uses it to detect the abnormalities of the heart and to diagnose the heart diseases. Convolutional Neural Networks (CNNs) have proven their ability in extracting the most important features, Long Short-Term Memory (LSTM) has the capabilities of learning the temporal dependencies between the sequential data. In this paper, a novel method based on the combination of CNN and LSTM is proposed to classify 15 classes of the MIT-BIH dataset automatically without any hand-engineering feature extraction methods. The proposed method consists of data filtering, dynamic technique for heartbeat segmentation, and CNN-LSTM model consists of 12 layers. Our experimental results of the proposed method …

Heartbeat classification using 1D convolutional neural networks

Authors

Abdelrahman M Shaker,Manal Tantawi,Howida A Shedeed,Mohamed F Tolba

Published Date

2020

Electrocardiogram (ECG) is an essential source of information for heart diseases classification. Hence, it is used by the cardiologist to diagnose heart attacks and detect the abnormalities of the heart. The automatic classification of the ECG signals is playing a vital role in the clinical diagnosis of heart diseases. In this paper, an end-to-end classification method is proposed using 1D Convolution Neural Networks (CNN) to extract the important features from the input signals and classify it automatically. The main advantage of CNN compared to the related work methods is that it gets rid of the hand-crafted features by combining the feature extraction and the classification into a single learning method without any human supervision. The proposed solution consists of data filtering, dynamic heartbeat segmentation, and 1D-CNN consisting of 10 layers without the input and the output layers. Our …

See List of Professors in Abdelrahman Shaker University(Ain Shams University)

Abdelrahman Shaker FAQs

What is Abdelrahman Shaker's h-index at Ain Shams University?

The h-index of Abdelrahman Shaker has been 7 since 2020 and 7 in total.

What are Abdelrahman Shaker's top articles?

The articles with the titles of

Efficient Video Object Segmentation via Modulated Cross-Attention Memory

PALO: A Polyglot Large Multimodal Model for 5B People

Learnable weight initialization for volumetric medical image segmentation

Xraygpt: Chest radiographs summarization using medical vision-language models

Arabic Mini-ClimateGPT: A Climate Change and Sustainability Tailored Arabic LLM

Glamm: Pixel grounding large multimodal model

SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications

UNETR++: delving into efficient and accurate 3D medical image segmentation

...

are the top articles of Abdelrahman Shaker at Ain Shams University.

What are Abdelrahman Shaker's research interests?

The research interests of Abdelrahman Shaker are: Computer Vision - Deep Learning - Signal Processing

What is Abdelrahman Shaker's total number of citations?

Abdelrahman Shaker has 325 citations in total.

What are the co-authors of Abdelrahman Shaker?

The co-authors of Abdelrahman Shaker are Ming-Hsuan Yang, Salman Khan.

    Co-Authors

    H-index: 157
    Ming-Hsuan Yang

    Ming-Hsuan Yang

    University of California, Merced

    H-index: 61
    Salman Khan

    Salman Khan

    Australian National University

    academic-engine

    Useful Links