ModernBERT FlexAttention #35423

staghado · 2024-12-26T16:42:27Z

What does this PR do?

This PR adds FlexAttention support for ModernBERT:

Combines sliding window and document masking to implement the alternating local/global attention pattern in ModernBERT
Mask creation is expensive so the two masks are cached at the model level and then re-used across layers.
Similar to the FA2 path, it works directly on the unpadded sequences
Re-uses the existing ModernBertRotaryEmbedding to avoid requiring FA2.

Note:
The current version requires one of the latest torch nightlies (e.g 2.6.0.dev20241112)
Currently transformers does not allow compiling the flex_attention function IIUC

staghado and others added 5 commits December 26, 2024 00:03

FlexAttention support 1/n

cd78f8d

FlexAttention with unpadded RoPE using GemmaRotaryEmbeddings

88cd884

Merge branch 'huggingface:main' into flexattn-modernbert

56c3846

fix check

d8d468e

fix: simplify condition

47758db

staghado mentioned this pull request Dec 27, 2024

FlexAttention slower than eager in HF transformers pytorch-labs/attention-gym#95

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ModernBERT FlexAttention #35423

ModernBERT FlexAttention #35423

staghado commented Dec 26, 2024

ModernBERT FlexAttention #35423

Are you sure you want to change the base?

ModernBERT FlexAttention #35423

Conversation

staghado commented Dec 26, 2024

What does this PR do?