Comment by zdevito

2 months ago

I tried to do something similar with 'first-class' dimension objects in PyTorch https://github.com/pytorch/pytorch/blob/main/functorch/dim/R... . For instance multi-head attention looks like:

    from torchdim import softmax
    def multiheadattention(q, k, v, num_attention_heads, dropout_prob, use_positional_embedding):
        batch, query_sequence, key_sequence, heads, features = dims(5)
        heads.size = num_attention_heads
    
        # binding dimensions, and unflattening the heads from the feature dimension
        q = q[batch, query_sequence, [heads, features]]
        k = k[batch, key_sequence, [heads, features]]
        v = v[batch, key_sequence, [heads, features]]
    
        # einsum-style operators to calculate scores,
        attention_scores = (q*k).sum(features) * (features.size ** -0.5)
    
        # use first-class dim to specify dimension for softmax
        attention_probs = softmax(attention_scores, dim=key_sequence)
    
        # dropout work pointwise, following Rule #1
        attention_probs = torch.nn.functional.dropout(attention_probs, p=dropout_prob)
    
        # another matrix product
        context_layer = (attention_probs*v).sum(key_sequence)
    
        # flatten heads back into features
        return context_layer.order(batch, query_sequence, [heads, features])

However, my impression trying to get a wider userbase is that while numpy-style APIs maybe are not as good as some better array language, they might not be the bottleneck for getting things done in PyTorch. However, other domains might suffer more, and I am very excited to see a better array language catch on.

0 comments

zdevito

No comments yet

Contribute on Hacker News ↗