Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization
There is a conflict among contextual embeddings in zero-shot T2I customization when varying the subject's pose. We resolve it by orthogonalization and attention swap.