OpenAI 发布了两个新的神经网络:CLIP 和 DALL·E。它们将 NLP(自然语言识别)与 图像识别结合在一起,对日常生活中的图像和语言有了更好的理解。
之前都是用文字搜文字,图片搜图片,现在通过CLIP这个模型,可是实现文字搜图片,图片搜文字。其实现思路就是将图片跟文本映射到同一个向量空间。如此,就可以实现图片跟文本的跨模态相似性比对检索。
运行成功后,命令行应该看到下面的信息:
x...
# 测试文本:
[INFO ] - texts: [在雪地里有两条狗, 一只猫在桌子上, 夜晚的伦敦]
# 测试图片:
xxxxxxxxxx
# 向量维度:
[INFO ] - Vector dimension: 512
# 生成图片向量:
[INFO ] - image embeddings: [0.22221693, 0.16178696, ..., -0.06122274, 0.13340257]
# 中文分词 & 翻译(取Top5):
[INFO ] - Tokens : [在, 雪地, 里, 有, 两条, 狗]
[INFO ] - 在雪地里有两条狗:
[ There are two dogs in the snow, In the snow there are two dogs, There were two dogs in the snow, There are two dogs in the snow., There are two dogs in the snow@@ .@@ ()]
[INFO ] - Tokens : [一只, 猫, 在, 桌子, 上]
[INFO ] - 一只猫在桌子上:
[ A cat is on the table, A cat is on the desk, A cat is on the desk., A cat is on the table@@ .@@ 3, A cat is on the table@@ .@@ 7@@ 16]
[INFO ] - Tokens : [夜晚, 的, 伦敦]
[INFO ] - 夜晚的伦敦:
[ Night in London, London at night, Even@@ ing London, Late at night in London, Late in London]
# 生成文本向量(取翻译的第一条生成向量) & 计算相似度:
[INFO ] - text [在雪地里有两条狗] embeddings: [0.111746386, 0.08818339, ..., -0.15732905, -0.54234475]
[INFO ] - Similarity: 28.510675%
[INFO ] - text [一只猫在桌子上] embeddings: [0.08841644, 0.043696217, ..., -0.16612083, -0.11383227]
[INFO ] - Similarity: 12.206457%
[INFO ] - text [夜晚的伦敦] embeddings: [-0.038869947, 0.003223464, ..., -0.177596, 0.114676386]
[INFO ] - Similarity: 14.038936%
#softmax 置信度计算:
[INFO ] - texts: [在雪地里有两条狗, 一只猫在桌子上, 夜晚的伦敦]
[INFO ] - probs: [0.9956493, 0.0019198752, 0.0024309014]