tests: PyTorch multimodal using keyword match (#4215)

* keyword accuracy check for pytorch multimodal

Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>

* Change keywords for some prompts

Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>

* Delete full text answers

Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>

* Cleanup debug code

Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>

---------

Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
This commit is contained in:
Anurag Mukkara 2025-05-14 02:18:43 -07:00 committed by GitHub
parent 5e634dd1bd
commit b15f57763d
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -18,7 +18,6 @@ import re
import shutil
import sys
import tempfile
from difflib import SequenceMatcher
from pathlib import Path
from typing import Optional, Tuple, Union
@ -1467,96 +1466,59 @@ def test_ptp_quickstart_multimodal(llm_root, llm_venv, model_name, model_path,
],
},
}
expected_answers = {
expected_keywords = {
"NVILA-8B-FP16": {
"image": [
[
"The image features a stormy ocean with large waves crashing, a gray sky with white clouds, and a dark gray horizon.",
"The image features a stormy ocean with large waves crashing, a dark gray sky with white clouds, and a grayish-blue water surface."
],
"The object is a large rock formation, and the weather condition is sunny with a blue sky and white clouds.",
[
"The road is busy with multiple cars, including a blue car, a silver SUV, and a black car, all driving in the same direction.",
"The road is busy with multiple cars, including a blue car, a white car, a black car, and a silver car, all driving in the same direction.",
"The road is busy with multiple cars, including a blue car, a white car, a black car, and a green double-decker bus."
],
["stormy", "ocean", "waves", "clouds", "gray", "sky"],
["rock", "formation", "sunny", "sky", "clouds"],
["road", "busy", "car", "black", "blue"],
],
"video": [
["woman", "street", "night", "walking", "camera"],
[
"The video depicts a woman walking down a city street at night. She is wearing a black leather jacket, a red dress, and black boots. The woman is carrying a black purse and has sunglasses on. The street is wet, and there are many people walking around. The woman is looking at the camera.",
"The video depicts a woman walking down a city street at night. She is wearing a black leather jacket, a red dress, and black boots. The woman is carrying a black purse and is wearing sunglasses. The street is wet, and there are many people walking around. The woman is walking towards the camera, and the"
],
[
"The video depicts a stunning view of Earth from space, showcasing the planet's curvature and the vastness of space. The Earth is illuminated by the sun, with the left side appearing darker and the right side brighter. The image captures the beauty of our home planet, highlighting its unique features and the contrast between day and night",
"The video depicts a stunning view of Earth from space, showcasing the planet's vibrant blue oceans and the intricate patterns of city lights illuminating the continents. The image captures the curvature of the Earth, with the dark side of the planet visible, and the bright side displaying the illuminated city lights. The contrast between the illuminated and"
"stunning", "earth", "space", "planet", "curvature", "dark",
"bright", "contrast", "illuminate"
],
],
},
"llava-v1.6-mistral-7b": {
"image": [
[
"The image depicts a dramatic ocean scene under a cloudy sky. The ocean is characterized by large, powerful waves that are breaking and crashing onto the shore. The waves are white and frothy, indicating that they are in the process of breaking. The water appears to be a deep blue-green color, suggesting",
"The image depicts a dramatic natural environment. The sky is overcast with dark, heavy clouds, suggesting a stormy or gloomy weather condition. The ocean is in motion, with large waves that are breaking and crashing onto the shore. The water appears choppy and turbulent, with white foam and spray visible",
"ocean", "cloud", "waves", "white", "shore", "large",
"dramatic", "breaking"
],
[
"The image shows a scenic landscape with a prominent rock formation, which appears to be a large, flat-topped mountain or butte. The rock formation is rugged and has a smooth, flat top, suggesting it could be a natural landmark or a geological feature. The sky is clear with a few",
"The image shows a majestic mountain with a flat top, which is characteristic of buttes. The mountain is prominently featured in the background, with a clear blue sky above it and a few scattered clouds. The weather appears to be and clear, with no visible signs of rain or storms.",
],
"The image shows a multi-lane highway with several vehicles in motion. There are cars and a bus visible, and the traffic appears to be moderate, with no significant congestion. The road is divided by a central divider, and there are green trees lining the sides of the highway, indicating a suburban",
["mountain", "butte", "flat", "top", "sky"],
["highway", "vehicles", "traffic", "divider", "suburban"],
],
},
"qwen2-vl-7b-instruct": {
"image": [
["ocean", "waves", "shore", "natural", "clouds", "turbulent"],
[
"The image depicts a vast ocean with waves crashing against the shore. The sky is filled with dark clouds, creating a dramatic and moody atmosphere. The waves are powerful and turbulent, suggesting a stormy weather condition. The overall scene conveys a sense of raw natural beauty and the raw power of the ocean.",
"The image depicts a vast ocean with waves crashing against the shore. The sky is filled with dark clouds, creating a dramatic and moody atmosphere. The waves are powerful and turbulent, with white foam at their crests, indicating strong winds and rough sea conditions. The overall scene conveys a sense of raw natural power and"
],
[
"The image depicts a scenic mountainous landscape. The central object is a large, prominent rock formation known as Half Dome, which is a well-known landmark in Yosemite National Park, California. The weather appears to be clear and sunny, with a bright blue sky and some scattered clouds. The visibility is excellent, allowing for a",
"The image depicts a scenic mountainous landscape with a prominent rock formation in the background. The rock formation is a large, steep, and pointed peak, which appears to be a well-known natural landmark. The sky is clear with a few scattered clouds, indicating fair weather conditions. The lighting suggests it is a sunny day,",
"The image depicts a scenic mountainous landscape with a prominent, steep, and rocky peak in the background. The peak is characterized by its sharp, jagged edges and a smooth, polished surface, suggesting it might be a well-known natural landmark. The sky is clear with a few scattered clouds, indicating fair weather conditions."
],
[
"The traffic condition on the road in the image appears to be moderate. There are several vehicles traveling in both directions, including cars, a bus, and a police car. The road is divided into multiple lanes, and the vehicles are maintaining a safe distance from each other. The overall scene suggests a typical day with moderate traffic",
"The traffic condition on the road in the image appears to be moderate. There are several vehicles traveling in both directions, including cars, a bus, and a truck. The road is divided into multiple lanes, and the vehicles are maintaining a safe distance from each other. The overall flow of traffic seems to be smooth, with",
"The traffic condition on the road in the image appears to be moderate. There are several vehicles traveling in both directions, including cars, a bus, and a police car. The road is divided into multiple lanes, and the vehicles are maintaining a safe distance from each other. The overall flow of traffic seems to be smooth,"
"mountainous", "landscape", "rock", "peak", "weather",
"steep"
],
["traffic", "vehicles", "moderate", "lanes", "road"],
],
"video": [
[
"The video shows a person walking down a busy city street at night. The street is illuminated by numerous bright lights and signs, creating a vibrant and lively atmosphere. The person is wearing a black leather jacket, a red dress, and large sunglasses, and is carrying a black handbag. The street appears to be wet,",
"The video shows a person walking down a busy city street at night. The street is illuminated by numerous bright lights and signs, creating a vibrant and lively atmosphere. The person is wearing a black leather jacket, a red dress, and large sunglasses, and is carrying a black bag. The street appears to be wet, reflecting"
],
[
"The video shows a spinning Earth with a black background. The Earth is mostly dark, with some parts illuminated by lights."
],
["city", "night", "lights", "jacket", "wet"],
["earth", "spinning", "black", "illuminated", "lights"],
],
},
"qwen2.5-vl-7b-instruct": {
"image":
[[
"The image depicts a dramatic and moody natural environment, featuring a large wave breaking on the shore. The sky is overcast with dark, heavy clouds, suggesting an impending storm or a generally stormy weather condition. The ocean appears turbulent, with the wave creating a frothy white crest as it crashes. The overall atmosphere",
"The image depicts a dramatic and moody seascape. The sky is filled with dark, heavy clouds, suggesting an overcast or stormy weather condition. The ocean is turbulent, with large waves crashing and creating white foam, indicating strong winds and possibly rough seas. The overall atmosphere is one of intensity and natural power"
"image": [
["dramatic", "moody", "stormy", "turbulent", "wave"],
[
"dome", "yosemite", "landmark", "sunny", "rock", "clouds",
"pleasant"
],
["highway", "traffic", "vehicles", "bus", "police"],
],
[
"The image features a large, iconic granite rock formation, which is likely Half Dome, a famous landmark in Yosemite National Park, California. The rock formation is surrounded by a clear blue sky with a few scattered clouds, indicating a sunny and pleasant day. The road in the foreground curves gently, and there are trees on either",
"The image features a large, iconic granite rock formation, which is likely Half Dome, a famous landmark in Yosemite National Park, California. The rock formation is surrounded by a clear blue sky with a few scattered clouds, indicating a sunny and pleasant day. The road in the foreground is empty, and the trees on either side",
"The image features a large, prominent rock formation, likely Half Dome, which is a famous landmark in Yosemite National Park, California. The rock formation is surrounded by a clear blue sky with a few scattered clouds, indicating a sunny and pleasant day. The road in the foreground is empty, and the trees on either side of",
"The image features a large, iconic granite rock formation, which is likely Half Dome, a famous landmark in Yosemite National Park, California. The rock formation is surrounded by a clear blue sky with a few scattered clouds, indicating a sunny and pleasant day. The road in the foreground curves gently, and there are trees on both",
"The image features a large, iconic granite rock formation, which appears to be Half Dome, a famous landmark in Yosemite National Park, California. The rock formation is surrounded by a clear blue sky with a few scattered clouds, indicating a sunny and pleasant day. The foreground shows a paved road curving around the base of the",
],
[
"The image shows a multi-lane highway with traffic flowing in both directions. The road appears to be relatively clear, with a few vehicles visible on the road. There is a bus in the right lane, a police car in the middle lane, and a few other vehicles scattered across the lanes. The traffic seems to be",
"The image shows a multi-lane highway with traffic flowing in both directions. The road appears to be relatively clear, with a few vehicles visible on the road. There is a bus on the right side of the road, and a police car is seen in the middle lane, possibly indicating a traffic check or an incident.",
"The image shows a multi-lane highway with traffic flowing in both directions. The road appears to be relatively clear, with a few vehicles visible on the road. There is a bus on the right side of the road, and a police car is seen in the middle lane. The traffic seems to be moving smoothly, with"
]],
"video":
[[
"The video depicts a woman walking down a vibrant, neon-lit street at night. She is dressed in a stylish outfit, featuring a black leather jacket, a red dress, and red boots. She carries a small handbag and wears large sunglasses. The street is wet, reflecting the colorful lights from the surrounding buildings,",
"video": [
["woman", "neon", "night", "jacket", "wet"],
["earth", "rotating", "night", "lights", "cities"],
],
[
"The video shows a rotating Earth at night. The illuminated areas represent cities and populated regions, with lights visible in various parts of the world. The Earth is depicted with a dark blue ocean and a lighter blue landmass, and the night sky is black. The rotation of the Earth is smooth, giving a sense of continuous",
]],
},
}
@ -1593,15 +1555,17 @@ def test_ptp_quickstart_multimodal(llm_root, llm_venv, model_name, model_path,
item = item[end:]
return results
match_ratio = 0.9
for output, expected_answer in zip(parse_output(output),
expected_answers[model_name][modality]):
if not isinstance(expected_answer, list):
expected_answer = [expected_answer]
assert any(
SequenceMatcher(a=output, b=answer).ratio() > match_ratio
for answer in expected_answer
), f"Wrong answer!\nGenerated \"{output}\"\nExpected \"{expected_answer}\"\nMatch ratio: {[SequenceMatcher(a=output, b=answer).ratio() for answer in expected_answer]} all below threshold {match_ratio}"
match_ratio = 4.0 / 5
if model_name == "qwen2-vl-7b-instruct" and modality == "image":
match_ratio = 4.0 / 6
for prompt_output, prompt_keywords in zip(
parse_output(output), expected_keywords[model_name][modality]):
matches = [
keyword in prompt_output.lower() for keyword in prompt_keywords
]
obs_match_ratio = 1. * sum(matches) / len(matches)
assert obs_match_ratio >= match_ratio, f"Incorrect output!\nGenerated \"{prompt_output}\"\nExpected keywords \"{prompt_keywords}\"\n Matched keywords: {matches}\n Observed match ratio {obs_match_ratio} below threshold {match_ratio}"
print("All answers are correct!")