Comment by sebastiennight

3 days ago

Very interesting project, and I found two things particularly smart and well executed in the demo:

1. Using a "painter commenter" feedback loop to make sure the slides are correctly laid out with no overflowing or overlapping elements.

2. Having the audio/subtitles not read word-for-word the detailed contents that are added to the slides, but instead rewording that content to flow more naturally and be closer to how a human presenter would cover the slide.

A couple of things might possibly be improved in the prompts for the reasoning features, eg. in `answer_question_from_image.yaml`:

  1. Study the poster image along with the "questions" provided.
  2. For each question:
     • Decide if the poster clearly supports one of the four options (A, B, C, or D). If so, pick that answer.
     • Otherwise, if the poster does not have adequate information, use "NA" for the answer.
  3. Provide a brief reference indicating where in the poster you found the answer. If no reference is available (i.e., your answer is "NA"), use "NA" for the reference too.
  4. Format your output strictly as a JSON object with this pattern:
     {
       "Question 1": {
         "answer": "X",
         "reference": "some reference or 'NA'"
       },
       "Question 2": {
         "answer": "X",
         "reference": "some reference or 'NA'"
       },
       ...
     }

I'd assume you would likely get better results by asking for the reference first, and then the answer, otherwise you probably have quite a number of answers where the model just "knows" the answer and takes from its own training rather than from the image, which would bias the benchmark.

0 comments

sebastiennight

No comments yet

Contribute on Hacker News ↗