I repeated my experiment with the Miss Manners benchmark using Google Gemini. The results were inferior to ChatGPT.

For details on the Miss Manners benchmark, and the results with ChatGPT, refer to my previous post, or the results with Mistral.AI or results with GPT4.

View the transcript here.

First, an invalid solution was generated (swapping the gender of guest 5 and guest 16).

The next solution was also invalid, with two males and two females seated next to each other.

Third solution was also invalid: “Male (10) and Male (5) are seated next to each other. Female (16) and Female (14) are seated next to each other.”

The forth attempt at a solution was valid:

Prompting:

Guest Female(7) is recently divorced from guest Male(4). Could you take that into account?

Gemini was not able to find a solution with this additional constraint, alternating between this invalid solution, that did not allocate a seat to Male (4):

And this solution that seated the divorced couple together:

Prompt:

What are the problems with this arrangement?

Incorrect response:

This particular arrangement still has a couple of potential issues:

Limited conversation opportunities for Male (4): While he's not directly next to his former partner, he's positioned between two females whose hobbies are currently unknown. This could limit his potential conversation topics, especially if they don't share any hobbies with him.

Potential discomfort for other guests: Although the arrangement prioritizes the comfort of the recently divorced couple, it might lead to unintended consequences. Having two individuals with unknown hobbies seated next to each other could create awkward silences or discomfort for them if they struggle to find common ground for conversation.

It's important to note that these are potential issues, and the actual experience of the dinner party might be different depending on the personalities and social skills of the guests.

Conclusions

This was the weakest set of results, so far, when compared to ChatGPT, GPT4 and Mistral (Large):

  1. With repeated prodding/prompting Gemini found a good solution to the basic 16 person seating problem.
  2. 👍The statement of the recent divorce between two guests was correctly inferred to mean they should not be seated next to each other
  3. 👎However, once the additional divorce constraint was added Gemini started to struggle. It failed to find a valid solution and instead confidently offered invalid solutions. When prompted to correct its errors it again confidently offered an invalid solution, getting stuck in a loop of invalid solutions.
  4. 👎When ChatGPT was promoted to analyse its solution it confidently generated an invalid explanation for why the solution was invalid.