🤖🏆 The Ultimate Battle: Chatbots in the Ring.

Date

Date

Date

August 30, 2023

August 30, 2023

August 30, 2023

Author

Author

Author

Matias Hoyl

Matias Hoyl

Matias Hoyl

Which is the best chatbot?

It's a question I get asked often, and it's not easy to answer.

What does it mean for one chatbot to be better than another? Does it write better poems? Is it more logical? Does it write better code? Is it faster? Does it have more up-to-date information? Is it friendlier?

These types of questions obsess the group of nerds who maintain the Chatbot Arena Leaderboard.

As the name suggests, it’s a kind of “fight ring” among chatbots to see who is the best.

undefined

The process is simple:

  • You are presented with two chatbots. You don’t know which ones they are.

  • You can chat with them simultaneously.

  • In the end, you have to choose who did a better job. You can also declare a tie or say that both were bad.

After many rounds of “fighting,” the page ranks the chatbots according to their score.

Here is the current leaderboard.

undefined

The models are ranked according to their ELO score, a mathematical ranking developed in chess championships.

Five things catch my attention:

  • GPT-4, which is the model behind ChatGPT Plus, leads by a wide margin.

  • The models powering the Claude chatbot occupy the rest of the podium.

  • Google, with infinite resources, appears only in 11th place.

  • Meta appears two places later. Notably, even though they are a private organization, they are the only ones who have chosen to open their model to the open-source community.

  • Speaking of the open-source community, it’s surprising that they have created more than half of the top 15 models.

Perhaps this table and these numbers mean nothing to you. Many of those models you may have never heard of.

So let’s move on to something practical and more entertaining.

My Own Battle of Chatbots

I will test the five most well-known chatbots across seven categories.

You will notice that I am not considering the creative or writing ability of these chatbots. This is for two reasons:

  1. It’s very difficult to compare and decide who is the best when it comes to writing something. It’s very subjective.

  2. Generally, all chatbots are already good enough at writing.

For this reason, I focus on somewhat more objective and distinct categories.

Let’s go.

1. Logic

I presented each chatbot with this logic case:

And these were the results:

undefined

Since they are probabilistic models (they don’t always respond the same), I gave each three opportunities. However, Bing, Bard, and free ChatGPT could not solve it in any of the three attempts.

The one who was consistently the best was Claude.

2. Mathematics

Now let’s do some math. This is the problem I gave to the chatbots:

It’s a tricky exercise because if you solve it, you find that the result is 42.5 small dogs. This number doesn’t make sense (we can’t have a half dog), which adds an additional level of difficulty to the problem.

Are the chatbots able to notice this subtlety?

undefined

Several interesting things happened:

  • Both Bing Chat and both versions of ChatGPT realized they couldn’t leave a dog cut in half and argued their results.

  • Bard always reported 42.5 as the result. Mathematically, it’s correct (and that’s why I gave it the point), but it never cared about leaving a dog in half, haha.

  • Claude rounded its result to the whole number but without justification. I tried it several times, and it always did the same, so despite being close, I left it as incorrect.

3. Riddles

Let’s go with a classic:

Only Claude could not solve it.

undefined

### 4. Updated Information

When we talk to a chatbot, we want it to be able to discuss current topics. But that’s not always the case. Most chatbots have been trained with data up to a certain date. For example, ChatGPT only “knows” information up to September 2021.

For this reason, the ability to connect to the internet and retrieve updated information is key.

In this test, I asked them:

And this is what they responded:

undefined

Several interesting things happened:

  • Neither Claude nor free ChatGPT have the ability to connect to the internet.

  • For ChatGPT Plus to do so, I had to install a plugin, so it’s not a skill it has “out of the box.”

  • Both Bard and Bing Chat have integrated internet browsing natively. Bing even showed me a card with the weather forecast in the same chat.

If you need to work with updated information, for example, for reports that require fresh data, your best option is Bing Chat. Bard also works well, but it’s not as good at writing or reasoning.

5. Image Analysis

Intelligent chatbots have gradually diversified their capabilities to receive inputs. Some no longer only receive and understand text, but you can also upload files like images, PDFs, and data.

In this case, I uploaded this image:

undefined

And I asked them, why is this image funny?

undefined

Here are some relevant points to consider:

  • Free ChatGPT does not have an option to upload files, so it is disqualified from the start.

  • Although Claude does have the option, it could not recognize the image or read what was on it, despite my trying several times.

  • Bard only allows file uploads if you are in the USA. Additionally, you can only ask it in English. That’s why its response is also in English.

  • ChatGPT Plus technically could not see the image (the koala) but was able to read the text using Code Interpreter. With that limited information, it managed to understand the joke, so I gave it the point.

  • The only one that did it without problems and understood the meme perfectly was Bing Chat.

6. Document Reading

I gave them this 43-page paper and asked them to give me the most important points.

undefined

And this was the result:

undefined

Some things to consider:

  • Bard, for some reason, did not want to read the document because it didn’t want to “share personal information about other people.” Another relevant point is that Bard only accepts PDFs that are on the internet via their link.

  • If I had to make a podium for the quality of the response, it would be:

7. Data Analysis

I have written before about how powerful ChatGPT Plus's Code Interpreter is for analyzing data. It feels like having access to a personal data analyst.

Let’s see if the rest can compete with it.

I gave them this Excel file containing some data on the highest-rated movies on the IMDB platform:

undefined

And here’s how the models performed:

undefined

Neither Bing Chat, free ChatGPT, nor Bard have the ability to read Excel or CSV files, so they are out of the competition in this category.

Something I didn’t know is that Claude can do it. It provided a comprehensive analysis, counting the directors with the most movies and the highest-rated films, among other things. However, something important happened; when I went to verify the information, I realized that some data was incorrect. In summary, Claude is good at providing general information but not so good at counting specific data.

ChatGPT Plus is the clear winner in this category. Since it can run code, its calculations are precise, and it can also create data visualizations.

Summary and Conclusions

If we count all the points, the table looks like this:

undefined

And these are my main conclusions:

  1. Bing Chat is the best chatbot for the average user. Behind it is the same model as ChatGPT Plus, it is connected to the internet, and it has the ability to read documents and images.

  2. I didn’t know Claude. While it didn’t perform very well in the “reasoning” tests, it shines in its ability to summarize documents and perform simple data analysis. Additionally, something that sets it apart from the rest is that it has a context window of 75,000 words, meaning your prompts can be infinitely long. Unlike the others, which accept a maximum of words between 3,000 and 6,000.

  3. The clear winner is ChatGPT Plus. And it has also been my experience: it’s the chatbot I use the most. But there are three considerations to keep in mind:

Which is the best chatbot?

It's a question I get asked often, and it's not easy to answer.

What does it mean for one chatbot to be better than another? Does it write better poems? Is it more logical? Does it write better code? Is it faster? Does it have more up-to-date information? Is it friendlier?

These types of questions obsess the group of nerds who maintain the Chatbot Arena Leaderboard.

As the name suggests, it’s a kind of “fight ring” among chatbots to see who is the best.

undefined

The process is simple:

  • You are presented with two chatbots. You don’t know which ones they are.

  • You can chat with them simultaneously.

  • In the end, you have to choose who did a better job. You can also declare a tie or say that both were bad.

After many rounds of “fighting,” the page ranks the chatbots according to their score.

Here is the current leaderboard.

undefined

The models are ranked according to their ELO score, a mathematical ranking developed in chess championships.

Five things catch my attention:

  • GPT-4, which is the model behind ChatGPT Plus, leads by a wide margin.

  • The models powering the Claude chatbot occupy the rest of the podium.

  • Google, with infinite resources, appears only in 11th place.

  • Meta appears two places later. Notably, even though they are a private organization, they are the only ones who have chosen to open their model to the open-source community.

  • Speaking of the open-source community, it’s surprising that they have created more than half of the top 15 models.

Perhaps this table and these numbers mean nothing to you. Many of those models you may have never heard of.

So let’s move on to something practical and more entertaining.

My Own Battle of Chatbots

I will test the five most well-known chatbots across seven categories.

You will notice that I am not considering the creative or writing ability of these chatbots. This is for two reasons:

  1. It’s very difficult to compare and decide who is the best when it comes to writing something. It’s very subjective.

  2. Generally, all chatbots are already good enough at writing.

For this reason, I focus on somewhat more objective and distinct categories.

Let’s go.

1. Logic

I presented each chatbot with this logic case:

And these were the results:

undefined

Since they are probabilistic models (they don’t always respond the same), I gave each three opportunities. However, Bing, Bard, and free ChatGPT could not solve it in any of the three attempts.

The one who was consistently the best was Claude.

2. Mathematics

Now let’s do some math. This is the problem I gave to the chatbots:

It’s a tricky exercise because if you solve it, you find that the result is 42.5 small dogs. This number doesn’t make sense (we can’t have a half dog), which adds an additional level of difficulty to the problem.

Are the chatbots able to notice this subtlety?

undefined

Several interesting things happened:

  • Both Bing Chat and both versions of ChatGPT realized they couldn’t leave a dog cut in half and argued their results.

  • Bard always reported 42.5 as the result. Mathematically, it’s correct (and that’s why I gave it the point), but it never cared about leaving a dog in half, haha.

  • Claude rounded its result to the whole number but without justification. I tried it several times, and it always did the same, so despite being close, I left it as incorrect.

3. Riddles

Let’s go with a classic:

Only Claude could not solve it.

undefined

### 4. Updated Information

When we talk to a chatbot, we want it to be able to discuss current topics. But that’s not always the case. Most chatbots have been trained with data up to a certain date. For example, ChatGPT only “knows” information up to September 2021.

For this reason, the ability to connect to the internet and retrieve updated information is key.

In this test, I asked them:

And this is what they responded:

undefined

Several interesting things happened:

  • Neither Claude nor free ChatGPT have the ability to connect to the internet.

  • For ChatGPT Plus to do so, I had to install a plugin, so it’s not a skill it has “out of the box.”

  • Both Bard and Bing Chat have integrated internet browsing natively. Bing even showed me a card with the weather forecast in the same chat.

If you need to work with updated information, for example, for reports that require fresh data, your best option is Bing Chat. Bard also works well, but it’s not as good at writing or reasoning.

5. Image Analysis

Intelligent chatbots have gradually diversified their capabilities to receive inputs. Some no longer only receive and understand text, but you can also upload files like images, PDFs, and data.

In this case, I uploaded this image:

undefined

And I asked them, why is this image funny?

undefined

Here are some relevant points to consider:

  • Free ChatGPT does not have an option to upload files, so it is disqualified from the start.

  • Although Claude does have the option, it could not recognize the image or read what was on it, despite my trying several times.

  • Bard only allows file uploads if you are in the USA. Additionally, you can only ask it in English. That’s why its response is also in English.

  • ChatGPT Plus technically could not see the image (the koala) but was able to read the text using Code Interpreter. With that limited information, it managed to understand the joke, so I gave it the point.

  • The only one that did it without problems and understood the meme perfectly was Bing Chat.

6. Document Reading

I gave them this 43-page paper and asked them to give me the most important points.

undefined

And this was the result:

undefined

Some things to consider:

  • Bard, for some reason, did not want to read the document because it didn’t want to “share personal information about other people.” Another relevant point is that Bard only accepts PDFs that are on the internet via their link.

  • If I had to make a podium for the quality of the response, it would be:

7. Data Analysis

I have written before about how powerful ChatGPT Plus's Code Interpreter is for analyzing data. It feels like having access to a personal data analyst.

Let’s see if the rest can compete with it.

I gave them this Excel file containing some data on the highest-rated movies on the IMDB platform:

undefined

And here’s how the models performed:

undefined

Neither Bing Chat, free ChatGPT, nor Bard have the ability to read Excel or CSV files, so they are out of the competition in this category.

Something I didn’t know is that Claude can do it. It provided a comprehensive analysis, counting the directors with the most movies and the highest-rated films, among other things. However, something important happened; when I went to verify the information, I realized that some data was incorrect. In summary, Claude is good at providing general information but not so good at counting specific data.

ChatGPT Plus is the clear winner in this category. Since it can run code, its calculations are precise, and it can also create data visualizations.

Summary and Conclusions

If we count all the points, the table looks like this:

undefined

And these are my main conclusions:

  1. Bing Chat is the best chatbot for the average user. Behind it is the same model as ChatGPT Plus, it is connected to the internet, and it has the ability to read documents and images.

  2. I didn’t know Claude. While it didn’t perform very well in the “reasoning” tests, it shines in its ability to summarize documents and perform simple data analysis. Additionally, something that sets it apart from the rest is that it has a context window of 75,000 words, meaning your prompts can be infinitely long. Unlike the others, which accept a maximum of words between 3,000 and 6,000.

  3. The clear winner is ChatGPT Plus. And it has also been my experience: it’s the chatbot I use the most. But there are three considerations to keep in mind:

Related posts

July 26, 2023

⏰🚀 Racing Against Time: AI Surpasses Predictions.

July 26, 2023

⏰🚀 Racing Against Time: AI Surpasses Predictions.

July 26, 2023

⏰🚀 Racing Against Time: AI Surpasses Predictions.

August 1, 2023

🤖🚦Braking the Race: The Dangerous Potential of Superintelligence.

August 1, 2023

🤖🚦Braking the Race: The Dangerous Potential of Superintelligence.

August 1, 2023

🤖🚦Braking the Race: The Dangerous Potential of Superintelligence.

Matias Hoyl · mhoyl@stanford.edu

© 2024 Matías Hoyl. All Rights Reserved.

Matias Hoyl · mhoyl@stanford.edu

© 2024 Matías Hoyl. All Rights Reserved.

Matias Hoyl · mhoyl@stanford.edu

© 2024 Matías Hoyl. All Rights Reserved.