top of page

Blogs, Research & Past Meetups

The LLM Models Showdown: Open Source vs Closed Source – Who is Winning the Battle for AI Dominance?

Updated: 6 days ago


Listen to Podcast (Notebook LM to Video Convert)




Read Blog Below!



The competition between Open Source and Closed Source Large Language Models (LLMs) has become one of the most intense rivalries in the world of artificial intelligence. Recent benchmark evaluations shed light on how these models perform across a variety of challenging tasks.


In this post, we’ll dive into the strengths and weaknesses of both sides, highlight standout performances, and explore what the future holds for the next generation of AI models.


But first let's Understand Benchmarks so we have the right context set before the debate.


Key Metrics Explained


  1. Doc VQA (Document Visual Question Answering): This evaluates the model’s ability to answer questions related to text within images, typically document-based images such as invoices or reports.

  2. Ai2D (Allen Institute Diagram Understanding): Tests the model’s ability to understand and interpret diagrams, which is critical for tasks involving charts, figures, and technical illustrations.

  3. Chart QA (Chart-based Question Answering): Evaluates how well the model can answer questions about charts and graphs, which requires interpreting both the text and visual data in graphical formats.

  4. Text-based Tasks (MMMLU, GSM8K, Math): These include various tests that focus on mathematical reasoning, text comprehension, and general knowledge assessment. MMMLU and GSM8K specifically test logical reasoning and math problem-solving.

  5. MMMU (Multimodal Multitask Understanding): This is an aggregate metric that reflects a model’s capability to handle multiple types of tasks, including text, visual, and multimodal inputs.

  6. Text VQA (Text Visual Question Answering): This metric assesses the model’s ability to answer questions that involve both image and text interpretation.

  7. VQAv2 (Visual Question Answering v2): This tests the model’s proficiency in answering questions related to images, with a focus on understanding visual content in complex scenarios.

  8. Math Vista: Focuses on math-specific questions involving visual aids like diagrams and charts, testing the model’s logical and mathematical reasoning skills in a visual context.

  9. Real-World QA (Question Answering in Real-World Scenarios): This metric tests how well models can answer questions based on real-world situations, often requiring a blend of reasoning, contextual understanding, and common-sense knowledge.



Now we have the context set, let's dive in to the debate are Open Source Model Better? or Closed Source where the Private Equity and Cash is pumped in?


For the same we will be using  Radar charts will be used to visually compare the strengths and weaknesses of these models, helping to highlight where they excel and where improvements can be made.





Open-Source AI Models

Let’s explore the radar charts for various open-source models: InternVL2 76B, Llama 3-V 70B, Llama 3-V 405B, and NVLM-D 72B. Each of these models is evaluated against the nine metrics mentioned above.


Key Observations:

  • InternVL2 76B performs excellently in metrics such as Doc VQA, Ai2D, and VQAv2, showing strength in tasks involving visual and textual comprehension.

  • Llama 3-V 405B is strong in visual and multimodal tasks, with a well-rounded performance across most metrics, slightly lagging in Real-World QA.

  • NVLM-D 72B maintains solid performance across most metrics but doesn’t stand out in Real-World QA and Math Vista, which indicates some room for improvement in tasks requiring logical reasoning.


Closed-Source AI Models

Next, let’s examine the performance of the closed-source models, namely Claude 3.5, Gemini 1.5 Pro, GPT-4V & GPT-4o. These models are typically more powerful but come with restricted access.


Key Observations:

  • Claude 3.5 leads in several categories, especially in Real-World QA and Ai2D, showing its ability to handle practical, real-world scenarios more efficiently than its open-source counterparts.

  • Gemini 1.5 Pro delivers balanced performance but seems to struggle with Math Vista and Real-World QA, making it less ideal for tasks involving mathematical reasoning.

  • GPT-4V shows excellent performance in Text-based Tasks, but lags behind in Math Vista, which could indicate weaknesses in logical problem-solving or complex mathematical questions.

  • We have some Metrics Missing for GPT4o, however this is a promising model currently available in ChatGPT to everyone


Here is are Performance Metrics For Each Model Individually!



Here is a comparison of Models against each Metric



The Road Ahead: Open vs Closed Collaboration


One key theme emerges from these benchmarks: Open Source models are catching up fast, and in some cases, even surpassing their Closed Source competitors. They thrive in areas requiring broad knowledge bases, such as document comprehension and VQA. Meanwhile, Closed Source models dominate specialized tasks like OCR and chart analysis, benefiting from cutting-edge proprietary algorithms and fine-tuned architectures.


But where do we go from here?

  • Open Source Potential: The Open Source community has the advantage of large-scale collaboration, rapid iteration, and increased accessibility. The impressive scores of models like Llama 3-V and InternVL2 are testament to the power of open development.

  • Closed Source Dominance: The proprietary LLMs benefit from intensive resource allocation, fine-tuning capabilities, and vertical integration within companies. Models like Claude 3.5 and GPT-4V are optimized for certain niche applications and are already integrated into enterprise ecosystems, giving them an edge in performance and scalability.


Conclusion: Who Will Win?


As LLM technology advances, the gap between Open and Closed Source models continues to narrow. While proprietary models still hold some advantages, the rapid pace of development in the Open Source space is undeniable. Whether Open or Closed will dominate the AI landscape in the future remains to be seen, but one thing is clear: the competition is pushing innovation to new heights.


It’s not just about one model outperforming another; it’s about the collaborative push for better technology and more accessible AI for everyone.


Annex: Tabular Raw Data



bottom of page