README.md
geogal-logo

A Scientific Large Language Model in Geoscience

  • Technical report is HERE!
  • The data pre-processing toolkits are open sourced on sciparser!

Statement

Due to certain oversights and unclear expressions on copy-right related issues while organizing our work, especially training data, we will take down the link to GeoGalactica model family from GitHub to suspend public dissemination of the GeoGalactica model. We feel sorry to the publishers and organizations affected. Respecting copyrights and academic norms has always been our team’s stance.

Fully considering reasonable suggestions, we will properly utilize geoscience data collected from open platforms, combined with our team’s prior accumulations, to carry out a new round of model training. The data for the new round of model training will originate from 540 open-access journals related to Earth sciences, natural sciences, computer science, and other geoscience-related fields. The articles in these OpenAccess journals adhere to the principle of open sharing and are published on the journal’s website. By using public, open-source web data like CommonCrawl, we have reacquired a considerable corpus, and together with the abstract data accumulated earlier, we foresee that we’ll still manage to collect a large enough corpus that doesn’t involve copyright disputes.

The new version of our model trained from the newer, selected corpora is currently under development, and we will update the new model parameters to our GitHub account once it is ready.

Introduction

GeoGalactica is from further pre-training of Galactica – a top-performing LLM trained with a large number of scientific documents. In this work, we take the initial step to leverage LLM for science, through a rather straightforward approach. We try to specialize an open-sourced LLM into geoscience, by further pre-training the model with a vast amount of texts in geoscience, as well as supervised fine-tuning (SFT) the resulting model with our custom collected instruction tuning dataset. These efforts result in a model GeoGalactica consisting of 30 billion parameters. To our best knowledge, it is the largest language model for the geoscience domain.

Resources

  • Paper: https://github.com/geobrain-ai/geogalactica
  • Data: https://huggingface.co/datasets/daven3/geobench, https://huggingface.co/datasets/daven3/geosignal, and https://github.com/zthang/geotools
  • Model: https://huggingface.co/geobrain-ai/geogalactica
  • Checkpoints: https://huggingface.co/geobrain-ai/geogalactica-ckpt
  • Plot: https://github.com/dbylynn/GeoGalactica_Analysis
  • Sciparser: https://github.com/davendw49/sciparser

Quick Start

A simple script is provided (tools/prediction/demo.py) for the model to predict the output text for a single input. The memory exceeds 140GB. The folder example_data shares data file format during the training.

Contributors

This project was founded by Acemap at Shanghai Jiao Tong University, leading by Zhouhan Lin and a group of students including Cheng Deng* (student leader), Le Zhou, Tianhang Zhang, Yi Xu, Yutong Xu, Beiya Dai, Qiyuan Chen, Yuanyuan Shi and Zhongmou He supervised by Zhouhan Lin, Junxian He, Xinbing Wang, and Chenghu Zhou.

Acknowledgements

GeoGalactica has referred to the following open-source projects. We want to express our gratitude and respect to the researchers of the projects.

  • Facebook Galactica: https://galactica.org/
  • Facebook LLaMA: https://github.com/facebookresearch/llama
  • Stanford Alpaca: https://github.com/tatsu-lab/stanford_alpaca
  • alpaca-lora by @tloen: https://github.com/tloen/alpaca-lora
  • alpaca-gp4 by Chansung Park: https://github.com/tloen/alpaca-lora/issues/340
  • K2 by Cheng Deng: https://github.com/davendw49/k2

We would also like to express our appreciation for the effort of data processing and annotation from the students in CAS.

License

GeoGalactica is a research preview intended for non-commercial use only, subject to the model License of Galactica and the Terms of Use of the data generated by OpenAI. Please contact us if you find any potential violations. The code is released under the Apache License 2.0. The data GeoSignal and GeoBench is open-sourced by K2.

Конвейеры
0 успешных
0 с ошибкой