Migration from LUBW/NAZKA ChatBot KarlA
Analyzed dataset from LUBW
Extracted chatbot signatures for intents and responses
Extracted data from plants, animals and Rheinauen" area
Intents also relate to access to current whether conditions and environmental data.
Initial decoding and routing implemented in python. Input matching with rapidfuzz (text matchin library). Fallback to remote LLM if required. Options for vector-search but initial version with BM25 not very helpfull. Future test with bge-m3 pending (vectors from intent samples already generated).
- tiere_pflanzen_auen.json: knowledge base. dataset for animals, plants and some Rheinauen types.
- intents.json: intents with sample texts and utterance (if any)
- intent_vectors.json: vectorized (bge-m3 embeddings) text samples and corresponding intent_id. Needs git lfs
- tagsAndInstructions.json: additional info for original bot decoding and routing
- pflanzenKeys.json: Parameters for plant descriptions
- tiereKeys.json: Parameters for animal descriptions
- taskList.json: decoded signatures. if utter is present, it should be used as response. Otherwise, intent should either start with tp_, tiere_, pflanzen_ which should then address the data from the corresponding types (or both), or with wetter or messdaten. Reference to the few Rheinauen datasets has to be defined still.
512*512, generated by flux-1-schnell.
Create vector embedding for all intent texts. Setup database with vectors, full text and intent names. Test chatbot response to arbitrary requests.
Add data access to whether conditions, environmental data, Wikidata images and audio files, Source to be found, probably from NAZKA, or https://www.museumfuernaturkunde.berlin/de/forschung/tierstimmenarchiv. MP3 files were missing in input dataset.