BERT models with millisecond inference — The Pattern project
Newsletter from NVIDIA about the BERT model with 1.2 milliseconds on the latest hardware reminded me that I run BERT QA (bert-large-uncased-whole-word-masking-finetuned-squad) in 1.46 ms on a laptop with Intel i7–10875H on RedisAI during RedisLabs RedisConf 2021 hackathon.
BERT QA requires a user input — question, hence it’s not possible to pre-calculate response from the server like in the case of summarization.
At the same time NLP based machine learning relies on the tokenisation step — converting common text into numbers before running inference, so the most common pipeline is tokenisation, inference, select a most relevant answer. During my participation in hackathons in 2020, I observed that it’s not trivial to load GPU for 100% for NLP tasks, hence I came up with the process below:
- Convert and pre-load BERT models on each shard of RedisCluster (code)
- Pre-tokenise all potential answers using RedisGears and distribute potential answers on each shard of Redis Cluster using RedisGears(code for batch and for event-based RedisGears function)
- Amend calling API to direct question query to shard with most likely answers. Code. The call is using graph-based ranking and zrangebyscore to find the most ranked sentences in response to question and then gets relevant hashtag from sentence key
- Tokenise question. Code. Tokenisation happening on shard and uses RedisGears and RedisAI integration via
- Concatenate user question and pre-tokenised potential answers. Code
- Run inference using RedisAI. Code model run in async mode without blocking the main Redis threat, so shard can still serve users
- Select answer using max score and convert tokens to words. Code
- Cache the answer using Redis — next hit on API with the same question returns the answer in nanoseconds.Code this function uses ‘keymiss’ event.
The code is easier to follow than talking about it, mind casting to relevant data types.
- Clevo laptop with Intel(R) Core(TM) i7–10875H CPU @ 2.30GHz, 64 GB RAM, SSD
Warning: Pre-tokenisation of answers, while being mathematically simple, is really tedious to debug and due to constant changes inside transformers library breaks easily. It was working on 15th May, it needs to be re-validated if you want to use it now.
There are other ways to mitigate inference delay in user experience, for example, I have “Please wait while I retrieve an answer” text to speech, which will cover few seconds.
Other ways to speed up inference — quantise and serve BERT from ONNX.
Written on August 7, 2021 by Alex Mikhalev.
Originally published on Medium