A while back I played a round with the HASS Voice Assistant, and pretty easily got to a point where STT and TTS were working really well on my local installation. Also got the hardware to build wyoming satellites with wakeword recognition.

However, what kept me from going through the effort of setting everything up properly (and finally getting fucking Alexa out of my house) was the “all or nothing” approach HASS seemingly has to intent recognition. You either:

  • use the build in Assistant conversation agent, which is a pain in the ass because it matches what your STT recognized 1:1, letter by letter, so it’s almost impossible to actually get it to do something unless you spoke perfectly (and forget, for example, about putting something on your ToDo list; Todo, todo, To-Do,… are all not recognized, and have fun getting your STT to reliably generate the ToDo spelling!), or
  • you slap a full-blown LLM behind it, either forcing you to again rely on a shitty company, or host the LLM locally; but even in the latter case and on decent (not H100, of course, but with a GPU at least) hardware, the results were slow and shit, and due to context size limitations, you can just forget about exposing all your entities to the LLM Agent.
  • You also have the option of combining the two approaches; match exactly first, if no intent recognized, forward to LLM; but in practice, that just means that sometimes, you get what you wanted (“all lights off” with a 70% success rate, I’d say), and still a lot of the time you have to wait for ages for a response that may be correct, but often isn’t from the LLM.

What I’d like is a third option, doing fuzzy matching on what the STT generated. Indeed, there seems to have been multiple options for that through rhasspy, but that project appears to be dead? The HASS integration has not been updated in over 4 years, and the rhasspy repos are archived as of earlier this month.

Besides, it was not entirely clear to me if you could just use the intent recognition part of the project, forgoing the rest in favor of what HASS already brings to the table.

At this point, I am willing to implement a custom conversation agent, but wanted to make sure first that I haven’t simply missed an obvious setting/addon/… for HASS.

My questions are:

  • are you using the HASS Voice Assistant without an LLM?
  • if so, how do you get your intents to be recognized reliably?
  • do you know of any setting/project/addon helping with that?

Cheers! Have a good start into the working week…!

  • dust_accelerator@discuss.tchncs.de
    link
    fedilink
    English
    arrow-up
    12
    ·
    edit-2
    4 days ago

    Hmm. I had pretty much the same experience, and wondered about having multiple conversation agents for specific tasks - but didn’t get around to trying that out. Currently, I am using it without LLM, albeit with GPU accelerated whisper (and other custom CV tasks for camera feeds). This gives me fairly accurate STT, and I have defined a plethora of variable sentences for hassil (intent matcher), so I often get the correct match. There is the option for optional words and or-alternatives, for instance:

    sentences:
     - (start|begin|fire) [the] [one] vaccum clean(er|ing) [robot] [session]
    

    So this would match “start vacuum”, but also “fire one vacuum cleaning session”

    Of course, this is substantial effort initially, but once configured and debugged (punctuation is poison!) works pretty well. As an aside, using the atom echo satellites gave me a lot of errors, simply because the microphones are bad. With a better quality satellite device (the voice preview) the success rate is much higher, almost flawless.

    That all said, if you find a better intent matcher or another solution, please do report back as I am very interested in an easier solution that does not require me to think of all possible sentence ahead of time.

    • stochastictrebuchet@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      14
      ·
      4 days ago

      ML engineer here. My intuition says you won’t get better accuracy than with sentence template matching, provided your matching rules are free of contradictions. Of course, the downside is you need to remember (and teach others) the precise phrasing to trigger a certain intent. Refining your matching rules is probably a good task for a coding agent.

      Back in the pre-LLM days, we used simpler statistical models for intent classification. These were way smaller and could easily run on CPU. Check out random forests or SVMs that take bags of words as input. You need enough examples though to train them on.

      With an LLM you can reframe the problem as getting the model to generate the right ‘tool’ call. Most intents are a form of relation extraction: there’s an ‘action’ (verb) and one or more participants (subject, object, etc.). You could imagine a single tool definition (call it ‘SpeakerIntent’) that outputs the intent type (from an enum) as well as the arguments involved. Then you can link that to the final intent with some post-processing. There’s a 100M version of gemma3 that’s apparently not bad at tool calling.

      • smiletolerantly@awful.systemsOP
        link
        fedilink
        English
        arrow-up
        6
        arrow-down
        1
        ·
        4 days ago

        Thanks for your input! The problem with the LLM approach for me is mostly that I have so many entities, HASS exposing them all (or even the subset of those I really, really want) is already big enough to slow everything to a crawl, and to get bad results from all models I’ve tried. I’ll give the model you mentioned another shot though.

        However, I really don’t want to use an LLM for this. It seems brittle and like overkill at the same time. As you said, intent classification is a wee bit older than LLMs.

        Unfortunately, the sentence template matching approach alone isn’t sufficient, because quite frequently, the STT is imperfect. With HomeAssistant, currently the intent “turn off all lights” is, for example, not understood if STT produces “turn off all light”. And sure, you can extend the template for that. But what about

        • turn of all lights
        • turn off wall lights
        • turnip off all lights
        • off all lights
        • off all fights

        A human would go “huh? oh, sure, I’ll turn off all lights”. An LLM might as well. But a fuzzy matching / closest Levensthein distance approach should be more than sufficient for this, too.

        Basically, I generally like the sentence template approach used by HASS, but it just needs that little bit of additional robustness against imperfections.

        • Jayjader@jlai.lu
          link
          fedilink
          English
          arrow-up
          9
          ·
          4 days ago

          From my understanding of word embeddings (as used by LLMs), you could skip the LLM and directly compare the similarity of what the STT outputs to each task or phrase in a list you have prepared. You’d need to test it out a few times to see what threshold works, but even testing against dozens of phrases should be much faster than spinning up an LLM - and it should be fully deterministic.

          • smiletolerantly@awful.systemsOP
            link
            fedilink
            English
            arrow-up
            10
            ·
            4 days ago

            Yep, that’s the idea! This post basically boils down to “does this exist for HASS already, or do I need to implement it?” and the answer, unfortunately, seems to be the latter.

    • smiletolerantly@awful.systemsOP
      link
      fedilink
      English
      arrow-up
      5
      ·
      4 days ago

      Thanks for sharing your experience! I have actually mostly been testing with a good desk mic, and expect recognition to get worse with room mics… The hardware I bought are seeed ReSpeaker mic arrays, I am somewhat hopeful about them.

      Adding a lot of alternative sentences does indeed help, at least to a certain degree. However, my issue is less with “it should recognize various different commands for the same action”, and more “if I mumble, misspeak, or add a swear word on my third attempt, it should still just pick the most likely intent”, and that’s what’s currently missing from the ecosystem, as far as I can tell.

      Though I must conceit, copying your strategy might be a viable stop-gap solution to get rid of Alexa. I’ll have to pay around with it a bit more.

      That all said, if you find a better intent matcher or another solution, please do report back as I am very interested in an easier solution that does not require me to think of all possible sentence ahead of time.

      Roger.

  • JoeyJoeJoeJr@lemmy.ml
    link
    fedilink
    English
    arrow-up
    4
    ·
    4 days ago

    I don’t have as much experience with HASS, but I did use Mycroft for quite a while (stopped only because I had multiple big moves, and ended up in a place small enough voice control didn’t really make sense any more). There were a few intent parsers used with/made for that:

    https://github.com/MycroftAI/adapt https://github.com/MycroftAI/padatious https://github.com/MycroftAI/padaos

    In my experience, Adapt was far and away the most reliable. If you go the route of rolling your own solution, I’d recommend checking that out, and using the absolute minimum number of words to design your intents. E.g. require “off” and an entity, and nothing else, so that “AC off,” “turn off the AC,” and “turn the AC off” all work. This reduces the number of words your STT has to transcribe correctly, and allows flexibility in command phrasing.

    If you borrow a little more from Mycroft, they had “fallback” skills that were triggered when an intent couldn’t be matched. You could use the same idea, and use https://github.com/seatgeek/thefuzz to fuzzy match entities and keywords, to try to handle remaining cases where STT fails. I believe that is what this community made skill attempted to do: https://github.com/MycroftAI/skill-homeassistant (I think there were more than one HASS skill implementations, so I could be conflating this with another).

    Another comment mentioned OVOS/Neon - those forked off of Mycroft, so you may see overlap if you investigate those as well.

    • smiletolerantly@awful.systemsOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      2 days ago

      Thanks for the recommendation! That looks interesting indeed.

      This entire topic is probably a sinkhole of complexity. It’s great to have somewhere to look for inspiration!

    • smiletolerantly@awful.systemsOP
      link
      fedilink
      English
      arrow-up
      4
      ·
      edit-2
      4 days ago

      Thanks, had not heard of this before! From skimming the link, it seems that the integration with HASS mostly focuses on providing wyoming endpoints (STT, TTS, wakeword), right? (Un)fortunately, that’s the part that’s already working really well 😄

      However, the idea of just writing a stand-alone application with Ollama-compatible endpoints, but not actually putting an LLM behind it is genius, I had not thought about that. That could really simplify stuff if I decide to write a custom intent handler. So, yeah, thanks for the link!!

      • Canuck@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        1
        arrow-down
        2
        ·
        4 days ago

        You can connect Ollama and cloud providers like ChatGPT into OVOS/Neon, so that when you ask questions it doesn’t know how to handle, it can respond using the LLM.

        • smiletolerantly@awful.systemsOP
          link
          fedilink
          English
          arrow-up
          4
          arrow-down
          1
          ·
          4 days ago

          Please read the title of the post again. I do not want to use an LLM. Selfhosted is bad enough, but feeding my data to OpenAI is worse.

  • tyler@programming.dev
    link
    fedilink
    English
    arrow-up
    3
    ·
    4 days ago

    Would love to know what you find. I started to use Willow months before the creator passed away and it seemed like the only option available (not the best option, literally the only option due to all the reasons you listed). If you find something I’d love to know.

    • smiletolerantly@awful.systemsOP
      link
      fedilink
      English
      arrow-up
      3
      ·
      edit-2
      4 days ago

      Never heard about willow before - is it this one? Seems there is still recent activity in the repo - did the creator only recently pass away? Or did someone continue the project?

      How’s your experience been with it?

      And sure, will do!

      • tyler@programming.dev
        link
        fedilink
        English
        arrow-up
        1
        ·
        3 days ago

        Yeah it was relatively recent. I think earlier this year. Can’t remember exactly, it’s been a longgggg year. I never managed to get it integrated with HA and the creator passed away and nobody knew if it was going to get picked up by anyone else so I just fully stopped trying.

  • Zos_Kia@lemmynsfw.com
    link
    fedilink
    English
    arrow-up
    2
    ·
    4 days ago

    I don’t know about other STTs but if you’re using whisper you can “prompt” it for consistent spelling. If you put “todo” in the prompt it should always spell it like that.

    Have you tried using a vector DB with an embedder ? It may give decent performance without the need for a full blown LLM