I have a program that require all keywords to be in a single paragraph, most of the time, separated by commas

For example:

I have those terms

1-Term
1.1-Term
2-Term
3-Term
4-Term

That i collected and organized into groups and subgroups with Titles and subtitles

Title

  • 1-Term

  • 1.1-Term

  • 2-Term

    • Sub-Title
      • 3-Term
      • 4-Term

But then i want to turn them into:

1-Term, 1.1-Term, 2-Term, 3-Term, 4-Term 
 

Removing certain marked words(Titles and sub-Titles), any Empty/Blank space, and Line breaks, while adding the commas between The Terms. I want to keep certain dashes “-”(like in words )

1-Term,1.1 -Term,2-Term,3-Term,4-Term

  • bus_factor@lemmy.world
    link
    fedilink
    arrow-up
    4
    ·
    1 day ago

    Your description is too vague to really get a good answer. In general, if you’re doing complex string manipulation, you’ll use a full-fledged programming language with regex support, like Python, Perl or Awk, possibly piped into each other and/or other tools like Sed or Cut. I can’t be more specific than that without a more specific description where you describe the actual data and criteria.

    Are you starting with the first or second example? Why do the prefix numbers change between examples? How do you tell text and title/subtitle apart?

    • Cactus_Head@programming.devOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 day ago

      Why do the prefix numbers change between examples?

      My bad, i fixed it

      I want to show that the two terms are related e,g Star and Jedi by grouping them together

      Franchises

      Stars wars
      Jedi

      Transformers


      Also i am not able to add line breaks between bullet points in markdown. so instead i get this

      Franchises

      • Stars wars

      • Jedi

      • Transformers

      So i cant show the grouping thing in lemmy here. I would have also liked The list i make to be markdown compatible but i guess that separate issue.

    • Cactus_Head@programming.devOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      1 day ago

      Basically i collect keywords( e.g: transformers, A Deep dive, Harry Potter The worst, Xbox, stars worst, Jedi) from videos on my YouTube home page and organize them into a lists

      • YouTuber terms:

        • A Deep Dive
        • The Worst

      • Franchises:
        • Star wars
        • Jedi
        • Harry Potter
        • Transformers

      • Companies:

        • Xbox

      And Turn it into:

      A Deep Dive,The Worst, Star wars, Jedi, Harry Potter, Transformers,Xbox  
      
      

      Removing the titles and subtitles.

      How do you tell text and title/subtitle apart

      I was thinking of putting a symbol like “#” for example, in front of the Title

      # - YouTuber terms:  
      

      so the script knows to ignore that whole line, like in general programming

      • a14o@feddit.org
        link
        fedilink
        arrow-up
        4
        ·
        1 day ago

        This is not difficult to achieve at all with tools like sed or awk. But unless you provide a concrete example input file or files, all we can do is point to those tools.

        • Cactus_Head@programming.devOP
          link
          fedilink
          English
          arrow-up
          1
          ·
          edit-2
          1 day ago

          Something like this?

          - Franchise(Title): 
          
            - Harry potter
          
            - Perfect Blue
          
            - Jurassic world
            - Jurassic Park
          
            - Jedi
            - Star wars
            - The clone wars
          
            - MCU
          
            - Cartoons(Sub-Title):
          
              - Gumball 
          
              - Flapjack
          
              - Steven Universe
          
              - Stars vs. the forces of Evil
          
              - Wordgril
          
              - Flapjack
          
          

          Turned into

          Harry potter,Perfect Blue,Jurassic world,Flapjack,Jedi,Star wars,The clone wars,MCU,Gumball,Flapjack,Steven Universe,Stars vs. the forces of Evil
          

          Both “Franchis” and “Cartoons” where removed/ not included with the other words.

          • bus_factor@lemmy.world
            link
            fedilink
            arrow-up
            3
            ·
            20 hours ago

            If you wanted a somewhat cruder approach using basically ubiquitous tools, you could do something like this:

            $ grep '^ *-' /tmp/foo.txt | grep -v ': *$' | sed 's/ *- //' | tr '\n' ',' | sed s'/,$/\n/'
            Harry potter,Perfect Blue,Jurassic world,Jurassic Park,Jedi,Star wars,The clone wars,MCU,Gumball ,Flapjack,Steven Universe,Stars vs. the forces of Evil,Wordgril,Flapjack 
            

            Here I’m first using grep '^ *-' to get all lines starting with any amount of whitespace and a leading dash, then piping that to grep -v ': *$' to remove anything with a colon at the end (including those with whitespace after the colon), then using tr '\n' ',' to replace all newlines with commas, and then sed s'/,$/\n/' to replace the trailing comma with a newline again (although sed is finicky across platforms wrt newlines, so you may want to just replace it with an empty string instead).

            The above is hardly an efficient approach, but it does the job.

            • Cactus_Head@programming.devOP
              link
              fedilink
              English
              arrow-up
              1
              ·
              6 hours ago

              I think this is The solutions that makes the most sense to me

              But i don’t understand what sed does here

              replace the trailing comma with a newline again

              Why do we replace the commas again with new lines?


              Also, I figure a better way to group related terms

              Stars Wars;Clone Wars;Jedi
              

              Using semicolons “;”
              I figure i can replace them with commas using tr command

              tr ';' ',' 
              

              But do i just pipe

              tr '\n' ','
              

              Into

              tr ';' ',' 
              

              Or is there a way to combine them. I don’t see an option to do more than operation in tr manual


              Lastly, i have been trying to use regex to match

              What "X" Says About
              

              To

              What The MCU Says About The Comics Industry 
              

              I just need to match The “X” There, the program takes care of the rest

              I tried

              What \w+\s+ Says About
              

              On this website to match

              What The MCU Says About The Comics Industry

              But using the debugger, it only recgnize “The” and then stops

              • bus_factor@lemmy.world
                link
                fedilink
                arrow-up
                1
                ·
                4 hours ago

                Why do we replace the commas again with new lines?

                Consider this two-line output:

                $ echo 'a\nb'
                a
                b
                $
                

                We convert the newlines to commas. Now there is a comma at the end of the last line as well, and because of no newline, the next prompt is at the end of the output:

                $ echo 'a\nb' | tr '\n' ,
                a,b,$
                

                Substituting only the last comma ($ means end of line) allows us to get the output we expected:

                $ echo 'a\nb' | tr '\n' , | sed 's/,$/\n/'
                a,b
                $
                

                Or is there a way to combine them

                These two commands have equivalent output:

                tr '\n' ',' | tr ';' ',' 
                tr '\n;' ',,'
                

                What tr does is take a list of characters in parameter 1 and converts them to the equivalent position character in parameter 2. There’s a little more to it (it supports ranges, for example), but this will do the job. To learn more you can run man tr to get the documentation for it.

                I tried What \w+\s+ Says About

                \w+\s+ matches “at least one word character and then at least one whitespace character”, and that’s not what you want. “The MCU” is one or more word characters, then a space, and then one or more word characters again, and that second part you’re not matching at all. In this case, you’re probably better off making a negative matching group where you make sure you don’t match across separators. What [^,;]+ Says About would match anything that’s not a comma or semicolon, for instance.

                The other problem with regex is that every implementation does things differently. For example, sed would interpret that plus as a literal +, so for sed syntax you’d need to use \+ instead. It also does not support \w and \s, and whether to use ( or \( for a literal parenthesis also varies between implementations. I often switch to Perl if I need to do some more complex regex shenanigans.

                • Cactus_Head@programming.devOP
                  link
                  fedilink
                  English
                  arrow-up
                  1
                  ·
                  3 hours ago

                  second part you’re not matching at all.

                  That because the program/ add-on i am using, only requires certain keywords to blacklist videos

                  so if it find What "X" Says About in a Video Title , it doesn’t need the rest of the sentence to blacklist the video.

                  The other problem with regex is that every implementation does things differently

                  Th developer links to Firefox’s developers Regex Documentation.

                  Regex
                  
                  You can use Regex to match very specific patterns of text.
                  
                  /aaa+/i: will block content that include aaaAAAAAaaaaAAAaaa or aaaaaaaa
                  /top \d+/: will block content that include top 10 movies, top 5 upcoming movies
                  
                  Supports negative too, by adding ! (exclamation mark) before the regex.
                  Example: !/^a/i will block content that does not start with a 
                  
                  

                  This is a snip-it of the the add-on Guide. I cant like to it cuz for some reason its only inside the extension but here is the add-on’s page

                  • bus_factor@lemmy.world
                    link
                    fedilink
                    arrow-up
                    1
                    ·
                    2 hours ago

                    We’re talking about different halves. The regex \w+\s+ matches "The " (“The” followed by a space), not “The MCU”.

            • bus_factor@lemmy.world
              link
              fedilink
              arrow-up
              1
              ·
              20 hours ago

              If you’re feeling a little old school (and some might say masochistic), you could so a similar crude parser with a perl oneliner. This would be more efficient compute wise, but it’s a bit of an acquired taste readability wise:

              $ perl -ne 'chomp; push @a, $1 if /^\s*-\s*(.*[^:\s])\s*$/; END{print join(",", @a), "\n"}' /tmp/foo.txt
              Harry potter,Perfect Blue,Jurassic world,Jurassic Park,Jedi,Star wars,The clone wars,MCU,Gumball,Flapjack,Steven Universe,Stars vs. the forces of Evil,Wordgril,Flapjack
              

              Here perl -n makes perl look at each line individually, chomp strips off the trailing newline, we match for /^\s*-\s*(.*[^:\s])\s*$/ (a string starting with a dash and ending with something not a colon) and append the content of the matching parenthesis to an implicitly declared array @a. Then we add an END{} block which will be executed after all lines are parsed, where we print the array joined on ,.

          • bus_factor@lemmy.world
            link
            fedilink
            arrow-up
            1
            ·
            20 hours ago

            If you can’t install a dedicated tool like yq but don’t mind creating a standalone script, python would be able to do this out of the box on pretty much any computer, calculator or toaster you can get your hands on in 2026:

            #! /usr/bin/env python3
            
            import yaml
            import sys
            
            def parse_yaml(filename):
                with open(filename) as fd:
                    return yaml.safe_load(fd)
            
            def get_leaf_nodes(data_iterable):
                output = []
                for v in data_iterable:
                    if isinstance(v, dict):
                        output += get_leaf_nodes(v.values())
                    elif isinstance(v, list):
                        output += get_leaf_nodes(v)
                    else:
                        output.append(v)
                return output
            
            print(",".join(get_leaf_nodes(parse_yaml(sys.argv[1]))))
            
            $ /tmp/foo.py /tmp/foo.txt
            Harry potter,Perfect Blue,Jurassic world,Jurassic Park,Jedi,Star wars,The clone wars,MCU,Gumball,Flapjack,Steven Universe,Stars vs. the forces of Evil,Wordgril,Flapjack
            

            This takes the first argument on the command line, parses it as yaml, finds all leaf nodes recursively, and prints a comma-separated list of the results.

          • bus_factor@lemmy.world
            link
            fedilink
            arrow-up
            1
            ·
            20 hours ago

            If you can stick to valid YAML like your example is, you can use a reasonably short yq command to get a comma-separated string of all scalar values:

            $ yq -r '[.. | scalars] | join(",")' /tmp/foo.txt                
            Harry potter,Perfect Blue,Jurassic world,Jurassic Park,Jedi,Star wars,The clone wars,MCU,Gumball,Flapjack,Steven Universe,Stars vs. the forces of Evil,Wordgril,Flapjack
            

            .. goes down the tree recursively, scalars filters out only scalar values, [] around those two makes them an array, and piping it all to join(",") makes it into a comma-separated string.

          • moonpiedumplings@programming.dev
            link
            fedilink
            arrow-up
            1
            ·
            1 day ago

            This is technically yaml I think, a list (with one entry) of lists that contains mostly single items but also one other list. You should be able to parse this with a yaml parser like pythons built in one.

            Note that yaml is picky abiut the syntax though, so it wouldn’t be able to handle deviations.