Skip to content Skip to sidebar Skip to footer

Find Any URL In Text String Exactly Like Twitter Uses

There are many similar questions, however they don't answer the problem of a url not having www., http://, etc. What I'm looking to do is check whether or not a string contains a u

Solution 1:

I don't think there's a good way to do this reliably (over time). Now that the new gTLDs are coming, it's going to be hard to keep up. Anyway, I gave it a shot.

/
  (
    \b
      (?:(https?|ftp):\/\/)?
      (
        (?:www\d{0,3}\.)?
        (
          [a-z0-9.-]+\.
          (?:[a-z]{2,4}|museum|travel)
          (?:\/[^\/\s]+)*
        )
      )
    \b
  )
/ix

Capture groups

  1. The entire URL, ex: http://www.google.com/anyquerystringSAY/Rfy/srA/yh
  2. The protocol, ex: http
  3. URL including www., ex: www.google.com/swrua8rua8rUWRWAURHAJSrjuhFAhjT/Rtgfsbdh
  4. URL excluding www., ex: google.com/sarwar8wa8r/R/A(R8 or images.google.com/w9r89w9ar8a9sjfriJRIUS(RY/(YUr

Optionally, you can replace the (?:[a-z]{2,4}|museum|travel) bit with all the ones listed here, but that list is never going to stop growing, so I doubt it's worth it. (You can see I added the two exceptions museum and travel.)

Also notice I added ftp, feel free to remove that if you don't need it.

Hope this helps.

See it in action


Solution 2:

(# Scheme
 [a-z][a-z0-9+\-.]*:
 (# Authority & path
  //
  ([a-z0-9\-._~%!$&'()*+,;=]+@)?              # User
  ([a-z0-9\-._~%]+                            # Named host
  |\[[a-f0-9:.]+\]                            # IPv6 host
  |\[v[a-f0-9][a-z0-9\-._~%!$&'()*+,;=:]+\])  # IPvFuture host
  (:[0-9]+)?                                  # Port
  (/[a-z0-9\-._~%!$&'()*+,;=:@]+)*/?          # Path
 |# Path without authority
  (/?[a-z0-9\-._~%!$&'()*+,;=:@]+(/[a-z0-9\-._~%!$&'()*+,;=:@]+)*/?)?
 )
|# Relative URL (no scheme or authority)
 ([a-z0-9\-._~%!$&'()*+,;=@]+(/[a-z0-9\-._~%!$&'()*+,;=:@]+)*/?  # Relative path
 |(/[a-z0-9\-._~%!$&'()*+,;=:@]+)+/?)                            # Absolute path
)
# Query
(\?[a-z0-9\-._~%!$&'()*+,;=:@/?]*)?
# Fragment
(\#[a-z0-9\-._~%!$&'()*+,;=:@/?]*)?

RFC 3986. Validate if a string holds a URL as specified in RFC 3986. Both absolute and relative URLs are supported.


Solution 3:

The answer is - you can't.

Twitter, for example treats the name of the singer Will.I.Am as a URL (.am is a valid tld).

Without knowing all the domain registration rules at every tld, there's no way of knowing if a URL is valid without testing.

Here is what I propose you do.

  1. Be generous with your script. Accept almost any string with a "." in it.
  2. Perform an HTTP HEAD request to see whether the URL exists.
  3. Do a WHOIS to see if the domain has been registered (even if the exact URL doens't match)

Of course, this doesn't take in to account that someone may have posted a link to their Intranet - which would work for some of their followers.


Solution 4:

My simple JavaScript library called FuncJS has a function called "findLinks()" which should be able to get done what you're wanting.

Say that you have a string with links inside it, simply include in the function parameters, like this:

findLinks("Visit my website at http://website.com and visit my profile on Twitter at http://twitter.com/yourProfile!");

And then output it using various methods, such as document.write and the string should be outputted with links highlighted.

For a greater understanding of this function, please read the documentation at http://docs.funcjs.webege.com/findLinks().html.

Hope this helps you out and anyone else wanting to do this! :)


Post a Comment for "Find Any URL In Text String Exactly Like Twitter Uses"