Find Any URL In Text String Exactly Like Twitter Uses
Solution 1:
I don't think there's a good way to do this reliably (over time). Now that the new gTLDs are coming, it's going to be hard to keep up. Anyway, I gave it a shot.
/
(
\b
(?:(https?|ftp):\/\/)?
(
(?:www\d{0,3}\.)?
(
[a-z0-9.-]+\.
(?:[a-z]{2,4}|museum|travel)
(?:\/[^\/\s]+)*
)
)
\b
)
/ix
Capture groups
- The entire URL, ex:
http://www.google.com/anyquerystringSAY/Rfy/srA/yh
- The protocol, ex:
http
- URL including
www.
, ex:www.google.com/swrua8rua8rUWRWAURHAJSrjuhFAhjT/Rtgfsbdh
- URL excluding
www.
, ex:google.com/sarwar8wa8r/R/A(R8
orimages.google.com/w9r89w9ar8a9sjfriJRIUS(RY/(YUr
Optionally, you can replace the (?:[a-z]{2,4}|museum|travel)
bit with all the ones listed here, but that list is never going to stop growing, so I doubt it's worth it. (You can see I added the two exceptions museum and travel.)
Also notice I added ftp, feel free to remove that if you don't need it.
Hope this helps.
See it in action
Solution 2:
(# Scheme
[a-z][a-z0-9+\-.]*:
(# Authority & path
//
([a-z0-9\-._~%!$&'()*+,;=]+@)? # User
([a-z0-9\-._~%]+ # Named host
|\[[a-f0-9:.]+\] # IPv6 host
|\[v[a-f0-9][a-z0-9\-._~%!$&'()*+,;=:]+\]) # IPvFuture host
(:[0-9]+)? # Port
(/[a-z0-9\-._~%!$&'()*+,;=:@]+)*/? # Path
|# Path without authority
(/?[a-z0-9\-._~%!$&'()*+,;=:@]+(/[a-z0-9\-._~%!$&'()*+,;=:@]+)*/?)?
)
|# Relative URL (no scheme or authority)
([a-z0-9\-._~%!$&'()*+,;=@]+(/[a-z0-9\-._~%!$&'()*+,;=:@]+)*/? # Relative path
|(/[a-z0-9\-._~%!$&'()*+,;=:@]+)+/?) # Absolute path
)
# Query
(\?[a-z0-9\-._~%!$&'()*+,;=:@/?]*)?
# Fragment
(\#[a-z0-9\-._~%!$&'()*+,;=:@/?]*)?
RFC 3986. Validate if a string holds a URL as specified in RFC 3986. Both absolute and relative URLs are supported.
Solution 3:
The answer is - you can't.
Twitter, for example treats the name of the singer Will.I.Am as a URL (.am is a valid tld).
Without knowing all the domain registration rules at every tld, there's no way of knowing if a URL is valid without testing.
Here is what I propose you do.
- Be generous with your script. Accept almost any string with a "." in it.
- Perform an HTTP HEAD request to see whether the URL exists.
- Do a WHOIS to see if the domain has been registered (even if the exact URL doens't match)
Of course, this doesn't take in to account that someone may have posted a link to their Intranet - which would work for some of their followers.
Solution 4:
My simple JavaScript library called FuncJS has a function called "findLinks()" which should be able to get done what you're wanting.
Say that you have a string with links inside it, simply include in the function parameters, like this:
findLinks("Visit my website at http://website.com and visit my profile on Twitter at http://twitter.com/yourProfile!");
And then output it using various methods, such as document.write and the string should be outputted with links highlighted.
For a greater understanding of this function, please read the documentation at http://docs.funcjs.webege.com/findLinks().html.
Hope this helps you out and anyone else wanting to do this! :)
Post a Comment for "Find Any URL In Text String Exactly Like Twitter Uses"