使用 Javascript 从 HTML 中提取文本

发布于 2024-11-09 05:12:27 字数 400 浏览 0 评论 0原文

我想用纯 Javascript 从 HTML 中提取文本(这是针对 Chrome 扩展的)。

具体来说,我希望能够在页面上查找文本并在其后提取文本。

更具体地说,在

https://picasaweb.google.com/kevin.smilak 之类的页面上/BestOfAmericaSGrandCircle#4974033581081755666

我想查找文本“Latitude”并提取其后面的值。 HTML 并不是非常结构化的形式。

什么是一个优雅的解决方案呢?

I would like to extract text from HTML with pure Javascript (this is for a Chrome extension).

Specifically, I would like to be able to find text on a page and extract text after it.

Even more specifically, on a page like

https://picasaweb.google.com/kevin.smilak/BestOfAmericaSGrandCircle#4974033581081755666

I would like to find text "Latitude" and extract the value that goes after it. HTML there is not in a very structured form.

What is an elegant solution to do it?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

爱冒险 2024-11-16 05:12:27

我认为没有优雅的解决方案,因为正如您所说,HTML 不是结构化的,并且“纬度”和“经度”一词取决于页面本地化。
我能想到的最好的办法就是依靠基点,这可能不会改变......

var data = document.getElementById("lhid_tray").innerHTML;
var lat = data.match(/((\d)*\.(\d)*)°(\s*)(N|S)/)[1];
var lon = data.match(/((\d)*\.(\d)*)°(\s*)(E|W)/)[1];

There is no elegant solution in my opinion because as you said HTML is not structured and the words "Latitude" and "Longitude" depends on page localization.
Best I can think of is relying on the cardinal points, which might not change...

var data = document.getElementById("lhid_tray").innerHTML;
var lat = data.match(/((\d)*\.(\d)*)°(\s*)(N|S)/)[1];
var lon = data.match(/((\d)*\.(\d)*)°(\s*)(E|W)/)[1];
神仙妹妹 2024-11-16 05:12:27

你可以做

var str = document.getElementsByClassName("gphoto-exifbox-exif-field")[4].innerHTML;
var latPos = str.indexOf('Latitude')
lat = str.substring(str.indexOf('<em>',latPos)+4,str.indexOf('</em>',latPos))

you could do

var str = document.getElementsByClassName("gphoto-exifbox-exif-field")[4].innerHTML;
var latPos = str.indexOf('Latitude')
lat = str.substring(str.indexOf('<em>',latPos)+4,str.indexOf('</em>',latPos))
递刀给你 2024-11-16 05:12:27

您感兴趣的文本位于类 gphoto-exifbox-exif-fielddiv 内。由于这是针对 Chrome 扩展的,因此我们有 document.querySelectorAll ,这使得选择该元素变得容易:

var div = document.querySelectorAll('div.gphoto-exifbox-exif-field')[4],
    text = div.innerText;

/* text looks like:
"Filename: img_3474.jpg
Camera: Canon
Model: Canon EOS DIGITAL REBEL
ISO: 800
Exposure: 1/60 sec
Aperture: 5.0
Focal Length: 18mm
Flash Used: No
Latitude: 36.872068° N
Longitude: 111.387291° W"
*/

现在很容易获得您想要的内容:

var lng = text.split('Longitude:')[1].trim(); // "111.387291° W"

我使用 trim() 而不是split('Longitude: ') 因为这实际上不是 innerText 中的空格字符(URL 编码,它是 %C2%A0 .. .没时间弄清楚它映射到什么,抱歉)。

The text you're interested in is found inside of a div with class gphoto-exifbox-exif-field. Since this is for a Chrome extension, we have document.querySelectorAll which makes selecting that element easy:

var div = document.querySelectorAll('div.gphoto-exifbox-exif-field')[4],
    text = div.innerText;

/* text looks like:
"Filename: img_3474.jpg
Camera: Canon
Model: Canon EOS DIGITAL REBEL
ISO: 800
Exposure: 1/60 sec
Aperture: 5.0
Focal Length: 18mm
Flash Used: No
Latitude: 36.872068° N
Longitude: 111.387291° W"
*/

It's easy to get what you want now:

var lng = text.split('Longitude:')[1].trim(); // "111.387291° W"

I used trim() instead of split('Longitude: ') since that's not actually a space character in the innerText (URL-encoded, it's %C2%A0 ...no time to figure out what that maps to, sorry).

菊凝晚露 2024-11-16 05:12:27

我会查询 DOM 并将图像信息收集到一个对象中,这样您就可以引用任何您想要的属性。

例如

function getImageData() {
    var props = {};
    Array.prototype.forEach.apply(
        document.querySelectorAll('.gphoto-exifbox-exif-field > em'),
        [function (prop) {
            props[prop.previousSibling.nodeValue.replace(/[\s:]+/g, '')] = prop.textContent;
        }]
    );
    return props;
}

var data = getImageData();
console.log(data.Latitude); // 36.872068° N

I would query the DOM and just collect the image information into an object, so you can reference any property you want.

E.g.

function getImageData() {
    var props = {};
    Array.prototype.forEach.apply(
        document.querySelectorAll('.gphoto-exifbox-exif-field > em'),
        [function (prop) {
            props[prop.previousSibling.nodeValue.replace(/[\s:]+/g, '')] = prop.textContent;
        }]
    );
    return props;
}

var data = getImageData();
console.log(data.Latitude); // 36.872068° N
安稳善良 2024-11-16 05:12:27

好吧,如果其他站点需要更一般的答案,那么您可以尝试类似的操作:

var text = document.body.innerHTML;
text = text.replace(/(<([^>]+)>)/ig,"");  //strip out all HTML tags
var latArray = text.match(/Latitude:?\s*[^0-9]*[0-9]*\.?[0-9]*\s*°\s*[NS]/gim);
//search for and return an array of all found results for:
//"latitude", one or 0 ":", white space, A number, white space, 1 or 0 "°", white space, N or S
//(ignores case)(ignores multi-line)(global)

对于该示例,返回包含“纬度:36.872068° N”的 1 个元素的数组(这应该很容易解析)。

Well if a more general answer is required for other sites then you can try something like:

var text = document.body.innerHTML;
text = text.replace(/(<([^>]+)>)/ig,"");  //strip out all HTML tags
var latArray = text.match(/Latitude:?\s*[^0-9]*[0-9]*\.?[0-9]*\s*°\s*[NS]/gim);
//search for and return an array of all found results for:
//"latitude", one or 0 ":", white space, A number, white space, 1 or 0 "°", white space, N or S
//(ignores case)(ignores multi-line)(global)

For that example an array of 1 element containing "Latitude: 36.872068° N" is returned (which should be easy to parse).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文